Kubernetes SIG Node, 28 Jul 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Node 20210728

Description

Meeting Agenda:

https://docs.google.com/document/d/1j3vrG6BgE0hUDs2e-1ZUegKN4W4Adb1B6oJ6j-4kyPU

A

Hello, it's um july, 28 2021 uh signal ci, subgroup, meeting, hello, everybody uh we have a small agenda and uh the very top agenda item is our best stability for release. We have some test failures recently and mike. I just added another agenda item mike. Do you want to start, and so we take it out of the way.

B

Girl, sorry for this, uh so I finished uh merging this pr. In particular, um I was wondering if it's okay to to backward this to 122.. I understand that the deadline was yesterday, but I'm not sure if it's still possible or if it's not really critical for this version. Oh.

C

I'm not muted, um so back ports are anything, that's release, critical, slash blocking. So uh let me see what the pr is I haven't looked at. It.

A

Yeah, I think it's just the motion. One test. It's a passion anyway, so yeah.

C

I don't think that is released. We definitely need to cherry pick this at some point, I would say open the back port to 122, but it doesn't necessarily have to land this particular release.

A

C

Basically, it's up to the release team, I'm not sure if they'll take it or not, yeah.

A

I don't think they will take it just like test. It was passing before and like it's just duplication like two tests testing the same thing so.

C

It is it's uh it's important that we eventually backport this because, like we don't want things that are potentially failing us to fail, 122 conformance.

A

No they're not failing they're, just testing the same thing, yeah.

C

You're sure it's just a duplicate. I thought that it was it's just.

B

A confirmation.

A

Yeah, I think that was the confusion. Last time we discussed and I looked at the history and the paco I think confirmed in one of the pr's that it's just a duplication rather than failing tests, so I don't think we need it.

C

B

I shouldn't worry too much about this. Getting in this release,.

A

Yes, yeah. Thank you for following up definitely.

A

Okay, um let me share my screen and open the link.

A

I'm glassing buttons: okay,.

C

So I just pulled.

A

C

That I opened yes, we can see your screen. I just pulled in the pr that I opened uh for the dynamic cube config metrics uh to 122. So that's new.

A

C

A

We may not need it, because my only thing that uh it may have happened because master was bummed to 123 already and that started fairing in master and they may not fail in 122 branch.

C

So I know that we have we like bisected that to be sure, because those tests.

D

C

Failing like a while back, I don't know if it was when they bumped the version.

A

Okay, then, we need to double check.

C

Yeah, like they've, been failing for a decent.

A

C

Of time, I don't know if it's been quite since, like the metrics were deprecated or if it happened after the 123 bump or what what's going on. But.

A

Okay, well I mean it's already approved and a small change, so I likely we can just take it and see what's going on, but uh it will be interesting to confirm what the behavior is supposed to be for hidden using heightened metrics, especially since we cannot unhide them easily and to end this.

C

Yeah well so I I flagged this to han earlier today, because I think this is honestly the first time that somebody has tried to use this for a feature deprecation, as opposed to just like removing metrics and replacing them with something else so uh like, and I think, like the deprecation process was designed to work for like stable metrics. I haven't seen somebody try to deprecate an alpha metric quite like this before so.

C

Basically, uh I am not entirely sure uh we're working in stick instrumentation on like what conventions for this sort of thing should look like, because we haven't really gone through this much before so I mean it's possible that just like the release version has already been bumped to 123 in master and that's why some of these tests are failing, but then that doesn't explain like why. We've been unable to debug all of the other uh test failures that are causing problems um for those on the call who are not sure which test failures.

C

I'm talking about there are two issues in here: marked release blocker and one of them is a sig storage test. So I've not been focusing on that one. One is the cubelet serial failures, so um we need to get that fixed. We're, not entirely sure why those are failing, but we have a number of tests that started failing particularly these dynamic cubelet config reliant tests uh that started failing during test freeze.

C

So uh we need to figure out what's going on with that before we can release, because the signal is not great right now and because the job is like overall broken, uh it makes it even harder to. We can't just be like okay, it's green great. Let's go so it's possible that maybe we should do is we should mark the two tests that are known failures as flaky and then they'll get skipped, and then we can possibly get the job green, but I have not previously suggested that.

A

Yeah, that may be a good course of action.

E

Are the the two tests failing in the ceiling, the one I I I posted yeah.

C

They're, the ones that you're working on the the two issues that are like not currently in this milestone, but we know and you're working on them.

E

Yeah yeah, I mean I'm confident at least the device plugging stuff. It's actually fixing to bear this well, we cannot manage it now for all good reasons, but still yeah.

C

Well, what you can do is, uh if I add, like a flaky label to them now or something like that. Then later you can take that off. Yeah.

E

Of course, of course, of course, that's no problem, that's no problem! I just wanted to see if there is more. I was not aware of.

C

Okay yeah, so that sounds good. uh I can go ahead and do that and maybe that will give I know. Dims has been not super confident because those jobs are failing, uh and so when, for example, we did the run c bump, we couldn't get any green signal being like yes, this is good to go, let's ship it, so uh I don't know if we'll be able to get cubelet cereal to green before the end of the week, but I could certainly try so.

A

Yeah someone just just wrote green before so just excuse.

C

A

One frustration.

C

uh Circa, do you know if there's any way to get more than two weeks of history on test grid, or is that the limit, because, like a month at this point, would be good? The problem is test. Freeze is so long that I can't compare back to like july, 7th july, 8th anymore.

C

A

Think there is a garbage collection which is very aggressive. I.

C

A

I think I mean we can double check with aaron, but I think last time I tried, I couldn't find a way.

A

I wonder what happened at 720.

F

A

May well be bomb aware of a master version.

C

So I don't know why.

A

C

I guess the release team hasn't bumped the the two flakes that's sort of our bottom of our list here. Does anybody have any objections to me like punting those to 123 for them.

C

Because I don't think that at this point, given that the release is in a week, I don't think we have any chance of getting those done.

A

I mean this, may I mean we have this um pr from matthias to increase the timeouts right. So maybe we can just.

C

Yeah so, but at this point like we're, not fixing flakes anymore, we're just trying to fix things that are going to break the release. So uh they're like test flights are annoying, but you know it's: uh it's not release breaking make.

D

C

C

And um peter do you know what's going on with the two or no not two just one, uh the one cryo test failures there.

D

I do actually have a suspicion and I I think it's a default of a change in cryo, but I want to see like I. I think it is also a the a problem with the way that the test is so I'm like trying to see there might be a cute fix that I'll propose, but it will not. It shouldn't have to be a release. Blocker, okay,.

C

Great so I'll punt that one to 123 as well, and then that should come.

D

C

A

So I saw the comment that uh it may be related to fedora version, something like that.

D

Actually so the suspicion is, there, we've been in cryo trying to drop the uh pause container um by default, and I think the the test, the like stat summary test is expecting the pause container to exist or nc advisor also is so it like is failing to get some stats and I think that's kind of muddling with the test. um So I need to I need to like dig in to see who who's at fault there.

A

C

I'm curious why.

A

It just started now and.

C

Because the tests are on the latest, and so as soon.

A

C

The thing gets bumped, like the tests could potentially change, which is why we don't normally do that.

D

Yeah I mean I also have a pr up to stop that. uh If people want to look at that um now,.

C

Is probably not the time to merge it, but maybe in a week plus a day, that's right, uh sergey! If you want to refresh, I just changed the milestones on uh three of them. So.

C

Our list for uh 122 stuff should now be decent bit shorter. Only four things left.

A

Yeah, I wonder like for this bump of uh metric deprecation to 123. We don't run serial on 122..

A

Do we need to cherry pick it back to 122 or not because like whether we do it or not, we we have no way to validate that. It's working.

C

uh What do you mean we have no way to validate that? It's working I mean.

A

We don't run serial tests on 122 periodically right, so 122 is just uh google features.

C

A

Complete 122., I mean.

C

We can set that up.

A

Yeah, we know we did it because we didn't consider cereal, like syria was not super stable for a long time like before.

C

A

So we never considered to make it release blocking and.

C

A

Think theologically for your versions.

C

Like we do, we do generally want to get things in earlier versions to be like more stable now right, so we spend a bunch of time trying to unbreak the serial tests. uh I mean now is probably not the time to be like by the way we want to add all this stuff to like 122, but I think, certainly going forward. uh We can look at like adding some of the stuff and adding periodics as soon as like you know, we we cut 123 or something I think we're making progress on stability here.

C

uh It's just a matter of uh you know getting it done. I I am worried about. Like I mean fundamentally right now, the problem is, we had like, probably a dozen things, all land, the day of code freeze and as a result, we had a bunch of regressions happen all starting to get code freeze and it's impossible to tell you know which was what so uh it's like challenging to kind of go through. I, for example, like looking at the cubelet serial failures, the new failures, like I thought.

C

Oh surely this is you know this thing or that thing and then clayton comes back and says: well all these metrics are missing so like these tests are failing and I'm like wha so.

A

You know my problem is that so far like we didn't find any anything in serial tests that were actual product problem right, so we found a problem with race condition on what was the device test or something we found. That's uh there's.

C

Sure there's a few things that were showing up that, I think were legitimate. uh For example, uh like we had a the thing that we're really trying to track down right now is like the reason a bunch of the tests are failing is that we have this one test that like fills up the disk and it was failing, which meant it was not cleaning up after itself. So there are all these files left on the disk. The disk was out of space and then all the rest of the tests would fail.

C

um So, like that's, really not good, uh you can't get any signal up beyond that, because then everything just fails because of disk.

D

C

G

That potentially could have.

C

Indicated a real problem with like static pod storage cleanup so uh but yeah without all you know with all this other stuff potentially blocking, then it's like. Oh, why is this failing, and why is that? Failing everything is failing.

A

Yeah but look at this list if, if you will confirm that this is this, thing was caused by um uh version bump and master. So if version bump in master happened at 720, then probably it's uh it's going to coincide with uh the start of failures right. So if it happens at state, then we can tell that this is not a cause. We don't need this fix in one page, two.

A

And then what will be left now? We don't know because we cannot run all the other tests, so we need another fix uh to make it flaky right uh I mean mark test, is flaky, and then we can get a green, potentially.

C

Yeah, so I don't know you know, let me open the notes. I will take an action to mark the known failures as flaky.

A

Yeah, my problem is we're trying to make uh serial testers release blocking, but then we never will run them periodically for 122 plus we never did it before so like we have some test failures. Like so far, we only discovered test failures and test inconsistencies. Yeah.

C

It's not um so I mean for like additional context. uh I think the worry is not merely that um you know like we want. We. We do want to treat these things as cereal and whatnot, because we had such a big refactor with the plague land in 122. I think we want to be really sure that this is not going to cause problems, and I know at least in initial testing that we've been doing at red hat.

C

We've been having issues with the performance of clusters going from 121 to 122, particularly with like io performance uh like very initial results, uh but like concerning enough for me to say, like I really want to get these tests green, and I want to like try to get an idea of uh what's uh what's going on, so that we don't end up like releasing with big blockers like that or big performance regressions.

A

Yeah, I I don't know like I understand the potential critical bugs we can discover, but uh at this stage I don't know how much we should market as the release blocking we never did and, like you, followed.

C

A

C

Yet that the the test failures in, like cubelet cereal, the new test failures, we have not confirmed that they are not like actually critical regressions. I think that's the problem and.

F

We've also found there have also been multiple that were critical, regressions um and so until we've at least investigated. All of them, like it's kind of worrying, for an actual release.

C

Yeah danielle, you did that one uh that I found there was the new failure and you did the deep dive on that one and that one was again like pretty serious bug that we caught so that one, I think, was also marked like critical, urgent fix right now.

F

uh There's also like a bunch of like recent, like c advisor, like issues from uh the last like 121 releases and stuff, um so we're not in a good state at all.

C

We could potentially backboard a bunch of these things. It's just a question of like. Are things so broken? We can't release.

C

My concern is if we have a bunch of like tests that previously or not failing that are now failing, we need to like get rid of all of the signal causing them to fail and like make sure like. Are they failing for legitimate reasons or are they not uh because we don't have any other things to really go on here? So yeah.

F

C

That's that's what I'm gonna be working on this week, doing as much of that as possible.

F

C

Get this green.

F

uh The rest of the eviction failures by the way seem to mostly be actual flakiness, as opposed to being broken, and that I can make them all pass individually, um but running them all together. Like is kind of a mess, and so I'm trying to make those not be trash.

A

So is it trying to track it all in one one issue with serial tests.

C

Yeah uh well so I think.

D

C

The the eviction stuff are not in the milestone, um I think like we're kind of going through. uh If you I don't know, maybe my screen is frozen, but if you refresh the 122 milestone list it's much shorter now because I bumped a bunch of stuff.

C

A

Just four items: oh.

C

Okay, for some reason, I still have the old screen yeah. We do.

A

You can reshare.

C

Oh yeah yeah, so we just got the four items. uh We know that uh there seems to be some sort of actual sub path, regression happening with sig storage and we need to help them if, uh if that ends up being on us uh and then similarly for the serial failures like we need to get that test semi-green as soon as possible, um so that we can say it's not our problem, we nothing's broken it's all cool, but right now I just don't have the confidence so.

A

Okay, make sense okay, so we have two action items merge this one and uh my mark. Something is flaky and we'll see how this green test grid will look like after that.

A

Okay, any other agenda items.

G

Maybe it's worth to mention that we also have cp measure when it's broken, and the problem is that it's not only the cpu manager is broken. It's all managers broken uh because it's like because of the place manufacturer, but again like we assumed that the buggy behavior is the expected one, and now we should to fix uh all manager like cpu device and memory managers according to clayton refactoring, I'm currently working on it.

G

I already prepared, like I feel, if it's the third iteration.

A

If you don't mind, I wouldn't call it clayton refactoring. I will call plague refactoring.

G

Okay, okay, sure, no words and again it can be pretty nasty for, like all users, that will use guaranteed pods for some low, latency or high performance workloads. Because again they will expect like pinion, cpu, isolation and stuff like this and oh, they will not have it.

G

uh It will not crush the kubrick, like we already discussed under the under the threat. But again it can be pretty serious.

E

And it will make the serial lane break so great that you raised.

A

So it will mark some serial test file, like which ones.

H

Cpu manager is running in the serial name. Isn't it.

G

It can be divided, it can be device plugin by the way, like I.

C

Think that there's also a separate cpu manager test. Okay,.

I

G

C

Cpu manager tab, it's the.

E

C

Tab on testgrid yeah.

E

In that case, yes, it it won't break this, but.

G

But sorry, under the serial line, if you will like will back, I I saw also some cpu manager test. Like probably it's again, the problem is focus and skip so yeah. We have some.

C

Yeah, there are definitely some flakes um for those sorts of issues. I think that, like those wouldn't be release blocking like not everything out of the box is going to be using the cpu manager so like they're, definitely urgent. We should fix them. They're, probably not going to block the release um but like the cubelet, is failing to clean up files from static pods, for example. That would block the release.

C

So uh just we're so deep into test freeze- and uh it's like it's really long uh four weeks so uh yeah.

A

C

Saw that issue uh I punted it from the milestone, uh but uh or I rather, I think I didn't pull it into the milestone because it was filed late, uh definitely a priority um and uh we should probably like you know we have too many p0s. So maybe that's p1.

A

So I wonder if there is a mechanism to add the release, node is a known issue like what would it mean the mechanism to do that.

C

I don't know uh we may want to talk to the release team.

A

All right, jim, can you take a action item to ask on the slack.

G

Sure, just let me understand because I I surely will forgot it.

A

I don't have your email, let me send you a document, so you can add yourself.

A

Okay, any other action items or change items. I think we can skip triage today.

A

C

Yeah, I was just gonna, say yeah. I have no problem with that, uh because uh triage like everything that we've been doing at friage, is probably gonna be lower priority than this. So uh yeah.

A

A

Okay, then, uh let's cut it short. Thank you. Everybody. Let's uh release a quantity, bye.

F

D