Kubernetes SIG Testing, 24 Mar 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: sig-testing weekly

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

A

There's nothing on the agenda today.

A

So maybe we don't have a meeting.

B

C

It is that time of the year when my calendar gets confused, because the meeting reminder on my phone was for now, but the calendar event is for in an hour. So it's that that's time zone, flipping thing, yeah, we're sorry.

C

I don't think you invent the timelines with you. No.

C

I hear there's a lot of talk everywhere about abolishing the whole idea.

C

Not time zones the time where you call it the daylight saving.

A

Yeah I uh so I did some work for um apple at some point in the for the retail stores and we had to. They gave us like a small window, that we could ship files to stores and uh it was like between two or four p a.m or something. But that means something different in every single time zone and there's like there's like 15 minute time zone and there's like a leap, 30 minutes and yeah.

A

I learned a lot. Don't ever want to do it again.

C

The way time zones work.

A

So we do have some folks there's nothing on the agenda. I don't know if anyone had anything they wanted to discuss.

D

Okay, pardon the lack of video I'm having some issues with zoom um the um if we don't have other stuff on the agenda. um I'd like to discuss the working group, reliability, work proposal, kind of moving forward on reliability.

A

That sounds good to me. Okay,.

D

D

um Let me okay, sorry, I've been trying to get zoom working. Let me add a link to the agenda.

A

I don't know if cole, if you had anything you wanted to discuss.

E

No, I don't have anything specific.

A

There's a couple things that came out of some of the the leads meetings in the community meeting last week. um We need to get some onboarding docs, not on docks, but we need to figure out onboarding for new contributors when it comes to like prow and test grid in general. I think jordan and I were talking about maybe recording some more videos. So if anyone is interested or has ideas around that um feel free to, let us know.

E

Yeah we're continuing to we're planning to continue adding stuff to the the proud documentation website. um I know we have a pr out right now with uh a detailed diagram of the functioning of tide, um so we're working on that. That's pretty helpful. That's.

A

Awesome uh I also talked to sig release this morning and the ci signal team is also working on their onboarding documentation and stuff. So yeah, hopefully, there's a good effort to get people educated on how this stuff works better.

E

Yeah, I think we have some good existing resources scattered around too um there's like a job cookbook that can be pretty useful to people trying to use pro um yeah. I think it's probably just not not located well so having everything on the site should make it a lot more accessible.

A

Yeah, I so I'm not even thinking about writing proud jobs right now. I'm thinking about people who have to like track down failing tests and flakes and yeah things. They've, probably never seen before.

C

Josh, you have some video now it's just very dark.

D

Yeah, um the zoom client is not working properly and I'm using the web thing, which seems to have hardware problems so the um um anyway. We want to move on to the um reliability discussion. So uh wg reliability came up with a proposal for um how to handle kubernetes um declining reliability.

D

um uh In their terms, I'm not entirely personally sure that reliability is declining, but josh would.

A

We didn't kick off the meeting right, so let's do that real, quick.

D

A

Okay: let's do that then yeah! So just reminder: this meeting is recorded. uh Everything you say will be on the uh the internet. um This meeting in all kubernetes and cncf meetings, abide by the cncf code of conduct, so please be excellent to each other.

A

I see we have ben hooray, hi, everyone.

A

Okay, josh, you want to take back over okay.

D

Okay cool, um so we have this kept proposal from uh reliability working group um about how to handle kubernetes at least lack of improvement in overall reliability um uh over the last couple of years.

D

um The as you see from the comments, I have some disagreements with um approach, but um uh regardless it does feel that that clearly we as a project, we need to be doing something to produce continuous reliability improvement, um which is not really a thing right now, um and so um I've been taking it around um to the community meeting and to some of the key sigs, which would be uh release.

D

Architecture enhancements in um now testing um um about what do people have ideas of feasible steps that we could take that would produce um reliability, improvement over time, um the um and um both from uh the reliability working groups proposal um and for anything we conceivably would do.

D

uh Testing is obviously going to be key, um the um both from the community meeting last week, which, unfortunately, the video is not up for yet due to technical issues um I and from other stuff um working reliability, was proposing using sippy, which, as far as I know, is not yet functional.

D

um The and um one of the pieces of feedback that we heard from several people at the community meeting um was that one of the things contributing to reliability not improving is that um general contributors have difficulty analyzing and um debugging their own work via test grid.

D

uh You know particularly because these days, a lot of the the failures and the flaky test issues happen in in ede tests that occur at a delay um from the merging of the code, um and so it's hard for people to figure out that their code is responsible and therefore they just ignore the problem.

D

um The um okay, arnold, you said: we've got a functional instance of sippy.

D

Oh, that's. That's.

F

D

F

About that during sick release, today and basically okay.

D

F

One specific team in inside, where that offer a functional instance of cpa was, was basically looking for sig release input about how we can share knowledge, which, between, I would say, openshift team and secretaries.

G

I'd also bring up that we've had the long-standing uh triage tool that lets you drill into like where test failure is happening. uh It's pretty powerful.

D

Yeah, um the um and, and it it may be, that the primary problem is just that people don't know how to use the tools that we have, um uh in which case it's it's more of an education campaign which would be great honestly, because that's something that contributor experience could potentially handle.

G

Yeah, um if that is the case, we have some pretty excellent content that jordan put together. um We have it captured on a doc and in a video recording on like how to triage flaky tests using these tools.

D

Yeah, um okay, um the um so uh education program there, something for contributor experience, maybe to put together to make sure that everybody um knows how to connect the dots, um but then the other question is you know this sort of thing of, for example, not not having overall improvement in the number of flaky tests or and and having kind of overall decline in test coverage.

D

This is not a new problem for for sig testing. um This is a problem that, as far as I know, we have always had as long as we have had. You know from like year, two of the project um and and seems like y'all would have some ideas about what the rest of the project should be doing to address. This.

G

Well, in the past we've instrumented coverage and we found that we didn't have much success, getting anyone to look at it. um So at this point I think some of our coverage tracking is just not even fully functional, because there's no ins there's, I think, there's insufficient incentive to care about this um right. Even like the unit tests. We don't really keep that dashboard. Functional there's been some specific focus on conformance coverage.

G

uh I don't think we've made great progress there, um but as a more general rule coverage from unit tests and e tests, even if we, even if sick testing provides the tools to be able to dig into this, it doesn't seem to have follow-through, I'm not sure how to change.

D

um If somebody else uh contributed experience steering whoever was taking on the hey, we will make the rest of the project care about this or working group reliability. How difficult would it be to get those dashboards working again.

G

um It probably wouldn't be too bad. I think a number of them have failed just due to things like the test. Lengths have increased and you pay a performance penalty instrumenting for conformance or for coverage. So some of them we probably just need to give them more time or resources to run yeah.

D

uh Grant was asking about coverage percentage pre-submits, I'm not clear on how that.

G

Would work um we've implemented this in some other places, or my team has where uh we basically have a post submit record the coverage amount and then um you can have the pre-submit check against the recorded amount and fail if it dips, I'm not a huge fan of these. I think it sort of just drives checkbox coverage um like okay. I got something to exercise the code path. Now the coverage number will pass doesn't mean you actually have like good comprehensive tests, uh but it's something we could consider.

D

Yeah, um because one of the things that was actually raised during the community discussion was people were suggesting uh saying that one of the sources of problems is that um enhancement. Submitters are often allowed to postpone submitting e-to-e-tests and, of course, postponement sometimes means postponement forever yeah. um um So I think I think there might be some appetite for having more stringent requirements.

D

I think that some.

G

D

Submission yeah.

G

I think that reviewers checking for like their concept is sufficient test coverage makes more sense, also say. Unit test coverage is probably like. We're actually think pretty good, where we can be good and there's other things that just don't really unit test um and for e2e coverage. It's not super viable. Stick that in pre-submit, because I mean the performance penalty of instrumenting, everything is kind of high and it it doesn't directly correlate. There's some noise due to the exact behaviors that happen and things like that, yeah, the um it's useful.

D

For getting some idea.

G

Where we're at it, it probably isn't suitable for blocking on.

G

That would be the number that I'm probably more concerned about myself would be that we're exercising everything in e and that as well, we're not going to run all of the e-s in a given job, uh there's just too many of them it'll take too long.

G

We have to shart out a bit and say like okay, we're alpha testing over here. So, if you want to know the true number, you have to look across many jobs and again we're not going to run all those in pre-submit.

D

Okay, um okay, I yeah. I was just you know the um grand ass there yeah.

G

It could be reasonable to block on unit test coverage, but I don't think that's the. I don't think. That's the problem. We're discussing.

D

You know it's kind of all out there right. We have the visible parts of lack of improvement and reliability, um which, in in my experience, is primarily a flaky tests, um test jobs and b um upgrade downgrade failures, um which we see both in our testing and in the field.

G

Well, we almost.

D

G

Upgrade and downgrade downgrade has no ci and upgrade has very, very important.

D

G

D

Because because downgrade broke and nobody wanted to fix it.

G

Well and there's even people that disagree, that it should be a thing that we support so.

D

The um well I was going to say: if, if downgrade doesn't work, then then upgrade needs to work flawlessly right.

G

But either way.

D

G

Very very limited coverage. I mean one of the things that we bring it further on there. One of the things we tend to run into is the vast majority of ci ultimately has to run on our gc testing, because that's where we have yep resources, but no one wants to be responsible for those.

G

Few folks, like me, are doing this in our very very spare time and it's just not sustainable, and it's not great, tooling um yeah. I so things like upgrade like the folks that are approving the code right now, really don't have time to be fixing upgrades or um there's no one responsible for upgrade, and often the problem is actually with the like upgrading and not necessarily like one of the components and it's hard to tell.

D

D

Okay, but you'd you'd agree with my assessment that that one of, like our sort of super critical areas, is um a kind of lack of meaningful, upgrade testing.

G

uh uh I think it's something that is highly overlooked. um How critical it is, I guess.

H

G

Of depends on your priorities: okay,.

D

Speaking is speaking as somebody who works for a vendor. um I would identify this as a critical area.

G

I think now vendors are testing for of their like distro.

H

G

Catching up there, um which works for them but means we're catching things late typically, um so I would add, I think, from my point of view: we have these lengthy release cycles with a freeze period and we tag all these betas and alphas, but I don't think anyone's running no. Those builds.

D

No way, one of the one of the the to do's that's been kind of perpetually kept alive um from the bot um since I was released, lead actually um almost two years ago now is make it easier for the public to consume alphas and betas.

D

C

G

uh I don't have a super easy answer there, but I do think that yeah.

D

Well, it would just be a huge.

G

I think that's one of the things that we have that should be able to catch some of these problems, but isn't really utilized.

G

When their tests are failing- because I thought that was the point of code freeze- is that we're not like we're not releasing until the tests are passing um yeah necessarily good enough to be getting things like. We have the release blocking dashboard. uh Maybe more things need to be in release blocking.

D

Yeah well, we've gone in release. We've done this thing for a number of years, where we've said hey. um If this test is going to be in release blocking, it needs to be not so flaky and you know taking the job back to the sig that owns the job and the saying we're not going to fix it.

D

That's and and therefore, and therefore the job comes out of release blocking right, because at some point the project makes a decision. If the project is going to release something despite a test failing, then that test is clearly not release blocking.

G

Well, I would add that so I think a good example of my other concern, which is that if we say that this is the gate that your sig must have reliable enough.

H

G

Then I'm disincentivized from going and adding more tests, because I have to keep all of them passing to do anything.

D

I you, you can see my arguments in other people's as to why focusing on this, as um we're going to block us enhancements is, is going to be counterproductive, um which is my opinion, the um um so um the um I mean I do think uh you know watching talking to some of the current release. Team ci people and that sort of thing is the release. Team has really been led to the idea that it's their job to make the test pass or fudge.

D

Instead of blocking the release, um I mean, if I compare to how we behaved in the release team, you know two or more years ago um versus you know how the release is run now. Is the release team has been told to prioritize the timeline um overpassing the tests um and and I'm not sure that that was a good decision um and the decision could be revisited.

G

So I see a number of other folks on this call that haven't said anything yet. Does anybody have thoughts on this I'd.

H

Say, as the relatively new guy here, you've got tests that are supposed to be blocking your release that you're, taking out of the way that you're not gonna, fix right um and then um I I kind of lost my train. I thought, but it essentially, it sounds like some of the, and uh that was the other thing. Is that the release um the release team was saying that it was their job to fudge or or fix tests to make them pass, and we started the conversation with there's a perception that there's reliability issues.

H

Surprise yeah,.

G

I personally am not sure I've seen like fudging, but I do feel like they feel responsible for getting the test to pass and have done quite a bit of chasing folks around trying to get things fixed um and it doesn't seem super sustainable and but like even stepping further back. There's only so much that we're actually blocking on to begin with, and something like upgrade is not one of those things.

D

Yeah, what's it's I mean overall, the result has been the master blocking test. Suite today is less than half the size. It was two years ago right and and that's not because kubernetes has a whole bunch less code.

A

I put in chat, but what what cigs are saying? We're not gonna fix our tests like that? That's a bigger problem then.

G

I don't think anybody overtly says this, it's more like uh if you say like. I want this job to be in the release blocking dashboard. Let's say like you know, we don't have time for it. It's not just like fixing your tests or something right to get a job into release, blocking dashboard.

G

There's a process you need to go through and a big part of it is.

G

You need to keep your job very stable for a like period of time to show that it is a reliable signal and that um the job that the job can be stable and that, if it is unstable, we must have had a bad change um and, as someone who's brought some jobs through that it's really time consuming like one of the main things I'm doing when I'm getting the kind project ready to be able to be used in rci is just trying to create the perception that, like this is actually a reliable way to test, and I have to go running around fixing things, because until it's a blocking job, people make changes that subtly break the job, and you have to go through our huge amount of changes and find what broke it, get it fixed.

G

And if you don't fix that fast enough, you don't have a green period on your dashboard that you can point to um so you have to. I would say that you can get that viable. You have to really get to a point where you have multiple people paying attention to this and starting to already treat it as sort of pseudoblocking like if it went red something's wrong. We need to fix it and get it stable, get it into somewhat blocking. If you have this purely in post emit.

G

That can be extra hard, because there's no signal on a pr that, like this was gonna break. This thing um we've had examples of this before, where things that should be totally viable, get knocked out and never come back, we used to test with chaops on aws and pre-submit, but we ran into the billing account being suspended there and it has never recovered. Since it was a billing issue, it wasn't that it stopped functioning. It was we weren't able to run ci because the account wasn't paid up and once the account was paid up again.

G

We I mean that it has never come back and we don't have that coverage now. I think we're moving in a direction where that's not the sort of thing we want we're, not trying to test like on more cloud providers or something like that that isn't necessarily what we want in pre-submit to begin with. Maybe we want to get away from all of them, but it's an example of like everything involved should have been totally stable, but once you lose being that blocking thing, the momentum to get back in blocking is high.

G

The the activation energy is very high.

G

And that's something that was happening in pre-submit.

D

Yeah, yeah and well- and the thing is that you know when we started out with the release team. We we had the opposite right, which is we had lots of tests that would kind of fail randomly um and and their failures didn't mean anything, and so you had the release team scrambling.

D

You know on a pretty much daily basis, to figure out whether the test failure was meaningful or not, which would then obscure the test failures that actually were meaningful, um the um so um but yeah yeah, there's not a lot of motivation for sigs to promote their own tests, um because it's a lot of it's a lot of work um and it doesn't solve a problem for them necessarily at least not right. Now,.

G

Yeah, there's a very high, very high bar motivation there. It has to be something where you know whoever's paying you to spend time on this, uh presumably because it costs a really large amount of time. I'd be surprised to see purely volunteers. Doing these um you have to convince them that you need this to be tested and released. Blocking upstream I've known a few examples of my employer. They've happened, but it's not very common, because a lot of times, like maybe just having some test signal that you can monitor, is good enough.

G

Getting it into release blocking is an expensive undertaking and the reward is not necessarily high for you individually, it's sort of more like, hopefully for the whole project. We have better signal here. Yeah.

D

The um I wanted to answer eddie's question by explaining, like one case of where um a test job becomes, we won't fix it. um That happened when I was doing a ci signal for the release team. um So this was a while ago um again a couple years ago.

D

um You would have to ask somebody who's currently on csignal for any contemporary examples of this, but really what it comes to, and this comes down to the we used to have both upgrade and downgrade tests um and a whole battery of upgrade tests which still exist um within sig cluster life cycles, test grid um the, um and what happened over time was that sig cluster life cycle, sid custer life, cycle's pool of contributors became identical to the kube admin contributors as in there was not people who were part of sid cluster life cycle who didn't work on kubernetes, and so we had a bunch of upgrade tests that were based largely on uh chaops um and they belong to sig cluster life cycle because upgrade tests and sid cluster lifecycle said this is chaops we're not going to maintain or troubleshoot these anymore, because we don't care about chaops.

D

But their own, so we said: okay, we'll swap in the cubid min, upgrade test right, because you want to maintain coopermen, which we did for a while. But it turns out that with cluster life cycles workflow they were okay for those with those tests going red for weeks at a time which was obviously not acceptable. If they're going to be master block.

D

So we ended up without upgrade tests at all in in master blocking the um I mean, that's an example of sort of thing. That happens. I mean often it's not so much. You take a job that is clearly owned by a sig to that sake, and they say we're not going to fix it a lot of times what happens? Is you take a job to a sig that ostensibly owns the job and they say well.

D

Our name is only on that job for historical reasons, but none of that code is ours and it since it's not ours, we're not going to fix.

G

It for the gce tests, I also for the team that was like okay. We need upgrade tests to cover this. This thing that we're writing, and um so they put a bunch of effort into getting the like upgrade upgrade tests, um uh because another bit of thing is that those other tests we're talking about that there isn't any special upgrade testing. It's test upgrade test and it's just the usual tests. We actually had some tests that try to do things across the upgrade.

G

I don't have a application running or something, but those are bad from early in the project not very worked on and horribly tied in tube, like the gce clusters and things it needs to be able to talk to like the deployer to like run the upgrade, and it's not aware of many, um though getting those running again itself has some unlimited value, because no one wants to touch those tests, um and so the activation energy to get good upgrade testing, I think, is even further.

G

Someone has to probably start over and write actual good upgrade tests in a way that's going to work with multiple tools so that whatever tools currently supported, we can work with and then get that into release blocking there's a pretty large effort, and I think, right now, the project on the whole is kind of understaffed on just like basic maintenance in many places, let alone investing in something like this.

G

I can tell you in sick testing right now, I'm working on the fact that we have a number of projects where, in reality, we have like maybe one approver and so use those things. So we don't really have the bandwidth to do major investments, we're just trying to keep the existing things afloat.

A

Yeah, it's it's tricky it's!

A

I was again I was talking to jordan last week and like we, I still think that there's an onboarding problem, I think even me as a maintainer and lead for the past two years, still struggle when I have to debug a test which I spent yesterday doing, and you know it's not a knock against the tools. It's not a against our documentation.

A

It's just that like like take the terms, for example, pre-submit and post, submit right, like those mean nothing to most people who aren't working at google right like that, is not a term I've ever had in any of the six companies. I've worked at before right, we've called it something different, and you know that's just like little things like that, like when people see that they're, you know they get a a failing crowd, job right. Most people just slash, retest and move on right. They don't know what to do.

A

There's no actionable steps there, other than hey, run, slash retest to run me right. So it's it's a bigger problem and it's definitely the the lack of resources and people that we have to put on making it better. So I know paris is putting uh I I got a sneak peek of a deck that paris is writing right now for the the cncf but yeah we need. We need more dedicated resources and individuals.

D

um Yeah the um and let me also suggest I mean this reliability conversation is going to bounce around the project for a while, and let me also suggest that's a good time to make a case for the fact that sig testing needs help and and needs contributors.

D

um The, um which is, you know, honestly, been true for a long time.

D

The um so um the um okay, um if folks, have other ideas, observations um commentary on this. Please comment either on the cap or on the thread I opened up on.

D

D

So that your thoughts get shared with the rest of the project because, like I think one of the reasons why sig testing is under resourced is because people expect it to just work and don't really think about the people who are in this sig, which which happens with a lot of things right, they're focused on their code and they're like well. Somebody else is taking care of the test.

H

The um no that's wrong and any in any way you look at it. It's the the person that authored the test to the test belongs to them forever and ever and ever.

A

Yeah, that's the problem is that all of these tests that I now own, I didn't bother right.

H

They've, been down from the test should go with the code and the code should go with the test and they're one in the same, and they should never part um the one of the, I guess general observation is the scope of the group is into the like the space of like governance and process and ownership versus just straight test tool development. So we, but we built these tools, so you can measure in some aspect the quality of your of your code.

H

Right versus some of the issues that were talked about were along the lines of well. We we we ran, I don't one, I don't I'm not sure I I guess I I'm pretty new, but do you guys run these tests or do we run these tests for everybody such that, or is that just coming from the release teams.

G

We help run the like infrastructure and tools for the test. Okay, um yeah, I uh one of my backbone things is, it would just be so expensive to do but someday. I would love to see this sig renamed. um Everyone thinks we write tess and nobody here wants to write tests, we're not going to fix the.

H

Tests and then, when the test fails, do you pick up and chase somebody down and say your test failed or do they know their test failed.

G

um We have tools that can let you know that, but you like you, have to opt into being alerted um we've done campaigns to get cigs to set up like an email uh group that gets these or things like that, but um no, the main group that actually actively tracks that is, there's a subgroup of uh the release team. Okay, um yeah signal group that is focused on just the release blocking signal. um Those folks will actually go chase people down trying to get tests fixed, but for the entire project at large.

G

I don't think it's even staffable to do that. There's so much ci yeah.

H

You know I don't I, I completely think you should not do that. I think um if, if and I guess it- and I also don't know- is there like a central execution environment? That is the the gatekeeper for all of this that this all runs in and if it doesn't pass there, it didn't pass period.

G

Yeah, so the vast majority of of ci's run on our own homegrown infrastructure uh prow, and um we also have a tool test grid that gives you display of the your results and it can email you on failures. um We and we have some of the ci like prevents things from merging. um Okay.

H

G

A is there, a third party.

H

That runs elsewhere like okay, uh openstack. Sorry, you said that out loud yeah.

G

But now I like that um it's been it's there's not much that we're really using. We have a we, so I worked on a program for here you can like run your conformance tests and upload results and if you get them running like continuously against the latest kubernetes code and- and this is stable, then like I'll help, you push like. Let's get this into release blocking because it was hard to get even like our base conformance sets in because then people are like. Oh well.

G

You know why is like I don't know, gce pass and conformance meaningful. Why are we blocking on this, but we've suggesting like I want signal. I want good signal covering these. Yes, it's really bad if we release a release and it turns out the conformance tests aren't going to pass well.

H

G

But getting on third party to pro to provide like reliable enough results, has um not been very successful.

D

Okay and we, we definitely experimented with it early on, and it was not good right.

G

I think it's important that our contributors are uh enabled to go fix these things, um the things that are release blocking. That does actually happen. um The problem is, there's a lot more. That probably should be in there and it's hard to scale that up.

H

Right but well it's it's it's you're kind of damned. If you do because, then you leave it in the sorry, I shouldn't use bad words on my being recorded, but anyway uh you, you can scale a little bit more effectively when you start involving third parties as well right because then you're not it's not all on you to scale up every single platform. That's out there.

G

Oh sure, but I mean, as a community third parties can come, help us scale what we have and it's when we've had like third parties, totally run their own thing and just submit results yeah they haven't. It actually hasn't been much of a scale up. It's usually like one or two people from the like company or whatever, um and then they kind of flake out on us and we lose contact and we don't.

G

It is even less reliable than what we're running but not actively working on ourselves. Okay,.

A

Eddie, I just wanted to give a quick. I don't know I I want to talk through what I worked on yesterday right. So this was the issue it started uh failing 27 days ago right, I got pinged by the release team. um I don't know when's the first comment by us on here so seven days before we noticed it right.

A

We as a sig are not looking at all the you know, tagged issues, it's just way too much github noise right, I have almost all github emails, muted and then so we take a look at this. We dig into the result there's a problem with these, uh the the page, stable.txt and stable1, so those are for the current stable, commit on master and then back one uh version.

A

So this would be 123 uh branch right, so those weren't there um they didn't get created by some release, script that should have created them right so boom, like steven augustus, figured that out once it got to him like what is that six days later and then another seven days later or so, we got pain that it was still failing, even though that those things were there- and this is where I came in- to fix this- and this was a failure, because I'm curious, like I'll, tell you what the problem was, and you tell me where the hole was so we added a feature to 124 that service account.

A

Tokens are no longer auto created for service accounts, right, they're, no longer populated, there's a 124 feature, that's coming. They are projected using projected volumes to get them into pods now so service account no longer has a secret attached to it. um This merged just passed all tests.

A

This test was the failing, skew test for one version back on 123, so in this case it was running the 123 ginkgo suite against a 124 server. The 123 version of this test waited on a timeout for that specific service account secret to exist on that service account.

A

The fix here was to back port a test change from 124 to 123 that didn't wait for it in a certain way. You can see the the the commit that I backported cherry picked there right. So where was the hole here? Where should this have been caught on the the pre-submits? The post submits, not master blocking? Was it bad test.

G

I I think the problem here is again that, on the whole, the community isn't paying attention to upgrade tests, because otherwise we probably would have thought hmm we had to make this test change. We we need to backport it like when we made the change um regardless of ci, uh but it's this isn't on the brain um and yeah, I mean we're not gonna, be able to put everything that's in release blocking into pretty cement.

G

Honestly, we have too much and it's too flaky in pre-submit today um I'll say to the same thing: the releasing did with reducing the number. There is a lot to be said for having reliable signal um and less of it just because that means that people pay better attention to failures when they're expecting things to pass, um but in like pre-submit, when I, when I send a kubernetes pr, I'm just like ugh tests are failing again stupid tests, like I'm not really thinking did I break something, I'm thinking these tests aren't reliable. This is annoying.

H

There's also, what comes around is a certain amount of tribal knowledge of, like you have to ignore those tests. They always fail kind of thing so he's like. Why are we running them and who's fixing them just get it get out of my way, your noise.

G

It's not, we have things for that, but again it's just like no one's problem. To do that, like you, can you can just you can just add a little bit of text to the name of test that says like this is flaky and then we won't run it and resubmit.

G

But uh you know no one's actually doing this right now and the problem with doing that is then no one's gonna go look at where we do run all the flaky tests and say, oh, but I can fix this one and put it back because, like why is that your problem? It's just, I think, most of the folks that would want to do something like this are just under a deluge of things to deal with already and it's hard for this to rate.

A

Yeah all right, the the fix being a back port of a test chain, a cherry pick of a test change to a previous release branch. Just it still feels wrong to me. Right like this, should have been caught somewhere else.

G

We, this is a very this, is, I think the key problem with upgrading, though, is just that, because we have behavior changes frequently, you can't have had the tests, anticipate it.

G

um It's a key problem and it's again it's just not something anybody's working on there's too many under other understaffed things.

H

A dumb question on the topic of upgrading are these: are you talking about tests that test before and after an upgrade or during an upgrade, or what does that mean.

G

We have tests that test before and after we have some tests that are based on testing around, but that's in very limited use, I'm not sure if anything's actually running today in ci and then in this case it's a cluster, that's running, multiple versions which isn't strictly an upgrade, but is like one of the main motivations for testing. That is that you know, if I upgrade a node is there is my cluster is still going to work yeah like so I can have a rolling upgrade yeah.

H

That's why you can't call.

G

Those skew nrc, those are all names, skew, something or rather okay,.

H

um Unrealistic scenario that everything's on the same version at the same time exactly.

G

This has been a really useful discussion, but I'm looking at the clock- and I want to let andrew uh get to his topic- yeah.

D

And, like I said before, this is this is not a new problem. This is a problem that we have had for at least four years, so we're not going to resolve it in one or even document. All of the aspects of the problem in one meeting.

G

um Andrew do you wanna yeah.

I

Sure um so yeah uh I wanted to spend some time and ask about the current state of um cubetest cubetest2, and um did you see a cloud provider um so for some context?

I

As uh many of you know we're, you know, there's this long-running effort to um remove the entry cloud provider code um and I think we're like we're at a point where, like we've, externalized, everything like the csi plug-in the controller manager, the credential provider and the cubelet everything's kind of externalized um and, like the final stretch, to really convince ourselves that we're okay to remove things is to have all our tests running with the externalized version of the cloud providers in the test, jobs, and so- and this is kind of like related to the staffing problems right like I- wouldn't expect anyone in sick testing to like help or take on the work of like converting those jobs so like.

I

I think it's kind of on our sig to like um kind of leave the effort and like see what we can do to um slowly convert those jobs.

I

uh So yeah like I want to be mindful of staffing problems um and try to get a sense of like. Where would be a good place to start to slowly convert um tests to use external cloud providers without you know destabilizing any existing jobs or anything like that, um and I did notice yeah like there was some effort around keep test 2 and it has like a legacy mode and a standard mode to use external provider. But I wasn't sure what the recommended approach uh for testing external cloud providers is on gce at least.

G

um I don't know if we can a whole lot about that. I can tell you cube test. 2 should be in a pretty stable state, now it's being used in ci for e2 tests and should be fine, but that's in that's one of those projects where um hi I'm like the only active approver right now. um I it's very high on my to-do list to change that, um but even still, it's probably not going to have a super active uh investment and there's going to be a really high expense.

G

uh I don't think it the scope of moving all of the things that use the entry gce provider to not moving them is enormous, like that's most of our ci, um and it also raises some interesting questions because we're like not trying to test this out of tree provider um or the provider at all we're just trying to test kubernetes in most of those, and um I think we get to cheat today on the stability of that by like well everything blocks on it and it's it's in the kubernetes repo.

G

So it it better, be passing it's not gonna like you, can't fundamentally break the cloud provider and merge your pr, um but once it's out of tree um that gets closer to where we've been with cops, where um folks need to be convinced that this is like reliable, that there's a stable version that we're using for kubernetes, ci or else there's going to be a push to remove it everywhere.

I

Yeah, I think, there's a bit of a chicken and neck problem.

I

Because you so the I'm fairly confident that if we were to reconfigure the ci jobs to not use an external cloud provider, but just like configure them without a provider, um a lot of the tests would break or not work. um There's like.

G

So we actually have a lot of tests that will work. Fine, I mean kind, doesn't configure a uh a cloud provider at all, but um there are tests that are covering things that only work with that and then there's just the bulk of ci is running on, like you know, a cluster using disposable, gc evms, and so we we just need the cloud provider to have the cluster yeah like I, I kind of like we just won't even have a cluster. If we don't have a provider.

I

Yeah, like I kind of categorize our tests into three three categories like there's, the there's tests that, like really don't need a provider and like those tests, run on kind and they should pass performance and whatnot and then there's like the provider, specific tests that are like creative gc volume or whatever, and then there's like this, like weird tests in the middle that are not testing provider behavior but may rely on provider functionality like, for example, if there's a test pulling a private gcr image that is going to execute the cubelet code to get a token from the gcu metadata server, and do that like like I.

I

So I guess what I'm trying to do is like. I don't want to be in a state where, like when we remove the cloud providers, we also disable or remove a bunch of tests, and so I'm trying to get to a state where, like when we get there eventually, at the very least like there are a set of jobs that have been converted to user external providers. So those tests can continue to run as opposed to like just turning them off.

G

I guess what I'm thinking is, so I think most of the tests that we should be running in kubernetes to test kubernetes should work without any particular provider, except for the part that we need clusters, and I don't think, there's been enough push to say, like oh we're, going to run all these with like a single node, cubatum or kind, or something like that, where we don't need that. As soon as we get into like multiple vms, I mean we need an act.

G

We need a provider just so that we can like bootstrap a cluster together. um I think we can even get to a state where the bulk of what we're covering in pre-submit, if we set aside um like node, which is with different unrelated, will work. But there are other things like scale testing, where absolutely not that you're never going to have that or it doesn't actually care about the provider.

G

But it's never going to work without having a bunch of like real vms running, with substantial compute capacity that we're not going to simulate by like just sticking kind on a huge vm or something um and like you're, making things yeah. Those things, I think are still pretty important and a huge portion of ci like even if we were like. I don't care about that.

G

One test that does some gcr-specific thing that probably shouldn't be in case and should be like tested over in the cloud provider repo the amount of tests that are something more like. We want to scale test, so we need a cluster.

G

That's like our default. Our default choice for testing, kubernetes and ci is to create a gce cluster because that's where we have lots and lots and lots of compute- and there are a lot of things where you know we haven't made this huge push like. Oh we're, just gonna run them all like kind or something um we could go one of those directions like.

G

I actually think we could get a lot of them to work that way and it would be fine, but there will be some that won't ever like scale testing, which is the thing we do in pre-submit today we do small small scale, scale, testing and, for those uh I'd be hesitant with them.

G

I would probably start by picking something like the conformance gce suite and trying to make sure that that's running, um but I think this we're gonna have to have conversations beyond even like r2 cigs, for example, with release about, like you know, should this be release blocking?

G

What's the motivation um yeah like you're, making a decision there's also like a general direction. Question like, for example, in pre-submit. Like do you know? Do we want providers.

G

And if so, like I mean do we want to go back to having multiple, and if we do, I mean do we want that to be like a plugin having lots of them um or or do we if, if we could, I mean scale is a question, but if you scale we could head back, we could head in the direction of just like no providers and pre-submit either of the any of those options is going to take.

G

I think a pretty large investment, and um if this is one of the things I've been trying to vote people about that, I think is going to be the hardest part of actually finishing provider extraction. Is, I don't think we're ready for there's just a huge amount of ci?

G

That's just like I'm just going to use clusterup.sh and create a gc cluster and um as someone who has gotten out of tree deployment tool to like pass nci it it's really hard to get to the point where people like trust the tool and keep things functional and don't just break you with their pr's. And then it's blocking um and we as a project cheat a lot by having the gce and the cluster scripts in the repo. So that, if you change anything in your pr like you will wind up inevitably fixing those like.

G

If you remove. If you remove a flag from a component or something like that, you have to change those scripts. um It is going to be very different when none of the tools live in the repo except sort of cubanum. For now and there's even talk of moving that out.

I

Yeah, so I think, like one thing I want to make clear, is um I am kind of willing to like. I understand, like there's a ton of work on this. um I am willing to kind of roll up my sleeves and like just just go into it and start doing some of the work, um but I'm trying to figure out like the best place.

I

To start and like you made a distinction between, depending on a cluster configured with a provider versus like, depending on the test, deployer like a provider for the test, deployer, and it feels like things can be like tackled separately or maybe like. We start one job that is configured with an external provider.

G

I

Then we figure out how to like decouple the deployer part of it.

G

I guess yeah, I think we have tests like this today, but, like you know, putting my head on is just like kubernetes testing.

G

I don't want things in pre-submit that like actually care about having like a gce provider or something like that, I I just want to be testing kubernetes, and that sounds like something I should be testing in, like the in the various cloud provider repos and maybe even it's a common thing- they all need to test and we should have a common test, but I still don't like care about that blocking my ci, um but there are the other things where, like I am actually just testing kubernetes that do depend on it and even for those just the question of like which one are we using and like?

G

How do we keep it? Stable um I've had some meetings about this in the past, but they haven't really gone anywhere. I can dig up some of that stuff and um follow up with you on that. I I think I would start by.

G

I believe we actually already have ci for this provider doing some basic tests with cube test two. I would take a look at those and see what shape they're in and then probably next step is like I mean: where do we want those like? Do we want to get that into like release blocking, or I think the question remains about? How do we version like, like what version are we using and right now, I think they're using it out of tree cluster scripts?

G

That is what I'm saying I think that's actually going to be very expensive if we switch that to to get people to to staff those tools.

I

Makes sense? Okay, so um then yeah. That sounds like a good next step, though, like looking at the current jobs that run extra providers. um If you can send me a link to those- and um I will see like what shape they're in and yeah, I think that's a good starting point is just stabilizing those and seeing if they actually more can validate what we want.

G

I'd actually look myself, it's been so long since I've been involved, but I think that it's like under cloud provider. I think there's like a dashboard for the gce repo. um We I don't. I don't actually have that. Indeed, okay, I'm.

I

Sure walter also has some context I'll poke into cool thanks appreciate it.

G

Thanks, everyone have a good rest of your tuesday.