Kubernetes SIG Testing, 28 Jul 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Testing - 2020-07-28

Description

https://bit.ly/k8s-sig-testing-notes

A

Okay, good morning, good afternoon, good evening, everybody uh today is tuesday july 28th, at least in this time zone, uh and welcome to the kubernetes testing by weekly meeting. I am your host eric kirkenberger.

A

This is a kubernetes community.

B

A

So we're all going to adhere to the kubernetes code of conduct, which means basically we're all going to be our very best selves and not be sharks. This meeting is being publicly recorded and will be posted to youtube later so today, on the agenda, we have a proposal that jordan and ben and myself have been working on to suggest some policies to maybe improve the stability of kubernetes ci in the short term uh and we'll see where the conversation takes us from there.

A

Jordan did you want to speak to this or uh do you you want me to speak to it.

C

um I can give the one minute blurb, if you want um so ci issues are not new to kubernetes. This has been an ongoing problem for years. At this point, it is more or less of a problem at different points. In time we see it crop up a lot, usually around code freeze when lots of things are trying to get in and there's lots of, load and lots of uh people very concerned about how quickly their requests get merged.

C

This release. It was particularly bad. There were two or three real issues that were causing lots of flakes, but they were lost in the noise of other test failures, and so it was hard to find them. It was hard to get the fixes that resolved them in it was hard to get signal on release blocking jobs, and so we there have been a lot of things said over the past few years about needing to do better and hold test owners and component owners accountable for their tests.

C

um It's very hard to do that when it can legitimately be said that a lot of the test failures are not the fault of the component owner or the test owner, and so this proposal is trying to put in place things that will improve the infrastructure aspect so that when we see a test failure, we can actually open issues for that component owner and test owner and say your test is failing, and we have a reasonable degree of certainty that it's not an infrastructure problem.

C

You need to fix your test or else and the or else is to be determined, but we envision things like disabling jobs that are permanently failing or uh not allowing cigs to merge features if their test health is demonstrably bad. The like, I said to be determined, but the first step is getting to a point where we can actually reasonably rule out infrastructure issues and say the the owner of this tester component is the one who needs to resolve this.

C

A

How about I meet myself, uh so does anybody disagree with jordan's statement of the problem or our stated goals.

D

um Can I just uh not to disagree with any of that just kind of add some kind of overall control? um It seems kind of bad that things got this bad right there. It seems like there should have been alarm bells ringing earlier and focus brought earlier just kind of at a very high level. You know to figure out whether it seems to structure- or you know, discipline of the component owners or test owners or whatever.

A

I I agree with that: um it's why I'm happy there are so many people here, uh because I think we could. We could use everybody's help in figuring out how to to motivate everybody properly. How do we align our incentives such that uh people are inherently motivated to do the right thing right? We should. We should be encouraging people to brush their teeth every day we shouldn't have to drag them to the dentist, to get a root canal every year and a half or so.

A

um So I can run through the specific policies we've proposed here, see what the group thinks about these. uh I'm also open to suggestions from the group about what what they think could be done as follow onto these, but I feel like I'd like at least like to understand if we can get sign off on some of these things. Does that seem fair to everybody.

D

um Yes, this is necessary before making any actual more progress.

A

Yep, uh so um it seems like the biggest problem we're running into at the moment, for whatever reasons are resource constraints, um because the majority of our jobs schedule themselves as best efforts or possibly burstable pods.

A

We have no. The scheduler can't make a very informed decision about how to place pods appropriately and whether it even has enough capacity in the cluster to place pods appropriately such that tests are guaranteed the resources that they need.

A

So I think, like these are. These are generally listed in sort of order of importance. I think um I think the most important thing is to get to a place where all release blocking and all merge blocking jobs actually declare the resources that they need, such that they can become guaranteed uh quality of service pods.

A

So when the skype, uh when pods, have non-zero resource requests for memory and cpu, and they have limits that match that um the scheduler is sort of guarantees that wherever it schedules that pod that is guaranteed to get those resources, it's also going to be the highest priority. Pod, that's left on a note as long as possible. Should there be any reason to evict pods off of them?

A

We feel like these are the most critical jobs. We feel like it's most important to declare these things.

A

The state of things today, like I said some jobs, detect some resource requests.

A

um So uh I pr just landed uh yesterday afternoon uh to make a best guess at what the cpu and memory limits might be for uh release blocking jobs. So those would be the jobs that run on the sig release master block dashboard. Those are the jobs that run periodically.

A

Those jobs are going to end up consuming less resources than pre-submits, because pre-submits end up having to rebuild kubernetes from scratch every time before they start spinning up a cluster and launching etv tests, whereas the release blocking jobs just download a pre-built version of kubernetes.

A

So my proposal is: we do this for the release blocking jobs now and see how this works out for those and assuming we don't see from pumpy. We then move to do this for precipice.

A

um We would enforce this and all of the policies we're suggesting here the tests that would run against job config submitted to the testing for repo, so you wouldn't be able to land a job campaign and have it be on release blocking dashboard unless it actually had the appropriate resource requests and limits um our suspicion. Well, let me stop here. Does that sound reasonable to everybody.

E

uh So all of it sounds reasonable. um What I would like to mention is that the main blocker for setting proper requests and limits in the past and powercase.io has been there's zero public data around what jobs actually use, and so I think the first step here would be to start collecting this data.

A

I agree completely.

A

I will show you just real briefly how I took the guesses on the release blocking pods, so this cluster is a gpa cluster. It's set up in google cloud and with that we get google cloud monitoring and logging um based on the metrics that I got out of the box with pilot monitoring.

A

uh I'm not saying I I liked this. uh This was still very manual, but I tried to um see if I could gather metrics that showed like what a given job was using in terms of cpu or memory usage over time. Let's see if this still actually works.

F

uh Aaron before diving too deeply into the data that you have access to, I think maybe you should mention the right, so I wanted to I just.

A

Wanted to show this and then uh say that I'm using this uh with a account that hasn't even activated its free trial of google cloud and the reason I'm getting away with this is because I'm in a member of I'm a member of a group called kate and for proud viewers which will give you uh read-only access to all of the things that are necessary for running prowl over in the wg k-10 for land.

A

But I can't offer this capability for the proud build cluster that runs in google.com, but we can happily provide this to anybody who wants to work with the publicly uh funded and available crowdfill clusters. This is the cluster I've been moving. This is why I've been trying to move all the release blocking jobs over to that cluster.

A

That's the one the community owns.

A

So if you want to see what I'm seeing I'm open to anybody pr themselves into this group so that they can see this level graph, which shows that the kind ipv6 job is using roughly whatever, like 14 gigs of ram at its peak, at least as as far as these metrics think, I don't know. If these are true, I don't know how to trust them and I'm super open to suggestions on other ways to collect this kind of data. But that's the best answer. I have to alvaro's answer.

G

So so, could I ask you aaron: um how would people um uh request request to request that access that you're talking about there.

A

uh They open a pr against uh this file.

G

Okay, I think jorge, is adding that in yeah. Okay, that's fine yeah! I just wanted to capture the notes and hardcase put that in thanks. Okay,.

A

Daniel's already done that, I could find his pr to link, for example, yeah.

G

And we, when we have it on the note standard, which is why I asked that's brilliant. Thank you.

A

um So, like I said, this is only available for the kubernetes io cluster, which only runs periodics and release blocking jobs. You get no visibility into. What's going on with pre-submits, it's why I think we should be incentivized to try uh moving the rest of the release blocking jobs over to that cluster and also moving the merge blocking pre-submits over to that cluster or another cluster dedicated to it.

A

This is number one, so we can ensure that our critical jobs are running in a cluster where, like everything is not is now well behaved like everything that runs in that cluster has to declare its resources upfront, so the scheduler can make a much better informed decision about where it should be, placing its pods.

A

These these jobs aren't going to land on a node where suddenly memory, usage or cpu usage is going to explode.

A

I'm open to suggestion uh from folks whether they think this should be a dedicated. You should have like a very clear cluster for periodics and then another one for birch blocking I'll caveat. All of this. That, again, all this infrastructure is in a place where the community is able to run this, and so I would really um like to see other people involved in spinning up these clusters using the code and infrastructure that's publicly available there.

A

I I'm not sure we'd be helping anybody here if just the few select overloaded googlers continue to run the few select clusters, so I'm not saying that has to like we can help for expediency, but I really think this should be the start of a broader community ownership of these resources.

A

um Ideally, at the end of all that, we should have a cluster that is appropriately scaled and sized for all of the release blocking jobs um in the event that that doesn't seem to be enough, or, as we continue to move things over. We think that we should also start to enforce some better accountability and hygiene on all of the other jobs for all of the other repos or all.

E

The other jobs.

A

For kubernetes that are not required, merge, walking jobs or they're in their periodics that just happen to be informative. um We think every job should have contact info associated with it. This is something we I mandated a while ago for the release blocking jobs um so that everybody would get emailed when they're when their jobs broke uh and then, in the interesting case of jobs that run like a bunch of tests owned by a bunch of different cigs. uh The release team has sort of handled that with the ci signal team.

A

The ci signal folks are sort of responsible for figuring out where to triage or route that as appropriate.

A

um There's part of me that wishes that less toil was involved with that, but I think that's also been a relatively effective model for those catch all jobs for jobs that are not catch-alls and are owned by specific sig use. That states, I don't know group as the email address, or maybe they have a dedicated group for alerts.

A

But the idea is that, ultimately, if you're going to use community resources like crowd, you should have a point of contact and then, if yeah, so we know that alerts are being sent to you. You, the point of contact is being alerted that the job is continuously failing. If the job still never gets back to a good state, then we should probably just go ahead and disable those jobs. So we free up resources and stop.

A

uh You know: stop paying for resources, we're not using and stop crowding out jobs that could legitimately use those uh resources for legitimate signal.

A

um Just to give you a taste of what the state of things are like this used to be visible or fauna dashboard, but we do have a query that runs every day and you can see, for example, uh ci kubernetes, no kubelet serial job has been failing continuously for 934 days in a row, um and there are, you know, they're, not just uh uh not to single out note. There are a bunch of jobs owned by a bunch of different cigs that do a bunch of different things that have been continuously failing.

A

um I feel like jobs that have been failing for hundreds and hundreds of days are probably good candidates to look at disabling first, I can't give you a great uh assessment of what sort of resources this would free up, however, but we think this might help.

A

That concludes all of the proposed policies here. Questions comments, concerns.

A

C

I think we're all just in shock silence that jobs failing for years and years.

A

Basically, every single testing presentation I have ever given at kubecon. It takes a moment to pause on the jobs that have been continuously failing and it started out at like 100 days and then eventually I thought well, maybe we'll get it to like it'll, clear it out after a year or two years.

C

I think the time is now, but we all agree. This is not a reasonable thing. The time is now.

A

So I'll stop reading to everybody, I I think I was um personally I'm more interested in this being a bit of a dialogue. I'm sure there are lots of people here who have opinions or suggestions on what we could do next, so I'm interested in hearing first, if everything that's proposed here, seems like a good, reasonable first step.

A

Second, who wants to help?

A

Third, uh when jordan mailed out this proposal, he talked about the idea of, like you know, maybe we should just not open like we shouldn't open up the main branch of development back up for, like anything, goes like. Maybe we should. Our focus after 119 should be continuing to address this to this test and ci signal pain uh until we feel like it's actually been properly addressed, as opposed to you know having a great angst for like a week or two and then everybody kind of moving on and spamming, retests.

A

uh I've spent entirely too long talking and we need to take a break and, let's open it up for discussion.

H

uh Hi eric, I have a quick question about the uh all the failed tests. You just showed us like the tests that had have been failing for um several hundred uh several years. Maybe so are those the ones that are triggered when a pr is submitted.

H

No they're, not the west.

A

uh Some are some aren't, so the naming convention that we stuck to was if the job has ci in front of it uh most likely a periodic, uh and if the job name has pr or poll, then it's a pre-sigma, and so the majority of these look like ci jobs. But there are so. There are also some uh some failing pull request, jobs.

H

Okay thanks, so where are these ci jobs defined? Are they in test infra or the under the kubernetes label? Yes,.

A

They're they're defined as testing for repair all of the jobs for uh all of the 150 plus repos across the six different kubernetes orgs, live in the testing for repo.

H

Okay, um okay, so, based on my understanding, I think uh my suggestion here is that we should make a list of the um long failing tests, uh and maybe volunteers can get work on some of them and can see there could be batches and we can see how much resource we can free up by doing the most by fixing the most frequently fading ones.

A

um I I I like that idea. However, like I personally, I would love to get a little more confidence about like what actually is the resource usage here, um and what would we free up by doing this? uh There's also a part of me that almost wants to default to you like. If it's been failing this long and nobody's noticed, is it really that worth bringing back from the dead uh like it? It may very well be worth a sanity check. I know.

A

Sig node has been doing a lot of inspection of their tests and kind of rediscovering some ancient tribal history, which has been really valuable for everybody, so it might be worth a little bit of a look, but I'm not sure that, like perhaps we don't actually need all these jobs to begin with.

C

Yeah, just I think it's a matter of policy perpetually failing jobs should not be allowed, and so, regardless of how much resource they free up, I if it, if it hasn't passed successfully in weeks or months that doesn't seem like something we should allow.

F

The other thing is that this has been periodically brought up, that we have jobs telling us long, and yet we have ones that have now reached something like three years. um I think the other thing to remember is that, even if we delete a job, that's not by any means permanent everything's recorded and get if someone comes back later and thinks that this was valuable, they could run in the effort to bring it back. But just as a matter of policy, we shouldn't leave things failing for this long and just running and wasting resources.

F

A

uh So question link to the list that was shown. The link is in the proposal document which you can get to from the sig testing meeting notes. It also lives in the testing for repo under directory called metrics.

A

um I I think, like at the zeus point like I don't know anytime this, the continuously failing jobs thing has been brought up. Nobody has thought to make a giant umbrella issue with a huge checklist and a clearly defined set of work items such that people go off and do it. We have noticed that that does actually work for these simple mechanical things like go: limp and shell check, failures and stuff, but I feel like hey. This work is less mechanical and b. This isn't about one-time fixes.

A

This would be more about like cigs kind of taking a look at the collective amount of time that they spend fixing their tests and their jobs today and deciding, if, with the budget that they have do, they want to take on more tests or do they feel like they're, already spending too much time fixing the tests that they currently.

C

This is also the reason that the permanently failing job item was after the contact info item. I I suspect that a fair number of these jobs are failing, because nobody knows that they're running, and so the goal is not to disable jobs that are actually providing useful signal. It's to provide to disabled jobs that are not providing useful signal that nobody is really looking at or paying attention to.

F

I want to point out very quickly that we've talked a lot about resources and we've talked a lot about failing tests and those are not always necessarily related and that, while this will all be super useful, getting things to a healthy state will require another series of steps. After this, once we have more confidence in the ci we're going to have to look at the tests.

F

There are cases where we totally fail to run the test, or something like that. There's also a lot of flaky tests that are just the tests of flaky or some piece of kubernetes isn't quite reliable enough or something, and that probably will need its own proposal right now. It's a little difficult to look at, though, because ci is such a mess.

C

Yeah, I I have a long list of follow-up ideas around particular tests or particular ways. We could analyze the end-to-end and integration and verify test, runs and identify particular problems for cigs to go fix, and so my goal is to get good enough signal at the ci infrastructure level so that we can run some of these reports and then fan out to the sig owners to say: hey, look: your end-to-end tests are consuming 40 of the end-to-end run.

C

uh Please reduce that to this much either by moving things to post submit or by speeding up tests or whatever. Similarly for integration, you know here's a package that takes 10 minutes to run. Can you reduce that or parallelize things or collapse, your coverage, but yeah? At that point, I think we can fan out and lean on the community, like the sig owners to drive that work.

G

So so so jordan, how would how would people be able to um action such decisions? Like I mean? How would they be able to see um or how would you propose that um that the people are able to view reports indicating resource usage? Is that something that's there already or something that would have to be built? I.

C

Mean for uh it depends so for simple things like test runtime, you can go look and test grid and show a graph of how long every test took, um and so you can see like this test takes eight minutes on a typical run, like that seems excessive for a pre-submit. uh There are other tests where you can look at the graph and see like. Usually it takes, you know half a minute, but sometimes it takes 10 minutes.

C

You know that that seems like something that component owner should dig into and try to understand. Like is something. Is there a bug here? Is there like? What's going on, um so that that view is really useful?

C

It will become more useful once we run tests with consistent amounts of resources, so that will take some of the variance that is due to sort of noisy neighbors or environmental issues out of the picture and give us more confidence that the variance is actually due to the test.

F

Yeah so most most of the things we run have this data available already, um but it's also.

E

F

To things like, if the infrastructure.

A

The other fun thing about the test grid. It just does this by default. I guess I don't know if this is a feature, but but uh it sorts by the longest uh um duration. So I can tell you right now that basic state the set functionality should perform rolling updates blah blah blah is the longest test. That's currently in the defaulty tv suite uh and it's hovering around six minutes, which is yeah. It's that's actually a little bit over what we said.

A

Our slow test threshold was, I don't I don't know how long ago and I'm sorry, I don't entirely know where it's documented. I think, there's a document that describes the test tags and says that slow is reserved for tests that take uh five minutes or longer.

A

So you know it could look like a human kind of looks through this and it's like. Well, you know what that's that's above five minutes. That's above five minutes, that's above five minutes, I'm gonna go open up the pr and add the slow tag to these jobs, but who do I know to notify?

A

Well, hopefully, I know which state to notify, because that's in the test name and hopefully that sig has some group of people who are responsive to polar classics related to uh you know, tests triage, whether that be correctly identifying pests as slow or flaky or fixing test failures.

A

This is something I went through uh sort of last december, I think um to we. We kicked out a number of really really slow tests uh and properly tagged them as slow.

A

But it would be wonderful to live in a world where we could have a machine. Do this, this sort of checking.

C

Yeah, I I'll drop a link in the uh the meeting notes to a doc that just had random ideas. I agree on wanting to automate this, like, I think, as an initial effort. If people are wanting to help like identifying the top 10 or the top 20 or whatever issues and kind of getting those knocked out manually uh is great, but something where we actually can run a report and say all right for these omnibus jobs.

C

What what percentage of this job is being consumed by each sig, and what do we think is reasonable? uh How long of a test do we want to allow in a pre-submit?

C

um How long of a test do we want to allow per package for for integration things like that, and just say what are our goals if we want pre-submits to be able to finish in 30 minutes, 40 minutes, 30 minutes 20 minutes? Maybe what do we have to do to get there and and then having tools that we run against the test? Result say which cigs are out of out of bounds for where we want to be uh and publishing those metrics.

C

Just like we published the the latest flakes and the latest failures say and making that a part of the sig's sort of week to week processes like how are our tests, our tests, flaky, let's fix them, are tests consuming more time, let's make them faster or move some of them to post submits.

C

So I think for sig testing. What we can do here is sort of figure out how to get that data out of the current runs and the current test grids and get it in a form. That's consumable for cigs to show really clearly like these are the sigs that are doing good. These are the sigs that are not doing good and then that can be input both for those six to prioritize their work, but also for enforcement things that are not doing good.

C

What what should the consequence be if your tests are unhealthy and you're eating more than your fair share of the community's time? uh Should you prioritize that over features? I would expect probably so.

B

Yeah, I feel, like that's a really good point, because there's been a lot of times where we've, even you know, manually identified some of these things like sig node was a great example and, as has been pointed out, they're making a strong effort right now, but for quite some time there wasn't a lot of effort around that, and you know they were notified kind of like repeatedly, and but there was no consequence for not doing it right.

B

So, um if there's a stated kind of like you need to get this done by this time, and it's not to you know, make their lives harder. It's just because it's you know fair to the community and also you know they should probably know what they're testing and that sort of thing. So I think, for any sigs, just having kind of like this defined threshold uh for different areas of quality makes it a lot easier to encourage them to do things and also makes it clear to them.

B

You know kind of like what they're actually striving towards right. Instead of just going to them with a you know, you need to make your tests better. We're seeing too many failures right, we have tangible things that you need to meet and like please do that in a reasonable amount of time.

F

F

A

Yeah for the node example, like you know what we're just going to ship kubernetes without testing the kubernetes doesn't no.

C

I I think I think it looks like uh features get held up. I that, that's probably I mean, and that's not really like a sig testing specific call, that's going to be like testing and release and architecture. Maybe I don't know uh but like right now there are preconditions to uh adding things for api changes like you have to have a proposal you have to have uh the it has to have a migration plan, and things like that.

C

um I think it is reasonable to say if a component is unhealthy based on the test reporting, it's not appropriate to be adding new things and making changes to that component. If we aren't certain that the current function is working properly, then we shouldn't be slamming more things into that area. We should be building confidence that the current level of function is working um so that the goal is not to be punitive to like actually release a high quality thing.

F

Sure I just uh one-handed seems like that probably needs its own discussion. I I'm super on board with that.

E

Yeah yeah, I try to add.

F

A carrot: hey fu has the like most tested best super non-flaky components, congrats.

I

So would it be.

E

I

To watch the signal and measure it against enhancement, proposals for the caps that are tracked is would that capture enough versus non-kept.

C

Content- I'm not sure tim. I I we'd have to think through how the mechanics of it would work. uh I'm not sure.

A

Yeah, frankly speaking, I'm not sure either I feel like cap is that, like kept some production, readiness reviews are really great uh injections of human thought into the process, uh but I also feel like it's possible for a component to decay at a faster cadence than that um within a release cycle. So how can we appropriately? You know, I don't it's the usual question of like well. How do we decide to revert an enhancement or whatever that turns.

E

Out won't land.

A

In time, in this case, it's how do we tell a group of people like stop attempting to land this go off and fix this yeah.

C

I I think the first thing we have to do is just get visibility like what what are the standards and if sigs aren't meeting those standards get visibility. To that. I remember in some releases we've had ci signal reports. You know kind of the last month of the release that says here: here's the top three flakes and what sigs owned them and whether progress has been made, and that was like an email that went out to the community once a week and so like.

C

On the one hand, it adds visibility like maybe maybe no one knew that it was going on. On the other hand, it adds a little bit of kind of you don't want to be on the bad sig list, um I'm okay with that. uh Even even if we don't have a super clear, like iron, clad things with this label will not merge and there's a bot enforcing it like. Even if we don't have that story.

C

Yet just saying like here's, here's how we've sliced up the resources we have and here's how cigs are consuming those, and I think that would do a lot to move the needle like. If, if there's an email going out to the community, once a week saying these things are out of slo for these things, uh I think that would help.

F

So one thing I feel like we didn't quite get yet was, uh I guess maybe nobody had anything to say, but do? Are there any objections to the things that are actually in scope for the current proposal, as opposed to uh continuing to think about what we'll do? Next after that.

I

Anyone explicit I like the proposal. Thank you.

A

So the anti-pattern that I want to avoid here is when we talk about me doing stuff, I'm not talking exclusively about the three people who wrote the proposal nor.

E

Am I talking about.

A

The select few people you have access to google.com and it's proud, build customer, I'm talking about we as a community. We need to work on this, so I feel like like at a bare minimum.

A

I took a best guess for the release, master block and stuff, and I feel like perhaps I did you all a disservice by going back, I probably did more than my fair share of taking a look at jobs and assessing what was reasonable, and so I feel like it should be on folks here to decide uh what limits and such they want to set for their priesthoodness, and we should go through that together as a as a community.

A

That would be one suggestion. um Another question I haven't heard raised by anybody here necessarily is like what is the definition of success.

A

uh We proposed a couple of things that we're planning on doing, but it's unclear to me. We've decided like what metric or means of measurement we're going to use to describe how awful things are now and how much better they are as a result of what we've done.

D

Yeah, can we have a measure of flakiness the quantitative metric.

F

We have some and we have people working on better ones, but I don't know if we have one that represents everything.

F

A

I can speak to that with a little more screen share if you are interested, uh so I'm going to browse over to the kubernetes testing florida.

A

Eventually uh and I'm going to click on the metrics directory, and so what this is is uh most of the test results that land in buckets that are visible by testgrid.com also end up getting scraped into a publicly accessible bigquery database that anybody is capable of querying.

A

We have a couple, proud jobs that run queries against this database. That looks something like this: that's how I got the continuously failing number of pr's, so there's.

E

Another one called.

A

Job flakes the complete core. It looks something like this. uh It took me about an hour and a half for two hours to really pick this apart um and it's like not perfect, uh but it does show the flakiest jobs for a given week and it tries to call out the flakiest pests, and so here I could see that pr pulled kubernetes e2e gce has a consistency of 84., so 84 percent of the time it'll it'll run appropriately.

A

So that means 16 of the time it's going to flick um and so.

D

Yeah, I've looked at this. uh You know, particularly during the past few weeks, when things were really bad and these numbers were like 80. Something and you know we had jobs, were failing 50 90 of the time it seemed like it to me right.

A

So there's there's like imperfections in this process, so, for example, data doesn't actually make it into this pipeline unless a job has actually finished so for all of those jobs that we're failing to schedule because of pod pending timeout, which we think is resource concerns like we weren't. None of that data makes its way here. This only really becomes more useful.

A

If we get to the point where jobs are actually scheduling, but then in theory, we could do the math of like taking a look at the consistency for each of the jobs and sort of multiplying it together and seeing given all of the release blocking jobs, merge walking jobs that we currently have for pre-submit.

A

What's the percentage chance of you actually landing your pr in an ideal world where every test would pass to begin with uh and like maybe that could possibly be a metric.

F

uh I have a really simple one: I'd like to see I'd like to see for the kubernetes blocking and release blocking jobs, because that's an easy set to say like those as opposed to I'm not sure what all is in ci. I would like to stop seeing pod error state in the ci, as in the ci failed to run like it didn't schedule.

F

It didn't finish. The init containers that sort of thing um there's still the like the tests themselves need to be healthy and that we may continue to see regardless of this work, and that will be the next step.

F

But we can look at the data that is served and browse a website when you look at a job product case that I have, and there is a state that gets recorded when we like fail to schedule a pod or when the pod, just like errors, or it times out during pending um those things, reflect infrastructure problems, resource contention, inability to schedule.

F

That's happening pretty frequently right now and that's some of those failures you're. Seeing that don't show up in that flakiness metric.

F

Conversely, we're not going to improve the flakiness of individual test cases most likely by changing the things in this proposal.

F

We're going to get to a point where we're prepared to say: okay, flakiness is a problem with the test or the code, and then we work on that.

F

So those great.

D

Yeah, I agree. There are different sources of problems here. um I was kind of moving back to the you know main point, which is adding all of them up. uh You know we need to be concerned, but even more importantly, breaking them down, so they can be diagnosed and fixed.

F

Right, and so we have some tools for that, as I mentioned, it doesn't catch this and and that's sort of the next step of this.

F

The thing I want to see is those gray triangles, so the circles are, I think we stopped running it because there's a like a new version of the code to test and we're moving on whenever you see one of those alert triangles and gray, that's one of the entries where it's just we didn't manage to run it successfully one way or another um at an infrastructure level, and though that data, that's a json endpoint.

F

So if someone wants to scrape it and look at it, we have that I want to see us get to a point where we stop having that for the set of critical jobs, I'm not certain that we'll it's reasonable and to for even the group here to solve that for everything in a short time frame, which is part of the proposal to move those critical things to dedicated community managed clusters, where we have, we can capacity plan for those.

D

Things that makes sense to me.

F

I I think, that's a pretty good success state for this proposal. It's not a success state for the project or for the healthiness of tests. That will that's another thing we need to work on and what we spent sort of the second half the meeting talking about. I.

C

Think yeah, so that that was why I copied the uh the sig leads list into the thread, um because there is going to be work for everything to go, do uh in adding contact info to their jobs and turning down or fixing perpetually failing jobs and then diving into particular tests that are flaking and dealing with those. So getting the leaders of those areas involved now to know this is coming, and I mean it, it shouldn't be a surprise. Everybody knows everyone.

C

Who's tried to merge anything into kubernetes in the last month and a half knows that we have a problem and so clarifying who owns? Which parts of that problem is what we need to.

G

Do just a a question I have: is it possible to leverage um owner's files um in order to attach ownership and responsibility to tests.

C

um Sometimes sometimes an entire package is owned by a sig for unit tests, that's more likely to be true for and actually for e to e tests. That was done as well. Most of the ede tests live in a package that is sig. Specific um integration tests are sort of all over the map. So that's that was one of the items in my follow-up suggestions like we need strong association of tests with cigs individual tests, not a whole job.

F

um E tests also usually have a sig tag in them that allows the sig to have like a test grid dashboard that filters to just their tests out of a run, um and that also means that okay, we're saying the sequences.

A

Just to have somewhat of a mechanism for this called test, owners.csv inside of the kubernetes repo, but the original person who wrote it sort of moved on on the project and nobody bothered to go figure out how that was generated and so support for it's been removed from triage and other things.

A

It could be worth investigating something like that, uh because, like each of these tests, we as humans can clearly sort of look at the sig, whatever prefix.

E

A

Last name, but things like individual integration tests, you might need uh other kind of sort of manual. uh Parsing of that input. um Machine.

G

um Sorry I took it across you there just as a background task on uh I've been kind of dawdling on us um on working on bubbling up an error for where a label was configured um via owner's files and as part of that work, um I've been doing some front-end work to make it easy to look at those files. And so when I have that finished and skinning up say the flakiness jason that you you have there coming out your queries and making that presentable and reportable. That's something that I could.

A

A

So I I don't know I I'm assuming the the long silent gaps here are people generally agreeing. This is a good.

E

A

A

Yeah- I don't know I just I just want to stress that saying that this is what we should do doesn't mean that it's going to happen, it's going to take folks, actually stepping up to the plate and doing the work, and I really do appreciate everybody's ideas, opinions and input, no matter how silly it it might seem.

A

Some of us have been really close to this and might not be able to think outside the box, and so I think, suggestions on what we could all do going forward. I really appreciate it as well.

A

A

Anybody else have anything they want to say on this topic, or shall we call this uh basically agreement that we should move forward with this and look look to see some action required emails coming out soon.

I

Are there some things that are just kind of obvious enough that, rather than starting with the email in action required, we could just start printing them.

F

We did, but it wound up being all aaron, uh which is something we want to avoid.

A

Well, so I I'm sort of trying out the hot qos guarantee thing for the release blocking jobs and seeing if they sort of flake any less randomly. I think it would be cool to see what or how the ci circle team currently takes an assessment of like how flaky things generally are. uh If we could sort of see like did that substantively improve things.

A

uh If, if you wanted, you could go ahead and start printing in resource requests and limits for the precipitates as well, and we could just all find out together what happens like we're theorizing in the dock, that uh you know once we start guaranteeing resources for enough jobs. We suspect that some of the other 1500 jobs for some of the other repos will not get scheduled for random reasons or might be kicked off of a pod for random reasons.

I

I was thinking like okay. If I, if I just pull a number out, say 400 look at the things that have been failing for 400 plus days. A number of things stand out to me, like um hey, we're doing scalability testing against kubernetes 1.13, except it's been failing for over 400 days.

F

That's probably mislabeled um there like, for example, there's a google dashboard that I have on my agenda to remove. That has things that are running against the um the like, stable one tags or whatever.

I

Looking at some amazon ones and their.

F

I

They're extracting ci latest 1.13.

F

That is also something that should probably be removed from ci entirely. For other reasons, um we we have some ci that got added at some point: it's using the upstream tests and nothing else upstream. um I was pretty against that and now that it's not being maintained, I think we have a clear path to just remove it.

I

Yeah, I like the idea of some of the cloud providers providing, but if it's not happening.

I

Oh and to be fair, um some of them are from my employer, also vmware, so I don't want to sound like anti-amazon.

F

There yeah there's this google dashboard in tesco. That has a bunch of nonsense in it, but I I took a look, and most of what's in there turns out to just be like the the dashboard isn't very good. The the things that are running are actually not running because of that dashboard. They're just included in that dashboard as well.

A

With the I guess, one final question I have so: let's say we get to the point where uh that email gets sent out. That's like put contact info on your jobs or else.

A

What do you all feel is like a reasonable cadence and time frame for that. But I, if we're in the magical world, we've got all the resource limits set and I send the email out today. Okay, you have one week to go, decide which jobs you want.

A

And any job you know that doesn't have pull requests against. It will just get deleted from the repo.

F

F

That leaves some leeway for, like oh, a bunch of people were out or something.

F

Can we just jail them for a while.

C

F

Before deletion.

A

Yeah, I I think scheduling them down to nothing, is doable.

G

So so so, what's what's the email that you want to send out iron.

A

So so I'm proposing, if everybody since everybody it sounds like everybody here, here's an agreement with all of the policies we're proposing uh that once we get down to implementing the policy of contact info is required on all jobs uh and I'm proposing that by some deadline, if a job doesn't have contact info, nobody wants it, so we're going to get rid of it. So how long do we get people to claim their jobs.

G

I'd be inclined to say to to suggest maybe um four weeks, but maybe not the four weeks of august. Maybe the four weeks of september.

I

I would like to see it sooner, but I was thinking maybe like in in over the next two weeks. Please get back to us or start doing this, otherwise, over the next two weeks, others are going to start removing like I want, I feel like I don't necessarily I shouldn't just go out and start removing other people's things, but given a lot of time, if they're not starting to move, maybe I could sort of help by initiating a pr.

G

F

Thinking in terms.

A

I think it's reasonable for us to uh remember that humans are involved. I just also sort of feel like some of the some of the jobs that.

E

Have been available.

A

For hundreds of days, maybe not likely to get contact info.

F

So it's something we'll think about sorry to get people to put contact info for everything. I would give plenty of time for this. Job has been failing for hundreds and of days. I would say we're deleting it if you're interested in it, you can come undelete and we should just go ahead and move forward with deleting and if someone's interested, it's really not a big deal to revert like removing something from get just some yaml no big deal, there's not going to be like code conflicts or anything like that.

F

They're separate chunks of configuration, um and we can just notify people that this is being done and if they miss that, because they're out they can come back at it later.

F

It should not be a big deal if something that's been failing for 900 days is gone for a week, and then someone actually wanted it back, um but in terms of like action required things that give people more time. These, I would say we already had action required. You shouldn't have left it failing this long.

A

Okay, well, that puts us that puts us over time. uh I really appreciate uh everybody showing up and again all questions comments and concerns on how to improve this situation uh are welcome. This is only gonna work if we all put in the work.

A

So that's a uh happy tuesday. Everybody thank you.

C

A

Hi, I'm howard, I'm from um howard. Can we get to your agenda item uh in two weeks? Sorry, it wasn't like scheduled uh ahead of time. uh So if.

E

You can ping.

A

Ben and I uh offline on or on slack, if maybe we can chat about it in the sig testing channel.

C

Okay: okay, thanks.