Kubernetes Reliability & Testing Resources, 15 Apr 2022

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Using health of a component/area to increase reliability and guide work

Description

Context:
Discussion on KEP for improving reliability: https://github.com/kubernetes/enhancements/pull/3139#issuecomment-1095771101
Mar. 17th community meeting:
Notes: https://docs.google.com/document/d/1VQDIAB0OqiSjIHI8AWMvSdceWhnz56jNpZrLs6o7NJY/edit#bookmark=id.45wmiyb70mnb
Recording: https://www.youtube.com/watch?v=m1nNW7gnbU0&t=26m55s

Health indicators we already have (and how to improve them)
kind/regression bugs (https://github.com/kubernetes/kubernetes/issues?q=label%3Akind%2Fregression)
AI: label issues/PRs related to regressions in your area
Represent issues about things that used to work and stopped working. Starting to look at PRs with release branches, look to see if they are fixing regressions or long standing bugs. Doesn’t matter how awesome new features are if there are regressions in the release that keep users from upgrading.

long-standing + priority/important-* bugs (~trailing indicator)
https://github.com/kubernetes/kubernetes/issues?q=is:open+label%3Akind%2Fbug+label%3Apriority%2Fimportant-soon%2Cpriority%2Fimportant-longterm%2Cpriority%2Fcritical-urgent

AI: regularly check for these issues in your component/area
Bugs indicate health issues, are there new features touching areas with bugs and should we accept these new features. Be careful to accept changes in fragile areas. We have a duty to our users.

test flakes (~leading indicator)
AI: capture these in kind/flake bugs with details
[Hopefully making use of SIG-focused triage board that lets you to filter for specific SIG. We rely heavily on tests, if tests are not giving great signal then we don't have a reliable floor to know if new stuff is destabilizing an area

"known fragile" areas missing test coverage
AI: capture these in priority/important-* bugs with details
When you fix a regression, insist on a test to check for the specific regression. If we want our areas to remain healthier, we should also do a mini "post-mortem" on regression and find out how can we prevent this. If multiple regressions in same area then that is a loud signal that the area is fragile. Might mean we’re missing a category/class of testing. How do we ensure an area has a good foundation so we can accept new features in that area. After a regression, we should have a long term issue to identify what the gap was.

A

Yeah, so um I wanted to spend a few minutes talking through um a topic we've talked about in the past and the the context for this was a uh kept.

A

That's been open for a while around work group reliability around trying to come up with ways to quantify and sort of standardize and, in some cases, automate uh reliability indicators uh that so I linked to the the kep from the agenda and um it's been open for a while, and it's gone through some a lot of iterations and a lot of the discussion in that cap is focused on. How do we accurately measure reliability and concerns around uh like enforcement?

A

That would be based on that measurement and whether that would be accurate and whether it would be fair- and I think those concerns are legitimate and I'd like to see us try to keep working through some of those. um But there were a couple comments. Daniel had a comment, and then I had a response.

A

uh So daniel's comment. I was talking about um like every contribution that gets accepted uh has like positive value for the project or we wouldn't accept it. The project gets some value, p of c greater than zero, which I love daniel. I love that our contributor interactions get turned into equations, but um so- and I was thinking about that- and I think uh in isolation- that's probably true- it's usually true, um but I'm not convinced, actually, that every contribution that gets an lgtm approved actually had a net positive uh value.

A

um So you can imagine cases where the tests passed and ci was happy, but uh this actually broke something that we didn't have good test coverage for, uh or um it introduced a bug that was latent, because some other bit of code was confusing and that impact wasn't really considered when just looking at narrowly. Is this change a reasonable change to merge, and so I think there are externalities that are not always considered when we're lgtm approving changes and the reliability cap is kind of trying to.

A

I think it's trying to figure out like a top-down way, to get visibility to those externalities to say like how are we measuring reliability and if those measurements get to a certain point, what actions can we take to like put the brakes on until we restore those measurements back to where they were so that's sort of a top-down approach and.

A

I realized that some of the things I uh reviewers and approvers are doing when they're deciding whether to accept a particular change um may not actually be happening, and so I wanted to kind of talk in this context to it's not exactly bottom up, it's more like uh middle middle up, like the the people who are on this call, are uh in many cases responsible for deciding whether a given change is a good direction and whether the thing it's wanting to change can isn't a state that can accept that change.

A

And so I wanted to at least talk through some of the things that I'm considering when I'm looking at proposals or pull requests and make sure that all of you are aware of these and considering these and hopefully making use of these signals and tools to make decisions in your areas.

A

So there were a few different indicators that we have that. I wanted to make sure you all are aware of and encourage you to help improve as you're working in your areas. So the first one is a label that we can put on issues or pull requests, which is the kind regression label. um So we we added this a couple years ago, maybe- and um this can be added alongside like bug or cleanup or whatever. So something can be both a bug and a regression.

A

But the intent here is to capture cases where something that used to work no longer works, and this gives us visibility to areas where we had test gaps like everything that merges past you know ci, it passed our tests and if something that used to work no longer works, that means that we didn't have good test coverage for it, um and so, when you're working in your area, if you release 123 or 122 or 124, and then you get a bug report, it's like. Oh, I upgraded, and this thing that used to work stopped working.

A

um People are good at jumping on issues and driving fixes and backboarding fixes. So that's great don't stop doing that, but uh throw the kind regression label on that. So we can have visibility to when our tests didn't actually catch something um and as a as a corollary like if, if uh a fix or an issue is related to a regression, um we definitely want at minimum to have tests.

A

Go in with the fix that would prevent that same regression from happening again, so, uh if you're putting kind regression on something, uh there should definitely be a test that would have caught a thing uh in the first place um in the community meeting the march 17th community meeting, which I linked to as well from the agenda, we were talking with the release team about how to get visibility to how our reliability is doing and one of the things that I would like to see us start tracking in an automated way.

A

I've done some manual tracking of this in the past, but in an automated way, looking at the changes that are getting taken back to release branches and seeing how many of them are regressions fixing regressions, um you know we take a few different categories of changes back to release branches like security fixes um fixes two bugs which have existed for a long time, and we just noticed them for whatever reason, and it was severe enough to take back to release branches.

A

But then the third kind is actually restoring functionality to a release that got broken in that release and so labeling. These things consistently will help us get visibility to that. So that's the first signal that we have these uh these regression labels.

A

um That said, that's a trailing indicator right like we, we don't know if you're fixing a regression that means it already broke, and so ideally, we want to do things to reduce the number of these that occur in the first place. So the second indicator that we have is um bugs in your sig or your area that are labeled as important so important, soon or important long term, or I guess even critical, urgent we've got a couple of those.

A

uh If so, I mean first of all uh take a look, hopefully you're looking at bugs in your area and if bugs are mislabeled priority wise, like definitely fix the labels like if something is labeled priority important long term, and it's not actually that important as sort of a knit or a nice to have like by all means relabel it to backlog or something like that.

A

But assuming these are accurate, uh it's how long it takes those bugs that we've accepted and prioritized how long it takes those to get resolved can be a good signal about where we're allocating development effort, and it can be a good signal to tell us how healthy a given area is. So if someone, if the new cap is wanting to make some changes or additions to something, but we've got like five priority: important bugs that have been open for a year on that area.

A

That component um specifically intersecting with sort of the area this new cap is intending to do stuff with uh it's worth asking the question like uh which should take priority. um Ideally, we could do both, but the fact that these bugs have lived here for a year uh we're clearly not doing both so which, which should take priority um for some of the bigger components it might not. You might need to go down to like a more granular level than just sig or component, so use your judgment there.

A

But um if the bug really is like right in the area where a cap is wanting to move stuff around or change stuff, my perspective is generally that we need to fix the the important bugs we already have before. We start like adding new change to a thing.

A

um So again the the two things you can do in your areas make sure, like the priority labeling on your bugs matches reality uh and then, when you're, looking at plans for a new release, take a look at the changes that are being proposed, but also at the bugs that are important and have been there for a while and use that to inform like where you put effort. First, um the next indicator is test flakes.

A

These are always with us.

A

We have dashboards. The triage board can do a sig focused view of test flakes. uh It's not perfect, it'll, sometimes flag, like your your sig's test in a failing in a job that, like it's, actually a problem with the job config. So sometimes you have to filter out like the job, because it's the job itself is problematic, um but at least being aware of, uh if there are flakes related to your sig, especially like the unit test level unit and integration test suite, we got our integration tests stabilized.

A

Finally, I think uh in this past release, which is really great so at this point, if you're seeing flakes in unit and integration tests related to your sig, that should be like a high priority uh item. The ede tests there tends to be more noise depending on what job is running but uh but unit and integration tests.

A

um Those flakes should should really get some attention, uh just as an anecdote. The times when we've really focused on driving down flakes, um we often will catch real bugs. As a result, a lot of the sort of race condition flakes turn out to be bugs in real logic, not just test logic.

A

So don't look at that purely as a test health exercise uh like until you actually figure out what the flake is. It's maybe not even odds, but it's a non-zero percent chance that this is a real bug. That's like hitting real edge cases in production and you can fix it by fixing the click.

A

The theme for all of these and we'll get to the last one in a second, but the theme for all of these is capture these things in issues and like describe what the problem is and the importance of it with with labels and descriptions.

A

If, if this stuff isn't visible in issues, then it's really hard to pull together all these signals into a coherent picture of health in your area, and so this this last actual issue, uh is um sort of a follow-up to the regression issue. So when you have a regression, that means like a narrow piece of functionality that used to work broke and by all means like fix that as quickly as possible as tactically as possible with test coverage.

A

That's the priority is to sort of restore the functionality of an existing release, but it's worth asking uh almost like a postmortem on a regression like. Is this a category of issue that we're missing tests on, and sometimes the answer will be? No like. Sometimes it's you know.

A

The tests that get added with the regression fix are totally sufficient and if that's the case, that's fine, but uh sometimes um there it can be a symptom of a category of tests that we're missing, and this is especially the case if you're seeing multiple regressions in a particular area and so in addition to just fixing the targeted regression and adding a target to test. For that.

A

um I would like to see us opening uh sort of post-mortem type issues. Tracking categories of test coverage that are missing and say, like this component, had three or four or five regressions in this area, and we fixed those regressions. But we actually don't have a lot of confidence in this category of functionality here and those uh issues we need to.

A

We need to be opening those issues with descriptions like specific descriptions of the area and the type of change and the category of test, and then weighing those open issues against new features wanting to move stuff around or change stuff in that area, so similar to long-standing bugs issues tracking test gaps that we know have actually led to regressions, and so the theme across all of these is having issues tracking them and then considering those issues when you're, considering the health of an area and its ability to accept change.

A

So uh hopefully many of you know about these and are using some or all of these already.

A

But the sort of close closing the loop with having it affect our plans for features is what I want to see us doing, and this can be.

A

It can be discouraging to a contributor who you know they have a proposal for a feature and like there's nothing wrong with their proposal and even the code that they've written as part of their feature may be super well tested. Hopefully, it is hopefully we're requiring that um it'd be really discouraging to say, like there's nothing wrong with your idea. Your idea is good. Your design is good.

A

Your code is good and your tests are good, but it's changing an area where we had five regressions and we know we haven't solidified our test coverage um and, uh if we're going to tell a contributor like that, like I'm sorry, this area is not healthy enough to accept your change. We need to be super clear about why, like specifically, why- and so- that's where, like having the specific issues with the specific descriptions like this is what the prereqs are to making more changes in this area, because we have a responsibility to contributers.

A

You know to to be timely in our reviews and feedback, and but we also have a responsibility to our users, uh to not break them on upgrade like in some ways it doesn't matter how many awesome features are in a new release if, along with those features, come regressions that keep our users from upgrading to that new release, so we have to balance those responsibilities and, uh and so having clearly captured, you know, regressions and bugs and test gap issues that we can use to judge health and prioritize like this is a great feature before we take this feature we have to stabilize x, y and z is what I would like to see us doing.

A

So that's that's what I've got, and uh hopefully this this will help in communicating um and prioritizing and getting a sense for the health of areas and the clearer we are in capturing these issues. These are great like help wanted issues. uh If we're looking for things like concrete things for people to contribute to uh beyond, like you know, help help document like these are really concrete ways. People can help here's a long-standing bug. We, as the leads, can say, here's the steps that need to happen to resolve this bug.

A

This is what we need to do before. We make more changes right in this area.

A

That's all I got, you can stop it really. Okay,.