youtube image
From YouTube: Using health of a component/area to increase reliability and guide work

Description

Context:
Discussion on KEP for improving reliability: https://github.com/kubernetes/enhancements/pull/3139#issuecomment-1095771101
Mar. 17th community meeting:
Notes: https://docs.google.com/document/d/1VQDIAB0OqiSjIHI8AWMvSdceWhnz56jNpZrLs6o7NJY/edit#bookmark=id.45wmiyb70mnb
Recording: https://www.youtube.com/watch?v=m1nNW7gnbU0&t=26m55s


Health indicators we already have (and how to improve them)
kind/regression bugs (https://github.com/kubernetes/kubernetes/issues?q=label%3Akind%2Fregression)
AI: label issues/PRs related to regressions in your area
Represent issues about things that used to work and stopped working. Starting to look at PRs with release branches, look to see if they are fixing regressions or long standing bugs. Doesn’t matter how awesome new features are if there are regressions in the release that keep users from upgrading.


long-standing + priority/important-* bugs (~trailing indicator)
https://github.com/kubernetes/kubernetes/issues?q=is:open+label%3Akind%2Fbug+label%3Apriority%2Fimportant-soon%2Cpriority%2Fimportant-longterm%2Cpriority%2Fcritical-urgent

AI: regularly check for these issues in your component/area
Bugs indicate health issues, are there new features touching areas with bugs and should we accept these new features. Be careful to accept changes in fragile areas. We have a duty to our users.

test flakes (~leading indicator)
AI: capture these in kind/flake bugs with details
[Hopefully making use of SIG-focused triage board that lets you to filter for specific SIG. We rely heavily on tests, if tests are not giving great signal then we don't have a reliable floor to know if new stuff is destabilizing an area


"known fragile" areas missing test coverage
AI: capture these in priority/important-* bugs with details
When you fix a regression, insist on a test to check for the specific regression. If we want our areas to remain healthier, we should also do a mini "post-mortem" on regression and find out how can we prevent this. If multiple regressions in same area then that is a loud signal that the area is fragile. Might mean we’re missing a category/class of testing. How do we ensure an area has a good foundation so we can accept new features in that area. After a regression, we should have a long term issue to identify what the gap was.