Kubernetes Reliability & Testing Resources Open Meetings

15 Apr 2022

Context:
Discussion on KEP for improving reliability: https://github.com/kubernetes/enhancements/pull/3139#issuecomment-1095771101
Mar. 17th community meeting:
Notes: https://docs.google.com/document/d/1VQDIAB0OqiSjIHI8AWMvSdceWhnz56jNpZrLs6o7NJY/edit#bookmark=id.45wmiyb70mnb
Recording: https://www.youtube.com/watch?v=m1nNW7gnbU0&t=26m55s

Health indicators we already have (and how to improve them)
kind/regression bugs (https://github.com/kubernetes/kubernetes/issues?q=label%3Akind%2Fregression)
AI: label issues/PRs related to regressions in your area
Represent issues about things that used to work and stopped working. Starting to look at PRs with release branches, look to see if they are fixing regressions or long standing bugs. Doesn’t matter how awesome new features are if there are regressions in the release that keep users from upgrading.

long-standing + priority/important-* bugs (~trailing indicator)
https://github.com/kubernetes/kubernetes/issues?q=is:open+label%3Akind%2Fbug+label%3Apriority%2Fimportant-soon%2Cpriority%2Fimportant-longterm%2Cpriority%2Fcritical-urgent

AI: regularly check for these issues in your component/area
Bugs indicate health issues, are there new features touching areas with bugs and should we accept these new features. Be careful to accept changes in fragile areas. We have a duty to our users.

test flakes (~leading indicator)
AI: capture these in kind/flake bugs with details
[Hopefully making use of SIG-focused triage board that lets you to filter for specific SIG. We rely heavily on tests, if tests are not giving great signal then we don't have a reliable floor to know if new stuff is destabilizing an area

"known fragile" areas missing test coverage
AI: capture these in priority/important-* bugs with details
When you fix a regression, insist on a test to check for the specific regression. If we want our areas to remain healthier, we should also do a mini "post-mortem" on regression and find out how can we prevent this. If multiple regressions in same area then that is a loud signal that the area is fragile. Might mean we’re missing a category/class of testing. How do we ensure an area has a good foundation so we can accept new features in that area. After a regression, we should have a long term issue to identify what the gap was.

1 participant
18 minutes

reliability

approving

reviews

considering

critical

decisions

concerns

contribution

indicators

tends

12 Mar 2021

Dan Mangum and Rob Kielty are back for the second episode of Flake Finder Fridays. In this episode they will walk through how to run Kubernetes e2e tests locally, as well as how they are packaged and run in CI environments.

2 participants
52 minutes

testers

troubleshooting

maintainer

flake

present

showing

guide

execution

peek

screen

5 Feb 2021

Rob Kielty and Dan Mangum kick off Flake Finder Fridays, a new Kubernetes community livestream where we explore building, testing, CI and all other aspects of delivering Kubernetes artifacts to end users in a consistent and reliable manner. In this first episode, Rob and Dan are going to look at recent failures in a Kubernetes build job and chat a little bit about why it was failing, what tooling is used to build Kubernetes, and the infrastructure underlying all Kubernetes CI jobs.

2 participants
44 minutes

troubleshooted

kubernetes

monitoring

currently

testing

ci

triaging

live

pinged

boater

27 Nov 2019

This will be a live API review, going through a real PR and showing how it's done. It will cover API norms, less-well-known conventions, rationales, validation, defaulting, and other important API concepts.

This is an opportunity to learn how to make your API review PRs go through faster and easier, with fewer revisions. It's also a great way to see how to do API reviews, in order to start down the path of becoming an API reviewer yourself. Every SIG needs to have active API reviewers to make development smoother and faster, so why not you?

HOW TO PREPARE BEFORE THE WORKSHOP

In order to make the best use of our limited time, please prepare ahead of time.

Reading:

Being familiar with the API conventions and API changes documents would help you get the most out of this workshop.

Laptop and build environment:

A working kubernetes build/test environment is only required if you want to try out API tests and code generation on your own during the workshop. In that case, you should have a laptop capable of building and running basic Kubernetes binaries, with the following software installed:

Go 1.13 (This is a change from the originally advertised Go 1.12)
Docker
git
make
kubernetes/kubernetes GitHub repo

Event link: https://events19.linuxfoundation.org/events/kubernetes-contributor-summit-north-america-2019/
Session link: https://sched.co/Vv6Y

3 participants
1:16 hours

apis

api

servers

policies

processing

requests

implementation

behavior

schemas

collaboratively

26 Nov 2019

Event link: https://events19.linuxfoundation.org/events/kubernetes-contributor-summit-north-america-2019/
Session link: https://sched.co/VvNY

2 participants
17 minutes

testing

tests

testgrid

assessor

grid

grade

features

gcs

dashboard

sig

19 Dec 2018

This was an unconference session and therefore has no proper description.

6 participants
48 minutes

tester

testing

troubleshooting

flaking

dashboard

monitoring

fail

bother

push

kubernetes

17 Dec 2018

Daniel Smith projects his screen and reviews API-changing PRs while giving live commentary!

Presenter: Daniel Smith, Google

7 participants
51 minutes

api

discussion

thinking

users

clients

lets

guidelines

kubernetes

staging

hesitant

Kubernetes / Reliability & Testing Resources

15 Apr 2022

12 Mar 2021

5 Feb 2021

27 Nov 2019

26 Nov 2019

19 Dec 2018

17 Dec 2018