Kubernetes Kubernetes Contributor Summit - Seattle 2018, 17 Dec 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes Contributor Summit 2018 - Automation CI and Testing

Description

We would like to take this time to answer questions from our contributors regarding our Automation and CI. How to get things done, why things are the way they are, where things are headed, and how to help out. All is fair game. If the audience is lacking in suggestions, we’ll have some slides and demos prepared based on FAQ’s we’ve encountered.

Presenters:
Aaron Crickenberger, Google
Cristoph Blecker

A

Okay, everybody I think we're gonna go ahead and get started, give some time for folks to roll in, but just kind of start. The the most important part of this talk where I get to tell you that my name is Aaron Creek and Berger, and you are all here to hear about me and Christophe talk about automation and CI. You may have heard of me from the kubernetes steering committee. He may have heard of me as the co-founder of cig testing.

A

You may have heard me introduced myself as the release 114 lead earlier today, or you may know me as Aaron of SiC beard from my numerous appearances on the kubernetes community call Christoph. Would you like to introduce yourself or would you like me to do.

B

That everyone, my name, is Christoph Blocher. I am the technical lead for contributor experience.

B

You have probably interacted with me at some point if you've ever requested membership to our kubernetes orgs or have asked for repo to be created or have some weird github issue with automation in CI.

A

You probably have Christoph to thank as well for the kubernetes project, upgrading from one version of go lying to another or from dependencies being switched from just locking everything in vendor to using depth. I think 4k, test infra was another fun.

A

One you've been through I, also just up top wanted to give Christoph a shout out for being a shining example of how to best work with the kubernetes steering committee in the project where he came to us with a fully foreign proposal for how to manage all of the github repos and users and integrations across the project.

A

And basically all the student committee had to do was say: yes, go do that thing and granted the authority, so this is kind of what my view of the kubernetes project has been like over the past couple of years, we have opened the floodgates to just a lot of a lot more contributors than I thought was strictly possible, and so we've we've encountered a lot of challenges in adapting to the needs of all of these various contributors.

A

You know we today managed 147 repos, like Tim. All clara was saying earlier today. Close to 150 64 of those are in the kubernetes org 35 are the kubernetes SIG's org, which we created just this year, 17 in the kubernetes incubator, org and then about 30 others in two other orgs for kubernetes clients and the kubernetes CSI integration.

A

That's a lot of repos, that's a lot of different projects and a lot of people trying to contribute to projects in different ways, and so, rather than try and figure out what the best message is to spread across. All of those people is I thought like. Maybe it's better to just kind of do. This talk live I'd like for this to be a very interactive session, where we try and address the needs of the people who are motivated enough to show up today.

A

Maybe you have some very specific questions and we can give you some very specific answers. We also have a number of things that we can share with you with regard to stuff. We think we're proud of or what our roadmap is for the coming year.

A

So yeah I want to welcome you to our talk where the topics are made up and the points don't matter just a couple ideas to see the conversation maybe you're here, because you have no idea of what you're doing on this project. Maybe this is your first coop con ever or maybe, even though you've been on the project for a couple of months or years, you still feel like.

A

You have no idea what's going on in this project, and you want to better understand where you can go to find out about more useful stuff how you can get more things done. How do you keep up with all this noise?

A

It could be that you're actually like super senior, and you know exactly what you're doing you just want help solving for x or maybe you're really motivated, and have some ideas on how the automation and CI for this project could be improved, and you want to talk about roadmap, or maybe you want to hear what our roadmap is for automation in the project. Maybe you want to talk about fada bot.

A

The contributor survey that was sent out a couple months ago had some specific questions around what automation you liked best and what automation you liked least or areas of the project you thought could be improved and fada bot was by far the most polarizing answer. We had an exact 50/50 split between people who really liked it and really disliked it beta bot for those of you who haven't had the pleasure of interacting what is with what is definitely a robot and not a human being, is the account that's responsible for flagging issues as stale.

A

If they haven't been touched in over ninety days, it's the bot, that's responsible for closing issues if they haven't been touched in over 150 days. It's also the bot, that's responsible for automatically spamming retest comments on your PR if it doesn't seem to be passing tests for whatever reason, but it has passed code review.

A

Finally, maybe you kind of feel like the kubernetes project is like playing a game of Where's, Waldo or Where's Wally, if you're from the UK and you're trying to figure out where you can find just the test, results or artifacts that you need to debug your particular problem or maybe you're interested in helping out with a specific issue or feature, and you want to figure out where best to to make that happen or where you can go to find out more or you could listen to me, stand up here and try to describe to you how all of our test infrastructure is wired together.

A

There's a lot of it. It's very loosely coupled it's evolved organically, but it has served the project very well. So, with that I'll hand it over to Christoph to talk about the actual suggestions, we have yeah.

B

There's a few things that we could go through and and some tools that you may have interacted with some tools. You might not know that they exist one of the like one of the really useful ones. We have a code search tool that searches all our orgs, all our repos for a specific stands of code. If you know like the function that you're trying to find but don't know where exactly to find a very useful tool, we also have a number of tools around a flake hunting and visualizing test failures.

B

Things like velodrome in Gruber Nader. We have automation and processes around a lot of our github management, as I mentioned earlier, things like requesting membership to an org that invite actually going out managing and pruning people from the org. All of that kind of stuff is handled through automation. There isn't a human. That's that's doing that right now anymore.

B

We also have tools to like our PR dashboards that help contributors manage the flood of notifications from a repo, especially like kubernetes kubernetes that can get overwhelming. We have workflow tools some again like. If you've opened a PR, you will have interacted with things like prowl or the Kubrat, a CI robot, as well as tied, which handles automatic merging now across all of our repositories.

B

Yeah. So there's things that we can go through. I, don't know if anybody has any, like particular desires or suggestions. Otherwise we can just start.

C

Make me walk here.

B

Okay, I'll demo. Some of these things.

C

I love it, you know, I'm really big fan of it, so in that 50 percent. So I won't wonder why I mean and the reason is in a lot of issues if they are actively tracked. You know a lot of issues. I mean once they're open the the issue reporter is like gone.

C

Okay and you know the triage just ask them question no reply, just sitting there right and may times you don't have time to close it because you're not following each and in shoes when you're talking about thousands of issues out there so with the BART, it really works really! Well. You know we have our issues in somewhat under control, but you and I think you know like I mentioned in 150 days. It takes before it close right.

C

So that's a that's a I mean if talking like you know almost six-month I months, can we I will do and prefer to reduce that time forever. You know I mean if you think it's.

B

That we could definitely look at adjusting those timeframes like the the bot itself, is actually fairly simple. The way way it works. It uses a github search, query and says, based on a certain time frame, anything that has not been updated in this time frame that matches this particular search.

B

Query go leave a comment on it to do a thing, and then we have other pieces of automation that listen to those comments to Mark and transition, a an issue through stale rotten and then close it off the reason we have like the time frames that we've selected right now are based around pinging authors, pinging, assignees and say like hey. If this is actually still real.

B

If you actually still need this issue, if people are looking at it, please mark it as fresh because there's been no action and right now, the time frame that we've chosen is a quarter. So a release that, like hey, if an issue has not been touched at all during the course for release that somebody should comment and say if that this is still needed. But all those time frames are adjustable because right now the source that we're looking at is just a github search. Query that we can adjust the time frame on that just.

C

Another follow-up question so once if somebody put a comment there do we start like from all over again with that.

C

B

Timer starts from like there's different timers for each stage, so if I, if a a particular a particular issue, is still in kind of in the fresh state like it doesn't have a life cycle label at all, then, yes, a single comment will restart the timer at the 90 days.

B

For that that particular stage, if somebody leaves a comment but leaves it's still marked as stale like they comment to say, like hey, is somebody working on this it'll still be marked as stale, but it resets the stale timer, which is another thirty days from that point, so it it does that particular stay, but there is in the comment that fada bought leaves. It says like hey if this is actually still needed. Please say this in a comment so that our bots can react and go like no.

B

This is actually still completely fresh reset it for another quarter. Thanks.

B

It's on a regular basis, I'm not sure exactly what it is I believe it actually says somewhere like when it was last updated in when I was that pulled, it runs in App Engine and it's fairly it'll get it's fairly frequent that it updates I think it updates on a cron.

B

But that one that tool in particular we have dims ii, think he kind of designed and was originally hosting that on on his own, but then very quickly realized it very expensive to run and reindex arco basis. But.

B

Internet fun times.

A

So this was the graph I was trying to find, so this is def stats, a project developed by the CN CF that shows all of the open issues and PRS that were opened up against the kubernetes kubernetes repository. Can you guess when we turned on beta bots to close issues that were older than 150 days right?

A

This is this: is our challenge, like we kind of needed to stop the bleeding I am aware there are people out there who are really frustrated that their particular issue isn't getting any attention, but for those of us who are trying to identify, what's the most important thing to be working on right now, if that line had continued going up and to the right at the rate at which it was there's no way, we would be able to find anything like this- maybe points to github not being the best project management tool for a project of this scale, but we have found that prudent and judicious use of effective labels have helped us somewhat, but also just using like responsiveness and and attention as a signal is also helpful.

A

So, like one of the complaints I often hear with beta, bought his guys, I'm really sick of like the bot Flags, my issue is stale. Cuz nothing's happened in ninety days and then I removed the stale label, because I still care about this issue and then 90 days later, the bot adds to stay a label back and I'm really annoyed by this. So I remove it again and then 90 days later, I'm like whoa whoa whoa, hang on hang on, like maybe the problem is that for 270 days nothing has happened on this issue.

A

Would you like to help out with this issue? Could you maybe help us understand why this issue is more important than the thousands of other issues that we have on the project? I know: that's not like a great it's a it's. It's a difficult problem to solve, but ultimately I think it's a problem of finding the most efficient way for our contributors to find the right things to work on. It's not necessarily the bots fault that we don't have enough people working on your particular problem right now, so I mean for real.

A

We could stand up here and ramble all day, but I do wonder if anybody here has any specific questions. All is fair game. Josh I'd.

D

Like to get started with hacking and prowl, but I haven't figured out a safe way to test it. You.

A

Haven't figured out a safe way to test.

D

The test that my changes are actually working the way that I want them to work because I don't want to run them against kubernetes kubernetes. If it turns out I made an algorithmic mistake boy.

A

That's a really great question, so I can tell you the way that sake testing deals with this today and the answer is not very well. We also kind of like to do it live, wait. Where's that slide at. Let me use that again. This is like is 100% how test infrastructure often works.

A

It was a somewhat cavalier attitude that frightened, the crap out of me at the main early involvement on this project, but I've come to realize, there's really no better proxy for how this projects like the traffic for this project is not something you can really fake thousands of contributors, thousands of PRS and issues, there's really no better way to understand how you're going to operate then I just kind of going for it in production, and so we are generally very diligent and prudent about evaluating the risk of what we roll out.

A

We never roll anything out like at midnight. We always try to do it kind of. During the day.

A

We often find that when we are testing things related to a new way of writing a job or a new way of running a certain set of tests that, instead of making this change for each of the six to seven hundred jobs that are responsible for splitting up kubernetes clusters and testing them that we instead have a canary job or two that we make these changes to first and verify that the canary jobs kind of looked appropriate.

A

So we can do this for things like bumping, the the version of the container that is used to run the end-to-end tests or changing the testing libraries that we use to actually stand up the clusters, things of that nature. We do actually kind of lack the concept of a staging, proud cluster. So when we deploy proud changes, we do it live.

A

We actually historically kind of did it on Friday afternoon, you yeah. You can also thank fada Bach for that for real. We eventually evolved from a human named Erik Veda, having the tendency to deploy prowl on Friday afternoon whenever he was on call to creating a prowl plug-in that would automatically open up a pull request every day in the morning, so that we can check that pull request, see what has changed in the test infrastructure and decide if it is prudent for us to deploy that day.

A

So I think it's a great example of like finding something that a human was doing and automating the problem away, but still making sure that there's a human component of like review involved. I, don't know if Christophe has any opinions on like how we roll that sort of stuff.

B

When we, when we deal with things like making changes to live, github issues and github PRS and we're we, we do have things in prowl around like there's a dry run, client and a fake client, where we try to be sickly, guess at how bit off of based off the API specs what github is going to return us and how we're going to interact with it. So there is, there is a bunch of unit testing around those kind of things with our a fake client, as well as a dry run.

B

Client that will actually interact with github, but will only do gets, will not do anything mutating. So for certain actions we can at least get a feel of how how github is going to react.

B

That said, in practice, there are a number of things that we still haven't figured out and still haven't solved at the rate of web hooks that we we end up receiving and how dependent we are on github, sending us details on when a PR was updated when an issue is updated and what has changed on that PR or issue yeah, we we we get. We get issues where web hooks will just completely go missing and we will not exist. A certain comment. Action or a certain push.

B

Action will just not appear to us or will get we'll get issues where the API says. We should have a certain response back, but actually we'll get a slightly different response back from the v3 API. So we a lot of it, ends up coming down to trial and error and seeing what works at this particular scale because we're receiving like every second we're receiving so many events from github and in trying to ingest them in a proud to take some action.

E

The community during test grade itself, there's yes,.

B

So test grid itself is apply. One of the tools that we have. This is actually one of the few that I believe is not open source.

B

This is still there's still some tie-ins with Google infrastructure that this particular tool is not in testing for the vast majority of what we do and what we run is in tests infra, but test grid itself is not the configuration for it is so being able to sort out and display different tests from we run like a number of different types of jobs who went pre submits that run on every single PR before the PRS merged.

B

We have post, submit jobs that run against the branch after a PR is merged to do an action or to run a test immediately after that actions being taken, we have periodic jobs, CI jobs that run on a regular schedule to maintain a certain CI signal. So all those different types in here are sorted primarily by sig, but also we have like some provider specific ones to to provide signal on different types of things.

B

These are used both by the SIG's to identify when they have issues they're used very heavily by the release team to provide signal before, during and after release on how how the release is doing if it's stable and when any time they're looking at cutting cutting a particular release, cutting a particular branch. They go back to this and they use this as their main indicator of. Is this release healthy, or are we going to have some issues with it?

B

It also really. This tool is really helpful for identifying flakes, because you can see PRS sorry, you can see tests and specific jobs over time, and do they take longer like these, these, these time series that there aren't springing up, you can see how long a job is being taken, and you can see outliers that if they're so yeah the the graph button you can do test duration minutes and that that works for anything.

B

That's in here, it'll record that indicator, and so you can see when tests fast fail, like you can see that particular job took a dip. That's because something very early on like a cluster up, failed in that job, so it we fast failed out of that job and it stopped running really quickly.

B

Yeah, there's a lot of interesting details that you can kind of pull out of some of these some of these and we run like we run so many different test. Suites we run upgrades. We run down grades, we run there's, there's jobs in here too, to check out specific cloud providers. We have jobs in here the the automation that we have around org memberships.

B

So, right now our org memberships are defined in a github repo in gamal and then a bot takes those compares it against our current membership in github and goes and makes mutating changes that job is a post, submit CI job. So once that one's a PR merges that job will run and we can see the results in tes crit of each individual run of that that tasks to to github.

B

It's yeah there we go so very, very small, but you we get. We can see in here like when it goes and invites people to our org. So we have a lot of visibility and we do in contributes. We have a saying if it isn't automated, it better be documented.

B

So you try. We try and automate as much as we possibly can at the scale that we're trying to take actions even something as simple as somebody puts in a membership application and being able to make sure that that membership gets processed. The person gets their invite to our github org and then gets whatever permission set that they that they need those pieces at the scale that we're doing it.

B

We need to be able to scale that out and we need to be able to also deal with limitations and github permissions model, because it used to be that the only people that could do this particular task needed super owner privileges over everything that we have and we try to limit those as much possible. So now, if we're taking a defined configuration in github and have a bot do all the work for us, we know exactly what's happening.

B

We have an audit trail of who invited who at what point in time, and there isn't there- isn't the possibility of like manual changes or manual mistakes if a bot is doing that for us.

A

Yes and so like, if you want to do anything, get hub related on this project. This is where you probably go: kubernetes auric repo. We use that handy-dandy feature of github, where you can use predefined templates for different issues. Generally, we expect, if you're coming to this repo, you want general support. You want to be added as an organization member so that she can do fun, cool stuff like be assigned issues and PRS. Wait. Sorry I mean um actually apply in LG TM to PRS. You want review privileges.

A

Basically, org membership is the thing we use as proxy, for we trust you and you have decided you also trust us and want to work on the project with us. One of the other ways we use this often is if you're, not a member of the org, we apply a needs, okay to test label on your PR, because you could be some random person who's trying to submit a Bitcoin miner as a whole request, so that you can run it on our fancy. Dancy test infrastructure.

A

Also, if you have some random third party integration that you really really like to try out on just your repo, but you end up getting punted too well a github admin has been sent a notification about that. I really don't have a great way to reply back other than like stalking down the person who tried just going through that flow, because we need to have a larger conversation about. Why are you adding that integration? Just because kubernetes is a super large open-source project and we take everybody's privacy seriously.

A

We find that there are many third-party integrations that require more privileges than perhaps they should and we'd like to understand. If it's going to cause any headaches for us down the road. So, for example like why, if those of you who remember we reviewable a couple years ago, were used as an alternate code review path, instead of doing pull requests, we found that it just wasn't really worth the hassle that it from an administration perspective. I also wanted to punt us back to the the test grid.

A

Question I seem to have already blown away the tab where it's like I want to make clear that anybody in the kubernetes community can contribute their test results back to test grid through a PR process. This is here for you to use, so we have a conformance tab, for example, where different kubernetes providers, who want to demonstrate that their offering is certified as conformant, not just when they opened up a PR to the CN CF repo. For that, but, for example, boy I hope this works. Look.

A

The 113 release of kubernetes is actually conformant every couple of hours every day. It's all green I love to see green I can't wait to see this for 114.

A

So that said, I feel bad. That test grid is an open source. I definitely have a dream that, like I, want to be able to stand up a kubernetes cluster and have an opinionated CI stack right out of the box. Proud is super cool, but it doesn't really provide this sort of historical view of all of the test results. You do kind of get this historical view of what are the jobs that prow is running right now, along with some filter dropdowns that you can use to go by pre, submit or periodic or post submit.

A

We want to open-source test grid. There are two parts of it. One of those are the front end that displays all the things. I know they're, probably some of you who know a thing or two about like CSS or JavaScript and would love to finally make test grid, say display the timestamps in local time instead of Google Standard Time I mean Pacific time.

A

There are some of you who might like to understand like what do all of these columns actually mean you can't actually hover over them for what it's worth, here's another pro tip- maybe you haven't noticed so that stands for the commits apparently, and that stands for the time. I'm sure this is all super tiny, so there's also the backend component, which is the piece that is responsible for scraping all of the data out of Google Cloud Storage buckets and translating it into data that is more efficiently consumed by this UI.

A

We want to open-source both of those things. We want to hear your use cases for why you agree. It should be open sourced. What do you as a potential customer or customer of this offering want? So if you want to use test grade, you think it should be open source. Please come talk to me or I hate to do this to you Michelle, but I will call it the fact that that is basically the test. Current author right back there with the hands up, yay.

A

Current current maintainer of test grid, but much as I am affectionately referred to as the community within my team. Michelle is the test grid within our team, so all credit to her, for example, for this recent change, where, if you know that there's a job that you care about, but you forget which stupid tab or dashboard it's in you can just like type in AWS and we'll go see the conformance, Gardner thing and hit enter and it'll.

A

Just take me straight there, as somebody who has to navigate the potentially 700 900 or however many different tabs, there are a search function from a company named Google sounds like a great idea to me. I see another question hang on.

E

Did you foresee, or some sort of Federation or delegation of the testing to the SIG's themselves.

A

Okay, lots of opinions on that one. Just stop me from monologuing. If I go on too long, yes, I first see a need for each sake to kind of own their tests. So we do this partially. By saying here are the different groups of dashboards that correspond to different sakes, so I sort of expect that everything under siege a double u.s., for example, aid of the sig av2 sig AWS, is responsible for making sure that every single job on each of these dashboards is solidly green. All the time.

A

Ok, you got a couple red ones. Sorry, let me try calling out a different sink. How about sake? Let's be fair and see. What's sake, GCP has ok, also a little bitter red. This is the reality of the project. Today, almost every single sig has a bunch of failing tests, and it's really impacting the velocity of the project and the health of the project. I would like to help sicks change this, so one way is for them.

A

You have people with a mistake who are regularly dedicated to like pruning and grooming, these dashboards to make sure that they know. What's on their dashboard and what they like.

A

Cygnet work, their dashboards a little bit nicer and one reason for that is they use this awesome feature of test grid which I won't click through, but you can basically PR something into the test grid config that allows you to hook up an email address to your dashboard tab, so that if your test fails more than n times in a row, it sends out an email and sinks, storage and sig network signet, working both our sakes.

A

That I am aware of that actually have that send to a Google Group that they have an on-call person responsible for triaging. So as long as you actually have your test screen to begin with, you can then have somebody make sure that they stay green. That's one way of doing it. You also set the magic word Federation, which I just want to hammer on the point that test grid gets all of its data by reading from Google Cloud buckets. It doesn't really matter how it got into those buckets in the first place.

A

So while I love, prowl and I, think it's awesome and we run tens of thousands of jobs a day through prowl. We also support reading results that are put in there by like Travis, CI or circle CI or if you are so inclined Jenkins and as long as that bucket is publicly readable. We will read the data out of that and consume it.

A

So this is like how the conformant stuff works, for example, though I'm not sure they actually have any data in here we do read from buckets that digitalocean and I do can populate to prove that their offerings are conformant. So this is one way that we can encourage federation of periodic and post submit results. Then the other question becomes pre submit results generally, if you're running your own project, we trust you to take care of what's important to you from a pre submit perspective.

A

So, like the cluster provider, AWS project, like I'm sure they got their own Cree, submits that they care about, or maybe the customized project I start to get a lot more finicky about what it means to like what the requirements are to have a pre submit when it comes to blocking the kubernetes kubernetes repository or an actual release. That's where we've tried to put together a set of documented policies and requirements for what it means to be release blocking and I anticipate. We will do something similar for emerge blocking so here.

A

If your job meets these criteria, we think it's good enough to be in the release blocking dashboard. There are similar criteria that I think we tribally have in mind for pre submits like if your pre, our test takes longer than an hour to run. We're not gonna, accept it.

A

It ideally like maybe 30 minutes tops because, if you think about it, if we had like a cereal queue of merging in PRS- and it took like an hour for each of those PRS to have all of its tests run, then that means theoretically, we can only merge in 24p hours a day.

A

That's not the reality of the situation. We actually test multiple PRS and batches to improve that velocity, but really we'd like to reduce the duration. It takes to run tests and there.

B

Are ways that we can and have an other repos and are gonna be trying to tackle this as much as we can for kou berries, ku grantees, where we, you can take different pieces and split them up into separate tests, which has the benefit of number one they can run in parallel, as when we start testing on a particular commit. Almost all the tests will start as soon as there are CI resources available to run those tests.

B

The other advantage is as far as dealing with flakes is concerned. If a test particular test does is flaking for some particular reason, you can retest in prowl that specific test, as opposed to running one large CI job that ok, it failed in this run due to a flake. Okay, I'm gonna hit retest and is gonna take another hour in 20 minutes.

B

For that particular job to run the more we can kind of split those pieces, the better experience that contributors will have like if a read if retesting from a flaked ends up taking you 15 minutes as opposed to taking an hour. That makes a really big difference. As far as the impact those flakes have on the PR authors.

A

So just what I clicked through here was an example of a prow job configuration file. You may have actually read this I forget who wrote, who or who wrote what, but this is an example of a job that basically involves running this container name like this should look an awful lot like a pots back to those of you who have worked with pod EML day in and day out, here's the image we want to run and here's the command we want to run inside of that image.

A

Here's some information in about the repo you're gonna clone in order to run this job, and then here's a bunch of stuff that we want to use to understand when and how to schedule the job we use these labels, things to like automatically pre-populated service accounts or credentials to stand up a cluster in AWS, for example. This job is just a yellow file and it's just inside of a repo and it's inside of it directory.

A

That's named after the repo that it works against and it has an owner's file and that owner's file has these people listed as approvers. So if you want to add a job to the cluster API provider, AWS repo in kubernetes 6, you don't have to talk to anybody from sig testing at all. You can talk to you Justin, Tiberius or Chuck ha or David Watson right.

A

This is a way where we sort of federated thority over who owns what within the project and here's, how we signify that A's different certain sig owns the jobs that test their particular repo.

A

We have a similar pattern for the jobs that run against kubernetes kubernetes, where we have directories named after each of the different SIG's. So SiC network, for example, has a bunch of different files that are named after there's. No real, consistent naming scheme but like here are jobs that stand up tests related to ingress, for example,.

B

And it we need things like this, because there's no way that cig testing as a group can go and be like. Okay, you got you are the managers of all CI and all tests, and that, because we need help from the SIG's to be able to not only design the tests that are running design, the jobs but respond to them when they fail and and and fix them.

B

There's a lot of a lot of the framework and a lot of the groundwork has been done by by suggesting. But it's ultimately, the responsibility of all the SIG's to have a high quality release and testing is the key to making sure and being confident in the quality of the releases that we put out.

A

Since we are getting close to time, we have about eight minutes left if there are any burning questions, but one thing I wanted to make sure I showed people, because I often feel like people aren't aware this stuff exists. This is the PR dashboard has offered by Cooper Nader. So, if you like me, are just you refuse to look at your github notifications, because you have like 800 of them and you try bankrupting that, but you end up getting like another 300 the next day.

A

This might be a better way of keeping up with what your PR workload is. At least it's not your issue. Workload I apologize.

B

In advance to anybody who has a PR, that's waiting on me for something all 15 of you or 37 that are assigned to me. I apologize so.

A

You can see not call them out, but there are 15 PRS that need Kristof's attention. Yeah, that's that's terrible.

A

There are also 37 PRS that are in some way shape or form on Christoph's plate and Christophe also happens to have two PRS out going. So these are PRS. He needs to pay attention to from I authored it. I really want it to merge. What do I need to do? You're, probably wondering what do these things mean and how do PRS end up here?

A

So there's a link at the bottom, where it's a needs. Attention is based on a simple state machine and because we love documentation, we thought code is the best form of documentation, and so you can see we link directly to the state machine code itself and essentially we try and say: if there's a PR out there and it's assigned or review, is requested of you and then one of these things happens either a comment applies to it or somebody labels it as LG TM. It's going to go back to you.

A

We could probably stand a document, that's better. We could also probably stand to maybe have this based on github queries that are specific to certain repos or something we've taken a stab at this, but not quite as thoroughly with the PR dashboard from prowl. So since we now live in a world where prowl and tide are responsible for the testing and merging of every single PR across all 147 of our repos, you might have questions about what it takes for your PR to merge any given repo and this dashboard uses a github query.

A

So, for example, we're looking at all PRS that are open, that Christoph wrote, but we could I'll just call myself out here. We could also see what it looks like for all PRS, that I wrote and apparently I need to work on like passing tests. I also have a bunch of labels on this PR. One of them is good because it's a required label. One of them is bad because it shouldn't be there.

A

So I need to find some way to get rid of the work-in-progress label on my PR, which I can do because I have titled this PR as work-in-progress I, don't want somebody to accidentally merge it. If I were to remove this WIP text from the PR title, that label would go away. So we feel like this is a more granular view of what it takes for your outgoing PRS, for example, but I have found that the goober Nader dashboard is by far a far.

D

A

Way of keeping up with like your workload, so if you work on this project- and you have to deal with a lot of PRS day in and day out and you've, never seen, this I would highly encourage you to give it a shot. It's also a good way to just kind of stock, other people's workloads, so maybe you're waiting on Daniel Smith, also known as lava lamp. Don't do that find somebody else which is just another thing.

A

I've said it a ton in in other forums like come talk to people on slack because generally github is not going to notify them and maybe even assigning stuff to them on github or notify them. If it's really urgent, they're, a human being, let's have a conversation as human beings and.

B

It just comes down to the scale of what we're dealing with like when we're talking about 147 repos, including the core kubernetes repo that currently has 20 nearly 2,200 open issues and nearly a thousand open poll requests. That is way too much information for anybody to digest and even like the people who are very involved bite people were top-level approvers. People are involved in sig architecture.

B

People who are you know, have some sort of role in the community, they're really busy people also very friendly, which is why I would second, what Aaron set around going and talking to people going and like hey if I'm not getting a response on github ping, somebody on slack ping, somebody in an email be polite, but like ping, people, because there's just there's so much going on at any one point in time.

A

Okay, one last shiny thing: I wanted to make sure I show people because I'm not sure how many people are aware. We have this. This is our dashboard that is based on bigquery metrics that are run against a publicly accessible bigquery data set. So if you know what bigquery is or you know how to use it, you can absolutely query the same set of data that we are to produce, maybe a better dashboard for you.

A

This dashboard I want to point out this table right here where we show the flake iasts PR jobs for the past week, so, for example, right now, unfortunately, the cops AWS job has been the job. That's failed or flaked the most this week, a flake is something that passes or fails right, but the like the commit hasn't changed. You've just run the same test.

A

This is a great place where, if you have time- and you want to help improve the health of the community- and you want figure out what is the most important test for me to fix right now like what is the test that is impacting people the most turns out. It's this integration test called test terminal pod eviction. If you could fix that, you would make life really great for a lot of people. This is also a dashboard josh. Burkas will personally send you a gift if you fix test terminal pod eviction, you've heard it here.

A

First, it's also a dashboard. I can use to at some point. I will start coming after people whose jobs have been running continuously for over 400 days and continue to fail every single day. I, don't think anybody's paying attention to these I, don't think we should be spending money generating these results. Finally, if you feel like you're, your PR is running into a lot of flakes, or things seem flaky.

A

Lately we have this graph that just shows how many times, what as a percentage, how many times has a given PR job on kubernetes kubernetes failed versus past. So we can see here, for example, that the kubernetes e to e GCE job had kind of a bad day right close to the end of November and I've used it in the past.

A

For example, here the integration test job was awful around October and then shortly thereafter we started having a problem with the cops AWS job, so this is just a good way to confirm your intuition that, yes, this problem isn't just affecting me. It is actually happening a little more elsewhere and if you want to find out how to improve this situation, now that you know what is flaking come to another talk that is happening sometime during this contributor summer, where I will explain how you can hunt these down and fix them.

B

I think we're at time so I'll just throw up this one. This is where you can find us I, I'm, always around watching sick and Trebek's. In slack our mailing list. We have issues open and cake community if you want to help out with the improving the experience of contributors, not only to coober the kubernetes core repo, but across our entire organization.

A

Cool thanks everybody for showing up really appreciate you coming here, also shoutouts, the AV guy for making sure as I was ducking and standing that we kept in focus.

D

A

For real that all these talks are actually recorded, and you can see this online after the fact, so maybe I can remember what I just rambled about for 40 minutes straight Thanks. Thank.

B