Kubernetes SIG Testing, 14 Nov 2017

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Testing 2017-11-14

Description

Meeting notes: https://docs.google.com/document/d/1z8MQpr_jTwhmjLMUaqQyBk1EYG_Y_3D4y4YdMJ7V1Kk/edit

A

All right, hi everybody today is Tuesday November 14th. This is sick, testings weekly meeting I am Aaron Berger, and this will be publicly. This is being publicly recorded and will be posted to YouTube, eventually, which site administrator. You note I, know I'm still behind on posting last week's recording I should get to that later today, assuming I actually have it with me. Otherwise, it's gonna have to wait a couple weeks.

A

So I wanted to talk briefly about a proposal that I finally got some time on that was initially wordsmith by Jace around how we define the criteria for release blocking jobs and merge blocking jobs, I've linked it in the agenda, but I'm going to go ahead and share my screen. So you can all see what I'm talking about okay. So this proposal here, the the TLDR, is trying to really like admit that this is just a first pass I'm trying to put stake in the ground. For today.

A

This isn't intended to be like the forever engraved in stone thing, but we want like enough actual numbers that we can sort of objectively identify when things are misbehaving, as well as when things are valid to be considered, release or merge blocking, and we want to have sort of human review to sort of understand that everybody's acting in the collective best interest here so I went based on metrics that I can easily glean by looking at test grid. I can pretty quickly figure out whether or not a job. By looking at the test.

A

Duration minutes graph on test grid I can pretty easily see whether a job averages finishing a run in less than 60 minutes. I can figure out pretty quickly if that job runs at least every two hours I can figure out. If that job passes, 90% of its runs in the past week, based on the summary tab, whether or not the job is capable of passing three times in a row against the same commits a little bit harder.

A

We can start to identify that when we enter code freeze, we start creating the release. One nine branch and only cherry-pick commits to there like add, commits to that on a daily basis. So at least four jobs that run frequently.

A

We can see the results of a job running against the same, come in a couple times and get an idea of whether it it makes sense for it to be included want to make sure that we have a cig that is actually responsive to being pinged by its github team, that it's test failures, github team and again based on test grade right now.

A

A job shows up as failing if it fails ten times in a row so based on all these things, I can pretty quickly see whether or not a job is the ideal candidate for release blocking or whether it's misbehaving. And then, if it's misbehaving, you know I'm trying to like go triage and poke humans to fix and resolve the situation, but eventually we're going to reach a point where maybe it's kind of time to talk about removing it.

A

So to that end, I just opened up a pull request to talk about removing all the gke related jobs because they they have been failing for about two weeks now and the evidence is just showing that, even though I've seen some people working on it, we still have all these jobs failing for two weeks. um So if we want gke to be a release, blocker we need to incentivize. People to you know, gather enough resources to actually work the problem, but I recognize.

A

This is an iterative process, so I want to make sure we get the right discussion from the right stakeholders. Similarly, so the key thing here is I want to make sure this poll request gets looked at by the owning cig, which would be sick, JCP and sink release, since they kind of owned the dashboard. The criteria for merge blocking tests is basically the same. It's just slightly different metrics. It seems like actually most of our pre submit jobs run in less than 40 minutes.

A

On average, some of them spike up a little bit verify runs a little high and they flake about 20% of the time, regardless of which commits they're. Sorry not flake, fail they just straight up fail about 80% of the time and that's expected: it's pull requests people submit code that will fail tests because people are humans, but this is all stuff that I can glean from test grid, ideally moving forward. You know we want faster or better smaller tests. Eric was talking about.

A

Maybe we want to start saying that pre submits are only like single cluster or single no jobs instead of spending up entire and n clusters, or maybe they're just integration or unit tests. That's great, but I'm trying to put the stake in the ground for today.

A

The other ideal thing I think I had mentioned this during stand-up the other day. I guess I'll. Stop sharing here is that it seems pretty likely that test grade is it's in a good approximation that I can look at as a human and a lot of people in the release. Burndown have been trained to look at, but I.

A

We have the data in in bigquery to generate a better dashboard with more reasonable or granular metrics, so eventually, once it gets hopefully right once it gets to the point where I can stop chasing people for tests that are failing 10 times in a row or sorry, jobs that are failing 10 times in a row.

A

I can start identifying like the flaky as tests within those jobs, and while we have a dashboard that does that for pre submits today, because we get a lot of great traffic there, it could still be useful for, like the CI signal person to go, say hey. We noticed that these these jobs are still flaking in the CI tests. So I just wanted to run all of that by this group to make sure that seems sane and I linked the doc in the proposal.

A

If you want to drop any comments there, my intent is for JSON I to turn that into like a markdown and try and enforce it as real actual policy going forward.

B

Okay, maybe it might be someone outside of the scope of this sick, but possibly something to throw at contra mix. So maybe I'm not sure. If they're icing I feel like we might need to do more to make sure that SIG's are aware that they need to have someone or someone's monitoring the test failures, oh and that handle or team.

B

I think if they are aware that if they're not monitoring their failures, we're going to delete them, hopefully they will actually do that, but I'm not sure if weekly hold.

A

All those things get it's it's something I'm trying to do as loudly with as much visibility as I can I'm, always wary of the fact that sake. Gcp is new in nascent. So this is why I like I went to their introductory meeting and gave them the heads up about this. I know they're understaffed. So that's why I raised it during the community meeting where, ideally, some Google product managers are sitting there publicly saying. Yes, we will commit resources to this, so I'm holding them accountable there.

A

The next step would be to have some mailing list, traffic on the kubernetes dev mailing lists and I'm generally pinging, the actual estate channels as well. So I agree like people. It's definitely I can't just like open pull requests and say I've done it. I gotta be as vocal about it as possible and recognize that um we're all trying here, but I will definitely raise it at contributes tomorrow, because I've like been kind of a big, vocal opponent to the github teams, with all these prefixes on them and stuff.

A

So this is me generally genuinely trying to use one of the specific teams for a specific purpose, and everybody keeps talking about like SIG's having on-call rotations and whatnot and like I'm, not the guy, to sit here and tell people how to run their sig I'm, just acting as a human being.

A

Who needs a point of contact to get in touch with when tests fail and so having a unified team across each sig seems like speaking for me as a human, and if we find that this works out, it's not too hard to go to then have like some script. Take the cig owner out of you know, wherever the test, config data ultimately lands and actually automatically do these notifications I also.

B

Noticed when you, you opened your PR to make sure that every test has an owner that like, for example, the odd things like didn't, have a test failures team at all. Yeah.

A

That's a that's a separate thing: I'm gonna go, try and chase them down over and it turns out in talking with folks aw aw s doesn't actually own cops. Sic cluster life cycle ends cops so I'm gonna be opening up another PR to ask everybody if they think that is correct, but, like I said, my goal here is to try and have as much of these discussions in a pull request, driven workflow as possible because eventually I think, like anybody should be capable of nominating their job.

A

If they feel like it's meeting the criteria, they can open up a poll request and I've. Just you know documented in that proposal what they need to demonstrate and then the good faith discussion that we will have around like yeah. This makes sense.

A

Okay and I'm gonna hand it over to you for release. One 9ci is coming so.

B

Release what 9c is coming? That's Thursday I have been looking at what we did for the past releases and talking to Zen, who did those and I think mostly sunn is looking to reorganize the naming finally to be unified from what the jobs are so stable, one stable in for each other releases and something that will be in before then, which will thank.

C

B

Pretty close to enough.

C

B

C

Them to like these food jobs to be generated to be like knowledge generated, so that we don't need to worry about, like hey, like different.

D

Zero jobs like.

C

I have a time out, England job and forgot the other ones like start to painting some things. Okay, things. That's.

A

That's so freakin peaceful yeah, all the release dashboards have slightly different sets with jobs. I'm gonna be dropping out the face of the earth Thursday and Friday, but if you can tag me on issues related to that, I'm interested in this is the CI signal person, because I was just chatting with Eric Chang.

A

Who is CI signal for the last release and like I, don't know where I should be looking for upgrade jobs they're like a whole bunch of different dashboards that have the word upgrade on them and some of them seem redundant and so like. Maybe if we can just kind of hash out like what we think those are supposed to be like I I will totally take your guidance on what you think the standard should be to stamp stuff out and if I need to help with that definition.

A

I have today and tomorrow, to help out there, um but it's it's ridiculous. There's like masters, upgrade and then coop CTS q and then upgrade optional and an upgrade job show up on some of the release.

A

All boards and yeah- that's weird I mean I, can't explain this better. So.

C

Q comes, do is or like it's for team cuddle test, like you know, against like infirmary, yes, yeah, that's not the master of queen is the probably the dashboard engine.

C

D

C

Is for some legacy come in a VM upgrade yeah.

C

A

C

A

I'm gonna pull request to the like. What do I do? Docs for CI signal and I'm gonna try and clarify like these are the dashboards I think I need to be paying attention to for upgrade tests, and so I'll just CC you on that. To make sure it looks seen.

B

B

One thing to add to the blocking jobs: I I, don't I didn't see for sure. If this isn't your dog but I, think at some point we might want to clarify a bit more when when they're allowed to run because I think, for example, if we can get it working again, something like cross should probably block if it's triggered, but it doesn't necessarily run against every PR. It does run in CI.

B

Okay, he's a much longer job but like if he, if you change the build week, we want to make sure build both words right now, kind of it triggers and people pay attention to it and try not to merge while breaking it, but it's not enforced yet, but I think in the future. There are a few tests like that, where we have a pretty good idea, if it needs to run or not, and it's gonna be slower, and we still might one block it's a lot slower yeah.

A

That's where I wasn't I mean I I question whether or not this gets mixed in with tide and status. Contexts on github there seem to be some pre submits that aren't blocking so I'm wondering if it makes sense to break up the dashboards into a pre, submits blocking dashboard and a pre submits non blocking dashboard.

A

If we don't have that, we should definitely make a.

B

A

um I can probably PR that maybe today yeah if I, don't if I can't PR it I'll just file an issue yeah you mentioned poll kubernetes cross, that's actually been failing for the past day. I wasn't sure if I should I think.

B

There's a PR to fix that right now could that the problem was that something changed that actually did manage to break it, that wasn't in like any of the build or anything.

B

But it's such a huge job that, like it's, not really reasonable to force that repair to wait for it.

A

A

Yeah anything else on on one 9ci.

B

No, I don't think so. Dad definitely I will be setting it up. I think I've got it set, but this guy is like you know right across from me and there's the last three so I will set. If we run into any issues. Okay,.

A

So these next two are kind of related, but, like sake, release wants to know. If we have any major migrations planned and the next week's and whether or not we should think about setting any kind of Free State for test infra before we do that, do we think we could? Where do we feel we are on potentially getting tide rolled out before code? Freeze, I, don't.

C

B

C

B

Think we want to sue it should tidy if nothing else, because the UI still isn't there.

B

So you know people won't have a super great indication what's going on also because you know tight right now, it does require our context to be doing, which is a pretty big change from this to make you it shouldn't be, but it is I I do think we want to be able to keep I mean so like, for example, I'm almost done migrating the verified job and when I do I'm going to follow that up with migrating like cross etcetera, but though leave the same job running in a different place.

B

I think we want to allows I, don't think we want to totally freeze tested for it. Do you think we want to stop any plans to turn on or off new tools for KK.

A

I mean so the thing I put in there like thought when this was raised last time you know I think we kind of agreed that just freezing in general out of paranoia is bad because it hampers the productivity of this team. Small incremental changes are better than letting huge things build up. So it's more about evaluating what we're planning on doing against whether that could be seen as high risk or destabilizing the queue so I would just ask you maybe is like that.

A

The test infra guy like if you happen to catch wind of there's some migrations, you're doing you think, and you think, like hey, that might be a little disruptive. Maybe we should hold off on that until after code freeze, you know elevate it, and these are the sorts of things. That'll that'll start coming up kind of daily during the release or down period like I. Think the biggest question for me was whether or not tied was actually happening.

A

I think, like a migration of things out of munch, get up to prowl could potentially be, can keep going on just it's just always. This like be saying, I. Think, even with.

B

That, mostly cold kids I think we're mostly migrating other repos like testers, right and kind of like. Maybe next quarter will switch over KK yeah.

D

I mean I most of the plugins. Besides the Simic, you are not going to be terribly disruptive I, don't think it should be pretty easy to move, and since we have enough other repos I've been able to canary them on a test, infra and other you know other smaller repost, so that be pretty safe to do for at least some of the plugins I also.

B

Think in general, test infra has very much gotten into a place where were heavily trying things not against anything that affects KK in particular, before we switch over so I I think we should be safe to do some migrations that have been carry tested like that. But I, don't think we'll be doing anything super disruptive and we definitely won't be doing anything that hasn't been extensively carried. Do.

E

We have a plan for testing out I'd like the smaller repos, never have I pulled up in the queue that requires like large tide pools. So how are we going to be comfortable with making the switchover I.

B

Don't know the size of the pool is gonna affect type much.

D

To some degree, because it affects batch merging rates and like how often we batch merge so.

C

D

Have been, we have been getting batch mergers in our repos they're, big enough to a periodically get batch merges, so that is that part's tested to some degree, but it hasn't been tested with load like Steve's talking about.

B

We could artificially reduce the requirements.

E

To try to batch yeah I think like in general, too. We currently have a little bit of a gap in terms of integration testing. Any of the problem ponents like phony, exists.

D

E

Not being used in any, you know like automated way for any of the product like hook stuff, so we could.

B

Have any that's the gap, you see the focus.

E

On the future, I mean know that we are actually intending to use phony at the moment right I just call him got out. It's like the only integration level exchange that testily, as you have well I,.

B

Do think a tied migration would actually be relatively safe. I think the things that are going to be breaking are just the then it's fundamentally designed differently like even now batch merges are not that reliable, they're pretty easy to break. So, if you know, if batch merge was acting up a bit I think we just told you oh yeah, we.

D

Can do a little bit of testing for that too, if we hold a bunch of stuff in testing for a day or two and then unleash it all well, I.

B

Mean so it decides to batch, merge after a certain number of cars right, no.

D

It's always trying to batch murder. It's always trying to launch a batch merge and a seer, a single PR. Basically, if.

C

There's a multiple.

D

Prs I guess it's not a batch of there's just one, but it's trying to. If there's you know three things in the queue it's gonna it'll launch a batch and we don't currently have a limit on the size of the batteries right. Not that I'm, aware of I think it can be. Is.

B

There another is there another repo week in Canary own. It's.

D

Already running on two repos I think but uh I think the other one is Federation, which has like nothing in it right now as soon as we have what's that can.

E

We get a larger repo or leak, maybe so as soon as we have some sort of UI for it, I think we're pretty happy to put it on origin. I think.

B

That's kind of the big issue for everyone else, too I think Jed stack was saying that they've adopted it as well. Yeah.

E

I was talking with him this morning, but it seems like they're a little bit smaller yeah but I mean like the other blockers, for you guys having the context being green, don't matter for us for origin because we have like that's already happening for our repos. So as soon as we have a new I I think we're happy to stress test it on ours, which, like more throughput, I, think.

D

Those kind of work-in-progress PR for the front end, so it's getting close. We.

B

Need a we need, I think. Maybe we should just add a crowd plugin that allows people to like for screen Oh, because because, like someone can come in right, now drop a slash test, some random test and then you're gonna have the status for that and you can't you can't get rid of it right. Maybe we can look, we could allow like, like owners or something to michaelis.

E

Didn't have an experimental, refresh plug-in that was doing that like run tests again and ignore that stuff into the room. I think either is that or it was reconciling stale context or something of the sort, but it you know, maybe that would be an appropriate place to put that code. I think it's still a work of artists PR this summer Oh.

E

B

Stopping uh tide from like checking the contacts because.

D

It's using ticket query that doesn't that psychic it doesn't it's using a combined status, essentially right now, so it's only getting the actual combined status for the entire for all the results right, but could we um I.

B

Mean would it actually be a real problem that you know another call to get the specific statuses, no I, don't think that's a barrier I think we should probably just do that and have a list.

B

You know I think that's the biggest sticking point to be able agree. Yeah.

D

We could have a list of a non-blocking, you mean keep it that instead of a list of blocking jobs, have a list of non blocking. I. Think.

B

We might actually want to stick with having a whitelist of blocking I.

A

Feel, like Joe might cringe that we're talking about doing something. That's not purely github query based, though yeah.

B

D

Do the transition.

B

Period or something you know, we can look at getting rid of that yeah.

A

I, like your idea of making it like a prow, maybe like a periodic job or a cron job or something we're like you, can you can run it to sweep it's a muncher right, it's gonna sweep through PRS and it's gonna identify which of the contexts aren't walking, and then it's gonna set those two green saying: hey this job doesn't exist anymore, preparing for tie it or something.

B

Yeah I'm not sure what makes what where exactly makes the most sense but I think we could have a pretty simple thing somewhere work. Some component is allowed to wipe away context for jobs. They you know, maybe somebody wanted to see or forgot to set skin for I mean, maybe maybe we don't know eyes for this, a wheat we, you know, try to enforce, not reporting at github, but somebody's gonna have a straight status and that's gonna walk. If we could just wipe away that status quickly, I mean today they.

E

Could just push right because the statuses are sticking with the committee are oh, no.

A

I, don't think they I think. If you push a new commit you all of the statuses, carry forward the old ones, just say like they don't have a result yet yeah.

A

It looks like a slightly different shade.

B

Of yellow, like.

E

A dark gold and there's a gold and I think it I think that's so in KK, because you guys have required context in repose. Without that, like we were seeing in the in some communities and commuter repos like they just completely go away. That's.

B

Probably, what that bug was with the cement cube was the pole, Cabernets, cross and I might have tied.

D

But could put a tide could potentially use that as its source of truth, as for which contacts right, actually one where you use, maybe me from maybe.

B

The book is actually setting the required context. I mean it seems, like developers, haven't, really paid attention to that like at all. So maybe we should just stop using that kiddo feature, because then, if that would unsticky them as well, that's all the problems that you push me are.

E

We also like using an end point to do merges that honors the required context. Yeah, it clicks the button it uses the.

A

Button click end point whatever I think. The the thing with required context means that you have to be an admin to manually click. The button, as opposed to just having read privileges on the repo, and the idea is that there are way more people who have write privileges to KK than a.

B

Required context on tides.

B

Okay, when tied compose the status on putting things about the PR.

A

That might be valid. We're kind of over time here. I think like this is something we should hash out in um in an issue or breakout or something I would normally keep the money, but I kind of have to run I guess my other. The other thing that popped in my head about potentially breaking migrations, would be the approval Handler that we've been carrying on other repos. Is that something we would anticipate turning on 4kk or are we gonna stick with lunch? Github.

D

I'm planning to turn that on to Kay Kay, it's been working everywhere that I've deployed it so far. The only difference is that it's faster and less buggy. No, there was three repos that I wanted to apply it to, and I made PRS for that and I've applied to two of them today. I'll do the the last one and then once that's and I'll do KK later this week and.

B

That would be my argument towards the only reason we might still want to consider finishing hashing out this tight stuff and making it happen is that the lunch github tool chain is quite buggy and stops us from doing things like making cross blocking. It would be great to migrate to tight, which is much smaller and easier to debug.

A

Yeah I I really want to see that happen, but I just want to make sure we're diligent about keeping things predictable, even if they're buggy so I do like the idea that, like we've canary different handlers or plugins, and we can migrate those over because that means you know when somebody complains about a problem. We can't necessarily say: oh, that's lunch. Github, it's dead code, we're not fixing it. um We actually have the power to fix it, because it's prowl um anyway, this actually ran way longer than I thought. It would.

A

Thank you, everybody for the constructive discussion and a happy Tuesday I'll see you all online. We do.