Istio Technical Oversight Committee, 8 Jan 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Technical Oversight Committee 2021/01/08

Description

Istio's Technical Oversight Committee for January 8th, 2021.
Topics:
- Which versions of Kubernetes to support in Istio 1.9
- Current Test Flake status
- Update on docs working group

A

Available for 1.9 there was a reminder for all. The working group leads. Please go ahead and review the priorities. If anything needs to change, please help change it. We still have time before we go into the asynchronous testing.

A

um The second reminder is: please pick up three test cases which needs to be automated each working group so that we can make progress in automation of the docs testing, and the third is uh thank you, everyone who actually helped mark green all the features which have been done for 2020.

A

We have already started working on the istio 2021 roadmap. What I have done, which we have discussed in the working group leads, is I have taken all the spillover work from 2020. They have been added as a draft to 2021.

A

So, if you haven't already any features you think needs to work upon in this year, please add them in the in the draft of istio 2021, together with the priority, the stages, the user benefit and the contributors. So that way, once we are ready in two to three weeks, we can have it review with the toc folks.

A

That's it for mean any questions. There so it's just a reminder.

A

All right go ahead. Luke, sorry.

B

C

A

um That is not for this one: okay,.

C

uh Brian, do you wanna ping on the release manager, expectations.

D

Yeah I'll have to go back and look through that dock.

C

And see what we had all right, um all right, we have the uh I guess jacob or sam. Do you want to talk about the upgrade work group charter.

E

uh Yeah, so uh we got approval from the um environments group uh as well as uh testing release. um Josh wanted to talk to neraj um kind of a little bit more about that. So uh hopefully they've discussed that.

F

No, actually, I haven't done it back. Are you here? Yes, I.

G

F

Yeah, okay, well, we'll sneak up um all right, and I think part of it also is um I want to know. Is this a long-lived working group, or is this more like? uh You know a faction that that lives long enough for for uh upgrades to be better and then it just merges back into one of the existing working groups.

E

um Yeah so at least in the charter, uh I guess one of our goals is uh we're not sure when, hopefully, you know within two releases to uh get merged back into um environments.

E

That is the expectation, but but obviously, if there are other concerns, then um or if people think that it is a worthwhile endeavor, then I'm imagining. uh We could live long. Yeah.

F

um Frankly, my other concern is just whether we have enough people working in this working group. So that's what I wanted to talk to narash about.

C

Okay, well, we've got that conversation. Offline sounds good.

H

Yeah, I guess one question I have is for the one nine book item. It would be great if we can link up to the issues or maybe have a umbrella work item. um So, for instance, you know eric and I have been having discussions on how we can help um as part of one night effort. So we try to link issues that we are going to uh eric is going to help us uh on from the scenarios that matters most to us. So I I noticed there's a lot of issue.

H

I shouldn't say issues, there's a lot of items for one night. It would be great to actually have a good understanding of you know. What are the github issues related to these? Are they still planned for one nine? uh Given you know where white knight is.

E

Okay, yeah: that is an action item that I can bring up uh and and uh try to have something a little bit more formal.

H

Thanks so much.

I

So shut up just I missed the first few minutes of the meeting, so one night we are still going with the original plan is that is that the thought process, because I'm guessing we lost a fair amount of time during the holidays.

I

So so are we still pushing for the same date? Are we gonna postpone things or is it too early to tell.

A

Is this a question for the istio 1.9 release date, so I haven't heard from the working group leads. Yet if there is any issues, but I will check this tuesday in the meeting I haven't heard it it's too early. We just came back. This is too early, then make sure that we are on track.

J

E

Jacob we'll talk about supporting versions uh yeah, so uh this hasn't been raised. um Unfortunately, I didn't talk about this in environments, but uh typically we bump up the uh kubernetes supported versions, so 1 8, supported 116 through 119 120, was released in december. So I'm just uh looking for verification that that we are intending to support uh 120 as well. As you know, 117, you know through it uh for 1.9.

E

And I can take this back to you, know, test and release as well as environments as well, but I think you know this is obviously being a pretty important.

C

Topic right as a whole, like we've, had questions about this in public forums. um Mitch has given a presentation about this, um so I guess there's a question.

C

Kubernetes itself adjusted its its posture in.

E

C

Right, it went from n minus two to n minus three.

C

Anyone else feel free to correct me if I start to reading correct factually, um which would mean that we would, if we stayed with n minus two, we would fall. One of supported, kubernetes release would fall out of the window, which is probably not a good idea for the project.

C

J

Thing I'll just mention is: do any of the major providers support 1.20 right now, gke or iks.

H

I know ibm cloud: does I'm.

A

Just looking at it.

H

J

I was going to say, I looked yesterday and I didn't think ibm cloud did, but I might be wrong.

C

Right and that's the other big point that mitch brought up in this presentation is kubernetes is the source of truth for kubernetes, but people are very vendor dependent.

C

Yeah, so this is the question fairly, just basically, what kubernetes like? Should we become n minus three.

C

um Right, we're still like there's a lot of discussion back and forth about lts and things of that nature. um Obviously going to n minus three is not necessarily solving all those concerns, but it is solving some of those concerns right.

K

Are you talking about which kubernetes versions we support, or which is we support, which kubernetes.

I

Good yeah and louie here I think I think jacob wants to clarify both whether we are moving one ahead or not, which is the key question here too right. Are we going from 119.

C

Right, there's three questions right one: when do we start to qualify 120, do we drop 116 right? We can treat those as independent questions and then what you know. What release would we consider to be this sweet spot right, which probably is not 120 right? So if we were going to do manual, qualification or other things that were not automated or repeatable, which release would we recommend to be used for manual qualification of anything or even like day-to-day development work.

C

Are there features in 120 that should need to be picked up and supported.

C

And any external pressures on this process right. I think those are the four items, so we could decide not to do 120 right if it's that fresh, but historically, we have always supported the latest support of kubernetes release. That would be a break from precedent to not support that.

C

C

If we can automate it right, or at least have some of our automation runs against it, I think there's no reason not to say it's supported, but that's a separate question from whether we drop 116.

I

So louis maybe we can pose this to some of the experts here. Has anyone evaluated, 120 and know? What's.

L

I

L

So we I've attempted to get rci running in 120 and everything works. Fine, like there's no breakages, that they made that impact istio and there's also nothing in particular that helps you steal that we need to adopt or anything.

L

There is like this minor issue where their pod termination they have some regression. So it takes super long for pods to terminate and our test time out um so we're working on getting that fixed before we adopt it in our ci. um But fundamentally, there's like it.

M

It should define.

M

On our side or in kubernetes no.

I

It's broken in kubernetes, so so john, uh I know in 120 there was a lot of work done in the kubernetes community for dual support for ipv4 and ipv6. Does that impact make our life better easier, easier harder? Do you know.

L

I think it just means that we support less of kubernetes than we did previously. Okay, because it's actually working now there.

C

Right so other than looking at the release notes right. They obviously did a major overhaul of ipv6 stuff.

C

But it's not clear that that affects us in any way and they stabilize some apis that we already had support for like app protocol.

C

Things which I don't think are that impactful for us either or that they don't really represent a change. um So it sounds like getting the automation. Working is just the major thing here.

C

Unless anybody else sees something in the release notes that piques their interest.

J

Well, normally, we do the long running test against all the releases that we support as well, and that might be a problem if no vendors have a 1.20 available to run the long running against. I don't know if the long running will run against kind.

J

C

It's a good question. I don't necessarily know if we have to answer that now or even so, we could choose to refine that right. So there was the. I brought up the point of sweet spot testing right where, if yes clearly, I think we would like the long-running tests to run against every known, supported release.

C

But we don't necessarily have to have an answer for that question until closer to the release date right.

J

I would agree I just wanted to make that comment.

C

Yeah, I think the more pressing question is what to do about 116.

I

I

I mean it looks like when we had this when we were discussing the kubernetes version matrix last time. We weren't sure, if increasing this support is necessarily going to help our cause, or maybe I'm misplacing, my facts here. Can somebody remind me: will this help if you make it for going forward and then maybe five going forward.

K

So the only signal that we got in the fall was around what users are currently doing. We did not see users currently running older versions than were supported with their particular version of istio.

C

Actually right they were running, there were non-trivial number of users running older versions of istio, possibly anchored because they were using an older version of kubernetes and weren't willing to upgrade it.

K

That is a possibility that you really can't distinguish that in the data set that we gathered.

F

Yeah john: how hard is it to maintain these older versions or n.

N

F

Four, what are your thoughts.

L

um It's it's not too hard until it's hard.

F

L

If, if there's nothing going on it's, it's super easy, it is literally zero time um I mean other than you know the test runs, but that doesn't really impact anything. But then, if we decide that we need some feature, that's only supported in 117 plus at some point. Then it could become. You know extremely hard.

F

How, if we say, m minus four, um how hard is it to change that later, if we run into something really difficult.

L

C

If we don't give, if we don't.

L

Give a general policy of that minus four and we just say for each release we're going to support these versions. It's quite.

M

Maybe that's what we should do, but.

L

Maybe that's not reliable enough for users. um I don't know.

C

L

C

A minimum right would say well we'll be at a minimum n minus three, but we're right and we can always have more yeah.

F

That would make me feel more comfortable.

C

um I'm not sure if that really helps right. That's.

F

um Now why not I mean we can basically say we have a minimum of whatever that number is and then from there it's best effort, but we'll do.

I

It if it's cheap, I would still say, though, based on my interaction. I've not heard a customer say I have not been. I've been trying to upgrade this to you, but I can't upgrade this year because I'm stuck on an older version of kubernetes, like people, don't have great history of technology.

F

Salesforce at that, for instance, okay, I'm sure we could come up with more examples. I don't know how frequent it is, but I definitely have anecdotal evidence for that.

I

Interesting: okay,.

C

Right, we certainly have leading aside the upgrade problem right. We have customers who aren't trying to upgrade istio right, they're happy with their their their pairing and they fall out of support faster than they would like right.

C

That's the reason why kubernetes went to n minus three: it's like I can't afford to do more than one upgrade a year right that that problem or two upgrades a year, and so with that policy in place like I'm always for three months, you're going to be out of support with istio, right or or something of that built right. That's why kubernetes extended their window.

G

So that there is also the other side of upgrade, which is uh someone trying to kind of leapfrog, several store versions and now they're forced to upgrade their kubernetes version, which they hadn't planned.

E

And it limits them also.

G

Right, oh also, anecdotal right. I I don't have no numbers, but I have spoken with customers that have this problem.

C

Right so I mean we're already working really hard to try and solve the kubernetes. The istio upgrade problem right. So even if we can't be istio more than n minus two right, it's not catastrophic for users, because upgrades are safer, easier, right, etcetera, etcetera.

C

um I think we'd probably all agree. The kubernetes upgrade and or support window is less problematic than the steel upgrade and support window.

C

I guess john just pointed out it it's relatively cheap for us right now until it isn't, but when it isn't, it hasn't been nearly as bad for us. As the isd upgrades.

C

And also we're not responsible for kubernetes upgrades, like we don't write that tooling. If that breaks, someone else gets yelled at us right, so it's definitely cheaper for us. That part is true.

C

um And then there's one of the vendors right. So if.

C

If most vendors, like 70 of their installs, are kubernetes 1.6 right, it doesn't make any sense for the project to drop 1.6 support out.

C

Aggressively right.

K

Yeah, I I think I agree with john that it makes sense to make these decisions in an ad hoc manner that we have a minimum bar that we support for kubernetes and at each release we evaluate how much more than that minimum bar we can support. I think that's going to be great for our users, and it also keeps us from tying the project to a policy that would might prevent us from moving forward.

C

Do we think there's a reasonable maximum.

C

Like we probably shouldn't exceed what we had done in previous releases, like we wouldn't add, go back and add another kubernetes release, postpartum right, no.

G

C

Yeah, um okay, so I think what we're saying is we will bump, but we should evaluate and and if we can carry 116- um or at least we should have a positive indication about why we shouldn't carry it.

I

Do we do we know if the current vendors or the hosted pro uh kubuntu providers what's their? uh Where are they at are most of them at 116.?.

H

I know we are we already deprecated 116., so it's going to be unsupported in 22 days in our cloud, so we would be very happy with n minus three.

O

M

You know that users.

C

O

To switch off, I mean if it is dedicated.

H

Well, they wouldn't get the security fixes really sure.

O

H

That didn't stop anyone.

O

In the past I mean it's not like.

C

When you enforce an auto upgrade, or do they just fall out of support.

H

I think they fall out of support. um We we auto upgrade for fixes like from 116 15 to 116, 16 or 17, but the major ones they have to click on the button to update like major miner.

C

um All right, so we need an action item to go, follow up and see what the the vendor states are or are likely to be. By the time we hit the release.

C

Yeah, if 116 is in the set that we expect one of the major four or five vendors to have by the time you release and the cost is marginal, and we should probably carry it right because kubernetes itself is carrying it.

L

Yeah, I think we should.

M

Support it sounds good.

H

So kubernetes supports n minus 3, which means it's 120 to 1 17 right, not 160 or maybe.

I

Yeah, that's what I was looking at.

C

Sorry I made this terrible.

H

This just to make sure we're on the same page, so essentially we by default would match what kubernetes has, but for each release we could decide. You know if n minus 4, which is 116 in this case, is reasonable. Yes,.

F

H

F

Longer term policy are we leaving it alone, but but adding the best effort clause. We can always support more than the minimum.

C

We can choose to on a release by release basis sounds good. Could we save the moment, where n minus three, with an option to be n minus four, but not n, minus five, just.

C

C

Until we get until there's some strong indication right, I mean we we could, but I I don't want to generate work for people unnecessarily.

F

Actually, sorry, are we n minus three right now, yeah, okay sounds good um in terms of having a cap.

C

Well, a reasonable way to define the cap is the minimum kubernetes release, supported by one of the major four or five vendors right, so iks openshift, aks, eks gke and use that as the rubric right or maybe some other thing, but in general like that's what we would look at by the time of the release right. So we can make this decision late right, because this is right. We're not going to use this for manual testing right, we're only going to use it for any automation. That's.

F

Right what so? What is the value of having a cap of putting a cap in the policy.

C

To exclude like it just minimizes like it's, it's some way to cap the supported surface right in kubernetes, 115, say right: what's the value in supporting it, if none of the major five vendors support it?

C

Okay, we can obviously amend that, like that's a pretty rough rubric, but it seems moderately practical.

J

So so it just means that we're going to support what the vendors support, not necessarily what kubernetes supports. No no we're our main bar is what kubernetes supports. Well kubernetes 1.17 is end of life, the end of the month right yeah. We always go beyond kubernetes right and that's.

I

What we're saying.

J

So we're going beyond kubernetes to what the vendors support. I think, because, yes.

C

I will say.

J

I thought louis was saying we were going to.

C

J

C

That's a difference was moving strictly to n minus three.

J

But that's only for 19 and above.

C

Oh, it's it's forward-looking! Yes, okay,.

J

So 1.17 still is end of life. The end of this month, wow and they're, going to end.

C

J

Is uh was end of life? I think in september.

C

Okay, uh yeah, then, then, we're we're we're still n minus two. Until 1.9 becomes the tail of n minus three one: nine sorry 119, and then we can choose to be one and or two more than that, based on what the vendors have.

C

But it is not guaranteed that any istio release will do that.

C

But then we have to have the policy that the next istio release could only trim off an optional one added in a prior release by one right. Otherwise we wouldn't be able to upgrade people.

I

So currently one eight when I look at the dogs, it says we support 116, 117 118 119.

B

I

Complies with the new policy.

B

K

So louie was already.

B

Forward-Looking.

C

C

Yeah, I don't know the.

C

I I'm not inclined to be more than n minus four.

I

C

Yeah, it makes sense.

I

C

Set the min bar high lower than what kubernetes sets it in any given release.

C

I just don't know if we need to come up with a way to describe the things that are like in that addition like. Well, we just say that they're supported in the dock. So then like it's right, it sounds like we're going to take out 116 and we'll be n minus four right at release: 20 19, 18 and 17.

O

We uh I don't know if this was a comment from from jones, that we test in 115. Even uh can we take a more pragmatic approach and when it's becoming expensive to support a release, we drop it and uh we just leave it there I mean. Maybe we can say that officially supported it's n minus three, but hey, it was 2015..

O

If you, for whatever even reason you want to use 115.

C

So so I'm like if we qualify it right custom, but if there was a bargain 115 that was going to cost a stupid.

L

C

You said that you supported it right and then you said that you dropped it right. So you can't really that's not what support means you need to describe it some other way right. You have to make it five, but it's not supported.

K

You have to make the support decision at the time that you release a new minor version of istio. Yes,.

C

And you're committed.

K

To that support for the lifespan of that particular version.

O

So but support really means security fixes. We are not really doing too many supporting things for all that, or even for me,.

C

It could still happen right. I just.

O

That's pragmatic, I mean if it's a small effort and doesn't cost us too much sure otherwise. Well, sorry,.

C

Right support right now means that if, if there is such a problem, we will absolutely fix it security problem.

O

C

That we will fix it right.

O

And order money back.

C

Blood, sweat and tears back because it's opens first right um yeah. I just think that's gonna be like I'd rather have come up with another term like we qualified it and it's tested. That's good enough.

C

um Questions about it, but we don't guarantee victims.

O

And if we do that, if we say it's qualified on one fifteen, one sixty whatever and then we have, we can say supported on 171, I mean we.

E

P

More aggressive.

O

With what you supported officially right.

C

C

Yeah, I think john, like qualified versus supporters, is probably the same as what you're saying in the comment um and yeah I'm fine. If we want to qualify 115 right. That's that's totally fine if that helps people, but it's different than supported right, supported means something very specific.

C

um Jacob to answer your question right, salesforce mostly self supports, but they would probably be very happy if things were qualified.

F

Okay, so what's the summary, it's it's n minus three at a minimum! um Sorry supported n minus three at a minimum. We might go up to n minus four in a case by case basis and we may qualify beyond that, but we wouldn't support it.

F

C

Yeah, okay, um and when, when kubernetes goes to n minus three, yes, then we'll be n minus three, but that policy will stay the same.

I

I think I captured it correctly uh for 1.16. We are probably.

C

I

C

Based on this policy, it will fall into qualified, but not supported exactly maybe just capture that.

C

I

C

So we need to define a document. That's right.

F

Who's going to send the pr to update the policy jacob. Could you do that.

M

I

I

Where do we write this? Is this currently documented.

I

Good question: there's a.

F

Support there's a supported.

J

Version page off of I o.

H

Yeah only on the versions, but not on policy, exactly.

J

Right and just list stuff, but I think we I think we want to update that page, probably to say what this is and then add the qualified releases in there yeah. I don't think you.

C

Necessarily want to start trying to like describe the rules, the policy I think it's simpler to just say what the supported releases are and say if you care about these kubernetes, if you care about over uber 90s releases. Well, here's the ones we've tested against these ones, but these are now no longer supported right. We'll just find some way to describe the meaning without having to describe the rules that we use to generate it.

I

D

So we actually do document already are in release cadence we document how often release and how often they are supported. It is woefully out of date, so we should probably update it. We can either get rid of it and just say: look on our supported releases page for the support of things, but we might want to actually reword it to be up to date.

C

Yeah I mean right now on the release. Cadence page even has an lts section, yep, okay, that is what flyover did um yeah and jacob. Can you guys take on you're getting really rid of the release, cadence page and just rolling it into the supportive releases page sure I can do that? Is this approved by the toc? I think so I think so. Yeah.

C

Okay, thanks for bringing that up um test late, dashboard.

C

I

Do we have christmas lights in there.

K

Clearly the top priority there was there were questions about uh the test links. Am I sharing my screen? Can you guys see that oh.

C

Yeah: okay, I'll stop.

K

uh So there were questions about test flakes uh and the dashboard was partially functional for several months there. It's back up to speed. It's automatically updating you can see flakes uh as of january 5th right now. It does update about once a day. So our flake rate, um if we remove our remove the date filter here and look back historically for all time. um This chart shows the ratio of prs which do not experience a flake to those that do experience a flake and in december, 21 of all prs experienced at least one test flake.

K

The goal that we had set as a project way back in the code mob days was five percent. We've never actually achieved that goal, but we did get pretty close in april. We were at just seven and three quarters percent, um so we do see a decent regression and flakiness, which is borne out by I think john howard's experience and others in the project.

K

We're all seeing flakes a lot more frequently than we'd like to if we filter by date, which, in this page, I've filtered it down to december 1 to current, we can see the test suites that are responsible for flakiness.

K

The two themes, I'll call out is, we see a lot of the multi-cluster suites are creating a decent amount of flakiness right now. This is not a multi-cluster suite here. This is a multi-cluster suite. This is also a multi-cluster suite. The other theme is that telemetry here's, the telemetry multi-cluster suite here's the telemetry non-multi-cluster suite, so those are kind of the two biggest areas of change that I see over the last few months in terms of increase in flakiness, for anyone who is interested in identifying flakes and finding them and root causing them.

K

This section up here gives you a list of recent flakes the 20 most recent flakes and if you filter by test name, you can get to the most recent flakes in your test suite. So it's a good launching point for troubleshooting and identifying root causes uh and I hope to see us get back down in the in the seven to five range soon.

F

This is fantastic, so the link to this dashboard is in the toc.

K

Notes, that's correct. You can also just find it if you, if you use eng.istio.io for anything else which it's not a super useful site yet, but there's just a there's a flake section on the left hand, side in the nav bar.

F

So I guess this is fantastic now my only concern is like how do we get this dashboard in front of all the developers on the project make sure they know it exists?.

A

So one more question with that: josh um mitch. Thank you for picking it up. I know it was down for a while, and so thank you for working on it uh same thing with josh asked is the maintenance. I know there was a plans of moving that to tnr.

A

Is that still in plan system are picking it up so that, as josh said, you know we can bring it up in front of developers. How is the maintenance plans going on.

K

So uh the idea was that I needed to get it back up to functioning before I could hand it off, we've still not really identified a future owner for it. I have do have one volunteer from the community who heard about it in the test and release group this week and he's coming up to speed on operations.

K

Ideally I'd like to have about three of us who are able to help, contribute and maintain this mostly for upkeep, although if there's interest in adding features, there's certainly a good number of features that could be added to make this more useful over time. So I am looking to tnr for a few more volunteers.

K

I know that we're all stretched pretty thin, so I expect I'm hoping to see that happen over the next couple of weeks.

C

um Getting back to josh's question about how do we get this, you know into developers faces as part of their normal flow? Is there.

H

Any way to get.

C

This like, even if it's you know as part of any pr that experiences a flake right just have this, be a link, the the kind of ci feedback uh section at the bottom of the pr right. It's not like! No, no one's expecting that it's going to tell you.

C

The feedback from the ci system is going to tell you which tests flaked you right. That's fine! So.

K

This system, as it is implemented today, is completely disconnected from prow. It reads: prow's, output on a periodic basis. Currently it's at once every 24 hours, the engprod team had committed to integrating this with prow and then our end fraud representative uh left the team and I believe the commitment left with him.

K

Okay, so that that's what would be needed, we would need to reach out to ngprod and have them integrate this into the prowl flow for it to show up in our pull requests.

G

So I think I think, though, before that right we like the so the person that experiences the flake is not actually the person that or the people that are, that should be fixing the flakes right. So for, for example, uh I guess unfortunately here telemetry has uh has lots of flakes, so it's so the so I think it it's really up to the the people that maintain these subsystems to know about this dashboard and and go there not so much the people experiencing the flakes.

G

So we should just find, like some other way of yeah, the other.

C

Is just make it part of the standing agenda of the word.

C

Like if the flight rate goes up or it's affecting people's productivity very clearly and the lead should be motivated to wind, that back down.

K

I'd be happy to do that. I review this once a month as it is with the test and release group. Obviously, we've missed the last few months because the data was not up to speed.

H

Yeah yeah, so I think it's both ways right. If I'm submitting a pr when the test fails, uh let's say we have ten tests. When three of the test suite fails, it will be super valuable for me to know. You know two out of them fail because of flakes. So I don't.

H

If I don't own those tests, I don't even want to look at them and then, if this one is a valid failure, you know I could actually look into the logs to see how my pr could introduce that, because today I don't have that knowledge. What I do is I retest and then see the same tests come back right, because why would I spend time to look at the logs when I can just spend one second to have the ci system re-test everything.

L

And I think part of the problem is that you don't know if the test is flaky until you rerun it and it passes. Otherwise it could just be your pr is bad and making the chest fail and yeah what you are doing also by retesting it. I'm not saying you shouldn't do this and I think everyone, including myself, does but that's part of the reason why we have flaky tests, because someone submits a pr and they're introducing a new flaky aspect and it fails and they re-test it and then passes and emerges.

L

So it's it's great. It's uh it's a complicated.

H

Problem, so retest is the best practice today, because that's the only way we know it's a flake, so.

L

I I think that when we have a flake, we should open up and we do a decent job of this like if the fubar test fails- and you know it's not your fault, um then you know it passes again.

L

When you try it, we should open an issue that says foo bar test is flaky, try and find out who owns that test and then, when you see a failure on your pr that you don't expect is yours, you should search the issues and if that one has an open issue that this is a known flaky test, it's probably fine to retest it. If not, then you should look at either. Did my pr cause it or maybe I should open an issue that this is flaky, so we can resolve this now.

L

Maybe some of that could be automated, but I think that's kind of the at least to some extent what should be going on, whether it's by human action or bot, action.

C

There's a question of investment here that I want to get into a little bit, which is what's cheaper, like I'm doing a whole bunch of automation around that process, that john just described or having no flakes.

I

I would not invest time in automation just being honest previously, whenever we have done that, you end up just creating a huge amount of issues which- and nobody looks at them, because that automation can't differentiate things enough.

I

So I would rather have no flakes and agree on developers being nice citizens and actually following a simple process search if you don't have an issue file, an issue.

H

Yeah, no flakes is definitely the best scenario. Nobody can needs to waste time. It's the best, also no bugs.

K

Then we can just get rid of the test system. Yeah.

L

One thing I want to point out.

K

L

That previously, when we had a bunch of flakes, it was because we had bad tests or bad infrastructure. um Recently, a lot of these test failures are from real bugs in the code. So all of these unit test failures, I think they're all related to a bug, actually a collection of bugs we just fixed and the multi-cluster ones. Most of them are a bug and envoy, so this is actually in some ways.

L

A good thing like these tests are helping us find extremely subtle, bugs like these are incredibly obscure, bugs that occur 1 out of 10 000 times, and so it's actually a good thing that we're finding them.

L

Although I mean it does hinder other developers, but I think in the long run like we shouldn't just ignore them, we need to understand the root causes and fix them, because these are providing valuable insight.

K

I do also want to call out that often times uh the root cause of a given set of flakiness is not necessarily belong to the team that owns the suite.

K

So I know that doug looked a little bit at the integ or in the telemetry failures, and he seemed to think that there was some sort of underlying infrastructure problem associated with them. So while we do rely on the working groups to at least begin troubleshooting and identifying a root cause, uh responsibility will probably pass off from one work group to another.

C

K

Understanding of the root cause.

C

H

Yeah makes sense so me, I have a quick question. Maybe I'm not interpreting the data correctly? It sounds like in january 2021. The flakes is gone down to from gosh it's much higher to a much lower number. This is.

K

Nice to tell us my guess uh on that is that it's simply no noise in the signal.

K

One thing that we do I do remember experiencing back in 2019 when we had a very big problem with flakes, is that running more tests concurrently tends to cause a higher rate of flakiness. I don't know if that's the case today. I can't prove that today, but, given that this dashboard is updated through january 5th um and january 5th was the first workday of the new year for googlers and the second for, I think most other people. I think that's what we're seeing we should see in another week.

K

If it's still really low, then I would be happy to wave a very mysterious victory flag where we fix flippiness without trying or maybe.

L

That may have a a role in it, but we made a bunch of changes to the tests. um We fixed I'm very confident that the majority of the unit test flakes are fixed, I'm close to having it set up where it's just running the unit test in a loop overnight and so far it's going pretty well for for the test. Now, um steven and doug also made some big fixes, like the multi-cluster one. We used to run five clusters, and now we run three so there's three fifths as many opportunities to flake, for example.

L

um So we've made real improvements as well, but.

A

L

The sample size is kind of low for january. So who knows what will happen? Hopefully.

K

It is, I would love to be able to get something that shows a day-by-day flakiness rate for people like john who are actively working on flakiness of various suites, to be able to see their results in more real time. The month-to-month granularity is not ideal, uh but again we don't have any hours allocated to that in the coming months.

L

Okay, so we do, if you go on the toc, I posted a link. There is a spreadsheet that I periodically update, which is looking at something that's a different metric but similar, which is the number of post-submit jobs that succeeded uh for pr and so that we have, you know, per merge, pull request granularity, but it's measuring something different, because in your case, this dashboard only shows a pull request. That's not merged that fails and then passes, whereas the other case is just running the test once and if it fails, it's marked as failed.

L

It passes, merges past yeah.

I

So mitch I mean this information is obviously great. I think the real ai action item here is to somehow surface this, maybe in the working group leads meeting periodically.

I

uh I think the other thing is just some education around how we can file bugs and not ignore until and just rerun the test until your test passes and pr as much right is there anything else. I think you do now.

F

I mean I'm thinking out loud her like what about I mean. There's. Also. This idea I think louis was bringing up earlier. I've tried to incorporate into you know, pull request reviews, um so that would be great great avenue to explore. I can imagine doing a kind of a you know. Sometimes um I think github does this, but they're tools that try to yeah so when github tries to automatically suggest reviewers for for pr they look.

F

The last person who touched that code, you know potentially be certain the last person that touched the code when there's a flake, I don't know it seems flawed, but something along those lines.

P

I

P

I

Project for that sorry, I remember.

P

Google paper.

I

Hazard something about it.

C

All kinds of things google has all kinds of properties internally for the moment right now, I would rather, I think, rather than diluting the effort, just focus on, let's just deflate as aggressively as we can sure. um So. The other stuff is quite expensive to implement and yeah it. If we can stay under a very low percentage of plates, but then it probably has almost no value.

K

For my part, I'm not seeing any lack of ownership from the various working groups. I don't think that that's a problem that we need to troubleshoot today, I'd like to think that what we've had over the last quarter was a lack of visibility, and I hope that this contributes to improving that yeah.

C

Absolutely I mean so just putting it on the agenda, for the working group leads. It's like hey. Let's look at the things the last month. Oh, like houston, we have a problem. Okay,.

L

I I don't know if I agree, I mean I don't want to throw other people under the bus, but to me there's no visibility issue. Every time you submit a pr, you see a test fail and you retest it and ignore it right. You.

E

See it clearly.

L

E

L

Very small subset of people that actually do something about it is the reality and I think.

I

But they might think they're not right, they might not have the expertise or the time to understand. Okay, is this a flake or I can just do a retest and get up yeah.

L

If you read it, then that means it's a flake right. If there's a failure, it's either a flake or your code's bad, so either way.

C

L

Are ignoring a flake or fixing the flake or you're fixing your code, so we.

C

Still need to get more people focused on the problem right. If we can get to the point where we like instantaneously have low flakes, then we want to shift the developer practice, which is, if you see a failure in a test. That's not yours!

C

You go and escalate to the person who owns the test right and you don't try and submit your pr past. It necessarily right, but you do actually start to block a bit. That is more normal.

C

All right. If you have a very low percentage of flakes, that's what you actually do now we'll have to see like, because we can obviously just change the ci to say. Look, oh that's! I love that. John sorry. That's.

F

Hardcore sorry louis, I'm sorry interrupted not today.

F

What we could do is we could have a window where we remove re-test and then maybe you know when we get close to the release, we restore it. Well, we could say if your test.

C

Fails, please go and follow the owners of these tests right, so we start to prod but yeah. I at some point.

F

I

F

Not joking, I really like john's ideas.

I

Well, I mean we normally, I mean that's permission. Also. We have this policy, this flakiness, you remove it, but given that in this year right now, we struggle to get quality tests to begin with and lack of new tests coming. I think if we start removing them.

C

Right, I I understand this is not going to happen overnight right and I understand that there's a middle ground between these two things right, but we want to get to the point that the first like we can obviously keep retest. But we want to get to the point. If you see a flight right, we want the default behavior to be okay. I'm gonna go and harass the person who owns the test right.

C

You cost me time and effort. Please go fix your stuff right because become the nag that you want right and then like, but if, if 10 to test or flick, you can't follow up with 10 of test owners right. But if it's one test you can.

F

What, if, for like one week for every every release, we remove re-test just to raise awareness.

F

P

The release not.

F

P

C

Let's let the working group lead process kick in and let's let them have an objective right. That's the right form to take ownership ownership, as as john mentioned, is the thing we need to work on first and and then yeah. Maybe once we've done that, then by all means we have to start to have all kinds of behavioral.

C

You know nudges.

K

So I do have one question about the objective itself: uh when we did the code move thing back in 2019, we set an objective of five percent or fewer of prs experience, a test click that sounds really great, but we've never actually achieved that in the history of the project like since we started using prow to run tests, which is as far back as this data runs, we've never had that that high or low level of flakiness rather do.

K

We want to adjust that goal uh and have it be the advantage of adjusting it, I think, is we can say hey if we're above pick an arbitrary number. So let's say 10 percent of prs are experienced on a flake. Maybe we need to double down our efforts on minimizing flakiness. Maybe we need a little bit more of a hey work. Group leads no. We should de-prioritize some other work so that we can get these flakes fixed, whereas if we're under ten percent, maybe that's acceptable.

C

Ten percent- no, I don't like ten percent- is really high.

K

I mean the the number of months we've been under ten percent. I think is.

C

I agree, I'm just saying I don't think you want to change the goal.

F

Okay, well, you could, you could have a you know, incremental goal, you know get to 10 and then just make it.

C

Better, so having a goal for 199 to be under five percent of test fights- and you know using the work leads meeting- is the vehicle to hold ourselves accountable for that and raising that to the toc? If there's an issue seems perfectly reasonable right now that we have visibility right, because we didn't have good john, had visibility by other.

C

Means but yeah- and it sounds like a like. A lot of effort has gone in already in this month to deflating a bunch of stuff so like that goal might actually be achievable. That's that's no reason not to shoot for him right.

I

Okay looks like that's doug's. uh This year's objective he's already taken care of ten of them.

C

E

Like honestly,.

C

Talk, I agree. There are issues with distributed ownership right I mean that's, that's why we want to bring them into these forums right where the leads come together or the toc right, so that we can wrangle those things. If there's a thorny, flaky issue that requires multiple owners, then that's the right forum and we should decide that and this tool provides the right kind of insight for that right.

N

Yeah, I guess my point is that it's easy to say: hey, there's a flake. Can you go fix this for me, but it'd be awesome if people could start to figure out why the flake was there themselves, and so they have the knowledge to fix tests too.

N

um I mean it's not quicker, necessarily to always ask me to fix a test right, oh yeah. So that's that's.

C

On the other hand, if there are, you know, 15 clicks right. People are just going to fail right. If there's one, then that's maybe feasible right. It's kind of the difference between a a known issue, a focused issue on a wall of.

C

C

I guess the other thing is other improvements we could make in the test pipeline so test so prs that introduce flaky tests don't get in.

I

How do you know that.

F

Yeah, if they're flaky, you might miss it you'd have to run it a bunch of times.

C

No, I don't know yes like you want.

K

To use like this system does not have any false positives, it does have false negatives, so we could write a system that says hey if if we observed a flake in your tests that we've never observed before- or maybe we just haven't observed in the last 10 days- um maybe we need to block your pr until that flakiness is resolved.

L

I think it's a lot more subtle, though often a flake happens, one every 100 times, so your pr introduces it. It's certainly not going to fail unless you're extremely unlikely, um so yeah. I thought about that. A lot movie, actually, um I think, for unit tests.

L

It's it's pretty simple, because in unit tests you can, you know, run it in a loop, a thousand times in a minute and get results for integration tests. You could do the same. It would just take a lot longer, um so I think yeah. It.

D

Would be useful, but it's.

L

It's uh it's not.

D

L

Problem but it could be useful what.

C

What about a very there was a paper by google which tries to root cause.

F

Yeah, I think, there's some fuzzing prior art here too.

C

um Well, the the root closing thing was just looking at pr history and test results. I'm just doing correlations.

I

C

Like so, I suspect john's thing about sorry go ahead. Let you finish right. I mean one thing at google and like that we do. Is you know you? You can indicate that a test is sensitive or potentially flaky, because it's doing something big and nebulous, um and so the test infrastructure just runs it a bunch.

C

Even if it's large it'll run it a bunch um right. Obviously it's not going to catch the one in 100, but it will catch the one in three or the one and.

F

Five john's joke, which I um took as a real suggestion and loved. You know this just remove retest. um I wonder if a variant of that might actually work, which is you can only retest so many times.

F

Yeah one re-test for fix, flake or or n read test for fixed fleet.

O

Don't forget that many failures are because of bugs in your own code, so yeah, that's part of testing to find bugs in your code.

L

O

L

Of it's very hard, though, like these a lot of the multi-cluster flakes are due to like major issues with multi-cluster that one person can't just reasonably fix, and it would also be turned off all the multi-cluster tests entirely. So we're just kind of stuck in a medium-term crappy state where the multi-cluster tests are kind of flaky. Okay,.

I

I think we are good with this topic for now, what the yeah, the.

C

I

C

A good the one thing where we have a goal: we have visibility and we have a forum to for accountability.

C

Yep, um we'll we'll save the the retest coinbase market for later.

H

C

I think we are out of time.

C

um Was there anything else.

H

Well, so do we, the only thing is the documentation will group update. I.

C

Understand, unfortunately- and I do have to run that's.

H

Okay, we can talk about next time.

C

um Yeah it is, it is an important topic that I hear. um Okay thanks. Everyone.

D

All righty all right have.

C

A good weekend, thanks.

D

D

D