Kubernetes SIG Testing, 1 Dec 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Testing - 2020-12-01

Description

https://bit.ly/k8s-sig-testing-notes

A

Hey everybody today is tuesday december 1st. This is the kubernetes sig testing bi-weekly meeting. I am your host aaron of sig beard. This meeting is adhering. We adhere to the kubernetes code of conduct in this meeting, which basically means we're going to be our very best selves and be kind to each other.

A

If you feel like there, you have concerns with our behavior in this meeting. You are welcome to reach out to me or bend the elder as chairs. Failing that, you are also welcome to reach out to conduct at kubernetes dot io. This meeting is being publicly recorded and will be posted to youtube later.

A

um So I'm looking at our agenda, uh we have a couple things to go through, uh so I'm basically gonna try to stop talking and hand off to folks who are present to speak to their their things. uh So sean chase isn't here for his agenda item.

A

um Somebody added something here about case in point for moving job configuration out of test in front did.

B

uh I I think that was me um and I'm sorry, I'm sorry that I didn't actually insert the actual case in in actual point um and uh but we we we spoke in a meeting or two ago about um moving proud job configuration out of test infra and just to try and consolidate um pro job configuration.

B

It's it's just a thought. I think the two of us had, um but I don't know if we've been in an action, live action item for it or an issue for us or if we want to just park this for now and just I might try and dig.

A

It out sure uh I don't have it handy right now, but yeah. This is something we're pretty interested in. um I think I feel like there's a lot of stuff in the testing for repo, that it would be helpful to try and separate some things out. So we've talked about the idea of, for example, trying to move proud into its own repo, or the other thought was moving config files into their own repo.

A

Some of those things, I think, are kind of tangled up with the way that prow auto deploys itself, um but I'm just trying to describe. I think, like the challenge that the challenge is for, why, like it, it hasn't been done already not to say that, like, I think anybody here if they were willing to could draft a plan on how to do that. Migration and we'd be supportive of moving forward with that.

A

Okay, because I know like the vast majority of prs to the testing for repo are to the config directory, and it would be nice for notification purposes to split that stuff up.

A

um Okay, uh so I'll put something in the notes just to see. If I can dig up that issue for you to take a look at, and maybe that about it in terms of uh plans for next year, the next time, uh okay next item was looking for advice or thoughts or comments on modifying the e2e test. Binary is uh wilson here.

C

Yes, present: hey so hi folks, um I'm wilson, I'm looking to uh add a flag to the e2s e2e dot test binary, which would improve the user experience in overriding registries for e2e tests.

C

The idea is that today we have environment, variable, cube, test repo list to override image registries for people testing or running conformance tests in air gap, environment and custom registries, and the way that we can know or build the config file for that is through looking at the code and I'd like to propose that uh there is a way for us to override it and provide a valid configuration without looking at the code.

C

um Aaron uh commented a bit yesterday about uh the the work to move our e2e images uh images being used in e2e to kids.gcr.io, and I think uh I had not considered that scenario much so I did left a reply leave a reply there um in the issue, so so.

D

uh Wilson, can you can you explain that scenario a little bit uh the scenario on the the the uh the url you that you just mentioned, that uh ben commented on.

C

Oh uh the case gcrio, yes, sir yeah, so I think there is an effort to uh effort or a proposal to move.

C

The registries that's being used for for uh e2e testing to be consolidated in kids, gcr io. I linked the issue in my comment, uh which I think I can paste on the chat as well. uh So.

C

C

A

So does anybody have any questions, comments or concerns on that.

A

um I guess: does anybody here test uh kubernetes in an air gap environment today.

A

uh I'm curious, I'm curious if you have to go through a convoluted workflow to make that happen.

C

um Sorry yeah, so uh what we do is that um we'll have to uh extract uh sorry we'll have to pull down the images uh from docker hub or uh any other uh registries, which has the images that is being used for e2e test um re-push them to a registry. That's typically local. That's just like a registry, that's accessible by the air gap. Cluster then configure that to uh the registry to the local registry.

C

So mostly that's the the workflow.

D

And just to add a little color to that, usually at least I'm also from vmware. At least. We use a center board to do that and it kind of makes it a little bit easier um to to simplify that workflow. But it's still at least a couple steps.

D

A

um Yeah, I sorry for taking so long to first respond on the issue. uh I started trying to investigate uh what we do with images today, because so basically, I feel like folks from vmware got real interested in air gap testing a number of years ago, and I felt like uh I don't think I was a I was. I didn't review the particular piece of code that added this registry configuration to e2b test, but it seemed like that's likely where that came from, and I thought that was sort of sufficient.

A

um So uh when it comes to starting to build scaffolding to like start to smooth out that workflow um like I was trying to express in the so sorry, let me back up. I guess I had a couple concerns number.

E

A

Like the e2e test, binary uh changing it to not run tests, I wasn't sure if we had prior for that- and we do uh there's like a version flag and there's also a list conformance tests flag which causes the test to print stuff out and then not actually run tests.

A

So it's like okay, we've got prior art, there's also one that lists all the images that are used, um and I really I was hoping that we could like just use the list of images and that would be sufficient for you to do uh air gapping, because you know which images you have to like put into your aircraft registries.

A

um But the problem is we have these like weird variable names uh that correspond to those registries and like in my ideal world. We just have a list of like registries to search and replace so like change, case.gcr.io to my awesome, local registry or whatever. um So that's kind of is kind of my.

A

My only knit is like because, uh like wilson was saying, like you know, looking at the the docker library images that we depend on we're going to want to move those to a different registry- and there are a couple other gcr dot, io registries- that we want to move to case.gcr.io, and so that makes me start thinking about. Like do we need to version this registry list, um and so I just want to be like mindful of attempting to create any kind of api around the etv test thing right.

A

So if we think of it in just like registries, that you have to search or place that that feels a lot easier. So I was trying to propose a way where we could probably support both the old, like variable names, that you have to inspect the code to know and the registries that you could. You could just get by looking at the existing list of images um because, as we add images for new tests, you're gonna have to look at that list of images anyway, to figure out what you need to mirror.

A

It also sent me down this long rabbit hole. Apparently I discovered that our storage e2e tests do some crazy stuff, where, like they load in, manifests and then dynamically patch images in the manifests on the fly based on the registry names, uh so that was that was fun anyway. Like sorry, my tldr is like I'm supportive of it. I dropped an issue in the comments describing why, and I see your your follow-up questions wilson, so I'll I'll respond. There.

A

Questions comments or concerns on.

A

That, okay yeah thanks for thanks for bringing that up. um Okay next up, I wanted to hand off to grant uh grant. I assume you might need to share your screen so.

F

Yeah please yep.

A

Okay, you have the power.

F

Okay, thank you.

F

If everyone can see this guy, it should just be the this metrics issue over the thanksgiving break. I took all of the or a good portion of the metrics in our metrics, uh like subdirectory, or the queries and added them as data sources to data studio and then just put together like a little sample report for things like yeah our commit consistency.

F

Here's all the jobs that have been failing repeatedly and the number of days in a row they've been broken.

F

Some information on our flakes, some build stats, and I know we have dev stats and triage, and this is kind of middle ground between both of them, but since metrics right now, the only way for us to really view them is to look at just the raw json from the um metric querier.

F

This is a little bit more human readable and I just wanted some people's input on the issue of what metrics they might want to see from the metrics. What might be useful and where this should live. It's embeddable- and I was thinking that maybe triage like a new page added to triage, might be a good spot for it. um But I was hoping I could embed it in just the metrics readme itself, but it seems like github doesn't support iframes as inside the markdown, but it does automatic.

F

I have it set, so it automatically updates every 12 hours right now.

F

And it seems like it there's some helpful information here and people can use the report in talks or as just other resources for other efforts like I know, we've been talking about doing something about cleaning up these tests that have been failing for years now.

F

Yeah, so really just like it's two questions where anyone thinks it should live and if you have any suggestions on metrics or graphs or pie, charts that might be useful from the metrics. It would be good to know what people are interested in.

F

A

Anthony entertained, antonio sorry.

E

It was part of the worst flake in ci. Is the flake jug.

E

Yeah, at least, we know that the methods are good or the job is doing what you're supposed to.

B

B

Could we drill down to the test on on the flakes.

F

Yeah we can, I can add, drill downs. I haven't completely figured out how they work nicely, but I do have that option to to it just kind of changes to a new view.

B

uh Yeah I'd be keen to help you out on that. If you wanted, but.

F

Okay, awesome that would be great yeah.

B

B

I I think the odds the ultimate will be that if we could go from here drill down to a test and then from a test and slash error type, then then click to click into triage that will be um yeah, that'll, be the cat's pajamas. You know.

F

Yeah, I actually, I don't know the full extent of the functionality of data studio, but it seems fairly feature-rich. So I'm sure there's some some cool things. We could do once we figure out how it how to operate. Yeah.

B

It's awesome looks really good.

A

Is this available anonymously or do you need to be signed in.

F

I am not sure I was assuming since it's embeddable, it can be available publicly without um any access rights, but I have I can test it inside, like a incognito or something.

B

I've been talking with grant on this in the past day or two, and one of the questions I have is whether or not I could be given access to the bigquery console in order to run queries. I um because that's something I might be interested in doing um um for other work and for ci signal related work.

E

A

uh So you you can do that today, um yeah, if you have a so like. I did this with my personal account before I worked for google um I just signed. I just like signed up for google cloud or something, but I never actually set up any building information so that I could use the free tier, yeah you're able to query up to. I think it's up to a terabyte of data uh from gubernator um without having to like or from from bigquery without.

B

Having to pay anything okay, yeah, okay, so alrighty, so, okay right, I I didn't even think about cost, but but yeah, um but let's yeah. I might I've tried to do that, but I'm I may have there could be user error there. You know on that so, but but I was getting permissions this year when I tried to run queries and query editor.

F

I do think we need to add your account to bigquery viewer access, so if you can provide the account user, I think we can give that.

A

Right yeah, like my principal concern here, is right now, where that data set lives, is a google.com only thing, so it's not like.

E

A

Add you specifically to that project um so like I would like to see us migrate, that data set to sort of the community-owned uh google cloud organization where you've added tons of people to be able to do stuff.

A

Now, that's one thing and then the other thing to keep in mind with bigquery is like there's the project that holds the data set. Then there's the project where you charge all of your compute usage to okay.

A

So basically the data set is there and free, but you need some place to like run the compute and use network in order to to run the queries. Okay,.

B

But I would be yeah yeah, it's like it's not hugely important for for today or tomorrow, but it was just something I was interested in so yeah. We can probably park it. Yeah yeah in general,.

A

I would really love uh for people to get curious and explore that data, because I, I think, there's a lot of useful stuff there. um This used to be available uh via thing called telodrome, which was basically a grafana dashboard that did not query bigquery directly but queried something else that bigqueries were periodically dumped into.

A

um So that's where my question of anonymous access came from because we used to have anonymous access there. uh The the other thought I I have, I guess, would be the ability to uh either change time ranges on the the metrics dashboard or have certain dashboards that are updated more frequently. I guess because um the way I personally used telodrome, I don't know if anybody else did uh was one.

A

I want to look at like the long term project health, so I want to see like release over release over release, as we've been working to improve ci signal. Have we actually done that like? Are the lines going down in the right direction and then sort of from a more tactical point of view?

A

I want to be able to surface the the issues that should most urgently be dealt with like top flakes and stuff or tests that suddenly start you know uh their duration, suddenly changes or their failure rate suddenly changes, or something like that, and so today people go to test grid to look for that sort of stuff which is updated like every 15 minutes. 20 minutes, it kind of depends on how frequently the updater runs, um and I would love to see like the dashboards generated from those bigquery metrics to update at a similar cadence.

A

So we could get a much faster feedback cycle on hey this pr just landed. Did we fix the thing or you know? Did somebody break the thing? Something like that.

F

That's definitely possible, I think we can go down to like even 30 minutes update cycle and if I pipe through the like finish times of jobs on the queries, uh then we can actually filter with like a little calendar.

A

That would be awesome, yeah.

A

um Cool thank you for sharing that with us great. I'm super excited to see all that data visible again.

A

uh Yeah, because I know we've been gathering it all this time and like it feeds some json files that some folks, when the release team will look at for like top flakes or whatever, but there's there's so much more available there, which is really cool.

A

um Okay, anything else before we move on to the next topic.

A

So somebody here has a question about um what can be done to get image pushes happening again.

A

uh Who who put this on here.

A

Does somebody want to speak to this.

G

uh Whoever has a question uh I have the answer: there's a request. I've placed in chat, I think, when we merge that it should be okay for the most part, recently uh antonio submitted a pull request which basically split the the personal job into individual image job.

G

So it's a lot more atomic in that sense. So, even if one image fails, the others should still function properly, which is great, and this prequels that I linked. Basically, we need that, because, in order for docker build decks to properly build multi-arc images, uh it has to call uh that register that sh with the persistent flag. This is like it really has to do that, and we've included qmo and camera binaries into the gcp docker gcloud image, which is being used by uh google cloud builder to build uh the test images.

G

So now, when you're, actually using that image to to build old images, uh it has all the necessary uh dependencies and requirements to build all the images, including windows ones, including arm images, s 390x. So on so forth, I've tried it multiple times, even double rasa systems, with nothing on them after reboots. So I had no lingering cpu flags set, so it worked for me.

G

If you really want to you, can uh test it yourself in the pull request. I also have a paste with that and the first line is the command I use to build the image.

G

But I should be less debatable now. It doesn't pull any other docker images and it just uses that third party script, which was at this some time ago.

G

G

But after that, I think we have to trigger the image jobs individually because uh before it used to be perf, the entire folder on the kubernetes test images and after this we'll have to either get someone to trigger those jobs or commit something to those images to get the possibility automatically running afterwards.

G

Because it's a personal job.

A

Yes, I have a separate issue for migrating us to all of the images that are being built via that in gcd. um I will paste it in chat, I think um so so. Essentially um my short answer is like: can we wait until the release is cut?

A

It kind of um we passed test freeze, and so at the moment we should really be looking to either fix broken tests or fix regressions that have been introduced or revert stuff, and I feel like I. uh I appreciate that rob is here on behalf of the ci signal like I. I don't think we should be changing uh the images that are used for tests at this point.

A

uh The current release schedule that I'm aware of is that we plan on cutting the release tuesday of next week, and we start getting to a point where people are going to want to see like continuous passes against the same sha or the same yeah sha of kubernetes, to make sure that, like it's stable and we're not seeing flakes and stuff I'm trying to keep in mind that the scale tests that we run take a while, and so, while in theory, this shouldn't break anything because it's just changing how stuff gets built.

A

I I would. I would prefer we wait until after it's released and then I think, there's a whole flurry of image based stuff that we could change. So.

H

I actually think a little different. I think we should just not bump the images we use and test. I think we should get the fix for images being able to build in. uh I think we're gonna wind up wanting to cherry pick that back anyhow um and people have been. You know on their hands waiting for this.

H

um If we can at least get the images building again, which is a thing people don't have direct access to, then uh you know people can make more progress on this without necessarily shipping uh new images in running tests.

H

I do think it makes sense, perhaps to not to not change what images we run and test this late. But um if the release team is comfortable with it, I think we should shift the fixes to just being able to build them.

G

I agree just building damages doesn't necessarily mean you will use them in the current test because they have to be promoted first and to a different version.

G

So whatever versions are currently in the kubernetes test details, image manifest.go, those are the ones that are being used in tests, not the ones which are recently built.

H

It also, it doesn't feel necessarily so bad to have not switched to the latest images in the release, but it feels kind of bad to have released. Word like we can't build.

A

So I guess let me try putting it this way like. I don't want to ship code in the 120 tag that we're not actually running for tests, even if that's like changing version numbers and changing how the images are built.

H

We already are, though, because we haven't been able to push the images, so people have been shipping changes to the images and then we haven't been able to push and promote.

A

Them right, so I also acknowledge, and I'm just one voice here. I really think it's. uh It should be the release team's call. So I think this maybe is a discussion best had uh in the sick release slack channel to work out. I.

H

Agree I actually already started that. Okay, I'm not I'm not using my milestone permissions here, yeah.

A

um So if the release team's cool with it like and we're all cautious about, you know what we what we push and we make very sure we revert stuff if it's broken and we respect when the release team decides, they really don't want to see anything else, land and kubernetes uh sure, because I do know a lot of stuff we want to do here.

A

The other random thought I had was you talked about trying to build all of the images I would suggest like just bump, the versions of all the images and thanks to the work that antonio did, that should trigger a job for each of the images, because we're changing the version flag in each image, subdirectory, which should trigger all those jobs.

G

Yeah I'll send the request afterwards for that.

I

Hi, I was actually the one that added this to the agenda, but I was running late and just got here.

I

I was wondering um part of my questions where we had made some changes to an image that was required for a test of ours um and uh and we had it where this image wasn't pushed, and I wasn't sure of the process, the normal process for triggering it when it failed, and I got different different uh suggestions depending on the the group that I popped into, I did get a suggestion from testing ops to create a pr that changed a file um in order to at the right level to trigger that, and that didn't quit feel quite the right approach, and it was also mentioned by someone else that wasn't maybe the right approach and then I was also asked to trigger, via asking the and for on call to manually, trigger the job, and so my question, I'm totally cool with waiting until test freeze.

I

This isn't a major thing it just. It did block us from from releasing more coverage this release, but I would love to know what's the best practice. As we start to add supporting test infrastructure to our images when we need to bump this release, was it just? Was this a one-off and we don't have to worry about it too much or is there a process going forward where we would need to do something? I'm not sure what that is.

A

I can explain to you why you've got the answers, you've, gotten and and then we can talk about next steps, so the way that job re-triggering works today is if the uh job resource.

A

Let me think about this, so there's the proud configuration file that says like build for master or build from a release branch, and then, when prow turns that configuration into a proud job crd, it resolves that to uh hash or sorry a sha, and then that is what is used to run the job, and so, if you want to re-trigger the job by clicking the retry button uh via in browse ui, what that does is basically just say: hey, I see this crd.

A

Could you please like create another projob for it, um so it uses the same shop. It doesn't try to refresh hey this says it's using master, so I should go and use head for master right. um That also depends upon how long that proud job crd stays in that kubernetes clusters registry, which I think it times out within 48 hours. At this moment, it's sort of dependent on like how much traffic pro is getting and stuff different pro installations. Have it tuned differently.

A

uh So what that means is, if you see the job fail, and you know it's a transient flake or a network error, and uh you have the ability to click the retry button in the ui uh it's been two days, you should be able to rerun the job at the exact same shot, um and we should be letting more people do that.

A

In my opinion, I think right now, if you look at the configs they're sort of cryptically like who is allowed to run a given job can be tied to github usernames github teams or, like github team, ids and you'll most often see this used by the release engineering team, who have their team id littered all over a couple jobs.

A

I think it would be cool if we could change that to using like github team names, and then I think, a github team of people who want to rerun those submits for building test images is a totally appropriate thing to do so. If it's after two days, then you can't click the re-run button via the ui.

A

You can't do it and that's when you have to ask the infra on-call or somebody who has write access to the cluster, to create a brand new crd to trigger a new job which will also refresh the the sha or refresh the branch to the latest head.

A

So I I'm open to ideas on how to make that process better, uh typically like changing how long it takes for the like. We could change the life cycle, the crd, but that kind of requires tuning based on the performance and traffic that the cluster handles.

A

uh We could look at trying to add some kind of re-triggering mechanism to prow uh that doesn't involve the crd, but involves creating like a brand new instance of a proud job, but that would need like a design dock and it would be neat it would need to be talked through with the brow people.

A

But I think we as a community would really appreciate that ability, because we are getting a lot more dependent upon post, submits and it's kind of there's kind of, like the broader question of like um uh making sure post submits, are reliable and flake free. Like we've already got uh feta, fatabot set up to go and just spam retest on pr's that have flaky jobs right and just keeps rerunning the pre-submits until they pass.

A

We don't have the ability to do something similar for post submits and, while the right way to fix this is to make the things not flake like. Maybe we would want to consider how we could re-trigger post-submit some way anyway. That's my long-winded.

I

Ramble on posted, good knowledge drop aaron. Thank you for that.

I

um One thing that I had difficulty connecting the dots with on post submits is where to find them sort of we would I I don't know whether that the who has permission to see the retry and also um how, because, because finding that particular pre-submit that failed and seeing the job logs wasn't super obvious to us, and that may be a documentation thing or maybe we're looking in the wrong area, but post-submit failures, weren't obvious, and definitely when you're talking about the proud, um are the our friends bot coming back and checking to retry the pre-submits.

I

How do in what way do even humans go back and check the post submits.

A

Yeah, um so I don't want to- uh I want to stitch while I'm going to search for stuff, but I just want to comment on that real quick. So there is an open issue somewhere in test. Infra that talks about. Could we make post submit reporting uh better? So, like you know how, when you update a job config, and you merge that pr you'll see a comment from prow, that's like hey. I deployed these uh yaml files which equate to that job.

A

It would be cool if we could have all of our post-submit jobs. Do that level of reporting. There are technical difficulties on why we haven't done that yet. But if you'd like to look into that issue and try proposing a way, we could do that. I think that would also be really cool.

A

What we have right now is that post submits will post the status to um to the commit status context um the catch is they post it to the merge commit.

A

So if I uh open up a pr that changes an image and then I merge that the the merge commit is what will eventually have a little green check mark or a little red x, depending on whether the post submit uh passed.

J

Oh sorry, one additional comment to this topic of possums and actually also periodic uh reporting, because they have the same problem. um What we've been doing in openshift is we've been using a slack reporter that will basically for a given set of jobs report to a slack channel of ours. If they fail because for success, we don't really care, there's nothing to do.

J

That might be another yeah approach to help with us notice it because, as you mentioned, no one notices the github status being.

A

uh There's we also have the job set up to email alerts to an email address. If you take a look at the job config, I think it's kubernetes sync testing alerts gets email alerts for any of the test, image jobs that fail, and so you could get alerts that way too. Instead of slack.

G

I think I I uh suggested some time ago that uh we could maybe add some uh commands to rebuild a certain image and that would uh allow a group of people to use that command to manually trigger the rebuild for an image.

G

So that way you won't have to rely on the on call team or trying to get in the 48 hour period to retry a job or refreshing. The the sha itself, as you mentioned,.

A

Yeah, so I don't know, I personally would be less interested in solving this problem specifically for uh image building jobs, and I try to look at it at the abstraction level of like how do I trigger a periodic or a post submit on demand um uh and I'm yeah, I'm not. I guess I don't have an idea off the top of my head there, but I'm not sure uh like adding a slash, build command would make a lot of sense.

A

um We already have the trigger plug-in, which is what you can use to slash test whatever, and it's a pre-submit job name. um I guess we could. We could talk about looking into just adding the ability to trigger post, submits or ci jobs from that, but I'm sure we'd want to take a good long. Look at how we authorize that stuff.

A

I don't know uh alvaro how does uh openshift handle like letting people trigger periodics and post submits arbitrarily.

J

Yeah, the same also via this v1 authentication group setting you can have on the jobs um but yeah the the issue we also have is that often people, especially for post submits, want the latest version, because not only the sha. Actually, it's also the job configuration itself because of that change, it also won't get picked up.

J

I thought we had an issue for this. We seem to not have, but this is definitely something we want. If anyone is interested in picking this up.

A

Yeah, so I think there are a lot of great ideas here. uh If you want to open up issues, so we can keep track of them. I think this is something anybody could work on. That would be appreciated by everybody. Oh, that would be super cool.

I

I'll put together a ticket uh for at least for discussion, some approaches for on-demand periodic and post-event jobs, because it does. It impacts a lot of the areas that we're that we spend a lot of time in.

A

Okay, do you folks have any other questions, comments or concerns on that.

A

um Okay, uh sean, I think I saw you pop up. Oh did you want to talk to pre-submits in test grid.

A

You are muted right now.

A

I still can't hear you nope, no words. I wonder if zoom is.

D

How about this.

A

K

All right, uh one of my.

C

K

Isn't working I'll deal with that later? uh Yes, I want to talk about uh pre-submits in test grid.

K

um There is an uh centralized issue for putting pre-submits in test grid for almost all repos and modules um and, with the exception of a short, um allow list, um this is enforced with our test grid like repo specific testing, um some users of test grid like to have their pre-submits up and some think that it's noisy and flaky and that they don't want it. um Aaron commented on this thread kind of explaining why we would want to have pre-submits visible on test grid.

K

I think I agree with him, but I wanted to talk about it generally, and you know talk about other points of view before I unblock this issue.

A

uh So not to put you on the spot rob, but what do you? How does uh ci signal keep track of like how or how's this pre-submit behaving.

B

And I'll be honest on top of my head- I I don't have good context in this, so I don't know um so for.

A

B

If you're like.

A

The the verified job uh failed on this pr. uh How do you answer the question? Is this failing on a bunch of pr's like? Did it get way more flaky, all of a sudden.

B

Well, well in in in terms of our focus on ci signal, um I'm not really looking at very well, I'm looking for verify master is is generally is generally pretty good like it hasn't, come across hasn't come across my desk in a long time, and that's the only one that I'm looking at and master blocking and then, if I look at master forming.

A

Right but those are so my point is those are all um those are all periodic jobs they run on whatever is in master or in a release. Branch.

B

Do you keep track of like the merge blocking jobs and not really? No, because um because our focus is master blocking and mastering forming those those two dashboards.

B

uh um Expectation is: is this we're dealing with after the merge in terms of monitoring the signal for the release? Does that make sense.

A

Okay, yeah so who keeps track of whether the merged blocking jobs are um asking your failing. I.

B

Presume the pr owners does.

H

That I mean, do you mean like, like failure rate or something it's a little hard to like I've, tried to monitor that in the past it's a little hard to poke around for like piers that just don't build, and sometimes people are spamming, pairs that don't build, and then the failure rate goes through the right and doesn't really mean anything. So the heat map on product h.I.o is pretty helpful because you can visually filter out really short runs, um but I don't think there's any like hard tool for like how you would monitor the head.

H

A

H

It's like, I do are kind because I you know I care about that passing the priesthood, but I don't know that you can sign up for someone.

H

You know. Okay,.

A

How do you keep track of that for kind.

H

So I look at the I look at the pass rate versus some of the other pre-submits uh that are similar as a control, and I look at if I can visually see a cluster of failures.

H

That is not that's over some minimum run time and maybe spot check those the quick failures to make sure that they are just build fills- and I do this like just annually um like I did- vary sometimes uh when things have been pretty stable, I'm not checking as often, um but if we've you know in the past, we've had a rougher patch. It might be something I do daily here or some more than daily, okay um kind of like ti signal, but just for just for what I'm working on.

A

Yeah I'll just share like my personal experience, but I recognize I'm just uh one person um I feel like um there are a few heroes who are in charge of kind of like noticing that certain pre-submits started failing or started uh flaking significantly more in kubernetes, um and I think the general process I have seen is that uh people who review a lot of prs start to notice that, like the same job, is failing across multiple pr's, and so they start to ask the question like hey: uh has anybody filed an issue that this job is failing or not, um and it can take like a couple hours before we, as a community notice that, like one of the merge blocking jobs, is failing or flaking a lot um and then it usually falls to like ben or sometimes me or dan, from ci signal or uh often ligit, to like go figure out?

A

What uh what introduced the regression or what broke things or often times there are questions about like hey, did something in the test: infra change that was there some environmental issue that caused this job to start. Failing um generally in those scenarios, you want to start looking at like okay, so can we pinpoint, when this job over time started to fail or like started to look a lot worse um and so for uh jobs that do report to test grid?

A

I can just go to pastries and take a look at that view, but for jobs that don't. I have to know that I could also go through and click on go to. Products.I owe, and I could search for the pre-submit job name and then I could click onto one of those runs, and then I could click on job history and then I could start to like scroll.

E

Down looking every 20 entries.

A

Or so, to see like how did that job, you know how how's.

B

A

Behaving over time.

B

And so would have concerns with you would be worth creating creating a a pre-submit dashboard, I'm just kind of looking for for a dashboard there. We.

H

Have one for blocking kubernetes pre-submits? I I found it's not very useful for two reasons: one on test grid, it's harder to filter out the noise of like oh, this is just a failed build from like code that doesn't build you're gonna need to click through to the logs anyhow, um um and uh I guess, because you can't see the overall time on the on the main grid view um and the other one is the test.

H

Script just doesn't perform super well when you have uh like a ton of data, um so a lot of the pre-submits have like two: they have too much running. I've even ran into problems with that we've like owned the front end from trying to load some of the submit dashboard pages.

B

B

Yeah not to deal with the discussion park this if we want one of the distinct one of the things I'd like to see through the ci infrastructure, is to to make the distinction between build and test um a bit more clear, so so that we would have some shot at a at, maybe pruning that data and making those distinctions so that uh um yeah we're not mega damning the front end as it were.

H

Yeah, it's just in some jobs like sometimes we can improve that, but there's other jobs where it's like, or something like that and it's hard to take it out, because there isn't actually a build step, but, like bunch of stuff, is going to fail. If the move, the code doesn't build.

A

So, just to finish my thought there, like, like my two conferences, it's all good like I don't want to be the only one speaking I really do appreciate I just want to uh so. My principal concern is like we force all post, submits and periodics to report to test group, but we don't do that for presidents.

A

So at the moment, uh people need to know that there's one way like they can go to test grade for all their post submits and periodics, um but they have to go this other way or some pre-submits, depending on whether or not they've been added to that period.

H

I'm pretty certain we actually enforce it. They're like at least blocking.

K

Basically, on some repositories.

A

That was just my thing. It's like I! This is from years ago. I added this test because I was like well. I don't want to solve this problem for all like. I don't want to be the guy who goes and changes everybody else's jobs uh for them, so instead I'll create like an exemption list of like well.

A

These repos are exempt from this policy now, but the goal was: let's whittle, that exemption list down so that we're consistent across everything, so that prowl provides proud, plus test providing more consistent user experience, and things are just generally more discoverable.

A

So that's like- and I I guess this this issue is about like do we care about enforcing that, like maybe we don't need pre-submits to be consistent, I feel like to me it makes things less discoverable and I I'm concerned uh it's not.

A

I'm really concerned about avoiding heroics, and I want to empower more people to be able to like troubleshoot and that's all this stuff about a variety of other tools that are not test grid that could be used for this and so like. Maybe this isn't uh that important.

A

That was my. That was my rational rationale for adding, if you all, think it's like too much noise or encouraging ridiculous prs or it's putting too much of a burden on test grid.

H

I found it really hard as well to get test grid alerting working well for pre-submit because of the fact that you're going to see pass unfail just from pr noise. um But I do think that something like uh grant's work on the on the business dashboard is where we might be able to say. Okay, let's have people other than heroic. Looking at how reliable our appraisals are and have actual metrics, instead of trying to like squint at a job or something.

K

Yeah, I can't really recommend setting up alerts for uh pre-submits in general, uh which is, I guess, kind of where my confusion on adding them to test grid came from, but um adding them to test grid doesn't necessarily require an alert. It doesn't set up an alert for you. It doesn't set up stale, alerting or any of those kinds of things that wouldn't necessarily make sense in a pre-submit context. It just kind of displays it in the test grid format uh for up to 15 days back or I guess until it runs out of memory.

K

So like from that perspective, I can see like kind of a contrabex reason for adding them like adding those pre-submits like, if you add the pre-submits and have a test that requires them to add a pre-you know, add them to test grid. Then you can just say: oh every proud, job's on test grid they're all there.

K

So you can go to test grid and look at a history. There's another place. You can look to also get a jobs running history, so maybe this is also a question for uh like contrabax as well as testing, because I think that's kind of the crux on it.

H

And not not all jobs are going to happen and for jobs that don't it it is like not actually significantly more useful than just like native display, which will search for the job name. Probably, actually I don't know. Oh, I do also want to add a quick assignment in a perfect world. I actually have a reporting enabled first imprisonment like to me personally, because I would like to know- and that's a little bit less to just be like.

H

The existing controls can't tune it such that you get a good signal out of that, but that you know that doesn't surprise me. I'm not even sure how you would design like test grade alerting would work well with christmas.

H

But if that was a thing I I would absolutely use it and I'm trying to use it, as is not surprisingly, it's not very, not very helpful tuning up the number of jobs and the reasons it helps a bit to filter out for it because, like you, can have an alert for when it just starts failing continuously, which does happen occasionally like an infrastructure outage or something. um But uh you need to turn it pretty high, so it takes a while to notice.

A

I want to be cognizant of the fact that we're five minutes over time and I appreciate everybody for uh spending the extra time with us, uh so I don't know if we actually came to a resolution. The last thing, like I said people feel like it's it's cluttering things up. If there's a there is a cost to it.

A

But it sounds like we have a lot of other like long-term ideas, for how we could accomplish the same goal. Better.

A

Anyway, uh thank you all for, for showing up.

A

I just want to put out throw out there the idea that well we'll probably be meeting again in two weeks, uh but then I think the next meeting after that will likely fall really close to the new year when I suspect a number of people will be available, so we might want to consider canceling that so I was thinking either like next meeting or the first meeting that we have in the new year. We could kind of get together and talk about sort of what our what our goals are for the year.

A

What are some of the bigger boulders? uh We want to move forward, um just the thought and I'll send this out to the mailings too. So, thank you all for your time. uh I hope you all have.

B

Thank you. Thank you. Bye. Everyone.