Kubernetes SIG Testing, 22 Sep 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Testing - 2020-09-22

Description

https://bit.ly/k8s-sig-testing-notes

A

uh Hi everybody today is tuesday september 22nd. uh If you were all at the kubernetes sleep testing bi-weekly meeting, uh I am your host and chair uh aaron drickenberger we're all going to be adhering to the kubernetes code of conduct uh during this meeting.

A

So you can watch yourselves not be jerks as this mute as this meeting is posted to youtube later today, since it is being publicly recorded and if anybody feels like someone is being a jerk or needs to report. The code of conduct conduct at kubernetes dot. Io is the place to do that.

A

um So I posted a link to the agenda in chat I'll. Do it again for folks who just showed up first off. I appreciate everybody who showed up because that means you used the new link or my calendar update, got to you or you just generally heard about how every meeting in the community now has a passcode.

A

So thank you very much for paying attention. If you know somebody who should be here who's, not let them know.

A

um The only thing I had on the agenda to discuss today um was to just kind of review where we're at with the kubernetes ci policy thing that we talked about a while ago, and then we don't really have much else. So I'm happy to talk about what the folks going to talk.

B

A

Or shut it down early, if we have nothing else to discuss, um so I guess I'm going to try and just share my screen. um Let's see if I can do that, share and application, I will share that.

A

um Okay, so I can't apparently I cannot see uh video while I'm doing this, I'm toying around with using the chrome app. So I can use this on my work computer, um but so uh this is the um this.

A

Is the project board for all of the ci policy things I'm gonna see if I can find my way to the umbrella issue, to remind everybody what we're talking about and why so there was a c there was a policies to improve kubernetes, aci proposal that we discussed here earlier um and it basically involved things seem really seemed really bad. During the 119 code freeze, it seemed like that was both a combination of resource exhaustion and also just tests being really flaky in general, which the resource exhaustion was making worse.

A

So, as a result, this was one of the reasons that the v119 milestone was left in place or, like code freeze, was left in place. All the way into the 119 was released, and this is the reason that the 120 milestone restriction was also left in place, so that the gigantic backlog of prs that were ready to merge into e120 could actually merge by people sort of piece by piece, adding the prs into tides, pool and ensuring that that could all merge.

A

We decided there were a couple of things we wanted to do. Is policies going forward to allow more productive, merging at prs?

A

The main things centered around was actually declare the resources that we think all of our jobs were using as it turns out, all the jobs were not declaring their resources, and so they were being scheduled in a cluster, with no attention paid to how much resources they may use, and they were all sort of competing with each other over cpu memory and io.

A

So we as a community, move to making sure that all of the merge blocking and release blocking jobs for kubernetes uh declared their resource limits and their resource requests explicitly.

A

The next two things were to make sure that all release blocking jobs run in a dedicated cluster, where only jobs that declare resource requests can run, and we did the same thing with merge blocking jobs.

A

Those are both still in progress. um The reason they are still in progress, as far as I know, uh is that the bazel bazel build and test jobs and the build jobs have not been migrated over. um The reason for this is because uh I really wish I had this link somewhere. Maybe I do um I do not I'm going to find it, uh but basically um these jobs write to a google cloud storage bucket, called kubernetes release or kubernetes release dev and that bucket lives in a project called google.

A

A

Here we go uh so the bucket lives in a project called cubicle containers. uh There is no that that project has a restriction that only google.com accounts can write to this bucket and part of what we've been doing as we move. The jobs over is to move them to a community-owned cluster that lives in the kubernetes dot, io google cloud organization, um so I've gotten as far as proposing uh that we use alternate buckets boy. There's really just no comment here that uh cleanly describes this.

A

Nope anyway, that basically uh release engineering needs to help us out here, and uh they have not had the time because they were heads down on 119.. Now that we're early in the 120 cycle. I would like to see us push for this.

A

This is going to require kind of a slow migration of jobs from one bucket to the other. It may involve setting up a canary job that pushes to the community-owned buckets in parallel and then gradually. We can actually update job definitions to consume from the new community owned bucket instead of the old google owned bucket.

A

um So as as far as I know, that's the only thing. That's really holding up all of the build jobs. um We were also experiencing some fun in migrating. The.

A

So the the situation with merge blocking jobs is much better. We have many of them migrated over the ones that are not, let's see well. This is a lot. That's actually been migrated over this grpc job. If I remember correctly, is an example of why it's difficult to mess with a job if it's already kind of flaky and failing, um we basically found that the job intermittently fails or times out for whatever reason.

A

So, if you're just looking at right this, so this is uh all of the runs of the job over time. uh Higher up is uh later over down is earlier, and so it's kind of tough for me to tell when something changed related to this job and if things got better related to this job, I mean okay, there's a big block of red down here. So maybe, if I go earlier, okay, there was a lot of failure, but I think this was related to a job configuration issue.

A

uh So anyway, I think we need to. I think this job is failing as badly as it was before. So we haven't made things worse, so I'm probably ready to call this. Particular issue closed. I just wanna. Actually I think we need to get some time on that.

A

um The other jobs are related to uh the other. Complications are related to bazel, uh so bazel in the google.com cluster uses a feature called rde or remote, build execution where it ships the actual bazel tasks and stuff off to some other rule of machines or something.

A

This is an alpha feature that is not available to us on the community-owned clusters, so we, this is partially like both migrating away from rde to be running a job zone cluster as well as migrating the jobs to the community-owned cluster.

A

This has basically been done for bazel tests, but, as you can see, I started to see like a lot of a lot more failures. Seemingly and the duration uh went way up, it turns out. Rpe gives you a vast amount of resources to run like as parallel as possible.

A

um The way we're simulating kind of similar resource usage on the build cluster is to give this job seven cpus, which is a lot of cpu. All this job is doing, is running unit tests twice. Well, this part was three times this part is twice because this was also really to look at um kind of pushing us to notice that back here.

A

In these times the the job was uh running unit tests three times and if it passed one of those three times it considered the test passed uh when we stopped doing that and instead required the tests to pass three times in a row, we uncovered that our unit tests are also flaky. Just like our inbound tests.

A

uh We've made great progress on uh deflaking that, but things are still a little flakier than before, because we've actually uncovered that our unit tests are clicky. So this is a similar situation, where it's tough to tell if things were appropriately green or not beforehand, and if we're making the situation worse or better or you know not making not having that much of an impact. But that's a good thing uh and it's the same story with bazel build with the google cloud, storage bucket issue and rdp.

A

um So uh between uh those top three were considered the most urgent and important things uh to get done. uh So there's a bunch of other stuff that I feel like we could get community help with. If anybody is interested, um we kind of decided once we had taken care of the permanent, critical jobs that we were, and I am not actually sure how I'm going to get out of screen share here. This is great. Let me stop sharing just so I can see your faces for a second, ah that's better.

A

I feel weird talking throughout look like straight uh a bunch of windows, um so that was uh taking care of the critical jobs. The next part is to like ask everybody: hey since you're, getting the benefit of using community resources to run ci, please make sure it actually passes and please make sure you're actually declaring your resources, uh and so what we need help with is uh identifying and implementing a policy that says hey if your job is failing for some arbitrary threshold.

A

Let's say like: if your job is failing for four weeks in a row, um you probably shouldn't be running that job, that's probably a waste of community resources, uh and so luckily we do have queries to identify how long jobs have been continuously failing.

A

So we could use that those queries to identify the list of jobs we we could go after as humans. um The trouble is, we may not necessarily know who to contact to say: hey your job is going to get deleted and so there's another issue about hey.

A

If you're going to use community resources, you need to put contact info for your job, which basically means you need to set up a test point alert for that, and we can probably make a really good guess at who should own which job based on what test grades the jobs already show up in.

A

So I'm pretty sure if somebody was handy with yaml or jq or yq or something they could come up with a report that kind of makes a best guess at assigning everything, and then we could probably blast out a notification to kubernetes dev and tell people like look. Your sig is getting this job. uh If you don't want it or you don't agree with it. Please comment on this pr.

A

um We already have tests in place to enforce, to the release. Release master blocking dashboard have contact info, so it would just be a matter of changing those tests to look at everything, um and I believe that issue is tagged as help wanted on our board.

A

So that's the bulk of what I have as far as where we're at on progress, I feel like I've been talking non-stop and if you haven't tuned out by now, you are paying way more attention than I would as a human being. So I'm just going to pause there and ask if anybody has any questions or comments.

C

Would it be beneficial to even in that case, open a pr automatically to basically disable these jobs.

C

And basically ping the contact person once that is in place, because then we basically have a board for discussions already um saying like hey. I need this because of this, and this or I'm just doing work right now or uh yeah. Fine deleted yeah.

A

I mean client, I mean the when we were back in august or july, and things were really bad. I would have been more inclined to be draconian about this and say everything's getting shut down, and then, let's see who comes screening out of the woodwork to claim their jobs. I think things are not quite as critical as that. So in order to be a little more collaborative or cooperative, I feel like it's better to give people a heads up and say like look we're making these, like.

A

You have until this date to claim your job and if you don't claim it we're deleting it we could get into the business of like oh we're, disabling it or we're scheduling it at a much lower frequency.

A

But I also kind of feel like recovering things from deletion is what source control is for, uh and so, as people are like searching around trying to find what jobs are doing, what it's better to just delete jobs if we're not actively using them. But I agree with your point that uh if people don't claim their jobs passed, a certain deadline, we're just gonna delete them.

C

And I, like your view, I wasn't more talking about the failing jobs that have been failing for a while, like not necessarily about the claiming part, but I guess that could work as well. uh So the.

A

The failing thing I think we um we wrote that up into three parts, the first part is like figure out, take a guess at who owns uh which jobs and then the really permanent fit look at the really egregiously failing job. So I think everybody's eyes popped out of their heads when I talked about how we had a job. That's been failing for 700 days in a row or something like that.

A

So there is an issue if anybody wants to uh go, take a look at the list of jobs that are failing and delete the really permanent failing ones. That sounds great. We should give a shout to like kubernetes dev, uh to give people a chance to claim these and say no, please don't delete that and then we'll know who owns it.

A

um uh But yes, I am a fan of like let's, let's stop the bleeding, it's pretty obvious that there are some jobs that are wasting resources and will never be paid attention to, um but there are a lot of other jobs where maybe it's less clear, whether they're, necessary or not, and I think that um having humans sweep through and do this is great and all. But I think the policy is really only going to be adhered to.

A

If there's automated enforcement of this, so the closer we could get to having a bot kind of by having the the contact info encoded in the job and since the bot sort of knows how to query which jobs have been permanently failing, it can sort of put two and two together and notify that email address and say: hey caution. If you don't do you know if the job has been failing for more than 100 days or more than four weeks, I'm going to open a pr to delete it.

C

Wouldn't it be better to change from email address to github user github user.

A

um um Well, maybe, uh unfortunately, at the moment, like email address is the the one way we have of doing uh contact info.

C

Yeah I saw the test script definition. I think you can only specify email address for maintainer right.

A

um Because yeah to add other contact info sound like that sounds really cool. um If we want to propose that as a feature, but I feel like, um uh rather than blocking on the presence of a feature which may or may not get implemented, I want to sort of work with what we have um uh but yeah.

C

Because, like I think we can have ability to send out.

A

Emails from prowl at the moment we do not have prowess set up itself to send out emails. uh We do have. um There is a slack reporter, which I believe uh can be configured to send out alerts to different channels based on the repo that those jobs are configured.

C

But you would still have to make the connection to whatever the email service on gcp is to send that out, right or like send grid or something like that.

A

uh So the idea is, we would lean on test grid sending out. Oh, I no I got it. I see what you're saying yes, okay, that is, uh that is very correct. Having a bot uh do this would need to, we need to make sure the bot could actually send an email. I now I see what you're saying uh that's very fair.

A

um Yeah, I'm not sure what to do about that.

C

A

C

Is like, is it more work to get this email connection working or is it more work to get like a github feature like notifying a github user? Basically, if we get the notifier part to basically assign tasks to someone right.

A

Yeah that that's really fair, I think um I'm sort of I'm wary of assigning tasks to to individuals uh just because uh they may or may not be present. um I think, though we could try like um we could try using github teams, like I think, every single sig is required to have a github team, that's named after the sig, and so we can make an educated guess based on uh which sig named dashboard a test belongs to.

A

um We could we could construct the github team name that way, and then we could ping that team when opening up the pr or something like that.

C

And the six have sick mailing list right, so the email would also work in that case for asic, correct.

A

I think that's that's the intent of like leaning on testgrid to do this sort of stuff- um and you know, test grid also provides the capability to send out emails if a job has been continuously failing for more than blah that can be configured on a per per job basis.

A

um That's that.

A

The other thing for what it's worth for me personally, that bugs me is like uh we all agreed that this was really important and bad, um and now there was discussion release about lifting the 120 milestone restriction, because uh I guess things are better now, uh but I really wish we had like metrics or graphs or something that showed us that you know the the tests are this flaky or ci is?

A

Is this bad uh and I'm not sure that we we have that at the moment, um so we have another issue open for like try and identify some metrics or stuff that we keep we could use and I'm really open to like any and all brainstorming ideas. uh People have on this um I'll go back to talking to windows for a second uh to share some of the stuff. I've got.

A

So if I click through on the issue to find metrics and reports that allow us to track whether the situation is getting better.

A

I tried to lay out you know how we were experiencing pain and different ways that we could maybe try indicating that with metrics um theories on how that like, what's causing that pain and if we could use metrics to show that um so where let's see what other stuff uh here's a list of all the possible data sources, I am aware of so you can straight from or grab metrics from, or something like that.

A

So things like you, know, cloud monitoring, dashboards of the the pro build cluster that the community has. So we can see how that cluster is doing.

A

So, if you're, a member of the kate's infra, proud viewers group, it's in a yaml file in the kate's, I o repo. You are all welcome to pull yourselves in. uh You can see this dashboard um here I'll sign in, as my personal account just to prove like this. This account doesn't even have a google like, I don't pay any money or anything on this account.

A

uh I just have an email address and I can still view things if the dashboard ever renders, uh and so I've been trying to take a guess at like what sorts of things are helpful for us to understand. You know, is our build cluster healthy or is it overly resource constrained.

A

Things that have been helpful in the past.

A

uh Are these top things here they show when I o is getting throttled either due to cpu usage or network usage. um The problem is these all look really noisy and peaky and stuff. None of them are related to actual bad behavior. To show bad behavior, I might have to go a month back.

A

A

uh Yeah, it's not being super helpful, so I'm gonna stop talking about that.

A

um So I've also got graphs that show the total cluster capacity in terms of memory and in terms of cpu.

A

You can see that that grows elastically, because we have auto scaling set up for this cluster um we've also, and so here you can see the size of the node pool, that's scaling up or down you can see. We get really peaky like we've gotten as high as having 120 nodes available all the way down to having like 27 minutes ago.

A

There's something around like 11 a.m. Pacific seems to be a pretty high traffic time. uh You can see that this way in terms of the different pods that are present in the cluster, so jobs all get scheduled as pods in the test. Pods name space. So here are the total number of pods in the cluster.

A

uh Here they are broken out by the job type. So here are the you know you can see the periodic jobs are pretty healthy. Cook pre-submits are super peaky and often will you know there will be more load from pre-submits or pull request jobs than there are from the periodics.

A

Down here it's broken down sort of by pull request.

A

So you can see the text is probably really small, but I said that kubernetes kubernetes, 94 196, has had 37 jobs in the last in a one hour interval that this graph is sampling over. So somebody probably hit retest a lot or pushed a lot of commits repeatedly or something like that, and so, if things start to tip over or we feel like, there's excessive resource consumption, we may be able to find whether there are culprit prs or whether there are just a lot of prs all happening at once.

A

That impact us.

C

The uh jobs that, if you say, pulled and pushed commits repeatedly, if I know the prowess canceling some jobs when you push it like within.

B

C

Would that also count towards the cancel jobs.

A

uh It does, it counts, the cancelled jobs. So that's why, like, if I go down.

B

A

Granularly now we're looking at a five minute interval. There was a period of time where uh yeah, supposedly this counted 36 pods, I'm like when we start to get down to that super granular level. I'm less less aware of what this really means. There were 36 pods running simultaneously or like 36 containers all running continuously or whether there were 36 pod resources visible to the cluster in some shape or form like this doesn't necessarily.

A

If I recall, I don't think this has the capability of segmenting by the different pod statuses.

A

It shows yeah, it just shows all pause, I'm wondering, let's see, can I filter by like pod status, uh not really um so it could be that I'm seeing a lot of like cancelled pods, um because prowl will abort uh pull request jobs if a new commit is pushed and these sort of just get deleted over time, eventually by the sinker component of crown.

A

The other, uh the other handy thing on the dashboard, is just breaking it down by job type, so we can see whether uh maybe a configuration has changed for a particular job or maybe something about that job jobs. Resource consumption is causing it to behave. Oddly, it's basically the scalability jobs, the 100 node pull request jobs; they tend to hang around for a while because they take basically the longest.

A

I think um so. This is one dashboard we can use. My my concern is like it's a lot of graphs and I'm trying to sit here and read. The tea leaves in front of y'all, um I'm not confident enough in any in any one of these things to say: let's try and set up alerts on this I feel like this is more like just humans have something else to look at which they should I'm glad we have it um another way to look at.

A

This is from the proud control plane point of view so plank, as some of you may know, is the component of prow that's responsible for turning a crowd job into a pod, um and so this dashboard here allows you.

A

It's got three graphs to to group uh sorry, so this dashboard here shows the proud job resources that flank is acting on um and you can group by three different metrics in this case the dashboard set up to group by the crowd, job type, the state of the prowl job and which cluster the crowd job is scheduled to and right now I've got it set down to only look at brow, jobs that hang off of the kubernetes repo.

A

So, as a result, I can see that these were the before times. This is when everything was really bad. This is when we were running way too many jobs for way too many jobs period, probably and certainly way too many jobs for the resources we had available for these jobs. um So just the sheer shape of it. You can see that the number of pre-submit jobs is way way larger than anything else.

A

These are not necessarily you shouldn't look at these as discrete events, because these are these are graphs of the proud job resources, so it's possible for a lot of crowd, job resources to get created and then to just hang around, even though, like the containers of pots that ran these projects got deleted a while ago, the resources are kept around until something called sinker goes through and cleans up the resources and there's a balance between keeping enough of these resources around to be able to do reporting like this versus we don't want to have like 50 000 crowd, job resources that were continually listing with our controllers to act against uh anyway.

A

So we discovered during the bad times that proud jobs that couldn't work for resource issues ended up in the error state. So if I look at the crowd, jobs by state and click on the air graph, I can see that, yes, we had a really bad time during code freeze and, yes, things have gotten substantially better, since we switched to crowd jobs to explicitly declaring their resources and we move proud jobs over to a cluster that you know scales appropriately based on those resources.

A

So here I think I can say things have gotten a lot better in terms of measuring like uh how far we have to go. So this is broken. This does not break down by jobs right, so this is all pull request, jobs and all periodic jobs. uh This is the number of jobs that run on the google.com default, build cluster and then this yellow line here is the number of jobs that run on the community owned. Kate's infra crowd build cluster.

A

You can see like over time, we've added a lot more jobs to the point where we're actually handling more jobs on the community cluster than we are on the google.com build cluster, which is great, um but I feel like we're not really going to be fully done until we've migrated this green line all the way down to zero um anyway. Those are those are the things I have for metrics at the moment, um something else I've been noodling on.

A

Personally, uh you all may remember velodrome, uh which was sort of a grafana instance that was driven by um there's an influx db instance that we stuffed uh query results from bigquery into, um and I had created sort of a merge blocking and release blocking job health dashboards, and we ended up having to kill off velodrome for security related that I haven't had the bandwidth to go back and uh really work on that.

A

uh I've started iterating on just standing up the latest and greatest grafana on my own system and then setting up a bigquery data source. So just talk to bigquery directly to see if I can sort of recreate that you like. If I get far enough with this, we could investigate whether you know we could replace velodrome with this see if the security is any better.

A

um I just I haven't quite gotten as far as would work, late, identity work with this or do I need to upstream some changes to grafana to get that attacked, but I'm showing the job data over the last six months. And so again we can sort of see like the number of jobs that were triggered for merge, blocking jobs, uh climbed up substantially over time and then fell down. So this is both pr traffic going down and us deciding there were certain jobs that we no longer wanted to execute.

A

um Trying to look at the daily failure rate of jobs looks super noisy because some jobs go from being perfectly fine to failing 100 of the day, because they don't uh fail that much, but you can sort of see that the failure rate kind of climbed up for all of these jobs and it's kind of gotten a little more tightly coupled down below aside from jobs, sort of periodically peaking up and failing 100 of the time. I would quantify these as like jobs that aren't triggered that often so the variability is really high.

A

I've also been trying to take a look at the duration of the jobs, seeing if that's improved over time as we've migrated jobs over I'm trying to scroll here. So just looking at the 99th percentile 75th and 50th durations things are kind of getting better. uh The top of this graph is two hours. Nothing should take more than two hours to run, especially if it's a pull request, job um and our times have improved. I wanted to call out the verified job has gotten a lot better.

A

uh I think um I forget what his actual name is, but hashed stan when he was doing ci signal for 119 managed to improve the verified child's execution time, and that really shows here which is really cool, especially if you start to look at like the 75th percentile or the 99th percentile, like it's all gotten a lot better ever since he did that change, so I feel like metrics like this might be really helpful. um This is me just playing around with other panels. uh So that's all I've got on that.

A

I really feel like I would love anybody's help for brainstorming or what we should do.

A

What's the issue with you know actually querying other stuff and figuring out the right metrics that say: yes, we are doing good or no, we are not. I think, like test grids, offers some stuff.

A

Like you can see that jobs are flaky here, but I can't necessarily see how how flaky they are. I could kind of go with these numbers, but then I can't necessarily see how these are changing over time. So it may be difficult for me to know when we did this change at time x. We saw this effect on the flakiness unless we did the right thing versus the wrong thing.

A

But again, I'm talking a lot and I can't see your faces so I'm gonna.

A

A

Yeah, that's that's.

B

A

Had does anybody have a chat about.

C

B

C

Topic um and uh a while back, I was uh talking about how we have pro installer in keema, um and I think someone I think eric saw it again when I, when someone was asking on, I believe it was prow uh slack um about how people are installing prowl, and I just want to see if there's an interest that we pushed this upstream.

C

um It's missing, currently everything around the s3 bucket stuff, so amazon hosting uh we haven't done that because we're running on gcp as well.

C

um So what the gist of this is basically uh getting pro up and running without you touching gcp service accounts yourself like if you follow the installation instructions. This uh apply. This yaml create a service account here, apply disk and we'll create a service account here. The installer takes care of that for you, if you give it a service account so like you literally, are down to one service account and the rest of us is created, it creates buckets for you and everything with the right permissions and gets your prowl running.

C

um I just wonder, like I, just posted a link in the chat as well. If this is something that anyone is interested in us pushing upstream, we might have to spend some more work on the code itself, but once it's upstream, I guess we also have to think about adding the s3 stuff to it um for people that are interested in that.

A

uh I can't necessarily speak for eric he's more on the team that works on crowd, but pretty sure they have a goal of making like prow as easy to install as possible, and so I'm sure they would be interested on collaborating with you on this, um like I don't necessarily know that it needs to be a hundred percent perfect uh before you get it upstream. I think uh getting it upstream and iterating on. It would probably be acceptable.

A

um The thing that comes to my mind uh is we there is something called tackle. I think.

C

uh Happy hands! That was the reason why we didn't like it, because it.

A

C

Very hard to use tackle.

A

C

A

C

Yeah, I think so and like that was one of the main reasons why we said hey. We want to do that ourselves somehow easier.

A

um Well, I'm certainly a fan of finding a way to do this uh without.

C

Basil, so this is all.

A

C

Just fyi, so we don't use any bazel um and hopefully makes it easier to understand and and install for you as well. So I mean I guess, people have to try it out and see if the code is okay, but um I don't know if there's an issue already for this, like prowl installation, make it easier somewhere, where I could start discussions on this or um I'm also not sure what eric thinks of it, because he I mean he saw it, but I'm not sure if he looked at it really.

A

Well, I'll appoint it appoint the team at that, because I would expect cole to have an opinion as well um and there's also just the fact that google's going through one of it, one of its wonderfully internal focused periods right now, where um some googlers may be less available than others for just just a few weeks more.

A

um uh But I think this aligns with with goals uh if, if we can get past the lack of basil or whatever, because that sounds fine to me, I think my other two questions from uh things I know proud does lately perspective- are whether this uh configures workload identity and sets pro up to use workload, identity uh and then also whether this uses the um sort of the automated uh web hook. Management feature that pronoun has.

C

I think we're not on these yet um so it's also a while old, because we we set up our um test pro cluster, so we have one pro cluster and one test pro cluster that we run stuff with and test our upgrades for for our main production cluster. So this was, I think, like a may april initiative.

C

So it's pretty old in that.

A

Case, I can't remember, like uh uh web hook rotation, so what I'm talking about was we somebody accidentally exposed prow's uh hmac token, and so we needed to rotate it and we discovered we needed to rotate it like everywhere and so we're trying to get to more of a point where you can have proud automatically install a web hook on a github award or a github repo, and it can also automatically generate new tokens and rotate those out without you having to worry about that.

A

So I feel like that, might simplify the process of installing prowl, somewhat kind of you know. I haven't looked at how your thing sets up web hooks, but um I know not having a.

C

Something for hmac, but I think it's just the initial setup. I'm not sure we handle rotation but.

B

C

Have a ticket somewhere for a rotation of that kind as well. I'm not sure if we're trying to get that into the installer itself,.

A

Okay, um I think it would uh getting that included, would would probably help a lot. uh We may still be in a situation where there are parts of prowl that that can't function with workload identity at the moment like I know we're trying to, if at all possible, stop having a service account key stored in the cluster as a secret and instead just rely entirely on workload identity, um but we may not be there yet. uh So I would view uh and cole would have more information about the um the web hook.

A

Hmac rotation thing, um but I think it would be really beneficial if we could all decide on the best like one-click way to install prow. uh We may not be able to. I don't know there might be differences of opinion on on how to best manage prowl going forward, and maybe some of those decisions affect how you choose to install prow from the get-go, but I think it would really benefit.

A

All proud users and certainly benefit all potential new, proud users, like anything, you can do to make it easier for people to try out proud, sounds good to me.

A

C

For us, it was mainly the need for like setting up test clusters easily, for whenever I want to try something with prow. I can use this and have a cluster up and running in. I guess like 10-15 minutes, just pointing it at gcp um and then tearing it down next week. uh If I want to try something out, I guess that was the main motivation for us, um but definitely for for new users, and I don't know how many people out there are actually having their own prow instance.

C

That would be interested in something like that or are developing features and need to have an installer to set up a prowl quickly.

C

But basal pain was the main motivation for this.

A

B

Coal definitely has some not more knowledge on it. I know they're trying to do something like a declarative in cluster version of that deployer, so uh you might be able to work with that.

A

A

Well, that sounds really cool. Thank you for uh for sharing that.

A

uh Anything else.

A

All right, I guess that's our tuesday. Thank you, everybody for showing up. um I hope you have a great rest of your tuesday uh and until then see you later.

A

Thank you good day.

C

All right, everyone.