Kubernetes SIG Testing, 11 Aug 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Testing - 2020-08-11

Description

https://bit.ly/k8s-sig-testing-notes

A

Hi, everybody uh today is august 11th and you are watching or attending the kubernetes testing bi-weekly meeting. I am your host erin.

B

A

And I am adhering to the sorry and this meeting is.

B

Being publicly.

A

Recorded and will be posted to youtube later, so you can all watch yourselves adhere to the kubernetes code of conduct that we follow at these meetings. By being your very best selves and not being jerks on today's agenda, I'm not entirely sure about the first two items.

A

We've got, but we've got a couple of demos to show, and then I wanted to have a follow-up about sort of the kubernetes ci policy discussion we had last week. um So first up is, uh is howard zhang here, or is anybody here able to speak to having the proud pod utility images on arm.

C

I was building some a while back for uh or the proud images themselves, maybe not the pontiac team, but for raspberry pi.

A

Okay, this was something we didn't have the opportunity to get to last meeting and I think there like it was handled on slack uh offline. uh But uh if, if not, we can uh come back to this, they have to show up to the meeting today. uh Next there was a there's an entry here. I don't know who this is from about triage party and the kubernetes pr dashboard.

A

Does anybody here? Did anybody put that on the agenda here.

D

I would assume that that was great yeah. That was me. um We don't have to cover if we have other things, but it was just. I had talked to triage party and um ben just about maybe making it a drop-in replacement for the pr dashboard since kubernetes uh getting older and older and less used.

A

uh That would be really cool. I know. Arno has been involved in setting up a triage party instance uh for the release team, uh arno and george. Do either of you uh have anything to say about uh how triage party is going for y'all right now,.

B

um I can also describe that. Oh.

A

Laurie please.

B

Yeah um so I spoken to thomas last night, so I basically did an audit of what the triage party set up for mini, cube and scaffolder are monitoring and how it's displaying information and um it's basically, you can set different uh time limits for what you want to watch and prs comments, etc. So it actually helps you organize the bucket of work overall into those categories so that you can, you know, take action on pr's issues, etc, and then you can also have tabs for specific types of issues.

B

So if there are like tools for you that you want to just monitor like the issues on a specific tool, you could actually probably set up a tab to do that for release team. um So that's that work is basically to figure out how we can prioritize the work, not just in terms of urgency, but we're also looking about impact so like what is the highest value work that we could possibly do, and so that's that's a discussion that we have to have.

B

But essentially we want to see if we can use triage party to have a single source of truth to bring all of the different project boards together.

B

The tabs enable you to do that, so you can watch different themes at one in one place, so we haven't used it yet in meetings, because we have a lot of things in the backlog to sort through, but once that is in a reasonable place, um I'm basically dropping a roadmap for how we would use cherish party and then what are the types of labels that we would need to use it as best as we can, including those impact labels.

A

So maybe maybe grant knows more about this, but the question I have is whether it would be possible to tailor that dashboard to an individual so to speak briefly to what gubernator's pr dashboard does right now I can go to gubernator.com pr. I can hold myself accountable here and do this live. I guess and it'll show me like uh what pr's uh are on deck for me to review. So if I go to ubernator.com.

A

Xp, um that's. I have apparently 18 pull requests that need my attention and it does this by like a little bit of a state machine that, mostly like, looks at prs and the labels that are applied to them and then uh like who has most recently commented on it. So um I can uh I can get like, but the thing is it's personalized to me. So this isn't me looking at sig testing's review uh cue. This is looking at like high individual pr uh workload and this seems easier for me to manage than github's.

A

um So it's it's unclear to me whether um triage party supports that kind of workflow- and I don't know like maybe how many people use this- I kind of live or die by this, uh this pr dashboard for what it's worth, um which is uh one of the options. One of the reasons I think grant's been looking at it because the rest of gubernator, we, basically we don't touch anymore, and so I guess it's kind of trying to figure out if we want to rewrite this or if we can use triage 40 party.

B

Instead, yeah, I could see this being useful for the release team as well. So um I.

D

um I just ended up with thomas about this and I created an issue and he said dave. They were planning to try to do something user-centric and I was hoping to get to that work. So I was going to put a pr up against triage party to add oauth with github and have this like user-centric dashboard, and then I was also talking to him on ben's recommendation about multi-tenancy.

D

So we don't have to have multiple triage party instances per team. We could have like a single instance with multi-repo and multi-tenancy or tenants. So uh it'd just be easier to have everything in one place and be able to see like sig release like testing like your trigger party within the same instance.

B

I don't know also like the priority label would help be helpful, because then you can also kind of organize your workload based on. What's most.

A

Urgent that definitely uh that could be useful. um Okay uh sounds like uh that. Would augmenting triage party sounds like a really neat idea, and maybe a useful way for us to align our efforts.

A

Next, up on the agenda, uh I had we had shane presented design dock to us a little while back about sort of a secret rotation thing he's been working on to help us manage crowd, secrets uh and shane wanted to present a demo today. Shane, are you available?

A

Let's see yeah, I need to make you co-host, so you can share your screen just a moment.

A

A

Okay, take it away.

E

A

Yeah we can see your slides.

E

Okay, um so I got started hello: everyone, I'm shane, and today I'm going to talk about my project, integrating kubernetes secret with the google cloud secret manager and um I'm going to include a live demo in this presentation. So I'm just going to start it right now, because it's going to take some time for uh for it to run okay. So, let's start with the background we utilize kubernetes to facilitate the ci process and within these processes, um sensitive data will be needed and stored as kubernetes secret and then consumed by parts, whoever need them.

E

So, for example, we store gcp service account keys as kubernetes secrets, and they can be used to authorize access to other google cloud services such as writing to gcs buckets or triggering google cloud build jobs.

E

So these secrets need to be first securely managed and also be rotated, and what we propose here is we incorporate the google cloud seeker manager and has it worked as a an independent secret managing platform that has a separate lifecycle from any given kubernetes cluster. So, in the case that the cluster dies for any reason, these secrets can still be restored and also in the case that the secret got stolen at any point, it will be invalidated after a while if the secret is being rotated okay.

E

So this is the basic uh architecture of the project. It is an integration of the google cloud seeker manager and kubernetes secret. So after being provisioned, a secret will be managed by the secret manager and then a secret rotator will be responsible for rotating this secret. Another secret sync controller will be continuously syncing from the seeker manager to kubernetes so that, within the same cluster, a pod can hopefully consume the latest version and also the currently valid uh key, so suppose.

E

For example, the currently valid service count key if it's applied on service counties and yeah. So the first part is the secret rotator and what it does is it performs two actions. The first is refresh and it happens when the last version is too old. That is, it was created too long ago. Then what it does is it tries to refresh this and create a new version of the secret and ever after it did so, it will then try to invalidate uh all of the out of date. Version of the secrets.

E

So one thing to note here is that we utilize the metadata of the secret manager secrets to help the secret rotator. So these are the informations. The rotator will need to rotate a service count key. The first two fields are the names of the project and service count for the service count key and the last field is: it contains all of the service county ids for every active versions.

E

So in this example, there's only there's currently only one active service cal key with id of this, which is corresponding to the first version of this secret manager secret.

E

So this is a brief example of what the secret rotator is doing on service counties.

E

Initially, we have one active service count key with id of 1e9656 and after a while, it will try to refresh this key, so it creates a new key with id of one cd7 cd and then, after a short while it will try to invalidate or in this case, delete the old service county, leaving only the latest version of this secret and this process keeps going on and on and the sticker will be rotated and will be rotated along with time.

E

So the overlapping time period of the two version of the secret is actually the grace period specified by the user.

E

Okay. So the second part is the sync controller and what it does is uh pretty simple: it just continuously syncs from the source secret to the destination secret and the source secret is said to be a secret manager secret, which is likely to be also rotated and the destination secret is a desired kubernetes secret. So, um for example, we're we're thinking on the key value pair level for a kubernetes secret. So, for example, we're thinking this secret manager secret to the value corresponding to this key of this kubernetes secret.

E

Okay. So, um let's talk about the setup for this live demo. Today we use config maps in kubernetes to specify the configurations for both the secret rotator and the secret sync controller.

E

So in this example, the secret rotator is rotating this secret one, um which is a service count key for this service count, and it has a refresh interval of two minutes and 30 seconds with a grace period of one minute and the scene controller is continuously syncing.

E

The latest secret value in secret, one into the secret value corresponding to key a of secret, a in kubernetes.

E

So in order to verify the result for this secret rotation, we implement a mock service, account key consumer within the same cluster.

E

So what it does is uh it continuously watch which is on the mounted secret kubernetes secret, which is the service count key in in this case, and whenever an update is observed, it just makes a copy of this current service county into another directory so that it will always have a full collection of the entire history um for this secret. So it will have every version of the service account key in the the other directory and then what is what it does.

E

Is uh it continuously pins all versions of the service account key and checks whether each is valid at the current time being. Okay, so let's see the results. um First, we get the pod that is running the the mock consumer, so it's under namespace 8 and we got the log for this.

E

Okay, so let's see we can start from here so at this time point um there is only one key valid. That is the third key key three, and it is the only active service count key and then, after a while, the rotator will try to refresh this. So what it does is. It first creates a new version of this key, which is key four and now the two secrets.

E

The two keys are both active they're coexisting, which is corresponding to the gray spirits of this key three and when the grace period for this key three is over, which happens within a minute here. So um after the the key three is out of its grace period, it then gets invalidated, or in this case deleted the the key three will be delayed. The service count key will be deleted and leaving only um the t4 as the currently active service count key and the rotation just keeps going and going so.

E

One more thing we can demonstrate here is: if we stop the secret rotator now, so if the secret rotator is being terminated for whatever reason- and we can see the current status of this rotated secret, so the the informations needed for the rotator um will be stored in the metadata for the secret, so that um whenever the see this rotator is being restarted after it's being terminated, it can continue.

E

It can continue to rotate the secret seamlessly, so we can just try to restart the secret rotator and what it does is um because it's out of the refresh interval, so it will refresh a new surface, count key and come back to deactivate the old version 4, where it was terminated for any reason.

E

Okay, so this is.

E

This is all for the demo, and until now we have um google cloud secret manager to securely manage the secrets that will be used in kubernetes and we are now supporting rotation for gcp service count, keys and with secret rotation, and we can mitigate the harm in the case that any secret should be stolen at any time point, and also by synchronizing, from secret manager to kubernetes um the pods and kubernetes can consume these rotated secrets and hopefully get the latest version and also currently active versions of this secret.

E

Okay. So um special thanks to my host aaron and the kubernetes android team. This is the repo for this project and thank you all for your attention. I'm happy to take any questions and inputs.

A

Any questions for shane.

A

Cool, uh thank you for your time. Shane. uh Sorry, I have a quick one go for it.

F

The slides for your presentation: are they public in somewhere.

E

um Maybe I can just post it in chat.

A

Oh uh shane I'll work with you offline to get a link posting the meeting notes and we'll and we'll post a link to chat or to slack here.

G

Erin already had it to do in the meeting notes around that.

A

Yeah, um cool, okay, uh shane: do you mind all right? There we go uh next up. We have uh michael colbert here to talk about rewriting, triage and go. I think we do yeah there. You are masked up. I like it.

A

H

H

I hope you can all hear me uh if you can't just let me know and I'll take off the mask, but you're good.

H

Okay, uh so I presented a few months ago uh regarding rewriting triage in go.

H

Because this is meant to be an in-place port uh like a drop in replacement, the best meaning. The best outcome is that you can't tell the difference between them.

H

Ideally, um so, on the left hand side, you can see python triage, it's currently running at its regular link, uh which is actually go, dot, gates, io triage um and then on the right hand, side it's the same url, but instead of triage here you just move it to triage, go um so, which I actually should be porting this over uh to, instead of having this under triage, go just have it under triage by uh end of day today, um yeah.

H

Essentially, you can see that uh the number of test failures in each instance are working the same. um There are a few that don't exactly agree like here. It says 4078 and the old one says 4082, but um overall, uh that's just going to be because of some uh time limits that are going to be different in python, which is go. But if you scroll down a little bit and you should see that they start to match up pretty much exactly.

H

I think yeah there you go um right, and one thing that we did add in the new version is that the clusters for sig now works so and the old one doesn't really do anything.

H

But if you do, for example, api machinery here, you only get clusters that are mostly um related to api machinery based on the name of the test, so, as you can see, most of these happen to have sig vi machinery. um In addition, we did update these clusters over here. These are sigs over here before we had some outdated cigarettes.

H

um Yeah that's in terms of the ui in terms of time we're taking for the jobs um you can see here that the old ones were taking around 50 minutes to an hour and now they're taking much less around half an hour. If we actually want to compare them, we'll see that the average or the last uh whatever this is 20 jobs or so, is that 51 minutes for the old one and 31 minutes for the new one, which is a great improvement.

H

um In addition, it's hard to compare the graphs because they're scaled differently, but um you can see sort of like the high points over here, like the longest ones, take uh 75 minutes for the old ones, um more or less, and then the oldest ones. These are failed, but let's say for the normal ones. The longest it takes is 47., um so even the longest jobs are taking a significantly smaller amount of time um yeah. I think that basically concludes what I have to show any questions.

G

All right are we ready to migrate.

H

I believe we are ready to migrate, um I'm working on a pr to migrate right now, um and that should be done by end of day.

H

A

Great um and as I understand it, this is this is the optimization we got from rewriting and go. There are further optimizations we may be able to make to update the triage results even more frequently.

H

That is correct. um There are some that are easy and should provide meaningful uh performance gains and some that are a bit harder and are more tenuous in terms of how much of a performance game they will bring.

H

So we'll see how far we can get in the amount of time I have left in my internship.

G

One other really important thing that didn't come up here is that the previous implementation was minimally commented very dense and undocumented. The readme said to do. um There is now a wonderful, detailed readme outlining everything in fairly well-commented code. Now, thanks for that.

H

um There are no speed changes in the ui. I didn't touch the ui, mainly because it was supposed to be a drop in replacement um in general. I don't think that the ui is particularly slow. Like you know, it's not ideal.

H

Okay, that's not supposed to happen.

C

H

Not particularly snappy, but um I don't think it's particularly slow if you have like any specific issues.

A

That you've been facing um you.

G

A

Chat to me- and I can.

G

Try to figure out what's going.

H

A

It doesn't happen.

G

G

You're not gonna, see much better and I'm not even sure that rewriting the front end uh would fix it. I think you'd probably have to redesign how the front end obtains data like do less of it in the front end or something.

B

H

B

That's possible, like pre-rendered the data.

H

It already is pretty uh efficient, like the person who wrote this before me did go to a lot of pains to actually make this pretty efficient, like you can see like over here, there's not so much data and then, as you get to the bottom, it loads more. So it doesn't implement lazy loading. It isn't uh some compressed form so that the data gets transferred quick more quickly.

H

um There are optimizations that were done, and I think that um you know you'd really have to dig into it in order to optimize it.

A

Further well cool! Thank you. I uh I I don't know about everybody else here, but personally, when I'm trying to like figure out, if I fix them like or not, this is the number one tool that I use, and so it is, it was frustrating to me when I had to wait. You know one to three hours to see if the pr fixed.

G

This is also the way I find flakes meaningfully. um Testgrid is great for telling me if, like a job, is passing or something, but this is really useful for telling me that some specific failure mode is popping up in our jobs, especially across jobs, um and having that latency meant. It was difficult to know if changes were that we suspected might introduce issues were introducing issues like upgrading, a container runtime or something.

G

No sorry, you can see the the optimization is a complete rewrite of the data processing pipeline from python to go.

H

Yeah yeah um there shouldn't be sorry. There shouldn't be any noticeable.

I

Differences in the front.

H

I

The optimization is all.

H

Going on in the back end in the pre-processing, so what happens is there's a stage where it processes.

C

H

C

Then uploads it to cloud storage.

H

C

Then this web page.

H

Pulls it from cloud storage, so any speed changes that we see would be in that first stage. Nothing in the second stage.

A

All right, uh thanks for your time, michael sure, um okay, uh next up, let's talk about kubernetes ci policy,.

A

um Laurie, I feel, like you've, been doing a lot of organizing of stuff. Did you want to walk people through? What's uh what's been set up in terms of organization.

B

uh Sure um so I don't think I can share the project board without hosting, but maybe you can pull it up. It's it's fine! If you want to just do that, it'll be easier. Give me a moment sure.

A

Okay meeting notes, project board.

A

Everybody see that yep.

B

So I think going from right to left would be easier, because I want to ask the group if you need these approved and interview columns, it seems like they just jump to close, so maybe we just delete those.

A

A

That that sounds fine to me I mean I think, uh for for some of the work that we're going through right now it has a pretty fixed uh workflow.

B

A

Be other things where there are pull requests that we feel are relevant to this, that we want to uh toss on this board to get uh wider attention and maybe they'll become relevant at that time. But if we're not using them now, we may as well get rid of them.

B

Okay, so- and we just got started so maybe we just keep them for now and see how we use them. You just wasn't sure about your workflow.

G

I think in review is almost always only useful for large changes that are you're going to review over time. But since most of these are like small conflict tweaks, I don't- I don't think it's gonna map well and approved is definitely not helping us.

B

But right not right now what aaron's saying is in the future they might so we'll we can keep them for a while all right. So let's just wait and see what happens there and then um the monitoring column is uh something rob suggested. So my question there is just so that's like our parking lot to see how the jobs are performing. So what might be the timeline for taking action to move those to the next step? What do you need to do and find out longer? Do you need to monitor.

A

I feel, like we've probably had sufficient time uh so these are. These are sort of questions that maybe have for the broader group. We still haven't actually defined metrics that are measurable. That say success, yes or no. I have some things linked in the talk that I could make as suggestions, but my general sense is like if we make the changes and the ci signal folks feel like yeah everything's about as stable as it has been, maybe even more so.

A

uh I think that that is the definition of done there, like, as long as we haven't made things worse, that that seems like good enough.

I

Yeah, I definitely agree with that and, like part of the point of introducing these resource limits and requests, is that when you know we do experience flakes or failures that we're able to pinpoint why they're happening? So you know, theoretically, as the tests that are running as part of these jobs change over time, uh the limits and requests could need to change as well um so yeah as far as like. Actually adding these, I think we are done, um and monitoring um like in the broad sense is just gonna continue to happen over time.

I

That's kind of like the purview of ci signals, responsibilities.

B

Okay, so we have the work itself and then we have the metrics so for what's in the monitoring column, what's your next step.

A

uh I would feel like uh basically calling these done.

A

I am I'm okay. If the people who are currently assigned to these issues go through and say, like yeah, I've been watching it for a couple days and it looks okay. um I had planned on sweeping through these at some point myself, but my schedule's been pretty booked, but if people are waiting for some kind of signal to say, should we call these done now? I think we've reached that point.

B

Okay, so maybe some of them are already here and can signal interest now, like you can just say, like close mine, if they're not here, we can follow up on slack.

J

Yeah we can always, I think we can work through the list, start closing them off and and if um you know at the end of the day, if, uh if anything needed to be reopened off the back of this change, they could be reopened. And the intention there was just to uh when I to just to shorten the in progress column yep just to show that this um that uh yeah just to show that the work had been done and that we they were just keeping an eye on us.

J

But, like everyone says, I think we can close these off now.

B

Okay, so do you want to handle that? Do you need.

J

To help him I'll be happy to work through that list and close them off.

B

Sounds good yeah and to rob's point it is great because then you can see what still actually actually needs work to be done and what is parking so it just helps create visibility because otherwise there'd be 22 items in progress and that's a lot harder to kind.

F

Of go through than 12.

B

Yeah well, what ends up happening is that work, just kind of gets hidden and kind of can stall, because you're you're, just looking at too much at one time so in terms of what's in progress, then is it? Is it moving forward? And how do we know.

A

uh My impression is uh many of the make critical jobs guaranteed. Pod quality of service can probably be moved over to the monitoring column. As far as I'm aware, almost every single one of these has had a pull request as uh merged, so it may just be a matter of like clicking through and seeing if the full request has been merged associated with it. Yeah seems like it so I'll move this one over, but I think uh people who are assigned to these issues can totally move it over or glory.

A

If you want to sort of uh groom the board that way, that could also be helpful.

B

Okay, I looked at that one, for example, and tim is out of office at the moment, so I wasn't sure if there was anything more to do and I'm not gonna bother him it's a personal matter. So um I think he's on a few other items in here. I'm not sure, because.

A

Sure I I would probably suggest uh other members of the ci signal team can probably keep track of this on his behalf.

B

Yeah, I think I think that would be great unless somebody wants to just brief me on what to look for, because.

G

That yeah, I think like if they have a pr version, we move them to monitoring, but there are some that uh you know someone has opened a pr and we haven't reviewed it yet, and most of them are moving. A few are stalled on a review or the author being out, as you mentioned, I I know I I have a few stalled just because um I've mostly been like load, shedding review and sorting that out and didn't actually get many reviews done for a couple days.

B

I think it's just having rob because he's working on this stuff right now like if rob, is willing to just look at the in-progress column and keep that kind of clean, like moving things to monitoring. That should be moved. If, if he's okay with that, I think we probably got the job done.

J

Yeah yeah, I I'm happy to to to go through both those lists on that yeah. So so this is specifically for the the the the resource changes to the configuration.

B

I think it was a critical jobs right, so anything with critical jobs.

J

Yeah yeah the qrs yeah cool, that's fine, yeah.

B

So all right um and then cool and then in terms of the metrics erin. What's your plan, what would you like to be your next step for getting those.

A

Let me just kind of walk through some of the rest of this stuff in the dock and we'll see where that that takes us.

A

So this is the umbrella issue that resulted from our discussions uh two weeks ago, which tries to represent sort of what we feel like are the three most critical pieces of action that need to be taken. In order to mitigate this situation, uh then we need to figure out how to define like what success looks like and what our metrics is. I've had a difficult time creating issues to represent these, because I can't describe like this is the metric. You should go measure and implement it's more like.

A

We need to go, do a bunch of discovery and there might be a lot of false starts. That's at least what I've been encountering personally and then these were sort of the three other like lower priority things that we discussed doing afterwards. So things like mandating that every single crowd job has contact info and then removing like perma-failing jobs, starting with the really egregious ones and then figuring out, based on our experience there, how to create a policy to sort of continue doing that going forward.

A

So, uh let's see back to the job, so as far as uh where we're at with our existing um migrating existing jobs, uh this is a spreadsheet that should be accessible to kubernetes staff and kubernetes testing members. um It's generated by a little report thing that I have it's very manual. I kind of just threw it together and presented to the group here and I'm filtering on every job. That's in the stick, release blocking dashboard and also has pod qos guaranteed.

A

True, I'm picking up a couple things that also happen to have word blocking in them, but aren't guaranteed. Just because, like sigurly's, prototype master doesn't seem like uh an appropriate dashboard nor uh relinch blocking, but this is just a sign that, yes, I think we have this is this would be my metric to measure like have we done the things we said we would do.

A

Yes, we have um take me a little bit more clicking around to do this for merge blocking see. If I can do that, real.

B

A

uh But the the intent is, this is sort of more of a reporting mechanism that we could use to figure out what our work is, and then there are tests that basically enforce this stuff. The tests are informative and then flips to blocking, let's see here and then kubernetes kubernetes.

A

There uh I don't even know why, that those are there. um Ostensibly, these should be all of the pre-submits that run against the kubernetes kubernetes repo and are also merge blocking and by merge blocking. I mean they either always run and are considered blocking, or they run if certain files, change uh and they're blocking so again, not sure why keyflow is showing up here, but the gist is. I think there are some time jobs that we need to set resource limits for and then hopefully, we'll be done with merch blocking.

A

Let's see so, we talked about the idea that when we were experiencing pain in the beginning early july, we experienced this by pods, getting either scheduling, timeouts or failing to schedule entirely due to resource problems.

A

Pods that had this jobs or pods that had this happen were jobs that ended up going into the air state. So I'm looking at the prow dashboard. This shows all jobs for all repos, not just the ones for kubernetes kubernetes.

A

So if I filter down on the number of errors over the last seven days, I'll back this up, just for like the last 90 days, uh I went through and bumped the persistence on the prometheus instance that this queries against, so that we'll uh we'll be able to sort of see things going forward. The persistence was tuned to like four weeks.

A

So we could take a number of jobs in an error state as some kind of signal that, like something bad, is happening, and we could take the reduction of this number down to zero as a signal that we think we've stopped the badness um and this corresponds roughly to when we were experiencing a lot of pain, uh and this corresponds roughly to I don't know if we have have not been experiencing a lot of pain. You can also see this roughly corresponds to like the volume of pre-submit jobs has also lessened over time.

A

So it could be that when we, you know, turn things back on. If pr volume increases, maybe we'll start seeing errors go up again as well.

A

um Beyond that, I feel, like people have experienced like jobs flaking on them, and I don't know how to sufficiently represent that in a dashboard or you know, be able to directly tie job to uh like resource issues.

A

um I wasn't sure if just uh daniel or or robbed, either of you sort of have a pulse check on the general stability of the jobs. Lately, how do you? How do you check that.

I

Yeah, so there's not a lot of stuff exposed in spyglass about you know why something failed. In fact, until recently uh it would say a job uh failed, even if it aired out so there's like a pod scheduling, timeout or something like that. That's been updated. um One of the things that makes it hard hard to understand what's happening is that um if a job has a any any sort of error, so uh pod scheduling, timeout or um anything of that nature, we don't get any of the uh pod info.json, so that's not being published.

I

So it's pretty much impossible to get anything other than um there's a pod scheduling timeout. So you don't know you know kind of like the underlying reason for that um I'm working on adding some of that to it. So hopefully, when folks see you know that, especially on pr's that something happened that they can go and it'll kind of like hit them in the face, whether it was you know, a resource issue or there's something actually wrong with the change that they're submitting. So hopefully that will help to some extent.

C

B

A

B

I'll go ahead.

A

B

I was just gonna, I'm just gonna suggest that um maybe some of you want to get together and just sort of hammer out the metrics that you think are worth worth, uh taking a look at or formalizing into like how we're doing, and maybe also you'd mention that there's some discovery relate uh discovery involved. So maybe like what is there to discover like what are the what it would be, the checklist for um things to look into, and then maybe from that checklist we could break down the work a bit.

B

It could even be items that we put in the project board in case. There's things that we would want people to take a look at.

B

To come back with findings like you know, go, look at what's happening here and then come back and let us.

A

Know I'm happy to give that a shot.

J

Yeah yeah yeah: it's a good idea, make.

A

An umbrella issue but go off and explore, and I can make issues for these are ideas. I have what I.

B

A

Feel like I can accurately do, is and here's how to know when you've gone too far off the beaten path, and this is working. This is not worth it right, yeah, don't really yeah. So I'm gonna have to ask that people sort of use the best judgment there.

B

Maybe also time box, it like spend four hours or something total looking at this thing, but I think if you set up the structure like what to look at in terms of like you know start here, and these are the things you might look out for then that's kind of declarative. So it gives.

H

B

Latitude to learn what are what are relevant metrics in this environment?

B

They can help them become more acquainted with that.

G

Just want to caution, I think a successful in state where we've actually gotten everything happy is relatively easy to measure, won't need much time to define but yeah trying to define how individual things are working. Well, it's going to be difficult. There's a lot of things that are hard to control for, like the load on ci, like as 120 development opens up, for example, and or, like um you know, is a job consuming more than it needs or.

B

Okay, so you might have to curate your list.

G

Yeah, I I I think we can define what does it look like when we've reached a healthy state and things would run well pretty easily. I tossed out a metric around this previously that I think, is good enough. Just around the amount of error state jobs we have, um I think we can lose a lot of time trying to define more granular metrics and probably not get super far with it. Okay, but it might be useful to look at trying to.

G

Find the right trade-off and bin packing tightly and how we make that maintainable long term, as in um only requesting what we actually need.

I

Yeah, so I mean some of that can be kind of difficult uh without sig involvement from like you know, you have to understand exactly what a test is doing um to understand whether it's it's consuming more than it should, or the resource limits and requests are where they should be at like, for instance, I know that a few of us have been looking at the verify test. That was using wildly more memory than it needed, and you know previously. The response to that was to you know, bump the memory request higher right.

I

Oh, it needs more memory. Let's do that actually investigating whether it should need that much memory is a different thing entirely, so um I don't want to like as we're kind of uh shifting more uh responsibility or motivation to sig owners. I don't want to like also throw this in there because that's a fairly significant ass, but in the long term I think it would be a reasonable and more sustainable path to have um sig owners be responsible for the amount of resources the tests that they run consume.

I

I mean that that seems like a a pretty reasonable thing to do, although it is a bit far from where we're at right now, I'd say.

B

Okay, so there's two different approaches here: maybe you can combine them so there's a discovery of things side and then there's also. We know what already what we have to measure. So maybe the four of you would just want to hammer out what you want to do with metrics. Would that be possible like in a google doc or on slack just what's your approach here.

A

J

Would you agree that it's tricky you see on on the face of us um on on the facebook, where we, where we, when we stand in front of test grid, we know where we're at you know in terms of in terms of um green and red, and what you're asking for is something that I think that we might. We might have to spend a few hours having to think. Can we create some definitions?

J

Can we create meaningful and useful metrics and, and then we'll have to have a little think about like ben said? Well, where's the trade-off here? How much effort will we put in in here to and what will be the reward.

B

The idea is not to do this now. Sorry, I don't want.

J

To refer it's not to do this.

B

Now you go off and do the.

J

Thing we're actually trying to figure it out now and yeah. Just stop us from doing that.

B

But it sounds like it requires some time and like you're saying it requires a think so yeah.

A

B

I don't know if, with four minutes left, if you have enough time to do that,.

A

So my suggestion was going to be, I feel as though the forcing function for us will be moving on the next two tasks, so one is migrating, the remainder of the release blocking jobs to a publicly visible cluster in case infra this one here. I really wish I could share this out to the community members, so they can help you this for me, but unfortunately, most of this involves google.com specific stuff, so I've been working on chipping away at the things that I can.

A

I feel like all that really remains is deciding what to do with the release like the build, related release, jobs, and we need to figure out how we want to set aside or reserve resources for those jobs while they continue to run behind the walls of people.com.

A

Again, I kind of feel like that's something community members can't really actively uh participate in, um but you know that there we are with that. I'm also just sort of trying to get prototype equivalence to these jobs running in community visible infrastructure. So the community can see what kind of resources are applied.

A

The bigger list is going to be moving over. All of the merge blocking jobs or all of the pre-submits, um there is at least an order of magnitude, if not two orders of magnitude, more traffic and resource consumption by these pre-submits than there is by the release blocking jobs. So we're really going to start stressing or exercising whatever capacity related issues we might have with the kate simple clusters.

A

I have been trying to go through in anticipation of this and start answering requests for more quota in terms of cpus and available ip addresses, and things like that. I've been trying to pre-allocate more projects to to set up for for bosca's.

A

So you can see here.

A

The number of gce projects has now increased to like 120 something uh in the last week, and this is to anticipate if I look at the number of gce projects that are used by uh that they're using the google.com build cluster, it hovers somewhere under 100 and occasionally peaks around 100 mark and so given what we already have as our load. Trying to anticipate.

A

We may need that many and just also making sure we have equivalence for special projects like gpu, stuff or scalability related stuff.

A

This is also just really going to exercise the visibility that community members have into the cluster infrastructure and the metrics that they use to monitor the health of that cluster. So I feel like this will more.

A

This will give more priority and maybe more visibility to the idea of how are we going to measure metrics that that show you know success and health, because I feel like attempting to do too much of that with all of these jobs sitting in a place.

A

That's not as visible to the community is not quite as proof, so it's my intention to start teeing up individual tasks to migrate all of these over and tagged them as help wanted, and I could really use help from rob and daniel and whoever else has been involved so far and sort of describing what kind of process we would go through to verify that, like the jobs look good um and it may be like it could be a little herky jerky for all.

A

I know in a wildly underestimating capacity and we'll have to like stop until we get more quota, but working on this will give us significantly more insight into whether the resource limits that we have set are working appropriately and then I think, because we start seeing the amount of uh money all of this costs that can start to move us closer to the world where we talk about individual states, getting budgets and and how much you know where they want to allocate their their budget towards, and we may be able to find metrics in the community may be able to find metrics that sort of show how uh certain jobs may be over allocating resources.

A

uh You know like it turns out. They may be using 20 of the resources that they're asking for both of the time. uh So that's.

B

A

B

Is this the next policy that you think should be lined up after the current one is done or.

A

B

Okay, um so I guess.

A

We should the way we originally discussed this. It was going to be policy number three, but what I'm trying to say is policy number two can't be done by the community.

B

Right so this is the next community one yeah.

A

And I feel like uh in the work that I have done to migrate things over for policy number two: uh that's given community members, sufficient visibility to implement parts of policy number one like they've, been able to go, daniel's, been able to go check out the verified job and see like what the metrics look like for memory, consumption and stuff. uh I feel like it's it's time that we give that same level of visibility to the merge blocking jobs.

B

Do you want to postpone the metrics discussion until some more of this work is done, or do you think that having it now is key.

A

In terms of me being a blocking factor, I feel like it's more important that I tee all this stuff up, uh but it makes it just we're over time for what it also just makes me comfortable as a data-driven individual, that uh I don't have data that shows that what we are doing is actually having an effect.

B

A

I can I can kind of seeing I can kind of see in some ways that it is having a effect like. I can see, auto scaling working now, based on the resource limits we've set for release blocking jobs, but uh I cannot verifiably point to a graph that is the smoking gun uh for the pain that we were experiencing before and pointing a similar graph now and show how it is significantly better as a result of what you've done and that's that would be the holy grail right. Okay,.

B

So would then suggest and give you what you're looking for do you need to discuss in a group.

A

I think we need more to have more group discussion.

B

Okay, so maybe you pull that together, as you see fit.

A

uh So I'll try to maybe start an umbrella issue for that discussion and we're at time. Anybody have anything else.

A

Okay, uh thank you, everybody for your time. I appreciate you all showing up and have a happy tuesday.