Kubernetes SIG Testing, 2 Nov 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Testing - 2021-11-02

Description

https://bit.ly/k8s-sig-testing-notes

NB: a portion of the meeting was edited out to avoid disclosing the specific vulnerabilities that were discussed in the meeting

A

Hi everybody today is tuesday november 2nd. You are at the kubernetes, take testing bi-weekly meeting. I am today's host uh chair of sig testing, aaron krittenberger, also known as say beard or spiff xp at all the places this meeting follows the kubernetes code of conduct, which means we're going to be our very best selves to each other, it's also being recorded so that it can be posted publicly to youtube uh shortly.

A

B

Today's meeting.

A

Agenda uh I've got a quick heads up and then we've got some suggestions from chow and arno to talk about things.

A

um So with that I'll, just sort of dive straight into the fact that uh code freeze is coming for kubernetes, um typically code freeze doesn't impact any of the code changes that land in the kubernetes testing for repo or any of our other sub projects.

A

But it's a thing we need to be aware of um I'm. Basically, I guess asking folks here, if you know of any changes that you definitely want to have landed for the kubernetes 123 release.

A

If there's anything, we should be paying specific attention to in the next two weeks. um Stuff that typically comes up around this time are any major test changes or anything that attempts to change a bunch of container images. We've made great progress in migrating to community, hosted images.

A

The other thing we try to keep an eye on. That's not necessarily code changes for kubernetes uh are things like workflow changes for the project.

A

um So if there are any like major, potentially disruptive workflow changes or crowd changes, we may want to consider whether we should try landing those now or passing pause on them or adopting a little bit more of a careful migration. Rollout strategy.

A

See everybody nobody's got anything they particularly want to land. Is what I'll take the silence as.

C

How long does code freeze last for again.

A

That's a really great question: uh I'm gonna! Look that up real quick. I think it's at kubernetes devs, slash release. Typically, I think code. Freeze these days lasts until the release actually happens. So code freeze will happen tuesday november 16th, um then test freeze. So these are test only changes if you want to like land your feature then land your tests, which don't do that. That's that's not good, but for things like conformance tests or things that have been hanging out waiting for approval for a while test.

A

Freeze is november 23rd and then I think things just stay frozen until the release happens on tuesday december 7th. So that's what three weeks freeze.

A

And then my impression is it's it's kind of a lull until uh sometime next year, I think I had seen some traffic on kubernetes issues about maybe using this time for any large scale breaking changes or like messy refactoring that we might want to land.

A

uh The one that's top of mind for me is moving the kubernetes default branch from master to main, which requires renaming, like a bunch of shell scripts, a bunch of jobs, test grid dashboards, all that fun stuff.

A

I think the plan is to start on that, after the release gets out the door.

A

I think the the the only one that I can think of off the top of my head that I'm going to have to like walk and talk here and find um somebody well antonio. Why don't you go say say your piece while I look for anything.

B

Yeah and now claudio is in the in the meeting. He has some pr to implement prepar pre-pulling off images on the e3, and I mean I think that that's interesting, but uh I don't know if all of you are aware, because everybody will sign up the piazza and I don't know if he will need more more help to learn this before code. Three.

D

That's a good point now. Basically, what's the idea with up to pulling images, uh I already saw that for node uh node tests, the images are being prepared and basically I'm proposing something similar for regular e3 tests.

D

This is because we have been seeing some flakes in the window ci, uh because some tests were expected, the the poster startup in one minute, but if those tests somehow end up being the first ones to run it might take longer than one minute to pull and unpack the windows images, they tend to be a bit bigger than linux images, so that pull requests. I'm gonna link it in sure I think yeah on antonio already linked it. Thank you now that basically introduces uh an optional argument by default. It's false!

D

If it's set to true it's basically going to spawn a demon set for most of the e3 test images on, so that, basically every node will have those images cached locally and then deletes them and then starts the test themselves.

D

So, for most linux scenarios- maybe this is this one be a typical scenario, but for windows it could be useful.

A

It seems reasonable to give it some sort of time now before countries I'll see. If I can take a look at this later today,.

A

um Of course, he's still trying to find my thing.

A

I'll put this in meeting notes year.

D

And I do have another request which uh it might be useful for us now from what I observed it's regarding. If a test fails, we don't really see anything that happened in the pod. If there are any issues there, especially if they are networking issues, a lot of services like diagnosed services, that being are being startup, every request is being logged, which is extremely useful, but we don't see that in case of failed tests.

D

So if we could actually see that information, we might get some more uh in-depth view as to what happened for that failure. Did that request even reach the pod or not? Did it fail uh with some interpol communication?

D

We don't know because we don't see any of the pod logs, that's scenario, so I would also send a pull request for that as well in that one, by default we log we get the logs from the top five points in that test. In most cases, it's more than enough, and in other scenarios in which the response too many parts more than five labels could be added to the most important parts that should be locked. Basically.

D

This basically came as a necessity necessity to say so. We have been seeing flakes and it is a bit hard to root cause it as to what exactly happened, and why did it fail?

D

A

No, I think this totally makes sense. uh I feel like.

D

This is another one. I.

B

A

Turn on and just sort of see what it does. The only thing I can. The only impact I can pick up is like an increase in the size of the artifacts that we store, uh or maybe there's gonna, be some flakiness and actually retrieving the logs for these pods. It could be the pods end up going away before we're able to retrieve the logs from them.

A

But I don't I don't know I'm willing to sort of throw some live test data or some live tests.

D

When it comes to the size of the logs, we only load the last, how many were them 40 lines or something like that from the pod?

D

So even in extreme scenarios it will only be 40 lines deep and the test that would generate a lot of locks would be the probe ones which basically will send periodically probes to some container, which will then be logged, but in a lot of them it might not not even be blocked at all, and this will only be executed in case in in the case which the test the tests failed.

D

So in typical test runs. Maybe there are a handful of failures.

D

Okay, oh I'll put this on my key. Thank you.

D

A

So the one I wanted, some folks eyes on was an update to the approved plugin. um We had this person come by stick testing a while back and sort of walk us through a demo of adding a granular.

B

A

To the approved plugin so that people could slash approve specific paths or specific directories, this is all in service of trying to.

A

uh Let's see, empower more people, I think, to approve files. It's it's targeted at like getting rid of the targeting the root owners or group improvers issue for kubernetes kubernetes. The fact that, like there are very few people in group approvers- and we kind of don't want to add more people to groupers, but we do want to for those people who are like over privileged. We want them to be able to say I am approving just this specific part of it, and we can also allow people to approve just like specific pads.

A

B

A

Cases here, uh but the idea is this was discussed in brexit testing a while ago and then I think, uh was iterated on and initially it had been done as a separate, approved plug-in. The functionality has all since been gated behind a flag.

A

I think my ask is I want maybe like coal and or alvaro to take a look at this just to confirm first, if we deploy this, as is without enabling the flag that we're not there's no change in functionality and then for us to consider the implications of whether we want to enable the flag or whether we want to suggest that we not mess with the flag on the run-up to code.

A

Freeze and stuff, like I think, if we wanted this deployed kind of now, ish or this week-ish would be the time I would want to see it so that we would have enough time to revert it as we start to see sort of an oncoming flood of code, reviews and tr pushes and people trying to get their stuff landed before code. Freeze, um like I approve of the idea in principle, but I need people who have a little better granular.

A

The granular is in the title, but like more granular, knowledge of the approved plugin to just make sure the performance implications or any corner cases seem like they've been addressed.

C

um I have some thoughts on this. uh First of all, it's the first I'm saying of this. I it's a little unfortunate because I kind of had some opinions about the approved plug-in earlier, namely that we should like completely rewrite it, rather than continuing to add more functionality to it. um It is very confusing and like huge amounts of the code are vestigial and you know could be refactored into things that make far more sense um and I'm a little uh alarmed that this pr is 6 000 lines of changes.

E

So it actually is six rounds on lines of changes mostly because it is a rewrite. It initially actually was a separate plugin and we asked to make the transition smoother for repos by just making it a flag to run the new code pass.

F

But the rather ask to duplicate the existing plugin, which is the very first commit and then change the copy of the daptic like center copy and, however, like I'm just looking at this, and the second commit that supposedly does this still changes the original approve and is still almost 6 000 lines of code change.

F

So either we should really duplicate and have a div to the original that is somehow digestible or, as cold suggests, start from scratch and and rewrite it, because as this right now, this is extremely hard to review and I would be super concerned that we missed something.

F

Because the problem is, if we miss something in this area, this has gigantic implications. This is really something we cannot mess up. um So that's why I'm so concerned.

E

Okay, for what it's worth a lot of this is tests.

A

Yeah I mean- I don't know I I I think I want I want some pushback uh like I have this impulse of this has been sitting out since march and I think alvaro took a look at it back in march, uh but I don't know how much we've actually engaged with this since then, um and yes, because of the large implications, I'm wary of having this land so close to code freeze, so I think uh maybe uh alvaro or cole, if you guys could sort of what I'm trying to figure out is.

A

Is there anything that is uh usable here? You know if they could break the commits out in a way that makes this easier to digest or easier to review.

A

Is there value in doing that, because if I were this contributor, I would find it frustrating that, after nine plus months, the feedback or financial months, the feedback uh from the people they've trained they've been trying to like get attention from, is actually we gotta redo the whole thing entirely.

A

uh It would be a little frustrating, um so I'm just trying to figure out how we can work with this.

A

You can drop some comments on this. I'd appreciate it.

C

I agree that that is it's a very frustrating situation.

C

um I wonder, if that's you know, I I'm not sure how we should handle that I guess um like I, I alvarez uh looked at this a little more than I have so maybe he maybe you know like how much the other two commits are like I'm just looking at the two commits that are supposed to be the you know, the changes, not the copy, um and it still looks very massive.

C

Do you know how much that is uh actually changes and how much of it is stuff that should have been put in that first commit.

F

No not really, I just basically looked right now at the following commits um no.

A

Yeah, I I think, like I'd, have to go back and dig through the meeting. That's the recording to know for sure, like I do recall this person showing up and actually like walking us through a demo of the proposed functionality. I think maybe the part of the process here that was lost is how we typically go through a design dock or a cap, or something to really walk through the implementation and the implications of this, and I feel like if I didn't.

B

If I didn't have the memory of them showing.

A

Up at the meeting, this pr looks an awful lot like dropping a multi-thousand line. Pr as the first step of engagement, which typically doesn't go very well, um so it could be what's missing. Here is like a document that sort of lays out the design, decisions and choices that were made, and it could be that breaking up the commits to make them more digestible.

A

In concert with the doctor, we're talking about the decisions and the.

A

Trade-Off but the changed.

A

This is such a fundamental part of this project's workload.

A

Anyway, yeah, I appreciate and appreciate some feedback from you tomorrow and.

F

D

C

Let's see, I also just.

F

C

To the issue where I kind of went over the or outlined a lot of the problems, with the approved plug-in that kind of motivated my opinion for think, we should just do a rewrite thanks.

A

C

A

Know uh making red jack's stuff better than different plugin is a thing.

A

um The only other thing I can think of about uh the big change prior to code. Freeze, since our note is here, uh it's a quick kicks in for hat on um the scalability jobs. We've migrated over the five thousand node performance test over the k-10. We haven't migrated over the 5 000 to correct this job. Yet I was wondering if there was anything specifically that was holding us up there.

G

Not really, I think I want to leave that decision to see scalability, because I talked to one of the leads and what he told me is. They need some kind of bandwidth to do the migration and babysit the job in case there's some failure. So I did everything to prepare. The migration now is just an approval from them.

A

Okay, that's everything that I could think of without my head for groceries and uh thank you. uh How do you for uh antonio for bringing some other things to the table uh with that? I'm going to hand over to chow to talk about github, dependable.

G

H

Thank you aaron. um So, as I have asked us like a couple days ago, uh we uh found out that we can turn on github depend upon alert without modifying the code base, so I turned it on and I was not prepared that we have 17 alerts and I'm looking at the page right now, I can copy paste the link, pnd meeting, notes.

H

So um do you guys mind if I share, or I think aaron you can- you should have permission to open the link yeah. Let me let me uh allow you to share all right.

H

So basically, we have four critical severity level and five on high severity.

H

I'm not quite sure whether we should care.

A

Okay, uh that moves.

B

A

Our now to talk about crowd migration.

G

I feel like I need to let you talk about this, because we had the conversation about pro migration three weeks ago, a month ago,.

A

The problem is, I keep like losing context and stuff uh okay, so.

G

Just want to refresh everyone, so I was talking about pro migration with eddie and basically, we now have a pro instance running a against the braille cluster different of the current one and eddie was asking what's the next step for the migration.

G

So I kind of have the same question because I feel like we don't have a clear plan for the migration.

C

We're talking about migrating, um the cage brow instance to the working group, infra instance. Is that correct that to.

G

A multi-tenancy cluster.

C

um I would imagine that the next steps would be to migrate. um You know sets of jobs or individual jobs at a time. Is that kind of the approach.

A

So then, uh so the thinking here is we've already migrated a bunch of community jobs over to a community of build cluster. um Let's just pretend for a second: I could wave a magic wand and proud.kate's dot. Io suddenly doesn't run into kate's crowd project and instead runs in a multi-tenant app cluster.

A

That's made by kj.

A

uh It's going to attempt to schedule jobs to google and it's going to attempt to some of those jobs that it's running are jobs that have nothing to do with the kubernetes project. So if I look.

D

At it, just from a.

A

Jobs perspective like we got to decide whether uh we should continue to kick off non-kubernetes uh projects from crowdcase.io.

A

You know, there's a google cloud platform, multi-cloud clustering grass project and I think maybe some of the bazel things still run off and product case that I owe so just like finish migrating that stuff over to the google's open source pro instance.

A

But for me the bigger things are like the.

A

The google cloud storage bucket known as kubernetes jenkins, into which we dump all of our logs and stuff uh and any associated.

A

B

A

Chat like basically, this is the last time I thought about it deeply, uh and I had concerns that we needed to make sure that if we were to stand up like we have a proud service cluster running over in k-10, it's called k-10 for proud, kate, spot io and like what would we have to do to flip domain names or whatever such that? That became kate's, dot io.

A

A

Infrastructure wise, I feel like we have to decide whether we would want that pro instance still capable of scheduling to the google.com owned, build cluster uh or whether we would say no. It can't schedule to that cluster everything has to run in the community owned, build cluster, and that leads us to the decision tree on which jobs should be migrated over or should not be um then there's the google cloud storage resources like kate's, test grid and kubernetes jenkins that currently live in google.com.

A

You know, would we take some down time to move those buckets over? Would we try to allow proud to use those buckets from the community? Would we try and get proud of use a different set of buckets over in community.

A

Is that and then the thing I don't even have listed here, I'm just realizing out loud is like the all of the workflow around continuously pumping products.I, o and stuff. I don't know uh like it all relies on the use of a kate's brow project to host all the images which is also google.com.

A

It's unclear to me whether we need to add privileges for the community owned, for instance, to write to that sorry. At this point I feel just like I'm rambling, but um I agree with darno that, like a migration plan needs to be developed and I keep losing context and not having enough time to actually like get the context page back in and focus on it and come to it in sort of a rigorous, step-by-step thing.

A

um I keep like trying to drop the things that I'm aware of based on where we are today, but I would love some help in actually putting a plan together.

I

But there are a few things that jumped out when we talked about this before the budget is going to be the big one, um especially once we start scheduling and putting logs on community owned stuff, because I'm pretty sure our runway will just explode. We won't be able to cover it for the year right.

A

Basically, I kind of that- that's maybe almost an orthogonal thing to me at this point I feel like we already lost our runway with the artifact hosting costs, um but yeah, let's say ballpark. It doubles our ci costs if we, if that's assuming moving over proud on case.io, also means moving over all of the other jobs that it currently runs right for reference. I don't, I think it's ballpark like 300 and something jobs.

B

A

Over in the community cluster and they're 1700, and something that's still running, the google.com cluster, but traffic wise, those 300 are like way more.

C

So these 1700 remaining jobs, though uh I doubt that many of them are for just google purposes right, like most of the like very few of those are going to be things that are, you know, shouldn't be running in that cluster right yeah. A lot of it is like kubernetes.

A

Ziggs repos, like cluster api, all the cluster api repos, all the cloud provider repos like there's a lot of valid stuff that hasn't been migrated, I sort of in terms of if we go back to your suggestion. Cole, like think about just the jobs that we should uh migrate. Thinking that, like uh running jobs, is the majority of the sea ice, not necessarily the control plane or the service cluster. Then yeah. We migrated over the jobs that are principally aimed at release, blocking and merge blocking jobs for kubernetes kubernetes.

A

We haven't really opened the floodgates to jobs for all of the other 100 and something repos that make up the kubernetes project. Part of that's just from. Like a billing perspective, it's really difficult to segregate out billing on a per job basis, so we wouldn't be able to waive a magic wand and say like well sid cluster life cycle. You get a budget of this many thousands of dollars, and it's up to you to make sure that you run you don't run too many jobs to build your budget.

C

Yeah using build clusters is probably the way I'd recommend doing that. If, if you want sigs to be able to manage their budget individually, they can.

E

C

Have their own their clusters for that um yeah.

E

I agree, I suspect, for most a lot of the things, though the cost is actually in the bosco's pools, primarily.

G

E

Overhead of a build cluster isn't free.

G

E

Really because.

G

Some some some say I have a lot of requests in terms of resources. I would say, for example, sig release.

G

Most of the job of sacred is not running boss, gospels, but directly on the build custom and they consume. My lot.

A

Yeah, I think kind jobs eat up a whole load um um anyway. I feel.

B

Like we're getting.

A

uh uh Off tangent here, the the like the motivation was to see if there's a way we can migrate just the service cluster, because the idea here is great: we're spending the community's money on jobs that run in the community on build cluster and that's cool, but I feel, like people are less incentivized to contribute to and help support crowd because they can't touch the proud control plane. They can't see the same logs that all the on-call people can see.

A

This is all kind of in service, of allowing more community members to be able to join an on-call rotation. That's basically best effort um and to be able to help out, and so the fact that proud runs in a google.com only project is.

E

The blocker to that right, but isn't that also because it controls a bunch of other google.com resources.

A

It used to I'm.

E

Not so sure that it does anymore, like old, bosco's, pulls and also the build clusters I feel like. If you haven't, moved the build clusters, you can't move the service cluster.

E

Just because of credentials.

A

Well, so I guess that's an open question: would it do? We think it would be a blocker to allow a service cluster that is outside of google.com to schedule to a build cluster that is inside google.com.

C

What credentials are in the build cluster that we're concerned about? Are these just like the few credentials for accessing one or two google cloud platform repos.

E

I mean it's more like these are google.com accounts and if we could just grant external access to these things, we wouldn't even need to move the project. We could just grant people access to the project.

E

Like there's, there's no upside to moving, if we can grant external access, the the only upside is moving the billing, which apparently is a problem right now. So maybe we don't want to do that.

G

So, for example, if we migrate the job that update your internal calendar for on-call rotation, is it okay to move that job outside google.com.

A

Actually, that's not a proud thing at all. That's a google internal thing, updating another google internal thing that ends up in a google cloud, storage elsewhere.

E

Okay, what's a board job? Actually, let's yeah, I love it.

G

Because I I think we have a full list of things we of credential. We need to migrate, so they may have some kind of case, and we maybe need to review all the tests term for our seed testing job to ensure we can easily migrate them without security concern.

A

Like I would do the security stuff I'm thinking of, is I have a service cluster cross service cluster outside of google.com that has the you know, config to be able to schedule to a build cluster which you all can correct me if I'm wrong. But last time I went through this, it required setting up basically an admin level user.

A

um So a gke cluster outside of google.com, more or less, has full control of the gke cluster inside of google.com. So it could schedule whatever it wanted inside that gke cluster, um you know kind of depends on like what workload identity bindings would be present inside of the google.com cluster. uh How far something would be able to get from there as far as what it could or could not access.

E

Are you saying that we have that working today.

A

I'm just waving my magic wand and asking.

E

I was saying I thought that was actually like a technical blocker, um the like, like being able to grant permission.

C

I don't think there's a technical block on that. I question whether we should be doing that.

A

Yeah I like, if I put my if I were to pretend I was somebody working at google security. My eyes would be. I would have the look of disapproval.

A

I'd have the legit googly eyes.

E

I think we should be moving the build clusters for sure I think we should be getting everything moved over. I I don't so. The other way of thinking about this is is.

A

Would there be a dance we could do where, like the let's say, we leave all the jobs there and they they still schedule to google.com owned kubernetes cluster or a google.com cluster that runs like largely kubernetes jobs, but some non-communities jobs and like what if we were to swap domain names. So now we have crowd.case.io, that's kubernetes thing and it's running mostly the it's running, the kubernetes release blocking jobs and we encourage people to migrate more of their jobs to run on it and then what's left is like google case.prowl.io.

A

So it's still around and it's still accessible and still triggering jobs, but we're going to deprecate.

A

So that way, we're still kind of like running the same set of jobs in the same places.

C

Having both temporarily sounds good to me, I would say that we probably shouldn't switch the name, though, until we are able to deprecate, because I think that'll just cause more conflicts and issues um right if we can keep the name consistent, at least until we're ready to make that the other source of truth. That might be easier.

C

A

Well, I keep feeling like when I walk away from these discussions. It's like well, I guess I should go write a doc, but I feel like arnold is holding me accountable. The fact that I keep not doing that so I can take another crack at this, but I think I'm gonna definitely get cole and ciao to take a look at it this time uh and see. If I don't know, maybe we could tee.

B

A

Up as one of those things that happens after the release goes out the door you know, could we do some slightly disruptive stuff at the end of q4 to make this happen?.

A

um The main main thing we haven't discussed here is like workflow related stuff when it comes to the continuous deployment of brow and its images and all that stuff I feel like that needs a little bit of a. I think as well.

C

um Yeah, I think we have pretty clear uh paths forward on all of that. I agree that we need to flesh it out, but yeah. I think that all of that there's no no technical barriers there that I'm aware of okay.

A

um Yeah and ciao, I know you.

B

A

Cube flow off so I'm sure they'll be able to point us in the right direction.

A

I think that's everything we had on the agenda and we got two minutes left sweet.

A

uh Oh, I see our nose back. Oh no did you have any like. How does all this sound to you.

G

Oh great, so I'd have a specific question in mind right now. So, okay.

A

All right well, unless anybody's got anything for our last minute, I'm gonna call it. Okay thanks. Everybody uh have a happy tuesday and I'll. See you all in two weeks.

F

A