Kubernetes SIG Testing, 11 Feb 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: SIG Testing Meeting for 20220211

Description

SIG Testing Bi-Weekly Meeting for 20220211

A

Hi everybody today is tuesday november 2nd. You are at the kubernetes, take testing bi-weekly meeting. I am today's host uh chair of sig testing, aaron krittenberger, also known as say beard or spiff xp at all the places this meeting follows the kubernetes code of conduct, which means we're going to be our very best selves to each other, it's also being recorded so that it can be posted publicly to youtube uh shortly.

A

B

Today's meeting.

A

Agenda uh I've got a quick heads up and then we've got some suggestions from chow and arno to talk about things.

A

um So with that I'll, just sort of dive straight into the fact that uh code freeze is coming for kubernetes, um typically code freeze doesn't impact any of the code changes that land in the kubernetes testing for repo or any of our other sub projects.

A

But it's a thing we need to be aware of um I'm. Basically, I guess asking folks here, if you know of any changes that you definitely want to have landed for the kubernetes 123 release.

A

If there's anything, we should be paying specific attention to in the next two weeks. um Stuff that typically comes up around this time are any major test changes or anything that attempts to change a bunch of container images. We've made great progress in migrating to community, hosted images.

A

The other thing we try to keep an eye on. That's not necessarily code changes for kubernetes uh are things like workflow changes for the project.

A

um So if there are any like major, potentially disruptive workflow changes or crowd changes, we may want to consider whether we should try landing those now or passing pause on them or adopting a little bit more of a careful migration. Rollout strategy.

A

See everybody nobody's got anything they particularly want to land. Is what I'll take the silence as.

C

How long does code freeze last for again.

A

That's a really great question: uh I'm gonna! Look that up real quick. I think it's at kubernetes devs, slash release. Typically, I think code. Freeze these days lasts until the release actually happens. So code freeze will happen tuesday november 16th, um then test freeze. So these are test only changes if you want to like land your feature then land your tests, which don't do that. That's that's not good, but for things like conformance tests or things that have been hanging out waiting for approval for a while test.

A

Freeze is november 23rd and then I think things just stay frozen until the release happens on tuesday december 7th. So that's what three weeks freeze.

A

And then my impression is it's it's kind of a lull until uh sometime next year, I think I had seen some traffic on kubernetes issues about maybe using this time for any large scale breaking changes or like messy refactoring that we might want to land.

A

uh The one that's top of mind for me is moving the kubernetes default branch from master to main, which requires renaming, like a bunch of shell scripts, a bunch of jobs, test grid dashboards, all that fun stuff.

A

I think the plan is to start on that, after the release gets out the door.

A

I think the the the only one that I can think of off the top of my head that I'm going to have to like walk and talk here and find um somebody well antonio. Why don't you go say say your piece while I look for anything.

D

Yeah and now claudio is in the in the meeting. He has some pr to implement prepar pre-pulling off images on the e3, and I mean I think that that's interesting, but uh I don't know if all of you are aware, because everybody will sign up the piazza and I don't know if he will need more more help to learn this before code. Three.

E

That's a good point now. Basically, what's the idea with up to pulling images, uh I already saw that for node uh node tests, the images are being prepared and basically I'm proposing something similar for regular e3 tests.

E

This is because we have been seeing some flakes in the window ci, uh because some tests were expected, the the poster startup in one minute, but if those tests somehow end up being the first ones to run it might take longer than one minute to pull and unpack the windows images, they tend to be a bit bigger than linux images, so that pull requests. I'm gonna link it in sure I think yeah on antonio already linked it. Thank you now that basically introduces uh an optional argument by default. It's false!

E

If it's set to true it's basically going to spawn a demon set for most of the e3 test images on, so that, basically every node will have those images cached locally and then deletes them and then starts the test themselves.

E

So, for most linux scenarios- maybe this is this one be a typical scenario, but for windows it could be useful.

A

It seems reasonable to give it some sort of time now before countries I'll see. If I can take a look at this later today,.

A

um Of course, he's still trying to find my thing.

A

I'll put this in meeting notes year.

E

And I do have another request which uh it might be useful for us now from what I observed it's regarding. If a test fails, we don't really see anything that happened in the pod. If there are any issues there, especially if they are networking issues, a lot of services like diagnosed services, that being are being startup, every request is being logged, which is extremely useful, but we don't see that in case of failed tests.

E

So if we could actually see that information, we might get some more uh in-depth view as to what happened for that failure. Did that request even reach the pod or not? Did it fail uh with some interpol communication?

E

We don't know because we don't see any of the pod logs, that's scenario, so I would also send a pull request for that as well in that one, by default we log we get the logs from the top five points in that test. In most cases, it's more than enough, and in other scenarios in which the response too many parts more than five labels could be added to the most important parts that should be locked. Basically.

E

This basically came as a necessity necessity to say so. We have been seeing flakes and it is a bit hard to root cause it as to what exactly happened, and why did it fail?

E

A

No, I think this totally makes sense. uh I feel like.

E

This is another one. I.

B

A

Turn on and just sort of see what it does. The only thing I can. The only impact I can pick up is like an increase in the size of the artifacts that we store, uh or maybe there's gonna, be some flakiness and actually retrieving the logs for these pods. It could be the pods end up going away before we're able to retrieve the logs from them.

A

But I don't I don't know I'm willing to sort of throw some live test data or some live tests.

E

When it comes to the size of the logs, we only load the last, how many were them 40 lines or something like that from the pod?

E

So even in extreme scenarios it will only be 40 lines deep and the test that would generate a lot of locks would be the probe ones which basically will send periodically probes to some container, which will then be logged, but in a lot of them it might not not even be blocked at all, and this will only be executed in case in in the case which the test the tests failed.

E

So in typical test runs. Maybe there are a handful of failures.

E

Okay, oh I'll put this on my key. Thank you.

E

A

So the one I wanted, some folks eyes on was an update to the approved plugin. um We had this person come by stick testing a while back and sort of walk us through a demo of adding a granular.

B

A

To the approved plugin so that people could slash approve specific paths or specific directories, this is all in service of trying to.

A

uh Let's see, empower more people, I think, to approve files. It's it's targeted at like getting rid of the targeting the root owners or group improvers issue for kubernetes kubernetes. The fact that, like there are very few people in group approvers- and we kind of don't want to add more people to groupers, but we do want to for those people who are like over privileged. We want them to be able to say I am approving just this specific part of it, and we can also allow people to approve just like specific pads.

A

B

A

Cases here, uh but the idea is this was discussed in brexit testing a while ago and then I think, uh was iterated on and initially it had been done as a separate, approved plug-in. The functionality has all since been gated behind a flag.

A

I think my ask is I want maybe like coal and or alvaro to take a look at this just to confirm first, if we deploy this, as is without enabling the flag that we're not there's no change in functionality and then for us to consider the implications of whether we want to enable the flag or whether we want to suggest that we not mess with the flag on the run-up to code.

A

Freeze and stuff, like I think, if we wanted this deployed kind of now, ish or this week-ish would be the time I would want to see it so that we would have enough time to revert it as we start to see sort of an oncoming flood of code, reviews and tr pushes and people trying to get their stuff landed before code. Freeze, um like I approve of the idea in principle, but I need people who have a little better granular.

A

The granular is in the title, but like more granular, knowledge of the approved plugin to just make sure the performance implications or any corner cases seem like they've been addressed.

C

um I have some thoughts on this. uh First of all, it's the first I'm saying of this. I it's a little unfortunate because I kind of had some opinions about the approved plug-in earlier, namely that we should like completely rewrite it, rather than continuing to add more functionality to it. um It is very confusing and like huge amounts of the code are vestigial and you know could be refactored into things that make far more sense um and I'm a little uh alarmed that this pr is 6 000 lines of changes.

F

So it actually is six rounds on lines of changes mostly because it is a rewrite. It initially actually was a separate plugin and we asked to make the transition smoother for repos by just making it a flag to run the new code pass.

G

But the rather ask to duplicate the existing plugin, which is the very first commit and then change the copy of the daptic like center copy and, however, like I'm just looking at this, and the second commit that supposedly does this still changes the original approve and is still almost 6 000 lines of code change.

G

So either we should really duplicate and have a div to the original that is somehow digestible or, as cold suggests, start from scratch and and rewrite it, because as this right now, this is extremely hard to review and I would be super concerned that we missed something.

G

Because the problem is, if we miss something in this area, this has gigantic implications. This is really something we cannot mess up. um So that's why I'm so concerned.

F

Okay, for what it's worth a lot of this is tests.

A

Yeah I mean- I don't know I I I think I want I want some pushback uh like I have this impulse of this has been sitting out since march and I think alvaro took a look at it back in march, uh but I don't know how much we've actually engaged with this since then, um and yes, because of the large implications, I'm wary of having this land so close to code freeze, so I think uh maybe uh alvaro or cole, if you guys could sort of what I'm trying to figure out is.

A

Is there anything that is uh usable here? You know if they could break the commits out in a way that makes this easier to digest or easier to review.

A

Is there value in doing that, because if I were this contributor, I would find it frustrating that, after nine plus months, the feedback or financial months, the feedback uh from the people they've trained they've been trying to like get attention from, is actually we gotta redo the whole thing entirely.

A

uh It would be a little frustrating, um so I'm just trying to figure out how we can work with this.

A

You can drop some comments on this. I'd appreciate it.

C

I agree that that is it's a very frustrating situation.

C

um I wonder, if that's you know, I I'm not sure how we should handle that I guess um like I, I alvarez uh looked at this a little more than I have so maybe he maybe you know like how much the other two commits are like I'm just looking at the two commits that are supposed to be the you know, the changes, not the copy, um and it still looks very massive.

C

Do you know how much that is uh actually changes and how much of it is stuff that should have been put in that first commit.

G

No not really, I just basically looked right now at the following commits um no.

A

Yeah, I I think, like I'd, have to go back and dig through the meeting. That's the recording to know for sure, like I do recall this person showing up and actually like walking us through a demo of the proposed functionality. I think maybe the part of the process here that was lost is how we typically go through a design dock or a cap, or something to really walk through the implementation and the implications of this, and I feel like if I didn't.

B

If I didn't have the memory of them showing.

A

Up at the meeting, this pr looks an awful lot like dropping a multi-thousand line. Pr as the first step of engagement, which typically doesn't go very well, um so it could be what's missing. Here is like a document that sort of lays out the design, decisions and choices that were made, and it could be that breaking up the commits to make them more digestible.

A

In concert with the doctor, we're talking about the decisions and the.

A

Trade-Off but the changed.

A

This is such a fundamental part of this project's workload.

A

Anyway, yeah, I appreciate and appreciate some feedback from you tomorrow and.

G

A

C

Let's see, I also just.

A

B

C

To the issue where I kind of went over the or outlined a lot of the problems, with the approved plug-in that kind of motivated my opinion for think, we should just do a rewrite thanks.

A

C

A

Know uh making red jack's stuff better than different plugin is a thing.

A

um The only other thing I can think of about uh the big change prior to code. Freeze, since our note is here, uh it's a quick kicks in for hat on um the scalability jobs. We've migrated over the five thousand node performance test over the k-10. We haven't migrated over the 5 000 to correct this job. Yet I was wondering if there was anything specifically that was holding us up there.

H

Not really, I think I want to leave that decision to see scalability, because I talked to one of the leads and what he told me is. They need some kind of bandwidth to do the migration and babysit the job in case there's some failure. So I did everything to prepare. The migration now is just an approval from them.

A

Okay, that's everything that I could think of without my head for groceries and uh thank you. uh How do you for uh antonio for bringing some other things to the table uh with that? I'm going to hand over to chow to talk about github, dependable.

I

Yeah, thank you aaron. um So, as I have asked us like a couple days ago, uh we uh found out that we can turn on github depend upon alert without modifying the code base, so I turned it on and I was not prepared that we have 17 alerts and I'm looking at the page right now, I can copy paste the link, pnd meeting, notes.

I

So um do you guys mind if I share, or I think aaron you can- you should have permission to open the link yeah. Let me let me uh allow you to share all right.

I

So basically, we have four critical severity level and five on high severity.

I

I'm not quite sure whether we should care.

I

um Or how much we should care about this.

G

I I can, by the way, not open this link, and at least for me, okay, okay, thank.

A

G

A

Sure, right now.

G

I

Visible to like.

A

People who have admin access to the repo which for what it's worth, I think I'm of the stance that the the tech leads for the sake should have admin access to this repo.

I

D

I

So here is the critical severity and high spread one. I guess we can uh ignore the moderate severity until we have resolved the top ones. uh The questions were like should we worry about this.

I

Some of them does going into our binaries.

G

I think this is very hard to tell if I'm looking at it like that.

G

For example, the jwt.

A

G

I don't think we do anything that would exercise this, but I might be wrong.

I

uh The jwt one, I think it's uh it's using jira.

I

I forgot, I think it's a plugin jira plugin.

C

uh Is it used by the uh app functionality as well or like for authenticating as a github, app.

I

G

I

I don't think so.

G

I think not, but I'm not 100 sure you know.

I

Yeah, as far as I can tell from gomod trace, this is used only by jira plugin.

I

um Some of them do have fixed, but this one doesn't this one doesn't so again, I will tell you which version you could bump to to mitigate this, but this package apparently doesn't have a good version to use.

I

So some of them, I guess we can just easily fix, like uh use a hatch version.

A

Update reading the text to the last one it made it look like if we upgrade it to a later version, it would actually fix the issue. Maybe I see this one? No, no! No, the jwt one right right, yeah it doesn't have uh it doesn't.

I

C

Is the airplane.

A

At the bottom, it says users are advised to migrate to a later three two one.

I

Yeah, but even that doesn't I don't think it's a fix.

I

A

uh I don't know my stance is: if there's any possible chance, this could be used by prowl. I wouldn't want our attackers uh escalating privileges on our ci system.

G

I

Want to say, because we.

G

Do need to go through them and see how possible or not it is to get patched versions. Sometimes it's difficult if it's a transient dependency and the dependency hasn't updated, or something like that.

G

Yes, but that's the process yeah, but if that is not the case, we should definitely update our versions.

I

Yeah, I'm also hesitating to raise the issues, because I don't think we want to expose these to folks to tell them hey prague. Has this vulnerability? Please use it. So I feel like it's more like a secret channel communication or unless you guys object. This idea that I can raise 17 tickets.

I

We can't trust them one by one.

F

I I think we just need some general. We need to update our dependencies. um The python stuff should probably be pretty unblocked. The ghost stuff might still be pending on finishing the basal transition.

I

Yeah so assume the leads should take care of this right. We we are not. We don't want people to see this.

F

Not only when this yeah.

G

F

Not super concerned about.

H

G

Because because because power has a couple of shortcomings like you can easily bring a power instance down by repeatedly calling specific endpoints are just expensive to calculate, and I think it's not super likely that this leads to any possibility of getting credentials or actually doing getting things to do something different, rather than maybe just bringing it down.

F

It it's also worth noting that this stuff is public data anyhow and with go and python it's trivial to check what dependency versions we have and find this yourself without the actual dependent bot alerts.

F

That's true like these are not privately disclosed vulnerabilities actually, but these are coming from like public government, databases and stuff sure.

I

C

I

C

Yeah and the ones that are critical severity. I don't know if any of those affect pro at all.

I

I agree: taiyama should not.

I

This is probably using tests. That's my guess.

I

But anyways that's just curious how we want to.

A

Come out with a question uh like I, I want for us to facilitate uh collaborative discussion on researching, like we have the option through the ui to dismiss these, uh but I'd like to have sort of documented decision making to assist us in that either we've decided. Yes, we want to upgrade the dependency or we've can we've confirmed that you know, go mod trace or go mod y, or whatever explains it like this. This doesn't affect us in a critical area.

A

I think there's a part of me: that's like I still want that discussion to be at least a little bit embargoable, uh something like using a google doc and restricting it to uh restricting view access of the docs to like leads for now, and we can add in um collaborators who are interested in helping out.

A

I don't know I'm open to suggestions here. I.

F

Think most repos just use this as a hint that it's time to update your dependencies and uh nothing heavier than that.

A

Okay, like I'm, also super cool with opening up issues that we should open. We should uh bump these dependencies, uh not necessarily explaining why per se.

F

I I mean just as a general thing like we can just bump pretty much all of ours. I don't think we even need to target specific ones. This is mostly, I think, systemic of the fact that we tend not to update any of them.

A

Sure I'm trying to avoid signing us up for the churn of updating.

B

Many many more dependencies.

A

Than just these,.

F

I mean these are just the known bugs and vulnerabilities.

D

F

D

Sending ourselves.

F

Up for a pain later, leaving everything outdated.

A

Sure I understand that we're accounting for that now, but I'm not sure you know, is this more important than other stuff we have going on. So I feel like the critical severity stuff I mean.

G

All the stuff that has known vulnerabilities.

A

Probably important, um I feel like just taking a pass and seeing which of these are trivial to bump. Without you know too much headache and then seeing uh how that reduces the size of this list would be a good first step, and then we can sort of discuss for those that remain and are trickier.

A

Do we want to uh move forward on those dismiss them whatever, and I think that ciao you've got stuff ongoing to like make bumping these a little bit easier via make instead of via basil, so the future where we bump at least our codependencies, uh is coming, and we don't necessarily have to solve for that right now.

I

Right yeah, I I think I agree with what ben just mentioned uh with the fabrication of bazel. The bumping of gold package versions are gonna be easier, so maybe we want to finish that first.

I

C

The python stuff won't change much either way, though right or do we also care about that.

F

I think we should go ahead and bump that it's not.

C

Going to change that.

F

C

Yeah, the python stuff seems like we should tackle now, especially since those are all the critical.

F

Ones, I agree, I think the biggest problem with those is I'm pretty sure right now. The way those work in ci is they're like pre-installed into the ci image, and then we just run and test against that non-hermetically. So um you I don't know if we can have confidence in the bump. Pr without someone confirming that they like check this locally.

C

You can create a canary job right with a new image and try it out.

F

uh Yeah, I think it's just more like we might. I think we can probably just go forward with it, but we need to be aware that, like it's possible that, after we merge that we might have to do some reverts once we bump the image for we might find out that, like it doesn't actually work.

C

Yeah and if all I'm saying is we can just copy the job and have a one canary two to test things out. If we're worried about breaking things.

A

Okay, I appreciate you bringing this to our attention, show.

I

Yeah cool, I think that's pretty much what I want to discuss. I guess we can so who is interested in doing the pi done upgrade.

I

I will keep doing the basal deprecation until we can do the gold package update.

C

I don't know what I'm doing with the python upgrade, but I'm happy to take a stab at it, probably not today, but I can start looking at it this week. Thank you. Okay,.

A

A

Okay, uh that moves.

I

A

Our now to talk about crowd migration.

H

I feel like I need to let you talk about this, because we had the conversation about pro migration three weeks ago, a month ago,.

A

The problem is, I keep like losing context and stuff okay, so.

H

Just want to refresh everyone, so I was talking about pro migration with eddie and basically, we now have a pro instance running against the bread cluster different of the current one and eddie was asking what's the next step for the migration.

H

So I kind of have the same question because I feel like we don't have a clear plan for the migration.

C

We're talking about migrating, um the cage brow instance to the working group, infra instance. Is that correct that to a.

H

Multi-Tenancy cluster.

C

Yeah, um I would imagine that the next steps would be to migrate. um You know sets of jobs or individual jobs at a time. Is that kind of the approach.

A

So then, uh so the thinking here is we've already migrated a bunch of community jobs over to a community of build cluster. um Let's just pretend for a second: I could wave a magic wand and proud.kate's dot. Io suddenly doesn't run in the kate's crowd project and instead runs in a multi-tenant app cluster.

A

uh It's uh it's gonna attempt to schedule jobs to a google owner, and it's going to attempt to some of those jobs that it's running are jobs that have nothing to do with the kubernetes project.

A

So if I look at it just from a jobs perspective like you got to decide whether uh we should continue to kick off non-kubernetes uh projects from productcase.io, you know there's a google cloud platform, multi-cloud clustering grass project and I think maybe some of the bazel things still run off in product case that I owe um so just like finish migrating that stuff over to the google's open source pro instance.

A

um But for me the bigger things are like the.

A

The google cloud storage bucket known as kubernetes jenkins, into which we dump all of our logs and stuff uh and any associated.

A

Here let me link this chat like basically.

A

This is the last time I thought about it deeply uh and I had concerns that we needed to make sure that if we were to stand up like we, we have a proud service cluster running over in k10, it's called kate and for proud of kate io and like what would we have to do to flip domain names or whatever such that that became kate's, dot, io, so uh infrastructure wise, um I feel like we have to decide whether we would want that proud instance still capable of scheduling to the google.com owned, build cluster uh or whether we would say nope.

A

It can't schedule to that cluster. Everything has to run in the community, and that leads us to the decision tree on which jobs should be migrated over or should not be um then there's the google cloud storage resources like kate's, test grid and kubernetes jenkins that currently live in google.com.

A

You know, would we take some down time to move those buckets over? Would we try to allow proud to use those buckets from the community? Would we try and get proud to use a different set of buckets over in community?

A

um There's that and then the thing I don't even have listed here, I'm just realizing out loud is like the all of the workflow around continuously bumping stuff. I don't know uh like it all relies on the use of a kate's proud project to host all the images which is also google.com.

A

It's unclear to me whether we need to add privileges for the community owned, for instance, to write to that sorry. At this point I feel just like I'm rambling, but um I agree with our know that, like a migration plan needs to be developed and I keep losing context and not having enough time to actually like get the context page back in and focus.

G

On it and come.

A

To it in sort of a rigorous, step-by-step thing, um I keep like trying to drop the things that I'm aware of based on where we are today, but I would love some help in actually putting a plan together.

B

But there are a few things that jumped out when we talked about this before the budget is going to be the big one, um especially once we start scheduling and putting logs on community owned stuff, because I'm pretty sure our runway will just explode. We won't be able to cover it for the year right.

A

Basically, I kind of that- that's maybe almost an orthogonal thing to me at this point I feel like we already lost our runway with the artifact hosting costs, um but yeah, let's say ballpark. It doubles our ci costs if we, if that's assuming moving over proud at case.io, also means moving over all of the other jobs that it currently runs right for reference. I don't, I think it's ballpark like 300 and something jobs run over in the community cluster and they're 1700, and something that's still running the google.com cluster, but traffic wise.

A

Those 300 are like way more.

C

So these 1700 remaining jobs, though I doubt that many of them are for just google purposes right, like most of the like very few of those are going to be things that are, you know, shouldn't be running in that cluster right, yeah.

A

A lot of it is like kubernetes ziggs repos, like cluster api, all the cluster api repos, all the cloud provider repos like there's a lot of valid stuff that hasn't been migrated, I sort of in terms of if we go back to your suggestion. Cole, like think about just the jobs that we should uh migrate. Thinking that, like uh running jobs is the majority of the sea ice spends not necessarily the control plane or the like the service cluster. Then yeah.

A

We've migrated over the jobs that are principally aimed at release, blocking and merge blocking jobs for kubernetes kubernetes. We haven't really opened the floodgates to jobs for all of the other 100 and something repos that make up the kubernetes project.

A

Part of that's just from. Like a billing perspective, it's really difficult to segregate outbuilding on a per job basis, so we wouldn't be able to wave a magic wand and say like well save cluster lifecycle. You get a budget of this many thousands of dollars, and it's up to you to make sure that you run. You know.

B

C

Too many jobs to.

A

Blow your project.

C

Yeah using build clusters is probably the way I'd recommend doing that. If, if you want sigs to be able to manage their budget individually, they can you.

F

C

Have their own their clusters for that um yeah.

F

I agree, I suspect, for most a lot of the things, though the cost is actually in the bosco's pools, primarily.

H

F

Overhead of a build cluster isn't free.

H

F

H

Some some some say I have a lot of requests in terms of resources. I would say, for example, sick release, most of the job of sacrifice not running boss, gospels, but directly on the build custom.

H

They consume a lot.

A

Yeah, I think kind jobs, heat up a whole node.

A

um Anyway, I feel like we're getting uh off tangent here. The the like the motivation was to see if there's a way we can migrate just the service cluster, because the idea here is great: we're spending the community's money on jobs that run in the community on build cluster and that's cool, but I feel, like people are less incentivized to contribute to and help support prow, because they can't touch the proud control plane. They can't see the same logs that all the on-call people can see. This is all.

B

Kind of in service.

A

Of allowing more community members to be able to join an on-call rotation, that's basically best effort and to be able to help out, and so the fact that proud runs in a google.com on the project.

F

A

Is the blocker.

F

To that right, but isn't that also because it controls a bunch of other google.com resources.

A

It used to I'm not.

F

So sure that it does anymore, like old, bosco's, pulls and also the build clusters I feel like. If you haven't, moved the build clusters, you can't move the service cluster.

F

Just because of credentials.

A

Well, so I guess that's an open question: do we think it would be a blocker to allow a service cluster that is outside of google.com to schedule to a build cluster that is inside google.com.

C

What credentials are in the build cluster that we're concerned about? Are these just like the few credentials for accessing one or two google cloud platform repos.

F

I mean it's more like these are google.com accounts and if we could just grant external access to these things, we wouldn't even need to move the project. We could just grant people access to the project.

F

Like there's, there's no upside to moving, if we can grant external access, the only upside is moving the billing, which apparently is a problem right now. So maybe we don't want to do that.

H

So, for example, if we migrate a job that update your internal client alpha on-call rotation, is it okay to move that job outside google.com.

A

Actually, that's not a proud thing at all. That's a google internal thing, updating another google internal thing that ends up in a google cloud store.

F

That's a board job actually yeah.

H

I love it because I I think we have a full list of things we of credential. We need to migrate, so they may have some kind of case, and we we maybe need to review all the tests term for our seek testing job to ensure we can easily migrate them without security concern.

A

Like I would do the security stuff I'm thinking of, is I have a service cluster cross service cluster outside of google.com that has the you know, config to be able to schedule to a build cluster which you all can correct me if I'm wrong. But last time I went through this, it required setting up basically an admin level user.

A

um So the gke cluster, outside of google.com, more or less, has full control of the gke cluster inside of google.com. So it can schedule whatever it wanted inside that gke cluster, um you know kind of depends on like what workload identity bindings would be present inside of the google.com cluster. uh How far something would be able to get from there as far as what it could or could not access.

F

Are you saying that we have that working today.

A

I'm just waving my magic wand and asking.

F

I was saying I thought that was actually like a technical blocker, um the like, like being able to grant permission.

C

I don't think there's a technical block on that. I question whether we should be doing that.

A

C

I like, if I put my.

A

If I were to pretend I was somebody working at google security, uh my eyes would be. I would have the look of disapproval.

A

I'd have the legate googly eyes.

F

I think we should be moving the build clusters for sure. I think we should be getting everything moved over. I, I don't so the.

A

Other way of thinking about this is, would there be a dance we could do where, like the let's say, we leave all the jobs there and they they still schedule to google.com owned kubernetes cluster or a google.com cluster that runs like largely kubernetes jobs, but some non-committed jobs and like what if we were to swap domain names.

A

So now we have crowdcase.io, that's kubernetes thing and it's running mostly it's running the kubernetes release blocking jobs and we encourage people to migrate more of their jobs to run on it and then what's left is like google case.prowl.io.

A

So it's still around and it's still accessible and still triggering jobs, but we're going to deprecate.

A

So that way, we're still kind of like running the same set of jobs in the same places.

C

Having both temporarily sounds good to me, I would say that we probably shouldn't switch the name, though, until we are able to deprecate, because I think that'll just cause more conflicts and issues right if we can keep the name consistent, at least until we're ready to make that the other source of truth. That might be easier.

C

A

Well, I keep feeling like when I walk away from these discussions. It's like well, I guess I should go write a doc, but I feel like arno is holding me accountable. The fact that I keep not doing that so I can take another crack at this, um but I think I'm gonna definitely get cole and ciao to take a look at it this time uh and see. If I don't know, maybe we could.

B

A

This up as one of those things that happens after the release goes out the door you know, could we do some slightly disruptive stuff at the end of q4 uh to make this happen?.

A

um The main main thing we haven't discussed here is like workflow related stuff when it comes to the continuous deployment of brow and its images and all that stuff I feel like that needs a little bit of a think as well.

C

um Yeah, I think we have pretty clear uh paths forward on all of that. I agree that we need to flesh it out, but yeah. I think that all of that there's no no technical barriers there that I'm aware of okay.

A

um Yeah and ciao, I know.

B

A

uh Cube flow off so I'm sure they'll be able to point us in the right direction.

A

I think that's everything we had on the agenda and we got two minutes left sweet.

A

uh Oh, I see our nose back. Oh no did you have any like. How does all this sound to you.

H

Now great so I'll have a specific question in mind right now, so, okay.

A

All right well, unless anybody's got anything for our last minute, I'm going to call it. Okay thanks! Everybody uh have a happy tuesday and I'll. See you all in two weeks.

G

C