Kubernetes SIG K8s Infra, 4 Mar 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: k8s-infra-team's Bi-Weekly Meeting for 20200304

Description

k8s-infra-team's Bi-Weekly Meeting for 20200304

A

Hi everyone- this is our Buick Lee case infra working group community call at the beginning, I want to mention that our code of conduct, which we can summary as be excellent to each other.

A

If there is any topic you want to add or place agenda, please do so. There is still time and is there anyone who is new and would like to introduce torkoal.

A

Okay, nobody, let's go further, we'll start with action items from the last week last week sexually, because we did a billing review last time. How is the container in water.

B

So we did a deployment of the auditor to the case. Our tax pot project, the project and it is working, but there is a site bug where it doesn't show the actual details of unapproved images. So I need to fix that we did a test with a dummy image to trigger that path on purpose, but so it did detect that it was not verified, but it just didn't tell you the actual details of the pub/sub message, which is pretty bad, so I need to fix that today.

B

In other news, you might have seen the announcement yesterday that was sent out to flip the domain for the vanity domain, so yeah that we need to work on well as part of that the back filling has begun. The update on that is that the job the promoter ran fine for two hours.

B

The auditor also ran fine for two hours, but the job was killed because prowl said that it took too long. It took over two hours so on the fifth second, after the second hour, it was killed, but it will run again today because we have a daily job that runs just every day as a sanity check, so that will kick in it kicked in yesterday at 1:00, something p.m.

B

as if it's hot, so it will kick off again, I presume at the same time today, so in about 5 hours time in that window of time, I would like to fix that bug the auditor and also probably a deploy it so that we'll see like some better locks, but for what it's worth during the two-hour run. Yesterday we just worked as intended like it verified all of the the thousands of images that did get promoted. It said yeah these each one of these are verified because I see it in the promoter manifest repo.

B

You know, because of the merge of the backfill PR, which added like the I, think it's like somewhere, then they rode up 29,000 images, so yeah that will run again today, I'll post updates on the channel as we go through yeah. That's my thing.

B

Well, I have been working with Tim directly here, so I, don't know. If you think, is there more work, I mean.

C

We're on the hairy edge being done quote quote done with it. I saw that Chris Nova also popped up on Twitter and was like hey I, want to help with this. So cool I, don't know if there's a phase 2 set of work that we have queued up. You know things that we'd like to fix. Once we get over this.

B

There's definitely so there's definitely a phase 2, because I mentioned last meeting or one before that that I have an intern coming in, so I'll have to tackle some of these long-standing issues that we have. There are plenty of there's plenty of work, I, just can't think of them right now, top my head, yeah.

C

So maybe the follow-up is just start thinking about opening a new issue or set of issues for phase 2 of the promotion process. I know like the GCS side of it is a particular problem for the release team right.

D

C

Thinking about how to normalize that yeah, but but really I, want to focus and get over this hurdle at the end of this month, the beginning of next month and have a party.

A

Right here, okay, so from the last week we have only one topic, which is the immigration to terraform, and unfortunately there is not a lot of progress. I started doing the Kalu of the port, which was just standing and I feel like. We need a little bit more. The discussion for the architectural design, all that kind of changes, so the I will spend some time during the next week. Just tackling this issue and let's jump into the open discussions and actually I, have two topics, because I started the process of moving the pair of project.

A

They were new in front and I have one question, and actually maybe this will be a good point to start a discussion, because what how actually, when we build the files about how to deploy the whole project, the new infrastructure currently who or what is deploying this so is it automatic or is it manual job manual running scrapped by someone or I, was.

C

Working right now, currently it's manual so I wrote. Some of this indicates that IO transition I think you can fire up cloud console, get a cloud shell and you get Q cuddle and you target the prod AAA cluster, and if those documentation are not good enough, then screaming we can do better and.

A

C

A

The starting point which I want to do is.

A

Their current plans of automating them, we think about it. Oh I.

C

Would love to have us automate that I don't think we would go so far as to say there are plans, okay,.

A

C

Like the plan to date has been to take all the things that already exist and move them over, this doesn't exist, so this would be pushing the frontier which I'm all in favor of, but we also still don't have great answers around monitoring and not everything has been moved over truthfully I, like I, like the project as if there's enough people to do everything all at the same time.

C

Great I would rather spend my personal energy in the next year working on getting all of the CI and scalability tests moved over rather than advancing this a.

A

Car great because I told on our last meeting, I will try myself to focus on the documentation more during the next period of times, so I will spend some time digging into his into this and improving a little bit of our documentation, so be prepared for me to poking you guys a little bit more about this, and this also is another topic from me. Is the audit process because I have the question I want to import the documentation or for that car mmm topic at the beginning. So what is currently the process of the auditing?

A

What is what we are a leading.

C

For the auditing section, so initially the goal of the audit was permissions and I am specifically hippy expanded that well beyond just auditing, and it's actually very interesting to get more information. So right now it's it's a mishmash of things.

C

It's mostly I am but not exclusively, and I would love to see us audit more, but we need to make the audit script not log the sorts of things that are typically changing right, so IP addresses and items in a bucket and those sorts of things are probably not good things to audit the existence of a cluster is a good thing to audit. Now you get into like the number of nodes in that cluster.

C

Well, the cluster is auto scale, so that can change and like if the auditor ran say hourly, would it be triggering a pull request every hour when the auto scaling is happening like that seems like a bad idea, so the audit script is pretty dumb right now. What we would really want to do is look at the results, we're getting from each datum and filter for the stable information, and so I've left that I've not tackled that.

C

Yet the goal with the auditing really was to get it to run every hour or something and any time it detects a change to send a pull request and make a human's acknowledge like actually, yes, Linus is supposed to have access to this right.

A

Okay, so that makes sense, actually it opened my eyes and so currently, what is the result of the audit? Are these a lot of files? We've? Yes,.

C

Okay right, so we need to do a first-time audit. We know there's a bunch of permissions and things that are that are wrong today, just because we've been a you know, evolving and iterating on the process. So, for example, we have an open bug. Anybody who creates a project, whoever runs the insurer staging storage. That person gets added as an owner of the project that they just created right.

C

So we need to actually fix the tooling to say after you've created the project and assigned correct ownership to the admins group that you actually remove yourself from it right like we did. We know that bug exists. So when we look at the audit log, we'll see it's not a log, the audit dump, we will see you know. Linus has some projects and I have some projects, and you have some projects and Christophe has some projects, and so we need.

C

We need to fix those things, and once we do a solid audit of that, then there should not really be any deltas over time that we don't expect right.

A

And that's that was actually very formative for me. So I will try to write a little bit more on the documentation, discussion, cool, okay. So that's that's it for me and I, don't see any other topics. So let's move to just open discussion. Any questions, topics.

C

There are there's the promoter or publisher bot which what had a PR when I looked at it last week, the PR was back in their court. I guess I could open it up again. This we can see if it's changed, I haven't gotten any emails, or rather I may have lost the emails I've. Also both.

A

Both but so far no okay.

C

So that that's the only reason that the development to cluster exists, so we'd love to get that out and then there's this old. The old Google only cluster still has two things running it: node, perf and velodrome, which I have asked those two groups individually to figure out how to extricate themselves from the cluster, and that's it after that. We actually sort of declare victory that our utility cluster is up and running.

C

Now we can think about. Should we have a cron job that runs the audit right. I actually am very interested in that, but to my earlier point, I also want to shift attention towards the harder problems. Now once we get, the GCR thing done, I really want to see the continuous integration and the boss cos and all the stuff that ends prod and sync testing have done moved into community hands. That's the next big goal, and maybe scalability after that.

D

I am for what it's worth starting fees and starting to poke some folks on the hedge broad team, about what processes we have in place to manage build clusters I feel like that would be the least invasive way of starting to move stuff over to ciencia like migrating prowl itself.

B

D

Kind of unclear how we can do that without causing some downtime, but we have been moving to the model where we support different teams by having them add their own build cluster. And so just as a sanity check with this group, does it make sense for us to look in the direction of provisioning extra clusters for prowl versus having prowl attempts to use the triple-a cluster, but my thinking being that crowd jobs tend to be kind of noisy neighbors, and so it was unclear to me that it made sense to use the triple-a cluster.

C

Oh I would love to try to use the same cluster, especially for the non-sensitive stuff, because, honestly, we should be able to support that, and if we can't, then we should go back to the developer side of our jobs and say what the heck guys, let's make a better product and so I would love to try. That comes with the caveat that actually I have no idea how proud really runs. So you know we will look to the folks who know that, for as experts sure.

D

I, like I, think down the road having a crowd itself run on the triple-a. Cluster sounds like a good idea. That's like just another app, but as far as where it scheduled schedules, jobs to run. We find that it's a good idea to segregate off to you different clusters, just for purposes of trust per.

C

Tree so for trust purposes, I buy for noisy neighbor purposes. I would like to pretend we can do better. Okay. Does that make sense? Yes,.

D

C

Good to see you yeah.

D

Good to be here, um yeah I wish I had more on a concrete plan to tell you about right now: okay I think you know I believe we're in a place where Bash is still used to manage a lot of that stuff and we're looking to see we can transition to the use of terraform and hope.

C

You have more of an update for you in two weeks. Awesome. The truth is, it hasn't been important, and so we haven't made it made it a priority, as we start to see the horizon approach. As you see our thing, we keep, we actually can expand. So right. Awesome super excited about that.

B

Oh actually yeah well related to this. So that's! Actually what Eric feta is asking for he's asking us to move off of the current trusted, proud cluster or some of the promoter jobs, because security.

C

Reasons, I'm not sure, I buy the argument. I would like to have it yeah, but.

B

That that is something that one of the things that we my intern has to do at least that's how I thought of it. Okay,.

A

So I also a little bit during the last week. The pro and the boscoe's I was try to understand. How does it work and what do we need to move it doing the new infrastructure? But no luck so far.

A

D

Mean my my understanding right now is that, like the the easiest least, disruptive migration path would be to continue to have proud running where it is, let's start to add the capacity to schedule proud jobs in CNC fo projects.

D

So one thing that would be required for that would be to run a copy of FASTA sand CFC own projects, okay,.

A

D

So I think when, when you're running a job- and it needs to stand up an internet cluster, the way it does it right now is it gets a GCP project from phosphorus and then schedules it to that, and so we would need, if asked us. That's set up in the CNC FM project and we'd need a pool of projects that that Rastas instance could then hand out to jobs that want to run end-to-end tests, so they could stand up their clusters in Vasquez, provisioned or managed project that lives in the CN CF space.

D

Does that make sense? Yes, that's.

D

So a a way we could maybe crawl before we walk. There is to consider yes like maybe moving over some of the trusted jobs or maybe moving over jobs that don't require standing up an entire kubernetes cluster to run into n tests. We do have a number of jobs that just run as a pod, and you know like we run a shell script or build something you know or things of that nature and it would probably be easier to look at moving.

D

Those are reversed, but yes, I'm led to believe that with each build cluster you need a Vasquez instance to go along with that, and you need GCP projects that that Vasquez can be hand out. Yeah.

C

So we we started talking about Bosco, so a while back and then we just back-burnered it until we got over some of the existing hurdles, so I'm happy to cut off the back burner and and.

A

Okay, so I think that this there's still a lot of place to discuss this topic. I think that it was a really good starter at this point, so let's also move it further later, and is there any other and other topics you want to discuss today.

A

So I don't see any and I'm after, like a seven or even eight hours of calls. So I'm very head give us like this 35 minutes back. If there is no other like questions topics for the first time.

C

A

C

Just one small step where and I have been talking with stephen augustus from cig release to figure out how release will fit into the the staging and promotion process. It's not a huge topic, but he's got the PR open and we've been sort of iterating on on how we want to align that and I.

A

Think that's actually the good good copy, because I'm not sure I understand the fully. What is his problem? Current I? Don't.

C

A

C

Either we're uncovering things as we go on a project this week that isn't owned by Google and isn't owned by kubernetes. That holds the keys for pushing things up into github, and so, like I, didn't know that existed.

C

Aaron I agree, so we're we're figuring out the problem as we go, I think I have a better handle on it this week and I did last week, and so it's not an urgent thing. I, don't think it's actually blocking the GCR work, but Steven wants to get it worked out. Oh I discussed.

A

This also one of the peers or issues I, don't remember, I feel like there's so many of those created and that we I will ask him and I will discuss with him at the next Monday topics with Segura this meeting so I till this time, I won't have a better understanding of the problem or some better knowledge about it. I hope that it will be fixed during that meeting.

A

So if you have any questions, if you won't be attending that meeting or you can attend that meeting, just send me, questions I will try to work with Steven and others.

A

Ok again for the first time and other questions topics for the second time for the third. So thank you very much for the skull. Thank you and see you in the next week's.

D

A

D