Kubernetes SIG K8s Infra, 7 Jun 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG K8s Infra - 20230607

Description

https://bit.ly/sig-k8s-infra-notes

A

Hi everyone and welcome. We are June 7th, and this is the kitchen from meeting we're still reminded that these meetings under the code of conflict so invite everyone to be kind to each one. In this call, uh I will need a note. Taker I will be the host.

A

Any volunteer I.

B

A

It okay, this time, certification! Thank you!

A

um So before we start do we have anyone new in this code, one two, some quick introduction, no obligation to do it. I'm new.

C

D

C

uh My name is, and.

C

um Hopefully, I'll be able to help.

A

Welcome uh yeah welcome to the gang. We really appreciate what Peter Martin is doing for us so happy to see more people.

D

Hey everyone. uh My name is Josh I'm. Also here just looking forward to.

A

Okay, welcome uh also folks, please don't forget to put yourself in the attend the list, because I see number of hardship and not matching the number of attending. So thank you so much it's all. uh Okay, do we have somewhere else.

A

A

Next, billing report.

A

uh Let me go, let me go back, these are like I, don't I did a quick check and also remember also riyan provides some updates, so I'm like do we want to go over, we want to go in depth in the building, we're probably bus.

E

I, don't think we need to.

A

Okay, I have one would.

E

I'm I am not the overlord of all things, so it's just one minor vote.

A

Yeah, that's one book, that's like no vote and one vote is like a plus one for me.

A

uh Let me check last month at least.

A

um I, don't see anything changing here.

A

D

A

Slow slowly growing- and this also is slowly growing compared to last one.

A

And we compared to previous month I'm, not sure we can, because it's like just.

A

uh How I'm gonna do this? Let me introduce last one.

F

I see something.

A

So nice, two percent compared to April.

A

uh But also yeah um the secretion, also yeah. We bring back down because of advocacy to kind of normal thing. So hopefully there should be enough.

A

I think we on track kind of.

G

A

Pretty excellent.

G

To see the change in the artifact registry after we fix the um the S3 buckets.

A

Yeah uh I I think we made a mention of that in the last meeting that we have a graph showing huge decreasing costs just for.

A

E

G

We do we did talk about it. It's just nice to see now that we have all the data that is actually down over 50 percent, that that explains those costs were a lot higher than we were expecting.

H

A

So I think we're good on gcp course here so I think we can go about the best any question about gcp. Specifically, yes, no I, don't see it on the screen. So.

E

Nope, nothing in chat, we're good to move.

A

A

uh Yeah the best well-hop, that's right. Let me ask uh how important is that Ben view? What is Ben View?

A

Thank you. You can do months last month, uh yes, May I, don't see something specific. We still.

H

Oops wait yeah.

A

No I think yeah. This view is better uh yeah, so yeah registry is still the highest cost because we have, we don't cover all the region and prow is slowly growing.

A

uh If I, let me put that so yeah. Let me do this way.

A

Okay, if I compare yeah brow didn't grow up, that much I'm surprised, but still growing, so yeah I think we'll make progress and the rest is basically mostly things. We know about the CI from Kappa uh artifact. That is okay and that's it.

A

A

I

Yeah I think if we look at the cost updates that I did at the bottom, it tells about the same story. I think we should be the way it runs now be around 2 million. At the end of the year there doesn't seem to be there's a massive change happening now. It's slowly creeping down on the one side and AWS.

J

Is slowly growing.

I

um So AWS already 59, it was two or three weeks ago or two or three meetings ago. It was around about 45 and now it's almost 60, so AWS is nicely growing to match um and also the critters looking good to 150 000 left of the current budget.

A

A

um Just quick Foy cost for AWS Google go down again this year by end of the year, so yeah.

A

Because we're still update and because is it going yeah no Let me refresh cost related to AWS, will go down by end of the year and cost related to EK, as we go up also easy to that's a pretty accurate prediction based on what we want to do for the next month for the next incoming month.

A

B

I have a quick question: do we have any estimation, or are we getting ready to go, live uh AWS plan for with the rent system, because we have persistent expected usage on that matter?.

A

No because we don't exactly know how many recipient remember we're going to consume over time because we're doing currently from job migration. So it's hard to get an estimate and currently because of overall budget I'm, not sure I, don't think doing. Saving plans for compute is worth it this year.

B

There are also some solutions of this for the ECP site for um Marketplace AWS uh reservances Marketplace utilization with automated third-party Solutions. So we want to take that out.

A

Yeah the the problem with having these approaches. Basically, we don't want to to cause optimization right now, because we don't have a full understanding of AWS infrastructure.

A

We don't currently I, don't I, don't I, don't want to say full understanding, we're more like we don't capture the entire infrastructure. We plan to have and I always say, for example, 2024, because we still have things we need to balance between gcp and AWS. So until we get that right, there's no really plans to do capacity, planning and cost optimization right now and we are. Our budget is fine to accept any kind of experiment. uh Basically do wedding things ready.

A

Naively, without look at basically how we can do spot instance, so complete savings, Cloud front, bundles stuff like that I think right now we are fine enough to answer. We have everything owned by the community.

B

And that's good. Thank you.

A

Okay, we go next or.

A

Okay, next so to two meetings ago, James asked about basically how we can get basically to remaining images representing the actual cost for case GCR dot. Io I, don't know if we can get that in the next two weeks.

A

And my question is Target to I.

K

Think he can um so one thing you need to do is label the buckets that you're gonna project and then you should be able to understand which set of labels are costing x. What.

G

What do we want to know? Wait.

A

Wait wait a minute wait.

B

A

B

A

So um in two meetings we had the conversation with Ben Justin James was asking about basically how we can hard more image to graduate from kgcr.io to the new registry and I think we were in agreement on getting that information first, because before we get any, we move to any conversation. So it was considered as an AI. So I'm not sure if we still want to do that.

J

And I think the AI was on me right and if so I haven't yet had time. uh Okay, but I can still do it. If we don't want to do it, then that's fine, but uh I did the analysis before so I assume that that's why the AI I think fell on to me.

A

Did I saw James in the call.

E

Yeah he started chatting okay.

L

A

L

uh Thanks thanks Justin, let's, let's do one more planning cycle uh just to get a next set of 10 images and and like see. If we want to do it at all right like I'm, not saying that we should do it. I just want to see what will show up in that bucket and then we can decide whether to do it or not. um And at this point like I, don't want to go increase the workload on all the googlers that are behind the scenes. uh You know handling the case gcr.io.

L

Unless we really think that uh you know we would make a dent in something or the other right.

G

Without looking at individual images, just looking at the overall bandwidth, craps like we have dropped it off so much you can. You can also just see that, like the case, rfx prod project has dropped instead of being like 60 plus percent, it's less than half of our spend, even though it still includes the the new artifact Registries, so um I've just been keeping an eye on those overall metrics for whether or not and my current recommendation is that we're letting it letting that traffic H out.

L

Yeah I'm I'm happy to do that. I just want one data point.

A

So, okay, we good on having that when adjusting is able to provide it- and we can come back to that conversation again and make a get some kind of consensus on this.

L

So just so, you all know um I cleaned out the last image that was there in the docker Hub registry. I, don't know if you know those things so I cleaned them out today. So there were three tags on the pause image um and if you go it's empty now, I I got access to it from Tim, Hawkins and I. Went, went and cleaned that out uh just an hour or so ago.

G

There's some other ones that look similar, that, like we don't actually control someone has a like kubernetes, IO, org, yeah,.

L

Yeah, this is the one that we know about.

A

Yeah, this is like the one we know about, so it's completely empty right now, so Oops. Why? Why cause no.

G

A

L

G

But if you search for pause, you'll find some other ones that actually look some of them look fairly official and we don't uh we don't control them. Yes,.

L

A

L

Type of spotting- yes,.

A

Yeah there's a lot of things: there's a share old. What is Ventures having this on Google having its own Humanities, have been so on so yeah we're still gonna have people trying to having a four copy or whatever so at least we have the main one. We don't have to care about anymore, yeah.

E

10 million polls 13 stars, rookie numbers. What's.

L

Going on it's, the image images were minted eight years ago, so.

L

Anyway, uh sorry, let's go on.

A

Cool uh okay: we can jump in the open discussions right.

A

Yeah, okay, so first start with me: um there's a draft of an article about basically announcing the other kids that I will use the CDN to serve binaries. So I put the link of the pull request. Opening so I invite people doing some kind of review. There's a preview. If you want to read the current state of the article, we hope to get that merge bye end of the week, so everybody want to drop comments.

A

Do some review and that's all for me.

G

It looks like it's merged already. What already wait.

A

What let me see, oh.

F

E

Yeah they just they just yell at it.

I

Still not publish yeah.

G

If, if the date is further in the future, it won't go live until that date, yeah.

E

It won't go, live until Friday.

A

K

I

K

When did we get rid of the red banner on the page that.

G

Was kate's.dev not raise IO? Oh wait! No! There's careers, oh, never mind. Yeah, I'm, sorry, I,.

A

Think there's an expiration date on the banner on the red banner, so it's possible when you added that there might be a heightened expression date that we referred back to normal. So we can either want.

K

Me to extend that.

H

K

Why is it like don't test it.

A

Is it? Is it really a problem to have that clean and back to normal, because I think it's not it's not really changing anything. We kept that for like three months traffic didn't go up or we still have like 60 55 of traffic captured by the new version, Series, so I don't think it's for maintaining that Banner all.

K

Right sounds good, um also the Google Maps out the redirects, or are they like we're still doing select set of images.

A

Yeah I think that's gone back to the question. We uh that basically come back to the request by teams about getting a new data point to see. If, if it's while doing add another image for the redirect, so in my I have a conversation I get about this well.

K

Cool yeah I think we should just max it out and put this whole episode behind this.

A

Okay, okay, uh so I'm done about the blog posts. uh Next is Patrick. uh Let me put do you want to present something or just want to talk about.

F

um You can just open those links that I've uh put here. Maybe let's start with the first one, so yeah I just want to briefly discuss uh the GitHub solution that we introduced to the eks pro build cluster.

F

So here on the bottom, uh there is a GitHub section and it describes uh how to create customizations and how you can basically deploy things to our clusters without having direct access, basically for GitHub, so yeah uh there is so I mean I, don't want to get into the details. uh You can read it uh on your own and you can reach me out later uh yeah, it's the GitHub section. uh It describes how to install it on the empty cluster and also how to extend what we already have uh so yeah.

F

As you can see, it is based on Flex CD. If you are wondering why not any other GitHub solution.

F

um To be honest, there was no like uh I mean it was based on a subjectively. It was based because it subjectively feels like a simpler solution there, for instance Argo, but if we have any like strong arguments, I guess we can simply uh migrate this. It shouldn't be really that hard.

F

It's not that much of an automation so yeah um we have some dashboards on grafana so that you can track if the thing that you uh audits, if it's if it has been successfully deployed uh what it's missing right now and what we need to still in proof is to present somehow what type of error happens uh if something goes wrong, but this could be achieved through probably lucky uh connected to grafana, with some logs from specific conference that flux has uh or maybe notifications. This is something that we can work on.

F

uh To be honest, I, don't really know if we have already some notifications uh if we leverage them. If it's easy to get some, let's say slack token I, don't know uh we can consider that it's not like must have uh I'll be also, as we can also uh configure Loki, as I said before uh yep. So this is this: is it uh if you want to discuss any alternatives, I'm open?

F

uh If you have any comments, questions uh yeah, we can discuss quickly right now or you can reach me out later uh so yeah. That would be it from my hands.

K

Yeah I have a question for you. So um can you just go to the um the markdown document um in the second tab.

K

um Can you split this document so that there's a section the bootstrap section right? This should be something that you run one time on a cluster when you create it and you don't really need to do it um and then there should be a section where people who are writing new stuff for the cluster. What they need to look at.

K

And I: don't need to worry about deploying um flux, but I do need.

B

K

How to generate the customized files and the crds for it.

F

Yeah I mean that's a that's a good point. uh We still haven't added the monitoring section to documentation, so yeah I mean uh we will prepare some developer dedicated, let's say uh readme, so that someone who's not interested in the administrative administrative part of eks could leverage that that's a good idea.

K

Yeah also, most of us won't be doing admin work at all anyway, so.

L

Yeah so basically personal based right, like so have some personas um and most of the people won't need to know uh how to install things.

K

And more importantly, you shouldn't have privileges to do it anyway.

L

I mean this is good: uh let's, let's go with what you have and see how it works uh over a period of time right like so only when we start getting into the maintenance mode will we know the actual problems: hey somebody's, not there. uh So we can't we don't know what is happening. You know things like that. Right like it should be. We should be able to people who are watching.

L

It should be able to tell what it's doing uh reasonably well and when it goes bad we should all get to know one way or another right. So some alarms monitoring that kind of stuff will be good.

D

Thanks James Patrick this is uh this- is good work. The model A lot of the common uh patterns, I've seen with fluxiest elsewhere and uh and and also here in.

L

Right one simple pattern: for the uh errors is probably going to be just email people when something when the git Ops operation itself fails, then just send send out an email or reopen a GitHub issue like we do in publishing bot one of those two. You know simple it. It works.

F

A

Okay, any other questions regarding this before we move to next. The next point.

A

Okay, Marco you're next.

M

Yeah, uh do you want me to show the screen or you want to go through the board.

A

So I have the board showing so I. Don't.

M

Know, okay, yeah uh sure uh so first thing uh I have worked in the past week on adding a new dashboard. That's going to allow monitoring the resource consumption of crowd jobs. This is one of the key blockers that we have for migrating jobs from default: Google internal kubernetes cluster to eks, parallel cluster, because in eks probable cluster, you need to specify for each proud job, resource request and limits, and in default cluster jobs running there. Don't have that most of them don't have that and the idea.

M

This is some work that we did internally in Cuba medic a few years ago. Actually, when we migrated to brow is to have a dashboard, uh this one could quickly that's like going to show the memory, usage and CPU usage for pods that are basically running product jobs and it is showing the status of that and how it works. Basically, there are multiple dashboards uh like you can send information or organization, level or repo level, or something like that.

M

But this builds dashboard is the most important one and, as you can see on the screen, like it presets you for a job that you select, you can see memory usage, you can see. Cpu usage, uh I. Think memory usage is like the most straightforward thing like it shows. How much memory is needed. Cpu is a little bit more complicated because of bots go Max prox, so go is trying to use more CPU. Then it can.

M

Then some throttling comes in the effect and like it's hard to estimate like how much exactly CPU you're going to need. That's more going to be based on like trial and error like you're, going to try to reduce CPU you're, going to look at the graph like to see forward, shot link is happening too often, and you are eventually going to see if the job duration got increased. The flaking is the increase, then stuff, like that.

M

One thing that I noticed in this scenario and like while going through jobs I think we provide a lot overestimated memory, usage for all jobs and I think this is something that we perhaps should work on and like try to figure out like how to reduce the job and make sure everything is working like, as you can see in this job, it is requesting about I, don't know 24 gigabytes of RAM, but the actual usage is like not even close to 10., so such thing can be uh reduced, but this is mainly for now.

M

The main purpose is like to drive the migration forward. To so false can open the dashboard, find the job see what's exactly the resource usage and based on that adjust requested limits.

M

A

Any question I don't have a specific comment about this and I just I just want to say this is great. This is awesome other than that I don't have a specific comment. Keep going.

M

It looks like Ben has a question: yeah.

G

Honestly, this is really great. I just want to point out that when we're looking at things like- oh, is this the most efficient scheduling or whatever?

G

um Looking at memory usage like that specific example, usually we're constrained on scheduling, CPU, so reducing the memory allocation to a job doesn't help us schedule more workloads and it does increase the odds that a pod gets uh killed or something um we uh and like things will react in Behavior, based on how much memory is available to them performance wise, so I would keep that in mind in the future when looking at this dashboard uh I think for most of them, we want to dial in a reasonable amount of of CPU limit and that the the amount of memory we allocate should probably be uh based on the shape of the underlying machine, the like ratio of CPU to memory, because CPU is almost always our most constrained resource uh it it's possible.

G

We find it as changes, but uh I've looked at a lot of job, but a lot of CI job workloads and I. Don't think we're gonna find that and it's it's probably fine and expected that we're over allocating memory to pods.

M

Yeah I agree with Ben. That's why I said I think like resource optimization for existing jobs shouldn't be like too big priority. The main driver for this is like to help to facilitate migration for jobs that don't have any resource and limits, so that folks have at least some to to more reason to figure out what's going on and how they should adjust their jobs.

G

Yeah, this is still super useful to to have it's it's hard to get a good idea if things are being like heavily, throttled or or where that's at. We don't have good visibility for job authors today,.

M

Are there any other questions? Of course you can always reach out to me on slack and we can discuss if there are any questions or anything.

M

Okay, I would go to the next topic. If that's okay,.

M

Yeah uh so jobs migration update, uh as some of you might know, we had an outage for the Google internal Pro build clusters, so jobs were not getting started and because of U.S holidays that were that weekend, where this happened, a lot of forks were trying to find the solution. One of the solution was to migrate jobs to AKs project cluster, because we have more coverage.

M

Attempts of support like folks who can chop a fix if something is wrong and we actually had some major repos a bit migrate as much as possible, and this includes test info. A lot of testing for jobs are migrated, ones that are not migrated. Are those require access to the Google infra like to push images, those are still in place. There are other projects like Q I.

M

Think is called like that, like that there is cluster, a cluster API provided for abs and probably some other projects that migrated I, think everything is working quite a while. So far, no major issues: egg, like there were no major issues, some smaller issues about resource uh limits and uh requests, but that was fixated that was mainly for Kappa uh The.

M

Next Step would be to like push forward to the communication side to come up with some good documentation on how you can migrate what should be migrated and how to proceed with that I'm interested to work on that I have some idea how that can look like.

M

However, I am on PTO, which is since last Friday until like this week and next week, so I'm back week of June 19., but as soon as I'm back I'm going to pick it up and to try to move forward with that and to see what we can help with jobs. Migration.

L

uh I had a question um not necessarily for Marco, but thanks for Marco for providing that context on. um You know the problem that happened and how we were able to uh mitigate some of it. But the question is how many clusters do we have uh and where are they running and who has access? uh I I went looking for it and I I couldn't find it in a single location. Then.

A

Go up you can find in a single location in the France, so.

L

Yeah that didn't have the eks Bro yeah.

A

That's one here.

L

Yes, okay! Yes, you have.

A

To do they're older, so if you want to basically load a list of cluster, you could just come here with yeah. You will know now the way now to document that somewhere else. Yes, it's just the one. The one felt that as a priority to.

L

A

L

Does this have the Google default cluster yeah.

A

Default first, one is defaults. First, one is default owned by Google last one is uh owned by Google in the middle. Those are community on.

L

Right we need to put this information somewhere for sure right, yeah yeah, the other one that we were talking about. That day was um uh you know, um foreign job definitions. uh The cluster is not mentioned, and you know clustered since the cluster is default. Then we were talking about hey.

L

uh Can we put you know cluster colon default there to indicate so that we can have a count of how many jobs are running where and how many are actually running in the default cluster, uh using grep uh kind of thing right, um or do some write some scripts to go around figuring out like how many jobs are there? Where are they running?

L

um You know how how many of them can we move from here to there that that sort of a thing we can start doing right like we can go? Tell six saying: hey, uh saying you have this many jobs running in this cluster? Can you move this to that cluster kind of you know: information go ahead, Ben I.

G

I, don't I, don't think we actually wanted to to centralize tracking that I think that that's gonna be a big headache. I think we want to put out a good PSA outlining for people why it's valuable to them and us to migrate and what sort of jobs make sense and just let them migrate what they can and then we can evide evaluate what's left and I believe you already found that we can.

G

We have the config loading like prow has a package, that's aware of loading, the config properly, that does all the defaulting and everything and uh we don't want to add the field to every yeah.

L

I mean the technical.

G

L

We didn't want to do. It was because the config map is going to. You know, grow in size right, but other than that it is good to have that data.

G

But we can also get the data without require, like enforcing that people. Add this to like boilerplate to every job like it. It's it's noise to actually have it on disk, and if it doesn't have the cluster field, then we know what it is, and um you know it's more reliable to actually load the config with the like the the package.

G

uh If we need to have account, we can set up some very small tool with that, but I think we should take that same energy and invest in telling all of the sigs how they can migrate and why they would want to do that.

B

L

A

That's fair, yeah, I, I think right now we need to basically push the six to start basically think about. Basically migration I think that's the one one thing and provide documentation about I. Think the one thing I would be interesting is like see document that guides migration a web literally, we literally say to people. You should specify the cluster field because yeah, it's explicit. We know where it's running in that migration guide and from that we push now the differences to do the migration, because we'll be talking about improving power.

A

Job maintenance for like now three years is not the priority for most of the six. So I think this is like one of the things where we have to be really pushed about this yeah.

L

And the thing that I've also noticed is like: unless there is an emergency people, don't move right so yeah we can't get them folks to do anything unless there is a real urgency. uh This exhibit a in. That is uh what I was doing last weekend, which was to send emails to uh you know uh differently, not emails, uh opening up, uh GitHub issues for all the sigs saying: hey here, are your 10 jobs that have been failing for the last uh year or so right.

L

um So I was sending those kinds of GitHub issues, and you know we had two two CA jobs which are failing on our side for a for three months too. So, and nobody looks at it because there is no urgency right. So we got to create the urgency somehow and I. Don't know how to do that and I'm. Looking for uh ideas, I.

G

Mean this is the kind of task that shards out really well. We've done some similar ones recently, where I was like hey, we should be using E2 nodes for for builds and contributors, just sharted out that work and pried all the repos and updated it. The config has so many jobs trying to centralize keeping track of all of them. I mean sick testing hasn't been able to keep on top of that and we're not even responsible for all the rest of the infrastructure.

G

But if you put up a a call to action that says like you, can flip this one config field and then keep an eye on if the job works and here's some guidance on which jobs might make sense, it doesn't even have to be everybody from all the sigs uh for when we can't get the sigs involved.

G

The rest of us can like review and approve those PRS, and you don't even have to have access to any of this or be an expert in any of it in most jobs where it will work, fine, you're, literally just going to add one field and say cluster. You know eks brow, build done, just add that field to the job and send that PR and then just be around to to check.

G

If the job starts failing and revert it and you you don't have to be the the owner there, but at the same time like if we try to personally do all that, it's too big of a task. It's the perfect kind of task to put up like lint failures or something and just ask people to work on it.

G

I think if we can even get a fraction of the project to pick this up, it will go pretty well and we'll move a lot of jobs and then we'll, then we can start looking at what's left um and we don't actually have to get like people to personally do this for their own, we'll ask them to, but it's okay, if someone else picks it up.

E

Marco has had his hand up for a little while yeah.

A

So so yeah yeah go.

M

Ahead Marco uh one thing about it to mention is uh we had a discussion in some channels and uh the idea that we had uh is that they come up with a sort of data spreadsheet whatsoever. That's going to include like a list of jobs that make sense to migrate and the way we thought this is something that uh I also had a chance to discuss with hippie, and the deems also came up with some spreadsheet.

M

uh Is that based on stuff, like labels, the environment, variables, the jobs that you're seeing and stuff like that, we can relatively reliably uh find out what jobs are a good candidate to be moved? For example, if you see a job, that's using gcp by I, don't know using the preset for gcp credentials, that's like suffocate who don't want to migrate.

M

There are also jobs like given that might make sense to migrate. This is using the e2e test for abs, for example, but this is something that has to be coordinated with us, with maintenance of the TKS brow build cluster, because we need to keep an eye on stuff like Bosco spool and to make sure that we have enough accounts and a bunch of other stuff.

M

So the idea was try to come up with a spreadsheet and try to do some like data Magic on it and figure out what jobs might make sense to migrate, export that and like maybe, create some phased approach like let's, first migrate, jobs that are easy, like don't use anything just run. Some script make linked make build. What's server, then go to e2e jobs that are possible to migrate and stuff like that.

M

So that's some idea that he had in mind for a while and I think we are not that far away from it, but yeah I currently not have time. I mean I'm on PTO, so I don't have time to push order for it. But this is another thing that I'm going to push for yeah.

L

And the other thing is like we have to do it um the best time to do it is now I guess uh otherwise, once we go down uh deeper into 128 release cycle, uh we don't want to. uh You know, break things, because people are going to uh you're, not getting good CI signal sort of thing uh so that that's the other thing right like. So, if you have to start doing it, we have to start doing it now early um and yes, then.

G

I mean this is a really easy thing to revert. If you pick the wrong job as well, it's it's this one field and config I really think we're overthinking.

G

This I think we need to provide people with some guidance as to what sort of thing makes sense and how you change this field and what it should be or how you add the field, if it's not there and then just put up an issue and and send out an email and get contributors on it and the worst thing that happens is people try to flip it for jobs that are like using external edu resources where it doesn't work and it's a trivial one-line PR to revert it back I.

G

Just you just remove the cluster field, I, don't think we need to build a spreadsheet or a dashboard or anything I think once we've, you know gotten people to to work on this. For a bit, we will have scaled down the problem, a lot, how many jobs are remaining on the default cluster and we probably won't even need to build anything. We can just look at the product, case.io dashboard filtering for the default cluster and and look at what's still there um and and I. Don't even think we need a really detailed guide.

G

I mean we basically outlined it in the chairs and tech meeting. We could probably just go transcribe what uh Marco and I said in that last meeting and send it out. Okay,.

L

So yes, Ben I, agree with you. We can do it, but I need somebody to volunteer to drive the effort.

A

um Yeah, sorry, folks, we have 10 minutes left I think we still have like four things to talk about so before I want to basically aw you have this conversation anything or in private DMS. Whatever you want, we need to move forward in this meeting so last question Jiffy. Can we make diplomatic work on that until end of the year move the 2 000 blow jobs.

E

Yes, I would say that might make more sense to be done by the community, because we're going to have to contact the lake sigs anyways, but yeah.

A

Yeah we're going to power them to basically be the Evangelist or Deborah for this info and make sure every proud job is running on committee infrastructure by end of the year, so we cover 128 and 129. and yeah. We hope for hopefully by December of this year. We have nothing left at Google.

A

Is it something you are okay with this or.

E

Oh yeah I said yes.

A

Okay, Marco Patrick. Is it fine for you.

A

E

I, don't see the chat. Oh, there was, there were thumbs up. I saw thumbs, UPS, okay,.

A

Good, so, let's move forward, we can, we can now follow up on this on Slack uh next is I might pronounce that from John oyan.

C

Yeah both both are actually correct. uh Thanks and I will return the favor and probably mispronounce your name Arno.

H

um Yeah, that's the right way.

C

Perfect thanks um so uh from my side, just the super quick one PSA and one open-ended question. uh So the PSA is I made an effort to kind of bring parity between uh the grafana for build clusters. I took a look at the 10 dashboards in uh gcp cluster I managed to successfully Port three uh out of the ten. The other seven had either no data or were just not relevant for what we have deployed in the eks cluster. So I think that is uh reasonably good.

C

uh The open-ended question um the uh security team uh or or the security response team from uh AWS has reached out that potentially we are sharing for read only uh too much on on the grafana for EK export, um mainly I think they mentioned the node exporter. uh They haven't found anything sensitive, uh but also they said. If potentially we may want to keep that even for read-only Access restricted to authorized and authenticated uh users.

C

At the moment the grafana doesn't really have uh ability to authorize yourself or authenticate it's just Anonymous users, but there is a thought that we can potentially split the dashboards into two grafana organizations. uh The public will have only the super safe, whatever we feel very confident that can be exposed to everybody, and there will be then a second organization uh which will be uh hidden only for those who are authenticated.

C

Am I making sense yeah.

A

I think we want to lock down acts with great access to the entire reference instance. I.

H

Think that's the goal and yeah.

C

A

That yeah leave that this way, we just basically don't have to give right access to anyone. You just basically live and read only yeah that.

C

Writing is disabled, uh but uh the security Response Team uh mentioned that even the read-only access for the node exporter is something that triggered an alert and uh they they mentioned. While they don't have any concrete problems with it and they couldn't find any concrete piece of information that is too sensitive. They would prefer us to have that hidden as well and having some dashboards hidden as far as I understand, grafana is best done by having multiple organizations like rafana organizations.

A

uh Okay, uh yeah I think we should just follow SRC recommendation right now. Let's just yeah, let's remove any any suppose dashboard, that's supposed to expose sensitive information because I think right now we only expose CPU memory resources and IP I private IP addresses right now. So if they feel like is tooth sensitive, just remove the dashboard I'll.

C

L

I was going to go opposite.

A

No right right now, I think I, think I think right now, James, let's just follow to con. This address SRC concern and come back with a proposal on basically layout information. We want to explore so we get the plus one. Okay sounds good yeah.

D

We can come.

L

A

To this later, yeah.

A

uh We have too many labs to talk about a lot of things.

E

You can skip at least Shadow program thing for next week, because mine was just going to be like 10 second update.

A

Okay, so OCTA you'll want to go first. Yes,.

E

So are no uh reached out to me wanted to get something going with OCTA uh turns out. We do have a contact with OCTA. uh The idea here was uh start granting access to GitHub or wow to AWS and gcp, using GitHub SSO via OCTA. uh We have a free trial account and we are working with OCTA directly to make that not a free trial account, but the free trial account is going to have a lot more of their knobs enabled than a normal account. So that's that's moving.

A

Happiness happiness.

A

Okay, Marco you're. Next before you have two minutes.

K

Just before we move on from that, um that's good to hear that we've found an IDP provider. um So does that mean like we're parking, Azure, ID, I'm gonna use this instead.

A

uh Octa is an identity provider.

K

Same set of features that I'm after um no.

A

So basically GitHub will be the identity provider and we want to try and do a configure you do it on figuration, using GitHub and having.

K

A

Yeah we can do yeah, we can do sing basically sing the GitHub username to any to OCTA and from that we make sure there's authentication in it allow from OCTA to any platform we want, and authorization is now handled by each Cloud plot platform. We want to integrate with that's the plan I. Don't we don't need specific policy regarding MFA like what Azure proposing we just need, Authentication, so I think we can start with OCTA until we get Azure and improve things later. But OCTA is a good start for me. Yeah.

K

It is yeah I'm, looking forward to seeing how that goes.

A

Okay, so sorry Marco we're gonna, be over I.

M

Just can't say in 30 seconds, like I, don't have anything else.

H

M

uh For OBS, uh the core implementation is done like there have been like cluster signal. This meeting we discussed it and like everything, is merged and like the first release that we are going to have packages built on OBS is going to be uh 128, zero, Alpha 2 and that's going probably to happen tomorrow, so trust if you're interested to keep an eye on that. We are still like going to see how comms and all that is going to look like, but for now just know that this is going to happen.

M

It's still not something we are going to show to end users. Still testy phase more will be decided on upcoming signal. This meetings.

L

Yeah Marco quick. uh This thing there um is there a GitHub issue for planning of the sizing for the OBS uh Network traffic. uh Let's open that up, please uh because you know we mentioned it a couple of times, but we haven't made any progress on that.

M

Yeah uh we will have to figure that out yeah.

L

A

You yeah I think we want to uh even ordered on that. We also want to wish out to OBS folk to tell us what exactly is needed to push the system package to an object: storage, Services, yeah.

L

Let's do the GitHub issue put things on there and then point it to them. So yeah.

A

Thank you on time. Any last thing.

A

The ones going twice and happy rest of the week. Everyone thanks.

L

Goodbye everybody.

E