Kubernetes SIG K8s Infra, 24 May 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG K8s Infra - 20230524

Description

https://bit.ly/sig-k8s-infra-notes

A

Hi we I made 24. This is the case of a meeting, just a reminder that this meeting is under the code of conscious. So please, please be nice to everyone in this car. Do we have new people.

A

There's no obligation to do an activation so, but if you want to do one feel free.

A

A

Need to adjust my screen yeah as usual, we do the being report. So do we have something new here to see.

B

A

Okay, we are cast down, I. Think that's really! Two: after five registry.

A

Yeah, that's really too much vibrations here.

C

A

Yeah, so basically we.

A

We were serving traffic from a book at that, though, that was not.

A

Non-Present bracket distributed, we were surveying in the specific region because of that and we had the density and Report related to this. I think Ben mentioned that in the last meeting about this. Because of that, we were like serving Ingress costs.

A

um That got fixed now we see God's going down. We now yeah we're back to what we were expecting so X, except that I don't see anything very specific or any okay. I see cost increase in computes.

A

A

Yeah I think we are we not in Star stand. There is an increase in compute related to it's possible. It's vote.

A

Relate to the build cluster I'm, not sure, maybe uh many possibilities, job moving from Google to our own.

A

Yeah, so it's going to be difficult to fully investigate that so I'm, not I'm, not really worried, because doing crazy is not that significant. When I'm yeah.

A

D

A

D

Sorry uh finish uh about to say something later.

A

I'm not I think we need to I'm pre-respecting costs increase because we're going to start moving jobs over the different build cluster. So I'm, not really surprised by this I'm, not really worried about this.

A

You wanted to say something: Marco.

D

Yes, I think this is okay, because before April, like 30th of April, was a bit strange, we had cubecon and we had cold breeze that month, like until mid April before qcon, so lowest activities kind of expected, and now the people got back from the coffers that we unblocked main Bridge, like I, think it was uh 15th April, something like that.

D

I think people are picking up PR's again and that's like because it's tough to get up a little more CI jobs are running, so I would say that is kind of expected and now that people are getting more focused from enhancements to be ours to actually Publications I, guess that is expected that cost to CI is going up, but we still didn't have any major immigration of jobs besides those few that he did I need to cycle with hippie about that like where we are studying with Collective jobs.

D

That needs to be migrated and in general, I think this is mostly related to more activity and people picking up their stuff. After the that was a little bit busier with other stuff.

A

A

uh Yeah other than that, just like nothing specific to report if I go over depend dashboard.

A

Okay, uh for the month by month, uh we are not falling in the mod, but I think we still are in the same channel last month. So I think that's pretty good.

A

Is there like any comments, something specifically we won't talk about before I jump, I need to press.

A

Okay I mean this.

A

A

I'll go this way: okay, let's apply for the entire trip.

A

uh I, don't see any things specific right now regarding AWS, unless I'm missing, something I think we're good.

A

Oh sorry, sorry.

A

uh The one thing I would like to do in the future is that delete your account. We don't use at all because I say a lot of Icon not, but you also I'm curious about.

A

Why are we keeping them.

E

B

No such unused accounts.

A

E

I said the load bearing unused accounts, something something somewhere is referencing them.

A

That's the thing is like some of them are not different at all.

E

Well, you won't know that until you delete the account yeah wow.

A

There's like the decision, the decision trees in 90 days, so we can trigger. We can and there's no way to disable service, only the best lighting GSP, so I'm like yeah. We now have an entire organization and we have now multiple postcodes account. We should probably start to delete them to always see it as a benefit, because we can force now people to migrate to the community infrastructure.

A

Isolating those AWS account.

E

We also don't want to have any that we can account for now, because it's only going to get harder to figure out, as we add more accounts for more things where these other accounts came from, and what they're doing.

A

But we all all of them are basically specify somewhere right now in case that I Repro, the new account so.

E

Right I mean that we don't want to keep around accounts that we don't seem to be referencing anymore, and we don't know about because the longer we wait to do that, the harder it will be to figure out where they came from.

A

Yeah yeah, we should start delete some of them because there's like I did a quick research there. Some of them are not reference at all. Nowhere in Destin, Florida repo I mean in any GitHub organization we own. None of some of them are not reference at all.

E

Some of them might only be in like a secret or something, but again we shouldn't be doing that without some kind of record. We should find an eliminate how to fix those.

A

So uh secrets, even that okay I'm gonna, try to figure out how we can safely delete them. Well,.

E

Like, for example, the CI stuff that wasn't using Bosco's, yet some of it was referencing like a secret with an account credential that points at some account, uh but we don't want to keep doing that anyhow.

B

E

What I meant by secrets- and you wouldn't see that in the repo except thankfully most of those did at least have a comment above the like the preset for the secret. But we don't want to depend on that. So.

A

A

Any question regarding this I: don't sorry I can't see.

A

C

We want to cycle through, take five of them or so, and uh just try to do one on deletion just through another process. So we can have some confidence in deleting larger squats and uh know how long it'll take to go through the undulation process.

A

So mostly when you, when you select the icon, is not fully deleted into 90 days. So basically, it's going to become the icon will be 16. before yeah, okay before it disappeared before the account is gone so I'm not real and River. That is mostly open opening a case on support, so I'm not really worried about beautiful the duration process.

D

One question about that: is it possible to do any sort of audit or ABS accounts, then save what you have in those accounts? So if you ever actually figure out okay, we deleted something that was used by Sophie. We can easily recreate touch account and know what do we need to recreate.

A

I'm not aware of a way to do that, you're after being a the best folks, then maybe they're more expert on that to figure out what that what is inside this account, but in any case I'm more interested to job migration to new to the community owned account. That's what I'm trying to figure out what's inside the second, because if you migrated job any job right now that will that will give us some insight about exactly what's needed to make those jobs fully running.

A

What are the specific or custom configuration we have in each account I'm more interested job. Migration done account replication.

A

Something like that.

A

Because I think I think what you're looking for is some kind of resource inventory over the entire organization or a specific account, which is not something I'm aware in AWS as a feature, but we might ask to folks see what's happening if they know something or how we can do this.

A

If not it's not the case. We just need to just migrate the jobs to the community console.

D

Okay, that makes sense thanks.

A

Okay, um uh oh Chief is not here, I have a question for him.

A

uh We have open discussion Marco, you are next, yes,.

D

Meet me, yeah I left the topic about monitoring forecast problem clusters. So the idea that you have in general is something that we discussed.

D

We have a dedicated monitoring stack for that cluster in that is running that is based on properties, and that is not exposed in any way at all to access it. You need to have access to the cluster, to do something like Cube, CTL, port forward, get access to the girlfriend service and then access it from the browser, but that's not a deal, and that is definitely not accessible by communities. So if they start from getting some jobs run into problems, they don't have any tool to investigate.

D

What's going on to, for example, figure out if there are any problem with resource limits and requests and stuff like that in general, especially now that here, for example, in 4C resource what does sibling uh now, the idea that they've got uh speaking of that is that we expose that stock and I just brought this up for discussion with folks that are working for it.

D

This is Patrick and Jan who is not on call today, but generally speaking, we can Yan created a PR that is basically creating a load balances service for that final setup, and that is going to expose it. But we need to figure out a domain like how are we going to access it and I thought about using something like monitoring, dot, proud, cks.kats.io.

D

But then we need to figure out how to manage that domain. How to manage certificates like monetary domain is not that problematic. We can add an empty to the NS configuration and so on, for example, them so Arnold can record soil that DNS configuration, but the problematic part is how to handle certificates, because I don't think we have suffix similar to what we have with gke, so that you, for example, create manager certificate resource so that it has a certificate for you.

D

That's Sophia didn't file for abs right and what I found is that this ACM I think it is called like that, and it allows you to request the certificate for your domain and basically to attach it to uh Avia slot balancer, for example, whatever type Cloud balancer. Are you using like elastic loss, balancer or whatsoever? But the issue is that: how should we handle that? Because there is a DNS verification process, then data query is some additional records. In that case, we should probably manage ACM with terraform I, believe so, and there's that too so.

D

I briefly discussed that with dims today. At this belief this is one okay forward, but some discussion with Ben and team I think team is not on the call today. It would be nice to figure out what is the way forward.

C

The DNS stack has already handled currently not through terraform, but through uh you.

E

Go ahead, we we have a. We have an automation based on octodn SLS, you control DNS, with yaml um I'm, one of the owners you just like, send a PR and when it's approved the robot act, automates it and we I think we want to stick to that, because we also have some tooling to help make sure that those changes work safely, like we roll out to a canary Zone first and have some tests and things to make sure that all it all actually rolls out properly.

E

um It's pretty mature and if you need to do things like certificate, uh provisioning I mean. Usually we would do something like uh like an Acme, Challenge and delegate just the like acne sub domain uh to some external service. So so that's what we're doing for like gclb's going forward um before that. We would like create NLB, get an IP and then add the IP to the to the DNS. For this specific entry, um we shouldn't need to Grant external systems, direct access.

D

Yeah uh that part for sure I mean I, understand the DNS part. It's pretty easy, at least for what I found out to just create a PR, and that is Rico cell. That part is easy. It is just to figure out certificates like because I don't believe we can use the same way that we use on uh gke with manager certificates. So we have to use AVS approach and that is going to be a little bit difficult.

D

Maybe I can send a link to you that didn't send to me, and maybe you can tell me how feasible it is to do something like that on uh without DNS configuration.

E

We previously used cert manager within clusters for what it's worth um and yeah I think I really hope we can standardize on on acne challenges. It's just like it's a good standard and it allows us to delegate things like this without handing over DNS API keys or anything like that, yeah.

D

I mean because it was like search manager and maybe a let's encrypt, then do it that way.

C

To be clear, with Ben's uh suggestion is that we can point the DNS uh at a domain that could be managed anywhere as far as for the dns01 challenges and whatnot to provide those certificates.

E

It looks to me, like ACM, supports something very similar, but not quite the act. The standard Acme challenges where you you see name a a token uh subdomain to to their.

C

E

D

Okay, that's possible, then I think we can do it. Acm.

F

I I think that validation is only possible if you're, using Route 53.

G

uh Not necessarily, but sorry, uh uh just a quick uh knowledge about that. If I made do we want to do a performance, dns-based verification or a different device certification for the search.

E

Dns with a delegated CNN would be preferable.

G

So yeah you have, you have separation of segregation to to entities without requiring any cross authorization uh uh provisioning.

G

uh The the only thing on the setup on the search, essentially would be just the values in the record need to set up on your DNS or ipam console, and that would be the the most separated approach, but maybe I I'm sorry for joining late. uh Maybe I misunderstood the requirement. So that's why I asked on the chat.

D

Yeah I think that is an option like if you want to use ACM, we can add the record. That's going to point to that, whatever is required, and that would work. The only thing I, don't I can't figure out from docs is.

D

If you need to keep that record all the time or reversal, if you change it later with, let's say what you want to point into me into, for example, if you want to you see name and that's what you have to use, if you use elastic wall browser because you're getting a whole thing, or this is a cname that you have to keep all the time.

G

For the third verification or for the load, balancer.

D

So for self-certificate, let's start with the verification, verification.

G

Is only needed for the initial verification, but I think for an ongoing renewal basis. You need to keep that cert, but again that's what what is the price of keeping that record up there, um any any records that you set or more more precisely any certificates that you set. If you want to set multiple certificates, the same record will be used as a verification.

G

So you only need to add essentially one record to authenticate that you have ownership and from that point on, we have Club launch on all certificates that we set up on the account level to some level.

D

G

D

The thing is sorry: go ahead.

E

We have existing one. Like I said this looks like the Acme challenges that we're using um it's a different name than the Acme standard, but it's the same idea use cname uh sub domain for verification purposes, and that way you don't you don't have to worry about it in this you're, just forwarding it to their DNS challenge. Responder. uh We can just leave those. We can just leave those entries up. It will mean that renewals happen and it's totally decoupled from load by our life cycle. This is this is back practice going forward.

E

It would be nice if they were all acne standard, but it this looks like probably something that was just bootstrapped before dns01 was finalized and basically the same thing: I, don't.

D

E

This is the problem.

D

I give an example of check, maybe because, for example, let's say you have monitoring power KSI or, for example, uh name is not important. This is the C name already 2.2 elb, because elb is something like it is basically a host name. It is not an IP address, so we can't use a record and then you need to add the same name as a cname to support sir verification right. The problem is that that's going to conflict no.

E

No cert verification is on on a on a subdomain, with the special name for the verification. uh Third underscore token and oh.

D

Okay, I see I.

E

See for Acme search, I think it's like underscore Acme. It's the same idea.

B

E

You put you put a one level deeper subdomain for the verification you forward that to the verification, endpoint and then, on their end, their DNS resolver handles the challenges.

D

Yeah sorry I was looking at the wrong column. It is so late for me. So.

E

It's just a one-time thing, you add, you add the you add the cname to our DNS and you're done. You don't need to touch it again.

D

Yeah, okay, then that's an option. I think we can do that.

E

This is already the workflow we're using for all of our other sorts uh going forward, so should be fine. Okay,.

D

Then you can't handle the acmb tyrophone, probably create this Certificate request. It requests the DNS records and that's basically.

E

Yeah it looked, it looks to me like you actually, don't even I mean you don't need to you. Don't uh you can create the challenge independent of that because you're just forwarding it to um the the resolver domain, but.

G

um You can use the output of the of the verification, I think for the record as a reference for them, but I think, since with these two systems are separate, that would be a semi manual step. Nevertheless, you can when importing a certificate um stating this parameters, or you know the data data resourcing intervals right. You can State whether you want it is available, so it won't be. You know it would fail on uh um if it's handling or getting a surf, it has not been verified. Yet, but again, that's that's.

G

uh That's later terms or later of steps.

E

Yeah Marco I can follow up with you. Offline I just went through doing this for the gcp equivalent on a couple of our domains um it. This looks pretty much the same shouldn't be a problem.

D

Okay, yeah. That sounds good to me.

G

Yeah I love you to assist that I'm doing it multiple times a month on AWS side from some of the search and domains so happy to pitching.

D

A

uh That's the only item in the open discussion is all anything else. People want to talk about.

A

Okay, I'm, like like one I, mean two question one for Marco and EP. Are we good with this capability test account? Sorry, is it now on board or is there something else we need to do sorry? Can you repeat that or not I was asking about this skeleton I want to make sure this is involved in both course.

A

My question was for you and Marco.

D

Yeah I am not sure anyone reached out to me for those accounts on boarding them, I guess I, don't even have access to that account. So we can probably follow up tomorrow if you're around that we figure out having them to eks problem cluster so that we do it create credentials and put them create a resource for that in Boston. That can be done. We can do that tomorrow, if you're around.

A

Yeah uh Evie: can you take care of that because I'm gonna be on PTO again soon, so my advantage is kind of limited I'm asking this because James being me last week about this one to make sure we we make progress on this, um go ahead, uh go forward. I will my second question.

C

uh Marker you've got access to me that is connected well, we'll roll that forward.

D

Are you, okay, yeah? Let's figure that out, we will see uh how we can handle it, but yeah I think it's not a problem.

A

Thanks folks and massagers for Ben, what's the next step? After all, those icons are properly set up like how we trigger scalability jobs.

A

E

Some follow-up, with uh with six scalability folks to to get that working properly uh against clusters that aren't Cube up. um We have a. We have a job, a demo job for this, but we want to make sure that it's using the boss ghosts that it uh actually starts running the scale test instead of standard, e2es and I. Think the log Dome script needs supporting.

E

um That's probably the last part is probably the trickiest. The rest is, you know, look at existing jobs and uh change the config options to to select the right tests and uh to use Bosco's instead of a fixed account.

A

Okay, so the last part still rely on caps.

E

Right, uh the last part was is probably I mean six scale. Bully has a separate script that they own right now, um we're only gonna go into too much detail here, but basically, instead of fetching from the nodes to the proud job and uploading, you want when you're doing scale testing.

E

You really want to run something that uploads from the the CI note or from the cluster nodes uh to the to the output, because uh you avoid a lot of uh extra data transfer right instead of having to pull it back into the job pod and having a massive, like you know, thousand nodes worth of system logs and then uploading those to the job storage.

E

We have it set up so that when we run the cube up, drops the we run a script that like SSH the nodes, grabs the system logs and and pushes them up uh from the from the remote machines. We don't have that's not supported on other types of clusters. Currently, we should Port that to work with uh kiops clusters.

E

That's gonna be a little bit of a blocker for really big clusters. It is a pretty major uh performance hack for CI, okay and and we need the logs to to like you, know, debug the cluster.

E

um I think that's I mean anybody could hack on that, but I think we were hoping for scalability work on that, because it is only used by scalability jobs for other jobs that isn't that big of a deal to just pull them, pull them down to where we're running the tests and then uh stick them in the artifacts folder, like any other output from the tests uh and that and that's more portable, but for scale tests.

E

It becomes performance sensitive enough that we're going to want to uh employ a workaround similar to that and right now those scripts are in test. Infront only work on Cube up GC clusters, because then you depend on sshing to the node and having uh access to upload with GSE till.

A

Okay, so where are we putting those logs in GCS book, the existing GCS bucket or industry bucket they're.

E

They're currently going to the the existing uh job results bucket.

E

um I guess it would probably be okay if they went somewhere else. I I think this is a follow-up concern once we get the tests running, but it will need to happen.

A

E

So so make sure the job is using the Bosco's pool, um just change that config and then or or do any follow-up. That needs to happen with the cube test implementation and then uh make sure that the test Flags get switched to run the scale tests instead of um normal cluster uobs. Those are the next steps.

A

uh Do we that we async trying to figure out a job consider to basically ensure we have we use the proper AWS account subscribers? We.

E

We have one there's the tracking issue in test infra for running skill tests on AWS and it references the like MVP that runs sort of a scale cluster, but not the right test, and it was using a fixed account. Okay, yep's account.

A

Sounds good anything else before I give you 50.

A

Sorry, it's 30 minutes of your time.

E

uh If talk has had a hand up.

A

G

A quick follow-up question regarding the the log Federation is: is there any thought to place any filters onto the Federation itself to ease on the data? Transfer costs essentially only only send off events that are relevant. You have that mechanism in place.

E

So we're dumping like, for example, the cubelet logs from each of the nodes we won't know we we want to dump them, because uh people need to be able to come back after the CI has failed and understand like what's wrong with the cluster. So if we, if we try to filter it, we won't know um in the future. We can look at you know, storing logs within the cloud that it was uploaded to I.

E

Don't think that uh huge priority right now, it's mostly just that it's slow to do extra copies, uh as opposed to like the cost I mean. Similarly, we don't cap the log sizes right now, we're just serving them directly out of a bucket.

E

um The the the real cost issue is just having to pull them all into the into the job and then upload them from there versus just directly uploading them from where they are. uh When you have like 5 000 nodes. That starts to become a pretty significant bottleneck to completing the the Run.

G

What what uh retention uh policy to our ego life cycle? If we can't, if we can't, diminish or reduce the data transfer costs, can we maybe reduce the the the contents of the bucket? We.

E

Have one I think it's 90 days currently um and there's some follow-up to like uh tune that sort of thing, but um I don't think we need to block on that for this and same thing like if we had to run some of the other ways where we just pull logs down to the nodes and upload it. That's fine to start we're running smaller scale ones, but if we want to seriously switch to using uh chaops instead of the GCE shell scripts, um so that we can have a tool that works for both clouds.

E

We will need to have something like the scale test: log dumper that works um ideally on chaops, Oz and Chaos GCE, and uh that and and that will have to be a special thing like we can't just use the kubernetes API or something we need to actually um get on each of the nodes and grab all the system logs. Even if the nodes are in a bad state. So you need to do something like on gcloud we use or on gcp.

E

We use gcloud compute SSH, um we'll want to we'll want to do something like that.

G

On an AWS side or or a cloud agnostic solution,.

E

I, don't think um I think we'll want to make it support both I. Don't think we want to jump to diagnostic. Usually we do that by like using kubernetes API features, but we kind of fundamentally need to reach the nodes. We also need to know like what are the names and that kind of thing it it'll probably need to be chaops and AWS, plus gcp specific, and if we ever have a a third Cloud, we can extend it.

E

We again, we only need this for scale tests, so um we probably don't need to be running scale tests on like every cloud, and we don't have the resources to do that today, um but we do want to get out of only running scale tests using the GCE bash scripts um and and once we can get we're starting as well just with chaos AWS, but um something that I'm hoping to press for later is to switch the um the GCU ones.

E

We still have to use caps GC, so we can very closely uh match and get away from those single Cloud shell scripts, but for scale testing. The the scale test, log, dumpers um kind of unique.

G

G

Understood now, I'm just thinking about the options that I'm I'm more of in terms of sending logs. If you're running an AWS, you can use databases towards like a cloud watch agent and could maybe add some files there during the run and only activated for that process, but I'm not sure, there's like S3 director S3 unless you're trying to get the dashing A bash approach. Something maybe just that's why I want to talk about protection? Yes,.

E

So, very briefly, kubernetes CI basically has a standard thing where there's a there's, a directory uh that is specially handled and uh when you run some tests they can dump things into there and they get persisted into a storage bucket this surface through the ciui.

E

um So there's some standard expectations around that, and that includes some automated tooling that, like scans, the test results to find things like clusters of flakes that are happening um like nearly identical errors.

E

Things like that, so we'll want to continue to store things there to support all that tooling and- and you know, other the users in the project that are expecting logs to be there, um and we have some tooling built on top to do things like um you know: viewing the files and whatnot it's not great, but it's what everyone's used to so um we'll, probably just want to you, know, get them copied in there.

G

Mr, thank you for the gratification.

A

Anything else before we close the meaning.

A

Sorry I can't see uh if they unlock a Android, so.

A

Going once going twice.

A

I think we can stop here. Thank you. Everyone for coming see you in two weeks.