Kubernetes SIG K8s Infra, 26 Apr 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG K8s Infra - 20230426

Description

https://bit.ly/sig-k8s-infra-notes

A

uh Welcome everyone: we are in the biography kitchen for a meeting. We are April 26th, just a reminder that this meeting is under the code of control and will be later on, YouTube publicly.

A

Okay, first, do we have another ticker.

B

I can do notes.

A

Thank you uh before we start do we have any new person in the meeting I'm you interested to introduce.

C

Hey everyone look at here: I was a kubecon I, went to the uh Sig meetings and met Mario and I said: join the school nice to meet. Everyone welcome.

C

A

Okay, so next is billing dealing with product. What do we have here? uh Is anyone seeing that correctly because I'm, like you know, no, is it too big, too small? It is okay, okay,.

D

A

Well, okay: I will go.

A

Yes, let's go, let's go to March one.

A

Just to see the trend around the regiment, so we still we're still with that c with you, I would say: yeah cost reduction related to egress of the continuation 3. So that's good.

A

I think if I basically see daily stand for project.

A

I, don't really see any increase recently.

A

That's good, but that there's something happening in related to the registry. I would not say it's about it's more like the consequence of the current architecture is. Will oh I'm gonna show this I spent for crashing, so.

E

B

A

Filter on I'll, say: okay, that's from and.

A

Let's see only iteration screen.

F

D

A

Gonna show this.

D

A

Let's go month to month so.

A

E

A

Something we want to basically say: okay, for, for the first time science, we have a community infrastructure, we are below the we are, we will be below the multi-threshold.

A

Yeah for the gcp for the gcp infrastructure, because and so much we were always Beyond 250 000 K virtual, and this is for the first time. Emperor will be the first month well I, where I will not receive another thing I'm over budget for a month, so we will spend by May 1 something around 150k monthly and that's like a good trend.

A

I think I said that in the last minute, but want to say that again, so, yes, okay, it's time to add more infra, but we will not do that now, yeah, we so the last three months we were spending too much I didn't check. Ben's um comment on slack we're spending too much. Oh Ben is here. So uh let me finish: yeah okay, but the one thing I noticed is like if I check the.

A

If I try to compare months month to month, like basically current month, this is previous months or cost for actual registry is really I, don't want to say hi, but we spent where we spent this month, six cages egress for algebra registry in Europe.

A

So it's not really a problem. It's more like a consequence of current architecture. We have forward example, one region, seven serving five country in Europe because there's like we don't Deploy on every gcp region in Europe. We might fix that later. But it's a consequence, so us adding starting next month, we will start add more region in at in gcp and integral. So we might see that go down.

A

Yes, okay. That was my comment and Ben. What do you want to say.

G

uh A couple comments- um one we are actually even slightly higher than we should be at the moment, because we still have the 123 CI that hasn't gotten cleaned up. So release is running an additional CI version beyond what we would normally have at. Sometimes we have an extra release, but right now we have like an extra extra release, so GC costs are actually pretty elevated at the moment, and so this should Trend even better in the future.

G

However, to the opposite end, where we're talking about spending up more infra, uh we want to do that on Amazon. We actually still have a very large amount of infrastructure that hasn't moved out of Google. Some of that will be taken uh running inside google.com organization instead of kubernetes.io, so build directly to Google um billing wise that isn't a problem. My understanding is.

G

We can continue to run that, but it continues to be a liability for the project that we have like these ancient projects from the beginning of kubernetes that predate even having the cloud native Computing foundation and no one knows much about them and we don't have good visibility into them. We want to replace those with things running under the Community. A lot of that spend is deal.case.io so that the FAFSA contract should cover that, but we will need to move like you know.

G

We still have a very large amount of CI um things like that and and some of it we can move to Amazon some move to Google. So we have a good buffer and we should be in good shape for the kubernetes.io gcp bill this year, but we should not jump on running more on gcp uh this year, um so we have some buffer and uh going into next year.

G

With this run rate, we will have plenty of room to to finish migrating things between that and having gotten spun up to the point where we can run some of the CI on Amazon.

A

D

A

Regarding billing.

G

I, it's actually more on the google.com than is on the kubernetes.io.

A

Any comment because I want to do the pdps costs.

A

Okay on database sites, we have things I'll pick one tomorrow well,.

A

Yeah, let's take this down our services, adding on AWS. We are fine. This is the credit we receive in January because we don't just for people new in the car. We got three minutes from a donation from AWS we're supposed to use this year uh we got the first one part we got yeah 250k in January, so I think. Currently we are not using that much except for some CI, because we have to aggregate some AWS account from the cncf organization to the community organization.

A

So I think we are fine.

A

I'm expect I'm expecting some kind of change regarding because I'm a little bits because, like Ben, said we're supposed to migrate, some CI to AWS, we also supposed to add more AWS regions. So all of the all of that will be will change.

A

I, don't see a trend regarding I, don't say stable Trend regarding resource consumption on abs very soon, so we just need to do some kind of monitoring around that.

G

uh These Justin's comment in the chat about the ec2 instance Ben drop.

A

Yes, because I was uh I, did an experiment uh was using some easy to understand. I delete all of them. That's why you see a job.

A

So if you this, it's normal, this there's a shopping cost because I did some cleanup on. Basically, some of this account also it's possible like build a test running, might have some change, so we don't really. We are not really granted a lot, but who exactly is basically using doing infrastructure right now.

G

A

To see yeah yeah regarding your costs, do we do? We want to consider currently existence cause really high, because I don't think it's I to have to spend 4k on ekx per month, I mean with a 3 million budget with a 2 million budget per year. Do we want to consider eks spending 4K per month is, is a high cost.

H

Well, to be honest still to me, I know that we have a plenty of budget but uh 20K per month for 20 nodes cluster is like still a bit high, so we want to take a look at today, but it is not a priority right now.

H

We first want to look at some other stuff, especially about security, improvements and stuff, like that the I am stuff that we are working on, but this is something that we are also trying to take a look, because, let's better bring it out a little bit, we can like bring more instances instead, which can be also useful but like if we have to scale that cluster to like 100 nodes, we could use it easy go over 100k monthly, which is not so nice is.

A

It is it a bad thing or good is the bad thing. Is it really that bad? If you stand one or 10K per month on CI, because there's the scalability that's coming so.

H

E

H

So I don't know for sure, like we will probably have enough credits, but it might be a problem. I see, I had more workload, especially that he don't know how much workload that we're going to add like you can definitely say on that. This is something that has a lot of space for civics, because we didn't really pay much attention to cost optimization. We want to get it running as soon as possible, but we will see about that like.

H

If we can, we don't want to spend like a lot of time in debt, but there are some quick wins that might be possible to accomplish a social time frame and that might be useful to have.

E

D

um What surprises me is the price for eks, because uh I don't know what is actually included uh in those 4K, but I thought it. It costs around, like 10 cents per hour of of running control, plane so I, don't know why we have so so high why the spend is so high on eks service.

A

Well, because we have, we have uh CI tests trigger creating eks clusters, um cluster API provider for AWS. um If I remember correctly, they basically have like this bootstrapping eks cluster. That's why we have and that's I think that's the control plane.

D

A

So during the migration we basically move, we currently have I, think 75 accounts will migrate from CNC and all those are gonna mainly about CI bootstrapping. A lot of things.

H

And also to add something, uh there are stuff. For example, the nodes that we are using in the TKS clusters are like one dollar per hour or a bit more, so this is 20 something dollars per hour for 20 nodes, and also we have established stuff like stuff like provision at the iops and stuff like that that we don't really use. That is also uh significant costs that we can probably just drop, and another issue that we have is networking like net gateways and such stuff tends to cost a lot.

H

So we might want to look into maybe optimizing on that. A little bit.

A

D

D

A

Your question Luke, we there's no control to our point of a spending. Currently, there is like we have currently basic alerting regarding costs.

A

I think that's it now we can and we can I think the AWS API can help us. We can set up some policy preventing bootstrap new resources to like make sure we don't overspending, but before we do all of these I think we can like start moving some CI and see what exactly happening, because currently we just we have like yeah the the Kappa test, plus some other tests like midi, cube a stoplight we currently use.

A

Now, if you move some CI from gcp to address, we might see cost increase and that's going to help us establish the Baseline of cost per month, and from that we can basically say what's the three short monthly for Enterprise I mean before we jump on cost optimization I. We want to establish a trend in terms of cost, because, right now it's like it's really random.

G

It does seem like something to keep an eye on, though, given that we should be a fairly small scale, though, um and that is generally how we we maintain, you know, cost controls. Just we talk about the bill here and keep an eye on things uh same on gcp. um We have Auto scaling on this cluster. If we don't, then I feel like that would be the thing to invest in.

A

Okay, so that, like that's what I'm saying currently there's no Baseline, it's really hard to establish shipping inside, so we need to make sure uh existing problem. Beer cluster is ready, so we can do migration of some of the CI before godfreys yeah before conference for 1 20 28th.

G

It would also be nice if we could break down like where the usage is coming from better on the gcp building reports, we're able to look at which gcp projects the resources are being consumed in instead of just the kind of resources being consumed.

G

um It seems like we don't have a very good answer for that on AWS. Currently,.

A

Yeah we we talk about this many times like about reporting on the AWS, so I think JFR has a plan for that. I'll I will even drive this and try to come up with some reporting around how we can break down costs between database account everywhere. Services and possible group of resources like how we can identify I would say resource used by specific seagulls approaching.

G

In theory, we have some support from people at at Amazon. Can we talk to them about this because I mean? Surely it has to be something realist? There has to be some way to label like sub-accounts or something and and display that I I hope we don't actually have to build something for that, like I, can just pull up the cloud building report and Cloud console and and see this.

A

Yeah, uh that's what I'm saying I know Jiffy is driving. This is supposed to talk to it. Some AWS talks about this bye. Now we want to make that a priority for now for 128 or do we want to just make sure we keep? We are. We keep bulk database folks about this.

G

uh I mean I feel like if there's anything, we need to change about how we spin up infrastructure, for example, if they need to be in different sub-accounts or something like that. We need to get a handle on that before we spin up too much stuff, uh because eventually this is going to be a problem. We need to be able to actually identify where the spend is going.

G

I

G

Not going to be able to distinguish ec2 from the CI itself from from the edoe Clusters.

A

Oh, that's: why there's an issue about tax policy and we are working on that.

A

It just takes some time to implement the current policy around this, but there's an issue, some more about tactics and we start doing yeah. We started doing some. We have introduced a few tags to make that happen. We just now need to tag everything, doing account, creation and also resolution, which is not done granted.

G

Well, that sounds like something we should be keeping in mind as we as we migrate things to to.

F

Make sure that we're.

G

Tagging it that's, the sort of thing I was fishing for is like whatever the solution is to be able to break down the the cost we should like should have that in mind because, like for example, in gcp, if I could go back, I would say we should have split more things into more projects.

G

So it's clear like there's too many things in Kate's, artifacts, prod and luckily they're, mostly on different services, but keeping things in different projects, as the gcp equivalent has been really helpful for figuring out where we have excessive costs and, as we spin up things on Amazon we're going to want to be able to do the same thing and right now, when we look at the reports, we can't do that.

D

A

D

A

uh Another conversation about this one Chief is around and possibly teams.

A

So that means bench. I'm tagging you to remind us our next meeting about these. If James or chief is around.

G

That's good I'll drop them a note as well.

A

G

All right, okay,.

H

G

We don't see things by account, so if there's some way to do that in the UI, then that might be sufficient because we do have things somewhat broken into accounts.

A

Yolo goodbye dimensional linking account and you can see everything by account.

J

Okay, yeah I'm, certainly not an expert on the billing side, thanks for.

A

uh I think there's a way to basically have those as a template Also regarding I. Never pushed for I, never really wanted to do. Advanced usage, of course, Explorer responsible to have breakdown per accounts. Then.

G

You get musical enough. Can we just bring this up when we discuss the AWS costs in the meetings? This view yeah, because.

E

G

I can tell you like, okay, so the registry is this much and- and that looks like this casing for prom- must be the CI cluster, like that. That's a much more informative view than going by the services.

A

uh Well, it's that's, I, think I think well it's about basically which formation you want to get from all this data. It's really yeah.

G

Well, we want to identify which parts of the project are using a reasonable amount of spend or not. So like I'm, not surprised that running. The registry is a little bit expensive and it looks in line with reasonable costs, but if Kate's in for proud suddenly cost like 10x, then I would like that would want to know, and it doesn't really matter which service that flags that this is the part of the project. We need to go. Look at okay,.

A

There's like nothing in the open discussion, so I'm gonna call out this meeting.

G

Did we have a discussion about the outage uh it feels like that might be a a good topic if people have time. Oh.

K

I see it on time. I have a question for you Ben, um so.

A

Wait wait we want to be concerned about outage.

K

Yeah I'm not talking about anything else, yeah so I'm.

A

Not talking about get concerns- and you remember about this- like we want to have a conversation about this I- don't see plus one in the chat.

G

I can meet with folks out of band.

A

G

There's one in the chat and I guess Muhammad, but.

F

I think we should probably talk about how we can fail over automatically in the future when this happens.

A

Yeah, okay, so let's then give an update. How about my question and any question.

K

Yeah sounds good.

G

I I'm happy to hear mohammed's question up front um all.

K

Right, so your class has page this information in there pretty useless with regards to Global Systems right. um One thing that Google, for example, has the answer is that they mitigated problems with the Google Cloud load boxers. That is the that's the thing. That's given us the biggest grief right now for this particular project um or.

E

Another project.

K

um There's some other fun problems going on with the global components. So so did you hear anything about the Google Cloud load balances mitigation is being done to them so that they can actually work in Europe.

G

uh I'm trying to find the the link there was a service entry like incident entry specific to the um the global load, balancer I.

K

G

K

Somewhere yeah.

G

I think the there's like distinct things you couldn't configure it at all. Briefly, uh that is mitigated um the thing that we're running into is that we have it configured but somewhere in there we're still seeing traffic that would have been routed to the Paris back in previously isn't being routed properly, I'm, actually not completely clear.

G

What I'm seeing is that we're getting a 503 from back end, which makes me think it's still trying to route to the um to there, but I think that may be that the um the control plane is back, but the data plane isn't back for uh Paris, because because I think it's public gclb is like a a broad Edge. That's any cast so I think what we're encountering as a project is The. Edge that is affected is is still broken for us in that area.

G

So, even if you're not on gcp, because the traffic is still coming into Paris and uh the Paris data center is having problems uh that that's not getting um updated to Route somewhere else uh for the traffic in other regions, it is updated now and from what I can tell if you hit any other gcp region, it won't get routed to the Paris region. Even if we're not running there, it'll get right into one of the other regions now.

G

uh So the problem is the the traffic that's actually trying to enter through that data center uh at best, I can tell, um and so to Eddie's question that is. That is my question is: is there anything else we can do to make sure that that traffic would have failed over because, generally speaking, this is what happens so we have a gclb configured with the gclb. We have a scope uh for each region that we're operating in in the network. Endpoint groups that then points to a cloud run in the same region.

G

So when you hit the gclb um you're going to enter it through, like the nearest Google location and then it should get routed by our routing map and our routing map says you know these: are the regions we're in route to a service in that region? So, even if a region isn't automatically removed, we can do what we did today and remove region.

G

um Our no had problems like that earlier, because uh uh the control plane had an an outage when this region had outage that's fixed. uh That was not that didn't persist very long. What we've seen is, there's still some transient errors and it seems that the data plane is still not updating in that area. I, don't know. If there's anything more, we can do about that as an end user.

K

Yeah but Google didn't quite make it clear that there was control plane failures when you read this outage on. First glass, you'd think oh Google just disconnected the region and they fixed it out right. uh But that's not exactly what happened, though. It's Global components that shouldn't be failing were failing for a very long periods of time. There's.

G

Multiple different instances here, the last one I linked is the CL is the one that's relevant for the control plane for gclb.

J

I think also like uh Ben, uh we should emphasize you are. You are doing a good job of speaking as a member of the community that happens to have, like you know, access to to more stuff, but you are not like revealing inside information and you're, not speaking as a googler here like this is what these are, the things that we have observed um in the in in registry caseio right.

J

This is not like, uh like the gclp update and I, don't think we have any particular insight into what's going on exactly there beyond the sales, the status page.

J

To comment on that, because he will get into trouble.

G

I can point that there are. There are multiple different status entries and they affect different things. The one for the control plane for gclb specifically, is that is that last link, and that is uh that is considered result, um and you can see that.

K

It was for about.

G

K

That is about an outage last year.

D

I

That we have it's linking to.

B

Yeah, it's 22.

G

I'm sorry I've lost track. uh I'll dig for that.

K

All right um so because that's the first people ask us right all right: Google had an outage. We lost the region. It's happened before it happened. Last year, London was gone for a while, um but you'd expect the control plane to be buggy for a few hours and then to be resolved.

K

I

Will be plenty of follow-up, don't.

G

Worry I I mean I I also can't do a whole lot about that. Like I'm, just uh just a software engineer, there I'm largely participating here as a as a community member, which you know Google Staffing, but uh we're I'm still like the rest of the group, I'm still largely on the user side, when we're talking here, um I'm gonna be talking to some people that have more knowledge of uh gcp networking about what our options are and taking advantage of that.

G

But for things like hey report service out, just better I mean I'm, certainly complaining, but that doesn't carry in that honestly doesn't carry more weight than if you complain as a.

K

Customer people are aware.

G

That there's problems here our incident tractor is uh lit up.

K

That sounds good.

G

uh uh And uh I mean yeah in the case of uh data center is partially underwater yeah. That's that's some problems.

G

um I really hope that we can do more to to Route around it and I'm not like as an end user, uh not thrilled about that, but I'm also still trying to confirm. If there's anything that we're, you know that we aren't doing in the configuration like I, don't see that we have any health checks. For example, it looks like that might be a thing.

K

And next is don't support health checks, which.

G

The serverless back-end group has a field for it.

K

Oh seven, snakes never had health checks um from day one, so you were never able to remove a dodgy Cloud run service from serving traffic. um Well,.

G

K

A big gap in the product but yeah.

G

Well and if the, if the product.

E

G

Do that, then we can look at what we can do as besides the bug as a user, we could probably operate something ourselves that auto updated the like removed it from the group or something um and we've been we've manually removed the one. So my bigger concern is that we've removed the group, but I still see errors and uh I've attempted to investigate that I.

G

Don't have a better answer at the moment, just my my theory from what I'm, observing and I'm going to be asking someone who actually has more expertise on one of our uh when a gk's networking teams to lend some time to this. We're originally going to be talking about the um cloud provider, removal and how we can help get that testing running, but we've talked to them.

G

So we're going to pump that and we're going to discuss mainly like what else can we do here and you all know them uh most of you probably um Antonio uh works on GK networking now and uh from GK site gclb is is one of their things, so I'm going to be picking Antonio's brain a bit about gclb and and what else we can do there.

G

This thing has no SLA, and this is a good reminder to to mirror if, if uptime in a particular region is really important to you, otherwise um you can pull from another region and we're fine. uh As far as I could tell between, like whatever provider you use. Proximity to the to the gcp region is is a problem because we have a global load. Balancer I think it's not super reasonable for us to try to like run multiple Cloud looking answers or something and uh something like this can happen.

G

We can have an outage for a region uh we'll do our best to react to it, but if you need higher uptime guarantees than that is there as a global Service, it is still largely available.

G

And that's working as intended. The other thing I'd like to do is roll out to more regions, um but I think we're blocked on concerns with the image promoter, so um I think there's still more to do to make the image promoter scale better. So we can spin up more regions.

G

um We could have an even smaller subset of traffic routed to any particular region if we, if we finished adopting all of the available regions, we're in a lot of them, but we stopped adding more because the image promoter's been struggling.

K

Well, I think Adolfo optimized. That thing very well nowadays.

G

um I've been talking to to John duffel about that I think we're maybe still not at a point where be super comfortable with it, but we we should revisit and if not um I just thought a book there's a few more things we can. We can do there, there's been some improvement to cosine recently um John had to quit working on this, but we have a couple more pointers from John.

K

G

Yeah I'll def ask along that. You know we'd like to know more about uh what happened and uh what we could do there and and I'll be talking into you about. You know what else we can do to configure the load balancer better or to to right around this. That sort of thing. One other thing we have to consider is: um we probably don't want like super excessive failover like as an extreme example. If we stopped serving traffic out of U.S, east and Amazon, our bill would go through the roof.

G

If there's an S3 outage in U.S east, we probably just need to tank it and not try to Route around it um that like we, we really want to be serving that sort of thing in region and that's more or less What's. Happening Here is that we have 20 GC 20 gcp regions currently, and one of them is impacted and it in the rest of the traffic is fine.

K

I mean it's not quite that all right you've, just not input yeah. Well, the odd fat registry isn't available um by the way which is One impact. um The other problem is you're going to pull from S3 because nobody can access Google cloud services in France at.

G

All right, but what I'm saying is only only the traffic that went to that out of the 20 places that we route through only the traffic that's hitting now when it's affected. If you are running in another Europe region on any of any provider, then and you're getting routed to another region, then you're not impacting yeah.

I

G

K

B

Say that again, so yeah so basically um I mean we still stayed everywhere that um if you are really really reliant on everything, then you should have a registry duplicate it somewhere in your own infrastructure, and I mean the downtime that we had or the downtime that we have is basically fairly reduced now to one region. Yes, it's still bad, but um I. Don't think that we should stress this too much that there was an outage.

K

G

Yeah we just had we had a confirmation from Marco earlier that, if you're running in another, if you move to another region in EU on AWS, then you'll see what you would see on gcp you'll get right into a different region. You'll be fine. uh It's the it's! The subset of traffic, that's close enough that out of the 22cp regions, it gets routed to the Paris region, uh whether or not it's on gcp.

G

um And if we expand to more regions, that would be another way that will improve our you know: failure domains but I think it's mostly working relatively as intended right now and uh we'll have like, even if we do expand failover, we have to be careful with that, uh because the whole the whole intention here is for, is that you you that you are served out of the closest region and that's good for users globally.

G

So we don't have a sudden Spiker or cost increase for one of the regions, and um you know if, uh if one out of the 20 regions going down is, is a problem for you, then you're a really good candidate to uh to host a mirror.

I

Okay, uh I is there any other questions regarding the outage.

I

Okay, we can uh wrap it up here. Then um thanks. Everyone, I'm sure we'll be following up and slack, and all this stuff see you around.

B

Have a great evening bye, folks,.