Kubernetes SIG Scalability, 2 Mar 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2023-03-02 Kubernetes SIG Scalability Meeting

Description

Agenda and meeting notes - https://docs.google.com/document/d/1hEpf25qifVWztaeZPFmjNiJvPo-5JX1z0LSvvVY5G2g/edit?usp=sharing

A

A

Ty meeting uh second of March 2023 and, um as wetek said, we have one topic to discuss regarding cutting costs of scalability tests.

B

um Sorry I would also like to ask a question after that, if you don't mind ability so.

C

Sure uh yeah, my topic is quite fast, so I I think, let's start with with my, because I think it's pretty urgent also- and we definitely should have topic- should have time for your questions.

C

um I think um I filed an issue, but there was there were some discussions that um were happening in the in the corridor more or less for the past few days, starting from Monday, I, guess or Tuesday.

C

um Basically, as a project, we are over the budget as a project I mean as a whole kubernetes we are over, but over the budget for our like infrastructural costs. um So we are looking for savings. Everywhere, We Care everywhere we can and scalability tests are be, maybe not the most obvious but like the second most obvious option to look for um the first first, more obvious.

C

The first most obvious is, um uh uh and the biggest uh is, is the are the costs of the the official release, artifacts and then related to that, and that has been taken care by others. But scalability tests are, are the second biggest costs and we are trying to. We are looking for for things that we can do to um to cut the costs, uh and one thing is is has already happened here, um which is we already disabled, the 100th note pre-submits. They are now only optional.

C

uh Their reasoning behind that was that they were basically generating half of roughly half of all the costs from scalability related tests, um and we really had trouble for for finding any case that were they in the last year or so, where they actually uncovered and prevented from merging something that wouldn't would be would be causing like the failures. So given that the ROI here is is fairly low and we really need to cut the costs, we we decided and I already did that I already tagged.

C

This PR and I think it already merged like earlier today that we made the 100 100 note: pre-submits um are only optional now, so more or less they want to be running. We are looking for more I. Did some minor things, some other minor things too, but I think we also need to adjust a little and by adjusting it unfortunately means decreasing.

C

The frequency of 5K note, which is this, is something that will probably happen soon, but um I wanted to. Let you know that we are, unfortunately will will have to do that too.

A

uh Did you say like 5K node test.

C

A

Okay, yeah: that's that's a bit unfortunate as, as we are running it only once a day now, right.

C

We are only we are running it once a day and uh to give you some numbers, I can't remember if I, if I remember them exactly, but we are roughly like all the scalability related things are costing project as a whole. Roughly six hundred thousand dollars per year, ish um pre-submits are roughly half of that. 5K node tests are almost half the like 5K note: Fest are almost half pre 100 nodes, pre-submits are another, almost half and the rest are peanuts.

C

Basically, so um one half we pretty much cut today, uh the second half we need to reduce significantly and the rest doesn't matter that much really.

D

Okay, so, oh so the 100 node, post Summit is still running, though right.

C

um Like so pretty much like every hour or so, okay.

D

I guess that's enough foreign.

C

We believe that, um even if something will merge, we we will be able to to soon to relatively quickly um determine what that was, and hopefully, reverted or fix it or whatever.

D

Okay, since and what the frequency for 5K uh you're, reducing it to once per day or once per day, is what we already had today on.

C

Your channel one day is already what we have um and we need to reduce it. There are some discussions happening there. um We will like I'm I'm, trying to to push for.

C

Running it as frequently as possible, but I I'm, not sure where we will end it, where we will, where we will end up here.

D

A

So maybe, um based on that I'm wondering if um we are planning any kind of like investment towards Cube markers, as you know like this, was kind of like always. The approach to to save money- and maybe maybe Cube mark would be. You know um something that we could potentially invest in order to have better um like more more frequent runs for for 5K, for example,.

C

If someone has capacity to do that, like that's, that will definitely be.

A

C

Definitely can be useful, but um yeah I I, don't think I will have capacity for anything like that.

D

uh But we were running 5000 Cube Mark at one did we retire those jobs or.

C

Some Cube marks are still running um I disabled 5K note Cube marks because they were kind of duplicated, with real 5K no tests. We still have like 500 note. Cube marks running pretty much all the time, so so something is running, but it's definitely also not uh not. Uncovering all the issues that uh that real clusters are uncovering.

A

Oh, so you you disabled, 5K, Cube marks as well, um then maybe like. If we reduce like depending how how much we reduce the regular notes 5K test, then maybe it will make sense to bring it back.

C

B

Sorry to interrupt I am a contributor at chaops, and today I was part of one of these discussions, um and my understanding is that the budget that is actually over is the Google GCE part any chance. You could use AWS for some of these tests. I know that we have quite a lot of credits there too.

B

D

Yeah yeah, I I think it's definitely a valid question and recently I guess there is some budget I think I'm, not sure about all the numbers and details, but I guess for the CI test. Ci testing part um I think there is is an account we have I guess we need to. You need to set up these. If I can note tests on co-ops I guess today, though, the way we set up all that scripts and that infrastructure I believe it's it's for specific to gcp.

D

They believe we also tweak quite a few things when creating these clusters. um Yeah I, guess I. Think let me let me take this as an AI back to my team here and see if we can begin by, for example, uh 100 node or a 500 node test, and maybe we can do something like um reduce tree with a reduced frequency. We can do alternating runs on gcp and AWS, or something like that to reduce some pressure off of these 5K DCP ones.

B

Well, you're saying something about the cops. So if you have any questions or something you can ping me in Slack generally I'm available, there.

D

Yeah uh yeah, so let me start the thread, maybe on the sixth scale, chart and try to poke a few folks.

A

B

Okay, in any case, my understanding is that the Google credits are kind of really low, probably.

B

Arnold was saying that uh it's kind of getting to zero for this period, at least so they were trying to figure out how to reduce costs in Kate's gcrio that people are still using and it's consuming quite a lot of credits. So it's really.

C

Yeah, that part is like roughly I, can't remember like 70 percent or something so so the the case IO um the the artifacts ones.

E

So so, yes, there are people looking.

C

Into that uh I think the num. The current numbers is that we are. We still have a lot of budget for this year, but with the speed that we are spending it now, we will get over budget probably around end of October, or something like that. So we will have like zero money for running anything in the last two months. So.

B

We'll have a black Christmas or something.

D

Foreign, if you do it every alternate day, instead of every day I'll take then are we able to make it through till the end.

C

I think we no, we are still I, think even if we disable it completely we'll we would be still or a little bit over the budget.

D

D

A

Yeah, do we have any other topic to discuss then, because it seems like this? Is this one yeah.

D

Yeah, so there is uh Nathan hers today from the case team and I think he wants to discuss a specific issue. This bug fix with watches I. Guess um hey, then do you wanna? Do you wanna? Let me actually first link to that issue. I'm mentioning.

E

Yeah yeah, um I I think the there's a fix that was merged uh yesterday. I just wanted to talk about if we could backboard it for for older versions, um when I was working on reproducing it, it looks like it starts on 122.

E

um I I believe uh Upstream we're still supporting 123, so I was hoping if we get it backwarded uh to 123.

D

So uh one more context, along with this is uh this- is actually causing issues to at least a handful of customers, so this is actually causing real customer pain. I believe uh is that right, Nathan.

E

Yeah I I think start I think the first case we saw was in uh November and since then I think we've had close to 10 escalations on it um with with most of those being um sub twos or like uh getting paged just because the cluster can get into an unrecoverable State. And even if the customer deletes the the crd uh a lot of the times, they don't have control to trigger a restart um like I. Guess really.

E

The only ways customers can get out of this is if they're able to trigger a scaling update on our side or if they do a version update, which it usually isn't viable.

C

Yeah so I think um uh the additional context about about the user user user escalations is useful here, so it would help I think currently it's causing some flakes of tests, so I think we need to address that first, um but assuming that will get fixed at some point um with the additional context, it would be good to write it down somewhere in some issue or or whatever um I think back parting. It might makes more sense than iPhone yeah.

D

Yeah yeah: we will do that after this call.

D

Thanks thanks Nathan.

E

Yeah yeah um I think there's like a Harbinger notice on our side open. So we can share something uh I think externally, on those um yeah and at least eeks still supports 122, so we'll just carry the patch internally. For that.

D

A

Okay, so um do we have anything else.

B

A

B

Ahead, okay, um I have a use case um with agents running on each node demand, set collecting Network information and enriching it with kubernetes information.

B

At the moment, all the tools that I seem to get that I could get started on exploring, um are doing kubernetes enrichment on each agent by listing all the pods and then continuing to watch generally I think those tools were tested at smaller scale, but in our case I think the scale would be more towards thousands of nodes.

B

So I have some reservations that.

B

Starting that those agents like 2000 agents and asking the API server for the list of pods, which is not so small, either I, would not really work very well. Now. I saw that, let's say Calico went with the taifi as a proxy psyllium created their own smaller objects, with only the information that they need, but I wanted to know as a general guideline. What do you think about this kind of use cases? What's the recommended approach.

C

Yeah, so certainly watching all parts from every note is not a good idea like it's. It's not going to scale um independently from like, even if we ignore the initial list for those that's a bad idea anyway.

C

um So so that that's like the um the first point uh regarding regarding the what we can do, I think I I, don't know the use case. So it's hard to hard to say but like even what kalika is doing, um is also highly problematic. So um the creating the object itself like the psyllium endpoint, uh is.

C

Definitely not enough. We know that so the the next, the next step they did is they. They created the psyllium endpoint slice, basically to batch multiple objects into single.

C

Batch multiple single, multiple, smaller objects into single change. Radio also include also at the same time reducing further they they they uh their size and I. Guess martial. You can probably talk more about it, um but and at the same time like introducing something so that, like not every potentially not every single change is being um sent, and even that is causing problems, so that is that helps a lot, but even that is problematic in in some scenarios, especially in scenarios where we want to have clusters with like higher posture.

C

So um there are some more things that still involves are are thinking about which I'm not 100 up to date with, uh but I, don't think we have like a good solution at this point like some solutions, obviously scalier, horizontal scalier controlled in horizontally further and instead of learning like five or three or five API server and 25 or 100 or whatever, and at some point that will work. uh But that's not super satisfying solution, so I think I'm. Sorry, I, I think that.

C

I think we should basically look into the use case and see what you can do to avoid having all the information in every single agent, because that um in many cases, including many sub cases of the psyllium that isn't strictly needed.

C

And it's just like um it often boils down to to the fact that it's just simpler to do.

B

Well, the use case is more or less let's say: debugging Tool, uh that is, is looking at what net what pods are doing connections, maybe errors between various pods um in theory, one could do this enrichment at a later time, so separate it completely. Just gather the raw data from the pods using dbpf or something um and then reaching it later.

B

That's I would say the simplest approach that I could think of um just that um generally, it was preferred to do it in each agent so that you can send and Rich data to the data store completed. So you don't have to join some tables to get the right info.

B

Would it help if, let's say just now, I don't know uh the controller, the deployments or replica sets are listed and watched, or even those at this scale would mean too much.

B

A

Would be fine right like not not the parts but, for example, deployments right sorry, yeah.

C

C

Probably I think I I'm trying to remember what we are storing in deployment status. I think we are also storing like number of ready Bots, which is obviously visibly better than than watching all the parts, but it's still I can still imagine use cases where it might be problematic, but it's definitely some mitigation for sure.

B

Okay, so it's not perfect, but in general it may help and it depends a lot on how many API servers are actually running in and balancing requests.

C

Yeah and- and it highly depends on the pot, churn itself. If this cluster is fairly static, then it it may work with even with pods, but if you want to throw a lot of bots or like create a lot of parts at the same time or schedule or delete or whatever that, like the more you do of that, the more problematic it will be.

B

Yeah, the more watch events will be and so on and so on.

A

um I also also, maybe one more comment, also like adding like how you deploy this demon set um like, as you said like there is like the first step, which is like listing, let's say, all the deployments and then watching them, and so you should also be careful like with the deployment of your demon set, because if you, if you create those um odds too fast, then they all will start hitting. You know API server at the same time as well.

B

Okay, so some mechanism to to have delayed.

B

C

That that should be possible to mitigate with priority and fairness. If you.

A

um Well, I would I I would not fully agree, because what can happen is then you know you are doing two calls per right like list and then watch and the watch might never happen. If, if you are overwhelming the APF.

C

C

But you might do some tricks that you list again one priority level and watch it again against the other one or something so.

D

C

Are ways to or exempt your watches or whatever like? That would be a bad idea, but um there.

B

C

Tricks that you can, you can do to mitigate that.

B

Okay, uh I saw that there are some caps about streaming or uh like instead of watch list to start streaming. Some sort of events are this in are planned for 127 or the.

C

Day the core server-side implementation of that um has just merged today. Oh yes,.

A

C

Is this is definitely planned for 127. um we will like we want to, like the client-side part, is still not merged, uh but I hope it will be. It still has like two weeks or so to have it merged, and it's not that far so I hope it will get merged.

C

um The server side part is, is completed.

A

So, actually, uh from the client side point of view, um do you have to do something in order to enable this feature? I'm? Guessing yes right, because it's totally different API.

C

So it's it's not totally different API but, like you need to you need to handle it slightly differently right, like the the again yeah. It's the watch. It's just a watch with a different parameter, uh but you need to you need to handle it slightly differently. So so, yes, you need to do that. The client-side part isn't huge and isn't super complicated, like uh I can probably link the the work in progress PR here.

C

um Let me find it.

C

You can Marshall if you can copy paste to the notes and if you look into that, it's majority of this PR are actually tests. So.

B

It's always a good idea.

D

So, um while the while, the streaming is happening for the objects is that occupying the APF seat or how.

C

Does this work? Yes, it is.

D

D

So it's it's counted, just as how, like those init, init events are counted today. Yes,.

C

Because it's in a way it those are exactly init events.

C

D

A

D

Am thinking I'm thinking, because today this happens with a list followed by so initially for most reflectors? It is list followed by watch right so for lists. It is if a lot of clients are making. These list calls we're able to use APF to to kind of throttle some of them, especially on some clusters, where there is um yeah, where there are a lot of listers for some customers where we have to do this. We, for example, introduce some EPF rules to throttle lists.

D

So if we are going to do this, um do we have to do something equivalent for watches, or is that going to have any side effects.

C

So the way API APF works is that only the the initial event parts are actually handled by or handled by, APF, which boils down exactly to what we want. So it's it's this the part of list. Actually, um so it may require a little bit of tuning, but also on the API PF side of uh of uh APF site itself um and I guess it may not happen for Alpha it's something that we should probably play a little but with before, enabling it by default, uh but um I like conceptually.

C

It fits the model so that that was considered like a conceptually. It fits the model. We may need to do a little bit of work to make it work exactly as we want and- and that may require tuning some um some some ipf rules also, but it should be doable, at least without significant work.

D

A

Hey I think we are out of time. Thank.

B

You very much for uh all your ideas and for your help.

A

Thank you very much.

D

Thank you. Thank you, folks. Everyone.