Kubernetes SIG Scalability, 27 May 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2021-05-27 Kubernetes SIG Scalability Meeting

Description

Agenda and meeting notes - https://docs.google.com/document/d/1hEpf25qifVWztaeZPFmjNiJvPo-5JX1z0LSvvVY5G2g/edit?usp=sharing_eil&ts=5d1e2a5b

A

Okay, hello, everyone: this is um six scalability meeting my 27th 2021.

A

We don't seem to have anything on the agenda today, so um doesn't win. Does anyone have anything to discuss.

B

Yeah, I have a few questions on the uh large-scale uh cluster uh configurations so uh recently I'm a low-test uh like 10k nodes, cluster with uh 400k plus and um we're trying to uh like, because we use very powerful hardware we're trying to understand any uh like uh parameter training did uh for these uh scale clusters, especially on the api server side.

B

Anything we can turn there and get a probably better.

B

A

Yeah in terms of throughput, you mean the throughput of pods right. uh Yes,.

A

So, do you know that, like um the api server itself is your through boot and not the other components.

B

uh I use more clients, and I noticed I think I use three api servers and the qps is around. uh I think uh uh six thousand uh yeah per second, um I think, uh even I add more clients, uh I don't see the uh like. This drupal go higher, so I thought uh probably some um like uh like issues on the api server. uh I don't use the api uh like fahrenheit and the priority, and I changed the mutation in-flight mutation request to higher number, but currently it stayed at that level.

B

Right, so I'm I'm thinking if any other things we can do. uh Most of my uh requests are like crud requests, uh so I think it hit almost keep the limit yeah.

A

A

Do you know that the cd is actually not the bottleneck.

B

uh Seems, uh okay, but the the slow query and pending proposal uh numbers uh goes high uh um uh and the overall uh end-to-end latency increased uh as well. So um I think because I uh most of my requests are right, uh like uh part uh uh patch and create a I think, etc could be a bottleneck as well, and we noticed like actually the performance on edc 3.4 has worse performance than 3.3.

B

We submit few patches and on the concurrent buffer optimization after that, I think uh it close to the three to three performance.

B

So we are not sure like. um What's the next steps, uh um yeah.

A

Yeah, because I I'm surprised it shouldn't be the limit like there are a couple of things where, like api server could be limiting. um One thing that comes to my mind is like the throughput of watch cache in particular.

A

But from what you are saying you are, you are seeing at a bottleneck at the level of like requests that we process and that shouldn't like.

A

I think, we've never seen that really. I think we were hitting a bunch of other bottlenecks, but not this one.

B

I see okay, so.

A

Doc, if you are, if you are not, if you are not able to to have more qps, even by adding more api server, that sounds more like a cd issue to me.

B

uh Yeah uh we haven't tried like five api server yet uh um and yeah. So that's the uh like uh to that's in my to-do list. uh I can have a try, uh yeah, so watch cash. uh Could you explain the watch cash uh limit.

A

A

I don't know the exact numbers um from the top of my head, but the problem is that there are certain things that are happening in serial. I mean we are dispatching particular events in serial, so we need to finish dispatching the previous um previous event before we start dispatching the next event.

A

Right um I mean not the whole dispatching, but like um not the whole dispatching in a sense like that, all the logic, but there are certain smaller things like um putting it into cues that are or to channels that are later than being processed by like per watch, go routines, um so it used to be significant problem in the past. We did a couple optimizations, I don't know two years ago or so so it wasn't.

A

um It wasn't our immediate bottleneck, but I don't think we are especially if you are um trying to achieve higher throughput. I uh I don't think we are. We are really close from from the throughput from the current limit the current throughput limit, so um especially if you, if you start observing things like but startup time becoming longer and stuff like that, that would be my first guess that, like watch cash, throughput is, is becoming a a bottleneck here.

B

Okay, okay sounds good uh yeah. There.

A

Are still things that we can do there, it's just. They are not that simple anymore and the like they weren't priority so far, so.

B

um Okay, yeah, let me try some uh like low-hanging fruits, like uh adding more uh api servers. First yeah, I can report the uh status bank in the next meeting.

B

B

um Another question uh I see uh there was some discussion on the compar uh compression between api, summary and atcd. So I checked the code since we use the uh some encoder and just the marshall, to convert the object to the device and I'm I'm trying to see if we can use some compression algorithm to uh to optimize the size we send to uh uh edcd.

B

Do you think that's a uh like a reasonable uh authorization.

A

Well, I think it depends where exactly the bottlenecks are, because compression also isn't for free, so um yeah I I mean it probably depends it may like it. It depends really on on, like what workloads do you have and stuff like that, like how your usage patterns look like, it may potentially be worth trying.

B

Okay, uh yeah, so one thing another similar issue is, I remember, like uh I think, two years ago, there's a patch to support the gzip between the client and api server, and at that moment I think we only have we only allowed the compressions on, uh I think payload over 16 kb. As I remember, uh do you think it's better to twin that number. At this moment we can, since we have uh like uh powerful machines. uh Probably like uh resource is not a limitation.

B

We can change the number to lower uh lower number and see if, like small requests, can benefit from that.

A

Yeah I mean you can always try. We, we didn't really see significant benefits when trying that, for I mean it was using way more resources, I mean cpu mostly and like the the gains weren't significant, so we didn't do that, uh but yeah. As I said, it really depends like on your usage patterns. So, okay.

B

Okay sounds good. Thank.

B

A

Any other topics from anyone.

A

If not, I guess we can call it and have 15 minutes back.

A

Okay, thank you thanks. Everyone.

B

B