Kubernetes SIG Scalability, 10 Dec 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2020-12-10 Kubernetes SIG Scalability Meeting

Description

Agenda and meeting notes - https://docs.google.com/document/d/1hEpf25qifVWztaeZPFmjNiJvPo-5JX1z0LSvvVY5G2g/edit?ts=5d1e2a5b

A

All right, I believe we are recording some welcome everyone to our uh another six cavity meeting. I believe it's going to be the last one this year and today we have december.

A

uh All right, I see some of new people I haven't seen before- maybe a quick round of intro in that case, unless you already know each other, because I wasn't here- I think for four weeks or something like that.

A

Hey guys, uh yeah, I can introduce myself, I I've. I've joined a few meetings, uh I know, and uh so my name is alex. uh I work at apple and apple um has been, um it has been building larger and larger kubernetes clusters in the last year uh and uh scalability is now one of the main problems.

A

uh Initially, there were problems even growing clusters to 2 000 nodes, and now we had the 4 000 nodes and thinking to build clusters as much as uh 6, 000, 8, 000 nodes and eventually uh 10 000 or so uh so um yeah. A lot of pain points there. um Internally, we started using.

A

We started using cluster loader tool to run, run performance tests and we do have plans to extend existing test coverage to cover our internal implementations and hopefully contribute upstream as well um with all right findings and the interest, that's great. So what do you mean by your internal implementation and cluster loader support in that area?

A

So we do have internal implementations for uh cube proxy. For example, um we have our like. uh We have um implementation for for network, um we do have things such as uh sticky, ips um and uh our. So we also use uh container d, for example, instead of korea for container runtime, and things like that, but I think network part is, is the biggest I guess trunk um there cool nicely, obviously, like all contributions are welcome.

A

So yeah like some like there's a part that I think we already test, for example, contain early. uh Our periodic tests in kubernetes are already using container d uh buddy, like obviously, if you have like some custom q proxy, and probably you want to gather some metrics from there uh like. This is something that cluster holder can support and if you want to contribute in that area, that's that's that's great.

A

uh Do you have like any questions to us and anything we can help you with, or I'm sure I will as we go on, I'm still researching so we have. um I might I actually started I pivoted to scalability um team just a few weeks ago, or I guess maybe a month at this point. uh We have uh a number of people that have been working on it for about a year now. uh So I'm sure I'll have more and more questions as of right.

A

Now, I'm just you know like a flying wall, just uh listening in and uh check on. What's what's going on upstream, okay, perfect, like if there's anything we can help you with or any questions anything can explain or maybe you want to share with us them feel free to just to just uh talk uh all right. So yeah reverting intro, I'm I'm matt, I'm six cowboy teacher uh vitek is also here as kubernetes work group. Reliability voltaic is rtl.

A

I hope he's listening because he's not reacting.

A

I'm sorry, yes all right and we have like two more people. So davos do you want to do you want to start.

B

Sure so I guess my my main question is previously. Some scalability or load tests have been discussed um that were being run internally, and um you know that piqued my interest into like what are. What are the things that that um six scalability people are looking at to determine the success of those tests like what are the types of um failure modes that would actually um start to surface once you try to deploy a large cluster.

C

B

And so right now um I have you know a pretty reasonable use cases where just a couple hundred nodes is fine, but in the future we have plans to migrate. um You know much larger deployments to kubernetes, and so I gathered some statistics based on on the suggestion of, I believe matt um about what like a um what like a hypothetical workload could be for for the for this particular scenario,.

D

B

So I can, I can share some of those statistics now. If that's, um this is a good time. Okay,.

A

B

So basically, we have like um a kind of like a sinusoidal type workload when it comes to scheduling of pods, um just due to like the working hours and people doing deploys and things like that, but at peak it seems to be or or I'm sorry as an average.

B

I mean it does actually get higher than this, but but as like a reasonable um peak or average we're seeing roughly around 30 to 45 pods being scheduled a second or we would be in this particular environment across roughly 40 000 machines, and you know, with roughly around half a million containers running or pods running in this in this context and so um yeah.

B

So I'm really wondering like what types of things should we be looking for as we start to grow these clusters in terms of failure modes like one particular concern I had was just with coupe proxy. So um you know with the number of pods, as they scale um like to my understanding.

B

Coup proxy is, is constantly pulling the api server to find out what the topology is and then set up ip table rules locally to facilitate that. But, like you know, if the api server was is, is like degraded or unresponsive, or something like that um hypothetically, like I'm thinking that maybe those rules could drift out, you know not and be pointing to addresses that are no longer valid or something like that.

B

um You know also by default, like cube dns, um you know so I've been testing in eks, but by by default you know, there's only a couple: coupe dns containers that get scheduled and you know things like the metric server. Also is you know these things are not deployed um in a way that they're going to scale automatically so yeah.

B

So I'm really just wondering if you can kind of walk me through um or or just kind of give me some points on what what should we be looking for in terms of um degradation within the cluster, as, as you said,.

A

B

Are like really.

A

Great questions I I would say like they, they boil down to the core of what we do at six scalability, so basically uh how we test capability and how we ensure that current scales- and I think it started with definition. So there is a link here. A few meetings like below of the presentations we gave at the kubecon last year uh basically summarizes a bit our approach to how scale tests and like how to define that cluster scale.

A

Slow cluster works, okay under high load, so like, as you said, like more or less like between words like cavities, complicated, because there's a lot of different dimensions and these dimensions interact with each other. uh So, basically like the the the framework that we have and uh the idea is like to find some safe space, uh taking into account like all these dimensions that as long as you are within it, uh your cluster is happy and like going like tldr here. We uh we approximate this envelope but like set of limits.

A

So basically we say your cluster should work as long as number of nodes will be less than something number of possible less than something number of secrets. Configmaps, etc, etc, and basically this is what we use in our continuous scale tests in kubernetes uh to make sure there are like no regressions and kubernetes keeps scaling to that limit. But now the question is: what does it mean that cluster is helping? I think that that was your.

A

What was we're referring to so uh how we can say that this configuration, for example, that we test this workload uh is, is still like, uh okay uh and when it begins like to obviously like, for example, if that's ever gross time, the ap server goes down and then then then it's like above our limits right, like that's, that's beyond what cover kubernetes can support, but what, if there's just a slight degradation of uh of of performance right, so the approach we use is, I would say, pt standard, so we define scalability, slows slice and slos like some principles of them, uh personal links, and basically we use that to to say that as long as all those sls are satisfied, uh cluster is happy clusters case right.

A

So the most important one is api called latency and we have like few others. So, for example, one this sort of one to what you said about the proxy, for example, overloading our api server and then q proxy lagging. So in such case, we have in cluster net for grammar latency that should more or less detect that.

A

So that's, basically what our tests do like in. In very few words, so we we load the cluster to some limits and then we check that all slos are satisfied and not all of these are uh ready like some of our work in progress, but uh I would say we have implementation for almost all of them in some state in cluster or supported uh so yeah so but like in general. That's that's another question that you can answer uh in a very precise way.

A

uh It's kind of an art, sometimes it's more art than science, uh to find like whether, like this configuration scales or not like, obviously, you can run tests, but to maybe do it more generically to to be able to tell whether this particular configuration will work or not. uh It's sometimes hard to answer. So I think testing is probably the the only way uh that can.

B

Really answer this question: can you can you potentially provide some insight into kind of what the scheduling, frequency or churn for some of the internal tests that the sixth scalability has been doing would be.

A

Right so yeah totally, so so why tech may correct me here or like add something because he's or extend this area, but uh what we do in our tests. uh We, I think we, as of now, are creating pods with uh 50 per second rate. We are also like experimenting with 100 per second rate, uh but we run into some issues.

A

uh So basically, like scheduler is not a problem here, it's not a bottleneck. uh The only issue is like in like the default, like configuration that we have on of of kubernetes. Scheduler has uh a hard limit on qps like client-side qps, and this is like set to like it depends on your deployment. Obviously right like, uh but like that we run the test in kubernetes and gc and the defaults are there.

A

I think uh 20 or something like that and we bump into hundred, but you can basically bomb this keep client side qps and like scheduler, is not a bottleneck uh like obviously it depends. What kind of uh like pots you're are running. Varying some more sophisticated scheduling features like dot affinity and the affinity.

A

Then probably there will be some drop, but uh if you're, just like scheduling uh like some basic deployments of spots, then like usually scheduler is not a bottleneck, uh but the bottleneck is uh in our test, at least like the biggest bottleneck is the uh the the the load generated by cue proxies like watches, watching the services, because, like the pots we are creating, are at least some of them are part of the services, uh but yeah like uh this is like usually overlapping api server, but we have some uh solutions to that.

A

I can like dig into that later, the other things that we basically that prohibit, because, like recently, we were experimenting with spinning up the test, the test, uh so we, I think, ended up with a 50. We didn't go to 100, because we noticed that we had problem with events uh yeah. I see there were too many events overloading uh at city so uh yeah we actually have had some ideas. Then I can like link you to some discussions about the reason.

B

So yeah, and so when scda does become overloaded and unresponsive like what are the actual failure modes that happen in the cluster. So.

A

Basically, what we see is usually increasing api called latency, so the slo is yeah, there's a link to like anyway. I can share the links later, so I cannot waste time, but basically like the slr for, for example, for simple get calls is like the the call should be below one second, when lcd gets overloaded, usually like spikes out right, because that's uh that's like a consequence.

D

A

That's how we detect, even though like in in I would say the most dire case if nct is super overloaded, you can like even like take down the whole vm with master and, for example, kill api server, so cluster becomes completely unavailable. But that's like the uh like easy to detect uh case right, uh but very often it's that just sd is overload that it still works somehow. But it's overloaded and we see that in the in the api collidency. Usually so that's how we detect that. That's the figure remote here, okay,.

B

And so, and if I heard you correct you correctly, you did say that, like coop proxy is a significant source of load.

A

So yeah like the issue that that's the problem of the endpoints api that it's backing the cluster ip services right uh with endpoint slices. This is better and like in since 119 right. We have npx licenses enabled by this okay. So that's why we were able to uh push this limit, because before that we were using. I think 20 uh like what's per second rate and mostly because of that, or maybe.

B

So if um so, if let's say that, if I'm using my own service discovery.

D

Is there a way.

B

To actually like turn down the frequency in which things are pulling calling the api server for topology information.

C

Or nothing is calling vpi like all our components are kubernetes components are basically using watch. We are not polymer.

B

Oh okay, okay,.

A

Even though it doesn't mean like what gives you like, unlimited scalability, because even with watch, we have like issues with the endpoints api right. uh This is because you have like this square factor there. If you have like service of size, n pods, then the endpoint endpoints object will have size of n, and if you try to do like some rolling update of it, uh then basically n times, you will be sending this object of size, n, right and.

D

A

The end it results in like you need to send gigabytes or terabytes of data over network right cluster, like api server, needs to serialize everything, so a lot of cpu is required for that and that's usually what's causing problems uh but yeah.

A

So to answer your question, we've watched there is really no way to rate limit anything or like to throttle anything as of now.

A

Okay. So, like that's, why like? What is important? You need to throttle on the like producer side or like so basically on the side of controller that is generating the data that is later being sent overwatch right. uh So here we have like efforts like priority and fairness uh is coming and it will basically provide some, uh I would say some overloading protection. We can call it this way uh that should basically help in these cases, because we can then configure. You can then configure the the the api server uh to make sure.

A

For example, we are not creating too many objects that are resulting in like too many watch events to be sent uh yeah like it's. Basically uh it's coming and we have like funds to extend.

C

It if you have your own networking solution, like you, mark individual services or a group of services, all services or whatever, um as something that shouldn't be followed, that's something that keep proxies doesn't have to flow, and then they are not one. They are not like programming ip tables but base for services, but like this is also like using the load. So if you have something else, you don't need like the particular service to be um resolvable via service ips within the cluster. Then it's possible. I think it's called like.

A

So there's like.

B

Yeah, oh okay. This.

A

Is exactly this.

B

Is exactly what I was looking for? I just didn't know what to call it or or how to search for it. Yeah.

A

So basically they don't you specify known cluster ap and they don't have cluster ap and then keyproxy doesn't watch them and there is another option. You can provide some label right in the service with some like custom, proxy or custom proxy, or something that and then also like you, proxy won't be watching them.

B

Yeah great, that's that's extremely helpful.

A

Let me just I I'm terrible at making notes but yeah. Let me just note this thing.

B

A

B

So what about coop dns so um is that something that could be potentially auto scaled like can we run because I is, is it stateless um or or is it not or is it so lightweight that I really shouldn't worry about it.

A

So it's stateless because the data is stored in that city.

D

A

Yeah, so you can scale up horizontally, okay and like not only we do out of scale. Okay, for example like okay, it may cause trouble sometimes, because if you like scale too much horizontally, then you end up with, for example, thousands of cube dns spots, and this is causing problems because you have a cluster ip service on top of that right. So we are uh so so yeah, okay, because.

D

Of the employees, api.

A

This may be problematic to have such a big service. The official recommendation for endpoints is to stay below 250 pots per service. I know that it's ridiculously small yeah, but that's that's reality, but with endpoint slices you can go like much much higher. So.

B

What is it, what is endpoint slicing doing that it allows for uh greater density? If you can just maybe give me like a.

A

Quick uh so, basically like the idea is instead of having one huge, endpoints object. We are partitioning this object into multiple slices, like that. That's what we call them here. So, for example, you have a service that is composed of thousand pods. You will end up. Instead of like one thousand pot object like endpoints object, you will have ten slices of hundred pots each like put eyepieces okay.

B

So this is like an optimization.

A

For like network bandwidth and things like that, cd or something yeah yeah, exactly because then, when you update a single port, you just update a slice of uh that is like it's like a hard limit on on the size right. It's I think. That's.

D

A

And then you don't need to send so much data over the wire and.

B

And that's in zero 19.

A

Yeah, it's 119 right, voidtech and my sizes, it's better in 119.

C

Exactly an endpoint in 119 like like endpoint slice, api itself is, I think, 116 or maybe 117, but, uh like the controller is 118 beta and the 119 is like, where cube proxy support for linux, in particular when and this, and thanks to that work enabled by default like this is like you process what is like generating this cloud.

C

Like the link to the presentation that we gave like I gave with me han about, like the end point slices or in general, about the problems and the way we approach solving it like on kubecon two years ago, a little bit less than two years. But I have troubles how to use this chart.

A

Yeah click this more button and then there was chat, I'm not sure if it helps.

C

Yeah but like if it's in the dark, I can find.

A

C

And it's like yeah, okay, so thank you. Thank you. I will paste it to the.

C

A

C

I was based things this for someone else, um some particular thing some time ago, but.

B

Yeah so I had a I had a question: that's in a slightly different area. I guess uh so, if we're gonna. So if we have a very large cluster- and we want to start scraping logs um for containers like is, that is that something that's only generating load on kubelet uh hosts and kubla apis and should scale horizontally or or.

A

Like do you just issue cube catalog, logs or like yeah? Well, so.

B

I mean that's what I'm doing now but like in the future.

A

Basically, it goes through, it goes to apa server right, so you reach apa server and then basically apa server connects to cubelet and there is a stream. So it goes back right, apa server to you right, so it will be also like causing load on api server.

B

Okay, but but is there but it's possible to just hit the google api directly but.

A

What you can do it like? Basically, you could have like some demon set with some node laser agent. That basically does that right, like it scrapes all the container or like yeah yeah. You can write right to some location and then this agent can basically read from there and and push it somewhere like whenever you need to okay. So that's that's. Definitely a more scalable approach: yeah! Okay, that's good!.

A

If I may chime in another issue that we came across as we grew, our clusters is that the default configurations for the control, plane, components and cubelets did not work quite well for larger clusters. So I was wondering if it was documented somewhere uh so basically recommended or tested configurations for clusters of a particular size. I'm not sure whether it's documented anywhere, but if I say correct me if I'm wrong, but.

C

A

uh You can like take a look, what we do in our continuous tests, uh because there we basically test continuously everyday clusters of size, 5000 nodes. So obviously we need to tweak some some parameters and linked to there. No that's not it's always so.

D

A

We're going to think here, config and jobs.

A

And six column 18 all right, so basically they're, like our. We use pro to to run this test. uh If you are familiar with product, that should be more or less easy for you to read, but the cluster configuration is usually there like some pretty. How do I.

A

And yeah, like that's another topic, was the syntax here and everything, but there should be a preset for large clusters.

A

Maybe it was up. That's cube, mark that's gc scale, uh okay! So no that's cube! Mark skating, sorry, that was too short. So as we see, for example, we are overriding some like core controllers.

D

A

Setting that's like for sure, but that's even at lower scale. uh No! That's all well.

C

I think many things are is not here, because some of them are hidden in setup scripts in the queue.

A

That's that's true. That's also true, for example, yeah like in flight requests right there's this. If else logic, if like number of nodes is larger than x, then use this setting and stuff like that, uh but yeah. I wonder whether there is a way. Actually there is a way you could just take a look at our tests.

A

um Six credibility tabs go to scale performance.

A

These are like cluster loader 2 tests at 5000 node scale, and we dump all the logs from masters masters and note and like from master nodes, you master logs, you should be able to, for example, get uh like the flags that were used to to start cube, api server and other controllers.

A

Okay, yeah. It will take some time to load but yeah it loads, so artifacts, artifacts master and, for example, cube api server. I'm not going to open it because it's like in gigabytes.

D

A

Like, basically, you should have all the flags there, so we can always like take a look and see how we are configuring. That's very helpful. Can we please paste this link into the doc if possible or sharing chat totally?

A

Thank you very much.

A

And also, I believe you guys gave a talk at the most recent kubecon about clusters of 15 000 nodes that was voytek right yeah. Is that already public? Is it available on youtube or not yet.

C

It is available on gke, yes,.

A

A

All right, uh we have one minute left.

B

I think that's most of it mostly. I really.

D

Oh, no, I'm sorry, I just dallas. If you wanted to continue that, please.

B

No, I was just saying that they answered all my all my questions and gave great uh things for me to look into, and I really appreciate it.

D

uh Yeah, I just wanted to echo same sentiment uh and come off of like lurking, move just to say: hey, introduce myself, I'm elena washington. I work at gusto uh and I just wanted to know if this recording will be anywhere or if there's like a list of recordings or if that's not shared.

A

uh So we are recording for a reason they should be available somewhere, but I have no idea where I know that, like two years ago or two years ago, they were like automatically uploaded to youtube, but I think the last time I checked they were in there. So I will. I will write an ai on me too, to check where they are being stored and share with you awesome.

D

Really appreciate that, thank you guys. Let me do.

A

It okay yeah all right, it's uh it's time! So thank you. Everyone for joining us today and as I.

D

A

I will announce this stack, but probably will cancel the next meeting because uh it will be during the christmas time so see you in the next year all right. Thank you.

A

A