Kubernetes SIG Scalability, 4 Mar 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2021-03-04 Kubernetes SIG Scalability Meeting

Description

Agenda and meeting notes - https://docs.google.com/document/d/1h...

A

Okay, hello, everyone. This is six scalability by weekly meeting march. What date is today march, 4th 2021? um Let me just open the agenda. I didn't see anything like three hours ago or so um yeah. There isn't anything in the agenda.

A

um I can probably give one update regarding um one of the things that we were doing, which is the um the cap for efficient watch resumption, um which is graduate, which is like code complete for beta now, um for those who are not familiar with that, let me just paste the link here. If you are interested in reading, I will. I will just briefly summarize it in a minute like just looking for the link.

A

Let me just paste it here.

B

A

Yeah, so basically the the idea behind it is that when you or the motivation for that is actually that, when you are doing when, when we are doing a rolling upgrade, for example, of the masters um or of the control plane, um the the way that the the way how currently watch cash is working is is working.

A

Is that basically, uh it changes the on the the resource version that it serves only changes when one of the resources of that particular type, because research watch questions per type is changing, which means that if we have a like a resource type, that none of the objects is oh, so that that is when it's changing and when it we are initializing it.

A

We are initializing it from more or less now, and then now is the resource version from from at cd now across any across any different resources, which means that if, during the, if none of the objects of a particular resource type is changing during the control plane upgrade.

A

That basically means that um every single watch for that resource version has to um has to be basically re-enacted re in re initialized with the list which is like pretty pretty expensive operation. I'm sorry, I'm I'm not doing super great job in um probably and explaining that, because it's it's pretty tricky to explain. I think if you want more details, I'm happy to answer like particular, like any any more specific questions.

A

If not like it's it's, I think it's pretty well written in the cap, so um the problem in the more details and like the exact examples, what exactly happens um so the way we are solving it we are used.

A

We are taking advantage of progress, notify feature in lcd which allows you to reque like when initiating the watch um to the lcd you can al, you can specify the progress, notify or yeah, something that is called progress, notify and then it's sending on over this watch channel the the notification or bookmark event um every n seconds, where n is like configurable and that cd, which means that we are able to update this. um This watch cache like every.

A

We are configuring it currently to five seconds. We are test, we were testing up to like half half a second and it also worked well. So it seems we have like quite a lot of headroom is really needed, um but basically basically yeah it it. It allows us to like update this this. The research version surf from the watchcase, so during the control plane rolling upgrade of the control plane. We are basically not not forcing every single watch to release well, not every single, but every single one, that is of the resource type.

A

That is not changing, that that didn't have any change over that period, so um so yeah that is going to beta in 121..

A

um If you have any questions about that, I'm happy to try to explain it a little bit better but yeah. I think that that's mostly what I had from my site.

C

That's that's very interesting. uh Maybe we can read the cap and then, um if you, if I have any questions, I can probably ask you to follow up in the next meeting.

A

Sure, either next meeting or like slack channel or whatever yeah sounds good. Thank you.

D

You said uh it will be better, it will be better in 21 1.21.

D

So is it already how far in like earlier versions.

A

It's it's alpha and 120. Yes, oh.

D

A

And- and I actually, if you want to enable it- it's probably safe because like we, we didn't change anything in beta, I mean we, we just graduated what we had to beta, so the enablement for beta was literally like free, apart from like a bunch of testing, including manual testing or mostly manual testing at scale and like correctness, testing and upgrade, and rollback and stuff, like that, the actual enablement pr had like three lines that are just changing the appropriate feature gates to beta.

A

So so, basically it pretty much is better quality in the alpha release. Already.

D

All these sense so yeah, it also helps reduce the load on the api servers. If I understand correct.

A

Yeah, at least in some particular cases, yes,.

B

A

Yeah, so that's mostly what I had like if you have any question any other questions, not necessarily related to it, um it's probably good time to ask them.

E

Yeah so, uh like I had a question, the watch catch that you mentioned is something which is like uh stored on every uh node or is something the only part of the control plane.

A

Watch case is, is part of api. It's a layer in api server um that but this in theory at least it's it's optional. I mean it you you can disable watch cache, um but you can't really have a very large clusters if you disable watch cash, at least for um some of the critical like the most frequently changing resources or well, maybe not necessarily most frequently changing, but the ones that are heavily watched.

A

So, for example like if you disable watch cache for pods, you won't be able to scale the cluster significantly, because the way we implement watch without the watch cache layer is that, like every every watch is hitting hcd, so it's basically being proxied by 2xcd and given that hcd doesn't understand our data model, labels, fields and stuff like that, so, for example, when cubelet is watching its own parts in its cd, it translates to watching all the pots in the system and then basically deserializing every single one, every single. So do it!

A

Sorry, let me take a step back so so, basically, if you have like, like, like n cube, let's so like endnotes in the cluster, it means that there are like any watches that are watching every single part in the cluster. So whenever some of the bot changes or is created or deleted or whatever like it, this event is sent to every single of those watches the serialized like filtered and sent only probably to one of those watches because, like it's, it's it's only.

A

The pod is running on the only single but like this, the serializer, the serialization of and filtering and sending to all of those watches um from its cd. It's basically super expensive. So um it was critical step to scale to, I think, even 250 notes when we were doing that like five years ago or so or four years, probably more than probably more than four.

A

um It probably will escape better now because we improved couple other things, but I don't think we can scale to even to a thousand now, even without watch cash like you can definitely disable it for some resources that are not being extensively watched, um but uh not for the the most critical ones.

A

Yeah the answer your question.

E

Yeah it does, I also can. I uh is there a resource that I can read about like uh how the watch thing works.

A

um That's a good question: there is a design for watch cash which might be a little bit outdated, but um I think it basically reflects the idea it pre. It significantly predates the like cap era, but I can probably try to look for it.

A

Well, maybe I I will add it to the notes later. I don't want to waste your time now, waiting for me to to find it.

E

A

Okay, okay, I think I found it or at least let me let me paste it here. I will also paste it to the meeting.

A

E

Yeah, this helps thanks a lot.

A

It's not super detailed. To be honest, so it's it's not the the perfect example of like how we should be writing designs to be honest and it's mine documents so, like I'm.

A

um I'm criticizing myself here but but yeah. It kind of explains.

A

It any other questions or.

A

D

If not, we can probably oh sorry question so. Do you know uh if six scalability or architecture plan to do some like uh scale out plans for kubernetes or any like zig project uh aims to solve the problem.

A

um Sorry to to work.

D

Like scale out uh kubernetes like api server or storage layer or schedulers, etc,.

A

um So it deep like I'm not aware of any any horizontal scaling of any controller components so like scheduler or anything like that. That is not happening, because we didn't yet really hit her here about like significant desire, and that would be like significant complication for api servers. You can technically horizontally scale them. The only thing that doesn't scale horizontally is the watch cache um and there is like there are some discussions we are.

A

We are having about how we can do that, how we can.

A

How we can redesign the watch cache a little bit. I can probably find the issue.

A

How we, how we could redesign watch cash to make it like a little bit more memory, efficient though I don't think anyone has like desire to scale it to like size state higher than maybe 10 gigabytes, or something like that. So with big enough machines, you can that really works big enough control, paint machines that more or less works now, even um and regarding ncd, it's probably the same story like I don't think we have like strong needs here that uh justify the complexity.

D

Okay, got it uh yeah um yeah. Could you help uh find the issues for the uh like the watch, cash uh implementations.

A

A

It's something that we've been discussing for quite some time. I think I I first was discussing that with clayton and two years ago, or so. The idea.

A

So it's not something that, like is super high prioritized, but I think, like clayton recently mentioned to me, that um they would like to explore it a little bit more deeper. So.

A

Yeah, I have it like. Let me just find.

B

C

It's not this one, maybe it's not this.

C

A

No, it's this one yeah. I have it.

A

It's basically this one and like this comment and the following discussion, there.

D

Thank you, yeah I'll, take a.

D

B

Hey, uh hey hi tech, so this is. This is a question I had asked you earlier um and this is in our public talks and I wasn't sure why this was right about endpoint slices. I thought maybe you might have an idea which is uh so in in our endpoints slices public dogs. They say that it might be possible that the same end point at one point in time, maybe in more than one endpoint slice.

B

I I wasn't sure why this is the case. Maybe maybe you know why, if you're going to.

A

uh Yeah, it's basically to allow reshuffling more or less like if you want to.

A

um If you want to do a reshuffle like let's say that you had a huge huge service um and then you removed like like let's say you had like, I don't know: 20 endpoint slices of a service and each fully packed. So let's say that the default is hundred. If I remember correctly um and then, like you, basically scale down the service and you remain like you, you end up with like just single end point in every single endpoint slice.

A

We want to do a reshuffling to not have like like because you you basically have like 20 end points. We want to do a reshuffling um to to condense it to a single object, because there are not too many and to be able to do that.

A

You need to have a moment where, where you basically don't want to remove it before you enter it to another one to to to not have it not included okay yeah. So basically we first included in in in the in another one and then deleted the old one. So so there are, there are potentially like short periods of time where, when this can.

B

Happen: okay, yeah. That makes sense thanks.

F

Guys I have a question regarding the eps server, like the uh operation like uh uh to um like operations on the uh xt cluster, so we, when we create and update actually using the transaction but form or um actually we did some uh performance evaluation. Recently we found the the transaction performance actually much uh much slower than the like single operations, so, for example, for uh put actually, if you use transaction pool, the performance, like is like, I think, probably half of the actual uh pull uh like the pro operations.

F

So uh have you guys ever uh thought about, like the uh you, like, uh you know, push the active site so that they add some atomic operations, such as like create update and like delay operation so that we don't have to use transaction. We can just use these atomic like the operation.

A

Instead yeah, so that's that's a good question. I don't think we ever considered that.

A

I don't think we ever had a real need to do that, but um um I think that's probably interesting thing to well, I'm wondering if it really deserves a cap. It might not like really deserve a cap itself, so um it might be interesting to. I think it's purely like api machinery thing, so um I would recommend like reaching out to api machinery.

A

I see um folks and and discussed with them. um I think that makes like um I would need to think a little bit more about the consequent. What what what will change um but.

A

Well, so that works in some cases for updates it doesn't work because we don't want to update unconditionally. In many cases.

F

Exactly so, we you know reaching the actual api so that when we do the update, we use this some argument to specify we're actually doing the update. We don't. We want a particular previous version to be there when we do the update, but because we are using, uh like the common, like transaction interface, actually transact, um actually transaction interface that you like.

F

If you use like right operation, it first needs to have a uh re like read transaction and then only when they find that you have like the these write up operations and it will or close the read transaction and open another write transaction which is exclusive, and this will. This is like taking a lot of time.

A

Okay, um yeah I've never really like profiled that cd parts- so um maybe I don't know yeah but yeah- that's definitely an idea that is worth considering. Okay, I would I would reach out to like api machinery folks and and and got their opinion about that.

F

Yeah cool thank.

F

A

Okay, anything else.

A

If not, then like thank you for today and um see you in.

B

Tweaks thanks see.

D

You thanks thanks.