Kubernetes SIG Scalability, 21 Jan 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2021-01-21 Kubernetes SIG Scalability Meeting

Description

Agenda and meeting notes - https://docs.google.com/document/d/1hEpf25qifVWztaeZPFmjNiJvPo-5JX1z0LSvvVY5G2g/edit?ts=5d1e2a5b

A

Okay, I believe we are recording so welcome everyone uh to the six credibility meeting today we have january.

A

21St will fill in the attend this list.

A

Maybe let's do a quick round of intro in that games? Time mat I'm 6kb teacher.

A

Why tech do you want to go next.

A

I'm voytek I'm.

B

6 scalability to.

C

Hi uh hi matt hi voytek, uh my name is swathi. uh I work for red hat and just had a couple of questions just to chat and wanted to chat to you guys about a use case that we had.

A

Yep, I saw your message on slack. I also replied and I think invited you here. So uh we are happy.

C

A

It uh okay uh so before we do that, just a few announcements, so uh we were able to finally figure out all the access issues to our 6k leads group and also figure out how to publish the recordings to youtube.

A

So the youtube recordings are now available.

A

The playlist link is the same. I think I already copied it here. So if you want to check like the all recordings you can find in find them there, uh there is one caveat that I needed to upload. I need to upload the recordings manually, so I only uploaded like the last two meetings.

A

If you really want some older meeting to be uploaded and just ping me the date, I can do that, but I don't plan to upload all of them, because that will take me probably a full day or something like that uh cool and there is shiian.

A

All right, so do you want to uh give us a quick overview of of the tests you plan to run uh and.

C

A

Let's find out.

C

Quick introduction of the use case that we care about it's about topology, aware scheduling, there's a well-known issue that has happened because topology manager tries to align your resources, but the scheduler doesn't have visibility at of resources at a pneumonoad level, because of that the scheduler sees nodes kind of in the same manner, regardless of the resources available at a normal node level, placing the workloads and then topology manager. In case it is configured with a policy like single numeral.

C

uh It ends up in a topology affinity error. So that's essentially the problem that we are trying to solve. uh We have a few caps and we have uh like implementations as well entry as well as out out of tree. So first component is node feature discovery, so that exposes like crds per node to expose the capability and expose the resources at a per number level. So the scheduler has that visibility, while it makes its scheduling decision.

C

um So what we're trying to do now is that we have these out of three components.

C

We were initially planning to propose a scheduler plug-in as an entry component, as well as the api of the crd entry in staging.

C

So when we went to sig architecture, they recommended that first of all, we need kind of an api review review and in order to kind of go ahead with this proposal, we have to prove this solution and the design that we're proposing at a large scale. So um initially they mentioned that maybe can we can. We show that there's regression at 5000, node scales, so um being new to kind of uh scale side of things. uh I wanted to get opinion a few guys and see if there's there's some pointers from your side.

C

That could help us kind of achieve our goal.

A

uh So, as I uh wrote on slack, I think a good starting point might be our pre-submit that it's available uh for any kubernetes pr. uh It runs 100 nodes like the only difficulty here is that we need to adjust it to uh enable like all these topology related features and and also like deploy your changes. So I would like to understand better what kind of changes we need to make in the cluster in our like test setup.

C

That I have is these hundred nodes: are they bare metal nodes or are they like gcp instances? There.

A

C

Okay, so is it possible to configure on gcp like multi-socket systems, by any means? I.

A

Have no idea right, do you know anything.

B

It should be possible, though I as must admit that I haven't done, that myself.

C

So I would say that would be the first requirement prerequisite for us and then the other other things would be deploying node feature discovery to be able to uh which is a component responsible for kind of examining the nodes and creating the crds and then deploying uh the scheduler plugin. So we have like.

B

So my sorry sorry just to interrupt the quick question. So can we somehow fake that um to really prove the scalability part without having like end-to-end setup? So my my point here is that whatever is happening on the node itself and it's purely happening with within the node, more or less seems to not affect scale? What effect scales is like the what is kind of like cluster scope, so every api call, for example, that the that is happening from components running running on the node.

B

So I guess my my question is: can we have like a fake setup where the the agent or whatever you are running on the note actually just.

B

Assumes silently assumes that, like whatever you need there is there and like always, sends this crd or updates is or whatever or actually cr. We should say crds like that. I mean it's it. It's probably a little of work to to fake that stuff, but it makes the whole setup much easier.

B

um The whole test, much more predictable and so on, and so on.

C

So um we did think about it uh and probably that's an area we need to explore a bit further, but one of the things that we would like to test is like running streams of pods and seeing how, when they're, requesting certain resources, how does it work out uh like scenarios like how does segmentation happen and in scenarios where there's resource crunch? How does the scheduler plug-in react in those scenarios? Obviously like api load and things is part of it.

C

I think faking the resources we can achieve through that, but kind of having um an understanding or um kind of having seeing how this cluster behaves and how these components are able to deal with the different, maybe edge cases.

C

Probably for that we we would like to test at scale, and I think uh the expectation from sig architecture is also that you showcase it in a real environment they mentioned like simulators. If we can simulate it, I'm not sure if that is almost uh kind of equivalent to your point of faking. The api components.

B

Yeah, I think I I would need to.

B

um I would need to understand this a little bit in more detail, but I'm wondering if it's possible, if it's not possible, to have like kind of two separate tests, one that will be.

B

More smaller scale, but like generating a bunch of like what do you mean what you call like corner cases and load and stuff like that on a single node and and exercising what is happening there and whether it kind of works or and so on, and the other that is more like faked, and this on on the large scale. um Maybe it doesn't make sense what I'm saying. I think I don't understand like that.

B

Your proposal, deep enough to be able to to to currently say um whether it makes sense and what what are it, what other interactions you can expect at scale. So um I'm definitely happy or to be proved that I'm wrong here and like what I'm saying doesn't make any sense. But I think we should consider that.

C

Yeah, that's not a bad idea kind of having two sets of cases uh kind of one dealing with a small, smaller scale where we are dealing with all the edge cases and then larger scale, where we just focus on showcasing that there's regression essentially with this new feature and that's not a bad idea. If there's anything, I can do to maybe help you understand this, maybe like you, can uh refer to the slack thread there and have a look.

C

If, if I can point you or maybe even just present something that can help, you understand this better, I can come next week as well just to present and give you a high level overview of the subject.

B

I think I I probably need to understand the proposal. Do you have a cap for that or yeah.

C

Yeah we have, um we have a couple of caps, I can drop.

A

Them on the chat, I will place them into into the dock.

C

Sure I will yeah.

B

Okay, I will, I will try to look. I can't promise, I will do it in the next two three days, but I will try to look at it as soon as possible. um I'm not sure how long how long it will take me. I mean I'm not sure how much additional context I will need to get so it's.

C

It's yeah, so I have uh I have I delivered a presentation to signored I I can send like the youtube link of that. I think that would give you enough yeah.

B

That would be definitely.

C

Useful and I'll send you the link to the caps that we are proposing. uh I think all this, while that we've been having conversations our proposal have, has slightly changed a bit. Now we are considering out of free scheduler plug-in. Initially it was just an entry, so those minor things have changed, but overall design wise. That.

B

Doesn't change like the overall core of the proposal? Yeah.

C

So this is one.

A

Actually, checking if we can somehow enable this multi-socket feature or whatever it's no.

A

If it turns out that it's possible in just like matter of configuring, the vm differently, that should be actually pretty simple to do right.

B

Yes, absolutely if someone has has time to to investigate that, like I'm all for that yeah. So sorry, I would like.

A

Suggest for you to check basically the gcp documentation, whether the features uh like the.

C

A

Features you need are possible to be enabled on pc.

C

uh Can you point me to the um to the pre-submit job itself, so I can have a look, how it is configured to be able to understand it a bit better.

B

Yeah, I think we shouldn't point you to the presab needs, but rather like the periodic jobs, because it's um they are more fun more well, maybe yeah I mean like that. We don't have like five thousand note presubmit per se. They are all kind of similar. So maybe that comment like doesn't make much sense, but.

A

Yes, some links here, uh the the problem is that probably the parts you will be most interested in so like the the setup of the cluster lives in in like actually the config, is in different place. The last test leaves in different phrases like, like the all the cube up stuff, which configures notes and master also like lives in different place. So there's like a lot of places uh but like everything, should be doable in general, like if you want to modify one part or another.

A

So the question is just like: what do you really need uh comparing to what we do right now? So.

C

A

I would like drop something later to not waste time now I wanted to mention just one thing and if it happens that we cannot really do that on gc vms, uh then I think it would be worth asking question whether cubemark can be somehow adapted for this test, and I just wanted to mention that. There's like a big space between end to end test and like some benchmarks, uh I think like yeah.

A

So so we started with end-to-end and voitec like mentioned, simulating something- and I think you understand that we should we can just like simulate control, plane, just apa server, uh which is probably not enough uh for you, but, like cube mark, I'm sure you've heard of it it's a framework that we own- and it's basically allows you to run end to end test control. Plane is a regular control plane, uh but nodes are are faked, so we have concept of hollow node.

A

Take note, uh and the question is whether we can adapt cube mark to basically simulate uh the behavior. We want from notes right and then test, because you have a real scheduler there, uh so we can basically test what I believe you can test uh what you need there so that that's awesome.

C

Yeah, like I, we came across the the the project itself, but again hollywood because it was hollow nodes. It meant that we don't have components like cpu manager, topology manager, running on them. I see.

D

So that was kind.

C

Of the gap that we saw in cube, work.

A

Okay, I will other two do.

A

A

D

A

That, like after the meeting or like tomorrow,.

C

And so all of you are managing the cubemark project. Is that correct.

A

I think it belongs to six credibility so we're the owners of this.

C

Okay, okay, so if there are things that we come across like that might help us like, maybe we can propose those in cubemark and may come back to you and yep. Okay.

A

All right, I added a few.

C

Links in the chat just for kind of oh.

A

Yeah: okay, okay, okay, let me copy.

A

A

A

All right, cool, okay,.

A

Do we have anything else to discuss today.

D

um Yeah hey, this is abu hey, so um I've been working on a pr to simplify the uh the timeout path of the request in the api server that was mentioned here. Let me just copy the yeah.

A

If you could like share the link to the pr that that would be great.

D

Yeah so um yeah, I just pasted that yeah.

A

Let me open it definitely.

D

Yes, so like, if you have any feedback in terms of like, does it have an impact on the scalability or any other areas? It's basically. um What what happens today is um we have a fixed 60, second timeout uh in the timeout filter and when.

C

D

Request enters the the rest filter. We basically uh create another context with this specified timeline. uh What I'm trying to do here is actually like, add a new filter uh in the filter chain that basically creates the context with the user specified timeout at the very beginning, uh and if, if a user doesn't specify any timeout, then you basically create a context of 60 second and then go from there um and that's pretty much it. I think uh in essence, but yeah. If you have any um like any feedback, that would be really great.

A

So so let me understand it better. So currently the situation is if a user uh doesn't specify a timeout in the request is devoted to 60 seconds that that yes,.

D

Yes, um if, if the user doesn't specify in timeout um the timeout filter, so so, the timer filter always uses a fixed 60. Second timeout, no matter what the user specifies.

D

Okay um and when the request basically the hand the handler enters the the rest filter, and I think we have code that basically creates a context based on the user specified timeout.

A

And we ignore the timeout context in that case, right.

D

Because the it's a because the context acts like a tree so uh basically, but what I'm doing in the pr is basically doing that up front in the filter change. So so, basically, uh when, as soon as the request enters the pay server, we set up a deadline for the request and that context is used throughout uh the handler chains and the rest handlers, so that you know it's more. It's more correct in that way.

A

I see yeah, I think I understand like the correctness and like uh the the basically refactoring the calls to to make it uh uh cleaner, but I'm worried about, like the user, facing changes here so yeah. So.

D

uh The backwards compatibility um we try to keep it more more backwards compatible so for, for example, if the user uh specifies the timeout that is longer than 60 seconds, then we'll default it to 60 seconds. um We don't throw a 4, 4, 400 error in that case only when the.

C

D

Specifies a male form timeout. Then we basically send a 400.

B

So, basically, you can only shorten the timeout with this pr right.

D

B

That that's the.

D

That's the uh the change in behavior uh today, and I don't know today, like a user. If any user uh today specifies a mail forum timeout, I guess we still start the request with 200 like, but with this pr we'll be sending 400s.

B

So so my other questions, I'm generally supportive for that, the the only thing that we may want to solve together with that, I I didn't look into that. So maybe you already did that, but just let me mention that is that we are not canceling any operations that are sent to fcd. Currently, so we are sending, we are setting the timeout for the lcd calls completely and independently from anything or we are not setting them at all or something like that.

B

um So um in theory, I can imagine that um with your pr and I let's say that as a user, I'm setting all the operations to say with timeout one second or something like that like something very short and a bunch of operations are, let's say they are taking more than a second, because the cluster is super overloaded, whatever something something bad is happening, then it's technically possible that we will kind of exceed the in-flight limit or whatever we we have with priority and fairness.

B

It's like, I can't remember the name there, but um there's like the concurrency limit or whatever it's called um it's possible that we we kind of, can exceed it because, even though, in the api server itself, it works correctly, because we are canceling under the operation and stuff like that on the lcd layer. We are not cancelling them and they are still happening and they are kind of they are still in flight.

B

So um I guess it's a little bit independent from that pr, because we have this problem even now, where, like list operation, if list operation, for example, is taking more than a minute, it happens that, or I think, delete collection is even a better example that, like they call the the the call to the api server is ending, but like the underlying lcd thing is still happening.

B

Yes, so it's kind of indeed it's a little bit independent from your your pr, but that pr may expose us on or may increase the risk of that being much more severe problem.

D

Yes, so this work, this pr is actually is is, is I'm looking at it as a prerequisite to solving the the the other problems like the the that's the decoration, so so far, I've looked at it looks like the lcd is wired up with the context.

D

The call to xd is already wired up with the context. um So with this pr merged, uh the context now will have like an actual deadline um and like the call will fail uh if the context is properly used throughout the layers. Okay, so.

B

You, what you are saying it more or less orally solves that or it required will require very small tweaks later or something like that right. Yes,.

D

Based on what I've seen, I think it already solves it, but I this is like my follow-up task- is to look at the look at the storage layer like deeply and see if the context is being wired properly. I see no, that that makes perfect sense to me. Yes, if that's the case, then we already solved the problem. It doesn't. The case then just need to wire up the context, and that should be, I think, good to go. I think.

A

I see okay, yes, so sorry for asking again uh absolutely yeah, let me just because I would like to understand so how uh does it work currently? So once again, I.

C

A

Question uh so we found your change. We have this like timeout uh filter right that sets the timeout to 60 seconds uh and what happens with the user provided timeout in the in the in the request.

D

Right so previously uh the timeout in the timeout filter. The timeline was always 60 seconds, no matter what okay.

A

D

We have a new filter that basically creates the context based on the user specified timeout. So there was no context before like that.

A

D

Was- and there is the context before at the time of filter, okay, so now, timeout filter does not create any context. It just uses the the context provided by the the the request deadline, filter, okay and it is always set to that to the value if there is a specified 10. Second, as a timeout that would be 10 seconds. If this doesn't specify anything, it would be 60 seconds the default that we plumb through the command line.

A

Okay, so let me just uh ask a follow-up question so if we now have like two timeouts, the 60 second and like the other, like user specified, so how how do they interact with each other, because I I'm lost here- to be honest, so let's say I'm issuing a get request with timeout set to five seconds right. Yes, so what really happens in that case now without this change.

D

Without this change yeah, so if you, if you specify, is it if you specify time out of five seconds today, so the timeout filter will create a go routine. That will basically have a timeout of 60 seconds and when the request enters the rest layer right uh context will be created for five seconds.

A

D

With this pr, what will happen is when the request enters the api server will create a deadline of five second and that that context will be used by the timeout filter and the rest layer.

A

Okay, uh so I think just the last question to let me send this, so uh what does the timeout filter? uh Do you said it's creating a routine? uh I think there.

D

Yes, no, so what the timeout filter does. Is um it spins up it spins up a goru team where it executes the rest of the handler chain? Okay right and it basically then waits up to 60 seconds to see if any response has been sent. Okay,.

A

A

All right, cool, okay,.

D

Please uh please, if you have some time uh re review it, uh we need as much eye as many as as possible sure to make sure that this doesn't have any other ripple effect.

A

Cool yeah totally, uh I will take a look and uh I encourage others to also take a look.

B

A

B

Great, like I think we wanted to do that for a very long time, and no one really uh actually sat down to that so far. So so this is great to see.

D

No yeah, I think, uh centralizing the the timeout logic in like a filter it it. It makes the it makes wiring up the context for like rest layer uh for like admission control, plug-in layer like very, very easy, like we have to worry about. um So I think there are there'll be follow-up prs that will be. We can open after this one gets merged appropriately.

A

Cool- and I believe you are out of time so thank you, everyone for joining us today.

A

We have next meeting in two weeks, so I hope to see you there all right. Thank.

C

A