Kubernetes SIG Node, 19 Sep 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: SIG Node Sidecar WG 2023-09-19

Description

Meeting notes and agenda: https://docs.google.com/document/d/1E1guvFJ5KBQIGcjCrQqFywU9_cBQHRtHvjuqcVbCXvU/edit#heading=h.m8xoiv5t6qma

GMT20230919-160418_Recording_1742x1120.mp4

A

This meeting is being recorded.

B

Hello, it's uh September, 19 2023. uh It's a sidecar working group meeting. Welcome everybody. Let's get an agenda uh ganjo! You go first,.

C

Yes, I listed those PR just to ask for the reviews, because they they are, they should be merged before we go to the web test.

D

B

Okay, so this is mostly about like resource managers, also other resource managers right.

C

Yes, most of them are resource managers, and one thing is about the the previous regression.

C

Like start any containers in the right order,.

B

C

There any even with sidecar containers, yes.

B

Yeah is there any concerns from Francesco Francesco I just gets fine.

C

Yes, I think they're, all okay.

C

C

Yes, and and also I'd like to add the working group cycle label to to the kubernetes to find the related PRS or issues easily. Can we do that.

B

We can suggest I, don't see a reason why or not so uh yeah, let's make it going yeah. Just yesterday, I discussed with avoid Tech about breaking issues. We have similar problems for In-Place upgrade uh cap. We have so many issues that it's really hard to query them all. um Okay, so work website.

A

Our website kind of sounds.

B

Good, maybe label will be different. uh Let's let me ask uh who can do it and uh I.

C

Know mechanics.

B

Of it, but I don't know like who who needs to approve it.

C

B

Good suggestion, okay, I just I, see you look at this PR says already right.

C

E

I, don't remember that there are so many and this week, I didn't quite work on these, so yeah I should probably take a look. If you want.

D

I can help out a little bit from review point of view, because these are Resource Management specific, so I might be able to help out.

B

Thank you for watching. That would be great. Thank you and that's it. That's what you do you want to talk about your topic, yeah.

D

So this is kind of an issue that Paco had created and it's related to the ability of allocating CPUs exclusively at a container level. So, currently CPU manager, if it's enabled with its static policy, looks at the quality of service of your part and then, if you specifically indicated through integral CPU request, that's only when it ensures that CPUs are exclusively allocated.

D

So I think what Paco is requesting was that we have the ability for a container to be observed or evaluated independently of the overall quality of service of the board and I had some potential solutions to get to. You know to get to uh a video of achieving that. So I I left a comment there and if you scroll a little bit more I think you should see it yeah that one I think you just scrolled by no I'll share the link.

B

D

The one before this yeah that one yeah, so this basically talks about uh a new cubelet flag. So we can something like a CPU manager's scope, and this is very very similar to what we have with topology manager. We have a topology manager scope and this would allow us to evaluate a container um independently and kind of independent of its cause, so that that was the main idea. In addition to that, we would need some changes in the resource requirements.

D

Spec itself, which is essentially you'd, see that in pod spec and you can see the example below which which I have there. So we can introduce something like a resource constraint and then we can say that for a specific resource, we have this constraint and I want to have this as a very generic solution and not just specific to init containers, because I've seen a lot of other use cases for this. So I just here to discuss that and get a sense of um how everyone feels about this.

D

If this is something that should be pursued in 129 cycle.

B

B

Thinking about every day, so I I still need to have some time to absorb the suggestion. The suggestion is uh I know in topology manager, scope, maybe either port or container, but if it's container, then you only look at container level. You cannot say these two containers, you look together and other containers. You look uh separately right.

D

Yeah, so at a pod, so in this case the way I Envision. This is, if we have uh CPU CPU manager, scope as pod, we would default to the current behavior and we that would allow us to get the backward compatibility that we that we obviously need and the container option is going to give the give us the new feature or the capability that we want. So we look at the container independently.

D

uh The only problem which, which is something that I've highlighted in the end of this comment, is that we want to ensure that the container, or rather the pod in general, has guarantees around exclusive allegations. So you know now this container is requesting for exclusive allocation, but because it's not part of a pod that is guaranteed, it could be evicted by a pod that is guaranteed. So we want to give these kind of PODS more priority. So one way of handling.

D

That is, that we change the quality of service evaluation logic of and take these kind of containers in its containers into consideration that are explicitly indicating that they want exclusive CPUs.

B

Okay, and is it a scenario when all the main containers want to have want to share CPU and uh sidecar is working on some shared pool of CPUs, or it's not a scenario.

D

I think at the early stage, I haven't thought about a separate, shared pool. If that's what you're asking about the way I see it is like we have a shared pool, and then we have exclusive CPUs. So a container that's requesting exclusive CPUs would be getting exclusive CPUs and everything else would be in the shared pool. So there's no separation of shared pool.

B

Can I ask for all the main characters to have to share the same CPUs uh inside like because I see? Sidecar is a little bit different, Beast uh or often sidecars, don't need access to the same memory or same CPU or same devices because they simply provide some networking collagen functionality that doesn't really need to be there. It only needs the same file system and same network as a other containers. It doesn't. You won't know about resources of other containers.

D

But that, but that's precisely what containers that don't have exclusive requirement get so this is uh this is exactly for normal containers. We have the same requirements, so imagine a pod that has two containers one is that is running busy, Loop, a dpdk application and the other container. That is just doing some sort of logging, so essentially for the second container to be on the shared pool. It doesn't really care.

D

You know what exactly is a share pool as long as it's a group of CPUs that don't interfere with exclusive and we mainly care about exclusive CPUs that are allocated to pods that are explicitly expressing that requirement.

B

Yeah I think I think this feature will be very valuable. uh The way you explain it is there any other questions.

D

No yeah I think the key question is like: do we go ahead with this in 129 time frame? It's.

B

Going to be anybody else has a question to you, but uh yeah I mean our main motivation for sidecars like there are many reasons why you need side. Cars and uh istio uh is existingly moving to ebpf approach and we have different stories there. But uh what we see more and more is uh Aussie IML workloads, they they want to be resource efficient and they need all those resource managers because of device access and they need login, pin monitoring Canada. That's why sidecars became more and more uh interesting to community, so.

D

B

That perspective I think this is important feature and if it will be in 129, it will be great um now the question is: uh like I see two apis like you, you want to introduce two apis right. First, API is a couplet flag and second API is resource constraints.

D

D

B

D

To be yeah, that's going to be challenging yeah and I. Understand that, and actually one more thing that I want to point out here, is that I I'm intentionally trying to keep this generic. So the solution is not going to be specific to sidecars. This can be.

D

uh This is going to be applicable to any container that specifies this resource resource constraint, and it would depend on you know how the node has been configured so that that's very intentional, because I've seen in the code that a lot of places when we are evaluating container is the first thing that we do is. You know, append normal containers to init containers, and then we evaluate them and I just want to make sure that we continue to respect that.

D

So that's why I I want to have this as a generic solution, not just specific to sidecar, like I, totally understand that this is important to sidecar and that's why I'm kind of having this discussion here with you all.

B

Yeah, it makes sense um yeah it may be challenging for 129 anything relates the resource utilization um can be challenging. So can you explain again what the resource constraints uh do.

D

So you know currently CPU manager looks at the quality of service of a bot and and if we have so, for example, if you look at the first container here, um it would look at integral CPU request, but the overall quality of service of this particular pod is burstable, so the CPU manager would actually not even allocate exclusive CPUs for the first container so that that's essentially the problem that we are trying to solve. So what we are trying to do is make this requirement explicit, as opposed to earlier it was implicit.

D

If you have a guaranteed quality of service pod and integral CPUs requests within the container. Only then it will be allocated exclusive CPU. So it's like you have to go to see if you manager documentation to figure out how do I actually request exclusive CPUs. With this we would be making the request more explicit.

B

So in swap cap we want to wanted to make sure that uh customer can disable swap because swap like first it has security problems, but also it has some performance issues.

B

um So what we said is uh if Port is guarantees and we will disable swap for the sport, but also we said that, if um container is guaranteed and guaranteed is meaning that memory limits have equal memory request, then we also will disable swap so we we didn't put an explicit API there. We just said that implicitly we will look at your container and decide per container kind of, is it guaranteed or not, but.

D

I think I think swap is still in a better position than we are with CPU manager, because we don't evaluate a container based on. You know the quality of service of the container itself, like with swap say at a container level. You are able to disable swap, but here we we are stuck with what we get, as uh you know, as a package of pod as a whole.

E

Yeah, actually, that this relates to one question that I had since you're, making something implicit explicit. Is it possible to make mistakes or something that don't make sense absolutely.

D

So I I did think about it a little bit um so the the mistakes that you could make is, for example, if in the first first phase we are just doing this for CPUs, like the API allows us to specify something like a memory with the resource constraints, but that's not supported, so we would need uh an admission, a validation admission hook to make sure that you know that it's the spec is validated, then. The other aspect that we would need to handle at cubelet level is a scenario where you know. By default.

D

You have pod scope, which is current default behavior and then all of a sudden, a pod comes where resource constraints are specified. In that case, cubelet would be rejecting that POD at admission time, because you know it doesn't make sense. This particular node is not going to be able to support this configuration. So those are the two main scenarios that I've thought about, uh especially with relation to inconsistent configuration or behavior. But if there's anything else that you can think of like um yeah.

E

Maybe mathematics like if, if you have like container constraints that add up and exceeds the Pod constraints or I, don't know if it's even possible.

D

Yeah so I think the way I would see is that the resource constraints corresponding to a resource would have to be identical to what has been specified with resource request and limit, because if that's not the case from scheduling point of view, we are not going to be uh doing the right thing. So that's going to be important as well. So that is another constraint that we would have to perhaps handle in the validation web hook.

A

Yeah on that second container there, what is what is that resource constraint doing for that container? Is it just a no-off? It doesn't do anything exactly.

D

Yeah, so the intention here is that, if you don't specify so if you don't specify the resource constraint at all or you you don't specify like say, the exclusive field or the resource corresponding to it, it'll be a new OP and the container would get CPUs from shared pool. So.

A

D

Going to be kind of an optional field.

A

So on that, first one, it's like you change it to a one. What does a mismatch between that value and what the plot actually yeah.

D

That's exactly what I just said there, so we would have to make sure from validation standpoint, that we have the request and limit matching what you've indicated, as as the value of the exclusive uh counterpart.

A

Excuse me, it might be in this room like I said: yeah I'm, admittedly not great at it, um but like it's exclusive just seems like if it's not really even needed, just if it's in the list, if the resource is listed, then that resource is exclusive and then you can't oh.

D

I I I, see I, see you're kind of getting towards maybe a Boolean kind of semantic. Is that what you're trying to say here, yeah.

A

Just if it's listed then we're gonna, then that resource is exclusive and if it's not listed.

D

Yeah, so I I think one of the reasons I went this way was because in the API documentation they kind of tell you to not have bullying Fields. So when was that- and the other reason was that you know down the line, we might want to add a shared feel to this, and we might want to split resources between exclusive and shared. So having kind of this laid out explicitly makes a stronger case capability.

B

Yeah I have an API discussion. uh Here, may not be ideal. Yeah.

E

D

I think what I'm going to do is uh like I uh study I. Remember: you had shared on slack uh a doc, uh the planning document and a separate section. So I've added this to that, and we can maybe uh write up the cap in the cycle see how it goes uh again. Priority I leave it for the sake to decide, but you know we can maybe start working in this direction. If it doesn't get in 129, that's okay! We can maybe you know, work Full Throttle in 130, but I.

D

Think having one round of discussion as part of 129 would be. You.

B

Yeah, um that's definitely will be helpful and, as I said, it will be very interesting for sidecar customers, how many of them yeah. Okay, yeah I, have more questions, but uh there are two specific, so yeah, but.

D

Yes feel free to leave, comment on the issue itself like because I'm currently thinking about how do I go towards the next step and I'm trying to make sure that I have all the edge cases tackled. Like some of the discussions that we already had about you know, the consistency is about values not specified in a proper way, handling, misconfigurations and things like that. All those are the time thinking about already, but if you have like specific questions, that's great I'm, looking for actually specific input.

B

It's a big area to investigate is uh if customers already have NRI plugins wasn't breaking anything, uh because this uh split brain a situation when some decisions are made on the right time. Decisions made on Google is not ideal, so if they may not know about a new way, how everything will be.

D

How yeah yeah I was thinking about about NRI plugins as well, and perhaps even interaction of NRI plugins with dra, because dra provides us the user interface and we can use NRI as a mechanism of implementing you know, actual CPU allocation. But one of the challenges with that and like based on discussions I've learned that when people want to use NRI, plugins, you're kind of taking control of the resource allocation completely and you would potentially be disabling CPU manager and all these resource managers.

B

D

I think that's how I would see it. Otherwise we would need some sort of separation of resources. You know some resources that are hunted by kubernet and some that are given to some of the component doesn't have to it could be. You know a CPU plug-in that I chose to implement using say device, plugin API, that's something that Nokia CPU puller has done. So those are the kind of semantics you separate resources in General within a node, but I don't think we can. We are in in the shape or or the ability to do that.

D

B

Makes sense: okay, um so yeah, let's continue discussion uh in a bigger forum. Okay, for this PR uh for this meeting uh dot, you have two vrs yeah I.

A

Was just I'm just putting them on a list there's two more uh PRS that lead reviews, or at least, if we're gonna go with that label label them. So we can sort of identify these.

A

B

Yeah, that's right! This is the old one, and this is an implementation of what Machias wrote. Okay,.

A

So slightly different, um there was a comment from uh JP bets on that one, um and my uh the difference is that this holds the uh Sig term. Basically, it's a pre. If the pre-stop never finishes for a sidecar, then you won't get the same term basically, so it sort of behaves like normal containers. Do if you're free to stop last forever close your pre-stop terminates early, then uh your sick term won't come until the main tanks are done or could restart blasts too long, but still has time.

A

Basically, then your sick term will come after your free. Stop is done. That makes sense.

B

Oh you're saying if pre-stop is too fast and signature may not be when you expected it.

A

um If you're, if you pre-stop occurs, uh is like really fast, you still won't get sick term until the main containers are done, but if you're sick talk, if your pre-stop takes forever and never exits, then basically you'll just get a sick kill at the very end.

A

uh I think what Mathias have written up was that you'll get the Sig term, regardless of the status of your pre-stop. It was still running or had exhibit regardless. So, let's sort of works. The way that normal containers do, which is your if your grease stop runs long. Your sick term is delayed until it's done.

E

Okay, I think it's it's easier to to fix the documentation than uh to fix the code, so.

B

Yeah I will modify it, but we are breaking scenario right so I'll scenario. No, we don't so if, if I want to implement istio I want to start pre-stop and wait for main character to complete and then when making a complete I I exit the pre-stop like by six term, so I want to exit immediately. So I want to wait till also maintenance still running, but then once I complete I want to exit immediately.

E

A

E

Pre-Step inhibits the the system right. That's that's the issue. Yeah.

A

Yeah, so the uh the way you would do it is that you basically exit upon Sig term free stuff lets you do whatever you want to do like a head of sort of get ready to shut down, and then Sig term is the actual shutdown signal.

B

Yeah, but if I need to do something in pre-stop, um make some kind of best effort cleanup, then I cannot do it right in this uh scenario, because if I do it for too long, then I will miss a sixth term and go will force port to wait for entirety of graceful termination, yeah.

E

Yeah yeah yeah yeah, um that's not ideal, and if we want to to implement the way we documented that's complex.

A

It's it's more complex enjoyed up comment on that one find your ER.

A

A

A

And so we can change it I think about it, while I'm sure it's possible to change it to match the docks. I just want to ensure that that's what we want. First.

B

Yeah, let me reply back here and we can just calculate I didn't want to keep this scenario. I want to make sure that we can exit as fast as possible, but allow as much cleanup as you want. So you need to be able to agree as much of cleanup if you want, but then you should be able to exit uh to know when to exit to exit immediately.

E

And and avoid misbehaving uh Christopher's, which could be possible.

A

Yeah, it depends on. Are you using the pre-stop to clean up, or are you using the pre-stop to signal to your main process to start because if.

D

It's the second then.

A

It's sort of it could work, but if you're actually trying to like do the cleanup in the freestyle cook, then yeah you you've got hopefully.

B

We can achieve the same behavior, but by having police talk extremely fast and only use it as a signal. Yeah.

A

So free stuff is like kill with whatever so user wanting whatever signal you only care about or touch some file or you know whatever mechanism you want to signal to them in process which then has from that time until it actually gets the sick term, but happy to also rework this to work the other way, but but I think.

D

A

More complicated to make Sig term interrupt freestyles.

B

Okay, yeah and it changes semantics.

B

Yeah, so we still may be in this uh people situation when we just missed a little bit the sixth term because of police talk and then I wait for entirety of graceful termination.

B

Okay, let's discuss in the NPR um in this documentation, PR and then adjust the implementation corresponding.

B

Okay, is there anything else.

E

So, second, you, you will add the comments there. Yeah.

B

I will add a comment, but please feel free to chime in uh I. Think I, don't know how to decide between. Please stop just being used as a signal and exits really fast versus, um and this is a legit implementation. It also doesn't change semantic of pre-stop, which is good sign right.

E

Yeah, it doesn't, but it then it's dangerous, because it opens the door for uh misbehavior.

B

So yeah, so if you implement logging Sidecar, then I I want to clean up and I will and then I will wait for termination.

B

A

B

How to so let's say: I I want to clean up after the last container exited and I also want to clean up. Like start cleaning up, biferson was sent to us I guess you can still do that like you can still have a super, fast pre-stop and then on Sixth term. You will start cleaning up for you.

B

A

I guess it's like a like a log inside car I think I would assume that most implementations, the uh the login service, like you, don't have like a separate blogging, binary or separate like execution that you run to actually go sync stuff off the disk or sync stuff somewhere else, like basically you're you're using the pre-stop to Signal the primary binary. That's running to go do something special, but uh yeah like it could work either way.

B

C

Some people just.

D

B

Fluid as a sidecar.

B

Okay, let's discuss in the comments we may need to extend the description of all the scenarios in a text of capital, okay, yeah. If nothing else, that's uh finish. The meeting and the movement to uh reviews and discussions.

E

Thank you. Thank you very much bye, see you bye.