Kubernetes SIG Node, 29 Aug 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Node 20230829

Description

SIG Node weekly meeting. Agenda and notes: https://docs.google.com/document/d/1Ne57gvidMEWXR70OxxnRkYquAoMpt56o75oZtg-OeBg/edit#heading=h.adoto8roitwq

GMT20230829-170534_Recording_640x360.mp4

A

Hello, hello, it's uh August, 29 2023! It's a signaled weekly meeting, welcome everybody. We have uh not bigger attendance um I. Think uh it's! The last week of uh summer, like before um many kids, go to uh schools. um So I think that may be the reason, and also there is a Google next coin. So many googlers are out uh is that uh we still have a very large agenda. So let's get uh right into it.

A

First is uh uh Canon.

B

A

C

B

uh Yeah I just uh I think about a month ago, I brought to this agenda the take over the Pod ready to start uh containers. uh Cap for beta and I have uh I have a PR to update the cap for beta and I was hoping to just get a reviewer slash approver on that, if possible and then also to see about I, guess, maybe I don't think this meeting.

B

We are going to talk about 1.29 yet, but I was hoping to get this on track for 1.29, uh but I guess I, don't know when that is yeah I, don't know when that's happening quite yet.

A

Yeah we started to have planning, uh maybe starting next week when there will be more people on a meeting or discuss it means already so yeah it's time to bring it up and uh if you think it's ready for 129, uh it's great and I see you posted couple scenarios that is being used. Yeah.

B

You you had a suggestion to update some use cases that I found useful, so I added those and then I updated uh the cap with some of the details since 1.25 just to make it more clear about what the progress of what that cap has done.

A

Thank you. Is there any questions from audience? If not, let's move forward the Karthik.

C

Hi actually Karthik has not joined the call and I would be asking a few questions uh on behalf of Karthik, so uh I. We actually wanted to schedule a meeting to discuss further on the dynamic node resize kept. So would it be appropriate if we could have this meeting scheduled sometime early next week, provided that signode has no agenda or plans which might hamper the meeting schedule for I mean specifically for the skip.

A

It really depends on who you want to have on this meeting. Do you mean you want to have bigger presentation during signal meeting, or you want to schedule something separate.

C

Oh, we would just like to like schedule uh a separate meeting for this to discuss further on this cap, so that is the expectation we have so that we could uh uh combine. uh You know, get get an idea together and continue to work on this Gap to get this get this rolling.

A

um Sure uh what Peter did with GC working group like image GC, uh he sent a uh what's called the scattering thingy scaling up. uh Peter remind me um doodle, okay, oh yeah, so you can set up a doodle and such as times and where we interested can join I. Think early feedback is uh we already have in place pod vertical other scalar feature? uh That is an alpha. The problem with this feature- and this feature enables dynamism in how ports are allocated, and this is another part of dynamism and I.

A

Think uh feedback was a couple of releases back that, since we have in place vertical Auto scaling implemented, then we'll open door for more dynamism, because uh we already have a way to adjust things uh on the North side. At least the problem is that uh in place vertical other scale still has a lot of depth. uh We uh we we need to refactor some of the logic and how allocation happens and this depth is still haven't paid um for.

A

So we still have this feature in Alpha and we still have um outstanding refactoring to do that will complicate implementation of dynamic, node resize, because without sending refactor for one cap introducing another cap will be quite complicated and uh may not be uh reasonable. So you'll get a lot of pushback on on quote uh front as well. Just for you to be aware of.

C

Yeah to uh I totally agree with uh that point of views, so uh we we will. We are okay with that. So we'll just keep the discussion in progress and probably once uh there is stability and some conclusion on uh in place pod Auto scaler. Then we would probably proceed or work on Dynamic node resize. If that's okay,.

A

Oh yeah absolutely and thank you for pushing forward I think the feature is required for by many um so and I. Think it's another shout out like since we're talking about in place versus scaling. If anybody interested to move it forward, uh please uh step up. I. Think VNA is not going to move it forward just anytime soon from what I heard.

A

um So if you want to take over and uh make progress on that, please step up.

A

Okay, thank you for bringing it up. um Next is a big topic of API for quadrants information.

D

Yes, hi, so I I would like to propose, let's say again, a new API for cubelet. So we've been working to add some between so answer some questions, so we would like to propose the API that will send the local pods information about its Readiness. We would like this API to be exposed. They are cool blood because we cubelet owns a lot of conditions that are regarding the plot State.

D

The status which they are equivalent is watching the pause that they're running, and we think that this API would be a benefit to the couple that puts Readiness from the control plane and also adding the new watches to cube. Api is a huge scalability issue, so if someone some workload means only to watch the parts which are running on the actual node, this is a good benefit, so we can read them directly from the cubelet.

D

uh So uh there is already there are already some apis in cubelet and, for example, there is a product resource endpoint, but actually this endpoint uh would serve the information about this local posts which are owned by the cubelets. So that's why I think that we should expose a new one, and uh we know that there are some issues with Q blood, for example the restart. So uh we need to take care of this issue and, first of all, we think that uh that this API should return the actual non-stay by cubelet.

D

So we need we will implementing this API. We need to make sure that when the API is ready to send the data, cubelet knows the information about all the endpoints, because we know from this pod resource endpoint that there's been the issue that cubelet is serving the information when it does not snow in which state the produce and by default. If Kubler doesn't know the information about the plot, its markets are not ready, so we know that we need to cover this issue. I said at the beginning that we need.

D

We want this API to be, let's say, independent through the control plane availability, because it might happen that control plane is unreachable or is down. We know that console plan is a branch in kubernetes, but actually for the brief moment when the constraint line is down, the user work would maybe working fine and there may be some work that are interested only in the current state.

D

So we think that if the cubelet is running and it's healthy, even if the cube API is done, this API should return the current state that is stored in that cubelet cash caches, not even if it might not be reported to the cube API and this app you should be right limited. So we don't want to put a very big law to the cubelet and I tested it briefly on resource consumption.

D

So if we run the grpc request with one let's say one request per second: it shows no resource consumption increase, so it is, let's say we can add more load. It is okay and we would like to use grpc because we think that streaming it we can version, it uh strongly, could strongly type and we can use Leverage The streaming.

D

uh So I put some proposal like how would it looks, but we are interested in returning the conditions that are owned by cubelet. So from this condition the workloads may know in which phase is bought and if it is ready to sell the traffic or or not,.

E

I think, uh can you give more details in The Proposal about the kind of workloads that need to know this information.

D

uh You mean that people are about to workloads which will be using.

E

This, yes, what are those workloads doing.

D

uh What so, so, I think that there might be different works, for example, uh which wants to understand if the pause actually is ready to serve the traffic. uh One of the example might be, for example, that cilium would like to understand if the pots are ready to receive the traffic or they are in training state or, for example, starting I think that uh there might be other customers might want to, for example, run some a workflow that understand, in which phase the reports is, for example, for better custom monitoring.

D

So I think that this API might have that's a various various uh user stories, but overall I think that for customers there will be just let's say one or two, this post, because they will be more like uh system pause they. So it's not like the customer will run these parts and expose this data outside. This is something that should be uh keep on the Note okay,.

F

Yeah I think maybe it provides us a little bit more context. We're trying to bridge a gap in reliability, performance Etc, uh we're right now to understand if you're, if you're a workload on a node today to understand if another workload on the same node is healthy, you need the API server to be up. You need to go all the way up to control, plane and then pack and that's not really helpful when kublet is sitting right beside you and is actually doing the health checking and has the most recent information.

F

uh What we're particularly concerned about is a scenario where control plane is unavailable and we want to understand the health of a pod on the same node, but we can't get it because the control plane is not there, but we have you know all the information is sitting right there beside us so uh that that's really what we're trying to resolve. You know the example Catalina mentioned of, like you know, psyllium or you know a data plane just wanting to understand.

F

For example, uh the health of an endpoint uh on on the same node could be very helpful and provide just a bit more reliability and remove that kind of control plane dependency, where it doesn't need to exist.

E

Yeah I think adding those use cases will be helpful like to motivate why we are adding uh something like this, and the second thing is I believe like there was a discussion with the architecture sigarch on on this proposal, whether to add, grpc or to add more first class API. So maybe an Alternatives considered section and Pros cons.

E

Why do this versus a full-on local API server versus something in between like a rest, API, which is not as heavy as the full API server will be useful to evaluate like what are the security implications like this will require mounting a socket, so the user would need more privileges on the host versus if we have local rest API.

E

What are the pro scores.

D

Yes, so uh yes, we, we proposes this to the sick art and they asked about this. So uh we look at the HTTP API, but actually, if it's only part, is being uh let's say considered as a not safety and it's for some workloads is being shut down and to access the HTTP API. We need to use the authenticated part and the application goes straight to the control plane. The credential can be cached by default, is I, think two or five minutes, but, like I said we wanted to to introduce this locality.

D

We wanted so the workload is on the Node and is accessing. So we thought that using the unique circus and to access the unit second, their workload, so it needs to be on the Node. So we thought that this will be beneficial because of this uh control plane dependency issue, and it will be I believe that it will be easier to version this API with grpc than with http.

D

E

I think it'll be useful to capture those Alternatives and why, in.

D

The darkness we.

E

Want to propose this system as a solution.

D

Okay, I will add this to the doc.

F

Maybe just a high level, that's very good feedback. Thank you. uh Just a high level kind of procedural question at what point to should we move forward with you know, actually translating this to a PR on k- enhancements uh like actually start spinning up a cap itself for this I.

E

Don't think you should wait, I think you have this add the Alternatives and you can create that PR, because you want something. If you, if you want to Target 129, you.

F

C

E

Start working on.

C

That yeah great thanks.

A

Or just if you expect more changes, it's easier to do in Google, doc and discussions are easier in Google doc, but yeah any moment now.

F

Perfect thanks.

A

Is there any more questions? Mike, you uh turned on the video for a second and then turned off. Did you want to ask something or just mistake.

G

Yeah no I was I was curious about the scope and then he said silent, so ebpf forget it. I I didn't understand why they wanted to be basically a read-only proxy of the API server and.

F

G

I, don't think it's just to be a read-only proxy, but we'll see what happens with the API.

F

D

So this this really is.

F

Intended to be read only and and only a very small sub honestly, so this is coming from Sig Network perspective and and all that we're really looking for is information on Readiness, though I can understand how like once we open this little door, it could expand beyond that. But for our use case, that's really all we want I mean.

G

Readiness of the Pod, Network and or other networks have been created inside of the pod.

F

Oh yeah, right now, right now it really is tied to just health checks, but I can understand how other information could be useful to other components.

G

Right, we have other paths to do. You know Network Readiness uh checks. Maybe we should talk about that. uh I, don't know if you've been working with the uh the safe Network team. That's doing some additional design changes to support multiple networks and updates that sort of stuff they're.

F

Not necessarily the.

G

Api server so yeah.

F

G

No, that's that's.

F

Fair there is the the parallel work on multi-network um yeah, I I think this is orthogonal because again at this point we just care about. Is this Ready or Not? uh And we don't want to really we're trying to remove API server dependencies where they don't need to exist? That's it.

F

You know like if, if uh if we get a support ticket or something like that, you know, uh API server was down and therefore my you know my endpoints are: never you know getting traffic or they are getting traffic, but they're unhealthy, like just trying to avoid those kinds of disconnects.

G

Good non non-ha, API server environment, where you can't just switch over to the backup kind of thing, yeah.

F

I mean there are any number of reasons when, when you have enough things enough uh scale, there are any number of things that could go wrong and so just providing just a little bit more uh reliability, I guess and I know. Kublet also has things that could go wrong on it uh so, but trying to keep dependencies as local as possible.

G

H

I wanted to add a quick point just that uh you know, there's other components: I think that are relying on the HTTP pods endpoint today right and there's a lot of problems with that endpoint one, namely like it's all and big Json blob. We've had like cases of metrics agents and other things, reading those and actually switching over to read from the API server.

H

Just because uh performance, gonna inefficiencies are reading, the HTTP API, so I would like, maybe to consider also to to make it so that you know we only have I understand the number of thread in this case, but maybe there'll be other use cases that we'll want to follow up.

H

So if you can kind of keep it a little bit more generalized in the sense, if we want to add that type of more local pod information later right, we can expose that might make sense, because I would firstly, sooner or later we'll want to add more information. I'm, just not regretting this.

A

Yeah I think my big worry about that is uh today we have um what uh what what is the Pod resources API. It's also grpc and also it's all the ports and we don't want to add Readiness information to report resources for multiple reasons, and one of them is to separate throw think of one endpoint from another endpoint. So if somebody asking about devices too often we don't want uh Readiness to be affected.

A

If we will start using it for more scenarios, then uh keeping one endpoint generalized may not be um a good answer for throttling issues of endpoint, so yeah I think yeah, I kind of like the Sig architecture like we need to spell out why exactly we're doing like grpc versus uh any rest endpoint that may have potentially have like per client throwing or something like that. It's like per service account routing uh per se. So we need to spell it out and understand how we'll generalize it in future. For most letters.

G

A

Okay, any more comments, questions I, think it's a big deal like it's one of the I think the second API is a guarantee. uh Well backward compatibility from Google. Not even metrics are guaranteeing anything.

A

Okay, um next uh is signing here.

I

I

Yeah so uh generally, like.

D

I

Have a question about like how should I handle the state of a node having like registration completed to get consumed?.

A

Can I, can you explain a problem a little bit more.

I

Yeah, so um the problem is that, like we.

I

Get the registration completed and right now is the internal field and only like you can only access in couplet, um but we are trying to fix an issue in our CSI node that we need to refer to the registration completed status um and then decide what the CSI node initialization should be doing.

I

um And then the first question is like: is there any existing uh accessible status in the node object to indicate that the node has stress registration completed and if yes, that's the best and then we can just consume it, and if not, is there any suggestion around? How should we pass around this status?.

A

So, when you say registration completed, registration is API server, right, yeah.

I

A

If you have a node object, it's kind of completed because you have another object, I'm trying to understand the context a little bit more.

A

So if you have a node object already somehow from API server, it means that not already registered itself with the API server.

I

A

Do you want some more indication or like earlier indication.

D

I

If we can have some early indication.

I

Yeah we want to pass down the registration completely in the cubelet and basically in the cube like how should we like verify that registration completed.

H

Yeah so I think we chat about this I think there's a couple options, one I think, uh since the volume manager, it's part of the kubelet, the kublic in its node status, has like that Revenue distribution completed uh flag right, so I think that's one option.

B

C

H

Use pass that information up.

I

So we should pass somehow passing that flag in the volume host.

H

Yeah I think today we don't expose that Boolean very well, but maybe that's one option. um The second option would be somehow to pass the full node status. I think. Maybe you could check the ready, conditioner proxy.

H

So I think the two options I would think of.

I

Okay, so is there any existing node status that, like that's like? Definitely, that can confirm that the node has like registration completed.

H

Yeah, like I, said the node status struck in there right that the Kubla uses it has that internal registration completed it'll only set that to truth and then complete. The registration right.

I

H

So so maybe we need to do I think the the problem is that that information is not really accessible to other yeah Google. It components like other. You know it's kind of internal to Google in some sense or even internal to to Google it from other packages. So maybe we just need to do a little bit better job, exposing that information other equivalent components right so we're just having some type of wrapper.

H

You know wrapper function or something that Returns the local status that can be injected into the storage CSI Place, wherever it needs that information.

I

Okay, so there was nothing like blocking like we try to make it accessible to other components.

H

Yeah I just I, don't think, like other components, relied on that information today. So maybe that's why we didn't do it, but I didn't see anywhere. We do that today. Well, at least maybe there is somewhere. Maybe I missed it, though I didn't see when I was looking over.

I

Okay, because I see the comments for a registration completed it, it says something like explicitly making this only accessible to this like so I was a little bit.

H

Yeah yeah, that's what I'm saying so like. Maybe we need to like have a have a public method, right kind of exported method there that that exposed the information right, so maybe not the flag itself, but maybe we can have introduced into a new method or something and.

C

H

I

Okay, exhausted.

A

Yeah, sorry for being slow, I think I understand the problem now, so you want registration, completed the flag inside Google code to be exposed to other components. David. If you look at this quote, how do we know to start the registering static ports with the API server? Do we check this plug somehow.

A

um Register is a node. How do we know to not start registering allocation, mirror ports for static pods.

H

Are you asking like how how do we block creation of mirror pods until the registration is completed or.

A

Yeah so I'm looking for examples when we may need this information because for download education from API server, we only start doing it when node is actually registered, but for us registering mirror pods. We need to check for this plug first right, yeah.

H

That makes sense that makes sense, I'm not.

G

H

I'm not sure exactly where it's synchronous I know like there's a sources like for the Pod sort of information. There's like all the different sources of static, pods, API, pods Etc. They have like a sources ready field, that's set when that source is ready and and is able to to get new information, I'm, not sure how that's exactly linked up to registration completed. Maybe that's there's two different mechanisms. There I need to take a closer look.

A

Yeah because, depending on how yeah my my posted some comments, Mike do you want to? Is it a call that uh doing it or.

G

Put the the link to where the code you know sets it to true when, after register, it's basically just uh yeah we're done, we registered with the API server as the node.

G

And yeah, it's not externalized anywhere I did I did a quick search.

H

Yeah, that's exactly what we're referring to.

A

Right, oh, it's synchronous! So if you file to register we never return, so it's infinite Loop and unless we completed registration we will keep doing uh going through this Loop.

D

A

Okay, that's why you don't need any synchronization, I guess.

G

It would make sense to have an API with a waiter for that being set to true okay.

H

Yeah, that's that's. Definitely referring having some external public function that just returns internally, that registration flag, completed and then injecting that wherever we need that information. That sounds like one approach.

A

There maybe on it being a function. Maybe it should be a Channel of sort like some notification mechanism.

A

I

Okay, thank you.

A

Thank you for bringing it up. uh Next one is uh vipin.

J

uh Hey guys, uh can you hear me this will test uh okay, yeah yeah thanks yeah, okay, yeah I'm waiting, uh so I raised the pr uh some time ago, it's to kind of fix, a CPU manager. When uh you you define some reserved CPU, they should just exclude from share pool so uh container shouldn't be allocated on the CPU. So um so, due to the current test, laying the there's a CPU major test is failing. So previous Reviewer is not confident so uh if the pi is still kind of pending.

J

So how should we proceed from there.

A

The things that are problems right first problem is the test is failing and uh uh I. Think SWAT is trying to fix this test, and Francesca is helping her uh and second problem is. uh Francisco is not sure that this fix is um actually changing things in a proper way. uh I didn't get into details.

A

So he said he's not convinced about this approach.

J

A

That's why I think uh Kevin again, because Kevin already approved it I think is.

C

A

A

B

A

Concerns like can we try to resolve it here.

J

So I, according to my understanding, I think he just means uh the change is too deep, which means uh he's afraid he might cover all the cases but I but I think it's kind of vague, so I don't know.

J

uh If if the statement can be more concrete, maybe I can try to take a look, but uh as far as I see the code, I think it's not uh that that might be the appropriate place. I don't know if I'm, if someone can be a side and uh sing with me to to proceed to help me to proceed.

J

um Because we really want to uh merge to fix this issue yeah, basically.

J

J

Yeah: hey it's okay! How? uh If, if someone can help with me to with that like either to see if there's any other place, I should change the code instead or uh how to do some tests to verify that you cover most cases or something or maybe just fix, that test line just need someone's help, because I'm I'm kind of new. With this thing.

E

Can you start a thread on signode and I'll poke some folks and see if they can help like Swati or something.

J

Yeah so I post a couple of uh message in Signal uh slack, but uh no no, no.

E

Okay, let me find it, let me find it and I'll I'll.

J

E

J

E

Bring me that and I'll add more.

J

Books, okay, I, send a message CCU, so you will follow up right and maybe.

E

J

Sign some reviewer or something okay,.

E

J

Yeah thanks uh morono yeah.

A

Okay, thank you for now. Yeah. This uh failing uh test line is uh quite disturbing. uh I mean it's new. uh What happened is uh in 128. We introduced this test land because before we didn't have any tests on multinum environments, so all the tests we had in upstream was I mean we didn't have any tests, so we just relied on red heart to test it and give a green light for every PR's and like yeah. Please PR is fine like go, merge it and like we just had to change it.

A

In 128 we had a staff line working for some time, but then it broke and uh we're still trying to get it back into greens.

J

Yeah yeah I think I think the issue is that it's just it fail, but it doesn't say well why there's no error message. So that's why? Okay thank.

A

You yeah thank.

J

A

With that we reached the end of uh agenda, is there anything else today.

A

If not, then uh thank you. Everybody for coming have a great rest of your day. Bye.

J