Kubernetes SIG Node, 8 Aug 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Node 20230808

Description

SIG Node weekly meeting. Agenda and notes: https://docs.google.com/document/d/1Ne57gvidMEWXR70OxxnRkYquAoMpt56o75oZtg-OeBg/edit#heading=h.adoto8roitwq

GMT20230808-170555_Recording_1920x1020.mp4

A

Okay, hello, hello, uh it's a weekly, signaled meeting, um welcome everybody. On date, um it's August 8 2023.. Today we only have one topic on agenda and linguan will be presenting. Take it over.

B

uh Okay, so, um oh sorry, I I think I uh I'm not able to share the uh the document. Maybe I can just.

A

Let me know if you need my help: I can present yeah.

B

Can you can you help me share that, because I think I I need to.

C

Quit my zooming when I open, uh there is some permission, interests.

A

Can you see my screen now.

C

Yeah cool, thank you, uh so yeah uh I think this is the continue the topic of the meeting of July 18th, so we actually brought both down the all the problem into four of them. So uh here yeah. This is the detailed document. So maybe I will briefly talk about each of them and then you guys can comment more. If you have more questions so the first one is the new static policy for CPU manager.

C

So we know, uh CPU manager uh have uh has some static policy and we add actually one more uh static option. Studying policy options is called a spread. The physical CPU preferred.

C

So the problem is that uh in our impedance we have a database team and we we actually found in performance again if we spread the DB container to to actually different physical course, instead of um like the same hype thread in the same physical core, so you guys can look at the maybe the picture. uh Can you scroll down yeah? So here? uh If you compare these two different uh CPU sorting image you, then you can see the difference.

C

So the first one is like this, uh like the CPU ordering is um they will sort um according to the virtual course uh you, maybe the hyper threading the same physical course and the second one is like we sort the CPU course according to different physical costs.

C

So we only add a one physical like so sorry, one policy options, so you can just turn it on sorry. So by default, it's not on. So unless you just turn like choose this study or policy option. So it's one of the statical policy options of CPU manager, foreign.

C

So yeah, and also uh we did some experiments and you can see the appendix like for the detailed experiments to see like why this uh CPU static policy option. It's better.

B

C

Okay and I think I I I will not talk about the details of the experiments, so you guys can take a look.

C

A

Sasha, do you want to voice it out.

B

D

uh You mean about which uh yeah yeah uh I think I agree with Francesca what he already Putin was saying meeting now it's what uh this looks like uh just an option to exist in static policy, so like with selection, which CPU core is just part of algorithm. What CPU manager static policy has um I, don't think we need a special policy for it.

C

Oh yeah, so right it I think it should be uh like uh one setting a policy auction. So it's not a new static policy. Maybe I should uh yeah.

B

I will edit the data.

A

Francesco also commented on something else right.

E

Yeah I can briefly talk about it. So the time we invent this policy uh it doesn't have to uh across Numa in community. Yet and after we evaluate the new policy, we notice there's the like a slice difference, because the distributed CPU across normal is mainly focused on putting like a heifer spread across the normal pneuma and our uh like.

E

The policies treat pneuma as a kind of uh like the secondary preferred uh like a Criterion to Cross or not so that it has to cross it try to firstly cross the uh like physical uh CPUs in the same new manual, and if the only if there's no like available physical CPUs, then it tries to cross the new map. So the the the the detail strategy is a little bit different compared to the new addict like across Newman I. Think it's the 125 I think yeah. That's the difference.

D

Well, here we have actually two different problems, so selecting uh spreading across physical course versus uh like multiple threads on the same core. It's part of a static policy so, like you, can look at the option. Existing one prefer uh like full CPU as an example, so like what what you're looking for is exactly where opposite what uh Spider did some time ago for for it, but trading Numa as a secondary thing.

D

uh Well, it's it's more a bigger problem, because it's not on the CPU manager. It's the whole logic of topology manager,.

E

uh Yeah I think the first thing uh it it is true. That's totally different direction compares to the full like, uh like the the pcpus, the only solution, uh because we do see the Noisy Neighbor uh problems in our use case. um That's the reason we like to distribute them uh in different, like uh like physical course. uh That's only for I think uh we do see the performance skills for our use case, but uh I know this is not a like. A common policy uh like could be used widely in other scenarios.

E

uh So that's why we make it optional. The second thing is the the difference between the the our proposal and the the CPU distribute CPU across pneuma, I. Think uh yeah, as I said, there's a difference. What we like to uh control is more on the physical core side.

E

We don't care like is that on the same human or different pneuma and yeah I think if we think we we want to more like information about our proposed like the options, we can also try like across Yuma and do some like performance comparations later I from the like the allocation strategy. I do see the difference, but from the like performance flight. uh Currently, we don't have these uh like data, yet.

F

E

F

Thank you for the proposal. I would just like to add that um in general, adding uh this feature as policy static policy option is feasible. I think is something we can look into button. My comments, I'm Francesco, um were more about. uh Okay, let's try to let's try I would like to see if we can implement this option, this allocation strategy as a composition or building up on existing building blocks. We have and then maybe fill the gaps or tune.

F

If we can so build on top of the existing bits and pieces we have already and then we can see which behavior we we lack and how to address that.

E

That sounds good.

A

Yeah I have another Uber question is: uh did you find it in testing? uh Did you like? How did you discover it? I, uh you probably have a lot of knowledge but I'm curious how our customers will discover similar things like how people will recognize that one policy is advantages comparing to other policies.

A

Do you have any insights that you can share.

C

Yeah I can share that so uh previously we only turn around the static policy for doing the testing and we we found that like. um um If you can look at the maybe the appendix.

C

Right so uh so previously, like um we, we found like uh when we turn on the default CPU static policy, so it will like assign the DB container to, for example, the same core with different hyper threads. So at that time like when we have like, for example, when we have uh one hyper thread like the DB container, it's like we only assign it to one uh sorry, one CPU.

C

So the performance is like it's like that, and then we also needed like two threads, so these two threads will be uh shared in the same physical core. So it's not like a linearly improved for the performance and we do more testing and then- um and we found like uh there is, could be some like a Noisy Neighbor issue, and we are actually during that time uh one of my team members, uh one of our team members.

C

They consult to some I, think the maybe Hardware or expertise uh like, and then uh they found like it's actually, uh for example, if the DB container is uh maybe there are two threads if they bound to the same physical core with two different hyper threads. So, in that case, there will be, could uh maybe something like a cash contention like there is a L1 cache. It can be uh it. It actually shared by two different uh hyper threads.

C

So in that case, like the performance can yeah like it's worse than if we assign these two uh difference views to different uh physical cores, so we do more testing we found out.

C

Okay, actually that's the trend, that's the issue, so we we actually it's actually a different team, they uh conductive experiments and they actually uh require uh like a submit some requirements to our kubernetes team and seeing maybe this CPU policy may be better like we need to spread all the DB container into uh different physical cores, because at that time, like one machine typically is like most of CPUs are idle, so we only have maybe a 1db container running in that machine.

C

So in that case we need to spread uh the DB container into different special course. So we have better performance so yeah, that's the I think the initial uh maybe request for this feature for this new uh CPU policy.

A

D

Well, in in your experiments, you had only one in the database instance running multi-threaded applications, but single instance.

C

uh So I'm not sure about the details. I I only know like uh because we work on different teams and uh they actually so from their statement. It's actually for the, for example, there is one physical machine, so most of the physical course actually are Idle No.

D

What I'm, referring to, if you scroll down to uh to a test.

C

Oh, you mean the experiments yeah yeah yeah.

D

So it's database like single container, pin it differently and the rest is empty of motion.

C

Right right! Yes, for that, one.

D

Yeah, when, when all those results can be easily easily.

E

Explained um yeah I can give some comments. They definitely have some assumption for this policy from our online uh like uh Statics. uh What we observe is uh we don't see all the instances are always busy. So there's always like sometimes instances are busy and some instances are kind of uh idle.

E

So if we, if a DB instance, it is busy and the average thread allocate to the same uh like physical card that have the issue, but it does have a chance like in our case, I would say a lot of chances uh like this instance are busy. Okay, you use the like the cash more than other instances that also share the CPUs. So that's our assumptions because we are doing the like the serverless DB instances. We don't see all the instances are busy uh yeah.

E

That's a I would claim that's a like a soundtrack of this use case now. Okay,.

A

I wonder if same results can be uh go through the same static policy, but twice the request.

D

Well, if CPUs are allocated exclusively and twice for request, uh you will get some performance boost. Definitely, but it's still not mitigating the issue when we're like two threads of the same application, become a Noisy Neighbor to each other on the same physical, core.

D

Okay, so so here here here are all the experiment. Results as I see uh we all assuming what you have a very active uh thread of application, which consumes almost a whole uh like physical computation problem or physical core, and you have a separate hyper thread which might be uh like less utilized for by our workload. So it will not let much interfere with database.

E

Yeah, that's true. Yeah.

A

Okay, do you want to switch to the next topic? I think you have three more.

C

Right sure yeah, so the second one is the kind of for uh maybe related to the first one. So when we have this, uh you know uh the new static policy, so we also need to uh or meditate or to work with the In-Place vpa, and we found that actually in community. This is this is not supported, uh but we actually kind of need those features to work together.

C

So we just uh for for this one, like we proposed some solution to fix this issue to make uh in place of APA and CPU Manager work together correctly.

C

Yeah, so in the solution part I think we have several fixes for uh for, for these two uh features to work to work together.

A

Yeah I think Sasha summarized that comment very well about why it wasn't implemented. It's not straightforward decision.

A

A

Easier suggestion um just to make it uh so you don't want to restart your restart your workload right, so you want to update in place without restart right.

B

C

So actually yeah, our uh like uh environment. We are, we already turn around these two features and we we also conducted several experiments like and on also some um stress test. So right now there isn't any other issue um we identified so so I think.

C

Do we need to talk about the like the detail, fixes.

A

Yeah I just wanted to highlight the facts that uh when we designed 3ba, we explicitly said that the weather workload will be restarted or not. It's not a uh you cannot enforce it.

A

You can just ask to restart explicitly, but you cannot ask not to restart and I, see more and more scenarios and uh um use cases when uh there is assumption that vpa in place vpa is assuming no restarts, I think if we have more and more requests like that, you probably need to deal with that and make sure that our API is allowing this assumption. So we we should be able to say like this in place uh vpa like specific API, maybe explicit option we can pass or something like that.

A

That will guarantee no restart, because um I mean, as I said uh right now is designed. Pp does not guarantee no restarts.

D

The bigger problem is actually implicit qos, so a change might uh might well calculate report into different class right now. If you remember correctly, it's uh well it's kind of used inside vpe, so it doesn't allow what to do, but the risks still exists.

E

So your concern on the Qs is uh I. Think you assumed, uh like a user, could change any like, uh like uh like value of the the resources.

D

uh My concern is what CPU manager policies are working only with guaranteed qos classes right and my behavior for guaranteed Qs classes like historically, it's uh maintained at what like. If we give some resources to a container, we are not changing it or let's say we. We should not disappear and in terms of CPUs it means what we are allocating exclusive CPU for guaranteed QRS classes.

D

If a CPU manager is active and some of our applications, uh let's say like the Telco dpdk based applications, they rely on web Behavior so way like team themselves, data processing threads to those exclusively allocated course, and if inside policy during the scaling, you start to add or remove something from CPU set, you you will, you will break those applications.

E

Right right, I think there's uh two yeah like two things. The first thing is the we want to make sure the Qs class won't be changed uh and uh I think that's for sure, because we uh so here we we talk about like CPU manager uh and the In-Place BPA, so the assumptions that people always choose the like the the integer values and they won't uh like uh go back and forth between like some um different Qs classes. uh That's a I. Think in our case that's an assumption.

E

The second problem is uh I think how application adapts to the new changes of the like CPU changes or memory changes. So yeah I agree. That's a like a problem actually to application, because even sometimes we do allocate more resources or uh like tier two applications. The applications cannot detect the results, change and adapt to is like Behavior, I. Think that's more on on the application side. If the application can dynamically detect the changes, then the feature Works. Otherwise it doesn't make uh like sense.

E

As you said, some like dbdk or some other applications, um yeah I I, would say: do we think it's better to uh like enable the uh like capabilities from the resource layer? First and then I think it's the applications, the responsibility to adapt to uh like these kind of new changes, at least that we can unlock some of the applications to leverage this feature and that's something we we are thinking about.

A

In general, we try to prevent uh people shooting themselves on the foot. So if you, if we can make it clear and like more explicit, uh what should what will happen? We will do that. So we don't fully rely on application to behave properly.

E

D

Like imagine, we enable with functionality and CPU manager and imagine we have like preventing quick us uh change, so the policy should be when able to track like when we're scaling up or down the port or well actually container vpa should know uh what was the original request. So we never go lower when there was a regional allocation and the same goes with CPU core. So if we allocated it would be, it will start of a container some CPUs.

D

Those should be kept intact, like we can add something and we can remove what we added later, but we shouldn't ever touch the ones which was allocated at the start: uh This Way. Theoretically, it will not break applications, but I see what it might create. Some other problems.

A

I think, in the end of the day, one way or another In-Place update needs to support apology manager so like uh it will I mean some solution needs to happen and uh I'm afraid that solution may be I mean solutions that will not rely on application to behave correctly will be to restart everything and, in some cases, even whole Port. If uh policy was better Port allocation, so you need to re-admit the entire report and restart all the containers.

A

Which is far from ideal, um so this is like one uh side of a spectrum is to restart everything like the entire report needs to be rescheduled in the same node and another side of spectrum is um just do whatever requested and hope that application reacts correctly. So I think we need to find a balance in the middle and I haven't read the solution proposal in details, but I think uh what Sasha brings as a concerns is a very well concerns and we need to decide where we want to be on the spectrum between like.

A

Can you start everything and just rely on application to be here correctly.

G

Did we ever discuss having some kind of a hook or signal to the application, so it can detect and like the cigarette liquid configuration, something like that could work or then or a custom hook or script that the application can add, which then gets called, and it knows something. Change I need to reread the configuration and update my resources.

D

In Upstream, kubernetes I, don't think we have anything required.

A

G

A

When we uh discuss the API for uh vpa, uh the API options was I, restart container explicitly and prefer not to restart. There are two options right now, and one of the options we discussed is exactly that, like a send some signal to The Container I I mean we discussed it briefly, but we never implemented anything like that. Do you think it will help in this situation.

D

It depends on what kind of applications we're talking about and Depends? Is the application have a possibility to say no to a decision, so example with the pdk application? You are saying a US engine signal saying: I want to remove CPU core number. Five and applications would say like no, no I'm already like using it very actively foreign.

D

Thank you. So we we had experience with uh in NRI we we have a down downward API as a file which say uh which provides a list of CPU cores which is available for application and uh like using I-95 Watcher on on this file, uh you can detect what something is changing, but it's like post factum, so you cannot prevent policy to change your allocations.

A

I think if I will summarize this item, we definitely need to work on that and some work will be needed if you want to move it forward. Please do um I think this may be a good starting, a starting point, but make sure you read the comments and try to address as much attention as possible.

B

C

To the next one uh yeah, this is the third one. Is the In-Place vpa performance Improvement, so we found that I think for right now for equivalent any 1.27. The problem is still exists. So if we change the resources, it will take about one minute to finish like from proposed to in programs and no, for example, we need to scale up or down to the specific containers, so we actually identify the issue it's like. Actually, there are for for this performance effects.

C

We have several fixes about this one, so it's actually uh I think this is the first one and can.

B

You go to the second and the third.

C

Oh sorry, I I forgot to uh paste the original one so actually in the uh Google it single pod, uh I think the thing called is like a one Loop to reconcile the uh skill resource changes right so at the uh I think at the end. Actually we need to get the results from the container CRI to know.

C

Actually, uh the resources has been like actually, for example, allocated or scale down up, it's actually already complete, but at that time actually in right now in Singapore there is no uh detection of that like we, we haven't, got any uh part status from CRI. So that's why there is like a what you need to one minute to complete the uh scale up and down.

C

So we actually fixed that, like we directly get these Port status from CRI and then we know actually in this one single powder, the VPN has complete, and maybe it will take less than one one second. So after we fix that, we found another issues actually in here with like a fixed one, fixed two, so it's all related to uh the part of the status from API server and also the CRI uh being inconsistent in the code. So we fix those issues.

C

Yeah I will paste the original like performance fixed here after the meeting.

A

Yeah um I, don't know about specifics here, I think we will have David, maybe able to comment right away, but I I, don't bet on that um yeah. Please send a hot fix like uh fixes PR, and uh make sure that there is a description idea. The test for that um I have a big, bigger piece of feedback from Clayton that uh most of the quotations written uh may need to be Rewritten, because there are many principles for uh races.

A

So there are suggestions how to refactor things if it's very targeted updates uh well features still in Alpha and it doesn't affect any any other logic, definitely like. Let's, uh let's make it better, so people can test how vpa Works in uh in ideal scenario rather than like less ideal. Thank you. Yeah.

F

Plus, plus one on that, please uh I'll review the document. It makes sense to send to vinay as well, and you can start sending appears for the individual hits.

D

And actually probably describe a problem statement as an issue in with official Repository.

C

Okay, sure yeah, so we can go to the last one uh so yeah. This one is also related to In-Place vpa. We found like sometimes it will be stuck in the progress, because sometimes we we do have like a customer resources in resource allocated. This keyword so I think maybe you guys already know that, because in the pot there pot has this back and also status right. So for the In-Place vpa we only focus on the CPU and the memory uh like a scale up and down.

C

So if there is another, uh for example, another keyword here, for example, maybe IP or GPU or network blah blah some other devices in this resource allocated, uh then it will for this one. The vpu will be stuck in progress because you can see the code here so it it will compare actually the the resource, I think from the spec and also the status. It will never like this. This resource will never be equal because in place, vpa will only have CPU and the memory, but for the spec it's actually another customer resources.

C

So yeah, that's the issue we found and we also fixed this one.

A

D

It's actually the same triggers a question about restock, not restart, so everything what we are not restarting. We need to have a comparison with its native resources for everything else. If we have changes, we we need to force restart containers to trigger the allocation of those resources like device, plugins or anything else, or extended resources.

D

So it might, it might be a bigger issue here.

H

That's a good point: Alex.

H

Number four is the bug. Number three is enhancement, optimization number one, the two: we need to continue discuss more.

A

Yeah I'm so excited that you're trying it out and uh given feedback. uh This is a big feature. Everybody wants to start using it, but uh I I'm a little conscious about it. Like so many issues like whenever somebody tried it, uh they found something to report. You have over 20 bucks reported in uh kubernetes issues list uh just for this feature.

E

Do you have a like actual, uh like a issue list that do we have a tag for this issue to easily find those uh like 20 problems? So we probably can check and uh let's see whether we can match our like fixes with those issues, uh so we can wrap the reference. Those issues if we fail the like PR foreign.

A

Would be to pause this list on uh enhancement issue, so we have some summary of what will be happening and uh it will tackle most of the problem in 129. uh It's a it will be a great Improvement.

E

A

Okay: um let's go back to the agenda. Is there anything else.

A

If not, then thank.

B

You everybody have a.

H

Great day, thanks for this foreign.