Kubernetes SIG Node, 26 Jan 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Node 20210126

Description

Meeting Agenda:

https://docs.google.com/document/d/1j3vrG6BgE0hUDs2e-1ZUegKN4W4Adb1B6oJ6j-4kyPU

A

First time hit record without warning so good morning, everyone now today it is the january 26th and it is our weekly signal and we need to start with the surgery also and alaila about the uh test and the pr status.

B

All right how you started oh recording started. So did you say it's a january 26th hi, signor meeting um so um yeah? I think uh we are in this stage of uh development and the version release when there are a bunch of work in progress prs right now.

B

If you look at created prs, you can see quite a few of them marked with working programs so used to be marked with work in progress, so this race in pr count may be expected, uh but still there are few pr's and we identified a lot of prs and cia test groups that needs to be reviewed and moved forward, they're quite straightforward, so yeah.

B

If you have time please review prs but- and I don't feel very bad about this- plus seven uh on pr count, because just in this stage like people working on features, so this is great.

C

Yeah I'll jump in and say so yesterday, uh in sort of the second half of the ci meeting uh we sort of met and discussed. uh You know we have the the ci sort of test specific board and then we have the uh the board, which is everything else and sergey, is mostly managing the task board and I'm managing the everything else board.

C

So uh from that meeting uh we came away with like a few actions, so uh action number one was for me to put a little like documenting note at the top of each column, on the non-test board, uh to sort of explain what things should go in each column, so other people can jump in and participate and as well to add a sort of more longer form document.

C

So I haven't had a chance to draft that document yet, uh but that should be coming and I think uh the other thing I should mention is that I think we want to maybe move the node meeting that the ci meeting, which is currently on mondays, because it's kind of at a bad time for some people. So I think sergey is going to send a doodle.

B

Yeah, I will I'm working on doodle now and let's see where the current time hold and people happy with that or we need a schedule.

A

Course I'm looking forward for the new schedule proposal and also looking forward a la la your talk and I really want to participate after ci working group is just complete. The conflict with time might have so. Okay, thank you for the updates. Let's move to uh next topic, the mother or commerce.

A

Sorry, I don't know how to pronounce your name correctly. Please correct me! So, let's talk about how to move on with your effects with the power of the power removal.

D

Yes, uh hi uh so couple of weeks back, uh I didn't.

E

D

Meeting so I'm not exactly sure what was discussed in this note further, I know like uh I was. I was looking into the issue which uh caused this revert to happen.

D

uh I looked into the h2 logs and the ci logs, and I saw that it's uh yep, so this fix will actually cause the part delete to be slowed down, which is expected and the reason I even mentioned that in the pr comments that, after this fix like you, should see a little bit more time a little bit more time from the time you delete the part to the time it is actually removed from the api server, because in the current uh scenario in the current world, like uh what's happening is not everything is being cleaned up and few things are being just left as a gc to take care, and those things actually include something important like a network resources and other stuff.

D

So this will actually tries to fix that like where it tries to make sure that network is also properly cleaned up before the pod is actually removed from the ap server, thereby like if there's any chance of prevent leaks happening.

D

The cluster means that they would know like beforehand like okay, why there are so many parts which are like actually uh asked to be killed, but, like still not yet fully terminated, with some network issue happening on some notes or something those kind of issues would be caught straight away itself.

D

So uh I understand that it was reverted because it's causing like some uh p1 issue in the cics there, so I I want to like since I was not part of the other sick node meter. I just want to let's see what was decided on this year. That way, I can try to introduce like look into this code more and fix, if whatever is required and create a new pr back with that with that fix.

B

C

B

Some context on investigation- um originally, it was proposed once long ago and uh it was implemented, but then reverted back once already. As far as I remember, because uh because of delayed introduced for a port shutdown, and then uh there was a pr that uh attempted to fix it and it was fixing it in most cases. So the idea was that uh when we receive uh this lifecycle event that port being removed from or continue being removed from a port, we will try to clean up the sandbox and the it was working in most cases.

B

But sometimes it was raised and at this stage like port, is not ready to be removed and we failed to remove sandbox and waste waiting for gc to pick up the slack. So it was like up to one minute uh to start a deletion process of some boxes and then, depending on how many sandbox you have it may take a while. So like one minute extra time of port cleanup is not acceptable.

B

uh Many tests will fail and uh like it's not super good experience, uh that's why we reverted it, and but I mean I think, it's reasonable to to want to clean up all the resources. I don't know why not. I just need to make sure that we clean up it as fast as possible and not waiting for gc to kick in.

D

Yeah, so uh yes, when the first reboot happened, it was mostly the uh uh the kubernetes uh actual etvs are doing the uh released validation test, which were taking more time. So I was running those things to verify my fix. uh I think now it's happening with the uh on the history side, where they use, I think, kind. In that network scenario, I can like uh try to uh simulate that and see where actually, the exact deal is happening and try to fix on that part, uh okay, sure I'll uh I'll.

D

Look into that exactly uh the issue which failed, which caused this thing to be divided and I'll. Try to see where actually happened. Thanks thanks, eddie.

B

Yes, I think elana mentioned some failures on uh cryo as well. uh That was happening because of that.

D

Okay, I look into that also yeah.

A

I also want to mention that uh there's the one uh also there's the e2e test make sure it is test the power to remove deletion and it's correct those past and another thing: it is uh the gel from the steel. They have like the very easy reproducing cases for that issue, so make sure that reproducing cases is covered underneath the to me, that's represent the somehow that they try to integrate so so how they are using so make sure that also passed.

A

I also want to mention that I just noticed that on the agenda, uh the the paxo- I I don't know pectoral shear and also another, and he also have a proposal that the pr try to fix, based on what you propose the business peer and approvals, but the decision it is not to take time because it is well rushed to remiss right so the decision it is reward because not uh try to stop the regression immediately, so that's character.

A

I actually take either the signal just letting you know, and maybe you can look at he tried to what he tried to fix. I haven't looked at uh his pr in detail. Yeah just want to mention here.

D

Thanks, I'm looking for that yeah thanks.

C

Yeah just to add slightly more detail on the cryo side, what we were seeing were end-to-end tests where a lot of pods were created and then rapidly deleted or timing out, because pod deletion was taking so much longer.

D

Sure yep, uh it's the only delay here would be mostly because of the cni thing I just want do you know like uh what was the cni being used in uh in that in that e2e scenario,.

C

uh Not off the top of my head, we can sync offline.

B

Sure it was it.

C

B

It was like cni wasn't taking too much too long over time. It was an issue that the logic that attempt to delete a sandbox was called when it's too early to delete the same box and because it's too early, it didn't delete the sandbox and waited for garbage collection. So delay was introduced by garbage collection, not by cni or any runtime interface.

D

I'll I'll look into that sure thanks thanks. Everyone.

A

Thank you and address this problem. So let's move to next one xing now you're the co-host. If you need, you also can share uh your screen. If you need to.

F

Okay, thanks doll, uh I don't think I need to share. I will just uh talk about this, so we are working on this uh boiling house feature uh we're trying to move to beta. So I have the link there.

F

um So one question that comes up during the review of the cup uh is that uh so for this feature right, so we have this volume house monitoring agent deployed as a side card with the css driver on every node, and each agent has a part informer um and that well, the in the agent will add an event to the pod if an abnormal body condition is detected.

F

So um I think the question is: will there be any scalability concern on having a port informer for each agent on the node? Does the signal have any suggestions on how to uh how to monitor this is any? Do you guys have any like common framework that we can leverage, or is this just it's? Okay, that we just uh have an agent each agent, keeper party, informer.

G

Question this is derek. um I appreciate you reaching out, um I'm not as familiar with the implementation of this. I guess, but I guess the first question I would have is: um is this informer listing and watching pods only bound to the node where that agent is running or are you list watching all pods.

F

uh This will be the parts that are used by the node.

G

Is it um are the rights for this driver equivalent to like say what the node restriction admission plug-in um enforces.

F

uh That I'm not sure that I have to I'm not sure what is the.

F

Permissions for that, so I need to do that. I need to check.

G

uh Maybe another way to think about. That is um the reason I ask is uh um we did hit scale issues on the cube in the past that I'm sure dawn can talk to around like the number of resources that are watched uh as the scale of the cluster grows, and we had done a number of work to kind of bound, both the access rights of each cubelet, to read that information and then when and how it chooses to refresh that information.

G

So um you'll see code on the cube around like the secret manager and config map manager. That like ensures it can only read the things associated with pods that are bound to that node and then the identity of the cubelet is looked at. When calling the api server to say, can I list pods? So it only restricts you to be able to see the things that map that identity or that bounding node. So like the first like scale, concern question would just be like ensuring.

G

Are you watching more than what is needed and then there's probably like a a secondary question on like how privileges are scoped to whatever monitoring solution? This is that keeps it maybe as restricted as say, the cubelet identity is restricted, but um those are the first things that come to mind. I don't know don if others come to your mind,.

A

I think that's the powerful, the topple concern we have and, uh and also we have another one. It is because um each agent- and then they have their own monitoring and each other monitoring without the awful uh well basically it is take over. This is more to negotiate. It's not open source cursor and for the production. If the tke will deploy some agent and using a lot of customers resource, I will make a big fuss out there.

A

I said it's considered right so so, but on the open source, because on that particular things because like for example, obviously I have many, it is internal deployment they want to function on. It is way more than the resource and the utilization performance.

A

So so that's why, on the open source reset requirement, I just say oh, please make sure you're not using too much resource. I just want to see that in the past the pneumatic is the same eye deployed and, and all that know the problem detector deployed and even at the serial in the past, the linkedin really using too much resources. I will basically just push back. I said no sorry even you maybe provide the good functionality. We need to redo a lot of work make sure.

A

So this is another concern, so I'm not sure because, based on I haven't looked at your type in detail, but that's the one of the another concept but to the skin native is the top concept for the functionality.

F

uh Do you have any documentation on this like design, design, docs.

A

We don't have those uh documentation, but uh I maybe dig into the really old of the know, the problem, detector and uh and because, when they first we being raised those problems because northbound they can actually have the both problem and the way is the skinny build problem initially and another one is just many times: it's not the first time I have the resource, consumption issues and resource consumption.

A

uh They had like the cpu memory and also even some discard, usually in the past, give us and also scalability problem, and also so also does tag of the note because uh sent too many information back to the api server and all those kind of things using other. So we'd be um so, but we don't have like one uh dark capture all of those things um I just have to uh dig into some old design, dock and or maybe some old bag.

A

Okay figure out.

F

Okay, um so there were some proposals, for I think it's between signaled and seek instrumentation uh on how to monitor node components is, is that uh is that complete or is it some? I just want to know if that's something that we can use as well. Are there any architecture there that we can leverage? This is maybe more about monitoring like the magic side, I'm guessing seeking instrumentation right. So do you have any?

F

Is that something that is in progress or is it? Is this the design.

A

That's a good question. Actually I, if I remember correctly that how to monitor know the component that's made a couple years ago by the vishnu and actually initially they found a signal, but do we want to uh partner with sick instrumentation? If I, if I, if we are talking about the same uh uh also, then that's, I believe, that's the richness make uh that didn't that's totally in the past. At this moment and and unfortunately, uh only internally, we discussed- and he even didn't made into the signal to discuss that.

A

But but I did talk to him about the internal. We do the review. uh I like the proposal, but just someone have to pick up that work uh and make progress on that one and also need to came to the signal and uh and make progress here, got the approval here.

F

Okay, so right now, basically the matches are still or still in kubelet itself. It's not really uh separated out.

A

um So yeah, but uh in some of the products I can share the gke and the several production. What I understand it is uh people build some plugin on the load problem, detector and and know the dedicated node detector marketing some of the per node demon. So because the reason I know is because I made this suggestion also at the signal and also made it into the gke, but we haven't make that progress yet. But I know someone approached me so they already deployed those uh things plugged in uh in their production.

A

So that's why I know so. uh We can talk about more those kind of things I didn't suggest them to upstream those things and the suggestions. And but I have to say that at the signal.

G

Here, just to maybe uh clarify my own understanding, so we have made progress within the sig to um distribute some sources of metrics that used to be coming from the cubelet to allow it to become from third-party sources. So a gpu metrics would be a good example of something that, like as a trend, we're trying to allow people to own their monitoring unique to their component and not have to go through the cubit. I guess, and so just trying to understand.

G

When I look roughly over the cap you have here, it looks like you have new uh metrics you're wanting to emit and is the question more aligned of. Is that a good or a bad thing? To do? I I mean I, I have no objection to that.

B

G

So yeah am I understanding the question appropriately.

F

uh Yeah, so it's yeah, so what do we have there? The one one house right now: it's not really matrix uh because we well, I guess it can be, but it's not implemented that way. uh So it's right now, it's just like events reported on par, so it's not really uh integrated with the metrics support.

F

But if there is some come from, maybe I think maybe we could look into that.

C

One thing I'd suggest uh because I have not seen this cap at all- uh is it probably would be worthwhile uh pinging sig instrumentation for a review? uh I don't think they're listed.

F

I think we reviewed the original one, uh the ava version. I believe there was someone.

C

Yeah, the other thing that I would say is adding metrics, uh because all metrics start as alpha doesn't require a cap, so you can just go ahead because they can be turned off now so uh or at least they can be mostly turned off. Now I think harder. Turning off is coming uh in this release, so um I would uh totally uh suggest that you sync up uh get that feedback, uh but you know go ahead and run with that stuff.

C

It's something that you can sort of experiment with, uh because there's not necessarily great guarantees. So that's why we put in the likes of runtime back out.

F

Okay yeah. Another thing is: uh we were also thinking in the future. Some of this information, maybe can be used for uh like for us to do some make some actions like. If you know something happened, maybe controller will actually do something with those pvcs, so uh that probably we can't really even know metrics, I'm thinking. So that's uh one reason that we may may not go with the matrix route.

F

Some of those are thinking it's better to move to like a pvc status, but that's that's not it's not in the country's cap. Yet.

G

Yeah, so I guess um there's no issue in other components, reporting related events around a resource. You know the cubelet can talk about pods in an event resource and the scheduler can and that that's all well and good. I I do think, there's an issue if you're trying to build systems with some guarantee that uh are looking to respond to events as like their messaging protocols,.

F

No, not messaging, so I think if we are really want to react to those abnormal events, it actually has to be like a first-class field, can't be yeah.

G

Okay, I was just making sure that that was.

F

Yeah because we couldn't decide uh what to do with those events, so that's why right now, they're just even they're, not the first class field. This I mean with those information. If we know exactly what to do with them, then you know we can do those in pvc status or something so that's still not decided. So at least those are not in the current cap.

A

uh Something this is why, earlier that uh visuals proposal, how to managing the know, the component and the plus know the problem detector. So actually that's uh can help you to figure out how to take action. So that's the transcendent oblong state right, so there also have a condition like permanent of the uh problem and so how you detect those kind of things and pop up to the upstream components and then take action. So so this is how we are trying to deal with like the network issue in our production.

A

So then we detect of the transit issue and also convert the transient issue into the permanent issue. If it's persistent, there was for some duration, so so the event is not reliable and we are not the uh we basically from day one. We introduce event that's most of the debugging. It is not a for system to rely on, it is take uh uh discovery and the issue or recovery issue. So that's not the kind we're using just to share with you here. Yeah.

F

That makes sense thanks.

A

Thank you and if you are interesting more and we can follow up on this one.

F

Okay, thanks thanks.

A

Okay, let's move to next topic peter and robert: do you want to uh talk about the acid rider stats thanks.

E

Hey thanks don uh first time, caller long time uh listener. I, my name's peter. I work for red hat, mostly working on cryo, um so for a bit of background cryo for the past well, forever has been using c advisor stats in the q clip because we found some initial um regressions with performance when we switched to cri steps so we're, finally, in the works of actually making the full switch over and uh finding ways to make it more performant.

E

um So uh we I started off by making some changes in cryo that I thought may help a little bit, but it didn't do quite as much as I wanted and I'll have robert describe it a little bit um some of the different. But basically basically, we tested like four different versions: one with c advisor stats and cryo, one with cri stats with like an improved cryo implementation.

E

And then um then I found some uh indication that there was still work being done on the c advisor side, uh even though we were using sarah stats. So I have a um a whip, pr that I put up um on the dock and testing with that actually was much more promising. um So uh robert. If you want to talk about the difference.

H

E

Different ones.

H

I'm just gonna go into a little background. uh What's motivating this uh from our end, uh I'm robert crowets I uh used to actually be working uh on.

H

I used I used to actually be working on the node team here, but I switched over to the performance team uh last year. One of the projects I'm working on uh involves using a an alternative, uh runtime.

H

It's not appropriate to port, it's not appropriate to port c advisor to know how to talk to uh the other run time. uh So that's uh that's sort of where I come in, in addition to being the uh performance, uh the performance team, I'm going to try to share uh I'd like to be able to share a window here. Please.

A

Just one second real nice day: okay,.

H

The other thing I want to want to just warn people in advances possible. I may have to off uh on no notice.

H

Hopefully that won't happen.

A

So robert you got expert.

H

H

Do people see my people see this window here.

A

H

Okay, uh this- this is a this- is a cpu utilization plot here uh by process it's using uh it's using a tool called p bench.

H

I'm not going to go into more detail on it here, but suffice it to say.

H

Suffice it to say that I've instrumented uh the worker node on which, um on which my test is running, my test involves creating 64 pods each with.

H

I don't know was it maybe 2 million or no, it was 256 k files per pod, so looking to see uh basic. Obviously that means there's a lot of files that have to be stated in order to come up with a profile of resource consumption.

H

So this graph here is the base case using uh c advisor stats and uh unp patched cryo, the red. uh The red here is the cpu utilization of the cubelet, as we can see it's averaging uh well about 19 percent uh throughout the first part of this on. The left is when everything was being created and then everything just idled for 20 minutes in steady state.

H

Let's see cryo here isn't actually cr yeah cryo here is uh less than one percent uh cpu utilization.

H

So no questions on that. One we'll move on to the second one, which was with cryo uh patched to cash. The stats result, um as we can see here, the uh the cuboid actually consumed rather more cpu. It consumed about 25 percent of one core here.

H

um Cryo in this case, consumed about uh three percent uh of a cpu three and a half percent of a cpu. um So we're clearly seeing uh we're clearly.

I

H

I peter, I believe, duplicated work would be a good way to describe this.

E

Yeah, so I mean here basically and robert will show the patch cube in a little bit, but here using cri stats, like c advisor is doing some work that also cryo is doing, and um the cubelet is basically asking crowd to do that. Work like to give the results and then also asking the advisor to get the results and resolving between the two. Basically just choosing c advisor.

E

H

Is I I neglected to mention initially uh looking at the totals average cpu consumption was perhaps 23 or thereabouts uh in the default case. In this case, uh it was using probably close to about 33ish percent, so somewhat more cpu utilization coming out of that.

H

The next graph here is cryo with patches and cubelet with patches do not call into the sea advisor. In this case we're seeing uh cubelet utilization fall to about 15 percent. Cryo is still somewhere in the range of three percent, but the total average cpu consumption now is only about 20 percent of one core. So this is. uh This is a substantial improvement and finally, uh the last one here I'm going to have to ask peter to explain it.

E

Yeah, so this one is uh without the cryofixes and the cryofixes are basically just emulating what the advisor does with uh caching. The disk stats results. So I um ran it with this vanilla cryo, where it's actually blocking the entire file system to calculate the um for every cri staff call, um but I wanted to show kind of the difference between and but this has the keyboard patches that drop the c advisor calls. I mean I wanted to show like in isolation. What dropping those um those calls does.

H

Okay, so here we're seeing cubelets taking about 20 cryo's, taking in this case about two and a half percent and uh we're total total we're. Looking at maybe 24, I'm going to compare this to the original, um at least when my browser is willing to switch. Where again we see it was in the same range 20, something percent, so that those.

C

H

The performance data I wanted to present here.

E

Yeah thanks robert, so basically uh the point of bringing this here is that there I I put up a pr and so there's some duplicated work in the cri stats path that I think should be optimized out, and I wanted to come here to see why we by default, call in to see advisor when we should be getting the one, the one stat that I don't know if we can actually get from the cri stats call. I haven't looked into it hard very hard, but are the process stats?

E

That's one thing that I'm not totally sure if we get the process number um from the crs class, but everything else seems like it's duplicated with the cri stats, but we just default to using the c advisor version if it exists, um and I'm I'm wondering if there's a different approach like dropping those calls or like maybe re, changing the priority so that if, if the values exist in the tri, we use those instead. So I just wanted to gapping that up.

H

So let me just comment on that. um The use case- uh the use case we envision here- is for virtual uh virtual machines uh running the content. The containers uh running the containers inside well running the container inside a virtual machine uh um is managed as a pod, so c advice c advisor on the host would not be able to extract meaningful data. uh All it would be able to extract data for would be the vm.

H

So that's that's the other reason why we have to be able to go through the the cri here.

A

uh So there's the is the frontal state weather it's just legacy. Leader uh say: the weather is like initially cerebral start before kubernetes this one this year and even before we have this. So then we have the sid weather came from the same team and from the book team, google team, and uh so so initially, when we start, the monitoring is just like the whatever monitoring available and we're just using so we're just using cellular monitoring.

A

But you can see that when we first start the container runtime interface, uh we talked with docker team and company community, so we basically want to using cis that's. This is why, from day one we do uh but over time. So that's why we have like the stats, uh related api and so but over time. There's the first thing initially, uh I think the even at the ci alpha release and the monitoring. There are many debate, so so it's not finalized. So so we have to like the alpha and the law out.

A

That's more finance and also second, one is continuity. Implementation, even the api is finalized. Implementation is not ready, so initial name, so we pound on that. One. Then later we have the cryo, so when we want to think about, we can switch. We had just performance concern, so that's why we are kind of hold, because we want to uh both continue d and this cryo and get graduate from the incubator. So we delay that one. So after that I keep heard like the class performance.

A

Have some performance concerns? So that's? Why? And just recently I heard that that problem is going to be fixed, so big is just legacy really. We do want to switch, but I I I knew continuity will be behind, like the using continuity will be similar uh data here. But uh hopefully someone can perform like the similar performance analysis and generate the similar data here. So now we could basically switch because we do have like the signal that basically is kinda like both of the two container runtime. Why? So?

A

We made a lot of great decisions for the both, and I want to make sure we also have the container d data here and then we could um say: okay, let's switch to the cryo stats.

E

From my understanding, it seems that container d is using cri stats. Now. Is that that's correct.

A

Circuit, do you want to give the latest status on this one.

E

I was just checking that uh container d uses cri staff instead of like the uh c advisor staff.

B

Oh yeah, um this one, so we don't have any specific plans to transition there uh like. uh I think, david working on that and uh we don't have any timelines for that. I think generally, like uh strategically. We may want to go there, but for now there is no actual plans.

E

So sorry, which.

B

E

Which one are you on.

B

J

Yeah yeah, sorry just to add um so so like for container d it. It is using cri stats but um like you've, seen in the kubelet like when you, uh when you're getting the cri stats, it's still fetching out the the c divisor stats, so yeah, okay, cool the difference, though, with container d and cryo in terms of stats and stuff is.

J

um I did notice that in kublet there is a function called like using legacy c advisory stats and that's using for summary api, and for that um there's actually a specific check, that's hard coded in kubelet, and it basically checks if you're using cryo and I'm not sure, on the whole background, why that was put in but there's a basically. If, if you're using cryo, then it will default to using c advisor stats. um In addition for image sets as well, yeah.

E

Yeah, so uh um so for clarity, the on these tests. We did remove that line and that line was added because of the these performance issues that were now like rehashing and trying to you know resolve. um So we we added that, because we were concerned about the duplicated work, but didn't have the time to kind of fix it, but now we're you know getting back to it. So part of this effort would also be removing that line, and um you know from some point once we've decided it's performance enough.

E

Actually using you know, cri stats, uh the same as container d.

J

Got I got it? Oh yeah, that makes sense um yeah a couple comments I wanted to make. I mean first of all really awesome thanks for doing this kind of performance analysis and getting all this data. um One thing that might be additionally helpful is we can provide like a cpu profile of kubelet itself.

J

That way we can see exactly which, which part of kubelet is taking up this extra cpu um and whether that's um because, for example, even with I saw the changes you made, I believe c advisor is still collecting the staff, so it'd be nice to understand and really pinpoint down. Is it uh additional like when you're calling out to get the metrics like that, additional, like fetching and merging of the c advisor and cri stats, or is it like in the background the fact that the advisor is still clicking the stats?

J

I think a cpu profile, yeah.

E

I was also a little bit confused by that, but um my thought that might my guess not really knowing is like by uh cuba never requesting for it. It's like. I don't know how see advisor decides to start. Looking at um you know a particular uh container or directory, but so like maybe by the cuba not asking for it. The advisor just doesn't try, but I I don't really understand the interaction right now.

E

C

E

For how this you know, the the stark difference in performance came up.

J

Yeah, my understanding is still well collected, but maybe because it doesn't call out there's some performance head yeah. So so that's why the benefit like a cpu profile, kubelet itself. What I think really answered that question we could see exactly which which functions using the the most amount of cpu, and this would help narrow down this investigation a lot I think so that might help.

A

D

A

uh Oh sorry, I just wanted one comment so david: can you can you do the similar experimental uh or maybe work with the peter and robert do the same in their experimental? So we can see that the contenders uh staff, so then we can make the decision move forward. We do want to switch to using the csi stats for a while right. So the always the constant it is just uncertainty about the performance and the resource utilization, all those kind of things.

A

So so, since the cryo already take action, maybe we should also otherwise it will be blocked.

J

Yeah that makes sense yeah. I think it's don you're, referring to doing something similar on container d side right, that's kind of referring.

A

J

Yes, yeah. That makes sense as well. I I can start taking a look at that yeah.

I

Yeah we can share our setup and stuff david and we can think on like also enabling the pre-prof profiles.

A

So then we can take one time: action because right now what we want to support the both container runtime, because that's the signal the indoors and allows that as.

J

Yeah definitely, I think that makes sense and uh yeah to do the same type of experiment, container design and make sure it works there as well.

A

J

E

You so the one other quick question I just had is uh another thing that we should investigate is on both of the cri implementations parts of the cri stats that we may not be filling that motivated us to use your advisor to augment those stats. So something that I will personally look at. I mean with this pr. It seems like nothing breaks horribly.

C

E

It doesn't seem like they're entirely needed, but it's you know something that um I think we'll also need to think about. If we're gonna, you know, move away from having see advisor augment these staffs when we're using cri staff.

H

E

H

Again, putting on my putting on my hat uh for the other project here uh with uh um sam with uh having containers running in vms uh is that c advisor will be uh of no use for looking inside the uh vm.

H

If I understand it correctly,.

A

Yes and uh yeah so so, and also this effort once we switch to the csi stats and this effort can help us to refactor the state of weather and make sure they have the cal have like the library like the core matrix or know the metrics.

A

The cellular already been refracted a couple times to satisfy cognitive needs in the past, but uh it's not done yet in the cri we basically design. We basically say we want to refraction. It is and then make this the only circuit of the node related of the core matrix and from the part level matrix and then give after all, the container related metrics to the csi. So then we can. We can talk about more how to refactor that one and then how to use it to link uh back to the kubernetes, how to evolve.

A

Cd-Rider code base also there's the many things in the past and the people post, but those effort is being passed so now after this one switch. I think we can clean up a lot of codes there.

J

Yeah, definitely, I think, yeah that's been a long, long effort and I think um yeah, starting with like making sure we can fully rely on cri stats on all the end points that's definitely first step and that we can think about doing that for sure.

A

And in that, after we have the reflection, robert, then those libraries, some of the library and then you could using and then link against the monitoring. We sing that via so then you have like the purp one part or one couple container run of them, some like the uh vm and then that could be linked, because this is original uh idea. What I proposed with the the third continental other side- I forgot the name and but yeah level finished that one also level graduated from the incubate incubator.

A

So that's basically what we want for a long time anyway. So that's we can discuss after that. Actually more.

H

All right I'll stop sharing my screen then now.

B

Well, thank you.

A

Thank you. Thank you peter and robert nice to see you two here again and yeah. So let's move to next topic uh uh and uh do you want.

K

Hello, sumiya is not online and only me we are colleagues and we work together recently on signal and fix some small bugs and we want to move on to take more responsibility and some features, maybe some more tasks but but currently I only found several issues and I listed it here and some purposes, but I don't think I I cannot decide which is better or I wrote a kp for something and the kp is in discussion and I think we can discuss in the kp and I think sergio's record suggestion is very good.

K

I will move on to some e3 test fillers or some slow test cases, and also I will see the to-do issues in signal.

K

I have some of them already.

C

I can jump in a little bit on the state of issues, which is that uh we do not have them in a very well triage state and for the most part I think people are focusing on the backlog of issues uh sort of based on. You know the the kinds of bugs that they're seeing in production and using that to prioritize. We have, I think, over 500 open bugs right now in signod or something around there, uh and so we need to go through and triage them.

C

It's uh it's sort of on my long-term road map. Right now, I'm trying to focus on uh getting the pr's under control and then, as soon as we kind of got that in a good, steady state, then I'm going to start tackling the issues, but I wanted to say just thanks so much for staying up so late. I think it's like not a great time for you right now and it's nice to meet you and you're great.

K

A

To meet you, and also I just mentioned to you at the beginning, for the first topic I mentioned that you try to help fix that problem, and it's just maybe, and maybe it is because too late for the release. So we decided not to take your your fix and instead we reworked original uh pr, and now we look into how to fix that problem.

A

So just thank you for your attempt, fixing the issue and and also there's the subnet shared uh the dock and in the signal when it is actually shared created by the menu and about like the kind like the plan for q1 right. So there's something in that list and also don't have owner and also some require of the more reviewer.

A

Maybe you can start from there and the last last quarter actually also directly share the one dog you can find that dog and for the clean state like we have many many uh feature still is captured as the alpha or beta. We want to promote some from there from the alpha to beta beta to ga, so the sum of those kind of things and also no owner, and even there have the owner some of they don't have the reviewer.

A

So you also can start from there and take a look and maybe understand something like the pr uh something like a feature why we decided not to promote or how we promote. So you can start from those and maybe that's the easiest way for people to start looking into some work.

K

Oh okay, I think a lot of works can be done later. Yeah.

K

A

Okay, so we have last topic. I think it's talk about uh so the siren. Do you want to talk.

L

Yeah, thank you don't like, uh so we we discussed about this previously as well like the adding a flag in cubelet to disable the p prof and that's the issue that we created. But I was wondering, like the issue talks about adding one flag to control both the endpoints, the p prop, as well as the flags I was wondering like if we should probably add additional or another configuration for the flags separately and don't combine it with the profiling flag.

L

So I added a comment in that github issue as well asking like if we should have two separate flags instead of one controlling both p prof, as well as the debug flag.

L

So I was wondering like if you have any.

A

Suggestions, I need to refresh my memory uh about this one.

L

Yeah, so this was basically like for aws, specifically like we have a fargate instance where we manage the cubelet uh tasks on the on our images, and we don't want the our customers to go directly invoke the peep prop, as well as the set the debug flags.

L

So that's. Why, like we kind of created this issue, a long time back and then and then what we you know from our initial meeting, what we decided was like we initially kept these ones open for debugging purposes and it's okay to add a flag to disable these endpoints and one of our community member. uh They were also working on the pr, but they were. They had one flag to control both the end points. So I wasn't sure if that's the right move so wanted to see. If we should have separate flags.

J

And just a quick question, I remember there's a flag that turns on both the prof pro uh endpoints. Is there any reason you need to turn on one versus the other or what's kind of irrational.

L

Oh, no, that that doesn't matter or like just one flag also should be enough to just turn it off.

J

L

The current state right, I believe, last time I checked it- was this one black we don't. Currently, we don't have a flag, though it's just in pr.

J

Okay, I remember when I was messing around enable something but yeah.

L

So we added one recently for logs specifically and for p profit workers in pr state which I was helping them to drive the fix now.

J

Got it I remember some like enable debugging handlers or something, but I would have to check, maybe I'm confusing with something else.

A

Yeah I remember recently there are some changes about that so, but I forgot the detail: let's.

A

L

Like enable system, log handler is something that we added recently.

J

Today, just check there's a um enabled debug handle enable debugging handlers, but it's a google flag though, but right.

L

Yeah so enable debugging handler, like that kind of uh takes care of like most of the endpoints that cubelet serves and this specific issue just for turning off the p prof and then the the flags in point.

L

So maybe don, I can also follow with you offline as well just get some and then see where we wanted to go with this.

A

Sure, um anyway, even we follow up, we, we will share back to the signals so yeah yeah. Let's need to follow up after this one. Then we share back to the you know after that, yeah.

L

Okay, cool. Thank you.

A

Thank you. Thank you. um That's all for today, any other topic people want to raise, and then otherwise we just call off for today and we're nice to see everyone.

A

Here: okay, thanks everyone and uh talk to you next week, bye.

C

Thanks everyone.