Kubernetes SIG Node, 9 Mar 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Node 20210309

Description

Meeting Agenda:

https://docs.google.com/document/d/1j3vrG6BgE0hUDs2e-1ZUegKN4W4Adb1B6oJ6j-4kyPU

A

And it is our today also, it is the 120th one called the freeze day and I saw uh several proposals already to the uh so we need to the uh code phrase, but most of it is actually it's not, and I also want to go through that. Our planning 1.10 uh 21, I think most it is already update either uh already almost in or already in or some is already pounded to the next genies.

A

um So, let's start today, oh I need the record first. I always forgot. I couldn't be recorded. Oh.

B

A

Fine, thank you derek and they need to start our normal meeting with of the pr and child status and also the test results. Sergey alaila. Do you want to use.

C

Let me kick off and so first we did outstanding work this week um there are kudos to uh people standing in updating prs really quickly and people reviewing it. um It's great. I mean we did way more than we typically do like 67, pr's, nourished and uh 20 post. It's a little a lot of progress.

C

um One I mean typically, I use this dashboard to review what happened in the week and uh looking at the 67 pr smurfs, it's like it's a lot. I mean you need to understand how much work is happening in uh uh our sig. So yeah I mean great work. Let's keep it going. uh I mean we only have one day left for pr emerges, but after that uh maybe after code freeze will end we can keep up the space again.

D

Yeah I'm a little bit guilty because I think a large chunk of those are structured. Logging related- and I know code freezes today uh and you may have seen that. I also requested an extension because I don't think we're gonna get the cubelet fully migrated by the end of day today.

D

I am very hopeful that the release team will give us an extension, at least for the cubelet logging migration, just because having a half migrated cubelet and then waiting until the next release to finish migrating, it will be really disruptive, but they may hold off the graduation of the rest of the feature to beta, which would be fine, but I asked for the whole thing to be extended, we'll see what they say. I have not heard back yet I don't think so.

D

um That has certainly been responsible for a lot of vlogging uh a lot of the churn uh and I can share the board. If somebody gives me co-hosts.

A

Let me give you.

A

A

Okay, you are free to go.

D

Great, so uh this is the sig node board and I'm just gonna, also pop in. If firefox will cooperate with me.

D

Structured logging, uh which I'm tracking separately and.

D

uh What was the last thing that I wanted to show? Oh, the ci board as well.

D

Too many tabs open.

D

So uh yeah, uh definitely as sergey said, velocity has been very good.

D

uh It has been pretty impressive to watch things go and I've mostly for the sick, motherboard, just been sort of dragging things into triage and watching them get triaged, because I have not been doing triage this week uh only for the strong, specific stuff, uh structured logging stuff, because there's so much of it uh like over 35 things that we're tracking specifically for that I have a board just for structured logging, stuff uh and so that's kind of all in progress and making its way uh and then in terms of uh where we're at for uh the signo test enhancement stuff, I keep dragging stuff on here and it keeps getting triaged.

D

So uh if you want to get involved with any of these projects, we have specifically like guidance on the boards. In terms of like how do I help? What should I do? uh I don't have you know org status or I don't have like reviewer status. What can I do in order to help out? How do I get to the next stage? uh All those stocks are available.

B

Yeah and just kudos to you guys for getting the board together. It's been really helpful at uh optimizing, my own personal uh time, um and so I'm sure it's helped others so big. Thank you for helping this.

A

Yeah and actually a lot of people mentioned to me say signal- have the dramatic change and they want to learn from this from you about how to track how to manage those things. Organized.

D

Well, that's great uh the chairs meeting, I think the monthly chairs meeting is today- uh and I am gonna be talking about uh like how to I know, lori also put a thing on the agenda for uh different sig practices and best practices and project management, and that kind of thing- uh and so I have an item on there as well, for uh like training, new people to become tech, leads and sig chairs, and that kind of thing uh for some of the other sigs that are wanting to bring in new people but aren't really sure what to do or how to improve their processes.

D

So that's uh happening today if you are a lead or uh some for some reason are on that leading invite.

A

Cool thanks: let's move to next topic jack, I already want you the co-host. I know you have the dog share. Maybe you want to present there. So next one we'll talk about the the uh the live prop uh uh timeout issue, that's kind of a regression, but it's with the full good intention. So you maybe want to talk about this. One.

A

A

I saw jack earlier.

E

He was definitely here.

A

Yeah he's the first one joined the meeting because I opened a meeting earlier today and yeah. I saw him earlier.

E

Okay, how about you continue doing and I'll go chase jack down and slot him back in later.

A

Sure so, let's move to next topic uh francisco, do you want to update the power resource api? Yes,.

F

Yes, thank you dawn. It's me, so I had a few cycles of review thanks to kevin kudos to him. He did great reviews, so unfortunately we need the uh approval he he cannot approve himself. He gave it looks good to me, though, so here I am asking for approval, because I added the feature gates which were in turned requested during the product showing readiness review. So that's it. Please folks have a look.

A

I I I was before this meeting. I was in process. Take a last look and I about to give approval, so I am on this one. Thank you. Thank you, derek. Do you interest.

B

Yeah I mean I.

D

B

It earlier, and I think I asked kevin to take a look and a lot of my comments were just it was confusing terminology that I think has been updated now in the pr, and so uh I'm I'm at this point. Okay, with uh as long as those changes were made, it looks fine to me.

A

Great thanks, elena.

D

Oh, I just wanted to I'm doing the bad thing that I'm not supposed to do, which is tell people to review my pr. uh I only got api review yesterday on it and I am nervous about the code freeze, cutoff, uh but I've addressed all of the comments. So I think it's just a matter of hopefully fingers crossed the alpha and end test path.

B

This was for the probe timeout selena.

D

Yeah, I finished all the implementation work and I've had the pr app for about a week, but nobody has uh really taken a look at it and clayton. uh I poked him again. He did the api review yesterday. There was a bunch of stuff that I was not aware of because it wasn't documented. So I went and did all of that and I think it's ready for another round of review.

B

Okay, yeah: I can take a look.

D

E

Dawn, I believe jack is back.

A

Yeah, I saw jackie's back jack. Do you want to carry on your topic and uh you? Let me give you the co-host, because I know you have the dog share here. Once.

G

Cool, I know I'm not sure I'll actually share the doc, because it's so large, but I will happily paste in some link matter here. Folks, so, hey everyone, real quick. um I just wanted to bring the folks attention that uh there are some people who are working within sig note and cigar and wg conformance to sort of figure out how we want to deal with the exec probe. Timeout feature flag going forward. So a little bit of background.

G

There's a lot of background in that document that I just pasted with 120.0 the long-standing bug for exec probe timeout enforcement was fixed, which is a great thing.

G

A side effect of that is that all timeouts have default values, if not declared in the user spec, and that timeout is one second, which for liveness probes is arguably a little uh going to be a little impacting for folks who aren't used to being able to declare those.

G

And so what we've been doing in the background is trying to make sure that this flag, which allows that behavior to be disabled, which is kind of funny, because it's really disabling a bug fix. But we want a little bit more time to disable that so various kubernetes providers out there can choose to kind of map out their own timelines for communication um to their user community, how they need to start, including timeouts, with all their exec probes and then possibly um in a best-case scenario, give some more time to investigate, possibly increasing that timeout.

G

I know that's a non-trivial thing, so we don't go into that now, but I wanted to pace some issues um mainly. This is just a call out for rfc for anyone who's interested in this. So I have an issue myself and a pr that does some kind of boilerplate um removal of code comments.

G

So in the code at main right now it suggests we're gonna remove the flag in 121 which we're not actually going to do as dawn mentioned earlier. We're actually feature free, so nothing further is going to merge, but I wanted to advocate that we sort of more formally define when this flag will be officially um disabled.

G

So that's what that issue link is to and then the pr updates the code comments, there's also another pr doing the similar a similar thing which bumps it to 122.. So if folks are interested, um just follow those two issues uh or that that issue in that pr and you'll get all the detail and then I also finally wanted to paste a link to sig arch. So we're going to talk about the same thing.

G

I'm on the road show right now um on thursday, at 11, pacific, 1900, utc, basically the same topic in cigar as this touches their interests as well.

G

So in summary, um everything's there's no emergency or fire girl from our perspective, the flag exists. So if folks want to disable timeouts because they don't want that default one second, they can continue to do so.

G

There is one issue that we're tracking sort of more um aggressively- that's not too strong of a word, and that is that the sig conformance tests right now currently fail if you set that flag. So there are folks, in the background trying to make this feature flag, um something that doesn't inherently fail conformance so we'll be working on that.

G

And I'm done I'm happy to answer any questions engage in any discussion.

H

G

H

Of conformance um is that a 120 performance, speed against a 120 cluster or or is there like working skew between the conformance versions against the cluster that is causing the issues.

G

It's the former lucky. Are you familiar more with the nitty gritty? My understanding is that um the cluster was built with this feature: flag, disabled, which is a way of saying I don't want exec probe timeouts to work, and the conformance test has a new test that validates the when you submit a spec without a timeout, but it times out after one second, so the conformance test seems to be not sensitive to that flag.

G

Essentially, the cluster's in a configuration where it's not going to for that enforce that timeout and the conformance test assumes that all 120 clusters must enforce that timeout. Does that make sense.

H

Yeah it does thanks.

B

So the issue was specific to the docker shin, um given that I think, as a sig, we want to figure out a you know: conclusive state, where the docker shim is either taken out of the cubelet and handled separately. Like personally, I don't like uh giving anybody. um uh You know this.

C

Wasn't stressing.

B

C

Guess it was for both for sirai and dokkasim. It wasn't only.

B

uh Is that okay, if that was the case, I thought maybe my memory was poor. I thought it had just been.

C

Yeah there is a difference between docker shim and craig on on dogersim. You know, like we start expecting timeouts, but proto is still running after we kill the client. That makes a request. So that's the difference. Your memory is correct in uh there is a difference, but actual bug was fixed on both.

B

A

B

I guess what I was trying to say like if, for the docker shim specific behavior, I have no issue in preserving the existing feature gate until dr shim is removed.

B

um If it gives people you know uh heartburn, you know, like, I don't think. As a sig, we want to upset users who have real production issues or concerns, but I had thought that this issue didn't uh impact cri implementers in the same way, and so um sorry, I guess I have to refresh my memory on what the behavior difference was, but I thought the timeout was respected, but there was something related to the grpc called flow that made it behave differently. So uh I have my own homework too, to refresh on.

E

Jack, can you can you confirm on our testing, because basically, the way that we came across this is everything lit up red when we started testing against 120 and then I think we're testing on cri with container container d under the hood right, jack.

G

Yeah, you know I could go back and test it. It was actually um I mean there are releases of 120 uh pre-releases. That did not have the change the change kind of came late in the 120, so it was like, as we went I'm. This is hypothetical from say, beta 1 to beta 2.

G

It was in beta 2 that we suddenly noticed um certain tests that included that required um a liveness probe to take longer than one second and didn't have that timeout declared those tests started failing uh we do test container d and and moby um on my project, so I can do some homework and confirm whether or not this is particularly related to.

G

Unless sergey or someone else yeah how the code works.

H

Reiterating what sergey said like the api behavior, where the exact timeout was not respected, is exactly the same between container or cri conduction.

H

The the only difference with calling out is that docker shim will run that exact probe indefinitely, and it doesn't actually do like a context cancel on the process, whereas container d, it didn't respect the timeout, but then after the timeout it would kill the process and stop it from running indefinitely. That's the only difference.

G

Right so, from a user perspective, it's sort of similar, whether you're using docker or container d.

B

Yes, with the timeout on the container d side or on the keyboard side, like was the behavior local to container d or local to the keyboard.

H

The the issue in the behavior was from the cubelet not passing in the context timeout when it made the journey requested, you're right.

E

And I think you know just advocating for the users here. This is what we've seen is folks, will update, upgrade their cluster and have workloads without the timeout defined and they will not pass liveless probes on. You know anything that takes longer than a second. So I'm you know so, as a you know, working putting my you know uh work at azure hat on in something like aks.

E

We either have to choose whether we want to break users or break conformance apparently so we're trying to uh not do both, but we need the help from the community to help us. You know guide us through a decision here, um so we're we're kind of advocating to leave the flag off. We want to have it on, but it's going to take time to communicate and we're worried that users today will go into 120 and start to break.

E

So that's one thing and the other thing is: we don't want to have non-conformant clusters, and I don't know you know this is a discussion for performance. I guess: should there be flags dependent on their state that break conformance? um I don't know of too many flags that you said on cubelet or otherwise that have a bubble up effect that break the conformance profile in the conformance suite.

A

um The let's now, what do you request all makes sense? I believe what the if I understand what you earlier proposed, so we're going to leave that uh flag, disable that and at the same time- and I believe he and the work with the people uh to fix the conform test once that flag.

A

It is on so so the basically confirmed has wheelbase on agnes for this feature and we're based on the feature gig, that's what I understand and uh and so far here so so we are working on those kind of things so that in but how to long term and to turn on this feature and still not break of the existing user.

A

That's another story and we can come back discuss and I have to the jacks proposal and the document yet, but I I I think that we should figure out away from the community because that's the good fix and but how we are going to mitigate existing customers, which have the dependency on the broken feature and that's another story. Let's discuss come back and discuss more on this one. I guess the jack. Maybe you have some like the alternative.

A

How we are going to move forward, but I have to read that here I haven't heard he mentioned uh clear previous.

G

Yes, my my proposal is really simple: it's to continue to keep the flag in enabled mode, so we're not! We don't want. We want new folks landing on 120 building new kubernetes clusters to get this new functionality, because this functionality that should have been working all along and then to just extend the life of the feature flag.

G

So certain folks can choose to turn that off because they run kubernetes services, for example, that support lots of users who've been on kubernetes for a long time and have built-in production workloads that assume that no timeout functionality is going to intervene on the behalf of their liveness probes. Hopefully that makes sense so just more time. For that feature: flag to be user, opt-in.

E

And, and during that time, have the conf don't break conformance.

B

That's right, yeah! So is there a vr open that says, skip this particular test and conformance if futuregate is disabled.

E

If you, if you agree with that um derrick, we can put in a pr to fix that we just wanted to get kind of high level approval with the plan. I think in the dock jack kind of lays that out, but we didn't want to go just jamming prs in without talking to you all. First.

B

Yeah, I think that makes total sense. um I mean we should still verify that the probe occurs, but we should just not uh expect a one second timeout um so find some other reasonable uh mechanism that represents the pre-120 behavior seems perfectly fine.

H

I'm I'm also wondering as a side note or as a mitigation I mean we could add, like an admission control uh like an optional admission control. That extends only the exact login of the probe timeouts to like five as a new default or something like do. We think that would be valuable at all to users.

G

I I would honestly, I think, that's a possible solution for folks, but I would say, let's table that and let's come to a consensus around the fact that we're going to extend the current feature flag and then decide. I think we should publish like let's say we're going to make this available until 124.. We should just publish that so it's super clear and then I think those conversations are super interesting, andrew about how we can we can either refactor the code under the hood, which currently assumes a common um type definition for all timeouts.

G

We can maybe extract those so that liveness, probes and readiness probes can get different timeouts. We can make it user configurable. As a you know, there's a lot of things we could do.

H

Yeah, it sounds good and like yeah in terms like future gates are cheap, so you know personally, at least for me, like we can leave that picture gate on until 1, 30 or however long like I don't think you know, anyone suggests that.

A

um Andrew, actually, I want to correct feature: gate might not that cheap if every every feature owner. I did that one and the level calling output, but I just want to make sure that people, because we we we spend the time trying to clean some feature, gate over many release, and but I totally agree for this particular issue because due to the legacy- and we can make this- this is the relative compel all those risks we can expose to the user. This is much cheaper and much more risk I just want to.

A

I just want to make sure that people don't think about the feature. Gate is so cheap and so other features like this way. So uh so so I, like you earlier, say that if we could for a long time, we can think about maybe uh uh some different uh time out venue, and so we could just relatively safe to enable this feature and reset our kind of status, because the current status, basically, we just have to live with that back right.

A

So we, but we do the reason we want to fix that bag is just because the customer came to up to, and they said this time out, venue is not respected and and which is make them. A certain feature cannot reliable and rely on this kind of things or their workload. They cannot rely on those things, so we need to still need to fix this problem, but how we are going to safely turn on this feature.

A

um It is harder because kind of problem it is on the uh it's not on the user side. So it's our problem, so so so so, on the note we easily enable this feature or disable this feature, but on the node could be some really relay on the bag.

A

Behavior or some is want to have the real uh behavior. So we need to think about some like the better way to handle this one. That could, I think earlier someone mentioned, maybe like the user config, something which is really uh uh concerned from api perspective, but we could discuss more yeah.

B

I have to admit, I'm still really not seeing how the code change was universal to all cri implementers. I I see that there's an issue that says conceptually. If the probe time was longer than any grpc timeout we allow on the cubelet, we would be confused and it seems like we should maybe think about upper bounding and some validation. That says this is the standard grpc pro timeout, but I think I mean renault you can chime in, but I'm looking in the background here and it's like depending.

C

B

Cr implementer itself did right.

C

B

D had different behavior than say: cryo did.

G

B

Causing me confusion.

G

I can easily run a 119 test with continuity and just post it to the issue just to share.

B

Yeah I I easily believe that, like the container d, implementation might have done something that returned an error response that may have tripped us up, but it didn't seem like it was universal to all cri implementers, and so that's what was confusing me because the code I linked to here is just saying: oh, let's, let's look at a deadline exceeded as slightly different versus the majority of the change, was changing how docker shim behaved, which was clear but uh yeah.

B

Maybe we can follow up on this after the sake, but I just think it's not universal I'll see our implementers.

I

Yeah, so I think I we had a change where we were returning the error that the cubelet expects on a timeout, uh but I still need to look closer to see like what would happen if, if you're at the one second boundary, we may still get a deadline exceeded on the cubelet side, even though cryo is trying to return the the message that cubelet was expecting earlier, so I pasted a link in chat. uh If you guys can take a look and uh shaman.

B

Yeah otherwise, I do think if we're gonna look at this, we we need to set an upper bound on all probe timeouts that won't exceed our grpc timelines or, and I'm not sure how to roll that out. um But that seems like a good thing to also do as a follow-on to this activity.

E

So can I just sorry dawn you go ahead? Yes, can I just confirm our marching orders here so jack's just going to test? I know our infra. We can test docker, shim and cry with container d on 119 120.

E

jack has put. I think you have some standing prs at the moment that somebody could review to extend the you know the the comments on the deadline of that flag being deprecated and then the other thing, that's additional that we haven't done at the moment is update the conformance test. To basically say if the flag is on, can you do conditional performance, conformance tests based on flags derrick?

E

I don't know the code behind that, but if the flag is on run the test, but if the flag is set to off, don't run the test and we'll get that into 120, so that 120 clusters with this off are conformant. Is that the action items here? Are we in agreement.

B

Yeah that that sounds perfect and I'll, give you an example of like the pattern. We have to already do this, so um here's a pr I did last week that um skips a test if a particular feature gate isn't on yet by default, and it's perfectly fine. We do this in other elements.

E

Excellent, so um we'll come back to the next meeting um and just give you a laundry list of prs.

J

uh Derek uh we don't allow um conf conformance well, you we can add a test like that, but we can't mark them, as conformance.

A

J

You could run the.

B

Same test, you just want to check for the timeout right like these tests existed before 120 it's just. How long do you wait for the timeout I see so within the test? There is a.

J

Conditional check.

H

B

J

Should still test.

H

It doesn't mean.

B

Maybe the timeout part is the only part that we block right. That's what I was.

J

Yeah yeah yeah, the test itself can't be optional, based on a feature. That's what I meant.

A

That's yes, I agree with you.

J

A

That's what we agree so far, so so we can come back after after the code phrase talk about this more and how long term things and there's the so, let's go to next one and uh jim. You want to talk about, run, container, runtime, kubernetes and a different amount. The namespace follow-up, okay, yeah.

K

Yeah so hello, I'm happy to meet many of you. I was here last week uh there was much lower attendance, um so uh I'll just do a real, quick overview of sort of what I'm proposing here and I'm hopefully going to be building this into a cap, but just sort of gathering some information beforehand about if there are any well-known issues that would would make this a horrible idea.

K

But basically the idea is that, right now, as far as I understand it, there's a mount propagation test in kubernetes that says that, when a mount is done by a container that has not propagation set to bi-directional, it is available to the container itself is available to the container, runtime and and cubelet, and it's also available to the host os, and what I think I would like to suggest is.

K

I would like to make it possible, when you install kubernetes, to make the choice to install it such that both cubelet and the container runtime are, in a separate mount name, space from the host os. Now the reason for that is really about, uh specifically in linux.

K

With with system d, it seems that systemd has a fairly high cpu utilization when you have a large number of mount points. The reasons for that, um and if you, if you follow the uh the note in the the meeting minutes here, um there's uh maybe some more details in the discussion that I post to the signoid mailing group, um but basically systemd rescans mount points and when you have a huge number of mount points, uh that's exacerbated pretty significantly.

K

So if you hide all of the container specific mount points away from systemd, so it never sees them. The system, ecp utilization on a fairly large system drops to almost zero and and that's a fairly important uh improvement. So.

E

The only cost of this.

K

From the perspective of containers in the container runtime is that anything mounted by a container would still be visible to cubelet in the container runtime, but it would not be visible to the host os or anything that's sort of running in that same mount namespace as the host os.

K

I'm looking right now to see if there are any specific use cases that we know that this would break, I haven't found any yet. I have even trouble thinking of use cases that that would break because top-down mount propagation would still work. So something mounted by the host os or something running in the host would still be available to the container runtime and containers conditionally based on the mount propagation flags. Of course.

K

Likewise, anything mounted by a container could propagate back up to the container runtime mount namespace and then from there back down to other containers, and I believe, as far as I understand it, this sort of meets the need of almost every csi that I've I've had a chance to look at, but it does definitely change sort of this.

K

You know guarantee of what the environment looks like where kubernetes is running and since that's kind of a small change, but also a very large change, um that's kind of why I'm in this information gathering stage, if that makes sense,.

D

So jim, uh I I listened to the proposal last week and then also this week and have had a little bit more time to think about it and to me like this sounds like a system debug, so I'm not sure why we would work around it in kubernetes uh and the things that sort of stand out to me is systemd is like a multi-purpose init system. It should be able to handle a large number of mounts and it should also be able to handle, saying hey. These mounts are requiring a lot of scanning and rescanning.

D

Maybe I'll just ignore those is like. Is that something that's been pursued because that seems like the more like I like in terms of you know, sort of separation of responsibilities.

D

What you're proposing sounds like a hack, and I would like to try to fix system d first before we do the work around, because system d isn't working, because this seems like something systemd should just do.

K

Yeah I I totally agree that, being a system d bug that we're trying to work around here, yeah and.

A

There are efforts.

K

Underway to actually fix this inside of system d, um unfortunately, that is not moving very quickly and the there are two other reasons why this would be. I think a good thing, aside from the systemd workaround, which was sort of the original impetus for actually taking on this, uh and those are just from the perspective of like encapsulation a lot of these mount points. If not all of these mount points are things like secrets and config maps, and things like that. These are really like implementation.

A

K

That are about how cubelet exposes resources to containers. It's not really something that the host os needs to know about, um and that in a way, goes to what I think is a minor security improvement, which is if these mount points aren't visible to that sort of top level host name space, a regular user saying you know fine mount or just mount show me all the mount points right now.

K

They get really detailed information into sort of the exact um locations where all of these secrets and config maps are mounted, whereas they would not get that information anymore. You'd be sort of hiding that information from regular users unless they have the capabilities to actually enter that mount namespace, um and I think that that's gated behind, like capsis admin and and something else so not a huge security improvement, but I think it just one level of inspection that a potential attacker would have today would be removed in in this.

K

um So so those are not really like important reasons to do this, um but uh you know, in addition to the to the system d workaround. um That would, I think, be a benefit.

K

um I I also did get a response from someone who was saying that for the the kind um it might be a useful feature to allow specifically for that use case, um but that was just sort of an early someone perking their ear to say this sounds like it might be aligned with what we want to do. But I haven't had a lot of discussion with them. Yet.

D

Yeah, as far as the mount points go and like you know, oh no, there are a lot of mount points that are exposed to the system and, like those are a kubernetes implementation detail. I'd argue like trying to hide. Those to me seems like if that's an implementation, detail like it's either. It is valuable for that to be exposed to the system as is or it's maybe a suggestion that, like kubernetes, should reconsider how it is implementing this and it's a smell that we don't want to sweep under the rug.

D

uh So I don't necessarily think that it's like you know a bug that we see this right now. I think it's important to be exposing that yeah. That is how we're doing this and that we can access all of these things through the normal mount ways and uh that they are treated as one would expect any other mount point. uh Whether or not that's a good thing right now I mean, I think, that's a much wider conversation but uh like if we were just hiding those from end users uh like.

D

I don't think that that is necessarily an improvement.

I

So I think one more angle here is like for systemd to be able to fix this. It needs change to the kernel, like changes are proposed to the kernel uh in the way that it propagates changes uh for for new mounts, and that is something that isn't going to land anytime soon. So I think, like basically, what we're trying to do is due diligence here to check that nothing breaks.

I

If we make this change and so far like based on the responses we've seen on the mailing list, we haven't found a concrete example of a case that'll break. So the only thing that we would really need change is this one test. So we are relaxing that, instead of uh propagating to the host just propagating to the cubelet and container runtime won't break you and then it's a choice for you like.

I

If you want to propagate it to the host, no one is stopping you from doing it but like if you want to hide it like. If you want to solve the system d issue, if you want to hide your mouse, if you want to kind of launch multiple kind clusters and hide their mounts from each other, those kind of use cases, then I think this is. This is useful.

I

So initially I was also hesitant but like since, since the effort like jim has put into gathering feedback, I haven't yet seen one concrete use case that would break by enabling this so.

L

So uh this, so this is going to be entirely dependent on how a node is configured it's like. Does somebody run services on that node? That needs those kubelet mounts in order to operate, um you know, maybe it is a csi driver. I know you said you checked everything, but um it's stuff like that. That's going to end up breaking and it's very.

A

It's going to be very.

L

Host dependent and somebody can like, if somebody wants to hide them, they can already do this by setting that in the systemd config as it is right, they can just set uh the right propagation.

I

Right so the case you're talking about would mean that the csi service is running outside of a container as a native uh systemd service or whatever unit service. Whatever internet system you're using.

L

Sure, or you know whatever um could just be something on the node that that's running for supporting the cluster in some way.

I

Right so I I think, we'll point that out yeah, so I think what we are proposing is not like an uh moving from the current model to the other model.

I

What we are just saying is just relaxing what the environment requirements are, so the test which currently looks for the propagation to the host would also continue to work uh if your cubelet and runtime are in a separate, mount namespace, and then you decide on the basis of your uh like csi implementation, whether your csi needs to run as a host service or within a container, and then you can pick whether you want this isolation or not.

B

um Obviously, in the past we had supported running cubelet in a container and then um we've had the back away from that uh is the mount namespace.

B

The only issue that we see here that we would get big bang from or is this like? Should there be concerns on this as a uh a slippery slope into just supporting containerized keyboards again.

I

So I think, like so far for this particular thing, we've only used looked at mount, namespace and derek. I think that's what came to my mind immediately when this was initially proposed, like we had backed away from this in the past, but this looks like a narrower thing than what we did before so maybe workable, and so far like jim, has only found this one e2e. That is failing.

M

D

M

I was just saying it's a level set on it. I mean uh we're not looking to make a change to the cubic code. Only that test and the and the the higher level question is not. Can we change the test? It's is this a conformant kubernetes cluster right to do it this way, um and if it is, then, if we decide that it is, then we can relax the test right. um If it's not, then we need to have a reason why it's not and the test stays strict, and then we can't do this.

I

A

I hope other people from other side came in because this the propagation feature- um initial proposed plastic storage and those like the too many mount points issue actually is the best signal and we do with one consent, uh but is insisted for certain deploy option, but because we have to trust that that's the front of sig storage. So that's why we take that and- uh and so that's also is based on the deployment from red hat. And then I noticed, like the always came from the openshift deployment, and then we saw the problem.

A

So I I want to see the other side of the deployment and is this going to this change is going to bring some negative regression or not because we do have monitored. We do because we used to have the signal. The perf test, the monitor the node, the resource usage, especially for the docker kubernetes, and all the other daemon sites have to live on the node.

A

So we did monitor some like the cpu utilization bump up during those certain deployment and that's the concern we risked, but we live with because that's we believe, that's the use cases and we have to propagate to the host os. So um so I a little agree.

A

I don't know all the contacts right now because I haven't read that email yet so, but I do agree either a lot say like a hider from the um from the user and hack that one actually maybe have the potential problem. um Also in the past. Actually we do found some system d problem. We did the actually kubernetes kubernetes community did push the system, defects right, for example, when we first talked about the sql version, 2 and the people push very hard on the system d and enter up the system.

A

G cannot support the hybrid mode. So we push it back and ask him to support both the sql version, one and secret version two. So there's the several of the change. This is one change. I remember there's the several change at the earlier container, runtime interface development, and we also found some systemd related issues. So we pushed back, and so it is possible.

A

I just don't I just kind of if we do think about that's the uh initially reasonable use cases and the deployment why we think about the we don't need that one could be situation change, because I do remember that came also from the open shift deployment. I think about that's the reasonable approach, um so so now we think about it. Isn't we don't need that one? So we we need to call out other deployment other like the vendor and say like it. Is that no regression, I don't see much people come chiming here.

A

I personally don't know honestly, I have to admit I don't know the gk side and is there a problem? I personally don't see the problem so far, but I also don't know the detail on this one, um but I just wanted to hear always.

B

That's all fair feedback on so I think um manal you. You noted that systemd itself is blocked until the kernel patches.

A

B

Yeah yeah does the does the issue get uh and so to be clear. My understanding here is it's not a system debug, so it's inaccurate to classify it as such right um and so given. That uh is, is the kernel patch in this thread here that people could go review because.

I

So I think jim you had linked to the kernel issue somewhere right in your uh notes.

K

Yeah, let me uh make sure that I've got that, but basically yeah, there's a red hot bugzilla that was sort of my introduction to this, and it looked like that had deeper links to the actual changes. But I I can double check that.

B

Yeah, I just want to make sure, as a community we're not disparaging other communities when the bug is actually not that community's bug and from that perspective, like kubernetes, should work well at minimal levels of pod density. Where my understanding of this issue is that it's not working as well as it should, and uh from that perspective like as a community, I would think we'd want to bring down our overhead.

B

I just want to make sure we all have the right visibility into the core issue uh and maybe some understanding on when the colonel itself may be addressing that issue or not to make an informed change.

D

Yeah is there a write-up of how this specifically like affects cube in a cube context uh on, like the kubernetes bug, tracker.

K

Sorry I missed the end of that.

D

uh If there isn't right now, can we get one a write up on the cube bug tracker.

K

Yeah sure I could write something up for that um and it's I think it's it's really like. Cubelet itself is not affected by this, except that there's less cpu available to it and other pods, because.

D

I think if it's like a node level concern, uh then it's totally valid to say like hey, you know uh it's not necessarily that it affects the cubelet, but it's like. Oh all, this cpu is going off to do.

A

D

Other stuff and that's bad for the node, uh because that's less cpu for workloads so.

B

Yeah, so my understanding, the use case here is you're in an edge deployment, and you want to be able to use your cpu as you see fit, and the fact that uh managing a process under kubernetes is more expensive than maybe traditionally managing under systemd is an impediment to adopting cube, which seems like a bad thing for us as a community, uh and so I'm I'm far more open to this.

B

If we understand the root issue and to me, the root issue is not systemd, it is the kernel, and so if we can just get clarity on root issue and maybe revisit this next week, we can um come to this with with broader perspectives. I guess, but I to me managing a process through kubernetes shouldn't uh incur such an overhead to make someone not want to use kubernetes and uh running 400 mount points.

B

I don't know if that, what what number of pods that map to uh does not seem like excessive numbers of of uh pods, that we shouldn't as a community look to fix so.

K

Yeah I mean that was like out of the box open shift with no workloads. Basically- um and that's you know, systemd was sitting around 10 effectively idle, but just keeping the the cpu nice and hot okay.

M

Before we move off of this, I mean we're talking about. Like I mean, I think we're talking about solution space here, but backing up to the to the initial question of kubernetes conformance, and that is like we have api guarantees, and things like that um is this effectively like. Is this a functionality version of an api guarantee is really what it comes down to me is it's like once a behavior is codified in a conformance test like do we say we are not going to change it like? Is that behavior not changeable?

M

Is that like a guarantee going forward or is a conformance chest or conformance test changeable? um I'm not sure what users expect right if you go into the conformance test and you're like okay, anything that a conformance any behavior that a conformance test guarantees for me, I can assume, will always be this way. um Is that reasonable for people to think do people do users? Think that way? um I guess that's that's where I'm going down on this thread.

A

It says I think, if we change it, sound like the user, configurable, api library, things so dramatically, um so I think I do think about the that's kind of good for beta, but uh unless, basically, no, even as the user, like the api specific, but no, we already migrated user off from that particular api. We have confidence right. I still think about it is arguable or it is, could be redefining that behavior the system always have to evolve so have to be the case back is uh in these cases.

A

I I already see that jim did a lot of due diligence and try to say do we have use cases uh looks like we don't. I don't need the united states, the I also don't know, but I just want to call out earlier because that be have actually insist by the previous sig storage and the open shift and also other my problem. It is right now at this moment the other vendor, maybe even don't know, they'll rely on that one.

A

So it's just like the proper liveness proper time out right so obviously from day one that be designed by design. That's not the right. Behavior but the customer may already rely on and the winner even don't know, customer already rely on. So that's my little concern here. So we need. We need to think about how to obviously the water jim's data. The problem is really not acceptable problem and but but I remember this problem- it's been raised in the past and the corresponding answer.

A

It is basically that's the knowing issue and it's going to fix later and in the in I forgot to fix it, but not the outside of the kubernetes. That's going to be addressed. So this is the problem and are we going to show so we maybe could say? Oh that's actually previously claimed. We are going to have that use cases. Actually it's a fourth claim and we don't have the use cases, but we need to have confidence. So I'm not sure I don't see the use cases at least from gk, but I don't think about.

A

I know all the use cases from the gke and then I don't know either vendor that's the problem today, but I do think about the openshift already stay clear because do the due diligence maybe see that when they don't need these kisses, which is good. um So we may need to think about some way to fix this problem at a lower level, and we don't care about it's the kernel or systemd or kubernetes. We are open to fix at whatever things to provide the best for a customer and conform task also could be changing.

A

We could, as the signal community, we could argue, and if nobody really unlike the feature and on that api, if those people are using 911 and how to mitigate them, to the new behavior.

B

I just think gemma thank you for the diligence and uh maybe uh just as a next step if you could uh share the kernel issue and we can revisit this. That seems like um a clear course of action, but I don't want to discourage evaluating um what we can do to be more uh efficient.

M

Yeah I'd be happy to do that.

D

Just a quick time check, we've got four minutes and two topics.

A

A

Whitney we need, you want to quick name updates.

N

uh Yeah uh hi, so uh this is uh pretty. um I think this is more of a check on where we are uh with the ask was to review two pr's one of them. The is tim hawkins and I we worked on the changes to the in-place pod, vertical scaling design and, uh I think direct we're waiting on direct to take a look. This is pr1883.

N

uh If you get the chance, please take a look, I think you're the last person to look need to who needs to look at it and see if everything you know smells okay to you, no concerns. If you have any questions, I'm happy to answer them here. Have you had a chance to look at this one.

B

No, I I've focused my upstream engagement on this present release, but I will hopefully be able to look down.

N

Yeah, we won't put this try and get this into one two two. uh I just want to get an early start on this, so if there are any concerns, if you could look at it and then whenever you get the chance, uh that'll be great, the second one is uh cri, uh just adding it's a process change. The prr section was missing from our older, older incarnation of the cap, and I added that so that can possibly just be merged.

N

If there are no flags that you see um I'll follow up next week on this, so is that okay.

B

Yes, no yeah, that's better!.

N

A

Thank you villain and the next last topic, uh this condition between scheduler and the kubernetes, I think, related to the memory management. So I guess that's what you proposed. Sorry.

O

uh High false, can you hear me uh it's not it's not really related to the memory measure. I found this issue under one of our ci jobs to be more specific under the heat page job.

O

So the problem that I saw that pod was moved from pending face to scheduled one, so it passed the scheduler, but after it was felt because uh cubelet said that it does not have enough. Resources doesn't have enough q patch resources so, like you are welcome to to leave any comments under the issue and I'm pretty curious. Why do we need like checks in both place like in scheduler?

O

It's the same check like we have a feats plugin in under the schedule, and we also call to fix plugin under the cubelet during that mid phase and like in both places, we are using like uh different cache, so the picture of currently running posts can be different and it's the place for the race.

A

O

I believe and like I don't want to take too much time, but I open for any discussion and please leave your comments under the issue.

A

uh The looks like the is only limited on the docker ship right.

O

uh Maybe because our e2e node test is running uh on the cushion, so I cannot say for sure if it's like reproducible under the contagion or under the cryo.

A

So so so, okay, so so even next we fix this problem, and so, if that problems to the other container, runtime still will have the problem.

A

A

I think we run out the time, and uh um I look here, though, thanks for reporting this problem, we can follow up through the issue and I saw your pr to fix of the docker shim already be merged, and I think that we we need to call out and to make sure the crowd and the continuity don't have this problem and we need to also look at the the initial issue. I have to figure out. What's the initial issue and just based on your uh uh discussing, we can follow up on the issue here.

O

Okay, I think we ran sorry folks. I just passed in correct issue, so I will re-pass the updated one, because it's not the correct one, I'm sorry yeah, it will make more sense. Sorry.

A

Okay, this is confused me. I don't understand, okay, so, okay, so let's follow up in the issue and uh and after meeting so thanks everyone for today's.

E

Meeting okay, thank you. Bye.

E