Kubernetes SIG Node, 7 Mar 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Node 20230307

Description

SIG Node weekly meeting. Agenda and notes: https://docs.google.com/document/d/1Ne57gvidMEWXR70OxxnRkYquAoMpt56o75oZtg-OeBg/edit#heading=h.adoto8roitwq

GMT20230307-180441_Recording_1386x1120.mp4

A

Hello, hello, it's March, 7 2023, it's a weekly, signaled meeting welcome everybody. We have a lot of things on agenda so, but I want to get uh started with uh some um statistics, uh as we typically do so there is uh 241 PR's active it's uh less than last week, I think we're down by seven, but it's still very high number um I.

A

Think one of the problems that uh not everybody around this time uh and I think Bruno is uh not around uh to do to approve many things, but we try to be creative here. So if you have a form of what was happening, you can look at most and close PRS. Some highlights I. Think uh two or three caps were completely immersed into three, so we have uh good progress. There. I also wanted to highlight that we have uh about 40 I, think it was 40.

A

Last I checked: yeah 40, open items uh with a label priority important soon, meaning that I mean typically. This priority means that it needs to be addressed in this release.

A

um It feels to me that uh not all of them uh can be addressed in this release and what you do on tomorrow's meeting like Ci meeting, and we already did it. Last week we removed some of the items, especially a very long um waiting uh bugs from import and soon so this Milestone will be cleaner and we will be. We will have.

B

Better picture of what needs to be fixed and what shouldn't be fixed, I.

A

Think, maybe uh late late this week we will start applying Milestone, so we can know what Jeff needs to be in this release and what can wait for uh next to this or whatever any questions on current status?

A

Yeah code freeze is very soon and last announcement is release cmask to start the feature block. If your cap is uh moving uh stages in this release, uh please consider adding a feature block about that. You'll promote a feature.

B

And it is very useful for any end users of kubernetes. So please do that.

A

Okay with that uh I think, first time item is Kevin.

C

Yeah yeah I'm online uh hi, everyone uh so I just have a brief item. I was starting uh work on a I, kept for adding a condition for uh pending pods that are stuck to the configuration errors and someone brought up in this cap that uh there's a condition invalid image name, uh which is usually for like it's.

C

It's validated against uh I think Docker for regular expression, checks about like whether or not a user has all capitals in the image or they're using HTTP or whatever, fails for whatever reason so I kind of created a separate uh PR, the POC for adding validation into the uh the validating of a pod, so that it can actually check these and not schedule them and not get them into a cubelet and I guess I'm, just curious.

C

If, uh if this is probably should be a separate cap in its own, because I know there are some discussion about whether or not it might be. It is a breaking change. If people are relying on, if they schedule a pod and then they want to have a separate controller, changing that to get it working, I don't know, but uh this was kind of I was wasn't sure if it's not pressing for I, don't think it needs to be in this release. But I was just curious.

C

What the if there, if anyone has thoughts on this, please comment on this PR.

A

Yeah um I know some examples when uh images that doesn't exist, but they still conform to the standard format of image names, uh so those images may be uh present locally or somehow being like replaced or tagged locally uh I, don't know whether um is definitely a big problem. I know how much it applies with HTTP instead of https. So I think this is a good thing to highlight for end user, but uh yeah um I, don't know. If does anybody else has a opinion here and like crazy use cases that uh EU currently using.

C

One percent suggests uh I think it's static, pods about whether or not when, when I think when you're creating a a pod, if it fails, should it be uh allowed to fail, and then it just shows up in the Pod logs or if we want the logs buried inside I, guess cubelet or whatever. That was one suggestion for uh not going forward with it. So.

A

Yeah I think original APR was moving the validation from Kublai to um API, but I think for static posts. We just need to keep the validation, so the question is how much it will be exposed. So, okay, if anybody on the call knows what to check and uh can comment on this PR, please do so I personally, don't know whether it has enough information for being accepted in the series right now, just because we we don't know what we will break. Yes, sir.

D

So Kevin I will I, have look I'm. Sorry, I I haven't looked at this one. So that's why I don't have more background. um So so, but you already have the pr so I don't know. What's the prefer it's the cap separate cap or PR, but that let me first look at the PR first yeah.

C

That'll be good, it was kind of a separate from the the cab so yeah. The pr is. The cap started a discussion about this.

D

So I assigned to myself and also I just marked your previous cup to the next readings so because obviously we're missing this one, but for Booker tracking for next one. So we Mark that yeah thanks.

A

Thank you Kevin um and thank you for fixing it um Francesca. Do you want to talk about rate, limiting it's a very exciting topic? Yes,.

E

uh Hey everyone, so, um first of all, this is in the context of graduation of the poor resources, removing the produces the feature gateway to us identified as 127 issue things we want to do so um during the when you identify the GE aggregation criteria. Two two items were prominent, one is Windows support with, and the other is rate limiting, actually the deniver service prevention.

E

So during the API review, those items were identified and those questions are asked- and this is why I'm bringing to this audience- and we don't need to if there is an easy online answer- great otherwise, it's always very good to chime in in the captain. To answer your dance comments so very quickly. The items are, we are adding rate limiting for pod resources, which is good, but the accumulator is already a bunch of uh other endpoints which are even more costly and costly, delicate with disrespect.

E

So is it okay to add just for here, I guess: Jordan is asking about the general plans, or so this is it. The second item is, uh this is actually an implementation thing, but it's good to have a Direction, so the current implementation is global, so clients, a rogue client can consume all the budget and another client could be starving, and the third item is okay, I'm, exposing QPS and bars setting, but the General in kubernetes is a project is moving toward a different direction.

E

So the question is: if this approach is good enough or do we want or need something else? So those are the points. Thank you.

A

Okay, I had this comment about being applicable to all endpoints from very beginning, I think it's. uh It is a good thing to have and it will be strange only have it on pod resources.

A

um The only thing that is unfair is that we Block in Broad resources, graduation by this uh um bigger chunk of work, um so yeah I I want to make sure that we I mean that we are not blocking like the sport. Resources are not completely blocked like it's owned by default, and it's beta, so people are using it right. So it's not like yes,.

E

I agree with the sentiment by the way we shoot them for for having a plan for all the end points. But so it's a good question to to ask thanks.

A

And can you tell um if there is a any analysis already done like what are the expensive endpoints uh we mentioned.

E

Give me just a second: if, for uh okay examples are like, but exec the logs report forward and I'm sure they are open, there are more.

A

Yeah, because uh what I worry about is metrics like if you start already missing metrics, uh there may be all sorts of crazy use cases for metrics that uh we unlikely will cover with single setting uh with other like uh logs and policies. We may be better covered, I, don't know.

E

There is another thing that the product resources API is one of I, think the only one or one of the very few endpoints that are exposed to Unix using a Unix domain socket, which is special for another reason.

E

So I think there is some room to discuss why police appear is special. However, I agree that the general point is worth discussing. So it's a very good point.

A

So um do we want to get on this uh PR and uh I.

E

Think that if there are easy, Quest easy answer, we can have right now great otherwise, it's very good to come. Just comment on the PRI sent all the links in the agenda. I added on the links in Legend development, I think commenting is fine.

A

Yeah I, don't know if this was discussed already, but another idea was uh would be to have extremely high limit, so it will protect from like very abusive Behavior, but it wouldn't uh limit anybody. Anything.

E

In the con, in the sorry, in the very limited context of the producer, the limits are currently infinite and you actually have to opt in to set any form of limit, and so this is kinda like you, you, you suggest, because the limit is actually infinite. Answer.

A

I wasn't clear enough, so I was thinking like if we can build in some limits that not exposes API. So we don't.

F

A

F

A

Into the discussion uh to protect uh from uh DDOS attack that was raised and uh in cap, so you unblock this PR from graduate cap from graduating and then, if this limit needs to be adjusted lower, then we can have more discussions. Yeah.

E

Yeah understand what you mean. The comment, however, was about the general fact that we are worrying about limit rate limiting just for this. This is what I understand, but I think that we can totally keep go, keep iterating offline, if you, if you like I, will really love to and I don't want to steal more time to others. Items.

A

Okay, any more comments on that anybody wants to my chairman.

A

Okay and uh yeah Francis I I, don't want to push for a graduation of this Gap I think uh we all know that every new cap brings more questions and it's. It may not seem fair that this question is addressed during specific care, but we have no other option, so um we blocked it will be blocked.

E

I understand it's: okay, I think the the question is right and thank you for your input.

A

Thank you. uh Next item is mine, um uh six collability uh came and that they suggestions that uh we have another QPS API QPS limit on complete. It's two settings right now uh that currently set at five and ten uh five API calls per second and 10 bursts limit. The suggestion is to uh bump them uh and saying that we can even like the goal is to not limit it at all, so API server will handle everything can handle everything fairly.

A

So the the there is no need to worry about it on client side, but the first step they suggest to bump it a little bit I think current is 5 50 and 100. uh yeah 50 per second on the burst is 100.

A

um and um there are two questions. First question is uh mechanics of that so right now we have have everything documented and default value is documented as a field in the field name. Let me find out yeah. So today is just a document as a default five in our config file.

A

um The suggestion is to just bump it here without uh doing anything else, um and uh so it will like better an image to apply to everybody instead of like us, suggesting that uh best practice is to increase this limit, and since we did so much work on uh API server to handle this load, it shouldn't be a problem for anybody and if somebody wants to return to previous behaviors, they can always downgrade to 5 and 10 as before.

A

So yeah I see no problem with this change, but I want to know if there is any concerns from anybody else and uh if it can be moved forward.

D

um I just reviewed I gave the looks good to me, but then last week, when they proposed it is the when siren that I do have cancer, because I do say that people didn't really have the test. They basically just say. Oh, we have the um uh enhancement on the aps server. So that's why uh We sync, all this is but do but to the the I remember two years ago, when we met you, I was still running to some problems. So that's why, uh two years ago we didn't put right.

D

So that's why but I I'm, okay with the current, because in in-house I did say some people bumped that one to the to the stimulus number and to actually work for a while. So that's why I'm? Okay with the current number, if we are working, still work well next time, I'm, okay, to try, keep Bamboo, but I am not comfortable. Just direct goes to the one solid.

A

Both in the same company, so if anybody else has an opinion, I I heard on the discussion there uh thought, maybe you can comment about that- that Amazon increased it to 10 and 20.

G

Yeah um so sometime last year, yeah I bumped it for to 10 and 24 hours of both armies, bks al2 and then bottle rocket um and I didn't go higher. Just because we didn't see much of an improvement um for the number pause for nodes and sort of the scale tests. We were running um so, but it hasn't caused any issues at the amount of production for all of our customers since last year, at 10 and 20., but yeah as soon as this goes in I'll.

G

uh Remove that so we'll get the new uh defaults that are even higher.

A

D

So targeting comment on that also in the pr I also comment to provide my to comment that one. So we have more data yeah.

H

A

Then a mechanic of that um right now it's beta API like configure the beta API, so we probably find change in any defaults that we want right. So I don't think we need to do any extra work.

G

And there's a graph of like some of the metrics I collected and there's uh uh on there and there's some links to the PRS I submit to our repositories for bottle rocket in particular, which has some of the numbers that we gathered.

G

Yeah in particularly, we were uh interested in the auto scaling case uh where you have lots of pending pods a new node launches and then a lot of PODS all get scheduled at once, and that sort of really hammers the API server from single cubelet.

A

Yeah American pigments.

A

Thank you. um Okay, I think. uh If this is okay for everybody, we can go ahead with emergency.

A

F

uh Yes, Yeah so basically looking for a signaled reviewers of the PRS for the subject that I mentioned before, so that to make sure that all pods end up in terminal phase.

F

Actually one decision is that we only wait with deletion from cubelet if there is a finalizer, because job uses finalizers, and this way we make sure that we don't worsen actually due to Performance due to the low QPS for um for parts that don't use finalizer. So if you don't use finalizer, you generally don't care about the final State, probably so we still transition, but we don't enforce that. It's like um it may be deleted from Cuba area anyway, for so in the process. It turns out that we also have an issue with running pods.

F

So that's why I created also the second PR for the running pods, um um so yeah, basically, basically looking for reviews that David Porter started reviewing them, but it would be, of course, good to have more eyes on this.

A

Okay, I, don't think we need to follow up right now. um So thank you, uh but if anybody's interested to participate in review, please do so skating.

I

Hi, okay: this is a quick status update. um It looks like uh we managed to score two bugs, and one of them is a release. Blocker that showed up. Let me start the video here right, so one of them is the release blocker. It showed up in the GK internal tests, which uh was kind of a blind spot for us, because we did not have any tests. uh There's no test coverage in our current CI that uh catches this case, a single couplet I, was able to use the local up cluster to Repro.

I

This issue uh I had to dig had to make some small changes to the command. The CMD couplet code base to force that a null code client, uh it doesn't work out of the box but I, think it verified. uh Jordan Leggett had some comments. He looked at it and then has approved it right now and it should be merging and addressing that uh blocking CI and we'll find out if there is anything else that comes up from it.

I

The second issue is this: uh the there is the the there was a CI failure that new CI test failure that popped up. Thank you for bringing it up, so this uh happened after we merged. When I did my initial analysis, I looked into I figured out, it's probably not it's not likely that we regressed it, because this test has been uh read since day, one that it was introduced in December sometime and it never had a successful run and when I looked more closely, uh I saw that uh the the failure pattern is similar.

I

There are cases where uh you're trying the framework is trying to create a pod, and that fails- and this is before and after the pr merge and uh I'm not exactly sure about why this uh it's randomly failing in different places. It look like memory could be an issue, but I didn't really see any more evidence of that. So at this point uh the investigation is still pending uh I'm happy to take a look at it further, but my priorities are a little different at this point.

I

uh If this needs immediate attention, I guess someone else could take a look.

A

I I remember: you also enabled re-enabled in place uh first like Ci jobs specifically. Are they all green now.

I

No actually I just was looking at it and there is a merge conflict. I am resolving it right now, so it should be done in the next uh up by the half of this meeting. I'll finish resolving that and then maybe an update you may need to real gtml I'll, just bring you on slack for that. uh Besides that, yeah I saw you approved it. Thank you uh once that is once that is in.

I

um We should uh I'm hoping to see that they run clean because the other the pull job which is similar to what this the CIA job does and the alpha uh job does for C group V1 and V2 respectively, uh they're working fine. So there are no issues with that. I haven't seen any problems, I'm expecting no surprises here, but you never know we'll find out uh at this point, uh given that we have two bugs. uh How do you feel? How does signal feel about this PR I'm kind of leaning towards okay?

I

It should stay.

A

Yeah I, if there is a big uh green tests and it's fine uh just let's just confirm that it's actually green and especially when uh your uh a new jobs will be immersed. It will indicate that the feature is working.

I

Yeah yeah um I think it it is uh that confidence is there I I think I already pinged him for uh I'm. Looking at the Paul here he wanted a couple of follow-up changes to restructure the naming of resize policy. That PR is out there, I just bring them to look at it last night and uh I guess I'm at a point where I'm going to create that blog PR.

I

We we're not going to kick this PR out at this point. Are we.

D

Also, there's no plan, but the Assumption. You are going to fix the tests. Well, we are going.

I

To yeah yeah the critical critical ones, I'm definitely fixing them, especially if there, if it's clearly identified as my issue here, uh anything that's not like this other PR, the other bug I'm happy to look at it, but my priorities are slightly different. After the last three weeks.

A

V2 May uh have some role in in here, but if investigation doesn't indicate that, that's fine, because I think we mostly rotation with C group E1 before.

I

Yeah, no, the pull job that I have uh which uh I put in the comments of this book, uh that is, that uses C group, P2 and C group V1, is currently covered with uh the alpha features, job and against container d166. uh I. Think I am I, think both run uh Mike.

I

uh If you're here, please correct me, if that's not the case, but uh uh I believe my previous understanding for another meeting was appeared on both say group B1 and V2 in the alpha job, for the pull job that I have I learned, C group B2, but yeah once we enable this we'll we'll know for sure. So this is a the next big ticket item. I need to work on besides creating a placeholder PR before the March 8th tomorrow, I guess so. Yeah I'll shoot you a quick message as soon as I fix the conflict.

A

Thank you thanks and said. There is no plan to kick it out, especially when we will confirm that it's working on multiple, like boss, C group, V1 and V2 I, was under Impressions. We only testing on V1 but uh yeah. It's good good to know. Thank you.

A

Like confidence.

I

No revert yeah, my my uh the timeline I had was okay. If there is any major issues, that's going to happen, it's going to happen in the first three days and I came pretty close. uh Jordan created that burger on day four. So.

A

Okay um yeah our next one is uh Todd and uh Muhammad.

J

Oh hi there. uh This is my first meeting here: um I'm, usually involved in sick release and suitcasing for work, but this is something that I'm working with Todd on. um So for those of you don't know, um we have a lot of money to spend on AWS, so we want to run some of the E to e tests in the kubernetes repository there, um the cubelet or the node E3 tests are much simpler to run on AWS, as they are usually singular tests.

J

I don't have to worry about stuff in that cluster, slash on the cube up shenanigans that is looking at so on third row, a PR that makes the runner run tests native s and that works nicely.

J

I was successful in getting tests to pass on, like Ubuntu 2204 servers on 82. So let's need to wrap up a few things, but it looks promising.

G

Yeah sure um yeah sort of part of the discussion. uh There was another PR today by Dems that was removing um some of the vendored AWS cloud provider stuff um and then there's a discussion like whether the remote Runner should be part of the test infrastructure at all or should be pulled out so that you can actually extract all the cloud provider specific code out of the kubernetes repository So. Currently it's not there's a GCE router and you know Mom said that first PR ads in AWS Runner. So you can easily run this.

G

The node e2e on uh AWS, so I guess part of the questions like. So what is the the plan there? Is it to eventually move all of that stuff out to external repository so that the kubernetes repo doesn't have those vendored modules? um If that's planned, is it okay to go ahead and merge in the additional uh code now and then sort of do it all at once?.

A

I think one of the questions I had before this work is I know that there is a test on call happening like for open source like if there are some issues with the environment. I know who the pink and uh they can help me troubleshoot. What's going on I'm, not sure like if you'll have uh tests on AWS is there some on-call or maybe the same rotation? It's different rotation that might be uh interesting. Conversation uh to have I.

J

Can answer that question um so the communities infrastructure on AWS community members can access it. um One of the things we're looking to do before it goes production.

J

um Remember the level of engineering we're going to be fine, but the idea is once he goes to production um Everybody in the communities all can browse all the accounts log in and take a look at failing things.

J

So let's say you're an ete tests on AWS there's an instance: that's failing! You should be able to log into dangerous account and look around and see what's going on um would with regards to test in from call. My understanding is that those people manage the proud clusters and don't really get involved in broken tests.

J

Is that correct.

A

Yeah, so that's about broken test, but it's uh it's more about infrastructure. Oh.

J

You should be able to access the accounts. Eventually, that's the plan. The address account will be open for everybody to take a look and access.

J

So yeah that'll take a while, but the prerequisite work will easily take a month and a bit sort out, but that work has been chased on a separate track as well.

A

Yeah I guess yeah. One bigger question is uh how how it will be approached from seek testing because, again like there is like sick testing that manage a lot of stuff like inclusion, test images, for instance, uh test images, some magically been updated and we don't need to worry about it and if infrastructure is not working, it's also uh not something that we've been signaled.

A

We don't need to worry about uh if resistance will start failing and like we, we are not sure if it's an infrastructure or not I'm, not sure how how we will proceed with investigation and like who will be engaged into that.

J

Okay, let me just note that down real quick, so I'll take it back to the second thing for a meeting um one second.

A

And especially like these days, we even have problems because, like it's decent set of tests- and we heavily regressed recently- because we didn't spend too much time on that um so now we barely can make those existing tests uh working.

A

um This was indicated before by when they some of the continuity tests were broken for a while and uh that's why we, we think it's maybe even a uh problem, but maybe not even a problem, so yeah.

J

D

May I ask what's the goal for this one and I uh for this one this maybe I can better understand.

J

So the background is, um we need to start running many communities E3 tests on AWS. um If you remember last year, Amazon gives a lot of money right. um So we need to use that right and good candidate for migration or well not migration, but additional coverage on AWS, for example, is the node E3 tests. um We run an experiment and see how difficult is it to, for example, write the runner to build instant scenario. It's very easy right, so we've got a patch in the repository. That does that.

J

um So the work is done. uh We just need to get it immersed and start running some optional tests on AWS over the next coming weeks and then maybe have a future conversation about promoting those jobs to like Master inform you must release block and all that stuff, but yeah.

D

So so is there anyone some similar like the on the uh AWS side, there have some like the in-house equivalent, like the note team response for those, um because we used to be have like the in-house I'm from Google right, so I used to build the node team like I, build the signal and also in in-house we build another team. So one of the things especially I told them is to kind of make sure the set has the green and even internally we have the reork couple times like the circuit.

D

Correct me: I want to make me honestly even like the circus team, and they take over note and the GK particularly GK. Note I also asked the team and make sure that CI relative the screen that's the highest priority. Basically, besides anything else, that's the that's the top one I send it to them. So so can we we have like the CI signal to have the CI project. The sub project. I think the circuit is the lead from that one, so we can start from there.

D

If you have the engineer, nominism engineer take some role there. Maybe that make. This is really useful because we've been trying like the couple years ago, I tried to I'd been talked to.

D

um We tried a couple times like we have like the ete test on the ideal for Windows right, so we also have the E3 has in the past. Also on the AWS I realized. The only thing really happened is the ungued cloud provider. They also don't need some engineer on top of that and to make everything so smooth, aligned together. So then, otherwise, those E3 tests AWS try. This is not the first time they try to add EG test, but the previous only green ones, and nobody really pick up so.

J

Okay, good, um my understanding is that all you need is a standard, Linux server, that's running a particular operating system right. You strap it in a generic standard using cloud in it, for example, and then you log in and you run the test and then you kill the server right. um That was my understanding of all. You need from AWS right out.

J

If the image wasn't built correctly or there's something wrong with the operating system, then that's a fun conversation to have, but I believe that's far and few between as someone who's run infrastructure on the public Cloud for a long time, you take a Linux server. You install whatever you want on there and then you run the test and you kill the server I believe that was the only expectation that was set um unless I'm missing something.

D

D

We definitely need someone, otherwise those green uh is the right and who is going to make sure that's the green for sometimes and make sure that the process procedures build properly right. So so, if there's no clearing easy way to using and people just couldn't do the charge, because if you really think about the open source projects can volunteer based right, so the the company volunteered donated their engineering resource computer resource all those kind of things. So there's some other way.

D

We need to make the other people communities life easier, and so then we can keep our community is more healthy people move forward and benefit more users right, so that's kind of so because in the past we do have these kisses. Then we just don't know. What's underneath where the node is just died, and should we respect that failure or not respect that failure and the Indians don't feel like they have the knowledge to debug those things, and they also maybe don't have access to this one. So that's slowed down. Then they lose the interest.

D

That's the problem, okay, and so that's why I prefer like the at the earlier stage. We at least have someone from AWS to help to make this whole thing smooth for sometimes yeah, at least.

I

J

um Okay, let me jot this down.

A

And we have meetings on Wednesday at the same time, 10 to 11 for CI group. uh You can join and we can discuss further uh I think even enabling for uh vanilla Ubuntu. It sounds like no-brainer like why?

A

Wouldn't we do that, but at the same time like knowing how many issues with the images we had before and it's just issues are not very easy to troubleshoot like something something happened like test started failing and then uh somebody need to go and dig what's happening in the end we found that some image was changed underneath and uh then there is some incompatibility or whatever, uh and it takes uh a lot of time and effort to investigate so minimizing. That effort will be.

J

A

J

Have another question about that? um You kind of said um running: vanilla, Ubuntu images is a no-brainer right, I'm curious why the project I'm a latecomer, so I, don't really know what happened in the past.

A

I, don't remember what other images we tried: don't have better memory, um so some images were like started failing at some point because of some like I, don't know, maybe current components or whatever, and especially for complicated tests like eviction tests. That uh requires like it was a specific behavior of eviction, um so those tests are really hard to troubleshoot and if we add one more image type that may fail some at some point, then uh we kind of increase entry increase amount of ways.

A

Things may fail and it's not clear who will be investigations at if we have some um like right now we're using a g key flavored, uh Ubuntu images, uh I think it's called Ubuntu cloud or something uh and uh those at least we know what to expect from there, and we have some contacts uh who to ask if something is not working Ubuntu we don't really have canonical like on call and like we cannot just tell them. uh Can you help us investigate or something is wrong with this image? Okay,.

J

I mean: is there a public document that talks about what's happening in those gke with images um like? How is that different from the vanilla image, because, like Google Publishers, when they say what they've done to the vanilla public, one thing which they provide from the Upstream images.

D

I didn't see much of the Ubuntu image, you have the problem, but we do have a lot of ways in the past, the Fedora and the central OS. Also even that is not a particularly kernel right. So it's just have the specific, like the kernel configurations and all tests maybe depend on certain things and, for example, swap test.

D

For example, they have to expect it a certain configurations and so and also system d, a lot of time systems you actually in the past, not today in the past, at the earlier stage and the system, devotion upgrade and the course know the finger and those who know the thing not unless it's kubernetes problem, but your course of the node test field, and so do you know, because there's also have the ETV test, the certain things we are stressed of load environment in your products.

D

Maybe you are rarely run into those problems because the Sun and sun is not there. So even we have the problem kubernetes we are quickening come up. The new node and the other job is just scheduled to the new node, the problem. It is on the test. Right so uh there's the you are the Tony expected a certain behavior is at important, and then you expect certain things to return at a certain Behavior.

D

So uh so sometimes the range problem for us um so again like the open source is community, is the open is volunteer based. So when they're strange yeah, so the people just kind of say. Oh, this is like an images problem or no all the VM problems, all those kind of things they don't want to spend a lot of time to invest in details.

D

So so the end up with those tests always is right, and then we paper over the real problem and really not unblock the or maybe slow down the release of development profits. So this is where I I expected the test is more green, and so, if we want to make that it is really useful test is greener. So we need some expertise at least to find Initial time and help to make everything is smooth.

J

Okay, I got it.

A

Okay, uh we can continue discussion uh tomorrow, meeting but uh yeah and uh offline as well Let's. uh We have 13 minutes left and three topics. Do you want to update on? uh What's going on.

K

Yeah we have today two meetings uh for our community updates on the cap for um for the resource manager plugins. um So in the first meeting we gave some updates on on the changes. What we are planning for the tribute based API uh and also clarify some of the open questions in the cap which were posed. uh We, which were asked about the attribute based API, just to give some more context. uh Then we we were going through the uh latest on the architecture.

K

We got uh good input from Kevin uh what we could do better there. Basically, uh we are going to uh yeah go towards a single CCI driver implementation. So basically we will not require a separate node driver or something like that to to understand dra claims. This was the conclusion. After the meeting and at the end we had some demo uh just to go through some scenarios uh like how to bootstrap the system it bootstrapping.

K

If the system is working, if we can start some Bots which do not require claims and then um also look at um what happens if the driver is not up and uh yeah at the end, try to allocate something. That was what what.

L

A

A

Yeah, if there is a meeting recording, you can share it with me, I can upload to Youtube sure foreign.

K

Updated documentation in the Gap around Friday latest by beginning of next week.

K

A

Okay, thank you for update. um Any anybody wants to know, tell us something about it. If not, we can move on um yeah. This is just minor topic uh uh as part of annual review. You will be updating some owners files and yeah I was stumbled in one of the PRS we needed any change of this event go document and I was wondering uh if we don't.

A

I was mostly wanted to ask you and Derek like what's the history of this couple, teams in obvious files, API, not API, approvers and node approvers in this API approvers? Is it uh for CRI specifically or for all apis, including like event names and such.

D

um We do have to know the level of the API approval, at least for a while direct and I was the node API reviewer of just I want to limit myself only on the nodes, I guess yeah. So we will include all the node level of API. Let me double check with America, because for a while I think, I removed myself and leave Derek. There represent note.

D

A

I will I'll sometimes you to check on that wow. Okay, uh Philippe uh you here: uh let's talk about it only CRI, it's very exciting topic: yeah.

M

So um some quick background um really is been building a project that has required access to both the CRI API and the continuity socket um after looking into it, a bit I started kind of questioning why um we don't have any rbec permissions or or some sort of read-only version of access to continuity socket. As we all know, um privileged containers having access to the Canadian continuity socket is basically being root on the Node.

M

um There are like other applications like Falco and other, like Telemetry solutions that also just read. Events on on containers, for example, that really don't need this type of access, and it seems like the cubelet already has some endpoints, like the port forwarding and logs that are basically proxying traffic to the CRI API. To be able to do some sort of our back enforcement on on certain endpoints um right now, I'm, just kind of trying to figure out how to approach this problem and kind of what organization or what domain.

M

This is like where, where the solution should live because there's a discussion, should this be in container D container D people might say no I, don't want our back things and read-only solutions and continuity. uh There are Arguments for the sake of saying. Well, the crio project would also benefit from this, uh so I'm trying to that's kind of where I'm kind of looking for guidance in in how to approach this problem or where to kind of could put the solution really.

A

And I it'll be very uh interesting to have that I think my perspective here is if it will be built in in containerdy, and there will be, but it only uh API in container G and cryo than many Telemetry, and the security agents will just switch to this immediately because they don't need all those problems elevating permissions for their agents.

A

um So it will be very good from security perspective like we have so many uh agents that suddenly don't need motivated permissions and will have more secure environment. But many people on this issue comment and that read-only port on continuity is not a universal solution. Some Arabic maybe needed some like granular access to specific apis may be needed and to solve this problem containing is not the best place.

A

So if we're thinking about um cover like I think read only covers like 80 of scenarios like over the top of my head and like uh just coming up with numbers but then like the rest of 20 of scenarios, probably will be covered better. If you start this as a project on kubernetes rather than on container runtimes.

A

uh So the real question is: do you want to start this project on kubernetes and as part of signaled, or you want to push harder to address this 80 of Solutions on container G and be done with that, at least for the time being,.

M

Yeah, that is really the question.

M

um I I I see it as I really understand your point of putting in this in continuity and I I, don't know really how to move forward with getting kind of people on the continuity project to to say that this is a good idea, but I also see looking at the current cubelet code and the proxying the traffic it proxies today to the CRI.

M

There would be benefits to the continuity of the kubernetes project to have a more high-level rbac solution, because then you could just get rid of of those kind of slightly undocumented uh proxy endpoints. On on the cubelet.

A

I think Mike is on the call Mike. Do you have any opinions on continuity perspective like? Is there.

L

Yeah I haven't, given it a whole lot of thought, not for the last. You know four years um we could certainly talk about it again, um get together a few. A few of the uh you know, container runtime members and and give this a thought. I mean I'm, not sure how you would want to split up the cry API with our back. It would be interesting, but I'm not sure the best I see the benefit, certainly on on a lot of the streaming apis right.

L

uh You know exact with response and stuff like that for security purposes, um but then why would you want to split up? You know the other apis um I I would need to see a bigger. You know, description of it, or at least we'd have to have a long discussion about it.

M

Yeah, for my use case really having a read-only uh continue to see it like CRI API would be fine for me. I I think there were other people who just said like oh yeah. No, but more granular access would be nice and but from.

L

Pretty much yeah I I get it, but we've also got you know namespaces in the container run times. um We we have lots of scope issues with the various pods. Is it read only for bipod or a particular user? This is going to open up a can of worms. Let's just say put it that way, we'll want to discuss I just haven't, given it a whole lot of thought. Yet yeah.

M

No but I'm I'm looking for discussing it, that's kind of where.

L

Yeah, if you can get there were a couple of people that were responded in the pure that interested Samuel um on the other side, I, imagine Bernal, you know, and um probably a couple others in in the cryo team are going to be interested in. You know doing that. We also have a ttrpc API um that we could probably use as well.

L

um We and we've had a lot of namespace discussions of late right about. Would you want one per runtime Handler? What if it's a VM right confidential containers? What why are we going to read only right? What was the the real reason, and it is that just for a monitoring tool for every namespace is that is that really what we're looking for or was it something else? That's all I'm saying.

M

Yeah sure no I get that so that's um yeah for my use case, I I'm, in kind of a weird one where I read the container or image layers soon through the continuity socket, basically and and yeah, but I I get that there are. There are a a whole bag of just random issues in regards to yeah name spacing and increasing.

L

Exactly and we don't only support, you know: root root, yeah we do rootless as well and that's why you've seen some of those right and we've done remote proxy, but but yeah for kublet. We we do have a specific thing. We need you know for streaming. So if you're talking about with kublet that's different, they would only use the you know all the other apis. So and this, let's be fair, the cry. Api is really for kublings right and any other cloud provider. It's not just for monitoring, but we do have monitoring apis.

L

That might be more. You know more efficient, okay over HTTP or some other protocol.

D

But even kubernetes have the monitoring API, but it also is kept at the kubernetes level, which is right um in the past, because there's the performance downgrade, we use the export more data than today and there's some issue rather efficiency issue, because we're using too much resource I, don't know just just for monitoring. So so that's why we have to do the cleanup. We evolve. Those and the capite, uh some like the what's.

D

The control level of the match is really needed, so what it is the monitoring, but at the kubernetes first, the cast object level. So that's kind of things, the way back and forth being discussed so many times in the past, so yeah the fin. What do you ask for and she's not surprised, but it's just like make data- is like an open account of the one and also I do have a concept concern.

D

It is about like the how much detail when you really ask the vendor to offer kubernetes so the of admin to run as well. How much you want to expose the other node the details through the monitoring tools, so how you are going to to um to make sure certain things can be monitored and Export and the surface to the other monitoring tools? Then that's! Actually it's more complicated, yeah.

M

D

M

Kind of guess that this this is like there, there are things that I don't see here, considering the fact that it is not implemented, it's it's kind of it becomes kind of a yeah. If this was easy, I think somebody should would have done this a long time ago. It's obviously not easy and then that's kind of why it becomes that issue. But if the issue I'm seeing is that myself with my projects and and a lot of other projects tend to just go: okay, let's just Mount container dsock and be done with it.

M

That's kind of really.

L

Easy you have to be rude to do that, because.

M

L

Using yeah, yeah I I, see I, see it yeah.

H

It's not only about well-being rude, it's also uh well a privileged profile of deployment. If you deploy your component as demon said, um because because we spot security, admission controller like you, cannot deploy something in the Baseline security model which mounts with sockets.

L

There's a possibility: we have other apis that you might be interested in we're implementing a new thing called node resource interface that we should be able to handle some of these situations, um just just a thought as soon as that's in cryo, then maybe there's a different route than using the cry. Api. Okay,.

A

Yeah we're already out of time I. Think one of the worry I have is uh we're trying to protect ourselves and, uh like say like we have this boundary of route, and but then we move all the security problems on the cost on a user, so they need to allow this privileged demon sets because they want monitoring. So I want to make sure that we've taken the considerations that we don't only want to protect ourselves, but also help customers, um yeah I, feel I.

A

Think next step is to as Mike suggested, to get more people on container GMA be career side, and maybe you can have a separate meeting for that. um Maybe we can like Mike. If you have suggestions what kind of details we need to flesh out before the meeting, it will be great like uh thank you.

A

um Thank you. Everybody you're out of time, uh if they're saying anything else, we always on slack bye, bye,.

I

Sergey, quick, second quick one uh I uh resolved the conflicts, uh please real GTM it when you get the chance, okay for the node CI. Thank you.

A

Bye, everybody.