Kubernetes SIG Node, 14 Sep 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Node 20210914

Description

Meeting Agenda:

https://docs.google.com/document/d/1j3vrG6BgE0hUDs2e-1ZUegKN4W4Adb1B6oJ6j-4kyPU

A

All right uh well welcome everyone to the uh september 14th sig node meeting um my meeting is recorded and will be later updated to uh our youtube channel for folks who can't attend today.

A

um I don't know sergey if you want to give a summary on where we are with respect to um the incoming and outgoing rates.

B

A

B

If you have some formal what's going on, um there is a lot of pr coming in and uh we're growing, mostly because we have many prs hanging around, because we don't know what exactly to do with static pod regression and how to proceed. So we have cherry peak, we have end-to-end tests, we have couple attempts to fix it, so I mean some of this grow. Some of that growth is expected because we just work in progress, but we also have more prs coming and emerge rate is not uh great right now.

B

So I think, uh as people coming up from location, maybe we'll uh kick off speed and start merging faster. I checked closed prs. There is nothing uh that rotted away. That uh requires some attention, so this is great. uh We are not degrading uh on that and we are actually paying attention to everything. That's coming all the closed prs are overly closed, vrs, so yeah. If you uh think you missed something just check out what's going on with people working on. Thank you.

A

Yeah thanks sorry, the one thing I was just trying to make sure was. um I thought we got through every enhancement for the one uh 23 release um that was expected, but if folks uh feel like there was a gap, please uh let us know, but I know at least that's where a number of us were focusing uh all right. So, let's uh transition to uh the business at hand here so uh alana. You have the next item here on the static plot regression.

C

Yeah, I just wanted to put this on the agenda to make sure that there was some visibility and folks were uh aware of the scope and impact and all that jazz. So this is marked as a critical urgent bug was introduced in 122 as part of the pod life cycle refactor, uh and we are uh we meeting a bunch.

D

Of different people, including clayton,.

C

uh Myself, ryan, jordan leggett um I've been working on trying to get this fixed. Essentially, if you take a static pod and you remove its manifest and then re-add that manifest unchanged, uh then that pod will go into an error state which is a regression. So I mean if you have static, pods and you're, not touching the manifest. No problem you'll be fine, uh but uh if you do that, then it will cause the pod to error.

C

We don't want that, so we're trying to get it fixed and it is proving to be relatively complex and we don't really have a test case to prevent like something like this from happening. So I've been working on an end-to-end test uh for this. I have something: that's definitely catching it uh on head and in 122, but like passes on 121 and earlier, um however, we're talking about like do. We also want to add uh more test cases to add like more coverage.

C

uh Is this like exactly the way that we want to shave this yak or do we want to like approach this test case in a slightly different way and so on and so forth? So I don't know if we have a working fix, yet the latest that I heard from ryan was that clayton's patch might not uh that the pod uh probe workers might not restart. Is that right ryan? Do you want to jump in.

E

uh Yeah, maybe uh so there is a end to end on uh the pr that clayne wrote and I think it's working for the most part um we're running it through openshift, ci and testing it there as well um and we're getting some flakiness on on probes, but it may not be related to the fix. So I'm diagnosing some of that now.

A

So I just had a question like I thought. A lot of this came down to you idea assignment. So um when I was looking at this, I think we were allowing static pods to escort their own euid.

C

Yes and the other thing that's a little bit weird is when we calculate the uid of a static pod. uh It basically only includes like the pod name uh and like the the contents of the file, and I think maybe the node or the node name so like if you don't change that manifest if everything stays the same or if you have a static uid set in that pod, manifest uh from the cubeless perspective, it has the same uid uh and so that's causing a lot of the weirdness.

C

But I guess now that that behavior has been in cube for uh so long. I think, like probably six years unchanged kind of thing, uh I think people will expect this to still work. So I was like well isn't that a bug shouldn't it just get a new you it if we like delete and recreate the thing- and uh the answer is uh maybe not uh since people might rely on this behavior and that will break them. Yeah.

A

That's what I was trying to figure out was: if, if we knew anybody actually depended on a static euid versus the cube, what assigned you in.

C

Yeah, I know that some people are definitely using this, like I think, openshift was using it at some point, um but uh in terms of like you know, do we want to continue to support this behavior? I think the consensus, at least from clayton and jordan, is that we should there's a pretty high risk of breaking people if we go and like change this so okay.

C

Because yeah, I asked the same question because we also saw some weirdness uh in terms of like pod life cycle on static pods, where, if you, for example, like restarted a node uh from like the node's perspective, when it would go and possibly incorrectly update the mirror pod.

C

That pod would transition from running to pending, which is not exactly what we wanted, uh because the the uid was the same so, but that that may just in fact be a bug with how the mirror pods get created for static pods and not actually an issue with the us.

A

Okay, um was there more that we wanted to bring less or is it uh yeah there.

C

Was just uh the the one thing that we wanted to check was to make sure that we had all the right reviewers on clayton's patch uh like sergey? Do you know if lantau could maybe take a look at this.

B

Yeah I can ask, but I was on call, I think, is she supposed to okay I'll double check.

D

Yeah sorry, I just uh reply down in the chat. I can take a look.

C

Great yeah, you might want to take a look at the the bug as well and uh sort of like the discussion on the bug uh and that should hopefully catch you up so, uh but I I included the links to I think everything relevant in the agenda. So you should be able to find that not a problem.

F

C

I think that's all I had on that. One just wanted to make sure everyone was aware, and hopefully once we get a fix working, we'll backport it to 122.

A

Okay, excellent um well thanks alana. um The next item here was looking for uh reviewers on a new cap.

A

Who had put this on the agenda? I wanted to speak to this.

G

A

I did okay, sorry yeah.

G

I'm speaking for biology, basically, he put the skip a couple of months back and the uh basic and nam manu and the basic idea was like. uh We wanted to introduce like a way to reject admitting parts based on the node's properties.

G

So for us, the use cases that fargate, for example, does not allow privileged containers and, if, uh like the fargate as a compute platform or like technology, does not allow privileged containers and by the time it reaches like target it could be. It could lead to like it's too late. So we would want to like from a security point of view and also from a usability point of view.

G

You would like to add, like pod labels, handlers determine like os label properties uh and then reject either by a hard, reject or a soft project, depending upon whatever it is.

G

uh We would try want to try and target it for 1.24, uh but in general we want to see like what what scenes, if this seems like a good idea uh as there, and we think like there are other use cases that people like the community can also benefit from. Like recently, uh windows introduced the os tech field, but then you could also like have things outside the pod that you can exit into for different operating systems and uh determine different properties of different oss. All you have to do.

G

All the node has to do is like install that particular plugin, regardless of if it's exec or grp whatever it is. uh I don't know if I summarized it well enough.

A

No thank you. You know, I remember jay and bolaji presenting this um uh concept earlier. um So I'm happy to.

A

Take review and we can make sure it's on the 124 uh queue. I guess. Unfortunately, I didn't I had missed this for 123., um uh so I'll follow up on that and uh um we can um iterate on the cap now between then and now is that is that five minutes.

G

Yeah, that's perfect. uh We were thinking one yeah 1.23 was too soon for us. Also, so 1.24 would be perfect.

C

I just assigned you derek.

A

Okay, but the thing I was uh trying to recall here was.

A

uh I couldn't remember if uh jr bellagio talked through like what's the desire to deliver this plug-in, maybe as a static pod itself or outside of a kubernetes management model, maybe just a daemon on the operating system. Externalized with was there.

G

I'll take that question uh like because it's a fundamentally a designed question and uh the initial idea, if I remember currently what it would reside on the node itself.

G

But I would definitely take that.

G

Question and ask you: I know that for a fact that target does not support demon sets today uh so, but still I'll revert back either here or on the uh kept itself.

A

Okay, yeah um all right, well excellent I'll. I look forward to walking through that and then everybody else wants to review that cap or add their use cases. Please! uh Please do um all right anything else. We want to bring up on that for me to be on the call.

A

All right uh adrian, I think, you're up next.

H

Yeah I just wanted um no, so I just wanted to mention that um the the checkpoint restore cap had one reviewer who was more or less in favor of it. um I pinged my now. He made a review today.

H

He didn't say if he is okay with it or not yet, but I think it's pretty close now, yeah.

I

It adrian so my general feedback on it is like it should right now. It covers all the things we have talked about, like what are all the possibilities. What is the end goal if we can somehow massage it to like different stages, clearly and say?

I

Okay, this is the stage one corresponding to your whip that you have your preferred concept right now and if you can clean the one, then I think it will be in a better shape to uh merge and I, I think, uh took a brief look at your work in progress, uh your pr poc. It looks good yeah, it's like it's yeah. It's it's good! It's as we had discussed.

H

Yeah, the the the actual code changes and besides the generated cri api changes are not not not that great. I think I was hoping for yes.

I

H

Yeah so, um okay, I will, I will talk to you offline, about um additional changes and then we probably can um and then I think, dawn was assigned as a as a proverb and we can um um talk to her.

I

Yeah, I think yeah. We just need to clean up the cap a bit bit okay, to align with what our current plan is and.

H

Yeah right there's a lot of old information in there. That's right! Yes,.

C

Okay, yeah, let's jump in because you mentioned that somebody had said it mostly looks good to them. I don't think that person's a kubernetes org member, so it's good that they looked at it, but uh they are not the sort of like we need somebody who's, an org member who has like um sign off powers.

H

Okay, um because um I think I I I kind of expected that the person who did a review is the one we added as a reviewer in the sick, note 123 planning document, but I'm not sure.

C

I mean maybe they're an org member and they don't display it publicly but they're they, I think they're they work on istio, but they're, not uh kubernetes or member from what I can see. So I don't think that they work on core kubernetes. So.

H

Yeah, so the reviewer also needs to be a member or just an approver.

C

It will need to get approved by either derek or don, and we probably also want wider feedback from uh people uh working on kubernetes. Specifically, I don't know if this is just in scope of node or if there's other things. We need to talk to derek.

A

Yeah, so this is going back to we had earlier meetings on this and the staging on this was we wanted to do checkpoint first before restore, and then I think dawn had volunteered to just do the review on the checkpoint flow.

A

So I think adrian, uh we'll just dawn's, obviously not here today to speak to her time, but I'm I'm sure, together with all you can okay make the increments that we needed, but I don't think there was any major disagreement on um the desire to get checkpoint functional and then we had use cases that we had discussed around like a security analysis where we all agreed checkpoint was useful. So, let's just uh work with raw. If you can on getting the updates that we all okay saw and then get done to review as well.

H

Okay thanks everybody. Thank you.

A

Thank you adrian um all right. So then, the last topic today uh looks like a presentation where now and marcus you were putting together. You guys went to.

I

uh Marcus uh was looking into splitting of like startup latency and he's done some analysis and he just wants to present what he has done for and looking for feedback uh marcus. You want to take it away.

J

Right yeah, I'm here uh good evening: everybody um can, I grab the share screen or does anybody else want to.

B

Share co-host you can share.

J

Cool, let's see if that works.

J

You're supposed to see my browser did that work. We.

C

Can see your screen, it is good.

J

Nice, okay, cool, so um it's a font, size, etc. Good as well,.

A

Yeah, it looks good.

J

All right cool, so yeah good evening, everybody, my name, is marcus, I'm part of the canadian community. Actually, as a quick reminder, canadian tries to build a serverless platform on top of kubernetes, and one of the main questions to us is how fast can kubernetes actually be um so in this slide.

J

Basically, uh if you look at serverless systems again as a quick reminder, they try to scale up and down parts as quickly as possible and even down to zero, which means that if you want to scale up from zero to something the pod, startup latency is going to be part of the latency of the user and kind of the narrative in and I've worked for. Different uh serverless platforms thus far has always been that kubernetes is too slow. It just doesn't work, and I got kind of sick of that of that narrative.

J

So I decided to actually take a whack at looking at things because, like anecdotally, I don't think kubernetes has to optimize as much as serverless would need like from their other use case, seeing from the other use cases. So I was.

J

I was wondering if there was like more if there was performance left on the table just by looking at it pretty much so in creative to we basically worked around kubernetes in a couple of ways to to get things quicker, one of which is, we implemented a an exact probe hack, which would allow for your readiness probe to be quicker than one second, which is really nasty.

J

We are probing non-ready ips from our routers to basically also bypass the readiness checks of parts and then there's a few ideas that are not implemented yet and that kind of don't want to implement because they are like a lot of incidental complexity, just because we need to work around kubernetes as a whole, which is in this case note local scheduling and pod pre-warning. So the question I ask is: is that even necessary?

J

So I can we improve kubernetes itself to drop those hacks and not need them and make the whole community better, and I think yes, that is very much possible. I look at it a little bit. There's a cab for subsequent probe granularity somewhere. I couldn't find a link, but um I know that somebody has written something up, but we'll not care about that. For the.

J

For the sake of this presentation, I wanted to actually look at actual improvement, actual performance improvements on the existing behavior, for example in cubelet, uh the cni, scheduler etc, and I kind of randomly picked cubelet, because I was interested in how it works, so I just started there and all of the result.

J

All of the pictures that you're going to see right now are all taken on kind, so take them with a grain of salt, I'm mostly using like the numbers to get ballpark-ish numbers, and I mostly at least for now, am reasoning about improvements that could be made by going through them in theory, rather than actually measure measuring uh things. You'll see in a second right that somewhat makes sense for now um I've written two little tools just for me to get an understanding of the cubelet, because the code base is somewhat um complex.

J

So this is like a pod speed is trying it's like spawning parts with a defined um gamma schema just like repeatedly, so it can generate some numbers from that and then cube. Tracer is what generates the pictures that you're going to see right now is basically just just using the recently introduced um uh logging.

J

I forgot the term in english, structured logging um to uh to fetch only the logs that are relevant for a certain part. So I can like get a look at what the pods startup latency looks like and then like. One of the things that I noted almost immediately, which you can see in this case, is that even just launching a part as like it attaches this default volume like the the cube api.

J

Did the service account token, and that has a 300 millisecond floor and if you look at look through the code, as I think arbitrary 300 millisecond retry on a loop which then spawns another loop which has an arbitrary 100, millisecond, retry, etc, etc, etc. Go down that, uh and basically that adds up to this 300 millisecond floor. But the actual work taken is only like 10 or 30 milliseconds ish. If, if even that, so yeah.

A

Pause here for one second, so yeah sure if my memory uh uh this this slide is very good and my memory was right. I I wanted seth, I think, you're on the call earlier we had gone through and explored this and I think at one point we saw literal sleeps in some of the volume uh management code.

A

uh I thought we had pulled all those out. I guess uh is my memory uh bad on that stuff and then.

K

No, I pulled out at least one where we were doing like a uh like a poll that initially waited at least 100 milliseconds. I'm.

D

Waiting on the 300.

K

Milliseconds looks very very familiar, um but yes, there were situations where it would. It wouldn't check to see if the condition was already met it just it would it would sleep and then check and in a lot of cases the the condition was already met if it would have just checked to begin with.

J

Yeah- and I think that's correct in this case too, because, like the initial check, wouldn't be true, I think the problem in this case, I haven't actually cooked up a solution for it yet. But what happens in this case is um the volume manager works in that it schedules jobs on pre-created runners, so the runners are basically just go routines per part, and then they have internal reconcile loops, which are firing only once every 100 milliseconds and then eventually that you have to go through through. I think at least two steps.

J

One is verified, controller attached volume and then one is mount volume and they are like individual jobs scheduled to that worker, and I think they, I think the solution might be to just um introduce like a signal once that worker job is done to actually signal back, reconcile immediately launch a second worker job. Reconciling me like signal that back and then signal back to the top thing that something happened then to please double check, something something like that.

J

I haven't actually looked at what makes the most sense in this case, but that seemed like reasonably ish.

J

I think like in this case, I think I've only also written it here that there's just that doesn't seem to be a feat like a signal feeding back from these workers that the worker has actually done. It just waits for it to be done through these retries, and that might be. There might be a potential to tighten that up.

K

Yeah I found that pr from like three years ago. I put it in the chat, but I I'm sure that there are other things like that in the in the volume manager.

A

K

On this particular point, marcus.

A

I think it'd be good to also reach out to sig storage, um because they're actually properly the owner of the volume manager. It's just our code structure makes it intertwined.

A

So it'd be good to.

D

A

Reach out to some of those folks as well to make them aware of this. What I was curious about was the use case of a representative pod, um obviously you're getting the service account. So there's always at least one volume, but is the use case. You were exploring largely pods with just secrets and config maps, or are you also focused on pods that need access to persistent storage generally.

J

That's a very good question: to be honest, I'm kind of working my way from the simplest part possible and then like, if I, if I can find obvious problems which, like this, is kind of an obvious problem. In that case, um I try to fix that like get that fixed first and then add more to the pots back like the two life written pot, speed has like basic parts. That is just that they literally just have one container with one image and that's it and by the way, I'm assuming pre-pulled images too.

J

I'm kind of leaving that out of the equation for now, um and it also can spawn canadian style parts, which then has two containers, one being a sidecar, a readiness probe being set up et cetera, et cetera, I'm kind of working my way up from the simplest possible towards the canadian spec and then maybe even further, as you mentioned, like uh configmaps, etc, will be in the picture at some point as well.

J

Is kind of the idea and for what it's worth these like? This is obviously not an exhaust exhaustive list. This is just like the two major things that I've found. Thus far, I haven't looked like.

J

As I said, it's not an exhaustive list of the things that we might do, and the second in this case is, as as y'all may know, the couplet is largely driven by the plaque and that black has a one second timer which, like it kicks it, every second- and I think sync part- is only allowed to pass once the plaque has updated its state once it's updated the cache and that causes for some uh arbitrary long latencies as well as can be seen in this case here.

J

And um I don't actually have a very good like um it's. Basically, it comes down to the same quote-unquote solution as the first one is to add more like to add signals to kick the um the loops earlier. If we know that that it's valuable to kick them, for instance, I've cooked I've cooked up a very simple thing.

J

A very simple quarter quote solution where um actually, after this right before the sync port exit is called, I poke the plaque to make it like relist immediately and that drops the that that would drop this entire 700 milliseconds down to to zero, at least for the very simple pod cases.

J

um So that's one thing um and that's kind of the whole schema here- is to make things a bit more event-driven rather than strictly timer-based. Obviously, where appropriate. Only.

A

Yeah, so the one, the one thing on this one, I think I would love to get to an event driven plug. um What I'm wondering, though, is if um that will just shift the problem to the runtime. um I guess uh is there anyone here.

D

A

Has explored this at the runtime layer to maybe think about how it can watch rather than be pulled. I mean yeah, it.

I

A

Just end up with a one second timer there as well.

I

So, on the runtime side direct, we could be faster like because we can use like I notified to detect uh stops and when we do a start, we can check after the start. I think we can be quicker than one second.

F

Yeah yeah absolutely yeah. We can do eventing yeah, actually.

D

I think it's nerdy and darker already support this kind of event today and actually, when we worked on flag at the beginning, we already started looking. We already looked into the docker events and it actually worked. It's just that they're they're there there was some actual work and at that time our main focus is to reduce the cubic resource, usage and condition. Resource usage and the release was enough at that time and that's why we didn't actually finish the complete event-driven flag.

D

But now, if the the latency become uh concerned, I think it's possible to do that and I think at stockholm connect uh they support it. We just we probably need to define a like a grpc streaming uh based event, api and then uh in cri and then get the event from the runtime.

A

Yeah agree with that, I guess I know we're reaching a conclusion here, uh marcus, but the other area, where I know at least morneau and I have discussed where an event driven plague would be useful- is to just bring down cubelet resource utilization.

A

And so uh there's definitely a lot of benefits uh for it, because the plug itself is a it's a generator of a lot of garbage.

J

That as well, if y'all, think that it won't take two years to like augment the cri api in that, in that case,.

I

I I guess like if we don't worry about the docker shim like container d and cryo maintenance should be able to if, if folks from both the communities are like signed up to help drive it, it shouldn't. Take that long like and london mentioned, canada already has. The events and drive is up for adding whatever we need to do to support this.

J

I mean it can totally be optional for the runtime right. We already have this plaque, which is like a brute force plaque, and we can do like test if it supports it, if not fall back to this, and if it does then use event driven plug.

I

Yeah, even with events, I think we'll still want to do that at a lower frequency rate.

I

A

Yeah, I think we basically just want the equivalent of like a normal list and watch controller in cube where right we'd get the event, and then we would still do a re-list with the existing plug to catch anything. We might have missed.

F

Yeah or or you could do a uh a subscription model where you get the initial and then we'll give you updates to your subscription.

I

Yeah till the cube, let restarts or something when.

F

Right right, right.

J

Yeah and the, uh as I said like the very short term right now, solution cooked up is kind of kicking the plaque when we know it should have something new like in this case.

J

uh Maybe I can show the code here if that works, uh make that a bit bigger, like I literally just added a poke channel to the plaque, so it can be poked from the outside and then at the end of sync part, I'm checking if one of the actions was a start container and if it was we kind of know that we changed the runtime state in some way and hence poke the plaque to relist immediately. I'm not sure if we can also like do a more focused just list.

J

This just give me that container or just give me that part, but for the sake of just showcasing it. This is like brute forcing a whole realist, and that kind of worked like that uh improved things- and I guess which brings me to one of the more important bits is: how do we actually surface things improved right?

J

um I don't have that here. I'm I I have to. I have to rely on your guidance to tell me, where best, to do that, if there's already like um dashboards or performance tests, that that would reveal that.

A

Yes, I think marcus when we had met before. If I'm not mistaken, I talked about how we had the node performance dashboard that maybe we could revitalize right um uh that wasn't necessarily testing the representative pod that you're presenting here. I think.

A

You know mano and I talk a lot lately. uh He and I were having a conversation before this where it's like. I think we want to be able to get more um prescriptive guarantees of like.

A

Ways that we can measure success so like uh what I'm curious about is um the answer of. I don't want to wait. Any time you know isn't is the ideal answer. I guess, but um is? Is there a a pod, startup budget that you feel.

A

uh What amount of overhead are you looking as reasonable in from the k native communities perspective.

J

Right so one definitely definitive goal is to get sub. Second, um which is you know you have to kind of specify on which platform with which networking provider, et cetera, et cetera, et cetera. That like throws a whole curveball into the equation, um but I'm trying to get sub second in, like say, p95, um something like that. So second is the mark for me right now,.

A

Where that's measured with the cube api server, knowing that that pod is ready or.

A

J

Good point I have, um I don't have a pod speed output here, but um generally I'm going via cube api. Yes, because that's like what that's what load balance as a load, balancers, etc, would see right to to configure networking etc to to actually use the part.

J

um There's one optimization that we can go for if we run into a wall which is I'm looking, I'm also looking at time to ip, where I think it's valuable to optimize the time it takes to get the ipd into the ap, the pod ip into the api server, because that kicks off services and endpoints, which can also kick off load balancer programming. And we already have that thing where the load balancers take non-ready ips and do use the well-known probe to find out if the thing is already ready.

J

Even though the cube api might not yet.

D

J

So that's a, I think, a valuable um quote-unquote workaround. So that's potentially the second time that's worth measuring, at least to us, from the creative community.

K

Yeah when I looked at this before from a cubelet perspective right, I really you know you create the pod. It's admitted to the api server, so you get like post admission time, but then it has to be scheduled and then the node has to get its watch event and things, and things like that.

K

So when I was doing this analysis before, I would always do base it off of like the syncpod ad event, um it's like from from there to when the uh pod status is updated to running like how long does that take, um which kind of isolates it to the cubelet and how you know what what we would have control over in sig note.

J

Right, yeah, I'm open to uh suggestions like that. uh That's fine with me! If, if that's like measurable in a the main problem, is being measurable in a general generic case right, which is why I started um with cube api visibility, because that means I can run the tool against any kubernetes cluster and see what the latencies look like like in general. um But as you say, it doesn't rule out things like cubelet uh things like the api being slow, noisy, neighbors, etc, etc.

J

F

J

Looking at best case scenarios only for right now, because there's um I have found enough potential for improvement in the best case scenarios uh before looking at like things like what happens, if you start these sports in parallel, what is like? Is there a lot contention and things like that?

J

I suspect there will be, but for now, I'm looking at part of the part of the pod after pod, after pawn waiting for it to be ready and then launch another one, etc, etc as to get that to a reasonable pace before looking at scalability et cetera. That makes sense.

A

Yeah, that makes sense. So what I'm wondering is um for folks that might be on the call and reflecting on this a little bit um at least um uh I'll put my red hat hat on right now. I guess there's particular uh resource budgets and maybe density goals that I'd love us to be able to uh get the community to be able to achieve um which would be. You know at a particular uh pod density. How quickly can pots start and stop and um uh what amount of overhead does a cubelet and and runtime provide?

A

I think that's in the best interest for everybody, and so, if folks have uh usage of kubernetes in their organizations where you have particular.

A

Budgets that are trying to be met it'd be great if we could um get them shared and uh talk through and just start measuring our success relative to those metrics. I guess um some of the stuff. I think we've talked about earlier around the density of things that I think uh we're starting to focus on with um uh like housekeeping intervals and in c advisor and how we get metrics on pods. I guess so. This all just seems to go together.

A

Well, um maybe that's my last point, which is: um is there an average life expectancy for these pods that you're also trying to optimize for or is it just a zero to start? But then, once it's up, um there's not like a churn rate that you're trying to build towards.

J

Yeah at least not right now I have noticed that um if you create the pods too quickly at least right now and the startup times somewhat deteriorate, and I was kind of attributing that to the housekeeping etc. I.

I

J

Digged deeply enough into that code, yet, though, to to know for sure, um but as I said it's kind of right now, it's the simplest part possible and starting it, it's just like getting it ready, getting it to serve, and that's that's it for now.

F

um Yeah, I think, sometimes isn't the model that you keep the part up and you serve containers quickly right for to integrate to do the services.

J

Yeah, sorry that didn't crock that one.

F

um Yeah, no, I think, isn't the candidate model. Doesn't it also include keeping a pot up for an extended period of time, but running uh containers for the for the services requested.

J

uh Not really, no, we kind of use the the deployment abstraction even actually and just uh scale up and scale down the replicas of the part of the deployment.

J

So we don't do any shenanigans with regards to like updating, pods in place or things like that.

F

J

And yeah, so maybe one one sentence towards I. I do know that the cubelet code is somewhat diligent, yeah, sometimes a bit hard to change, so I'm I'm. I am looking at uh the least impact possible also with regards to what you mentioned before.

J

um I think it's not an option to like make the plaque quicker, like the interval shorter or things like that. I think that's that's kind of off the table. I'm looking at actually saving time versus saving time and work versus making more work um to achieve a better latency, yeah.

A

So marcus I was trying to find if we had any past enhancements or documents that had explored.

A

The event driven plague and there's, I think, there's even a document on like the original plug design. I think if, if you want to help push forward on that, it'd be greatly appreciated and I'm sure um maybe we can start tracking it for a candidate for 124. But I see it as a big win for for everybody.

A

Independent of the k native use case.

F

Derek donald, if you saw lantau posted a link to an old uh event, driven stream uh idea for uh docker from 2015.

I

I think the plague enhancement pre-enhancement has designed somewhere.

F

J

Right I'll go look at all of these. um If anybody has a a pointer with regards to the the dashboard that that you mentioned earlier, um that'd be cool, I guess we could like add.

J

um I don't know, maybe the the simple part to that, if the, if it's not already a very simple part and see how we can make things like get things measured, so I'm not shooting in into like you, don't have to trust the numbers that I'm posting them um and other than that. This was mostly supposed to be an introduction, so y'all know who I am where I'm coming from um and that I'm going to to push on that or I'm planning to push on that. If y'all think it's it's useful, so yeah.

I

I think like mike lantau, if you guys, are interested, we can have a small subgroup defining what the cri changes will can look like for event, driven and start working towards putting together a camera.

F

Cool yeah I've been one to help refactor kublet runtime manager a little bit. That would be nice.

A

All right, well, thanks so much uh marcus and um uh if uh you all do get together uh be great to just make sure we all uh can participate. uh I look forward to hearing uh what might come out of uh some of those discussions, but I think it'd be awesome if we can um start looking towards a venture and plague in the upcoming 124 release.

A

I think that got through everything on the agenda uh were there any other topics that folks wanted to raise. Otherwise, I'm happy to give back 10 minutes.

C

I just wanted to give folks a quick shout out uh for we had the cap deadline last week uh and I was really impressed. I think we got like everything merged and ready to go 24 hours before the deadline, uh except for, like you know the one thing I think we ended up dropping so that was.

I

C

Impressive great work, everybody.

H

A

All right, uh hopefully, we're just as good on the execution side of our plans. Yeah.

C

On the execution.

A

C

Gonna send out a follow-up email. We had discussed having a sort of like soft deadline to ensure that we had a bunch of pr's up and we were able to track stuff through the release, uh so I'll send out an email uh talking about like uh you know, we got these beta caps in so we expect them to merge approximately around the state and for all the alpha uh caps.

C

We expect them to have pr's up by that date and that way we'll be able to track them a little bit better in terms of what prs we need to be looking at for future work, this release, so uh I will send that out. Hopefully, today, sometime.

A

That that's a good positive note to end the meeting on thanks everyone for joining today and uh we'll talk to y'all next week. Bye, bye,.

J

Cheers everyone cheers.