Kubernetes SIG Node, 10 Mar 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Node 20200310

Description

Meeting Agenda:

https://docs.google.com/document/d/1j3vrG6BgE0hUDs2e-1ZUegKN4W4Adb1B6oJ6j-4kyPU

A

When I run to the March 10th signal meeting today's agenda.

A

Visually and the newest items that aren't on the agenda folks are free to raise topics those reserved.

A

So the first issue here is going to draw attention to a particular PR around pod worker behaviors.

A

B

Hi, this is how way so than how a knife on the wrist Commission in Hewlett that could cause a pod Walker to stop functioning. We sent a fix to grants, but I want to bring a topic a arise. We think a bigger problem here is that when a goroutine, cubelet crashed, say in this case a column or routine crashed couplet doesn't crash.

B

It recovers and keeps growing which makes this kind of issue hard to debug, because in this case this pod Walker for the for a specific dog stopped doing any updates to the containers for the pod. But everything still farm to you found out that, for example, the container actually crashed and the cube that never restored that container, and that may happen days weeks after the panic happen.

B

So I think it's battered for cubelets to crash when a sub-goal routine, crashed to make a debacle, debugging easier and so I took a little bit into the history of kubernetes, see why we have this recovery. Behavior and I put some link down below in the background, I think. Actually, in 2016 we changed the behavior for all components in Corrales, from recovering the panic into actual panic, and if you click into the 2016 link, I think we had some agreement without cubelet as well.

B

The concepts was, even though we aren't sure if all clusters that helped have some sort of system even to restart capelet, which is true for things, but we still think it's a good thing for debugging purpose to make extra crash, and we should have reached notes at that time, but actually I think we didn't change the behavior for cubelet, because we had a flag in Cuba, which is called real, actual panic for debugging, which sets the patch level parameter into false by default.

B

So I'm thinking this might be a oversight at that time that we never changed the behavior, but I won't oppose here that, should we change the behavior now that we can.

C

B

Default that flag into panic and have people opt out to use the old behavior or we can just remove that flag to make Quebec always crash along a sub-goal routine crash, and we should make some really snowed about it.

D

Howie the issue Howie report I think the way being discussed before so the PR and the back that the report before also and so I believe the back again Eugene and and her look into that one. They couldn't reproduce that problem so so so so there's the big discuss the crash, the köppen aid or not crash Cuban, eight and even the discards like the ones we Department. Probably my understanding is the current problem.

D

It airs when the goal routine pad Walker is crash and the Cuban into working as normal and the team to renewed require the enter self-recovery. The enter a nice start, the new guru team and a huge reconsolidate estate. That's the good, that's the major problem! So so I, don't recall why? Because because its nature, nice think about the Cuban nature crash. But then we just started from scratch and then we consolidated the whole estate.

D

But there's the if I don't know cocky but I forgot the detail, the people, some people don't like that idea because they have some dependency.

D

So so so I just share this here, because before this one's being discussed through the PR or issue and also I, remember in the signal that we also discussed so there's the pushback to crash coconut is those like the concern. People can recall that concern or you.

A

Catch the cubelet and it keeps restarting and crashing you'd have no way of training. The node.

B

In this case, if you restart your blade city severe a race, if you don't hit it again, kill that will be healthy for some period and well having the crash helps us debug and fix this issue.

D

What I want to say that what Eric Arnaud it is because we used to be crash Cuban it. We do couple cases we crash Cuban it down in the first couple release just now. You just point off the one problem that time we discussed and we think about the region.

D

This is particularly problematic once you crashed and then Cuban will recover and also everything's moved forward, but the our concern is I really want to introduce that kind of powerful model finger model to the Cuban and itself, because we used to also have a crash carbonate and well, if I remember correctly, we used to have like the one continuity, the stalkers number ID or something is like. We crash criminate a couple times and we in the APNIC have the series of the couponing crush loop and then cannot really come back or even come back then.

D

Maybe next after five minutes and you notice become to unknown State a lot of things that you evicted, others kind of things is character and expected condition for forecaster. So so this particularly problem Stephanie can be Cory and but is that's the best way. We are doodle Akari.

A

B

The thing I can also fact what so so, if we, my original proposals that we change the behavior for this handle, can't handle crash function, so I think we can do a research in Cuba at code base to see where it uses this function and recover from the crash. In this case, we found that it's used before the particle routine and I think we should crash in that case, but if we do some research and gather what in what could pass, we have this kind of issue and categorize them into like in this case.

B

If we crash, we think it might be able to come back healthy again and in some cases, maybe you're concerning or more like. If we crash, we won't ever be healthy again with that help have a fine-grained control of the crash pit.

D

How difficult to to identify that it is the go looting crash and the party workers crash and cannot make any progress, and then we have the signal and then we start. Another good eating. Is the possible I'm going to try that I think.

B

It's so it's also possible, but it's basically changing the place where we detect. We, where we have this recovery function, where we detect that the movie was crashed. Well, maybe maybe there's another way track attack mechanism to check sir say like a heartbeat for a goroutine and create another one. When the old one is gone,.

B

B

Mm-Hmm yeah yeah, I least, is that a as an alternative I, don't.

A

I know that just the one I think Don you're right where we do crash the cumulative like the container run time, is not responding. Cop.

D

Okay was one case s universe. The stock is the container. The doctor is not response, we crashed, but then later we changed and murder because cause a lot of other said. If factor so we we character in a signal and we say the we want the Cuban eight we okay terminate to restart right, so we all kick open eight to crash it, but we are not artificially to crash Cuban eight, because we still want Cuban a to report because we want Cuban eight. We think what accompanies the brain.

D

After that note, if we wanna say this note is the key know the Cuban a node, so we want company response to know the healthiness. Alright. So then there's the a lot of the matrix, that's exporters through the Cuban, a so that's kind of the way we made a decision. We, of course, all those kind of things at the win is the totally make sense at a gaming time. So now, kubernetes totally different kubernetes also is different, evolved a lot, so we could change it and but Michelson when concern it is we.

D

This case is Stephanie helpful, but then we started to reintroduce that to crash Cuban eight mode, and there are many other players could be crushed to cooperate and it could be next. One Mulder I introduced two new mode and, and they introduced the new mode and then become to the crash loop. Let's I need the concern, so so I think, therefore, a lot of people crashing or maybe also okay, because Cuban enter an analysis in my concern is certain feature even Cuban idea.

D

We are clearly see we don't want Cuban a to checkpoint, and but we do because Cuban and also is extended. Mold hazel is a model, so there's the component pack into you evoked by Cuban a to actually do some chick upon it. I won my cousin, it is evil Cuban and it is self crash, and at least the we could detect those checkpoint by the crop. Hit big hard record right now is more like the packing mode same as the Cuban ideas.

D

So there's the people during the Cuban crash could be corrupted their checkpoint, their individual models kicker point, so that may introduce more problem. This is just my wild guess. We saw this before the.

A

Core did you hear, though, like I'm wondering if we had a way to just detect a hyung pod worker yeah and a way to monitor that? Would that have been a good intermediate followed by man?

A

A

Don't think we have a great way right now, I'm saying that a Todd Walker is.

E

A

If we had a way of like reporting a a counter on the number of workers, that crash would is that a good first step and.

D

So, but today, actually, this problem really gave me some neck, though I think the a lot of problem we saw the issue filed, nectar, say: oh I noticed that containers died and I noticed that part it so so I believe that's the easy to find the powder is not running and spat start at the pending state and the community anemic progress.

D

That's really easy for people to spot that problem and the way we look into those kind of problem before and we do notice that the colluding crashed and we think what is a state mismatch and a lot of time. We come, we suggest after user customer it is delete that particular Department instance and in but that's the problem, it is actually part is running successfully.

D

Part is in the running state, but it could be negative container is in Nevada State, but it's not detected because the puddle workers just crashed and the best actually is more much harder to detect right. So so this is really yeah. Just thanks to find this back. I just want to first part, because we know your solos problem. If.

A

We crash will be incentivized to be better. So, like part of me right now is like we should just crash my.

B

Hope is that we still just and even restarting to that, every single point, but yeah I I'm not sure what what potential risk there will.

F

B

That to be into a state.

F

We should definitely make sure that there's a metric for number of times cubelet is starting surfaced up through the components it's like, like anybody who has a prometheus system should have an alert for that today. I'm pretty sure that doesn't exist because they've never seen it fire yeah. This would be an obvious one like if cubelet is starting multiple times that, at a rate higher than like one, every ten minutes, something is for one every hour. Something is wrong.

D

Yeah, so actually we do have the MPD nah I think the MPD have the one of the problem: detector, the packing, which is detect this continent time and the troop innate in crash loop, like the I, forgot, the time interval like the 10 minutes or 15 minutes, and the coupon ID or darker in crash loop.

D

Yeah, so how many times so I forgot it so, basically that it is using to detect this kind of problem, and anyway, so I just threw out here I one of the my major concern. It is because each component, which is plugging and incubate and in the past they did the checkpoint and those checkpoint, because, due to the coupon it crash and in the corrupted mode and then cause the coupon, it cannot renew, which is their cracking cause Cuban. It cannot really start so is.

A

It I forget like if everyone in the world is gonna, be probably launching a Cuban on a Linux house on some system gig unit, that's gonna have some restored on failure. I can't recall right now, if there's a simple way to figure out from sister Angie, how much it's done that restart I want to go check that, but assuming anybody in production could wire that, in their monitoring system, I think that's probably easier than taking on a no problem.

A

Detector dependency, but maybe we can just take this and actually them, but in general, I feel like crashing, makes our product better so and the checkpointing.

A

It's a concern, but we tend to have an aversion, the checkpointing anyway, so maybe it makes checkpointing better too, but yeah. We I think. The first thing is to check if there's a simple monitoring rule good place, to recommend folks. If system D keeps restarting and Cupid on failure, you.

E

Should probably look I know the weakest node exporter has some system D, metrics I, don't know if it has like police start count, sort of thing, but I know that it has a bunch of system v related stuff, but if it doesn't have what we need, like, maybe the right place to put it.

B

Okay, yeah I think I can go to verify for assistant Enda promised us a quarter. Had some wait for us to get a start.

B

If it has I might go to Tunis issue appeal.

D

Thanks to for finally, CSUN define debug this right. This is a yeah.

A

Looks like system does have a restart counter.

G

A

Was just looking at the system code itself.

D

We can quickly discuss I, think the I think that you really want it. Is this probable exporter to Castle I will say: oh, this is Cuba, that is in crash loop and I'm, not sure the system D I know system have the rheostat account, but, but today I don't think well, they have like the interval. They do have that the between each series style they have the interval, but they don't have like the. How I'm going to ask major this is the in look.

D

This is the crash, look like the total interval and how many Cuban eight, how frankly Cuban in me style and then I think well, you needed here attention. So, let's alert after signal I think that that's what admins and I know the problem detector. We detect that a problem based on the giving total interval and how frankly you'll restart couponing or talk her and say this node is okay or not. Okay.

D

We can talk about this more and if we finally decided introduce this, the crash I.

A

Guess the Piron question: we should get this merged into one, a team back to at least 115. Yes, yeah.

E

A

So the next topic, if right, move on, is the kept that was open for pavel resource limits. Yeah.

H

So I I'm not familiar with zoom. Is there a way for me to share the screen? Show a few slides I mean yeah.

A

It should be possible there's a screen share button. Oh yeah I.

H

See it one second.

H

So, basically, I'm what we're talking about is a way to make it possible to share limits across containers in a single pod. So a little bit about my motivation to include this I'm working in a product called the business applications to do @s AP, which provides and development environment an IDE as a pod, so each user would we get basically his own pod that simulates a full development environment and in our product.

H

Basically, it is extensible in the sense that tools can be added into the base deployment as a set of site cards, and what we have figured out is that when we do that, it is very difficult to accurately describe to kubernetes on the container level the set of limits and the set of requirements for memory for each one of those tools, because it very much depends on what the user actually wants to do.

H

So if one user wants to compile using one tool, another user so compiled using a different tool, they would probably benefit from different limits and different memory requirements for each of those containers providing each of those tools. And what this is about is actually maybe enhancing or extending the way that kubernetes sets up the cgroups on the pod so that, instead of setting up the resources on the container level, it would be possible to set the resource limits. On the pod level. There is already a cebu for each pod.

H

That is a something that has already merged into coordinators.

H

Basically, we look at what exists today, and every quality of service guaranteed. First of all and best-effort has its own C group, and under that there would be another C group for the part itself and under that for each container, including the post container. There would be an additional signal and the limiter actually said on this level for guaranteed it actually, today is already setting the limits also on the pod level, but it doesn't actually matter because there is no way that you could escape the the limit that is set on the container level.

H

So it makes no difference if this is set or not, and in today's implementation, the pod level on the burst of all quality of service is only set. If you actually did already set a limit on all of these containers and just.

A

Took a is a beautiful chart and so I just want to know that guaranteed pods are peers to the burstable slice and it's just because there is no guaranteed slice and it's just because the way CFS is evaluated among siblings. So.

A

When we talk about why the burstable limits were summed, it was so that CFS sharing is evaluated with the right ratio across guaranteed pub and the burstable okay.

H

So basically, what my proposal is is to allow the user, if you so decides, to specify to the you, let that for specific wads there should be a limit set on the pod level, even if not it was not set on each and every one of the containers in in that specific pod, and this is an opt-in behavior. So if the user wants to keep the current behavior, that's okay, no change!

H

But if the user wants to have this, what level resource definition it would set the the single limit here instead of setting it in in each of these containers,.

D

Yeah I understand wearing kinfe, but I think I think about the. If he the why we we comments are so awful, it's easier to for user to set that limit or request limit at the paddle I will instead I to the container level we see. Each container actually really represent is a binary or it is application, so they can either much easier to to through the sequel.

D

What actually is the same to measurement but really for the part, and if we seek one kubernetes a lot of the feature and going even activity for like there's in each container and static our container, and so it's how I personally feel make the I think about.

D

For me, it's much harder to side at the paddle Iowa, because the pub and I will resource usage is not always controlled by me as a developer, so I wanna go out a certain application and I want to benefit by kubernetes all those functionality services you gave to me and I end up to schedule. I do the deployment I could have some other container. Some other is not out of my control and in share the same powder so forth from that perspective, much harder to cider at the popular wall. I just want yes,.

H

I I completely agree: I will get to this. I have a slide, I think 294 to discuss the two alternatives to implementing. Let me just get to that. Okay, the advantages of allowing these odd level limits. Well for me at me, it makes sense because it means I, don't have to micromanage the container limits in my specific use case and I.

H

Don't need to give unlimited resources to specific containers, because the current behavior is that if I don't specify a limit for all of the containers, then basically those containers that don't have a limit are unlimited and that actually also causes some sort of noisy neighbor problem. So pods in the burst of all quality of service level that have containers that are not constrained are actually able to consume all of the resources that that belong to the host, and this would actually make it possible to to prevent that. Yes,.

A

Necessarily precisely true like soon we could cap resource consumption at the cube pot slice that.

H

That is true, for, but still there would be contention between the pods that are running on the node. Even if the cube let itself or the system itself would be able to continue working because they have their own slice, the the burst of all part would still be able to consume more resources than even a guaranteed part living on. On that same that single.

A

That's only going to be evaluated relative to other pods in that same merciful, tier yeah,.

C

Because the time.

A

Slicing on CPU is going to be at each level in the hierarchy. It's not going to be evaluated across all parts. That's.

H

True proceed to you, but for memory, because memory is not compressible in the end, it would get to the point where this single pod could theoretically prevent somebody else from using the memory, and that would be a bigger disruption. Yeah.

A

So does part with memory that is probably them back to Dawn's statements.

A

Just what happens when I put an ephemeral container in there due to the bug, you're your pot yeah and then the other thing was just to make sure I think that commas on your cap, like pod resource overhead, that is in flight with Erics, where we probably want to know how to okay.

H

So I'm not exactly familiar with the PUD overhead. What I understood from what I could find is that it describes a amount of resources that are a priori assigned to the spot for consideration by the scheduler right. It's not it.

I

Doesn't Potsie group is well, it sets up the C group that the containers run in it'll. Add the overhead in it, because, instead of just running those individual containers, you could be running pieces of cry or container D, perhaps or you could be running a virtual machine and using that to isolate instead of just namespaces. So you would essentially cubelet would go in and size the pod C group accordingly to the some of the requests plus that and then it'll update the CPU shares as well and.

H

I understand how that works, for guaranteed pods and also for bursts of all parts where there are limits for all the containers for burst of all pods, where there are not necessarily.

H

I

Actually set right because if it's completely burst, Apple you're not actually at foreseen for best effort, it's kind of, if else of, if limits are defined for everything.

D

We don't need to figure out the interactive feature, this one and but I think we treated how the overhead is. The taxes have to pay once you install that one and you a lot of that feature. It's kind of the text for all the power to have to pay so I think it a QB term. There's how can I one image actually could it be remake of the sinkhole yeah.

I

D

I also think about the powder over hide that actually make this part. Our limit is possible because we talked about powder lower limit before many times, and one part it is just introduces virality, actually skim from the pot over high. Is that time we start talk about the virtual machine and we talk about some of different of the container as donation technology, all those kind of things we we think about overhead, it's harder to be I.

D

Think in fact it's not like one number right, so next apiece or so so now we have the powder overhead concept. I. Think that perfect timing to wise of the puddle I will be made.

I

Yeah I was gonna say that I I'd heard a lot of other people as well kind of describing a desire for pad level limits. I think that the difference is that they don't want to care about anything with respect to container resources. So, whereas I see this cap is kind of talking about bursts of all specifically call guaranteed, it doesn't really matter it's already. I'll send up, but I think the other end of the spectrum is where they don't want to specify anything, and they want to have it set up correctly.

I

So I think that having some notion of support for best effort, I think, would make sense.

I

But yeah you know when I was looking at I, wasn't sure how to tie this to QoS necessarily, and it seems like you know, that's the way you were thinking I was I was just looking at it kind of generic across the board. If okay.

H

So so maybe I'll get to the next two slides, which are the two different documentation options, and then maybe we can discuss exactly what the best way to implement this would be, and obviously the first implementation object. Option would be to put this on the pot level and something like this. It's a resources section on the pot itself and then you would limit it, however way you want, but actually I agree with what we said here before.

H

It's not that convenient, because that would mean that I would need to calculate all of the limits myself and I would need to somehow figure out what the correct limit for the pod is, and in cases where you have something like a sphere that is adding additional containers to my-my-my pods and I'm. Not actually in control of that I know necessarily how to deal with that, and also, if you upgrade sto and then suddenly, the continued limit that issuing Jax is different and then everything breaks so I'm, not really a fan of this option.

H

I was thinking of something a little different. What I was thinking was to set a boolean on the pot level, I'm open positions for the name not not married to this name, it's just kind of hard for me to figure out the more descriptive name and basically, what this would do. It would set up the C groups in the following way.

H

Any container that defines a limit would still get its limit defined on its level, but the sum of all the limits would also be set up in the pot level see what that would mean is that, if you're interested in capping the limits for all of the pod, you basically need in one container, to set the overall budget for the party and then all of the parts on all the containers that are included in this pod would not be able to exceed the limits that was set in the pot level Sibu.

H

That would mean that it would be possible in the example that you can see here to limit the shell to only use 128 megabytes of memory at most, simply because the sum of all of the containers in this fort have a limit of 120.

A

It's not clear to me how the first advantage bullet holds so like if you go to the sto example of injecting a container and the sto container did define a limit. How does that not impact.

H

This it would not impact because we would sum that limit and add it into the C group on the level of the pot. So if it says that it needs another, I, don't know 128 megabytes, then that is in addition to all of the limits that are in defined on the rest of the containers. Then, actually in my in my pod, okay.

A

So I'm sorry, so basically, if my container doesn't specify a request, you don't any change here, but only if I had a limit. Did these get zoned. Okay,.

A

Okay, there was another I put on the cap, which was yeah. Is this really something end-users understand, or is this something that could be abstracted behind like a runtime class yeah.

H

So I think, if we abstract it on the runtime level, then it becomes a lot harder to use, because that would mean that if I want to set a budget or a limit, I would need to and a different budget or different limit for different odds. I would need to define a random class to go along with that different budget or different limits, and that would create an explosion of runtime classes. Also, you might need to start giving permissions to developers that are not necessarily good idea.

H

The admin of the cluster needs to be able to define the runtime classes, not necessarily each and every developer, that wants to deploy something on that question.

A

That's fair I was trying to work out if there's any complications with the kata use case, where.

A

I

It's got a we just today, do a summation of the containers and that's what we said all that. Well, actually we don't. We just in the community's case, take and mimic what is done by cubelet already, so that initial pod seeker value, if it was explicit, you know, and it the pods back said that this is what the pod level resources are. That'd, be great.

D

I think and Mick Brown, race or continent. Can you talk if you cannot I will great.

J

Just making mention it since you can't shrink the the memory allocation of any particular. You know. Container you're gonna have to do also consider and restarting long-lived containers in this model right.

H

J

Once you expand the the memory limit of a container, you can never go back yes, so so, if you're sharing these bursts of limits, you you run the you run. You know a big risk of running out and write memory, so.

H

I'm not sure if that is the case. So if you define the limit on the upper level in the hierarchy in the C group hierarchy, and then you allocate that memory in one container, that does not define itself a limit, but then you free that memory and you try to use the same. Oh.

J

Yeah right right store, stop right, yeah, it's been some of.

E

J

You right yeah, if you bring it, be a start and stop with the container I'm. Just mentioning that.

H

The container, so even if the process that use that memory would just release that memory, that would also release the allocation right.

J

Yeah but I can't shrink the memory allocation that I've shared or allocated with that container.

J

Let Linux and validates any shrink in the attempt to shrink the maximum limit.

H

Okay, so I will I will double check that I actually get the behavior that I want in this case, yeah.

A

The other question I have is: what should the own killer do, so how do I decide which container to kill when one is consuming too many resources with this model? Do I do anything differently, or today we said it was gorgeous relative to usage versus request and I'm, not sure if you would desire that same semantics when sharing resources across your containers is there had you thought about that? I did not I.

H

Need I'm not sure that it needs to change that much I, don't know, I need to think about.

E

For the first option, are you proposing that users can only set limits, or is it also possible to specify odd level requests.

H

I, don't think it it matters. If you set the request right, the request is only used by the scheduler said.

I

That shares as well with requests I'm.

K

Mostly just trying to figure out what scope RIA proposal, not.

H

So I'm most interested in the limits, not not in the the requests I think.

A

When we prior discuss this David I feel like I've, always been interested in over committing the pot itself and so yep speaking of a set request and limits on the pot was useful if I wanted to overcome it. But if this, if option wasn't, isn't an option you want to like deeply pursue, then maybe just focus on option. Two I.

H

Would prefer option two? The question is: this is something that appeals to you. I think.

I

If we're gonna go ahead and do this, it would be nice to be able to do something like best effort which I think option. Two isn't going to be able to necessarily cover, and there did naively haven't not thought nearly as much as you I would think option. One would require a decent amount of validation.

I

To checking what you know, the pot level is versus what the different secret requests are: make sure that one is bigger than the other, but for me that would maybe be the most useful so that we end user can just go ahead and say all I care about is setting the pot and- and you know the workloads can figure it out. I, don't really care! You know, if you imagine kind of giving out a pod to somebody who's, doing different container builds or testing or things like that.

I

They don't really care what the end user does with it, but they just want to dictate that the pod only takes up so much so that they can start packing it I. Don't think you could necessarily do that with option. Two I think.

H

You can what you would do is you would set up a container in the pod that defines the ultimate limit that you want to give to the entire pod and set this boolean value to be true, and then you would get that behavior.

E

I thought it was summed and if a pod doesn't specify limits, that means unlimited.

H

There has to be at least one container that sets limits, because otherwise you're in the best effort, quality of service environment- oh I, thought that was what oh. So you mean sorry so that I understand treatment that moving this instead of being in the burstable octave service, double to move this towards the best effort and then have the resources being set on the pond devil or.

I

Make it maybe not really tied to quality of service but I.

H

That becomes a bit difficult to reconcile with the quality of service concept, because what.

A

H

Do with guaranteed pods in that case well,.

A

It's already basically shared at the guaranteed cost here right, it's just individually, bounded by each container there and.

C

A

It it doesn't, it doesn't change the meaning of anything it.

H

Would have basically no effect in that case, it would be a definition that you would need to validate, and you know when the pod is being checked to see that everything is okay. In the end, it would mean that the developer would need to in his in his head verify that the summit, the pod level, it also matches the specific values in each of the containers. It's additional development over and then.

A

The only other thing I'm wondering about is look at this as I see Victor's on here, but on the other thing, what else from Empire Intel's here, but what impact this has for folks, who wanted to have memory, aligned, CPU and devices and the whole new mo stuff going on so like right now, many of those decisions are made on a per container basis, and so I don't know.

A

If sharing that the pod level would introduce any immediate concerns, I can't think of any right now, but I'm just gets into a complicated mental space and so I don't know a Victor. If you had thoughts or anybody else thought someone no.

C

I I- don't at this point, but I can certainly take a look at the cap and you know see if that would have any impact so that thanks for bringing that up, I, don't see anything at this point. But I haven't looked at the cap, but.

D

I do have a concern, so so so there's a couple constant I have but I just want to name one, because the other comes in I haven't were clearly single, true, so one concern I have. It is based on this. One you'll use kisses, you described, I think it's totally fate with the option to, but the nurse that could be used kisses and used to really know what the contender is, the gaming content of their resource. So they want to carefully think about the what's the peak usage.

D

What's the irony city, so they schedule that can based on the requests and the make. That is the first of all and I have the well-defined of the request, an immediate and but then there are some other whatever company injected into that.

D

Well, I, don't know, and also they don't mean a specific minute in my thesis you could end up like the assistant of the container, in that application could be using because you mark this, a shared personal image to the potentially died to a container the when I was a helper company or whatever container could be using all the usage. Although other a lot of the limit to that in that part, so to me, is the potential is actively worse priority issues here.

D

So I want to my while define container like the all those kind of Webber stories and all those kind of things and I I made I stay. Okay, average, you say gee that is I stopped, requester and then peak I'll, say the peak usage is minute: I'm the wild bhai weight of the singular. So now, there's the something and somebody provides something and injects into my partner for the deployment, but they couldn't because they have no idea, they provide infrastructure or services for everybody. So they cannot predict what is the real usage?

D

It is so they unlimited in that could be using most of things and then that's back to the Mac Brahm. Actually he mentioned something in that cases. A lot of time is not an easy to me. Release of the memory they burn out. The memory type is the such certain memory. Even cannot. Declare ball so always goes to. Powder goes to negative, enter click pile in Iowa, so even a company dying, it's not other resource. You see like there's certain slab.

D

Your city is now to release them, and this goes to the powder level how to destroy. It goes to the root level. So a lot of time. If you major carefully, you can see that slab. You said you keep increasing, so you end up to have to force no the reporter to release those kind of the memory. So enough, like that, you could the neck the really hard of you, while behind while designer application here, so they just I just want to sue this one. This is the one of the concern.

D

I have I, have the couple other concern and but the other things I haven't thinking through yet so so.

H

So this is obviously something that somebody who wants to use this would need to opt into it if he doesn't up into it and he still gets the current behavior, in which case that injected container would still be unlimited, exactly as it is today.

H

I agree that, in a scenario in a costume where there are multiple additional helpers at mission controllers, that inject pods inject containers into pods, you could end up not exactly knowing for sure in in the beginning, where you're going to end up in terms of memory usage, but this is I think something that is more descriptive of development environments in, in which case it's easier to recover and prepare for those production rounds at least I think so.

H

Well, it is a concern. I think it is manageable in this case,.

A

Do you have any if, in a minute, did you work through any scenarios where admit containers were used and if you'd want to shrink the pot size differently based on different containers, were created or deleted or had already completed.

H

Sorry, could you repeat that you.

A

Could have a pod who's in, like the nick containers, run a sequence and so almost curious how you thought the burst of all limits should be applied during the container yeah.

H

So actually, I didn't give enough thought to the image container phase of the pod lifecycle. I need to recheck that and then for your example.

A

Are you using like memory back volumes as a scratch space in your developer, environment.

H

We're actually using not memory based volumes but empty tiers based on on the host file system. So so no I think since the size of the ephemeral memory-based drives, should also come from the memory C group. It should also allow you to be able to limit the ram drive to the size of the odd that.

A

Was gonna be my follow-on question which was: do you want a scratch space to be shared across containers like a ephemeral, storage, CPU and memory? And if you were using memory with empty Terr back type of s, then you probably would have hit the situation that Don talked about and then, if you're, using a ephemeral storage, either scratch space in the container or like a local volume or something it wasn't clear if you want to scratch space to be shared, because I could see an argument that that's also hard to size another container base.

A

So how is it? Is that also see group or a different evil? It's not enforced at a secret level, but there is a loop in that cubelet that enforces ephemeral, storage, so I think in the interest of time. I know there were other topics. We've spent a lot on this. Maybe we could give a quick chance to let others get their items raised if it was quick reviews or icy plating also typing a huge number of notes here.

A

Can we fill up on this on the kept and then maybe sync up again in a week or so to see? Thank you I just.

D

Wanna say thank you brought this up and it's the good timing should revise this thing, but iterate those things but I think this is the big topic and it's complicated public actually and we we cannot draw conclusions in so we need to carry on the discussing just wanna make you know. Thank you thanks.

A

So Victor, it looks like you were, calling up for documentation. Prs. Is that right, yeah.

C

There's yep just to to finish up for beta, but the last one. With the recent merge. We did break the Windows platform and you know what I'm wondering is there a CEO job or something that we should have run to like prevent this in the future or I'm sitting there wondering? How do we miss this and let this in and don't.

A

Know again, I think everyone's probably done this at some point. There's a incantation. You can run.

A

Depending on what you broke, if you broke a build or something there is a shorthand, you can find the PR to test all platforms. I guess, but I want to feel it built correctly for.

E

A Windows platform it.

C

Just wouldn't start on it: okay,.

A

C

This is where the pod admit handler was Neil. Windows is returning nil for the handler in it at some point, dereferences it and blows up cubelet fails a startup they've got a full request over there and there's been some comments. So the other ask us: okay, we broke them. Let's maybe help get it in.

A

And then in four minutes remaining I, don't know if Matt or Clayton, when I flip the coin on who wants to take the time quit.

G

In mine, is it super time-sensitive because I'll be out on vacation next week? So if you want to go no.

F

There's no rush on mine well,.

G

Equally, no mine, so I defer to you.

L

E

Guess we'll just go home I'll take the time.

F

L

M

Unless you want to review my document, PR I would suggest someone else goes because Seth wanted to hear your notes and words, and so.

F

So I tried to capture all of it in the agenda. I've just been spending as I was in I was looking at something in the cubelet it caught my interest around status managers that allows since we've gone in and touched status manager, so I was looking at I think there's a lot of room on the table for improving the late, the Indian latency. From the time we detect that a status update is necessary in the cubelet to the time it's written to the API server, especially in busy nodes.

F

So one of the things is we kind of have a real simple, like here's, the stream of updates that need to be happened and then periodically resync everything, that's a good mechanism. The problem is, is that after a certain point, because it's a single threaded process when you get about more than 10 pods on a node, there's a good chance that you're, just in the reconcile, behavior, always continuously, which isn't bad, but that like maximizes, like p99 latency, is pretty bad on that. So I was looking at a couple of simple changes. Getting familiar.

F

The big thing was that there's definitely there's definitely some room for reorganizing the code, just to make it more readable while doing that Jordan and I caught a couple of bugs that are just you know, as we've evolved there's some obvious things in it that we can fix. I talked to David, Ash, Poole and Asheville, and we he gave me his brief on what he'd done before there was probably like four or five things we could look at.

F

I did some simple ones which is trying to avoid the live, get which is actually very expensive in big clusters, because reads are blocked on range locks today, in some cases- and so you know, you know a big cluster lots of nodes doing live reads. All the time of pods is very bad and there's not actually any safety. When we do a live read because in between, when you do the live, read and then you generate a patch and send it back to the API server, multiple other people could have written.

F

So I think that right there is probably the biggest win and then another one I was looking at was prioritizing, which updates go out and the updates that go out. That really matter are when a pod becomes ready or when a pod becomes not ready and then, when a pod transitions to succeeding or failed and is ready to be cleaned up so right now, I'm, just prototyping and I will pull this together, an account but I think there's a ton of low-hanging fruit.

F

There should help us cut p99 of status updates and pod into end times. Pretty significantly certainly may not show up in the overall numbers, but at least in an e to e run. You know just like, as a micro benchmark, I was able to cut the total time spent waiting to send a status update to like a fourth or a fifth of what it was before. Just with some basic improvements and so I think there's some there's some more room there too.

F

So we can get tighter, tighter tolerances on something changes and how quickly can we get the API server to reflect? That did you mean to say 800 seconds. So if you sum in an EBE run, if you sum how long we spend waiting from the time we could we detect that a status update is necessary to the time where we send it.

F

It's 800 seconds over the whole run with some basic improvements that I think are still sound like from a correctness perspective and obviously like we had to be really careful, because we had lots of issues with stale stale reads: the thing is: is the library doesn't prevent stale writes because the way we're doing it? So we still have some issues there, but I was able to get down to about 200 seconds of waiting, and that was just by avoiding the live, read and using a recency cache.

F

You know we're the single writer most of the time for a pod, and we already have detection to go mitigate and somebody else comes in. So there's a true, though, with readiness, bus, bus, I thought we might be a single writer anymore.

F

It doesn't even matter, though, because all of the same like the thing is, is that the current logic is somewhat like, depending on a library read, doesn't guarantee you a live read because in between the time Reid completes, someone else can go, write it so we're doing an inconsistent operation after a consistent operation doesn't protect us. We just have to be sure we're consistent with ourselves and that we we're not. The other part of this is like.

F

As part of this, we need to add a lot more IDI tests about multiple writers explicitly like on the previous work, I've done for, like pod termination, just having an EBE test that extra pod termination found five or six issues. I think we need to have a multi rider to the cubelet multi writer de pod status, ete test that specifically stresses this and I'm sure we'll catch stuff today that we still need to fix.

F

So it's more of just a I'll try to pull some some thoughts together when I have some more data, but I think there's a lot of low-hanging fruit here to speed up and improve general correctness of the cuban status loop.

J

Just so clearly you're only looking between api server and you're, not looking at the CRA.

F

There was definitely so I flagged something for Bernal to look at with CRI, which is that none of the CRI is to return type terrors. So, in a lot of error cases we do the worst possible thing. So, like you try to start a pod, that's already been deleted or start a container. That's already been deleted.

F

The error message says this container doesn't exist and in a teardown case we could probably cut one or two seconds off a lot of tear down loops because of that until the you know, the plague is going to be behind, but because we don't have a typed error. We just do the naive thing, which is correct, which is retry later, there's, probably a big win on pod teardown. If we can get type their ores back from CRI for not found that alone would probably save like pod.

F

Teardown is pretty bad right now, I can take up to 10 to 15 seconds.

F

Yeah I, don't I, don't think there's anything. Nothing in here felt like there were gonna, be massive wins with big rewrites. It was lots of low-hanging fruit, we're mostly correct. We could be more correct and much faster and put tested at the same time. So that's kind of the mindset.

J

We go with the mentee model, just just a heads up, you know we don't we don't use the polling loop, so much venting yeah.

F

And the eventing is actually pretty good. It's just the interactions like the pod worker interacts poorly with a status loop and vice-versa I. Think right. One of the things that we can do is actually look for places where just the information flow is bad. The nice thing was is that we have like most of this was pretty obvious.

F

Just look at the couplet, which just goes to tell me the like, nobody's, really sat down and like put put the cubelet through its paces, and some of these like what happens if I delete and recreate, create and delete a pod immediately over and over and there's a bunch of issues that jump out getting more été tests. In that verify, those kinds of flows will get out will put us in a much better spot in general yeah. We test it and cry, but yeah make sense of the e2e level as well. Yeah and I.

F

Don't think no Dedes would even catch some of these, because it is specifically the interaction between the two parts of a distributed system and the races that can occur aren't always obvious. If you don't have fairly realistic flows, we.

D

Also have that we haven't know that ute, but the problem is no te. Te is nothing unique informed like the using those decadent, but there are certain scenario: actually you want to stress on an older. It's much.

D

It just rests on the node, but there's we never figure out how to we and our connector is simplify our lo de te and Lisa often know the level of the confirmed has to the Caston Iowa, but there, but obviously we lose a lot of the stress anyway, really appreciate that you look at this well and a part of those things and it's right timing and a recent a actually.

D

We do have the people when they're all you can see the user came to here, and but it is the different cases, that's the circular dependencies, but it also would be nature of the paw. The Statehouse me Tennessee, yeah.

A

Yep all right, well, I, look forward to run all kept for CRI error messages.

A

But I think you gotta throw a topic. Stein and Matt will pick up your topic next week and thanks everyone for a lively, sig note and.

D

Take care everyone also, it is a special time period. Yeah thank.

H