Kubernetes SIG Node, 25 May 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Node 20210525

Description

Meeting Agenda:

https://docs.google.com/document/d/1j3vrG6BgE0hUDs2e-1ZUegKN4W4Adb1B6oJ6j-4kyPU

A

Good morning, everyone- and today is the may 25th and our weekly signaled community meeting welcome everyone. So we have a full of agenda today and, let's start our tour meeting as yo-yo cirque. Do you want to silky? Are you around? Do you want to update us about the pr status and and also the charity status.

A

B

I think sergey might not be on the call, so I am happy to jump in on his behalf. I don't have the numbers for the um the from the script for the meeting notes, uh but anecdotally. uh I could say that we've been merging some prs. I've certainly been.

B

uh First sorry.

A

Oh all of a sudden, you are freezing, so that's.

B

Oh, I see sorry, my wi-fi is a little out there, um uh so I was just saying that um pr seemed to be moving along well, uh the triage column is mostly empty. uh We definitely still need more reviewers and more help with uh triage, but uh velocity seems like pretty good right now and the more folks who get involved the more we can sort of burn down the backlog, because we still have lots of pr's that are waiting on reviews, even if things are mostly chugging along.

A

Thanks alailah, so maybe we move to our uh agenda and the first one. Actually is you and I talk about the yeah? What.

B

So I put this one on the agenda uh and put together a doc, because we have been chasing down a lot of race conditions in the cubelet and issues with pods, spin up and deletion, and I've been chatting with ryan and clayton and clayton. Is here, hey clayton? Do you want to jump in and talk about some of the stuff from the dock that we put together.

C

Yeah and uh ryan and elena and I have been like chasing some of these for a while- um uh we had a bug uh two years ago, which was like you could have a pod. That was um that always failed. So, like you could have a job that returned exit code, one all the time and then very rarely it could actually return zero. That was due to a race conditioning cubelet status reporting. We we chased some stuff usually had some feedback.

C

We knew there was another race somewhere else, and so we kind of put in a best effort check that that handled it. um I was the test case that we put in to simulate that um actually uh creates and delete like it, creates a pod and immediately tears it down, um and the container is supposed to return zero, and so the test was checking between zero.

C

We had a couple places where that would hang or like fail for a long time, and so it's like, oh you, know, there's something weird going on, but it was really rare, as I was digging into that last week. I realized. The actual issue is: um what would happen would be that, just because of the way tear down happens, you actually would nev would not uh the various bits of the cubelet that are like hey? Is this pod terminated?

C

So I can start cleaning stuff up we're actually vulnerable to a race condition where the first container hadn't been started. So there was no strong synchronization in the cubelet that allowed you to know that the pod is transitioned uh to uh basically there's like three phases of a pod's life cycle. It's like it doesn't have any containers. Yet once the cubelet sees it, it can start having containers and then, when it's when it's stopped or evicted or um you know shutting down, you need to know that no more containers can be created.

C

From this point on and most of the other cubelet cleanup loops were doing a very, very, very incorrect check for that they were just looking at whatever copy of the container status they had, which could be wildly out of date, and so and now so it turned out that status check wasn't checking in it containers or ephemeral containers. So it was also just wrong.

C

So after some quick discussion with elena and ryan, like um my proposed address and what's in the dock, that alana wrote up, was really great that she she did, that I was adding some notes. Is the life cycle of a pod is pretty predictable and we need some strong guarantees. We already have the pod worker. I think the pod worker today exits too early, and so my proposal and what I was hacking around and testing, was effectively that the pod worker from the moment we start building up a pod to the moment.

C

We've fully torn the pod down and and generated the final status, because there's no more running containers on teardown would all happen in a single per worker loop. um We actually found other pro. I was linked into other problems that people have found so there's an issue with graceful termination which is you're supposed to be able to shorten graceful termination, but the way that that was being handled didn't necessarily guarantee that the shorter duration like so you can.

C

You create something you say, wait 30 seconds you're allowed to at the api server, and this is part of the original design say: oh, no, no only wait 20 seconds and if 20 is shorter than 30, the cube is supposed to shorten it. That logic didn't actually fully work. There was a pr for it. I actually think to correctly do that. You need to be able to um know what the last value was in a deterministic fashion. The cubelet, the other person, had already started changing that in the sync worker.

C

So again, I think it's right for us to combine how the sync worker be the source of pod life cycle, tearing up the immutable transition to starting to tear down the transition from when there's no more running containers to when the pod is cleaned up, and actually I found a couple more race conditions, as I was going, so I'm pretty confident that kind of uh pod logs are actually being torn down too early today, because the loop that tears down containers uh also tears down pods, but that's not actually tied to pot shutdown.

C

So I actually was able to trigger a race condition and a test that I had seen before, where we a pod, reaches success, and then we try to go see the logs for it and just because the the sync loop was kind of slow, most of the time you'd have like 10 or 15 or 20 seconds before it got torn down. I was actually able to make it happen instantly because in my in the sink worker I was tearing down the containers and the fix is pretty obvious, which is we want to preserve?

C

We want to guarantee that container logs stay until the pot is deleted or an eviction or a garbage collection happens, and by unifying some of this logic, that became a lot clearer. So there's a bunch of like little bugs all over the place that I think this helps. It's not a lot of. um Basically all the places in the cubelet that are saying things like hey, given the current pod, I have go: ask the status manager for the container state go check, running containers.

C

They basically just become calls to the pod worker, which has a locked synchronized immutable, one-way transition map that says the you know the sync worker has started up, so there could be running containers sink pot, and so I also split sync pod into three there's setting up a pod, there's uh stopping the containers and then there's cleaning up the resources and cleaning up the resources matches what you have to do in the status manager before the final delete is sent. So those are all basically unified.

C

As the pod worker goes through those phases, it checks to see that we're not transitioning through an illegal state transition. It sets the map and then everybody else in the cubelet just says like they make the same calls they did before, but they're consistent and they're they're one way, so they can never regress.

C

um I'm hoping that I can get the I think I have almost all the bugs I'm still chasing down some cases where I didn't understand the cubelet enough um mirror pods are still broken, um as they always are, and I need to go uh fix lump in there. But I just if fultz can read the dock and give feedback on it. um Did it sound sane? I saw don nodding, um but don don is easily convinced by my my my subtle eyes. Sometimes.

A

Honestly, uh I want to invite you to youtube here if you will feel so relaxed somewhere, repeat, basically see what's original design. I like here, I know I, I believe the single part and also part worker- or it is the first time we'll be introduced to the synchronization to the kubernetes, and I saw the original episode is the party event in the old code. Maybe some optimization mess up those events so distributor we saw so it looks like still. It is just optimization.

A

Last two years we I did like that we have even recently, I think, the half year ago we have the huge pressure. People want to even more optimize the status report so but did not really uh enforce those synchronization and also part the related events, so that causes some of the risk conditions. So what do you you? You see it here? It is so makes sense. I also see the nantes here today.

A

I think that's kind of you basically repeat and there's original design here and how we why we are so keen on those core design and but obviously there are we try to optimize and obviously some backup introduced thanks and yeah. Well.

D

Yeah, actually, I I just recently had a discussion with jean from sig storage and basically, we also encounter some risk condition uh around similar to what you described. Is that uh we'll create a container and we'll create a part, and it's not running yet it's pulling image, and and now you try to delete it immediately and somehow the volume will go into a real state. uh We'll say that the actual volume is not mounted but uh cubic. Think that issue month and create a local directory.

D

So the problem is using a local directory, but it it it actually should use a volume. But anyway it goes. It goes into inconsistent data and we discussed uh internally, and we also discuss with several options and one of the options is that to make sure that part worker handles the lifecycle of the uh of the pod and the other logic, like the part, uh the the the clean up.

D

We have a big cleanup look, so I think for that one, the risk condition is caused because the both the cleanup and the part worker trying to do things- and we saw that pod worker stopped working, but it's still working. So if we have a a clear indication about why the pot worker is still managing the the pot, if it is cleanup, shouldn't kick in and so to avoid the risk on the condition only when we know that part worker protein is gone, we can use any use.

D

Channel use log use anything to make sure that it's gone, and then the cleanup starts doing work. Then it can avoid this kind of issue. So just another example similar to what you describe.

C

It is kind of reassuring because um most of the the race condition that was originally caught was the short running it could happen later. In some cases it would be unlikely, but cleanup in general is working, um we've kind of like, as we've improved cleanup over the last couple years.

C

The nice thing is: is that we've kind of improved and concentrated where cleanup happens and as I went through, I was trying to like just verify all those places, and I think we have gotten a lot cleaner so that it's easy to say, like here's, a set of things that we have to clean up uh having that list actually means we could just do that synchronously, which also reduces tail latency on shutdown, but there's some trade-offs there that I think we should discuss um one of don.

C

You brought the point about status, so the thing I was actually seeing before with status was some of these second order effects with cleanup. um You know with status. There was basically uh you know something it was. I think it was like. It was taking us 10 seconds on average to sync some of the important events.

C

What I'm seeing today in the code is, it can take somewhere between 10 seconds and 20 seconds under moderate load or 30 to 40 as more popular show up to um actually finish. The termination of pods part of that is actually just the sink loop intervals, which is a defense of like you know it degrades gracefully to a more predictable performance mechanism, but also status is easier to reason about now, because we could actually look with the pod worker um having the the full life cycle.

C

What I was finding was that I could actually make an argument that we would know the time between different phases much more accurately, because it's one spot, which also would help us with putting performance measurements in place. That would let us assess tail latency of pods, shut down and measure it accurately, um as well as things like uh when we do go optimize status, reporting we'll have a much much better understanding of when the status uh should be done and we can compare against it. But yeah like it was.

C

I the places that were a little weird were uh some of our terminology is a little inconsistent.

C

So in eviction we would say terminated, pods, but those are actually terminating pods and then um like pod, deleted, like is pod, deleted, was being used a couple places, and that was it wasn't really the pod as deleted. It was um either cubelet side.

C

Eviction has been requested, in which case the cube wants to get rid of those resources, or the user has told us to delete it, in which case we can get rid of those resources that was kind of being used inconsistently, so just going through and reviewing you know, it's easier to put better names on those that are centralized, like should pod resources be reclaimed, which makes it easier for someone adding a new loop or looking at it to reason about um you know, it took me a couple days to bring everything into context, and I still think there's some subtle stuff, but it has helped to look at it.

E

I was looking at your first pr and I know I've been telling you for last couple days. I'd look at this, but you know live code. Review is fun. I I don't. uh You said static, pods, weren't working right with.

E

I was trying to think through the changes we did last year to handle, uh I guess proper shutdown of a static pod when you change the file on disk- and I remember having to spelunk my way down into the pod killer stuff, and I see that you've gotten rid of the pod killer or I can't find it potter.

C

Killer is part of the pod workers is the pot killer, so all pod killer function is merged with sync work. The pod worker in the pr.

E

What happens when a cubelet um starts and sees pods that.

E

Should no longer exist, like um all the cleanup loops for.

C

All the cleanup loops are basically the same, so the cleanup loop would look at all the like containers or volumes and it would say, uh here's the set that shouldn't work and then all the cleanup, as far as I can tell the cleanup loops, are actually correctly cleaning up all the pods that are non-existent because I didn't change the cleanup loops except to say, like the cleanup loops have a bunch of gates in them. That say, is the pod in this state, or does it exist or not exist?

C

I made those correct because they were not correct, like the places where we were checking. Our containers running was racy and was vulnerable to a race on startup or on restart of a cubelet or on restart of a node. So that is a place where a race would exist in practice. Most people wouldn't hit it, but you could actually start tearing down the containers or the state you needed from the previous reboot or from a previous start of the cubelet.

C

Due to that race condition, um that's been mostly fixed, so I didn't really have to change that logic. It's just the check is correct from the should this pod exist and what phase of its life cycle is this positive, yeah.

E

I guess I thought I had to check myself afterwards, but I thought when we discovered c groups and processes running that correspond to pods that no longer are tracked from a file source or a api source that we still went through the pod killer flow and there was no pod worker. There.

C

I'll go double check um the as far as- and this may be like one of the tests that that was broken like the core tests, all work, which was another example of um any place where I knew it was broken and the ede tests did not catch. It is a place where we probably need to add some better ede test, like the no dd test caught a few of them, but, for instance like some of the race conditions like the fact that containers were not considered at all means, we need to have an ede test.

C

That's a little bit like the existing, create then delete for knit containers. um There was one with pod logs aren't being preserved. That was a race condition that I did. I only the test, flaked and didn't fail.

C

um I actually think we need a test for that, which is um we need to put the pod into terminating and verify for the majority of the life cycle of the terminating pod that we can get the pod logs that'll be a little bit weird to test, but I think it's possible.

E

And then it seemed like on slack we had some discussion where terminology is probably overloaded. um Did you?

E

What are you calling a.

E

Shut down as a consequence of getting observing like a systemd hook that says this note's going to power down, and are you calling that something different than out of resource eviction? Yeah.

C

So I didn't change any names of that. um So, like eviction, so everything if you want to terminate a pod from the cubelet in a scenario you basically send the sink pot, kill event to update pod, so eviction through the kill, pod now uh rapper, uh and then you pass it.

C

Obviously you'll do status mutations that say like that, make it permanent or not so it's kind of the uh soft, so soft admission failures in the sync loop um are the only thing that don't go through that mechanism, but it puts the container into the same space, which is a soft pod termination. So if this, if pod emission happens, you're still in the sink loop you're still in the building up, but you get to the point where you realize you should be not running and you shut it down.

C

So it's like one of the first checks um all other like eviction, graceful shutdown, basically just called kill, and they either mutate the status or they don't. um The issue I saw from a terminology perspective was graceful. Pods shut down today is behaving like eviction.

C

When you shut down the node gracefully, I don't think it should, because there's no expectation that that pod can't restart on the next time. The keep the note comes back up. That was a minor thing I caught. I tried to unify all of the code that is terminating pods, so there's only one method right now that anybody can send to telepod to shut down. That's not inside a sync pod worker, so kill pod is only inside sync worker, uh except for the cleanup loop.

C

um I think that was maybe the one and maybe clean up even should be more closely aligned with the pod worker to your point derrick, but we could tag it like with a to-do or something like clean up of pods. That don't exist is similar enough to pods shutting down that they should be aligned so that you know when you can use, kill pod or not. But I didn't. I didn't change the definition of eviction. I didn't change. Grace was shut down and it didn't change.

A

So we do have like a eviction and uh also out of resource tests, but I don't know the e2e. We never.

C

As far as I know so, I did not break those.

A

I know and look I just want to say that, but there's the many to do. We didn't finish the reason it is. We need the support from other uh communities like this uh kubernetes community, because I think the back is in the community unique about a lot of what is conformance right.

A

So that's why we so this is why I invented something called know the level of the conformance um so but back to this, the pleg, like the management, actually used to have the integration test for those kind of things and unfortunately, those integration tests also is no maintenance. So that's why to focus for the community can be focused. So we later only focus node e to e the later felony kubernetes have the cluster level of the conform test. Then we focus on confrontation, so you can see like over time.

A

We are focused on smaller, smaller, even with smaller, smaller core function on it here, for the try to make those is the e2e have the eating test, so we we, we give up a lot of other tests even at the same weather, related standoff. We also have integration tests and the plet. We also have integration tests, so maybe we could revisit those kind of things. Even if you know the perf has the performance test for the container runtime and also know the level of the performance from the cuban level.

A

We basically give up at this as the signal the community.

A

Because, due to the org, due to many things, we last two years three years, we give up gradually just just update here. We maybe can pick up a lot of things. Come back, maintain our reliability and.

C

The nice thing is, is um I at least from the behavior, a user expects um most of our ed, like the gaps I was seeing were like places where, in the ede of like have we ever really defined how long it is that you can get logs from a terminating pod? No, should we probably because if it's racy right now, so people are probably expecting it to work so, like even that's like a great example of a conformance style test. That's pretty easy to write uh with a little bit of cleverness.

C

I did really regret not having um reliable and easy to run integration tests for some of the interaction between the cleanup loops and some of that's tough, um no ddes could cover that.

C

But like alana- and I were talking about yesterday, just about um there were some definite places where this was harder for me, even kind of knowing what the intent was and having a lot of the backstory, because I couldn't check the the mid-level invariants without just firing up a local cluster and running it, which felt felt like we could do better um for testing specific scenarios that are kind of like whole public. But you know knowing how complex all the flags for cubelet are. There's also some limitations there so be happy.

E

C

um To identify some places where I think would have helped here as well: yeah.

E

So um if I think uh flantal or myself want to help review on this- and I know clayton will help you with the yeah.

C

But I I'm not ruling out the possibility that some of the other weirdness we've seen on static pod restarts might, I might have broken a work around where this might actually help us fix it more cleanly um static. Pods are subtly different in their termination logic, because they will come back. um That'd be that'd, be a that'd, be actually a great opportunity to go and make sure that I understand it as I'm putting some of this other stuff in space. So I can help add more tests.

E

And then uh dawn you had mentioned like the perf test renault and I were having some discussion with some folks at red hat that wanted to uh better look at pod's, startup latency. I guess, and one of the areas we were trying to see is if we could have them um bring that test uh dashboard back to life. um I don't know what the timing on that will be, but sure definitely.

A

That's what I also have to do based on original design. He said I can share. If what I can schedule meeting give the status current status, and you know what we can move forward and also there's the sig instrument, which is david, david, the dutch part right so rather signaled. So we also want to do more from the instrument and those kind of things, and then we can, uh let's also from the signal initially basically from us. We want to instrument enables those kind of things, and so you can easily monitor of the latency.

C

I would be happy to put that in this pr, especially because I was noticing that the sync latency isn't quite correct and some of the latencies that matter are like from the time uh from the time someone requests, shutdown.

E

C

The time we delete the ncd object is unmeasurable today, um but it could be measurable and like there were some options, so I'd be happy to tackle some of that and then tee up some of the discussion as well. um Certainly cool with this change, at least for unloaded, cubelet shutdown, latency becomes much more predictable. It's not it's, not uh it's, not a regression um and the tail is pulled in a bit. It's not fundamentally changing like some of the um like the the status loop is now the biggest like under load.

C

The pod status worker is or pod status manager, and its final updates are the longest component of shutdown. From an observable perspective, the other fixes I had um that I was playing around with in, like you know, improving. That would be probably even better now because of we're not dependent on the cleanup loops in this approach.

C

So if we exit with success, basically the status manager has everything it does like it at that point can basically declare the pot done, because we've reached our guarantees that should get the p99 of pod latency down tremendously.

C

So I'm kind of incentivized to like get the right instrumentation in so that we can say like hey, here's, what we're getting now and we get we're seeing how that is. And then, when the status manager improvements like. If somebody gets time to come back and recess them, we might actually see a significant reduction in the time to shut down which matters for anything short running.

B

Clayton, do we want to touch on the uh the pod level termination grace period seconds being set to zero bug, because that was kind of the thing that.

C

Was like yeah and yeah, like jordan, had been recommending to people in edes because jordan's uh not the best person sometimes, but in edes we were force deleting pods, and so I thought we were forced deleting pods in edd tests where people were actually invoking a deletion, but it turns out that we actually allow determination, grace period seconds on spec to be zero, which is not the intent at all, and I don't know how it's gone five years without me, noticing that um so that's what the pr originally was.

C

I was fixing it to make it impossible to set zero from a pod spec, because that completely takes the cubelet out of the chain. It has a whole bunch of resource implications for the cubelet. The original intent of graceful deletion was not. Our first deletion was not for it to be a standard part of the workflow.

C

As part of that, I was proposing, making zero behave like one a lot alana appointed to another bug where negative also works. um Jordan, and I were just having a brief chat yesterday, where he was suggesting.

E

C

Suggested that negative one and zero negatives and zero are basically the same problem. We could fold those together get validation in, but that's a we would have to honor the old value as we would any invalid value basically becomes one.

C

So I I'd like to see that pr and the approach I had combined jordan was making some suggestions on the pr because he had questions about it, but it would be an api change effectively where we would say you can no longer request forced deletion of a deployment or replica or staple set. So I I'm a little.

C

I understand that this would make things work. I feel like taking cubelet out of the loop at a workload. Level is not a behavior that I think is encourageable, but I'm open to alternatives.

A

Clayton, I actually I update. I give the some background. uh Basically, I'm I think, there's the mini debate between us and also community, so many features I did because we also don't want to kubernetes in the middle for a lot of the force deletion, behavior scenario, so the so that time. I basically debate a lot of time, the, but I also don't want it in the middle. I just want to say: there's the also.

A

If we need the kubernetes really and to accumulate the real need delete all those kind of things coming up, and uh so so I basically want initially what I want. It is the kubernetes observer at least observability event and the part event will be generated. It's currently deleted, but then there's a lot of people think about. They don't want to wait even with that single-part uh duration and and to delete anyway. I forgot all the rhythm counter argument and so.

C

And that is, that is a fair point like if you have a name locked and you want to if you want to like, if you're doing a single pod use case where you're going for end to end latency.

C

But I guess the question is: is the only workload controller we have that's dependent on names is stateful sets and force. Deleting a stateful set is a non-goal of stateful sets except to resolve a partition.

C

So then the question would be like if we can improve the is this a the performance is slow and by improving the cubelet if we can get it down to like, because I do think that if we don't wait for containers to be removed, which is another thing in the pr that is worth some discussion, um we're talking volume cleanup, which is the slowest part of it so volume cleanup, um even with the the check I was doing, was still seconds.

C

uh Qos c group removal is very fast and the sync this the the pod sync loop, if it had the right info, could basically be near instant with the exception of throttling. So I do think it's actually pretty possible that uh if we could improve volume cleanup, we could get down to two seconds like if you had a pod that had a termination case period of one second, you could probably get down to finishing in two seconds. Would that be short enough, or does someone still need it even faster.

A

This is this, is this: is the evil deletion blocker single part? That's the wrong oriental design is not this way I heard design. It is just generate clean action that came from me. I said: okay, once we watch after party event, no matter it is the container creation and the contender delete all those kind of things and even like the you you generate once you have those kind of things and you generate an event how the worker will generate clean up event associated with this one and the hand over to clean up the threat.

A

That's that's totally separate worker, so that's the rental design student block this one. So so, then you basically have a lot like the disk clean up disk management to deal to deal with all those kind of the uh heavy lifting work. So.

C

That's probably what my change looks like, which is, if you said, grace period, one the first event comes in. It makes it to the pod worker right now, we're not interrupting the pod worker, but in order to do context, shortening we would be able we should be able to cancel the sink pod worker by passing a context to it.

C

So that would be the that was uh listed in there be the event the you see the delete it certain events, transition and terminating would be allowed to cancel shortening the grace period ability to cancel the current sync, which would then just t up the next one we'd call it pod killpod would be done, and if, at that point we have enough info to fire the final status it's taking a while today under load, because the status manager is not optimized for per pod latency, it's optimized for throughput, but I think my change is 90 of that which would basically be it's.

C

I have that mostly working now, so maybe we do a follow-up.

A

This available up, while I'm cleaning up also is the shifter and the hand over to the different things is just because on disk there are also uh you need a single wheel about all the disk management include of the container image, download and start log all those kind of things on the disk. So this is what I initially designed.

A

It is that's the power worker for handle off the single part event and of their watch about uh all those events generated from the api server related stuff and then there's the separate of the maybe a group of the disk co-worker. That's the separate disk worker, but there's the one thing's, the single slide, monitor entire of the uh node disk, include new image, download track of the image download and the logging, and and also the water and on the disk. So that's kind of make.

A

This is everything we haven't finished this disc, totally disc management, yet, but the better for the wall and the eater front they were should be delegated to another separate of the worker. So that's the original design and just share around here, because I think a lot of new people here. I.

C

Think I accidentally discovered the mind of dawn inside the code base, and so I was clearly guided by um and actually like.

E

I do remember some of those.

C

Discussions like derek- and I had a lot.

E

C

It was actually good for me to bring it all back into context, because I could see many of those elements um like. Maybe we can do a follow-up um at some point soon and go through the pr detail for. Are there other elements we can improve, because I do think in order to fix some of the race conditions we found anyway, we have to do some of these trade-offs um like uh because we're the loops that are cleaning up stuff too early.

C

Those have to be fixed um or we'll. Just you know it's. It's races that we're we're just finding. We can easily fix them. So thanks john.

A

A

um So I think the please review the pr and also read off the docs share by the alailah and the thanks a lot to converging everything and the ninth house thanks offer uh for the review benefits for this one. I know you are super super busy so for time checking and then, let's move to next topic.

A

Youngjin, do you want to talk about the containers start time as the matrix here.

F

uh Yeah and- and there is my issue- I want to get an opinion here.

F

Let's me, uh as we see issue 101 851, we want to expose container start time, incubates metrics resources and points, and so we can have uh two benefits. First, it allowed to detect containers. We start as a container start time change seconds. It allowed to speed up metrics for fresh containers, and so I think it works.

F

I want to know about your opinion.

F

Thank you very much.

B

Thanks so much for joining today to bring this up and I can provide a little bit more background as well. So on the sig instrumentation side, merrick was looking into using the metrics resort endpoint uh as a replacement. I think for the summary api and there were some issues with the cpu usage metrics.

B

So basically, I think he wants to add some new metrics to the metrics resource endpoint. So the container start time and the node start time, but- and I think david ashbal took a look said- it seemed reasonable, uh but we need to discuss at sig node. So.

F

A

I personally think that would be useful metrics, but I want to ask the ask what uh how you define. Oh you wanted the container that was that time. Okay, that's, never mind. Okay, yeah.

B

I think that we have all of this already we're just not exporting it on that endpoint.

A

Yeah yeah yeah, we just yeah, that's the useful.

E

Us, what is the definition of node start time?

E

Is it the start.

B

Of the cuba start.

E

Of the linux or windows host.

B

Is sorry, I'm ready, I'm gonna.

A

B

A

I I originally, I think, notice that I just brainstorming, okay through here original notice, that time when we monitor so is the when the node uh start boot up and to the node.

A

The first claim it is the nodal id that's node, uh startup type, uh similar to how we claim this is uh how we claim of the power startup type, when we first and uh like the duration is that when we first have this part created and been seen in the api server and tier of the all, the initial application container uh start once so. That's about the start, latency uh at least this is how we initially defined node. It is that's orange. What I defined and just as the brainstorming here through here.

B

My concern with having the node start time be when the node marks itself as node ready is. It can run stuff before then, which would be confusing right like static pods could potentially start before then uh things that I guess tolerate, node, not being ready, could potentially start before then so.

A

um We, I think, though, I think a lot of people forgot if the signal that we we we know, there's the use cases for static part. We even last year support some a real name required of the feature but again static. Part is discouraged.

A

We know, there's the initial clock, the initial cluster initial, uh depending on the static power. I I I know the no matter, openshift rtk, there's the static part dependency, but I believe in the signal from day when we said we are discouraged, people using static power all the time only for the cluster initialization. We know there's the dependence until we address the after-sale for uh um health sales of the cluster. So that's why? That's mostly for these cases, but the rest of staff shouldn't ask the static product.

E

A

You can deploy.

E

Your cni so like the node going ready is basically as you're, seeing I deployed right and then many folks deploy cni through a daemon set. Just just fine right. You tolerate and not right.

A

I think it's not just cli actually know the ready. Actually in the past the notoriety. uh I know we relax a lot of things in the past the node already it is at least the container runtime run once so. You will see container runtime is ready and then you will see the yasirai or it is the powder center wrench is being located yeah. So it is not variety. That's the first time notoriety. So.

E

The container start time metric, that's in the proposal, makes total sense to me. um The the uh just clarifying. If time of note is I mean it's a net new metric. I just didn't know what that was um actually measuring um and it wasn't clear what that metric was used for right now to meet the two use cases above um like if it's. If node start time is when the computer was powered up.

E

Let's say, then I guess you're measuring the delay from power up to running a container which seems very useful versus um start of cubelet reporting, ready and then running a container, um because you would have already been probably running. Containers.

B

I see I was trying to read through this and see where it was used or why it was used and merrick wrote. The nodemetric proposed is not required due to differences in node lifecycle, but was added for symmetry and to simplify implementation of cpu usage calculation.

B

So I wonder if we could get away with not adding that one or trying to figure out a better use case for it before we add it.

A

I also want to see that even today note the wanna know the first up and we'll say: oh copenhagen, birth, generator event and then load is ready and the generator okay. No, it's ready event. This is why what I talked about know the status object. Otherwise, so basically we used to measure it. Is this node the successfully joined cluster and it's ready to take any work from api server, so so we believe node itself can expose other metrics to say. Oh how long you you?

A

Basically you have other metrics to measure after your kernel, startup, tab and and turn already, because you just pass your log as a kernel log and there's the there are. Also uh you can you can measure if you look at the know, the perform they're actually measure all those kind of things based on the log scale, but do we want to when we talk about kubernetes by default, open source, not each body's production, and actually we should be most generic as data, so the node started up.

A

For me, it is more like okay, this node is ready to join unless kubernetes right now. I think the director you were here when I talked about can we make the kubernetes as the self content of the product I mean not product and for the for the this is to come up the static part discussing all the other things discussing kubernetes itself actually could have api like like the public upgrade we turned down that proposal.

A

I I mean I propose that when the community turned out and proposed a long time back so then basically kubernetes is the basically kubernetes. It is always part of the cluster and you have to be say. This carbonate is ready, server, class api original. We actually even want companies to put the public api so itself can publish the power spec power api right. So that's original water. I thought so. Then we could have the start, but that's the older time we can start kubernetes through the kubernetes evolve with the whole, the kubernetes cluster.

A

That's the original source yeah, I think yeah. I think.

D

Oh, go ahead. Sorry, go ahead, yeah.

A

D

Just want to say that in this context I feel like it, the the metrics are mainly used for for monitoring the monitoring metrics. So it's for calculating the cpu usage. In that case uh I mean, if we add node startup time, it should be used to serve the purpose for cpu usage, calculating right then, in that case the node ready main may not make sen make sense.

D

Since I I know that the matrix can be useful, but in this scenario it seems that the note boot time may be more useful here right if we want to calculate the cpu usage of the node yeah.

B

I want to make sure that we get clarity on this before we spend too too much time digging deep in it. So I put an action on here I'll summarize, the discussion we had today on the bug and I'll take this to sig instrumentation, to clarify why merrick wanted the node boot time, because I think there's a bunch of different ways.

B

We could implement that and in the issue he says it's optional, so uh it sounds like we're mostly aligned in terms of you know the the containers uh start time, metrics that makes sense the node time. There's some questions about. If we either come to the conclusion, that's not needed, then that solves that problem uh or if we can get more context in why it's needed, then maybe we can move forward. Does that sound good.

A

Sounds good really good thanks keeper and also come back thanks for taking this one. uh One thing is just one I didn't know the problem detector. Actually, if I democrat somewhere I'd, know the bullet of time to export that as the matrix already, if some product you want to use it, I don't know, I think it could go know the problem, no, the problem, detector. This is the compromise we we we have in the past. Let's move to next one knee uh hi.

G

Everyone, um so I think to uh for as part of the femoral containers prr, I wanted to add a couple of new metrics to the to the kubelet um to answer the following two questions: uh how many ephemeral containers are running on this node and how many has this node started? How many were ever run on this node? So I added a couple of gauges. A couple of counters and the counters are fine, but I think there was there's some overlap with existing metrics for the gauges.

G

um I need to measure specifically ephemeral containers and there is a current couple of metrics called running pod and running containers.

G

These measure, the the kubelet's internal representation of containers so uh running pods measures, sandboxes, not pods, technically um and running containers, manages just the list of containers. uh Coupe containers, not api, object containers um so that doesn't have the information I need.

G

uh I I could plumb down the information and like add them as a label or or some or like an annotation um which would be fine, uh and we actually even had uh the container d authors uh asked for this behavior, so that would be beneficial, but before I knew that what I did instead was add a new metric to the pod manager that has access to all of these api objects, the the pods and all of their different containers and surface those, as as a gauge counting the number of of pods and containers running um running in this kubelet.

G

So it's clear that we don't want both of these metrics. We could have them they're measuring slightly different things, but the the feedback that I got was that we don't want multiple metrics. It confuses people, so my question for signode is is which do we want to measure? What exactly do we want to measure? uh And maybe even how do we want to call it.

B

So my concern was not having multiple metrics. It was just that the naming made absolutely no sense to me, and I didn't know what they were talking about. So I just linked the pr and a chat which was adding the clarification, which is that the running pods thing actually measures the number of sandboxes, not pods as api objects, or something like that.

B

uh So, like that's, the sort of thing I think, would be super helpful and in the pr you had introduced, this terminology managed pods and I was like what is a managed pod are not all pods managed, um and so I think just I wouldn't have any issue with having multiple metrics. My concern is just I want to know as a cluster operator.

B

What does this metric mean because, like none of what you have described is well documented or clear, and I want that to be super well documented, super, clear and hopefully like discoverable, for both the metric name and the help text.

G

Yeah sorry, to misrepresent your position, I I've actually heard from multiple people that they prefer not to have multiple overlapping things, but it's totally yeah. I didn't mean to to misrepresent um what you said in in the pr, oh and by the way managed pods comes from pod manager. I am not at all attached to this name.

A

um I uh uh and it is from hello- I I do think about the there's, the leader- to introduce uh metrics for informal container and running on the load. uh uh I, but I understand what elena said, because uh for the running path and running container is all it is kubernetes internal representative: again, it's kubernetes internal representation.

A

The reason just like the earlier clinton then talk about like the synchronization issue, risk condition issue, even even actually, without those risk condition, issue kubernetes by design kubernetes of think which the api server could be. This is why we introduce because we cannot need a cabinet in the middle to block a lot of api behavior. So that's why we maybe have some like the time and the difference from the api server of willpower. So now we could the driving reconsolar data and driving our stators consolidate and the sync currents with the api server.

A

This is where we introduced funding. When we introduced those running path and the running container concept to then we can monitor from the eps server and also from the node side and see what's the diff and eventually it should be consolidated. Right should be seen. So the so, at the same time, I want the informal container, because running container could be total on the node in formal container. It is actually it is the temperature and the container we should have the tracker.

A

um I uh rest the stuff and it's, but I don't agree with that, and I know we should have the wow document and uh because to represent of the history legacy and what we have there. I believe we do have, but maybe uh scattered too different of the documentation and over the years, maybe people just don't know, don't have the background removed, but I do think that informal containers should have the matrix.

A

I want a new feature such as, like the informal container, which is take a really long time. First thing: it is the guard by the matrix. So now we can see that it is behavioral. Well, then we can monitor. We need to build this monitoring dashboard to say, oh a giving off the cluster or gaming node. How many informal containers is running it is that informal for their stay there. Then we could give us our full path to clean up those things. People don't abuse those features anyway. I just share my top thoughts here.

E

Yeah, so I guess uh thanks a lot for linking to the the pr where the existing terminology was uh easily confused. So, um like, uh oh, I think the one with the running pots. Sandbox clarification is like a net win. I wish we could fix the keys, probably for for this pr. I have. No.

E

This all seems super useful um and the the manage terminology doesn't upset me because cubelet its job is to manage pods. So I think that's a fine prefix.

E

um I I like being able to distinguish by container type on the labels there, and so I like that, while you were focused on just thermal containers, I like having this for the other container types. So um I I have no like objection to what you're proposing here lee. um I I have to tie back in your code to make sure it's counting the right thing. um The running containers metric we have, I think, uh historically, would have been tripped up by the pause container or not when running on docker versus the cri.

E

So having this cleaned up is um having something that represents the end user view more than the internal subsystem view, makes total sense to me.

G

Okay, thanks derek um and I could also pursue a cleanup of the of the running ones. If, if that's of interest- and these are all in alpha st still um running the running- gnomer kind of gets to me, because it includes containers that aren't running.

E

I guess the one question I have is um the motivation to know if the cubelet had launched an ephemeral container. I guess is to know if it was a pristine node. I assume or pristine workload environment.

E

Are you pursuing this path.

E

Because we didn't want to put something in the node api itself, we're on the pod api itself to say that.

G

uh No so this is this is to help cluster admins with upgrades and downgrades, or feature enablement and and turn off um just to detect whether the feature has been used anywhere. um We we are pursuing a pod api level, uh sort of tainted flag to let no to to inform as to whether or not it's been ever had an ephemeral container.

A

Part of monitoring driving development- I call it- I don't know other people call, but I call is monitoring driving development. So that's why I um I I hope, let's just make that scope smaller. Only I, the like informal container and the information and not over, complicate the existing api. I think that's also a last concern like oh. We changed this to landing part that becomes management part all those kind of things, but we don't have document because running power.

A

The running container actually has been legacy being used for a long time, and people may be already monitoring those things we used. If you want to change that people just say: oh no, no, don't change it. I already monitored those kind of thing. So, let's just add this kind of things. If we want to do the bigger surgery and consult with everything, I think we need to call out on the like the even like the water directly, then node api. More seriously, then let's call our data separately.

E

A

E

Understand that are you or against the metrics he's proposing like to me. These are all perfectly fine.

A

Yeah, I'm totally fine, but I just want to leave to larry down the scope, because I think I have the concern. It is just to rename certain things and which is not well documented. So, but your existing things, uh it's been used being there for a long time. So so this is kind of further delay.

A

Basically- and I have to- I would just suggest, just narrow down only like the focus on the informal container and uh and is that, okay for this one, because I'm basically driving this one as the monitoring monitoring driving development for me, because every feature in the past every feature. I did my first question to engineers how to monitor how the cluster admin and the lee actually is si at google. So he basically have this kind of momentum like the model. How to do those things.

A

I think if so, that's also what I like- and I think that's helped us to figure out the issues for any new feature. Raw art.

G

Okay and just to make sure I understand um I, so I am solving the problem that I need for thermal containers, but but also surfacing the other types and not just ephemeral, containers and and you're. Okay, with that too, the container type, as opposed to only a metric, a metric. That's only ephemeral containers.

A

Is that, okay with everyone here, so we can.

E

Immensely useful.

B

Yeah, I'm fine with having a ephemeral. Only one yeah thanks. Everyone.

G

Wait, I think we just agreed to two different things. Oh.

B

No sorry, if I had poorly.

G

Yeah, I so I think the metrics in the as is here um modulus and better documentation, I'll, go back and and try to explain what they all are better uh or we agreed that this is okay to proceed.

G

D

Think we we only want to add the ephemeral white right, not everything I mean that's what I heard no.

E

D

C

I would like.

E

I would like all three and I like the label, he has.

D

Actually, I also have some questions about this because to me this is not cubelet. It's it's not the actual statement. I feel like it's important for the kubelet to report metrics for the actual state, because, because those are things you cannot get from anywhere else, but for this one actually to me it's just some category, counting of some api objects. In that case, you can just easily do it in the api server.

B

Yeah api objects. I don't think we should be trying to do in the cubelet because, as I said,.

E

B

The cube state metrics will scrape this from the api server. The api server is the authoritative source, not the cubelet, so uh we shouldn't be trying to get that from the cubelet.

D

Yeah in theory, actually even the informal controller can just read all the parts and counting them and also also check what node is schedule two and get the num exactly the same number, because it's just desired state and all everything is in a server. I feel like this, and especially as don mentioned, that whenever we add matrix it's hard to deprecate, because people may start, depending on.

F

D

And we may only want to add it when we fully understand the use case like this one. I understand that where we want to know how many parts, how many, whether this feature is actually being used, but even the informal one to me is it can be get.

D

We can get it from the api server by monitoring the power created and updated. So.

G

I think it's useful to know.

A

Go ahead. Sorry.

G

I was just gonna say I think it's useful to know how many containers and pods the kubelet thinks it's managing right, because the kubelet is not the api server. You could pull the intent, but the intent is different from the actual state um and and also I'm I'm sorry. I don't actually know the answer to this. There are other types of of like the static, pods and mirror pods are all these reflected in the api as well. Well,.

A

I I I did I landed. I don't think about the way I disagree with china here. I also want every single things: kubernetes expensive export. It is actually state. I believe, this informal container also it is in this category and the api server like the. What do you just say that desired state? That's exactly how kubernetes today uh export of the running parts and the running container, that's the actual state right. So what.

D

Yeah yeah yeah, yeah yeah.

A

So there's no difference so I don't understand the uh yeah.

D

My question: oh: go ahead.

A

Go ahead, what's your concern here because yeah.

D

My question is mostly to maybe I need to reveal the code, but to me it seems that it's counting desires. It's counting desire state instead of actual state yeah. If it's continuing to say, then I'm fine with that. It's just not quite clear to me.

E

It looked to me that you.

D

E

This in pod manager- and that should be reflecting actual state I mean I'll- have to run it to check. But, like I find this useful in the case of knowing did a cubelet actually get the watch notification that the api server said should be there or not um like. I don't think the api server is always the one true source, uh and so as as long as this is actual state, it's it's fine.

H

Lantau, look: a quick look at the code, he's he's making the call to start the you know the ephemeral container and then he's checking the result. So the counter is after the result of successful it may not currently be running. So it's not an actual state, but it is. You know the fact that they were successful in running in a thermal container of these various types.

H

Before responding for this metric. Okay, I see- I see I see so it so it does have that value right as a kubelet metric.

A

I I think this is back to the creator. The earlier state terminology misalignment. I think it's the most representative, some level of the actual state. Now I understand where 9th house came from so but it's not actually the continent already running, and but it is the what kubernetes understand informal container and on this note and the supposed run.

G

Okay, sorry we're actually talking about two different metrics, um the the one that is updated is the counter of has an ephemeral container been run and there's another metric of it's a gauge of how many ephemeral containers are running right now,.

G

A

Okay, okay, also, it's actual state to me to some extent from the api server perspective. I think.

A

Can we carry on this one? I think we really ran out of the time. Can we carry on the rest of this guys? At least a single from high level will agree. Kubernetes can expose some of the matches represent the actual state and the informal container account magic's. Actually, it is useful for the cluster enemy and uh and but how to represent, and can we narrow down or not larry down and how we are go we can carry on on the pia is that okay.

G

Yeah sounds great thanks. So much everyone.

A

Thank you lee and sweaty, and sorry we we we didn't. We didn't track our time while I helped so you missed today and uh next week your your topic will be the first one. We will make sure your topic is the first one to discuss.

B

Thank you all right thanks. Everyone.

A