Kubernetes API Machinery Special Interest Group, 29 Jul 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG API Machinery 20200729

Description

Today's Agenda:

-[wojtekt]Proposal to fix api server starting up with empty change history in watch cache: https://github.com/kubernetes/enhancements/pull/1878
-[Bhagwat] Discussion about deep healthz check on API server. [
-[Bhagwat] Discussion about graceful shutdown of API server.
-[fedebongio] metacontroller update here.
-[mvladev] ResourceQuota admission controller and aggregated apiservers

A

One two three record thing: hello: everybody welcome to the kubernetes, see api machinery uh by weekly meeting uh we have today is july, 29th 2020, and we have a number of items in the agenda that we are hoping to go through today, so without any further delay, let's get to it. I'm presenting the agenda and I think the first one is um something that was brought up by voice and daniel. So let's get to it.

B

Yeah, so I can talk about it, so um I wanted to talk a little bit about the proposal that um I'm I have put together. It doesn't seem to be super controversial based on daniels and david's comments, who already at least initially reviewed that but like like, I wanted to give you a chance to actually object.

B

If you see any problems with that, so the problem that I'm trying to solve is basically that, because the way how watch cache is actually initialized whenever you are using watch cache and that is really needed in like large clusters, at least for some resources or for some high cardinality and high churn resources in particular, pods um nodes leases probably aren't crucial, but pods nodes end points. If you are extensively using them and stuff like that.

B

um So the way how currently watchcase is initialized is, is it doesn't start with any history um it just is. It is just initialized based on like the current state of at cd, so it's basically initialized from now for some definition of now using the quorum list to add cd and that is causing two main problems.

B

One. The first problem is that, basically um uh in case api server is rebooted or even if there are multiple api servers and we are doing a rolling upgrade whenever a watch was connected to to to an api server whenever there was some watch that was served by the old api server um and now and after reboot, it's or after during the rolling upgrade it's as a result of rebooting the api server it was connected to. It is reconnecting to a to an new newly intro newly in newly initialized api server.

B

If there weren't any changes in between times, um the watch will actually fail with two old resources or yeah. Two old resource version and released will be required, so in particular, when there's no no changes happening to any of objects of a given resource type, because watch cache is per resource type.

B

After the upgrade of last last api server, um pretty much every single watch will will require like a re-initialized re-initialization, with least which is causing significant, like performance issues in large clusters.

B

The second problem is um basically the fact that we are initializing from like quorum read from ncd, and even though we are listing just objects of that time, the the resource version the list is returning is actually global for that particular hcv backend, which means that, um if something changes in between upgrades of of different api servers, but no object of that particular type is- is changing.

B

It may happen that, like the different cube api servers, even though they will contain the same state of like objects and the same set of objects in the same state, they will claim to be synced to different resource versions, because that will be a resource version from the moment of initialization, not only the one.

B

So even though it usually it after initialization, it is updated only to it is then updated by watch, so only it can only have values of objects it has. It can only have values where, for which there was an object, change, either creation, update or deletion of that resource of object of that of some object of that resource.

B

Of that type. Sorry of that of that type, the initialization actually doesn't meet that invariant, so we may end up with in a situation where multi, where different api servers claim to be synced to to different resource versions for an extended period of time.

B

So that is roughly the problem description. What I am proposing is basically reuse or utilize.

B

The progress notify feature from xcd, so the progress notify feature allows you to configure watch to send your periodic um periodic progress, notify events, that's how them how they, how they are called in etcd, um that that will tell you that your watch is actually synced to particular resource version and that resource version to it to what uh that contained, that this progress, notify event contains, is actually also global, so we will be able to actually so so that obviously fixes the the the second problem.

B

It's pretty obvious why it fixes what it is fixing that um we need to change at cd a little bit, because currently it's like hard coded, the the interval is card coded to 10 minutes. We need much less, but but we there is a fairly simple proof of concept. Pr put some time ago by jingy, um so so it should, it shouldn't be very controversial.

B

um We will we need to like change the watched question a little bit like client, libraries or reflector in particular to to be able to handle a better the bookmark events.

B

um I probably don't want like it's basically extending like or introducing a simple interface where, like the store, can actually also support, update resource version and and keep track on to what resource version is actually synced.

B

And and we need to like, adjust the yeah, we need some relatively local and small changes in watch cache and and in the refrigerator um to to actually handle correctly the progress notifies but, like I have, I have poc it's linked from the cap. Also, it's pretty small and it actually shows that it it works.

B

We were also testing or I was also testing it, as per like on the reboots to see like the improvement on very large clusters, and it actually is tremendous improvement, so it really helps um and- and then we need to like um yes, so so I mentioned that we need to adjust the the period like the exact frequency is probably to be done. I was thinking about something between one second and like 10 seconds, um the motivation or the requirement to to fix the solution to for the solution to be um to fix.

B

The problem is basically that we need to receive an event or progress notify event or if there is any yeah progress, notify event. um After the cuba, api server was, after the previously rebooted api server was initialized, but before the next one is actually going to be initialized.

B

So so, depending like how this how this time interval looks like it, it may we we might have like place to like tune this exactly, but but probably like. It won't be lower than like couple seconds anyway. So so anything like that seems fine and the final thing we need to do, which is a little bit or toggle.

B

Now to this or we would it would, it would be useful, no matter if we proceed with that proposal or not is like to um to change the api server to also send watch bookmarks like on shutdown so right before shutting down it will send watch bookmarks to all watches that will be like broken because of like ips servers shutting down so so that is roughly it uh like there's much more or much, hopefully much cleaner explanation, the cap. If you want to follow up but um yeah, I think that's, that's mostly it!

B

So if you have any questions like I'm happy to to talk a little bit more now, if not, we can also follow up on the cap.

C

So I read the idea: the idea made a lot of sense to me. I did have a question about the behavior when we see these progress things right. So so, if I get a progress indication and it says resource version 150 and I'm getting these notifications once every 30 seconds, one of my q api servers gets 150, but the other one's not going to get another update for another 10 seconds.

C

uh If there is a client who does a list and gets back a resource version of 150 and then uses that for the watch, he would still get the miss on two out of the three qaps servers right.

B

um Yes, that is correct. Yes,.

C

But we don't serve open-ended, do we? The list request has specifically asked for the cash to answer right.

B

um For now, yes, that, actually that's that's a very good point like.

C

I just want to make sure we don't end up in a case where, like one out of three times, this is going to happen to you, because uh you one out of three times you will hit the a a q, a server, that's more current than the other two.

D

Yeah that that's a very good.

E

D

I I I think it's relatively uncommon for the lists and watch requests to go to different api servers. uh So uh if they do then yeah you got a problem, but I think that's relatively unknown. Why would that.

F

G

Said I don't think we should.

D

I don't think we should have that property.

G

Why would that be uncommon?.

D

Why would it be uncommon? uh Because, because uh go uh catches, the both http and http 2 go does client. Caching. uh That makes it very likely that the request goes to the same api server.

G

uh With no disruption involved, absolutely sorry, I thought you were talking yeah with no description at all because, like no there's.

F

G

You're going to get sprayed multiple times across multiple clusters and all of your clients, like always like I remember, 130, of your clients, reconnect or whatever right, because the everybody in the first one gets balanced to the others and then half of those right right and then everybody else gets.

C

About one more time, I'm willing to accept that a disruption is just going to cause weird things to happen and like this is. This is fine, in that case uh wojciech, if you wouldn't mind, including the explanation of cash, uh http connections, sure.

G

C

Why I think it's going to go to the same qaps server.

G

I was going to actually ask one more question, though, like I don't know of any reason why we wouldn't more aggressively use, watch progress notification to ensure clients in general, know the resource version um relatively uh frequently in as many possible cases as we can so, like you know the point about doing it right before restart. We should just do that. The problem is, is that we have the semantics around it.

D

So there's a there's, there's, there's kind of an alternative uh approach that for check- and I discussed fairly extensively in the comments here, which is instead of making sure that clients are exactly up to date. We could instead make api server preload some history, so that it's okay, if clients aren't completely up to date um and it turns out the work to do that- is about the same so.

G

I I was gonna say like I every time I have had a client that is behind in some like so uh there's a couple of problems too, that, like we haven't really sorted out like what happens when someone has to restore from an older ncb backup- and you go back in time and every single watch cache in the cluster is broken in a horrible way, because we've completely changed the fundamental assumption of the kubernetes system, which is we're moving forward in an immutable thing, and so you get horrible horrible things happening. um Keeping. I actually.

D

uh Actually, uh actually, we've got, uh uh we've got a solution.

G

For that um sure I I might, as we boot everything in the cluster for fun, but but just going.

D

Back to the issue.

G

Keeping the checkpoint moving through the cluster is what I would call strategically better for the whole system to make sure that clients are relatively bounded in terms of resource version within maybe even the compaction window, which is harder to do but like just like, like. I don't know of a reason why we wouldn't want to leverage the bookmark mechanism.

G

D

Yeah, I'm not arguing with that. I I just think um uh like the original proposal was like 250 milliseconds and that that just seems a little brittle to me if the client has to get yeah 250 milliseconds. That doesn't seem good.

D

So I I think it's reasonable to approach it in both ways like like try to keep clients, have a reasonable effort to keep clients like in the same universe of like being up to date and at the same time preload some some history, so that uh uh if the client is a few seconds older, like that's fine um yeah, I was I was.

G

Minutes I was just like we have a history horizon in every cluster today, which is the compaction interval. We are not aggressively constraining across the cluster to clients to remain within the history window.

G

um Doing what we can to put that in, I feel like is a broader problem of it would be better if clients can reasonably understand what their drift is without the cube api server without having to do complicated logic, which is we're doing a better job of checkpointing within some reason, that's more than three seconds and less than compaction interval divided by two or whatever. um In order to uh like, for instance, I would expect to see some of that in logs.

G

I would expect that to show up in things when we have to do triage and analysis in the cluster after the fact in some of these, like insane scenarios where stuff goes wrong, um that removes a whole certainty.

D

So uh pro tip you can you can force everybody in the client to clear their cache and reconnect uh by doing clever things with ncd.

D

Command, if you ever need to do it in an emergency situation,.

D

B

D

B

For the record, it's slightly related like in 119, like I also we were also scale testing like existing bookmarks, like kubernetes bookmarks, um and we because initially there were some ish like performance issues in large clusters with them like it seems like at least some of them or most of them, or maybe all of them were, were well solved and like we on top of like what they originally, we were just sending them right before timing watch is timing. Out now we are say or starting with 119, we will be sending them periodically with 119.

B

sorry periodically every one minute. um So if we, if we start like propagating with progress, notified progress notified the watch cache like we will get this, what clayton? What you are asking pretty much for free.

G

Yeah and like there's a I was talking to a few people about this, I haven't brought it up as a as a cap yet, but like there's a bunch of problems in cubelet that are unsolvable today, without us making other more dramatic changes, and most of them come up with, like reasoning about or linear order of events within the system.

G

um I'm not proposing we'd do anything with this here, but just like in general, a resource version at the heart of our system. How our caches are kept up to date is like a fundamental, cross-cutting concern.

G

I'd like this purely because and what you're talking about what tech like that moving us from a model where we're kind of like wondering what clients are doing to where we are putting some bounds on what you can reason about in the system um in a very concrete way, because it is a fundamental part of when cube goes badly badly wrong. It goes badly wrong in those kinds of dimensions like you're out of date. You understand we miss events, etc.

D

Yeah, okay, I I I don't have any objections to the current form.

A

That's good time check 20 minutes into the topic, any other opinions or we can move on.

A

Sounds good voice. Did you get what you needed? Yes! Thank you. Okay, perfect! Thank you! So much, um let's go to the next one. uh Bhagwat are you here.

H

Yeah, I'm here hey, so this might be just a a discussion and and learning maybe for others as well as an operator of kubernetes right. um I would like to understand more in depth about a system when a system is bootstrapping in a situation when, let's say I created a cluster versus a cluster, is being rolling, update or having a very heavy load.

H

uh Let's say we have like 5000 clusters and- and I would like to basically segregate and understand- basically, let's say um we are only talking about like no webhook and in only critical components and, like the whole whole as a api server, not any like dependency on the web hook for the lg and then what happens when a custom components involved in that healthy or or like api. So it depends on the other components to be.

H

uh You know, be active to be respondent um and when we say that shutting down, uh how does a graceful shutdown uh means for this system and like um and the like performance and benchmarking of that? What how do we say that hey this version of kubernetes, let's say 119 or 118.?

H

um Is there a way we can quantify that hey? This is the basically our bootstrap time and it should be up uh or or what are the caveats around that? What should an operator needs to? You know, look at and like understand it thoroughly that hey um no, it depends on these these things. This is how it happens. It cannot be quantified like yeah.

F

H

On like in depth, understanding of these systems around that, if we can like uh describe someone, can describe more on this, this.

F

One so large question so with yeah.

C

So, with regard to a case where someone has created an extension or an extension mechanism and is somehow using it in a way that actually prevents health checks from succeeding, I I don't think we would really provide guidance to someone who's doing that whoever's doing that, should know what they're doing and decide whether and how they want to fail right, like failing, open and failing closed.

H

Yeah, that's okay. We can. We can mostly focus uh today on not like extension, but the core of like api server. Maybe we can.

C

I guess I guess, then I would be trying to figure out what difficulty you're, having we've improved these messages in the log this past release. Actually, I think walter tagged the pr uh so that in the log when, instead of getting just reason withheld, you actually get the reason in the log, so you can look at it. That makes it slightly easier for a cluster admin to know why, but a decision about when to give up, I'm that that seems like a very deployment, specific choice.

H

So when we say it, the deployment dependence. So what are the factor which makes it a deployment dependent.

H

I

As I said, I can kick in here, I mean we've seen this in a variety of ways. I mean one thing: I'll mention is a lot of these uh hooks.

I

You can exclude from your health check style check period is configurable, but a lot of what's going on behind these initializers is that the api server is trying to either create state or or make sure a certain state exists if you have a particularly large cluster and a lot of clients kind of going back to vortex earlier piece where you get something like a watch storm when the new cube api server comes up, it's possible that the api server is just crawling because it is overloaded by client requests and its own requests.

I

Aren't prioritized, uh there's also interesting things going on with things like priority and fairness, which I believe may help this, but yeah I mean. Certainly if you have ten thousand nodes and huge numbers of operators, it's pretty easy to the api server so that it takes a lot longer than normal to start.

I

H

Say I'm saying um I even like I'm just bootstrapping not forwarding my request yet to that instance, I'm doing let's say rolling update with just five thousand or ten thousand uh in the case, when you give an example, um I am waiting for the the my api server to you know bootstrap completely and say that okay, I am now healthy and ready uh to accept traffic, and then I'm saying, okay, if I have a load, balancer or anything which can which I have in front of that and then I will say okay now I am ready to accept the request.

H

So there are two aspects of that as well like if I have a external load balancer, that's one part of it and the second part of is like when the api server service endpoint gets updated lcd so that I don't have control over right like um when. Does that says like when does the api server service endpoint gets updated? Does that uh sorry? I haven't basically verified that. Does that says after some critical component? What are those critical components, or does it just says, looked at the ping log and a cd and then say that?

H

Okay, we are okay to basically say that we are healthy and then every cubelet or not the cube that basically, um all the components which are relying on the service endpoint will start talking to api server. And while it's not completely ready because most of the critical components are no.

C

It is, it is ready-gated right, so it doesn't add itself to the endpoints until it's ready when it shuts down it gracefully. It removes itself before it goes not ready, but we aren't going to be able to give you a you should give up after x period of time, because that period of time varies significantly based on where you've deployed it. How how big your cluster is, how small, how big or large the machine is what your network latency looks like how heavily loaded is that cd? Is there sufficient?

C

I o for both the q api server and for xcd? Do you have you know significant uh admission web hooks that you've added, which are synchronous, calls and can intercept some of these? We aren't going to be able to give you a this is the time you should wait. It's dependent upon the deployment that you have.

H

C

H

David, I was yeah, go ahead.

I

As I say, and even when we talk about things like if you have a load balancer for external clients, that's great for things like nodes, but most people have the controllers and other things uh co-deployed with the api server.

I

So those watches and a lot of that traffic is going to come in whether or not your load balancer happens to acknowledge that the api server is ready. I mean just reinforcing what david is saying.

H

Yeah so, like mostly, I wasn't talking about in terms of like time uh as we discussed right like so, um we have some post hook like critical components which are going on. So, as you described like you, we update the end point right so, let's say you're saying like heavy load on lcd.

H

So I'm saying, like I'm even saying, like the lcd, let's say healthy right, I'm just like doing a rolling update of my api server um right and that point of time I'm not touching my xcd, I'm independently rolling, update, let's say kubernetes version of my uh system from 117 to 118 or 118 to 119, and uh but it is like a bigger cluster. So where I, what I wanted to understand is like do for a performance point of view right like even like 5000 or 3000 node, which basically should support. I just wanted to understand.

H

Like do we have something which, as an operator, I need to worry about and uh take care of uh when I'm when I'm having a heavy load and performance systems where I could say that hey these are the critical components or, let's say a watch or a call um will get affected.

H

If I'm not basically looking into those critical. At least these are the critical components which needs to be okay. Or are we saying that when we say that the radiog is okay and we updated the end point, it's it's. It's completely, okay to go ahead and basically register that instance or to start like serving the traffic. So.

C

So that is deployment dependent right. We can these checks on the cube. Api server are things the qapi server knows about itself for every cube api server right, but whether or not you're going to have full usage of your cluster is going to be dependent on what has been configured in that cluster. So if you have some resource for prow- and you have some admission web hook on there, this isn't going to check to make sure that admission web hook is functioning properly right. That's not that's not what this readiness check is for.

I

But it will this.

F

I

Will be affected right so that, as an example of that yeah, it can be, if that, if that web hook isn't working, that web hook may cause uh an update or create that one of these uh hooks is depending on to fail, and so it may either take a long time or you may never get to a healthy state. Depending on how your system is configured.

H

Yeah yeah, basically, I understand the when we have any admission wave hook or any other uh api extension or a dependency um like I'm. um I was mostly like because that's that's the second step of like when the health g like comes into the picture and deep, healthy, um the the even the first take. You know um the definitely hcd and the api server has to be like the one one, the first part where it it. It makes a whole like web api server, where you have a connectivity with your database.

H

So I was I I mostly wanted to focus and like wanted to get some some details on like we. I think, like from my understanding what I'm I'm thinking and knowing like david, and you guys can basically correct me from wrong like let's remove the web hook and like any extension, I'm I'm bringing up a api server without any other dependency.

H

Can I not benchmark my readiness like um that? Hey here is what it api server should look like. Does it depends on any caching or anything not right like when I'm bootstrapping my api server it? It doesn't know at that point of time. It just basically empty bootstrap and then the request comes and then the caching happens. Isn't it.

C

uh I think it varied between a couple of the releases there might have been one or two where we actually waited for something to sink.

C

There are going to be cases where you have rights that you're going to want to have happen for things like the bootstrap. Our back rolls, I guess what I would say is that when the server reports ready z, that is usually enough to decide that this server is going to respond correctly enough to be worth sending traffic to and- and there are many cases where it won't be responding.

C

Exactly how you'd like, but I I think that anything more than what we provide is really deployment specific and even the timing is going to be dependent on the kind of hardware that you've got it deployed on and the kind of load that you end up under, which is it makes it distinct for per deployment and so you'll see you'll, see different distributions, choosing different values and maybe even different exclusions on their health z check. For instance,.

H

Yeah, so when you, basically when david, when they were saying like so in terms of like how we benchmark that, like any scalability or any anything like, we said, this is the cpu core, and this is like memory. It has been tested like how we say that uh 30 pod and this like uh one node, and we have tested this as a benchmark. So I was just more on like what resources we like. Yes, it differs definitely distributed system like the in various way.

H

I was just looking for like if we can benchmark uh this like. Okay. Here is what a system and the resources which has been used- and here is our benchmarking, and then this should be okay. People can use a different way to uh do that. I was just you know like this is my core component. This is how we restricted. This is our benchmark.

C

Yeah uh we have not chosen to try to measure something like that. To my knowledge, we don't have standard hardware configurations that we would even like be able to write it down on. I know that a lot of distributions have opinions about what their minimum requirements are.

C

I I don't think I see us trying to get more specific in the open source project.

D

If you're, uh if you're, trying to validate your setup and see how how small of a thingy you can run it on, uh I would recommend checking out the into in tests.

D

If you pass the test, that's probably okay!

D

If, uh if some of the some of the healthy hooks are indicating a failure or taking a long time to complete, um probably something is not going to pass some into an intestine would be my.

H

Guess that's a good point daniel. I think if we have in the end-to-end test that that should be a good indicator to look at. I.

D

Think yeah, but in general I think uh um you probably should go if one of these is failing and you don't understand why and you don't understand if it's important or not, you should probably go figure out what that thingy is doing um before deciding either way. If it's important or not.

D

No, I think I.

B

Know, that's not a.

D

Great helpful answer, but uh these things all do different stuff, so.

H

Yeah yeah, so I was very curious and like got infected with like how can I make sure, because I can't go like through tomorrow- let's say: I'm working on 118 and tomorrow, some other controller like post, who got introduced right. um I would never be able to like quantify each one of them to make sure that okay, these are the ones.

H

So one thing like I look, I will look at like ready g, but I was also curious, like looking does looking at ready, j is like always making sure that all the critical components are up and what are those impact? How should I give when? Should I give up like those kind of like questions.

D

Yeah, the answer is no already doesn't tell you everything about the system.

D

uh Ready z is mostly about api server itself. It doesn't tell you very much about the rest of the control plane.

H

Yeah so does it. Let's say I have a request if I only uh introduced the health g just basically api server and then um I I basically said: okay, I'm ready to accept, but isn't actually the other stuffs are ready, not just the dependent one but the like any the control specific thing. um Then I would basically that all request will fail uh or what happens to those requests is also a question. So, let's say see, a registration is not ready right reason withheld. um I said: okay, I'm only looking at like a ping.

H

Then it's api server is giving me okay, I'm going to basically start a request, but actually it will fail right.

C

I I think that trying to go through and distinguish what each of these individually do probably isn't a good exercise for this call. um If you want to try to build a way into the code to to make these self-documenting in some way. That could be interesting. I can see that being useful on a detailed reporting page to say. The impact of this not being up to date is x, um ca, registration in particular. Now the request isn't going to fail on that.

C

That's going to be have an impact on anything using in cluster authentication, but but I I think, if you want to try to go through each of these you're going to be better served by thinking of a way to to open prs that encourage self-documentation of them in the code.

H

Yeah no, no, I do not want to basically go into just one example. I basically shout through yeah we can move on to the next one. I think I got the answers um I was looking for, or maybe like some action items to maybe do something. Yeah.

A

Okay, okay, yeah! Thank you! um So, yes,.

A

Do you want to talk about the graceful shutdown.

H

Yeah, I want to basically quickly like touch down on this one as well, so when so when so in respect to the end point right. So when we, when we shut down an api server that removes the end point from the service api like services, endpoint right, I've seen cases where it basically takes really long time or like even the the in a rolling update fast on the instance, which was um shutting down or let's say, I'm like removing from there wasn't.

H

So when we say a guest will shut down, means that until that api server removes itself from the end point, and then then we say that okay, it's a shutdown, happen, graceful, isn't it right or should we depend like any other api server finds out? Okay, this node is not there like. They don't know about that.

C

So so, there's a couple different aspects to to graceful shutdown at the very highest level. You get an indication you should shut down and then what happens next is um I'm trying to remember. I believe now at this point you end up in a stage where you continue accepting new connections, but you report that ready, z is false and you remove yourself from the endpoint list, but you can actually accept new connections for a certain period of time so that you can handle a lag of x many seconds.

C

The next thing that happens is after that x. Many seconds is done. You continue to service all the requests that have already been started, but you start rejecting new connections and new requests, and then, finally, you terminate all the connections that are still remaining.

C

If you notice, ready z went false at the very beginning, so so this should inform different load. Balancers yeah david.

G

Got that absolutely 100 correct, you have to wait from the time you stop start shutting down. You signal you're done as early as possible. You must wait as long as the longest load. Balancer could possibly take to remove you from rotation when it observes either your ready check, going down or endpoint removal. If it's a service load, balancer or um you know, uh propagation delays so like the cloud load, balancers have propagation delays um and check intervals, so you have to be longer than the longest possible interval.

G

The load balancer could still think you're in rotation, which is usually you know, based on your tuning, like you can't tune even certain cloud providers below a certain amount so like on aws, it's like 40 seconds or something like that.

C

um Yeah we didn't match the documentation, the documentation said it should be fast, but our our actual experimentation said that it was like four times slower than advertised yeah.

H

Yeah, so it as basically david explained like there are three like four steps you said like one is like when we get a signal that I have to shut down. uh Ready g starts saying that I'm unhealthy, while I'm still accepting the request and then.

C

The distinction here is very important. It doesn't report that it's not healthy if you report not healthy and you're running in some sort of self-hosted way that can result in a cubelet killing you. You don't like that. We report not ready but still healthy.

H

Okay: okay, okay, I understand so when we say not ready and like yeah. um So when we you say when we are still accepting the request, um I I'm understanding like we will have some threshold of time here, like till which we will accept and start rejecting.

H

Is this correct.

C

And I don't remember whether that's configurable or where that lives.

C

But but yeah there is a time period during which you are not ready and you still accept requests to avoid rejecting clients. Okay,.

H

J

I think it's 30 seconds already. I can't remember if it's there might be a flag for it.

J

H

J

Yeah, I said I I believe I believe it's 30 seconds by default. um I could do a mistake, that's what I remember and there may be.

G

A really concrete example is like in openshift. We standardize this across, like all the cloud providers and on-premises environments is being somewhere between roughly 60 and 120 seconds, and I think david right now we're at like 120 or 60 to 120.. We do tune based on which environment it is. If we we're the ones who set up the load balancers and know how long the load balancers are configured for that maximum window through experimentation and testing um so like on a cloud provider, did we create the load balancer? Did we set the health check interval?

G

Did we set the number of failing health checks and then add some fudge factor so that you know if you miss the interval you're you're in there so yeah you? Should you really need to tune this based on the characteristics of the load? Balancer? Fronting your api server.

H

Yeah absolutely right.

A

Sorry to interrupt, I think we should unbox this and you know um we've been uh yeah, it's um maybe next time you can do some more preparation in advance and then the more concrete the questions, the more concrete the answers are going to be sounds good yeah. Thank you. Thank you. Thank you.

A

We still had two more items here um and we can follow up offline, also in the slack channel. If you have more questions, um mine is super quickly, uh so hopefully we can get to uh martin a little bit so for those of you that have been or not following up, you know there was a proposal to move.

A

Meta controller was looking for a house. Basically, a new home was originally into one of google cloud repos organizations and was not being maintained, which was causing a lot of problems. So we met with the main contributors of meta controller, um together with david, uh daniel myself, uh very open conversation and with the help of everybody, we decided to give them ownership. So basically we removed that from the google organization and they have replicated the repo fork and replicated the repo somewhere else.

A

We transferred them. The ownership of the groups uh we as in we, google, uh but also you know, was part of the stick discussion. um So yes, you know, um conversation was close. I was working with amit and uh alaimo. I don't remember his name, but his nickname uh in github, and you know end of the story. So meta controller now lives in a different place. If you go to the original repo, it will tell you where to go exactly okay, so that was my update.

A

Any questions. If not, we can move to the last topic in the last 10 minutes.

A

Okay, martin, are you there.

K

uh Yeah, can you hear me.

A

Yes, perfectly.

K

Okay, great, so uh it's more or less a behavioral question. So today the resource quarter, admission controller, uh pretty much doesn't work uh if, for example, you want to, if you have your customer resource exposed via an aggregated api server, so it works for local api services uh for crds. uh However, uh in order to get it working um in in most of the cases you you must pretty much vendor kubernetes uh the entire repo and then manually add it to uh as a plugin.

K

So is this behavior.

D

You're saying that's because is that? Because of the the the runtime checks that the api server does when it is admitting requests.

C

D

Yeah I mean the.

C

Admission plug-in is located in kk yeah.

L

C

I would not be open to moving the actual like quota calculators, but the admission plugin itself seems pretty reasonable.

D

Yeah we could, we could offer two ways. We could actually offer it as a as a consumable package, which is probably good. We can also offer it as a web hook, which is probably slow but maybe acceptable.

K

C

K

Of my colleagues, it's actually, I think, working on a pr to actually move it in the api server. uh If this is reasonable- or I should discuss with him or where should be the the best place to actually put this uh plugin, if it's not part of the kubernetes such kubernetes repo.

C

Yeah, I think that it would be reasonable to place it into kate's. I o api server yeah.

F

C

Sure to eliminate the connections it has to the current kk I'll call them like quota calculators. I don't remember what derek called him. What he made.

K

C

But right you shouldn't be moving anything like for here's. How you count uh pod quote out. You shouldn't be moving that, but the actual admission plug-in itself sure there was some some fairly tricky logic in there. As I recall.

K

L

K

Did sounds good to me I'll talk to him to my colleague and uh he is actually working on the like a pull request right now. So.

C

Trying to get it created uh if you could keep the polls from getting too large uh so like. If you have holes that actually like snip easily the configuration a bit apart, uh it would be helpful if you can like give less than a.

F

Couple of their.

J

L

Because it's kind of noisy sure.

C

uh I was just saying it would be helpful uh if you could keep your refactoring prs fairly fairly modest in size, so we can more easily review them. If you make one giant pr that does all of it. It's going to take a long time.

K

uh So your suggestion is to copy or uh or move or or do the bot or I mean yeah. We will try to keep it as as small as possible.

L

Yeah just adjustable chunks, okay sounds great.

A

Thank you, okay with that, I think we hit the end of our meeting today. I appreciate everybody's time and brains here, answering questions and having good discussions. I hope everybody stays safe and healthy. Try to stay positive, we'll see you in the next meeting. Thank you for joining, have a good rest of your day or night, depending on where you are.

A

D

A

H

Thank you guys.

J

J