Kubernetes SIG Architecture, 21 Sep 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 20220921 SIG Arch Prod Readiness

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Okay, we are recording.

B

A

Is the um kubernetes architecture production readiness review team meeting for wednesday is september, 21st um 2022.. So, thank you all. Let me share the agenda.

A

I don't remember how to share things in the zoo. Let's not show it. I don't see that there. It is big green button. All right.

A

Sorry couldn't hear me, um zoom always does that when I share I'm sure there's some setting, I can fix. um So thank you uh just wanted to uh welcome our shadows. We have a great good crew of shadows. We, which is great we've, been doing pr now for several releases um and uh the team kind of uh started with three went to four and then went back to three and then um at least for this cycle.

A

Wojciech is uh out until the end of the year, so it was just david and me so really really grateful to have all of you here um happy to. uh Maybe you want to introduce yourselves. I know all of you, I don't know. If david knows all of you or you all know, each other, in which case.

B

I also think I know everybody so.

A

Okay, well good.

B

To see you guys.

A

If you have anything, you want to say about why you want to do prr, try me and otherwise we'll move on to david's agenda.

A

All right I'll take that as a no. So uh it's all you, mr eads.

B

Let's try that without being on mute uh all right, so you guys are all fairly new here. I'm probably wondering why on earth would you bother to do this uh every year for the past three years we have run a survey of uh of whoever wanted to answer, but we get demographic data. It was largely people who uh administer cube and or run it as a platform as a service, and I'm trying to find the share button. uh John, it tells me the host disabled participant screen sharing. Can you um yeah?

B

They all look trustworthy to me.

A

Yeah, I think we can. We can do that. Well, I don't know I don't know exactly how to do that, but I think.

B

Okay, that'll work.

B

B

Okay, let's let us let me close this window just in case I get that wrong and you guys should be seeing two survey results uh left and a right. Do you see a little report? Excellent uh so on the left is the one we did this year on. The right is the one that we did last year. uh You can see right away. We got a lot less participation, this time, not sure why, um but we have about two-thirds of the results of last year.

B

uh Overall, the breakdown to my quick eyeball of the sorts of people that responded looked about the same. You ended up with about 80 percent of people who are either cluster admins or run it as a platform as a service.

B

We added a new question this year and the question we added was is cube more reliable than a year ago, and I was very satisfied with his answers. 75 said: yes, uh there were a bunch of people who didn't answer or went not sure about 20, and only 3 percent said that we were not uh more reliable than last year, and so um one possible claim out of this is that prr is useful to do to help improve this or to help keep cube reliable.

B

uh That's the the glass like half full uh way to look at it. The other way to look at it is just we're a more mature product, but I think if you look at the number of enhancements actually merging, um I think that the focus on how do I actually administer the thing that I'm creating has been valuable.

B

uh Overall, the demographics continue to be really close. There's not as many people who administered very small sets of nodes, but overall, our sweet spot is still in the 10 to 1000 nodes 10 to 1000 clusters.

B

The reason why we drew that line in the first place was that we figured people who are at the very large numbers of clusters and very large numbers of nodes. They already paid specialized people to run cube, they're very familiar with it. They appreciated our help, but they didn't really need it in the same way the chart for clusters versus nose under management. It's shifting slightly. You can see it most in the. I don't know if you can see where my pointer is going. uh Hopefully you can uh but the excellent.

B

uh So so I noticed that this column had shifted and it said to me um we are running.

B

We are running fewer small clusters, as I, as I recall, reading it. I wrote down my notes that way, um so we have more nodes for smaller numbers of clusters, uh which I think is fairly good.

B

The version breakdown is significantly different. This release. This is the raw version numbers. If you recall john last time, we did this, we said g. I wish we had a thing that showed us uh version level minus. However many versions back you were so, do you want to try that chart this time? That's what this page is. This is uh this is 124 most current release when we released the survey and it is 124 minus 3 releases, and then you can compare it against 121, which was last year.

B

The newest release minus two releases in this case, uh and you can see we have a big difference.

B

There are a lot of clusters lingering on 121.

A

Well, we had a bunch of removals in 122 right.

B

We do that's certainly what was going through my head.

B

uh I have not had the heart to look up to see whether 121 is out of support, but my memory is that 121 went out of support when we shipped 125.. I don't know, I bet. Jeremy probably knows.

A

Well, yeah, I see him nodding. The the this survey was done, q2 right, so you know we are. I guess that's not that far long ago with it.

B

It is, it is not uh we shipped 125 what a month ago, a month and change ago, um so it'd be a lot to try to to move in that time frame.

B

So that definitely stood out to me when I was going through and looking at this, and I was.

A

Yeah, what actionable we can take other than than the awareness that removals are so painful.

B

I I don't have I yeah, I don't have much out of that. I just I I looked at it in this chart made it really clear. I do think that's gotten better right as the the beta policies of uh you know not allowing perma-beta and disabling beta apis by default.

B

Both came out of the results of the survey year, one and year two, and so I'm hopeful that that will ease future migrations because there will not be a guaranteed cliff that every feature goes through or that users of every feature go through.

B

So that's that's. My takeaway from the the top part of this chart is that we are stuck here on the oldest versions. The newest versions show a little bit of shuffling. um I think I think it's more noteworthy to me that um if you add up n minus one n minus two n minus three um we're we have a larger portion of it um taken taken up by those releases versus the ones that are even older, uh so yeah.

A

People in there yeah.

B

A

You look at the it's. Almost everybody is on and on one two or three and 124.

B

A

Come out, I think when this survey was done, so we wouldn't really expect many people on one twitter.

B

Yeah, but this uh this portion of the chart is actually pretty small, so people are are staying current with their newest clusters and when you start breaking it down by uh how many clusters you have it's the same as we've seen every year. If you have a lot of clusters, you have all the versions, you have things that are really old, and then you have things that are really new. You just get enough clusters, that's what happens.

B

uh So this chart uh surprised me um it. It surprised me, because people had answered that they thought cube was more stable than it was before a year ago, but we actually had a fair number, a slight increase in the percentage of clusters that had rolled back minor releases in production, um not huge overall, it's very similar to the previous year, but it.

A

B

A slight uptick.

B

But the the reasons it failed um were, I think, slightly better. We did not have as many cluster failures now users get to self-select. I I realize that the shadows here did not see the survey questions, uh but users get to select um if you rolled back there's a time frame. If you roll back, I think in the past year, did you roll back for these reasons check the ones that apply and they get to select? Was it a bad feature? Was it a bad component? Did my entire cluster fail?

B

uh Was I just testing, or was it a scalability problem and in previous years and including last year, one of the reasons that people checked was because my cluster failed and they get to decide right. Was it a feature? Was it a component? Was it a cluster? They probably don't know exactly what each of those is, but this release we didn't see people choosing cluster failed. uh It was odd enough that I actually feel a need to go back and look at the data to see if I have parsed it incorrectly.

B

But since I finished building the first cut of uh of all these graphs yesterday, I haven't gone back to double check to see if we changed some kind of phrasing somewhere and broke it, but that could at least explain the result. uh So does the next page. So the next page is about. Did you roll back a patch version, uh and this year so few people in absolute numbers rolled back patch versions? I don't even know if this page is useful right and for comparison.

B

Last year we had double the numbers and only a third, more people answer it. So they might be looking at this and saying that's a lot better.

A

Right right, I've had to roll back way fewer patch versions, and maybe a few people had to roll back on my interversion. Maybe they had to roll back because of the removals that they weren't aware of.

B

uh It it could be. um I could, I can imagine someone going. Oh, what just happened, what just happened uh as as they upgrade and the crd application starts failing.

A

It's it's not that much better, though, because less there was less respondents right. So proportionally speaking, we should expect less a.

B

A

Yeah, so we should expect at least a third fewer if things are staying, constant.

B

Yeah and I think we got about a half fewer okay.

A

So it's better, but it's not like half better. Just all. I just tried to say yeah.

B

A

Are small, so it almost becomes.

B

Annoying so I got I built this chart and kept it in here, so we can keep it for future future times. I did by the way, john. I did write up a list of like in order to build this report. I do this because it is definitely your turn. Next time.

A

Okay, fair enough yeah yeah, it's a bit of work! Isn't it.

B

It is, uh and I've done my two years so very.

A

B

A

This one bunch of new participants so.

B

Oh, I see yeah they're going to be excited, they're going to be like uh I've done all this, and the next thing I want to do. uh I want to review this survey results it. Actually, it is very neat to get to see an evolution over time. We finally got there right three years. It's pretty good.

B

uh This chart was, I took away from it. We were just about the same as before. Beta nga policies, they've been persistently all on um matches are defaults.

B

I don't, I don't see a whole lot of that changing. I don't know. If uh I don't know, if I could look at this and say it would be a good idea for us to start disabling beta features. I don't know that I.

B

I don't know that I could.

A

Read is this beta, apis or beta feed? It must be beta features.

B

It's beta features, okay, and uh we.

A

Have disabled debate apis, but.

B

Correct uh and the reason they were distinct is because the beta features do not have a forced client migration in them right so.

B

uh Yeah, this to my eye, looks about the same as before and again it's fairly small numbers when you, when you look at this total number of respondents and then the number who actually disabled it, trying to glean why you disabled it from like eight people is small.

B

uh Alpha enablement, um this one uh surprised me for how big a difference it was uh in terms of the number of people who chose to turn on sorry. The percent of people who chose to turn on alpha features in production was very small compared to previous years uh that uh that does.

A

Feel like progress to me.

B

um And I don't know whether you want to really dig into to each of the individual categories. But just overall, this looks to me like we saw a significant reduction um and across all categories, and I would just go ahead and take it as.

A

Yeah and we still see the same roughly 40 or while 40 versus 50 roughly for development environments, so people hopefully, are still trying them trying out the features but not enabling them in production.

B

I guess half glass, half empty kind of guy could look at that and say the alpha features are so bad. No one has the guts to turn them on in production anymore, um but I am definitely a glass half full guy, so I'm going to go with. We have finally cleared enough features out of alpha that people are able to to run their clusters without needing plus this one more thing, that's an alpha that I haven't gotten finished yet.

B

The the results for how people end up you, how people end up troubleshooting um various problems.

B

I think in one of these the colors shifted, and so it became a little harder to see difference I get the colors fixed, I might have gotten the colors fixed.

B

uh So this is a breakdown of the people who answer a question that says: have you used metrics to debug a problem in the past month, quarter year or never, and then people pick one from a couple different categories. We asked them for events, we asked them for metrics and we asked them for logs and consistently. As you would expect the the fewer clusters you have, the more likely you are to use something like or I guess this is nodes.

B

Those are the clusters is the next page, uh the more likely you are to use something like uh metrics, if you're, big and logs, if you're, small and events, if you're somewhere in between, uh but at the lower end of the scale, we did seem to see higher percentages of people actually using metrics, which I took to be a a good thing. You can stand.

A

Here, what is that? Metrics are getting better right like.

B

Yeah, uh I think so, we've been focusing on it in the pr reviews yeah. I think it stands out more on the next page when you look at it by cluster. If I recall nope, that was.

A

B

um So it's it's similar to what we've had in the past. No, no huge difference by cluster, I think by node. I did see the little improvement, but it does show us that metrics are by far the for this category. They are very valuable and we should uh continue our focus on them, and that is that is all I've got any questions. I can click through to any page here.

B

um The survey results you know as before. We have um there's the last question with email addresses, so I don't want to project that I should make a data set that removes those so now I should anyway, there are a couple of freeform text fields. uh There are some fields about uh what other debugging tools you use and I've been meaning to slice the. uh What other tools do you use to debug your cluster, but I haven't gotten to it yet.

A

Yeah we used to do. There was a question about like.

B

Yeah- and there still is it- lists uh cloud provider, log tools, um splunk prometheus elasticsearch uh datadog, maybe.

A

All right so yeah, I think that um we may want to present a summary of this to this arch meeting. I don't know if you can do it tomorrow, but um I don't know what's on the agenda tomorrow, there, probably nothing. um So if you are up for that, um we can present. I don't know if we want to do the whole thing, but if there's highlights you want to present and then share the the two reports.

B

A

B

Can go ahead and pick out uh if I were picking out highlights for a you know, five to ten minute.

A

B

Minute description, five minute questions: I would probably choose uh this one: we are more reliable and this one we are stuck on 121.. I think those are the two biggest things.

A

Okay, I will add that to the agenda for tomorrow, though,.

B

B

uh You've got a lot of people here and they've sat through me talking for like 20 minutes so john. I believe you owe them an explanation of what is gonna happen in the next two weeks.

A

Yes, okay, yeah, so there's a new process um this cycle, so it's new to us as well. Let me share that project board. uh One second find it.

A

Oh, I can't remember where it is: all right should be easier than this, but listening for me too,.

A

Did somebody put it in the chat all right.

B

B

A

Well, john's doing that. uh One thing I was thinking about um david when you were talking about this- is we're going to have beta apis turned off by default in the future.

A

Is there anything in the data are ready to support that or anything that we should be doing in the future to gather data since we're going to be doing that and we might want to showcase the effect.

B

uh So one of the reasons that we well the primary reason that we did it uh was because of difficulty during upgrade and freeform comments about, like he pulled my apis, and so I think the way we would see that manifest is that we don't see as many version cliffs. So in that presentation I just gave it said, people are stuck on 121 and it's noteworthy like it was a big difference between where we were when we did this in 121 and the previous level was 118..

B

um I think us not seeing that in the future. In the survey for.

B

2023 uh will be a good check.

A

Yeah, okay, I agree with that. That's a that's a good thing to watch.

A

All right thanks yeah, so I was I was actually talking about again when I share so you mean it's me um so um yes, so uh so, you've probably all seen the new sheet, that's being used for enhancements tracking. This replaces or the new project board, which replaces the sheet that we used to use um and we have a pr tab on there. So in the pr tab, there's an assignee and there's a shadow, and you everybody here except for han, uh will be in that list. So I will add han to the list too.

A

um In fact, I can just add him right here and.

A

A

So there we go so.

B

What we would ask.

A

You to do um is as you, uh what we'll be doing is going through and assigning ourselves, and you can just look over the caps and decide we, I don't. We can decide how to do this. I was thinking people would just self select, um but uh I would be happy to get together uh with people who are like shadowing some of mine. It might be it might. uh Well, as I go and do pr reviews, um we can maybe do a live meeting um and just go through it. It's pretty.

A

um I guess I talked with some of you, but not with everybody. So um in my mind, the biggest the most important thing about the pr review is actually simply getting the developers to shift their mindset and think through the problems that might happen from their feature.

A

Like a lot of the questions, we ask: aren't um they're really just geared around forcing people to think through things that they might not otherwise think about developers uh with all good intentions tend to be most interested in the happy path tend to be most interested in getting their functionality working, and so we try to ask about- um and they may be thinking about in isolation, how their future works, and we try to ask about okay uh at alpha.

A

The main thing we ask about is: can we turn this thing off without breaking the cluster and and when we turn it back on, will it still work and not break the cluster? So um that's almost all we ask at alpha very, very minimal um and uh then the the the sort of requirements ramp up. um So I didn't like prepare going through the questionnaire. Maybe we should maybe I should have. um We could probably do that still, um but you know I guess, as you shadow, we have two ways.

A

We can do this I'd love to give people thoughts right. One is, we can sit together and do a review another is um you can look at reviews, we've done in the past, and you can sort of do a review yourself and then decide whether uh and then we would and then we would review the review.

A

I don't know anybody have thoughts about how we would want to approach it.

A

What would be most useful to people as far as would it be sitting down with whoever the approver is, which is going to be mostly me and david this the cycle, because I don't think boychick's going to participate um the would it be useful to sit side by side with us as we go through and do it and we talk about it.

A

uh For me, selfishly yeah, sitting side by side would probably help um if people are kind of time constrained and need to do things async. I would also be happy to be on ones.

B

Where the primary reviewer needs to do work.

A

Async but um obviously uh for me um being able to actually sit in.

B

Would be would be amazing, you're in eastern time right I, I am yes, anyone else in eastern time.

A

Okay, well, I was initially going to suggest.

B

We'll split it up eastern time and west coast time, that'd be perfect. uh I mean it might still be perfect for me.

A

I wake up super early, so I'm basically eastern time.

A

B

I'm in mountain, but I start pretty early as well, because we have some folks in eastern time, so we can do both uh well. uh So if I get, if I get joe and han and andrew- and they are uh interested in doing that- I'm willing to do that uh because I'm gonna spend the same amount of time sort of regardless.

B

um I will forewarn you that I do spend a fair amount of my time when I'm reviewing these actually going through and reading the kept, because a lot of times it's something new for me um and so like if you're expecting an amazing presentation like I just gave, uh and that you were all enthralled by uh this. This will be even less interesting uh for at least part of it, because a lot of it is is especially when you go to beta. The questions become things like. uh If this starts failing in my cluster.

B

How will I know as a cluster admin? This is failing. How would I know as a user that this is failing, so that I could complain to my cluster admin and then what as a cluster admin, should I try to do to either fix it or gather the data for the person who's going to fix it, uh and that takes a fair amount of time reading the kept to understand what it's doing?

B

Yes, but I'm. I am happy to do that and I am going to be looking to start doing this next week. uh I'm probably not going to start tomorrow or the day after uh it'll be next week.

A

Yeah that that's a great point david. So a lot of this is actually yeah. Familiarizing yourself with the cap, which sitting and doing it, live with. Somebody isn't probably all that interesting, but we could try it and see how it goes um when I read them like like when I read the kept through. Sometimes it can become an almost overall design review, um because.

A

Because it's hard not to but but there's also time constraints so last cycle, I think what were there david, 80, caps um and and so um somebody's lines went up, and um so you know at some point like yeah. There's a judgment call as to how much time- and you know how much depth to go into the design pieces. But often the design pieces dictate things especially around scalability or things like that. So um so I guess maybe the um the first step. I also will probably start next week and to give a time frame.

A

So we have us what we call soft freeze for prr, uh the 29th. What that is saying is hey.

A

Please try to get your your things ready for prr to give us time to review them before the cutoff, but we make the best effort to get everybody's pr done and thus far have, I believe, gotten everybody's done and not dropped anybody because of the soft deadline, but it's just kind of to create a little extra urgency for people um to try and make our lives easier, not too crammed in the last week. But what tends to happen is that last week gets pretty hectic um and it's a it's a pole and weight cycle.

A

um You go through all your assigned caps and you make your comments and then you wait and so, um which I'm sure many of you are familiar with in general, in pr reviews pretty common. But um you know that that's so I tend to keep I'm hoping to keep everything in the enhancements.

A

Tab thing project board, so we had them. Have this notes? Column you'll see so in the past. What I've done is I've had just a spreadsheet or I have all of them listed, and I have a notes, column that says, when I last like pulled that one and what the current status is effectively like what I'm waiting on typically um and so that way, it makes it easy.

A

I can just go through my list and as I pull a check each one, I just take a note and move on so that that's I'm hoping I can do all of that through that enhancements project board now, rather than having a separate spreadsheet for it. But what we can do this week.

A

I guess is: all of us should go through and assign ourselves to david and I assign ourselves to the pr assignee and each of you assign yourselves to some of the shadow to shadow some of those and um and then once we have that we can organize some. Some calls to go over.

B

That sounds good. uh I will go ahead and help split um before the end of this week.

A

B

A

Not sure how much vojack is going to be involved, but he uh volunteered himself to be my pr reviewer for white caps, so he is doing some stuff. I I think it's just out of the goodness of his heart, because I think he's on he's on.

B

How much he likes you.

A

B

I I think, he's interested in the the subject matter. I think.

A

Csi storage, migration.

B

Are you sure you got the right one.

A

No, he volunteered for several of them, and I think that was not okay, that was not mine. Yeah, say something on slack that he was going to go through. um I mean you know um so I mean that'd be awesome. I don't want him to do stuff when he's supposed to be not doing stuff, but um you know definitely there's enough work, um but I'm very encouraged having all of you here to help out.

A

um Obviously this cycle, I expect it to you, know, be more um there'll, be help, but I'll be just also you know learning so, um but hopefully in the future cycles. This will use the burden a lot.

A

Do you want to go through the questionnaire now um and see what we currently do? um Just we can talk about each of the questions since we have time um and and what we interpret them.

B

I already found it yeah.

A

And then it will, it will silence me again for whatever reason.

A

Okay, this is the um enhancements repo everything's done in the handstand three points, um uh probably no um caps and then there's the cap template and in the cap template. We have our questionnaire, I'm going to actually go to the raw, because.

A

A

A

There used to be like a raw thing here. What did they do to me.

A

Am I an idiot here guys what's happening? There was always a way to see the wrong content.

B

Click on not click the file go to file yeah.

A

A

Here we go, um this is way too small, probably right.

A

It's fine for me.

B

If you're my age, it's too small.

A

Okay, um all right, so uh when people are you obviously you're reviewing the kevin markdown, so uh you'll see all the comments each section talks about uh when it needs to be completed, um and then the comments try to give some context around.

A

What the point of the uh the question is mechanically, um we handle the approvals by having a certain directory that has crowd readiness, approvers and all the owners file. It's a little awkward, but it works well enough.

A

um Okay, so partly this serves not just as you know, like I said in the beginning, like we want to make people think, but we also want to document so partly where, where this puts into the cap, documentation, okay, here's how you enable or disable it and there's a feature date gate to do that. We keep this in the kept ammo. The theory was at the time that you know we can do some tooling around it. No, no that's happened, but um we're not changing the process.

A

Now at this point, maybe we'll eventually get that tooling.

A

um So people you know in here they need to document whether there's a feature gateway which almost always there should be and then the specific components that reset feature gate.

A

um Sometimes we get kept reviews for things that are out of tree, so they won't have a feature date. It'll just be like, don't deploy it. um That would be like the other type of thing.

A

We don't technically need to review things that are out of tree, but if people want to review we're happy to do so and anybody I can't see, hands raised or anything while I'm looking at this. So if anybody has any questions, just blurt them out, please.

A

We don't want to change any default behavior if we can avoid it. um Okay, so can it be disabled once it's been enabled, um so that this is how? How do you do the actual rollback disable support should always really be true. We should almost never. It should.

B

Almost always be true.

A

Right, um um but sometimes there are consequences on.

A

You know that that fall out of the disablement, and we want to make sure that we document them here, and this next question is like okay, you've disabled it and then you re-enable it and um sometimes people don't think, like I remember looking at the the ipv6 um pr review um there were these situations where, if you were like enabling and disabling, you could essentially um trigger a massive flood of resources being created in your of end points being created from your services because effectively they didn't create them for for the ipv6 endpoint.

A

Until you enabled it and then, like you know, you've got 10 000 endpoints.

B

In your service.

A

It could it could cause a flood, so we didn't really tell them to do anything about that other than you know, to document it and make it really clear so that users who are using this uh are aware of the impact of their actions.

A

um And then our ede framework is not great about enabling and disabling feature gates, so the best we get is unit unit tests for the most part for the enablement and disabling pieces with api fields. We do want to see tests around.

A

um You know how things are handled and we try to provide pr links or whatever, where it can be helpful to people.

A

I think we're still in in the alpha section here featuring that's three bars three hashes um okay, so here we get finally to beta. So here we're starting to talk about okay. How do we roll out this feature? Often these features have both an api server and say a cubelet component, and so, like you know, we want to make sure people have thought through.

B

A

Does it work when nodes are, you know, with a partially rolled out um provisioning of the feature enabling another feature? um How do the nodes which have yet to be uh updated? You know respond.

A

What, if you can see the examples here, you know your api server restarts in the middle of this. Like do the notes, get confused or is everything fine? Obviously we want everything to be fine, um and uh then we start to get into metrics.

A

um How do we know that as we're rolling out this feature enablement um or the feature if um on upgrade, um especially now that data features are disabled by default? I guess an upgrade is not as much of an issue um except when they're going to ga, but um in any case uh uh you know how do you roll back? Do you know when, when you need to roll back, how do you know it's failing.

A

uh Upgrade downgrading.

B

Can I comment on that real, quick yeah, so so one of the harder parts that we found is that when features that exist or run on a per node basis on the host make it to beta, it can be difficult to um for for developers to see how they would debug that at scale and an example that you that we used to get quite often was well you ssh into the host, and then you run.

B

These commands look at the proc file system and you figure out what's broken there and you go into the journals and you can fix the problem and that doesn't work well at scale when you are trying to decide whether you need to roll a thing back uh and that's where the question about metrics comes in and I think there's more questions as john gets to later. Sections and pressing back on that is is important because even if they only update reference implementation, maybe they only update whatever standard fleecer there is uh to expose the data.

B

It can make a big difference when you actually roll out a feature.

A

Yeah, some of that comes up in later questions like around the monitoring. I think in the next section, like it's very common, to get the answer. Like david said of ssh in it's also very common, that's at a node level, but even at a cluster level, it's very common to get the answer up from this q control and if you've got 10 000 clusters or a hundred thousand clusters that many people around a thousand.

A

But even if you've got a thousand clusters, that's not a really very easy way to determine whether the features enabled, for example, that's that's often one of them um so um like is it in use right in use and enabled are different things, um but like we'd like to see metrics for those things, because we don't want people to have to run commands against thousands of clusters.

A

So that's very common um feedback we have in in the pro reviews. Do we have a generic feature enabled metric we in a meeting about a month ago? Some mo was it mo? Is that who it was?

A

Was gonna propose a cap for that? I haven't seen whether it rather either yeah. That seems like a good idea.

B

Yes, uh I would be in favor of that yeah.

A

Yeah, okay, so I don't know if that's that's happened, but it was a discussion we had in our media a couple months or so ago, and we all agreed that that would be a good, a good thing.

A

um Okay, so where was I yeah? Upgrading rollback tested upgrade downgrade upgrade? This is similar to enable disable enable, like those are multi-stage sequences of events that people don't always think through, and if you leave detritus, you know after downgrade you leave you move crap around. Then, when you upgrade again, you know it might cause confusion, so you just want to make sure that those things are tested. Typically, that's going to be manually tested because we don't have any any anything to do.

A

We have any ada framework that can handle that kind of thing.

A

And then we just call out especially you know: are you changing any interfaces um that it's gonna kind of surprise people um all right monitoring requirements? As I said, how can it be see? How can we tell if it's in use by workloads now some of these questions apply to certain types of features and not other types of features so that that'll often come up to you?

A

This doesn't apply to my feature, and I mean you have to make a judgment call about that, whether it's true or not- but um this is like just because I've enabled you know, network policy. It doesn't mean that anybody's using network policy, so um we have to uh have a way to tell whether somebody's using it. This is where I often get run this cube control command um and I think david didn't. We have a discussion at one point like do. We have metrics around like counts of individual resources or anything like that.

A

I mean, I think, there's there's always you know.

B

We decided that we did not need to create a metric for every field, basically like, if you add a new field, to figure.

A

B

Whether someone's using it, we decided against doing that.

A

Yeah cardinality would be super high. I think yeah.

B

A

Start out small and then.

B

Ramp up is what we figured would happen so.

B

So determining whether a feature is enabled um a metric, for that would be very useful, and that will let you know whether someone might be using it.

A

B

Also, whether given.

A

Resource kind group group and kind exists right like I.

B

Know that we do that.

A

For, like I know, you have metrics, for when people, um for example, use a deprecated api or whatever right so do we have something that just gives us demographics um and would the cardinality be too high? Is a question I don't recall the answer.

B

For the majority of features that I've seen coming through, they end up adding things like error counts and success counts. Overall, so you add something for storage. They have an error, count a latency and a success count. So the cardinality is fairly small, but if they look and see that any of those are greater than zero someone's using it.

A

Yeah, okay, a question here uh is it: are we looking for like a command that you can use with vanilla, kubernetes or if somebody came and said hey, you can use cube state metrics and it would give you this. Is that acceptable, because you would have to then like put that extension on your cluster.

B

I have historically allowed you have to use q uh state metrics, you have to use uh c advisor or this this fleece or the idea. The idea of, like you put this thing on the host I've taken the trouble of providing at least one for you. If yours doesn't do it, it's on you to fix it.

A

Okay, yeah, I think I've been okay with the keep state metrics as well yeah um all right. How do you know it's working for your instance, uh so this is more user oriented as opposed to operator fleet oriented, but we we added this at one point, because we felt from a documentation standpoint. It's it's super useful, so this is more like looking at a specific cluster, you look at the events or the status, so this is one where you're, not thinking in that global scale.

A

So, as a uh you know, operator operating.

B

A

I don't know enough about your feature to know whether it's working well or not. um What metrics, how I should um you know, what's a reasonable slo for uh for your metrics. So please tell me: that's all this section is what are my sli's?

A

What are my slos that I should be able to say this feature is healthy, um sometimes there's a reason not to add a feature, a metric, but we try to document it here. Sorry, I guess I'm not reading the questions. Maybe I should be um okay dependencies, so uh this is where we start to talk about, like other things within the cluster that this might rely upon.

A

So are there any specific services running in the cluster that are necessary in order for this feature to work things like you know, you know metric servers or node agents, um and the idea here is uh to document this is sort of more of a play, playbook or a section of the so again more for documentation than than for review.

A

But you know it's like what happens if api server goes down, um you know and if your features something running on the node, what happens when api server goes down um or if it's something uh you know running as a service within the cluster, and it depends upon some other service within the cluster?

A

How do we figure out? What's what's happening, um scalability questions, so here we're trying to get at what um sort of is the impact of your change based upon the number of resources, you create the number of reads or rights to those resources and make sure that we're not causing a scalability issue.

A

um So new api types and then we talk about how how big those are you know, number number for each for each cluster.

A

Additional api calls.

A

Typically, things like a certain like a.

A

Service going to create load, balancers or whatever in the cloud provider, so there may be scalability issues there. um Here we go the sizer count. I guess we yeah, we had it there um and then what is the impact on other sli nestle? So, if you're adding some feature that you know, that adds a 20-minute sleep and cubelet startup, we would want to know about that.

A

We are almost on time, um so there's nothing to I don't know. None of these seem surprising to me, I think they're, pretty straightforward. um Then again we get into this run book section troubleshooting.

A

How do you you know? How does it this we already asked about failures, but we want to make sure people specifically think about this um other known failure modes and that's it like so really like, I said the the idea is, make people document and think through the failure scenarios, make people document and think through as a operator with thousands of clusters.

A

I have a very different vantage point than an operator on a single cluster and- and I want to make sure that I can both types of people can support the feature.

B

A

Answers have been getting a lot better over time.

B

People have gotten used to answering it, so things have gotten significantly easier. uh One final note, john: I have gone through and everything that was graduating or that already had one of us assigned either you or I I've gone ahead and added it to the assignee column, awesome and there's a bunch of net new stuff. We're gonna have to split up so fantastic, okay.

A

Thank you all right. Well, thank you. Everyone reach out on the channel. If you have any questions we'll get these uh future assignments done also go through and based on time zones or not uh pick um ones you're interested in and we'll try and set up some live calls next week.

A