Kubernetes SIG Cluster Lifecycle, 8 Jul 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 20200708 Cluster API Office Hours

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Okay welcome everyone uh today is, actually I don't know, july, 8th 2020, and this is the cluster api office hours. Cluster api is a sub project of sick cluster life cycle.

A

Please adhere to the cnc dev code of conduct. I use the raised hand feature on zoom if you'd like to speak and be kind to each other. um If you are attending, please add your name to the agenda in order to get uh edit access, you need to join the cluster lifecycle mailing group. The instructions are at the top of this document.

A

um Okay, so we'll get started. um So first, stop is welcome and introductions.

A

So, if you're new to this meeting or to this group and you'd like to introduce yourself and tell us a bit more about why you're here now is a great time to do that, um so I'll go ahead and mute and feel free to unmute and speak up. If you'd like to introduce yourself.

B

I'll say a quick hello as I've not uh joined this meeting before I'm james bondly. Some of you may know me from set manager and a few other meetings, um I'm just dropping in to chat about cert monitoring, strapia.

A

Hello awesome, hi james, welcome.

C

I'll jump in there and well as well, uh I'm chris hein. I work at apple, um I'm kind of sitting in on this just to feel out what's going on and uh see if there's any anything that we can help with or uh potentially look into into leveraging great welcome.

A

A

A

Okay, um so we'll move on to psas uh may. Do you want to take the first one.

D

Yeah sure um yeah so uh me and cecile finalizing slides for kubecon uh this week.

D

um If there are things you've been working on uh either in class, api or infrastructure provider, and you uh you want to highlight that at the virtual coupon, which is on august 20th, I believe, might be august 19th and then contact us in the community slack. I suppose I'll drop my email in as well there as well. If you want to reach out that way,.

A

Yep, let us know um vince rc, release, exciting.

E

Yeah so we're almost getting there um so yeah like yesterday, we cut rc0 and um thank you for uh to the azure folks like to for quickly uh testing this out. I think mater and users will like put a pr up and like that's at the end, which seems like they have been successful, which is great to hear.

E

um There's a lot in this release so feel free to take a look at the roadmap, but this is a rough list of like high level features that we added in and uh yeah community and the team like did a great work in this release. um The release notes will actually be compiled at release time, uh I'm hoping to do an rc1 tomorrow, which will add support for a cluster cuddle, move.

F

E

Close resource set and then uh the actual release, like maybe either like friday or probably early next week, though like so that we don't rush it and we can test it a little more.

E

So that's exciting and congrats everyone for this huge milestone and then uh like after this, like we'll, probably uh start talking about alpha 4 in the next few meetings. Any questions on that.

A

I don't see any hand raised, so I'm gonna assume that's a no um that's awesome um great! So moving on, um I don't see any demos or pocs signed up for today, um so we'll jump to general questions um and by the way, if you're new ish, um this general question section is great. If you have questions throughout the meeting so feel free to add them even after we pass this section I'll make sure to check back in um at the end, um so we have one from lauren about the no draining timeouts.

A

Do you want to talk a bit about that form.

G

Yeah, so I have a um there was a issue about adding a no draining timeout um to the machine controller. I believe, and if you want to open that up. um Basically, um I kind of did some um some diving in and I found out that we have a uh the machine controller by default uh when it drains the nodes. uh It's sort of default way of doing things is by evicting the nodes and then waiting for them to be deleted.

G

um However, I noticed that, uh amongst the many timeouts that are there, namely timeout global timeout and there's one called skip, wait for delete yada, yada yada. What I notice is that in the path for eviction um we essentially after evicting, we wait for the pause to be deleted essentially forever, but if the delete pods. So if we disable eviction, I mean just go down the path of deleting the pods.

G

We essentially wait for 20 seconds for the pods to be deleted, and my question was like: where did this 20 seconds come from, and is there a reason for this difference in behavior?

G

I just wanted to kind of get more context on this area of the controller thanks.

A

um Anyone have any comments on on that.

A

I see jason had a guess in the comments.

G

Another thing I wanted to bring up was um if there are any sort of tests specific for this draining logic. If you could point me to them, I didn't I kind of did a quick check and I didn't really run into any. um So if you'll know, if there are any tests around testing these things, they might also kind of be useful and understanding be the behavior again. I I just want to understand this, because if I you know I want to make changes, I don't want to screw everyone over with some real time out.

A

H

Yeah, so I think the oh I'm sorry, so I think the biggest challenge we'll have here is um if we're recommending that users that are running critical workloads use pdbs to ensure that um those workloads are running as intended.

H

We need to make sure that we're honoring, you know those pdbs as part of eviction and even including that in our kind of like scaling, work flows and all of that stuff, so I think it's good, probably not to wait. You know until max end, but at the same time we need to make sure that we give kind of reasonable time for those workloads to evict uh prior to moving on.

H

Otherwise you know we could be disrupting workloads that shouldn't be disrupted, and it may be that we need to do something similar to the auto scaler at some point where we actually introspect.

H

uh You know some of that information, but right now we're just kind of blind right now, and I think the the draining process is the only interaction point we have right now. That would have any chance of interacting with the pdbs.

A

I

Yeah, as of today, you could build some kind of external uh controller if you want it and apply the skip, node draining annotation after some time out that you've determined.

A

Does that help warren.

G

Yeah, I think the origin of this issue was like to make these timeouts during the draining configurable so that they could, um uh you know, adhere to the pdbs that have been configured as well. um I I guess I can reach it michael later, but regarding the separate controller to to annotate the pods, um I was hoping.

I

To build a whole subject.

J

I think having the ability to have an external controller with the annotation as an escape hatch is useful, but I also think that it would be nice to not require that if you did want to just say: okay, I've been waiting, 10 minutes for the eviction to happen. It hasn't happened. So let me just force it through.

J

And it would be configurable like it wouldn't be. The default.

A

Yeah um I see a few hands raised. uh I'm sorry, I don't know your first name uh d, thorsten thorson,.

K

Sorry, I guess I didn't have my name set and he's no, um it's dean. uh I think he just actually answered my question. It was just going to be that we have a use case for very long drain times. uh So as long as that's configurable to a very large number, we would be fine, but if the default changed, we just need to just keep track of when we're going to change that default, so that we don't break things when we deploy it.

K

Relatedly we're also I'm troubleshooting what appears to be a bug in the drain logic, and I'm not sure if it's in here, if it's somewhere else, but um I think I'm getting closer to to what's happening there, but it seems, like our evictions, aren't they're not actually making it to even that 20 second timeout before our pods are going down.

A

Michael, do you still have your hand raised, or was that from before.

I

Yeah um yeah I mean we can make it configurable as long as the default is wait forever.

A

Okay, um warned.

G

Yeah I just wanted to clarify. I wasn't intending to change the defaults, my um I just wanted to uh expose it and make it configurable and that's what kind of led me down this path of which timeout to make it configurable and that's what I noticed.

G

The two different paths having different timeouts, I'm happy to leave it as maxint. If that's what the community feels is best, so my goal was just to make it configurable.

A

Sounds good uh yeah, it sounds like having a configurable and keeping the default is a reasonable change for now and then we can have a more same default later. If we feel that the default doesn't make sense, um I see some hands still raised. Are those still raised, or did you forget to lower your hands?

A

um Dane okay, good, all right, let's uh move on for now, um but yeah. Thanks for writing the notes, uh ninja, um okay, so discussion topics, um I actually had the first one. So I just wanted to ask um about.

A

I know there was some discussion about the bootstrap reporting a while back and basically the idea is that, right now it's pretty difficult um for the infrastructure providers to know uh or for cluster api to know what the status of bootstrapping is um when we run, for example, cloud init on a machine if cloud- and it fails due to like some, for example, if the vm size is too small- and there are no not enough, cpus cubanium will fail and we won't get that reported back, and so that's uh uh not great uh from an observable observability perspective.

A

So I was wondering what the status of this document was or if anyone was driving this actively and if we would consider putting that in v1. Alpha 4 meteor.

D

Yeah I paused on it for this, for we run out for free, ish 037 anyway. In any case, um I think got to the conclusion that it was going to be it's quite difficult to do an infrastructure agnostic way of doing it, um but definitely something to consider for we want our four so I'll probably be happy to start revisiting it pretty shortly. Actually, if anyone's interested.

A

Sounds good, I'm very interested, please let me know if you have more discussions um and if we think that infrastructure agnostic is going to be too difficult in the short term, I think that's okay, too, and we I just I was holding on off on doing anything infrastructure specific, because I wanted to see if we came up with a solution in cappy first, but it is quite a blocker right now, so I would like to proceed with uh an azure specific solution.

A

If, uh if we're not gonna get that done right away um cool anyone else have any questions on this or comments.

A

A

uh Upgrade cert manager, vince.

E

uh Yeah, I just wanted to call out, uh and also shout out to james for jumping on this, like we are using a very old version of um serp manager, which is 0.11.0, and we have been talking for quite a while to upgrade the search manager version. uh We've been stuck with this because, like mostly like, we inherited the dependency from cube builder and controller tools, um so we're trying to move away from zero eleven to one of the latest version and um yeah.

E

If you're interested like this, there has been a lot of discussion on 2240, um definitely jump in, like um I think we have been making some progress. So I'm not sure if we will be able to like it do any upgrades in the alpha 3 release, but um we'll definitely prioritize this for alpha 4.

B

And I I guess just to add to that as well. I'm hoping like I know, one of the kind of blockers that you've had on upgrading is waiting for things to be ready and um yeah.

B

So that's kind of something I want to address with some of this work and just making sure that we're taking an approach to checking whether certain under is ready that is kind of agnostic between versions, so that next time it comes to needing to upgrade, you should be able to just drop in the new manifest and run your tests again and make sure it's all good, because we do.

B

We do put quite an onus on api stability, at least, and making sure that if you create a resource in v1 alpha 2 once it should work across all api versions similar to kubernetes. So hopefully that can improve things there.

A

Thanks, um do we have any questions for james or vince about cert manager and upgrade.

A

All right well yeah thanks james for jumping in um cool, uh so I had one more question. Actually um I was wondering if I don't know if sadaf is on the call or not, but if we could talk a bit more about cluster resource sets for everyone here and because I know that's a big new thing from the new release and specifically how infra providers should go about adopting it in terms of timeline.

A

I know there are a few issues open right now in the queue are there any that are blocking? um Should we be aware of any bugs.

A

Etc, can anyone speak to that.

L

Hi yeah uh so about the open issues. They are just enhancement, they are not blockers. As of now so cluster resource set was in the cut, yesterday's alpha cut, it can be used. So what are the other questions.

A

um So is it ready to adopt right now for infrastructure providers.

L

uh Yes, it is ready, ready to be used. Yeah.

A

Okay, great, that's what I wanted to know and it's not in the tests yet for gaby, but it is going to be correct. It's not used.

L

A

End-To-End tests.

L

Yet yeah we have antennas for cluster resources.

A

Okay, great and um yeah, as vince pointed out in the chat, would you mind just giving a brief overview to everyone, who's not familiar with what crs is and what it does sure.

L

So cluster resource set uh is would be useful to kick start a cluster by adding initial add-ons such as cni csi, and make it operational for, for example, installing adobe managers. So initially, when a cluster comes up, it is not in operational state. Basically, we create cluster resource set. Add a couple of resources can be secrets or config maps, and we by doing that, we can like install uh like plugins to the newly created clusters by matching the matching their labels. Basically.

A

Great, is there a documentation available for providers somewhere.

L

No actually yeah, I I was planning to add it to the cluster api book, but I'm waiting to finish up the last bits in those open pr's before publishing anything.

A

Okay sounds good thanks um and one last question: are we planning on removing that cni step from the quick start as soon as we're able to when providers adopt this and there's no more need to install the cni manually.

L

uh We haven't discussed it, but it can be possible. I think.

E

So we might need to unleash her because, like we, it's an experiment and it's disabled by default. So you have to change the flag unless we default it to true which, given it is the first first integration which probably shouldn't um but I'll I'll leave it to the group to the side like I just wanted to.

A

Point it out got it, so it is an experimental feature right now, so it wouldn't really make sense to put in the default templates that infra providers publish as part of their releases since it needs additional steps to be used.

E

Yeah, like the the only other thing that we could do, is to default it to true so that you don't have to set and enable the experimental feature. um But, okay.

A

F

I want you to give a question on the end-to-end uh question. uh What we have now is a specific end-to-end test that that basically exercises the cluster resource set. I think that your question this year was about when, uh if we already are using the customer's set for spinning up all the other tests, a quick start, self-hosting is all this is not yet there, but we are planning. There is an issue to do this.

A

Okay, thank you for clarifying.

E

I'll open an issue to see so like, let's repeat, zero, three seven wait uh when clusters will set defaulting to false and then, as we evaluate it, then maybe in zero, three, eight or nine or like in another patch release. We can default it to true and because, like it doesn't impact that other controllers, which is nice, it's like kind of separate, um so it can be done and then yeah we can discuss on the issue.

A

Got it sounds good thanks any other questions or comments about this.

A

Okay, um all right, um so I think that's the end of our discussion topics um but yeah feel free to. Let me know if I missed anything, um so I'm just going to move on to issue triage for now.

A

A

So those are the issues that are without a milestone right now and I'm sure there are many other new ones that vince has already triaged or others have already triggered, but we'll look at those few for now. um So I think this is actually the bug that dane was referring to earlier. um They do want to talk a bit about this already.

K

Certainly um I'm in the process of reproducing the issue which I've done multiple times in a few different ways. The simplest way most recently was just to create a pdb for a small deployment of three pods and delete one of the pods. I made sure it had a initial delay of like two minutes to make sure that um it had a good long.

K

While that we were uh at the point that it should have restricted, the pdb should have restricted further evictions, um and then I deleted one of the other machines that housed one of the other pods and it it just drains really quickly. It takes about 20 20 to 60 seconds and it's gone, um and then we end up with two unavailable pods. This originated from an actual outage, where it impacted a stateful workload um in in one of our live environments.

K

So um the most recent thing I found was there was an event um for the pod that that shouldn't have been evicted from taint manager eviction.

K

um So I'm trying to figure out what would have triggered that, um and there are a few other moving parts in our clusters and some other controllers. So I also need to rule out that it's not one of our other controllers that is doing this.

A

um Thanks so it sounds like you're still investigating um is this? Do you think this is a regression from the current release, or is this potentially something that's always been there and do you have any ideas around that.

K

I just I haven't looked at the history of the drain code path, but I suspect it's been there for a while. um I guess I'm not really sure I only recently started working on this code base, so maybe someone with more history on the drain code could speak to that.

A

um Okay, michael.

I

Yeah I've been working on this area extensively, so um each machine is drained on a per machine basis and the pdbs through the eviction api are quite atomic. So it's unlikely it's the eviction api one possibility is if one of these kubelets uh went unreachable during this time period, so the first first one was a lot of drain like normal and then one of the other ones went unreachable.

I

For whatever reason, what will happen is the node lifecycle manager will taint the pods and the nodes, and the scheduler will mark those all as deleted, and the default options is drain ignores uh for the cluster api. The drain ignores pause, the deletion timestamp greater than five minutes when the kubelet's unreachable.

I

So that's the only scenario that I can think you might have bumped into here, um but I'm definitely interested. uh We posted on this. uh If you find anything.

K

Yeah but um actually.

I

Sounds very good.

K

That sounds very likely. um Looking at the event, the the stopping container engine x is coming from source component cubelet host and then the node's name um so that I think you're you're right on with that.

A

um Okay, so for now I've seen michael on here sounds like you should follow up with him um and I think uh in terms of milestones vince. Where are we because we're about to release zero? Three seven: do you wanna start putting things in zero? Three, eight or I don't think this is necessarily a blocker for the release, but it should be addressed soon.

E

Yeah, I'm I'm gonna do zero three x because we don't have um okay that got a milestone.

F

E

I'm gonna put away more evidence as well just so that, like we know that we're waiting for morningstar.

A

Okay future um add support for multiple tilt providers and tilt provider.json.

J

I took a look at this one and added a suggestion to see if maybe we can make this support, two different modes of operation, one being the current singular provider and another one switching to an array and have it be backwards and forwards compatible with both um if starlark lets us do that.

J

I would say this is really kind of an anytime sort of thing, um but I do know that there are folks who are working on trying to add additional providers to kappa to support things like eks. So from their perspective, it's probably important soon.

J

In terms of milestones, we could do zero three x if we wanted to.

A

Got it yeah, it sounds like that's the use case that it's being asked for, um and should we mark this as help wanted or is this, do you think this is a good candidate, yeah? Okay, all right.

A

Oh so I guess this is one of the follow-up enhancements for a cluster resource set better way of conveying errors uh is the def? Is there any? Is this just 0 3x.

L

Yes, 0 3x, and this is an improvement yeah.

A

A

A

And you're not working on this actively right now correct.

L

A

Put the wrong label kind feature: oops. No, no! Okay! Sorry.

M

A

And then I think yeah. This is the last one.

A

um Comedian control playing resilience to machine disk space issues sounds interesting. Ben.

M

Yeah so as part of I'm, I've been working on the machine health check, uh implementation for kcp and just was like trying to see what would happen in the scenario to see like if it like.

M

If, if there's this whole thing in scd of like disk space alarms that we could look at to see if it's under disk pressure and what I saw was just like, I was just filling up the suv with random keys, um which is not a very realistic like scenario, but I think it seemed like we well that, like std didn't um raise any alarms for one um and other pods started. Failing like the api server pod started, failing and yeah, it just got into a bad state.

M

So it seems like something where we just probably have never investigated this failure mode before. But we should probably uh see what see what would happen and like try and make sure it happens in the way that we want it to so. It definitely is like a long-term feature.

A

N

Hello, thank you. um Yeah, I noticed uh cube. Adm uh does not request any disk space for xcd, which probably makes this a tiny bit worse, and I have an issue filed somewhere for that which I'll post in the chat in a second.

M

Yeah, I think I came across that while I was googling this stuff.

A

um So it sounds like part of this is adding a test right um to simulate this scenario and then yeah deciding what action should we should take.

M

Yeah yeah, I imagine, there's like a category of tests that we don't have today, which are like kind of exercising failures. You know kind of chaos, testing or something I was thinking- um would be a really interesting area to explore.

A

Yeah definitely- um and this I have run into this before the disk space bank, so it's definitely important that we test it um great so brian, if you want to paste the link to that other issue in kubernetes that you opened that'd be great, so we can link them and then I think we can leave this as a milestone next for now, um since this is more of a long-term thing, any objections or any other questions or comments.

A

Nope, okay! Well, um I still can't do milestones so I'll. Let vince do my writing for me, okay uh and then I think that's it. um Are there any other uh recent issues that weren't discussed here that anyone would like to bring this group's attention to or any milestone advancement for positions.

A

A

um Yeah brian, can you please link the related issue in the in the other issues? um That'd be great uh cool. Okay, uh I think. That's it anything else. Anyone wanted to cover.

A

All right, if not, I think, that's it for today, um thanks everyone and congrats on the new release, really great work from a lot of people here. So super excited to see this coming along have a good end of your wednesday bye thanks cecile.