Kubernetes SIG Cluster Lifecycle, 26 Aug 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 20190826 - Cluster API Provider AWS Office Hours

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Hello and welcome to the August 26th edition of The Closer API provide our AWS office hours, a sub-project in both the cluster API and state cluster lifecycle. We have a relatively short agenda today, so please go ahead and add anything if you have it. I am put in the link in the chat right now. The first item on the agenda I wanted to just give PSA that going forward.

A

All image building will be done with the cig stockades that io image builder, the most of the work that we had in place got migrated over to the vSphere provider and then that got imported in there and I have some tweaks that I need to make to it to finish off the support and I hope to land those today, but I'm in the process of building the one dot, 14.51, dot, 14.6, 1.15, dot, two and one dot, 15.3 images and all of those will also include an update to as well as a fix that Daniel found.

A

We were publishing older cubelet server, serving certificates in the images. So it's going to include a fix for that as well.

B

Sorry I've got the wrong screen shared here. I realized what my problem was and it was I had an outdated link on the sharing.

A

And on that, do we have any other topics before we dive into the backlog, grooming.

A

All right over to you, Andy, okay,.

B

So 993 is the last one we did. Is that accurate, correct? Yes, okay, so what I ended up doing was just filtering all issues with no milestone, so.

B

Looks like well as we can close 981 right, because you have code that went in that records. The name of the machine into the config map, control.

C

Plane, 8 4, B 1, alpha 1.

B

Yeah, so for everyone out the two that's going into the cuvette: oh yes,.

C

Yes, of course, yeah go ahead. Okay,.

B

C

B

Okay, we have convert project key builder, be I'm, pretty sure we got most, if not all, of that in I.

B

Will double-check, with Chuck on this one but I'm going to put it into the milestone for 0 for 0 we went out and I would call this important soon already enhance the generate the animals script using generate animal-like output director to match the cluster name.

B

Okay and we had a couple people who liked the idea.

D

B

Don't know that this is necessary to get alpha, tear out the door, but it would probably be nice to do so. I'll stick this in the zero for series and we will give this a long-term prairie. It's not reasonable.

B

B

E

Image building.

B

Process + customize, when using Q builder be to the default, is to modify the patch for usage during make docker build. We should think about this.

B

I'm gonna say given its long-term: we can just put this into the z4x, really smile, stone and follow up with Chuck.

B

B

Rename controller manager in the config directory queue builder uses, color manager is default name, but we can name it something else now something more relevant to an infrastructure provider.

B

B

What do you all think on this one? Do you want to do this before we release or sometime.

A

After drink I think we probably want to do it before release that way. We avoid any type of potential operation or issues. This.

B

Affects the name of the pod or deployment right, I.

A

B

Yeah I think it's a combination of the name, the name of the namespace and then controller manager right or something like that.

A

Yeah well, actually, this is related to these labels that we apply so right now we assign the labels control plane is the key and controller manager is the value so by doing that, by making it unique means that, if, for some reason we do deploy all of the multiple components in the same namespace, you can actually run them alongside each other. Without issues with the deployments and replica sets.

B

So I'm looking it may be that too, but I'm looking at the manager, image patch, yeah Mille for kappa and the name of the deployment is controller manager.

A

Yeah so I think we need to address both of those.

B

Yeah, so I think this basically needs to be. If we don't do it as part of this release, it's got to wait until alpha-3 right cuz. Otherwise we need to deal with some sort of grade conversion plan to tell people yeah go delete this old deployment and then create this new one right.

A

B

Right so I'm gonna make this a p0.

B

We can name it anything other than controller manager and Oh, or we just say you can't deploy anything else in this namespace. That is a queue builder project which I think is probably not a good one. So let's go ahead and make sure that we get this done. Asap.

B

All righty um surface all permission errors as events this one I imagine we would have to do a lot of surgery right, like anywhere that we talked to AWS and and get some error back. We'd have to parse the error and potentially generate an event or events from it, and we could have a helper for it, but it means touching every single line of code where we talk to AWS right.

A

Yeah likely I I, don't think it's reasonable for the initial 0.4 release. Yeah.

B

But it definitely could.

B

I'm gonna go with long term and 0 for X and basically the reason I'm not doing cute well, I mean I, guess we could do important soon and call it a p1. Even though it's not in the miles, you know the just Friday's milestone.

B

Let me do that. Make it a little bit more important, already machines with police instances, don't join the cluster. This one was yours, Liz and I would Jason you and I had discussed back and forth and get but I think that we basically need to mark the machine as perma failed in the event that we see that the ec2 instance has gone away.

C

That seems like a sensible that come to me.

B

I'm gonna put it in the milestone for zero, for, if it doesn't make it, it can come shortly. After doesn't.

C

That make it important long term.

B

Well, I would be great if somebody can do it this week from the community, but if they can't I, don't think we hold the release up because this, if this isn't implemented this week,.

B

And Jason you were working on this. You been to one right. Yes,.

A

That's correct.

B

Would you call this a p0 given that it has certain stuff in there? Is it just the bogus sir yeah, so.

A

It's a self-signed cert for the cubelet, so I. Don't necessarily think it's critical, but it's not ideal, especially as the image is a different and.

B

An opportunity to modify listeners for ELB they wanted to add additional ports. Is that or change the port.

A

Yeah they wanted to add additional ports so that they could be able to access it over more than just six. Four, four three and.

B

Thence did this or not, or this was part of the.

A

Field was added, but the actual implementation has not been done.

B

So basically, well this lets. You set one port being the API server port. They they want to add additional ones right.

B

They write here, they say they want to add listeners.

B

But yet it's hard to tell if they just want to change the port or if they want to keep this and have other ones too. Yeah.

A

B

A

B

If they're happy with this, then maybe this is enough and we'll put this as long-term in 0 for X, and hopefully this will get them what they want. At least that's what it sounded like.

B

Chat is eks control planes supported in cluster API for AWS, not right. Now we are interested in trying to do that, maybe as part of alpha 3 or alpha 4, when we talk about control, plane management and managed service providers for ku brandings, but it's it's not supported right now.

B

B

Using unmanaged VPC with subnets and different availability zones, the ELB for the API server is not configured correctly, so terraform module for unmanaged, PPC, no AZ specified I've, never created in different a Z's. The AZ or the ELB was only attached to one availability zone, not the one. The control plane was in and, and it failed is this something that we can identify and fix. Do you think Jason so.

A

This might be complicated because, since its user provided networking, we don't necessarily know which of the subnets we should be inquiring about, but I don't see any reason why we couldn't potentially add.

A

B

This is they like Aaron brought his own V PC and subnets, but Kappa still created an EOB. Is that accurate, yeah.

A

And what it is is one was public, one was private, but we don't necessarily, you know, know which is which, but we probably just picked one- to determine what a Z to use. So we probably need to enumerate the provided subnets and just make sure that the AZ for all of them is included when we create the eld.

B

Would we have to create I mean it so if you've got.

B

I'm, trying to figure out does he have two public subnets are just one in this example, just one.

A

B

But it needs just to be. Can you attach an EOB to multiple AZ's? Yes, okay? So that's the problem. Yep, okay, I'm, going to I'm I! Don't think this is a release. Blocker right.

B

Okay and this one that you created Jason.

A

Yeah I've been doing testing on the images that I'm creating and as I was tearing down clusters. I noticed I wasn't seeing all of the deletion events so.

B

The database cluster liked it it was still there, but we weren't recording events to it. Yet.

A

So well so yeah so I started the deletion process. I would see some of the events. I would get up to that delete, manage security group event and then I wouldn't see any other additional events related to the deletion after that is.

B

That what this peer is for, or is it.

A

That that's specifically to clean up where we log in the events against, because right now we're actually login the events against the cluster object instead of the AWS cluster object. But.

B

That doesn't that doesn't address this issue. No.

A

It doesn't okay.

B

A

Wondering watching all events on the kind, cluster and yeah it could be related that to the fact that my client cluster has been long-lived, but I definitely wanted to dig into it further before I, you know ruled out a bigger issue.

B

All right, well, I'll, put it in the milestone. If it, if we don't get it, it's not a showstopper right.

B

Hey Andrew how's, it going sorry we're late, no worries, so we're just doing triage on Kappa and we're down to one. That's just waiting more evidence. Actually, let's look at this since we didn't so they tried to create and it basically didn't and oh I think they were confused so I'll just see.

B

Did you ever get this resolved anything for the non issue, triage agenda from your end, Andrew anything you wanna talk about today. It's.

D

Funny the reason I realized that I was late was because me too, like hat, was creating a cluster and realized that, like the load, balancer being ready was taking long and I was like ooh Jason mentioned this this morning, wait I'm late!

D

So no no agenda items I'm curious. If there was a real issue behind the load balancer taking long, but that's it so.

A

I didn't have a chance to debug it much further other than to know that, for some reason the DNS propagation was taking an exceedingly long time and it shouldn't be related to anything that we're doing it seems like it's. Some type of a bug on the AWS side and, what's even odder, is, is that the resolutions not working from BC to instances that we're creating too. So it's not like it's cached on some type of DNS server between AWS and us.

A

The resolutions not working using AWS as DNS servers, so I didn't file an issue just because I didn't know. You know what we could, what we could really do about it other than say. If you hit this issue and the DNS resolution doesn't work after you know in a reasonable amount of time, delete the cluster and recreated and it should work because I just deleted it recreated another one and didn't run into that issue again. So yeah.

D

Interesting, so we actually hit this from the context of pivoting on Vince's cluster CTL branch. What happens there is the process kind of stops and it's waiting at kind of a weird spot that we weren't used to it. Waiting like it's usually hung on like waiting for a control, plane, zero to be ready or something like that. But this time it was like waiting to apply yeah Mille and in the middle of all this, like we got security groups and we got the load balancer and it just kind of chilled there for a second.

D

So maybe we need some extra. You know. Debug logging at that point in the pivot, to see like hey, I, can't resolve this DNS or something like that. Yeah and this.

A

One this one's tricky at least right now in the current state of v1, alpha 1 and B 1 alpha 2 in that, in order to clean it up, you need to you can't just replace the load balancer. You also have to replace the control plane instance, because when you generate the new load balancer now you need a different Sam on the certificate that you're generating on the initial control. Plane instance as well.

A

So I don't know that. There's much that we can do to to be better about automating cleanup of it until we get to be 1 alpha 3 with control, plane management, yeah.

D

It makes sense, I mean we didn't. We didn't even get a chance to like confirm. The DNS resolution was even the problem on our side. I just I just was looking for, like any evidence that anybody had seen something like this before and then I remembered your message from this morning.

B

You want to talk about this, one that I filed, so all of our retrial, AWS calls use this function to create the back off and I. Think I did the math right. If you account for jitter, then are the sum of the amount of time that we would be sleeping waiting for AWS to return a positive indication of whatever you're waiting on is ninety billion years, which is a long time.

B

So I think that we should probably either switch to using the API machinery weight functionality, which I believe can give you some timeouts like hard timeouts, or we can figure out some way to do it, but it probably would be nice to have a timeout of somewhere between 50 and 60 minutes when you're waiting for a WS to do anything- and you know the longer lived operations or the operations to take longer like waiting for the Nathan, Act and Gateway to be available. That can take a few minutes.

B

Surely, somewhere between 15 and 60 minutes is reasonable and one better than 90 billion years. So.

A

I would probably say that we want to go closer to the shorter end of that, because, right now, all of their operations would currently be blocked on it longer term. We probably want to have a better strategy around reentrant C, and then we could just basically return the item and riku it or handle it. Some other way, yeah.

B

So the in both alpha 1 and alpha 2, the clustering machine controllers now like they're, not in any released versions of Kathy or Kathy. Yet but at least in the release branches, you can set the flags that I just added to the manager to increase the concurrency for the clustering machine controllers above the previously hard-coded value of one. So if you want to reconcile 10 machines simultaneously,.

B

And I need to do another PR to master, to do the same support for AWS machine, an AWS cluster for those controllers, so that would at least help with different clusters in different machines, but certainly if you are, if it is stuck waiting on an ad gateway and for whatever reason, AWS just never comes back with it, then yeah. We need some way to deal with that. Yeah.

A

I'm just thinking it's definitely more of a refactoring and non-trivial, but if we could do our waiting on the queue versus you know, waiting in process, it's probably better.

B

Yeah I agree, so I mean at the very least. A short-term fix is to fix these numbers so that it's not 90 billion years.

D

Yeah I think the Sun will be out of LTS by then the solar system yeah.

B

I think so, since we're here and there's not that many of us and there's actually not that many issues- oops that's a long milestone, so we have 22 open issues for our zero for zero release and I know. We don't have a full house here today, but maybe with those of us who are here, we could at least go through these 22 and decide if we want to still try to do them for this Friday or defer anything else.

B

That's deferrable, so I'm just gonna start at the bottom and we'll go up so number one ability to customize security group rules. This was requested, I'll, say a long time ago. So give there's no way. We could QA something this helpful. You know yeah, so I'm gonna bump this to the patch release milestone so.

A

This is probably going to also require API level changes to be able to specify groups all.

B

Right see how I move it to next I think.

A

D

All right, I think it's totally reasonable, like you could conceivably get all of the functionality of like being able to do that from the additional security groups feel, but most of the functionality and I think that's what we do. So this is totally totally deferrable. In my opinion,.

C

They survive without it for what nine months yeah yeah.

B

This was nadir, so I don't know if he was doing that on behalf of one of our customers or just something he was interested in seeing, but yes, I totally agree, it's been a long time and not implemented so next it is alright document what you get in a cluster I still think this is useful. It is a documentation issue, so, given that it doesn't directly impact whether or not the code works, I think we can leave it in the milestone all right. This one's been around for a while I know.

B

We've done some effort to try and mitigate some of this, but it's not completely done ray Jason. No.

A

And this goes into any additional changes. We require much larger refactoring and that's why we basically looked at just the implementing the exponential back-off to begin with so I think it's safe to bump this patch or next I think we should at least attempt to address it in the patch. But okay yeah.

D

I'd be curious to get some more detail on like is this. This is like if the tagging fails, like the resource, isn't there or something like that, and the back off basically saves us from that. Yeah.

A

The eventual consistency of the end of us API, so that was the initial one that we were hitting. So the initial tagging would try to happen and it would fail because resource was found yet and then we're basically or from the resource we're in a better state now with the back off. But ideally we should be using AWS client tokens wherever we can and then that would ensure that even if we fail tagging, you know we should get the same result back. If we call that make that call a second time, I see the downside.

A

There is that we need to ensure that we record that client token, so that we can use it again and they can't trust B, the UID and the object, because once you pivot or if you restore from backup that UID is going to be different. So ideally, what we would have to do is generate the client token save that back to the API server ensure that save is successful, and then we can go ahead and proceed got it.

B

Okay can I move on to the next one or anything else on this one: okay, we're strict I, am policies further to only allow deletion and plus the okay of resources. Jason smirk long-term should I just pump it. Yes,.

B

E

B

Know: load balancer services and whatnot.

B

Did you have a long-term solution on this one or not.

A

So what we can probably do here is, we can just go ahead and clean up any load balancers that match the tagging. That's done for the integrated cloud provider, stuff yeah! It's just a matter of adding a query for those and then deleting those resources, because if the cluster is gone, I mean you're not using the load. Balancer you're, not using those load balancers anyway, there's no risk of actual data loss. There.

B

I, don't have any problems leaving this in for now and if it misses Friday, it's fine to miss.

B

All right documents annotations, at least in terms of alpha 2. These are not relevant anymore, as they are here so I'm going to put them in the 0-3 milestone.

B

We can potentially just close this if we want to given that at some point, alpha 1 isn't going to be maintained anymore, so, but I'll put in 0 3 4 right now and maybe after I get back from San Francisco I can actually document this Larry um ad validating webhooks for alpha 2 I think this is still worth trying to do and if it doesn't make it, it's ok, I'm good.

D

Yeah I think that's fine. I have questions about house to the certs. Will.

B

D

B

Problem the problem so I went through the Q builder v2 webhook configuration flow and coding flow and I use cert manager to generate certs and it works there. At least, there was at least one Gacha that I ran into with some of the Hamill that got generated where I had to hack in a fix, because of something that wasn't coated in controller gen and I. Don't know if it's been fixed yet but yeah. If you don't use that approach, then I guess you're, sort of on your own yeah.

D

We so we have a couple other q builder controllers that use the web hooks and we just have certain manager everywhere. You know the question just becomes like does I mean everybody who uses this either needs to provide, assert or use cert manager. You know deploy cluster API so.

A

I think because the kid builder scaffolding revolves around the use of certain manager, that would probably be our default story yeah and then, if, for some reason you don't want to use certain manager, you know, then you know you need to handle the certificates. Some.

B

Of them, yeah I'd, say: there's probably three options there's option, one which is I, don't want to deal with this as an end user, and so you know we just have a banner somewhere. That says warning. If you don't feel like dealing with certificates, then you're not going to have all of the functionality that we have coded for validation and then the two options are the default being cert manager and then instructions for how to do a non certain integer approach.

B

Yeah. It sounds like.

B

Yeah I'd be really nice if we could get this in for the release, but it's not a breaking change either waits until after Friday, so I think all of the labels and milestones are still accurate.

B

Controller can generate ELB name, that's too long. So this is an issue. What's.

D

That I said it's like this is basically the same issue right because it's gonna be I. Think if, by the bottom of these comments, we were like cool validating with hook, yeah.

B

Well, I also was going back and forth on collisions and and whatnot so yeah. Yes, so I have a question like how important is the EOB name in terms of human readability to people who are using capital clusters?

B

We don't care about it.

A

Jason I know we have a long term feature request that was asking to be able to provide some type of an alias and then that would solve any potential. Human readability concerns I.

A

Don't think it's like huge, but I think just pure, u ID or UUID. May you know it may be a little off-putting to some.

B

Well, I: don't think that this is something that should I.

B

Don't need is a decision that we should make under at the last minute here so I'm gonna move this to next, or at least propose moving it to next, given that it is a it's, not an API change per se, I mean unless we we go with my suggestion of storing the ELB name in the cluster stack, but if we do change the way that we generate the names that we either need backwards, compatibility to try and find the old-style or we just need a clean break, which you probably don't want to do so I think once we get to once we get alpha two out the door.

B

We can reevaluate everything that's in next, as well as the the patch milestone.

B

So what do y'all think is next, okay for this one yeah! Yes,.

D

B

All right, what do we got um a.

B

Conversion tool from elf one I know: Liz you've been working on some of this I. Don't think that this needs to make it by Friday, though,.

B

I'm gonna give it a priority. It is important soon enough: okay, remove Bastion host requirement, Jason I, don't think this is anything we want to do for Friday right.

A

No and this because it would change kind of the status if we do remove it, we should probably bump this to next.

B

And H a cluster subnet reconcile miss to check and create route table. I.

B

Don't remember asking about this one, but I don't remember.

B

Dean, okay, we can close this right, it sounds like snort was, was able to get this set up and yes, they wanted. They want Kappa to create the subnet. If it's, this was a bring your own great.

A

Yeah I think he expected the bring your own two through more than it did. I think he I think there was an expectation that we would do the routing tables and things like that with just a bear. Subnet provided.

B

Yeah I think if, if there are things like a subnet in this case that need to get linked to other things like Route tables using poor terminology but I think you get the point, then you either need to let Kappa manage it fully or you need to manage it fully. But it's not fifty-fifty and that's different from saying something like well. You know if you could bring your own ELB, we would let you do that, and but it's not like that.

B

That's probably even a bad analogy, so yeah I'm, gonna say it it's it's gotta be all or nothing.

A

Agreed because I mentioned this in there, like it's impossible for us to determine the user intent, you know yeah if they wanted a fully public subnet or if they wanted a private subnet to be netted. You know we can't make those assumptions without breaking some type of user expectations.

D

That scalability zone filled a real thing: yeah oh wow, okay, I could pay attention morning. It.

B

Got added to alpha one and it's in alpha.

D

Two as well, because we have some crazy looping that like chooses the subnet and like looks it up like in our tooling before actually running, training, stick I.

B

Believe it's a fairly naive implementation. I think we just picked the first one we can find in the AZ that is either public or private, depending on which flow we're in got it. Okay, I think that's a double check.

B

B

This is another one: around reusing names in the same account, I, don't think, there's really a ton. We can do here great other than what Jason was suggesting about modifying the cluster name.

A

Yeah and I would say it's probably too late in the cycle to attempt to do that. The.

C

Alternate um the alternate thing to do here is a pre-admission.

C

I agreed that I don't think that this will I agreed that this should not block that release. I, think we we can move it into the point release but I, don't think, there's anything that prevents us from doing this. Anyone else' to there's no there's no light fields or anything that I think we need to make this yeah.

B

We could even actually what did you have in here. Validation,.

B

You know we could I think for this first one here that you wrote adjacent in terms of validating. We could query AWS like pick a resource, whether it's easy to instances or elby's or whatever, and check to see if that cluster name is already in use in any just filtering by tags. Yeah.

A

That could work we'd also have to make sure that we're only checking for currently active resources and not straggling deleted resources. We could also just like.

C

You know try to create it and surface the error correctly. If.

E

C

Fail just like have an event that says: can't boot duplicated, something like that. I'm sure, there's a you know: if is 4:09 conflict or something in the end of us API, but regardless I think it's it's not critical to do right now. Okay,.

B

Cool I agree: make file maintenance, I'm, just gonna bump yeah.

C

It's not it's not use you're facing at all. We can do that whenever we want yeah.

B

All right clusters, private subnet, not discovered preventing cluster, delete.

B

All right, so we don't actually ever reproducer yet but created a cluster. You got reconciled.

B

How did he force delete it? If you remove the finalized, errs and I wonder.

A

Last time I talked to Daniel about this. He wasn't quite sure how he encountered it and because he hasn't been able to reproduce it I think it's probably safe to close at this point, okay.

B

B

Convert project, can you build or be? We already went over that earlier. Remove local image builder.

A

So this can make it I can do this as part of the image work that I'm currently doing.

B

Rename control manager we talked about unit testing would be nice. This.

D

Is a tough one only because I guess figuring out like which is a unit test and which is like an integration test, is a little bit difficult in these queue builder repos, where, like there's the big setup block that vegetable stands up a whole cluster and the reason why I think this is tough is because it takes almost like a little bit of design to figure out what goes where like for us. In our queue builder controllers.

D

We just put everything into that suite and just take the performance and timing hit, but it seems like in some of our repos. We have some unit testing I'm, sorry in some of the cluster API repos, whose drivers specifically there's like a bunch of fake client work that runs a bunch of tests. The issue with fake client in a bunch of these is that it doesn't care about a lot of the deletion stuff like fake client ignores finalized errs, so there's a bunch of I. Don't know, I did tests right, I, don't know yeah.

B

Having significant experience using mocks in unit tests, I would love to avoid that, so, whatever we can do to keep the unit tests readable and easily understandable would be nice.

B

I'm just gonna: we can leave this here. We can move it to the patch release, yeah.

D

I'd be happy to take this on no problem just by Friday I. Don't know you know that.

B

Why don't we just move it cuz? It's not a blocker yeah all right. We talked about this one already, that's an awful one-one and Jason's working on the certs. We talked about that and we talked about that. So I think we are good in at least for having covered all of these anything else. Anybody must talk about.

D

Somewhat related to the Ubuntu image, I guess like how much do we care about I, don't know like security patches to those a.m. eyes or something like that like, for example, there was some port mapper stuff. That's default, enabled on these lundi machine totally unnecessarily, but I don't know if it's like in scope or not for us to care about like patching all those little things and making eyes so.

A

The way we've approached it in the past is is the images that we provide are simply for testing.

A

You know just something to be there, so that you can try it out test and that sort of thing anybody using using cluster API in production. We would expect them to provide their own images because then they could have whatever you know, type of governance around those images required.

A

A

Every time we push a new image, we try to fix up related security issues there, and that was part of the reasoning behind bumping container D with this last one, because it includes the go HTTP to fixes in it. So.

D

Any color you like as long as it's cloud in it.

A

B

Okay, well, I, don't have anything else. Is she related, so you can't wrap up if it makes sense.

D

Are we gonna have this meeting two weeks from today? It's right up against the thing yeah the meet oh I.

B

Will probably actually I remember when my flight is I, think I'm available, but I'm I'm, okay, canceling it oh yeah,.

D

I'm just curious a bit had already been decided because I know folks, for my team will be traveling already.

A

I'm pretty sure I'm traveling at the time as well, but.

E

A

Don't think it's a I, don't think it's a big deal either way as long as there's somebody to host it.

A

That's it. Thank you all.

B