Kubernetes Kubernetes AWS Provider Subproject, 8 Jul 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes - AWS Provider - Meeting 20220708

Description

Recording of the AWS Provider subproject meeting held on 20220708

Agenda - https://docs.google.com/document/d/1-i0xQidlXnFEP9fXHWkBxqySkXwJnrGJP9OGyP2_P14/

A

Hello, everybody and welcome to provider.

B

A

uh For friday july 8th uh nothing on the agenda today, but we will go ahead and do quick, sub project updates and then probably have a short meeting.

A

A

Let's go ahead, it looks like justin. Are you creating the agenda from the template right now? I.

C

Am doing my best awesome.

A

All right so for ccm.

A

We, I don't think, uh let's see there were some new releases.

A

Excuse me, there were some new releases recently um see, I think 120 through 124 all had a patch release. um I think that's it yeah uh go ahead and go to cops.

C

Thank you. I think uh I don't there's any aws. uh Surprise related surprises. uh We are uh about to release a 124 ga, so that's good and we have some patch releases, but uh yeah. Everything on the 81st side, I think, is- is uh fairly uneventful as it were, which is good news.

A

Yeah sounds good: um let's do load, bouncer controller.

D

I don't have any significant update, I'm planning for the next uh patch release of two four two four three, which would be the minor fixes that we have accommodated so far.

A

A

um Carpenter predict, you have a calendar update.

B

um No updates uh from myself for carpenter, I haven't been following the project for the last few weeks.

A

Okay, no worries. We should probably get uh somebody um who is uh so. Are you a a maintainer still or are you no I'm not dominating anymore? uh I got it um yeah. Maybe we should reach out and invite some more carpenter maintainers.

A

If we want to still put updates down for carpenter, which I think is a good idea, but all right cool. So does anyone have any pull requests that they want to talk about or anything like that.

C

It's tangential I'll mention that we observed some uh ede failures across a bunch of uh chaos tests and we think it is actually because the aws janitor is deleting uh resources uh over eagerly. So if you see iam roles in particular uh disappearing mid test, uh hopefully that'll get fixed um but yeah. So um it's not it's not an aws issue. It's just an issue that is happening with our test infrastructure on aws.

C

So if you see that that, if you see permissions randomly disappearing in the middle of your tests, that's probably why and.

A

Is this from the account that uh I guess I I know of like one account where tests run in and I'm I'm one, but I think there might be others so do you know which account.

C

It is, uh it certainly happens in the the primary account as it were, but I think the janitor runs against all the accounts. uh The what happens is basically once you're running enough tests or running tests, often enough that a resource exists on every run of the janitor.

C

uh It doesn't for some resources which don't have a unique id. So like instances have a unique like numeric id or hex id. Maybe I don't know they have a unique id. uh I am roles do not. uh They just have the name, and so the janitor can't tell them apart and if it's c, if we're, if we're unlucky and the janitor sees like those at every hour, it will treat it as a resource. That's been there for a long time and it will.

C

It will delete it, assuming that it's no longer, assuming that it's pac, because it's past the threshold. um So it's it's unlucky, but it it is very surprising when it happens. How.

D

Frequently does it happen like when was the last time it happened? If you have any dates.

C

uh Well, the janitor runs every hour and it seems to clean up something every hour, uh but like a lot of our so on the. If you look at test grid the k ups test grid, I think it's now hitting one. It seems to vary based on luck of scheduling as it were. I think it's now hitting the 123 tests pretty hard, um but uh it was previously hitting the 122 tests pretty hard. It would happen about like half the tests, um but the janitor is running every hour. I think currently the 45 minutes passed.

C

So if you see something disappear at 45 minutes past the hour, that's probably what did it.

D

Stop I see, is it like the key ops infrastructure or because I I don't remember like anything that I that we have on our side so far, it's.

C

Yeah, sorry, oh sorry, it's not it's not! uh It is a it's called the aws janitor it isn't that aws wrote it. It is part of the it used to be part of kate's test testing, kubernetes test infra, and it's now moved to kubernetes six dash, slash boscos.

D

Could we have the time extended, maybe like from 45 minutes to like one hour, 30 minutes or something just to give the test enough time to complete.

C

Sorry, it's uh it, uh it runs every hour and looks at all the resources and it's supposed to only delete them after a ttl, which I think is four hours or something of that nature. um I don't actually remember the exact value, but it certainly isn't every it. It's only that it got confused by us happening to recreate the same resources every hour effectively.

D

There are some instances like uh where the cleanup doesn't happen properly. I do see, may not be for chaos project, but for some other projects, and then I kind of periodically look at them, uh but I usually have like two hours or at least four hours before I delete the stack but yeah. I do that sometimes, but uh not that aggressively.

C

I'll put a link I'll put a link in chat to uh the project itself uh yeah, so this is uh well I'll, put it in chat and then I'll also paste into the notes.

C

This is the project and it is supposed to be the thing that automatically cleans up, because um the the challenge is like, even if chaos, what chaos we all know is absolutely perfect and never has a bug, but uh even if there was a bug, even if there were no bugs in chaos, uh the ede test runs, can sort of be interrupted and cleanups aren't guaranteed guaranteed to run. So, even with the perfect k, ops resources are still going to leak or perfect.

C

Tooling, resources are still going to leak, so we do need the janitor to like you know so that otherwise it relies on people having to go in and and like remove things.

D

Correct uh does it use like cloud formation templates or anything, because it will be easier to clean up right. You delete the cfn and the resources go away, uh just like thinking aloud here.

C

Right uh um kops does not use cloud formation or doesn't by default, use confirmation, templates and um yeah. I I just don't think we can guarantee it for everything in the ecosystem that it it does it, but I think I don't know whether it will clean up cloud formation templates. I was looking, it looks like it will clean up. It looks like the janitor will clean up cloud formation stacks. So I guess someone is using cloud formation stacks and has added effectively garbage collection for them.

D

D

A lot of times like when the tests are invoked as part of make that's where the trouble is because make doesn't catch the signal and pass it along to the subsequent process. So it's a project specific right. We have to look at every project and uh ask them to like change the strategy.

C

Yes, but I mean like the as I understand it: no matter what you use, the job could be interrupted and you are never going to like in that scenario, you're never going to get called.

D

Correct like there is a chance like there are corner cases where we have resources lying around. Of course yeah. I will take a look at janitor as well, when I get some time and see like if there's anything that we can get from that as well.

A

Yeah, if you could link it in there.

A

Somewhere, yes,.

A

C

Let's you another poll in there right, four one, two.

B

Yeah, I I I added that um I just wanted. I think it's already on next plate. I just wanted to mention waiting for the pr review.

A

Yeah, thank you for adding that I and I've totally dropped this one. uh I've been meaning to review, but I've been on call so.

A

Well, should we take a quick look at it as well.

A

All right, so this is rate limiting when calling sts, assume role.

A

um And so can you just describe what the what the issue was.

B

uh Yeah sure uh give me one second, real quick: it's been like a few few days. I worked on this.

B

So yeah, the issue is uh when we, um when the cloud order makes an easy to describe instance, call to check if the instance exists, we pass in the credentials which are assume role credentials and if we fail to assume those credentials, uh cloud provider will keep trying for every node to make a call to assume those credentials. So we want to slow down the call to sds when assume roll is, is failing or is not able to assume the role either.

B

The role is broken or some other issue, so we don't want to slow down. um The describe instance call because we want to process the nodes as fast as possible, but this is when the describe instance call is happening. It needs credential, but for credential it needs to make a zoom roll, although once the call succeeds, these credentials are cached for 15 minutes. So that's the happy path, but if we fail to assume a role once then, then the loop just runs over and over again and calls sts multiple times.

A

And so why was it? Why was there such a difference between the like? Why did it wait until um okay, I see yeah so so, what do you cash on the failure, then.

B

um Say it again, what do we cash on soluble? I.

A

Said: okay, so so the the assume roll would succeed and then because the ec2 described incident or whatever the ec2 call is that would fail. Then we wouldn't cache the the assume real credentials.

A

B

A

The assume roll call actually failing and then it's.

B

The assume roll call that's failing okay, but if that call succeeds and we we get the credentials from sds uh so in our credentials chain, we we already cache them so that the next time when describe instance needs to make a call. It will use those cash credentials for 15 minutes. So we don't see it that often, but when assume role fails describing since we keep calling assume role, assume real assume, so we don't want to do that. So that's why we are trying to slow down the call to sts.

A

Okay, and so, if assume roll fails, um so what is the new behavior like? How do we get around.

B

So we call the same provider so so, basically, this is an interface that the provider passes, which is on your screen and every time we call retrieve. So we are just encapsulating that retrieve method here and this retrieve method will uh cache the result for one second.

B

So if we have seen that the last call was made within one second, it will just return the last value or the last error it has seen, and if you call it again after one second, then it will call the underlying provider which we are today passing to the credentials and get the result back and send that result whatever it is getting from there. So basically we're just adding a layer to cache for one second, okay, that.

A

Makes complete sense right now yeah, so it was spamming that much like multiple times. A second.

A

B

So make sense, so I added the cloud watch graph to show the difference uh between before this fixed and after this weeks.

B

So before this fix, those were the number of calls it was making two.

A

B

And then, after this fix, the call drops significantly.

A

Okay, so this is what was applied at five ten okay, so it doesn't even show up.

B

uh It's like very few, like only one or two counts after this nice.

A

Cool all right thanks for that explanation, I will definitely take a look at this thanks.

A

All right, any other topics.

A

Thanks for joining everybody and see you in two weeks.

A

C

For asking nick.

D

Thanks bye thanks, everyone.