Kubernetes Kubernetes AWS Provider Subproject, 2 Apr 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes - AWS Provider - Meeting 20210402

Description

Recording of the AWS Provider subproject meeting held on 20210402

Discussed cloud-provider controller race - service gets deleted and there is a node sync

A

Okay, hello, everybody. um It is april, 2nd. Welcome to the provider. Aws meeting um looks like there is an agenda item today. um Actually keisha you put that in march 19th, I'm just gonna copy it up.

B

Oh sorry, in my bag.

A

A

So put that first, okay, so keisha do you want to tell us about this um item.

B

Yeah, so what I observed was uh in the service controller uh like there are two independent uh threads. uh One is the node sync loop and another is the work queue like whenever the service gets updated deleted uh and what it turns out is uh like the way that the load balancer functions are invoked like in a certain sequence, when service is getting deleted and at the same time there is a node sync event.

B

We see that both of them end up calling like some of the common functions in their update, uh instant security group, for example, and that is causing a race condition like some resource leakage and all other undesired effects. We looked at the code.

B

We could put a lot in those functions, but a lock isn't gonna be very helpful uh because uh they would get invoked anyways like even if she realized, like uh the update code, could potentially get invoked.

B

ah Okay, let me take that back, uh I mean lock. Locking is not like a very feasible uh right now because, like we have delays and all sort of things there, so what we're looking at is like whether from the service controller, we could serialize uh all of these operations like putting them in a single queue, rather than handling them uh independently. So that's what I wanted to discuss further.

B

uh You're on mute looks like.

A

Whoops got it so you're talking about um so the the work queue for like node events and um service events. Basically.

B

Correct so what we do is even for the node events. We look up the service uh that are impacted and then reconcile them. So if we could combine them into a single queue, then we can get benefit of that our queue uh synchronization that it offers and potentially minimize the race condition here.

A

Yeah, that makes sense. um Is it worth taking a look at the code right now and just.

A

And and so you're talking about, you saw this in a cluster that was running, I guess the legacy cloud provider in cube controller manager, but the issue would probably show up in either one.

B

uh So I saw it on 115, uh but I looked at the recent code and those conditions uh haven't changed, so it would still be possible that it gets triggered yeah I mean it could be specific to aws, but uh like the way that things are, I mean we should do something in the service controller, which would be beneficial for everybody.

B

But this definitely happens like on a big cluster like say, hundreds of nodes and like some 200 services and like whether, when there is a constant uh node change event happening, scaling up and down happening. So.

A

A

I'm gonna share my screen really quick.

A

Can you see yes, okay, cool so.

A

Just so that I make sure I understand.

B

So there's a node sync loop somewhere, if you search for it.

A

Yeah, so I.

B

A

Service informer, I see the node informer, I see trigger node sync, I was just gonna see so.

A

A

Okay, controller um node.

B

A

Is this what you're talking about.

B

Oh yeah go to note sync internal.

B

That's the one that gets involved here and let's go there.

B

Yes, so this, uh if you look at towards the end, you will call the update load, balancer host function.

B

It looks up for all the services in the cache and then it calls this so this will end up calling update load balancer. If you follow through further.

B

So you will call uh somewhere uh update lock.

A

You will call the lock update.

B

Yes, locked update, load, balancer host and then this will call the update load balancer, so that goes into the aws code. For us now update, load balancer will do its thing right, it will go, look up the service the nodes and then you will update the security group and all of those stuff in there right and it's not protected by anything right now this update or any any of the aws functions.

B

Now we can follow the create as well.

A

B

Which create not the create sorry, the delete like load balancer delete, so load balancer delete would happen from the work queue right, so we would uh follow the work. Thank you.

A

Got it so, let's find that.

A

So here's process next work item.

B

Yeah, that's the one and then it will call sync service.

B

And if we go to the sync service, the process, service deletion.

B

And then process load balance or delete.

B

And process load balancer will call insert load balancer deleted.

B

So what I'm saying is like depends on how they are involved, but what I am seeing is like process load balancer deleted uh the insured load. Balancer deleted function goes ahead and modifies the security group rules. It deletes some entries from the sg rules and then the update one gets invoked a little bit later and then it goes and adds back the rule again because because they just uh run concurrently. uh So that is the reason why they they're not synced.

B

A

A

So your proposal is to make sure that all this work happens from one cube. I guess.

B

Got it correct.

A

Yeah, have you created a issue upstream yet.

B

uh Not yet I wanted to discuss it further before creating it. Just wanna.

A

B

Like we're all in the same place,.

A

Yeah I mean it makes sense to me. You know without like actually having seen the issue, that, if you're modifying, if you're, making an api call to aws from two, you know potentially two simultaneous different threads of work, and that is modifying the same entity in aws.

A

You have no guarantees over what order those things happen in um so yeah. The problem makes sense to me.

A

My my recommendation is definitely to move forward with making the issue um and proposing the solution in the issue. um Definitely do that uh before you know, uh starting.

B

A

Because, maybe we're not not understanding something but um yeah, it seems seems pretty straightforward to me.

B

Okay, so the note sync is usually 100 seconds or so, and it may not even be seen like if the number of nodes are less or like updates are not frequent. It just has to have the right set of trigger condition, so I I we didn't reproduce it, but we look at the logs and then we analyze the code and then young and I concluded that that's what should happen so, okay, I will.

A

Go ahead and create.

B

The issue upstream yeah.

A

Network meeting about the change: what.

B

Was that I mean we can we should discuss this.

A

In sick network yeah, that's a that's a very good idea. Once you create the issue, um since they are responsible for the service um resource, you should definitely bring it up there. Okay,.

B

A

We even thought.

B

About like adding lock to the aws code, but that, uh like a it's like difficult, uh because if we look at the insure load, balancer deleted function uh that can take up to 10 minutes to complete, like we have that time out right, so it it would effectively like slow down everything and uh it may or may not work. There might be corner cases we haven't thought through, but there might still be some corner cases that may not be handled there.

B

It's just like little more complicated solution in that case and every cloud provider has to do that right. So, rather than that, we suggest this way.

A

Yeah yeah, I mean there could be other reasons for having two distinct um cues for, for you know, and not combining uh node and, and maybe it has to do with, like the fact that the the node sync only happens. Every 100 seconds.

A

So how would you control, if you combined them all, I guess you would control the node sync by only adding those you'd still only be adding those events to the shared queue at that frequency.

B

Correct perfect.

A

A

Yeah I mean, without being I mean.

B

What you've said made.

A

That makes sense to me, but without being more familiar with the the code in the actual service controller.

A

um I don't know how much how much else I can really uh so yeah go ahead and create the issue I'm on the same page with you just see me on the issue, because I'm definitely interested in the results and- and I wouldn't mind uh hearing the conversation at um the um sig network meeting as well.

B

A

Cool sounds good.

B

And how does it work for cloud provider v2? Is it still gonna be this loop or it's gonna, be slightly different, uh like for the out of three cloud provider that we're gonna have eventually how things are gonna be different.

A

um I think it was. It was going to be uh this loop unless we needed to change it. So um if you know if, for some reason.

A

uh They don't want to um make this change, then you know.

B

We have the ability to.

A

Replace all of these in v2, we don't need to use them they're, just they just make things a lot easier.

A

So I was seeing this um with like the the um load balancer uh like using custom load, balancer names, the way.

A

That that is implemented in um v1.

A

Oh sorry, actually uh with the with node names, like the the way that um node names works with uh the way it is now, it's just really difficult to um use custom node names uh with uh just like the way that um the the the cloud provider loops uh work. So I was considering nb2 the same thing replacing some of these loops, but um yeah, I'm not I'm not sure. Yet um it really just depends on if we need to or not.

B

A

A

So I'm I think I'm gonna push uh the issue triage to next meetings when when justin's back, um so I don't have anything else, if you guys don't.

B

Sure, let's hang up and then I will go ahead, create the issue and will take discussions further in that issue. Thread awesome.

A

B

Cool thanks everybody all right. Thanks.

A

B

A

You made it back right, yep enjoy it all right, see you guys.