Antrea Community, 20 Jun 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Antrea Community Meeting 06/20/2023

Description

Antrea Community Meeting, June 20th 2023

A

Perfect good morning, good afternoon, good evening, thanks for joining this instance of the entria community meeting today, it's a Wednesday, the 21st of June and uh or if you live in Europe or Asia, it's already summer for you for folks in United States, you have to wait a few more hours for your summer anyway, jokes aside, for today, we have a conversation with Chan on enabling ha for the entria controller. This will be he with the active standby with multiple controller instances. That is uh the only topic that we have for today, so Chan.

A

The floor is yours.

B

B

All right, yes, the topic is about higher availability for Android controller.

B

Let's talk about the motivation first, um as you know, Android controller is now responsible for many keys tasks, especially the net policy and egress calculation, which affects the uh the the networking and the security of community supports.

B

However, currently it belongs as a single replica of uh deployment of a single replica and relies on the development controller to ensure its availability.

B

But some of this recent user reports and some tests shows that the development controller doesn't provide a very high availability, for example, um if when, when a note that hosts the only replica of unsure controller becomes unreachable for you by by by Panic or some other accident, the it could take a long time to for for deployment controller to create another replica to replace it and during this time window- and this could happen, nail policies do not apply to Newports and your glasses do not apply to new points and for existing parts, no policy and the request changes will not be applied.

B

This also includes that even a nail policy has been applied to a port, but a New Ports has been created during the controls. Downtime then the if the Newports should be the phone or two of the mail policy, and in this Imports we are not receive that update and I think the first two issues may be serious, because the pause will not be secured and is egress traffic.

B

We are not Behavior like user is expect and I do some tests and investigate investigated why it took so long to create a new replay card in this situation by the way in other situations like if you, if the country control Port crash or you delete the uh the the Pod, it is very fast to create a new replica. It's just this case. uh Kubernetes doesn't handle it very very well, and the time spent on the floor is first. He when a node becomes unregible.

B

It takes kubernetes, not the life cycle controller 40 seconds to mark this note as not ready, and it will then change it as no execute. This is determined by the parameter of cable controller manager and it defaults to 40 seconds.

B

So this we don't have control on this configuration a way how to support that the center, the the most common configuration that we, so we should assume in most clusters when a note becomes unregible and- and it will take 40 seconds uh before the node is marked as non-ready and has this tent and a half after the tent is added to the node. uh It takes another five minutes to effect pause running on this note. This is determined by another parameter. It is the default unreachable contribution seconds.

B

By default, we don't have any Toleration. We don't have added any noise cute Toleration to pause, including Android controller port. However, kubernetes has a plugin which is enabled by default since a long time ago. It will add no execute Toleration to every port that doesn't have such uh Toleration already, and the default storage in seconds is 5 minutes. So after the Pod is marked as not ready and has the tent, it takes another five minutes to delete the port and only.

A

B

The port, the no responding Port is deleted. Deployment controller will start creating another port for this deployment and it then takes a couple seconds to create a new port.

B

So that's why, in this situation it takes 5 minutes 40 seconds and more to follow, and- and this is a when I observed in when I did some tests in kind, but it's different in some managed kubernetes cluster, for example in eks, because when a node becomes not ready, I I remember I saw that the node will be deleted soon by I, asking it by the uks by some service in uks. So it will also check up all the deletions.

B

So in the failure it's faster than what I observed observed in kind cluster and in this situation for in eks, it's typically typically takes about one minute, one minute to failover. I guess the first 40 seconds is also spent on detecting this node as not ready. Then there may be about 20 seconds for eks service to delete the not ready node.

B

uh So after the investigation, I I found that the file mean is delayed because of this annotation Toleration added by kubernetes by default, and because it only has this Toleration when there is no such tourism on this port, we could proactively add this Toleration, but set the tent to zero seconds the thorough Toleration seconds so that once a node is marked as a noise skewed, uh the pot and the controller Port will be deleted immediately and a new replica will be created. So with this change we could keep.

B

We could reduce the available time from 5 minutes, 40 seconds to 40 seconds, but uh uf40, even for 40 seconds. It may be too long for some users when they expect anchor, could failover faster, for example, to in just 10 to 20 seconds.

B

So I am wondering if we could provide them higher availability while running multiple replay cards. There are two zero. There are two options: one is active active. This is how Cuba API server um Highway B mode runs, uh but this is not very relative, realistic in Azure controller, because some um factors, the first, is that we need to publish the certificate which is generated by the controller itself in most cases.

B

So if we run active active mode, we need to make a way to make all replicas to share the certificate key and the certificate data this may be doable, but the second it's more challenging.

B

This is because, like a cube controller manager and the cube schedule, we have some modules that can only run synchron without a big change, for example, the nail policy status and the stats aggregation. We need a single place to aggregate this data, so it's not well feasible to make this future working active, active mode, and there are some other functions like the external IP poor and ip4 IP allocation.

B

It is very complicated to achieve this when multiple instance could do the same thing and also including the traffic flow tackle location and the set really is that today we didn't see a any problem except the high availability by learning the learning a single replay card. It is the the performance of the single instance is good enough to support a large scale cluster according to the test, so, except for the ha I. Think, a single active instance is Enough from our architecture perspective.

B

uh So in that choice we how it's a longing, multiple replicas as negative standby mode, just like a cube controller manager and the cube scheduler, and then there there are two uh problems we need to solve. The first one is the hallway to leader election. The second is how we loot the service traffic to the active instance.

C

B

I will talk about the Solutions in this page for leader, leader election, like a controller manager and Q scheduler. We could leverage the kubernetes mechanism, and that is at least API, which is designed to for to to be leveraged by this library to uh to support a leader election, and it is very easy to integrate this uh collaborate way to have a little election in anterior controller and the parameters can be tuned so that the failure over can be can happen in 15 seconds or less by default recommended parameters.

B

It normally takes uh around 15 seconds to 20 seconds to select a new leader yeah, but the challenging part is the service traffic routine, because we have only a single active instance and we want to make sure clients of financial controller, including Cube, API server itself and the answer agent and the external and control clients.

B

We need to ensure the the request only which the activist instance, but if we just do nothing and scale up the replica number all of the instance could receive the request under the standby instance and do not know how to handle the request.

B

So I investigated investigated three ways to making the work along to the first one from internet. However, I think this may not really work. I will explain it.

B

The the the essence of this approach is that we should add one check to answer controller, that red is probe so that only active instance will return true for ready and the standby instance will return Force for Readiness probe, then in kubernetes API only active instance will be ready and the other instance will be not ready and the service will automatically select the only ready instance at the end point of the service and the traffic could be redirected to the active instance.

B

The benefit of this mode is that there's no extra operation cost the everything is the same. Despite the number for replicas, it could work with one replica and it could work with three replicas.

B

uh No, nothing needs to be changed between, but when we switch most, uh but the drawbacks is that first, the deployment status will look abnormal because it's already replaced. It will always be less than its desired replicas, which may confuse some monitoring tools and.

D

B

When we run Cube CTL law out status, this deployment will in the command will never finish because it will always appear as the progressing, but the biggest problem is that it relies on ready condition of port if a port crashes or it is killed. The the red already condition will be changed to not ready immediately, but it's not the case when the node becomes unregible. This has the same issue as the deployment controller, because.

D

B

A node becomes not not reachable, the port will only be marked as not ready after this period. So this, if we adopt this approach, it's almost equivalent to running a single replica, and maybe it could reduce a couple of seconds to create a new port and I, don't think is where it was to get that little benefit by.

B

Adding this logic so I I tried another solution which is to make the service I'm sure service uh ways out to to change the answer service to to be a service without selector.

B

This means that kubernetes will not be responsible for calculating the endpoints of this service and the way we will do it ourselves- and we will ask the active instance to update- is itself as the only endpoints of the service law actively.

B

um This should reduce the uh the the delay of failure award to about 15 seconds or less. If we want, because as long as a new leader is elected, it could immediately update itself as the endpoints and the traffic will be immediately redirect to that Port. So is the fattest solution.

B

However, my only concern is the operation cost, because uh in by default or before we make this future uh officially supported the we will have to switch we how to change the service definition when running different modes. For example, if a cluster is already created and has wrong address a single instance mode, it already used service definition with the selector. But if user want to change to another mode, they will how to change the service definition to remove the selector and I tested that by deleting by changing the service.

B

The existing endpoint slides previous created will not be deleted automatically. So, but this must must be deleted, because when the active instance or for Android controller, uh create and update yourself at the endpoints of the service, it will be mirrored to endpoint slides, but the it will not removed existing endpoint slides. So two endpoint slides will be both used in by Q velocity or anti-proxy to to receive traffic, which will cause a problem.

B

So so we need to be careful to ask user to upgrade or change the the the the ha mode when it is in a resisting cluster yeah, and there is also some other approach. I I saw from the internet. uh Some someone use this approach.

B

They add an actual label to the active instance and, for example, they add an this label to the active one and make the selector to select only active uh pot by matching the this label, but I think that it is essentially not very different from the second one, except that it is a bit more complicated because it because the active instance will be response. We are needed to also remove the this actual label from previously perform print with uh active instance, which is hard to guarantee the transaction in kubernetes when you are operating several objects.

B

So I think this has no it's not better than the second one yeah, and so my plan was my plan is first we we should consider setting a node skilled tent with the zero Toleration seconds to enter a controller. This may be a a big Improvement to most clusters that that are not the most, not a self-managed community cluster and may be helpful for some Cloud manager class tests for cloud management class tests.

B

Maybe the failure over time could be reduced from one minute to 40 seconds, and uh another plan is to implement the second approach to support active standby mode, but because of the operation cost in the first stage. I I want to uh to to to to make this feature experimental, and we just provide a another manifest in which, until service written will be an empty under the department.

B

Replacers will be three and we could also provide a Harmon value, for example, ha equals to two um and uh by which the amount values will be automatic set. And if this mode um attempts to be working, great I think we could consider that we, we change the service selector to empty to to always sell to empty.

B

But if the replay cast is one and then the only instance will be, will still be responsible for updating itself to the endpoints of selector, so that in the future user could switch between as the single single written mode and active standby mode. More gracefully.

C

B

Yeah, that's! That was what I, what I got today and uh I'd like to answer your questions. If there's any.

A

Thanks John, it's been a very little presentation as usual from you and um well, let's go to the question section. So if anyone has questions comments, concerns ideas, please go ahead.

C

Question have we received any requests from users? They want to have a shorter field with Tangent 40 seconds.

B

um Yeah we I upload up this because last week we recent we will receive the user request that they observed one minutes, one minute. If you know what time uh in uks yeah but I haven't checked whether 40 seconds was for them.

C

Well, what's their expectation, they they want less.

B

They didn't mention how shortage it is it should be, but but because of the the second delay as I assume that it may has some impact more or less even on cloud, even though the cloud could delete an unregible Port. So our first try to to suggest then to to update the deployment, many manifest to add a a zero second Toleration seconds, so that the port could be deleted immediately and let's see whether the the the new time could uh meet their requirement.

C

A

I think the team one, the only comment that they have is uh you know, is probably trying to find whether the effort for supporting Giza might be justified by the benefit that we give to our users and I was considering one aspect and when we do things like uh reducing the default or reachable Toleration seconds from 300 to zero, what we are going to get are often false positives. So what is the cost of a false positive?

A

In my opinion, if I remember correctly, audio and Tria controller Works since there is, there aren't many initialization initialization tasks that need to be performed at startup. The cost of failover is pretty much just the cost of destroying the older port and creating in the new pod. So let's say, even if we have some false positive, it's not going to cause a big deal. Is that correct.

B

uh You may, after removing this uh Toleration seconds, if it happens, frequently uh the the port is deleted and recreated you mean the.

A

Yeah I mean we don't know how frequently it will happen, but you know I'm just wondering if we have some Force positive like uh we fail overly strictly, but actually it just. The node was unavailable just for a couple of seconds right. So uh in that case, are we going to pay a penalty for initializing your entire controller or does anyone Trio controller comes up comes up immediately and it's able to operate.

B

Yeah I think it takes uh only a few seconds to come up a new instance. There is no many cost to initialize it and besides, uh it's not like the the Pod becomes unreadable immediately, who we will delete the port first. There is. There are 40 seconds uh duration uh for kubernetes to tolerate the nodes and response to to to be unregible only after it becomes unregible for 40 seconds and during which it receives no any uh status update. From that note, it will mark this node as not ready.

B

So if it's just a few seconds uh on response to you, I think this will not even trigger in this process.

A

Okay, thanks understood.

D

So there's one question like uh how the synchronization of state will uh take place between active and standby. Let's say: I have changed something on the active controller right.

B

uh Sorry, can you repeat the question my.

D

Question was like: there are some changes on the active entry controller. We are using active standby mode right, so uh yeah, so that will be like. Will there be any uh database, something kind of maintained which will also allow the standby controller to synchronize, with active.

B

No, we we don't need to. The only storage is kubernetes and the etcd uh when uncontroller launch as a standby. It basically does nothing but to uh but by the tries to acquire the leads from release API and only after it acquires the list successfully.

B

It will promote itself as the leader, and it will then to do the other things where we will have in controller we're going to launch as a standby. It doesn't do the uh and it it doesn't allow any controllers or modules and because the storage is the kubernetes after the active stand. Active instance is down and the the data is persistent in each City anyway, so the standby instance could retrieve the same data.

D

Okay, so yeah, another thing is like we are going with the active standby, not active active, so active active have Advantage like you, can distribute traffic over active effective and it can be, can act as a load balancer. But in this case only one controller can be active so that Advantage we are kind of uh will not be having an uh activist standby.

B

Yeah yeah uh as I expand in this page, and there is some architecture problem that is not easy to resolve uh if we run active, active mode and um perhaps according to our test, perhaps why instance is already enough to support the the load.

D

We are going with active standby.

E

Thanks, hey Chen, a quick question: I may have missed something. Can you go back to the different options for the service traffic routing? Thank you so for the option you have in red, which is your preferred option? uh Did you mean that you wanted to have like different service definitions for the case without ha and for the ha case, yeah yeah? Why don't we use the same mechanism or the regular case? I mean the the knowledge showcase.

B

uh Yes, I I in I did a plan that in the future, if this actually mode tends to be working, fine and we could unify it. But currently, if we make the change uh immediately, it could affect the existing users who upgrade from a prayer with release, because some endpoint slides will not be deleted by default and.

B

I I was one the my main concern is my major concern is to come backwards, compatibility yeah. How and whether this this is uh uh a really good solution to support the ha so I will I would prefer to have a different definition and using a different yam to to deploy until HOA mode and to see whether it proves to be working. Fine.

A

Isn't the Readiness probe enough like uh this? If a pod is not ready, you'll need to be removed from the resend points, making sure that only the activistance is uh is a usable endpoint.

B

uh You mean the first approach: yeah.

A

The first approach with the Pod Readiness is basically the only the only red pod which is ready is DHA is the active one right all the others have.

B

Not already the the problem is when the uh when.

D

B

The previous active instance node uh becomes unregible. It takes this time to for communities to mark that note that Port is not ready. It will not be updated immediately on the Node okay.

A

Yeah, okay! Now because I was under the impression, maybe I got it wrong that you were implementing a new Readiness probe where only deactivate the active pod will return, is ready and uh but but I see your point because before doing the transition? uh Okay, because.

B

I think because it's not even learning so it will not even call this.

A

Because you know, I I was mistaken, because I was thinking that the leader election will kick off immediately and the leader election will select a new leader and then the Readiness probe will return ready only for the leader. So regardless of the node of the node status of another reachable status, I was thinking that all the parts that were not elected as leader were returning, not ready, yeah, so um I, I, misunderstood, I, believe, okay,.

E

And did you find another solution using that? um Well, the thread solution, the second solution. Sorry, that wasn't clear: did you find another project using that second solution.

B

uh No okay: what I searched uh is the first and the third one, but the kubernetes API server itself use the second one and but for a different reason. A kubernetes cube, API server will report itself as the endpoints of the kubernetes service, but the main reason is a before itself is ready. I think Cuba controller manager is not even able to update the API.

D

B

Not sure what's the root cause but uh uh yeah, it use this mode, it append itself as endpoints, but in our case we replace the only endpoint I didn't find another project using this mode, um because in uh that it I haven't seen a project which is similar to our architecture, that the Android controller is responsible for both API and controller.

B

If we divide the responsibility, I think we could run the API as remote and learn some controller in active standby. That would be fine.

B

The problem of the first step, apology is the first approach many projects adopts is that it doesn't even result our problem. It still takes this long time to failover. Even there are multiple instance uh running. At the same time,.

E

Yeah I mean the second solution, sounds fine by me. As long as in the long term, we can have a unified approach for ha with aimed for a single replica case and.

C

E

B

Yeah I think that would depend on how this approach um Works.

E

D

B

Yeah, if there's no other questions so other I, think that it is.

A

Okay, thanks John, uh that's been a great discussion for today, so um we still have some time in the scheduled time for today's meeting. Is there any other topic that you would like to bring up for discussion? I'll wait for the like about 20 seconds as usual, for you to come up with a proposal.

A

And time's up for today, then so I would like to thank everyone for your time and uh wish everyone a good night good evening or a good afternoon and to the team members in China. I would like to wish a great long weekend for the Dragon Boat Festival, so see you in two weeks time and have a good one.

B

Thank you, bye.

D

Thanks cheers bye.