DASH High Availability WG, 8 Nov 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: DASH High Availability Working Group Nov 8 2022

Description

BFD & ECMP - Prince
Overview, from a DASH perspective we would need to add-on

A

Into the high availability meeting for the day, um November 8th so last week we talked about covering comments on the AMD PR here uh 271 and we also possibly had Prince coming to talk about BFD and ecmp, so I'm trying to get a hold of him here. Yeah.

B

I I just joined Texas.

A

Oh, hey, Prince, hey how's, it going good good, good, okay and um I was trying to see if I needed to let someone else in Lisa, great okay, so yeah. um Why don't we get started with uh Prince? We we went ahead. Maybe um maybe we can go over this with Gohan when and if he comes the The Proposal 271, um so I'll stop presenting here, and maybe we can talk a little bit about uh BFD and ecmp yeah.

B

So I have one uh diagram to share, so I can I.

C

Can share that.

B

So it it should cover for both the bft and uh another proposal that we we have been thinking. So let me um share the screen.

B

Yeah, um so can you all see my screen?

B

um Okay? So, basically um uh in in this approach, right, let's say the we have like a appliance card. One is primary and another is backup.

B

um So from a T1 layer we can initiate um some sort of configuration through the controller to advertise the uh the bgp prefix that the the appliance cards are advertising right and then um and then the T1 can establish bft sessions right with, with uh with the cards like, basically the primary and backup and and based on the bfts um hell right between the between the primary and backup the T1 can determine uh what is the the end point for that um uh for that package.

B

So I have the one who said: let me also share uh uh one more uh documentation, so that would be.

B

So this is the the PFT and overlay ecmp design document that we have in Sonic. So basically the idea is um the the routes or the prefixes are programmed at the T1 layer and T1 will distribute the packet to uh different endpoints right. So let's say these are the the appliance cards and, and in our case there are like some primaries and some backup the T1 can be programmed in such a way that the controller can say: okay, hey. This is the the endpoint one is the primary and two and three are the backup.

B

Now the T1 can initiate bft session to each of the endpoints and when the primary goes down um based on the bft session right, um the the T1 layer can reprogram the route to uh to point to the the backup endpoints so that when the packet lands on T1, instead of going to the um to the previous active endpoint it can, it will now go to the um to the backups. So this health of the the back or the cards are determined by the bft session.

B

Does it make sense? Or so today we we are using this propos. This side idea for um overlay, ecmp hashing. Basically, the packet lands on T1 and it distributes the packet to all the different endpoints in uh in an ecmp fashion. But the same thing um we can extend to uh to a active backup or primary backup scenario as well.

D

There's one question here: so the the bft session and the tunnel endpoint: are they uh kind of related or is it some custom logic to map uh which tunnel uh I would imagine there is a tunnel from leafman to tunnel in point one right and that uh there are three tunnels here and one of the tunnel goes down to reprogram to go to the other. Two I would guess right when the PSP goes down.

D

So is the uh mapping uh in the scheme right, for example, the tunnel endpoint one has to be the source IP for the PFT session, or is there a separate mapping to say this bft session corresponds to these tunnels? The identity is whatever so.

B

So yeah that's a good point so, but from the controller perspective right, so here is the uh uh the table. Definitions right like.

D

B

We have today, so basically you can say uh the end point is so this is the prefix. Let's say this is the beep that is advertised okay. It can specify what is the endpoint IP, which is uh uh this endpoint IP right for that preference, and then you can also say for that IP. What is the uh monitoring IP or the bfti? This is the bft session. Okay,.

D

B

Is the bft session so so from the schema we have one to one mapping between. It can be same like if the if the vendor or the implementation choose to have both in the next stop IP and the monitoring IP to be same. It can be same or it can be a different one. That maps.

D

B

D

B

So the T1 has the logic to map that session to this particular endpoint.

D

B

D

B

So yeah, so basically um uh it means like over here in the same schema uh when we say in point and endpoint monitor. There will also be one more attribute that says primary and it can say, like IP address one comma IP address 2 are primary and say: if you have IP address one two, three and four: um the implementation can infer that 3 and 4 would be the the backups and 192 would be the primary.

B

So it will basically start the bft session with all the four um uh all the four monitoring IPS um and and make sure that all the four are reachable or the bft sessions are up and when both the prime, let's say, 192 are primary and if both primaries goes down is when it falls back to two and three. The next stop will be changed in the hardware or the Asic uh to choose two and three and four, which are the new backup.

B

Only if, let's say the bft sessions are up, okay, but um now one more thing is like. If uh say, the BFD session comes back between with one and two, then it will. It will revert back to uh one and two: the Prime account.

D

Underneath is it like a route with a bigger admin, distance or a bigger metric, or something, and one goes.

B

No, no! No! No! No! No! No! No! No! It's.

D

Reprogrammed I.

B

See it's just the hardware that programs, with only the primaries or active bfts, there's no um longer prefix or shorter prefix.

C

E

B

Basically, the controller will program the T1 with the prefix and the endpoint IPS.

D

E

So one question Prince. So when the bft session goes down, the controller intervention is required for the switchboard.

B

No right because that's why the BFD session is between the T1 and the end point.

E

So the switchover uh happens.

B

Is automatically happening.

E

Okay, okay, based.

B

On like uh today in the in the Sonic we we can, we have the hardware- or this is hardware offload PFT. So the the notification shall come to the the sonic layer that says that this bft session is down, and then the Sony cork agent will have the logic to map that vfd session to this endpoint, and it will remove that endpoint from.

D

Yeah so the Sonic, each Leaf, the Sonic on each sleep, is autonomously kind of monitoring the PFT session.

B

F

So what happens? Is this one of the three ones most uh loses its bft session to the dpu.

B

So yeah, so one thing is in in this case: it it uh kind of assumes that all the T1 should be having the same um view right like, for example, Leaf one and leave two cannot think one uh this tunnel in point. One is primary but leaf here things that okay, this one is primary because it lost the uh the PFT session right.

B

So for that um one thing is: whenever we do such switchover, uh the plan is to have some sort of alerting and monitoring so that the sdn controller knows that there is a these are.

B

These are the this is the view of the of each of the T1 layer, but uh maybe to to answer your question: there's no synchronization between the T ones, okay, so the the one thing that we thought about is to to have the the switch over being reported to the controller, and, if controller sees that, okay- that's not uh that's not right, then we need to take some actions around that.

F

So from the from The View perspective ly to even one of the uh leaves or t1s, then it's supposed to switch over, although it has connected you to other tea, leaves.

B

No from the DP perspective, there's nothing to be done right. The the switchover happens at the T1 layer.

D

Yeah I guess um because your question is uh you're looking at the problem scenario, where leaf and leaf 2 had one view while leaf and had uh another view, we are looking at that situation from the dpo's perspective, where the dpu still has different and leaf too, but leaf and uh connectivity goes down. Is that correct is that? Was that your.

F

Planning uh also from the dpu perspective, is in the active standby mode of operation. The standby is dropping packet right, so the dpu also has to do a switch over um correct.

F

So if, if the active dpu has lost connectivity to just one of the leaves, while it still has connected to the other leaves if the knees uh needs, synchronize and say, okay, we want to we actually want to show to the stand by the standby is not yet ready to access packets unless it also the dpu level also has a secure happening right right, so there has to be some kind of uh handshake between the t1s and the dpus I. Don't know like probably, we need to explore this. Okay.

D

Yeah, the only problem I see is if Leaf n might be going through maintenance, which the dpu doesn't know.

B

D

Case we will definitely.

E

B

The uh the leaf and and and disables the bft between between okay.

D

B

So so, like you know, planned maintenance is covered, but only concern is likes and in case of some unplanned.

D

Connectivity yeah.

B

The connectivity.

D

B

In in case of active active, we don't have the tissue because it will, it will be always uh it can like like. Even if the view changes it, the packet can be sent to any of the cards, but in case of standbys, where we have that that issue yeah.

D

Okay, so currently I guess from the dpu side, there is no uh linkage between the number of PFT sessions and the.

F

D

What I'm saying is the PFT session and the uh HS state is kind of tangential. They don't intersect in any way, at least from the DP perspective, yeah I think so the states are independently going. Okay, that's.

B

G

Does that mean that the bft states are tracked between the tpus and the t1s? Here.

B

G

B

Session is initiated from the T1 to the uh so when, when the controller programs, the the T1 or the leaf one is when the lephone will start the bft session with uh um with this uh endpoint or or the card so I don't see a requirement for the dpu to keep track of uh the bft session.

G

Yeah I was thinking about keeping a dpu to keep track of the BFD session of the standby DPO right. That's the that's our connectivity path for the ha sync messages.

D

Oh I, guess uh are you referring to the heartbeat itself.

G

And everything yeah.

D

I see I, see, okay, I guess he had. The bft in this context is not that I guess it is just the PFT to the Leaf I guess: I am right, the heartbeat is a separate thing. It could be BFD, it could be something else, the heartbeat between the DPS. Basically, we are trying to.

G

See if this BFD infrastructure can be used for heartbeat sake, right that we are trying to achieve using uh achieve for H A between active and standby, if you are, we are trying to see if we can use utilize this BFD infrastructure right. So then we will need the path from active DPO to standby dpu. Also.

G

Unless we are not trying to use that infrastructure for the hi heartbeat.

D

Right so I guess uh Prince, please correct me my my view. What I'm the way I'm understanding is is right, so uh there could be potentially let's say we use bft as a protocol for monitoring the monitoring DP to dpu the heartbeat itself right. But that would be us that would be a special bft session. I mean what I'm meaning is we'll just call it heartbreak.

B

D

And it has got a special meaning too. Then these three, for example, these PFT sessions, that we have these bft sessions, have no bearing on the HS state, while whatever is there in the heartbeat, whether it's bft or a different protocol that has a pairing on the HS state of the dpus correct, right, yeah, it's incidental that yeah um incidental, that PFT could be used for the heartbeat too, but uh not related to this scheme.

D

I guess, apart from the common, is that it's the protocol is the same other than that the functionality is very different.

G

So previously, when Gohan was suggesting to utilize the BFD infrastructure for the ha heartbeat as well, I was under the impression that you know.

D

We can use.

G

It yeah, and also if that is the case you know further from this- is that it can do multi-path, probing or not just multi-hop, but multi-path probing right.

B

Yeah, so that's the idea of uh PFT with probably acmp so.

D

Here the this thing itself uh in your previous uh this thing or even the route that we have here prints I, think I have some idea, but the this thing, the uh the key ones, the the leaves that advertise that out uh are they kind of related to where the dpus are placed. The appliance is placed or it could be.

B

D

B

Could be in a different um uh cluster as well, because these are like a tunnel packets right, so it.

A

B

To be like, for example, that's why this example was given right so from Leaf one. This card is in a different cluster right.

D

B

It can some of it can take a a direct path through the t0 if it is uh if it is connected in the same cluster, but otherwise it will take the the um the t2 path and then go to another cluster to it, because this is like uh within the same availability Zone, and it can be um reachable um from this T1 to another.

B

D

Yeah, my question is also from the failure domain kind of uh this thing, because uh if it's too far away, then I think the case you were saying with leaf and losing connectivity, but not leaf and leaf. 2 becomes more probable.

B

D

Right if they are uh completely different failure, domains too far apart? Yes, yes, this.

G

Is within one class topology one class right and and it's within one availability, Zone yeah.

B

It's like within one availability so on, but it may be spanned across uh uh one or two clusters because we want, uh like the actual package to land on different t1s right.

F

This is running on another IP right control like control. Yes,.

B

Yes, so it can like I said in the in the schema it can be, uh it can be configurable. So if, if your next stop endpoint IP is let's say the cards PA right, uh the the monitoring IP can be the same, or it can be a different one. So it depends on what the card wants to have the PFT session, for it.

F

F

So the whip is what the T1 is: advertising right correct. uh So there is another IP that we are using for the heartbeat between the so each.

D

F

Has each dpu has its wrong IP for the heartbeat.

B

F

Will be the kind of.

B

The PE address right.

F

uh It's not the ba senator look by type.

B

No, but that is up to the implementation right like we can have the uh the PIP also for the the bft, but the bft session will be established only when the the card is ready, say.

F

I'm not talking about the BFD session I'm talking about the who will advertise that so so my understanding is. This is a replacement for the bgp advertisement right. So if, with this scheme, we don't need to advertise bgp from the dpu, um but.

D

F

Also, there's another IP that we advertise uh with which the heartbeat so that, when we want to connect to the peer uh for the heartbeat, the peer GPU that IP is also advertised by the prdpu in addition to the script today.

F

So what will be the scheme for that type? So we call that control Network.

B

But that is not for used for data traffic right.

B

F

That is, for the closing can be pulsing and heartbeat, and all that.

B

Yeah I think that between.

G

Two dpus: they need to know that I think.

E

F

That's also be a VIP or is that another IP so.

B

In this program, in this model, I I, my understanding was it is in the PA space between the two dpus.

B

But if there is because uh with this like the advertisement will be done by the vapor advertisement is done by the T1 and the cards doesn't need to even advertise anything to the door right so and the end point or the the next stop for the the whip is nothing but the pap of the end of the card. So.

F

F

So this is the third IP.

B

B

uh So there is like two paps and then um sorry, location.

F

And then another control, Network IP, which uh which we today we are yeah I plan to start it is that also airbgp.

E

B

E

Two entry connected L3 interface, paips and then another loop pack, which is used for the control Network IP, that's also advertised through bgp, but if that control Network IP is in the finite space, no data traffic hits it. It's only used for our flow sync. So if it's in finite space we don't need the bft mechanism. uh I guess that's the question.

B

Yeah, maybe that IP alone, we may want to start the pgp and advertise to buy through the through the current method, so that it will. It will be there in the in the reachable.

F

Okay, so that means that will continue: yeah yeah, okay,.

E

C

Prince um I have a follow-up question um in the current Sonic. Like you mentioned earlier, it's operating active, active mode, so basically the tunnel endpoint one and tunnel in 0.2 can be receiving traffic based on the load, balancing decision at least one. Yes, however, um with dpu tunnel endpoint 2 will be standby right. So, let's say of the BFD session between Leaf 1 and tunnel, endpoint one go down for whatever reason, but we have not triggered failover from a dpu point of view.

C

So if leap one starts, switching traffic to Tunnel, endpoint 2, which is a standby dpu, will not forward that traffic until the controller get notified and probably tell Turner endpoint2 to make the itself active. Is that.

B

The expected- that's not my understanding, I thought uh if let's say for any genuine reason, the tunnel endpoint one goes down right. That's when the bft has gone down right right. In that case there is an automatic synchronization between the and the two cards that the other one will uh switch over I. Don't think a controller needs to be needs to notify.

C

um So then, so then this is another. This will be another trick of a failover. Then.

C

And, and also right, this.

B

C

Session that terminate on one particular endpoint only right I mean you and I only right do we really want to I mean? Are we just fail over that particular you and I, or what are we? What are we? Oh.

B

That depends on the website for each whip. You have a uh endpoint, so if it.

F

B

Across different dni, then it will be uh it, it will be basically um will.

C

Be the same right so yeah, but right now the control of who's active standby that is dictate from the controller to the dpu right.

C

Because remember when, when the.

D

Controller yeah there is an administrative state which says that uh preferably this is that, because this is a stand wave, but the actual operational state is controlled by independent in the sense, uh if the active is not there, then the configured standby can become the active. Is that what you meant? Yeah.

C

So this will have to be. This will be a new trigger for uh changing the role.

B

No from the deepest standpoint there is the it should not trigger anything based on the t1's bft session.

C

Now, let's say let's say: T1 BFD session go down between the uh leap one and the active dpu Okay, so in that K T1 will fail over the traffic to the standby right, correct. Okay, so standby is current with the current behavior. If he is a standby, he will not follow any traffic. He'll drop it no.

B

No then the so the point is this went down because the active went down right.

C

Not necessary right I mean BFD. Can it depends on your your bfv session? It can fail along the way. It's not necessary because of the dpus fail right. You'll.

B

Be okay, so so, okay, so.

C

For many reasons, so.

B

Okay, so the point is this is not a normal scenario where the active actually went down it is like active is still now, but for some reason the bft went down exactly yes, so the bft also is like uh there is like multiple retries like we have uh 300 milliseconds by three times.

B

So so, if all the three went down right, so all the three like, let's say three retries of the bft, went down, and in that case yes, of course the the T1 will make the switch over decision, because the the T1 doesn't know whether it has actually actually went down. Or is it a physical networking connection right.

C

No, no actually actually T1 and the dpu it's a multi-path pfv right. So let's say any of the note along the way can fail what cause a BFD session to go down.

B

Multiple different path right so.

C

E

B

Let's say, uh let's say this: one is connected to uh uh this door.

C

Right you're talking about multi-path for the underlay multi-part for the underlay yeah underlay. So what you're saying is, if a b, because this BFD session is running over vs lanterno, the only way for it to go down anyway, unless all the underlay paths are there, what is unlikely or the other reason, is the tunnel endpoint actually down for which you interpreted as a dpu down exactly so, which failover will happen? Yes, okay!

C

So then, in this case, even though you're using BFD as a fast detection for failover, how much traffic loss will depend on the ha failover in this case, yes, unplanned, if ever okay.

B

C

Can be up to two seconds in this case? Yes, okay got it so I guess that's that's the major difference between the current design, where you have active, active tunnel invoice which is active standby in the dpu case. Yes, okay got it thanks.

G

The connectivity between endpoint, 1 and T1 Leaf one there could be multiple links there and then, if one goes down, other will still remain active. For example.

G

Now, if leave one end point one connectivity to leave, one completely goes down right but and point one has connectivity to end point two via Leaf, 2 and others, because they're all connected to the Taurus and it's a cost apology, yeah right so so from end point. One connectivity to end point two point of view: there is still connectivity.

B

There is still connectivity yeah if any of the T1 going down is not a concern for for us, because uh it's like just isolating from the from the network, but the only concern that we discussed in the in the beginning of the meeting is: let's say if one has a connectivity to uh to n.1 but leave two lost and if Leaf 2 decides to switch over, that will be like uh inconsistent between the two t ones right uh and that's when we we wanted to uh to raise it as a as a signal and alerting to the to the controller too.

B

So anytime, message over happens so that, um if, if for some unknown reason, if any underlay network has gone down between the T T1, but the T1 is still advertising, which means it's not isolated, then that needs to be a like acted upon. There's no automatic way too.

B

Determine that yeah so.

H

So prints for, for the clarity of thinking the BFD sessions between the t1s and the endpoints are for completely different purpose right. This.

B

H

Fast, you can reconverge when there is a connectivity issue.

B

uh How fast we can reconverge and fall back to the standby and also the other reason, is the with this proposal. We don't need uh the cards to send. You know different bgp session with the aspened or something to to tell the physical networking that hey I am inactive and that I'm In Style by right. So.

H

B

So there are two things that we are trying to to solve by this proposal.

H

Yeah and what I was saying was I mean for the clarity of thinking. This is that for purpose is met by these BFD sessions and the heartbeat mechanism between the H.A peers happen to use a BFD, but it's a completely different thing. It.

B

Is completely different.

H

Yeah, you don't have to assume they are Atomic and failover at the same time that that's orthogonal to this yeah.

B

Okay, yeah I think that's all from my side. I'll stop sharing.

A

Thanks Prince um I wrote down the link, so I'll put that in the notes. So did you want the community here to go ahead and read through it and look at it.

G

Friends are we planning to tie this to the discussion that we have been having so far um in a chair right um about the connectivity between the two dpus and what would be the uh reason for switcher switchover, failover right yeah?

G

Is there any text in this PR that will connect with the dpu and the ha between the two dpus.

B

No, this is a generic.

G

B

So I think from the dash perspective, you may need to add some something there yeah.

B

Sorry, you were asking something: I missed that.

A

um Did you want the community here to look at and review that PR or are you just letting us know.

B

This is already uh one that is available.

A

B

Sonic yeah, okay,.

A

B

Somebody out on this yeah, okay,.

A

Great so I'll just post the link to it then yeah, okay, okay, great um and then I, don't see Gohan on the call Prints, but they we did do some follow-up to the uh PR or um the amendments to pr244 and I. Believe Gohan will see those um because those were basically his comments. So did you want to look at them here Prince, or are we good just to let Gohan handle it.

B

Yeah I think that one let Gohan handle that yeah.

A

Okay, all right, uh anyone on the call do you have something you want to bring up before we close for the day.

A

No okay, um the survey responses I've received have pretty much unanimously said, cancel for Thanksgiving week, so I'll probably go ahead and do that and I'll bring that up again in the larger community meeting. Just to let you guys know that's what I'm hearing though okay well stop would be good. Yeah, yeah, okay, give everybody a break right. Yeah.