DASH High Availability WG, 19 Apr 2022

Previous Meeting

⏯

youtube image

►

From YouTube: DASH HA High Availability Working Group 20220419

Description

April 19, 2022 HA WG Call

A

Okay, I see it's recording so hi everyone um yeah. So as christina mentioned, our goal here is to have a series of uh meetings to discuss different topics regarding the high availability in dash and eventually come up with some standard uh defined protocol, uh so that uh the dash appliances dash cards will be able to communicate their state to one another.

A

So that's why we're all here so a little bit more about, but why we need this is, as you can see, what we are changing with dash is a path with which the packet is traveling through the network. So if we are this aggregating our state away from nodes that are running vms to dedicated appliances, uh now it means that uh our path is changed slightly. It's not just going from one node to another. It will have additional hops, which are our dash devices and it introduces a new.

A

New points of failure, which are at the moment, uh don't have any redundancy, so we of course, have redundancy at t1 uh t2. Of course, we we can have in t0 as well as, for example, microsoft is doing in sonic with dual tor, uh but when we are introducing our uh dash devices, this is. uh We are running again into the issue of what happens when one of them fails and how we can minimize the impact of such failure.

A

um So this will be if I will try to keep this uh these series of meetings uh as interactive as possible, because uh the goal here is not to to come to you and to present something that uh I think how it is supposed to look like. But uh I want to do it in a way that everyone provides inputs on different topics and at the end of these meetings, we will be able to converge uh to uh to a protocol.

A

That will be that everyone will be able to support and that will be uh open and provide interoperability, and uh I've gathered a few. A few topics in here that I want to go over in the next meetings. So today, we'll probably start with the requirements.

A

There are some requirements already available in the uh dash repository documentation that I have copied in here and also I would want to get more input on every of these topics also uh to talk about end-to-end flows. How do we see this working in uh in the real network? How all of the uh how this, how all of the scenarios will evolve the control plane interactions? Of course, what is the data? What are the inputs uh from control plane?

A

uh What information we should include, and so on and so forth?

A

um Of course, data frames, since we are talking about something that's supposed to be an open protocol, we want to define how our data traveling from one dp to another, should look like um and eventually just combine it all into a formal definition and present the result of our uh work to the wider dash community.

A

Yes, so this is. This is how I see these uh meetings uh moving forward. um If you, by the way, have any topics that you think should be included into the future meetings, you can you can. Let me know right now. I will keep this presentation in edit mode. We can, we can add everything that we think is worth going over or you can just after the meeting just. I will of course send this presentation you can you can just reply to me with stuff that you think should be added as well yeah.

A

So any questions about uh about this high-level plan.

B

Yeah, I I had a few things to add. Can I can I wait yeah thanks thanks for this um thanks for leading this too.

B

Maybe a few other goals we could consider one is: um do we want a behavioral model or add this to the behavior model, because this might be modelable partly in p4 and partly in you know, c or something like that and having a model of it might really help might even provide some implementation start um not just a model. Even I would say a simulation. We could have a simulation of two dpus interacting sharing their flow state, and so that's one idea another is.

B

We ought to consider the testing and observability of this and bake that in early, because once the thing's live and running, it's going to be a tricky thing to observe, right and verify.

A

Well, I I think, like testing and observability, is something that should be a part of end-to-end flows. Yeah, that's a good idea.

B

And uh another goal would be.

B

We should study this in the context not only of the current dpu appliance model but of the smart switch model and make sure we don't overlook something because that's coming down the road right, the integrated smart switch and we shouldn't forget that.

A

Oh yeah, of course, smart, switch and also other use cases. uh I think it should be extendable. uh We haven't talked about in dash, probably because we haven't finished with v-net yet, but there is slb, there is other stuff and it should be uh extendable to those as well um yeah. So I think this is also like smart switch. I think this falls under intent, flows as well: smart, switch, slb, etc.

B

I I I wrote, I made a couple of uh diagrams new ones um in a pull request that I put in last week and I was hoping for microsoft to be able to review them at some point and see if, um if they make sense, because do you want to review them today, yeah yeah, we could look at them at the right time. I'd be happy to share my screen.

A

Okay, okay, um so I see anything else or anyone else wants to add something.

C

um So we have low level high availability captured here. uh What about the intra device?

C

High availability right like um in service, soft software upgrade or what happens uh just just as we do, ha how we handle a chain switch dash device also should be able to handle, ha and that will be similar to switch for a single smart nic and be more complex for a smart switch.

B

A

You're talking specifically about uh smart switch or how the tor interacts with the mdp appliance.

C

uh In the dpu um standpoint, the high availability still has to work within the device itself right. That's that's um one thing and then flow level that is across dpos how we um synchro synchronize the flows and how we, um uh you know uh if one of the devices go down, how we switch over. All of that is just is another flow level, um high availability within the device itself like, for example, if we have um you know, if we have control plane, we have uh arc agent. All of those.

C

uh You know the subsystem of sonic itself that will be running, including the dash agent dash container um inside the smart device right as sorry dash device.

A

C

ha Related to how all those components recover upon failure is a separate aspect also needs to be tracked. Are we tracking that here.

A

um No, so what is uh what, with regards to sonic, uh this is something that will be solved in another form. What we are focusing here is uh state replication across uh across dash devices, meaning that this is probably what you uh mentioned: uh synchronization of the state from one uh from one dash device to another.

A

All the protocol around it control plane around that we're not talking about, uh let's say, failures in the software stack that is uh not related to to the state that dash appliance keeps so if, for example, something like orc agent container, something like that, if some software component within dash container.

B

A

This is outside of scope of of this working group. However, we can revisit those points later on, but.

B

A

Yeah software failures are not something that we are focusing right now, because there are mechanisms in sonic uh to overcome those and also upgrades, and all these uh things, what we want to focus on is the is a definition of the way how we can replicate a state of one dpu into another dpu.

C

It's fine totally to go through the flow part, because that is uh missing uh right now and then, um if it's handled in a separate forum, uh that's fine too, or we can. um You know, continue this forum once this is completed uh to discuss the intro device, high availability in the dash device, yeah.

A

Yeah, so this this is intra delay, so we're talking about replicating uh state of one device into another device. This is exactly uh our goal, but we're focus we're focusing on the state of the data plane not what's going on in the control plane assumption for this working group, at least for now is that uh failure happens. uh Sorry, no failure happens and the uh control plane or almost no failure happens to the extent that we can allow that. So we want to replicate data plane on the control plane.

C

Yes, a flow flow of we are focusing on how the flows are distributed or maintained within each device and across and reconciled that flow level. State uh level. uh Yes,.

C

Okay, we can maybe at some other time or some other forum, discuss the rest of the.

A

Yeah, so the rest is more related um to uh to the sonic software stack uh right now.

C

A

Yeah right now we're more focusing on, uh let's say no diagnostic stuff.

A

um Okay, so probably this stuff will be replicated into end-to-end flows as well. um So I think today uh our today's slot will allow us to go through the uh requirement, review and clarification.

A

uh I took the requirements from uh documents available in dash repository and also, as we already started, uh define what end-to-end uh flow types we are interested in, like some of them were mentioned in here, testing and observability other stuff that we want to cover and, and let's say, define how we expect them to be working so that eventually, those expectations of expectations will be documented in the final protocol and also how they will affect the protocol itself.

A

Yeah, so if you don't have anything else to add, I will I will be also. As I said, this is kind of the uh presentation that will be updated on the fly. I will keep track. I will have this as the meeting notes for uh every of our session, with the dates attendees and so on, so just for you to know so so that you can pick up from any place if you in case, if you missed a meeting or joined later, or something like that um yeah. So, let's jump into the requirements.

A

Okay, as I said this is, this is just the initial set of requirements. I strongly encourage everyone to think of something that.

A

You believe or should be included here as well.

A

For now I have only one item that we briefly touched upon, but let's go over what was already in the dash documents and try to decipher this and understand how it affects a standard or protocol itself. um So first three are kind of from the same uh basket. They are talking about the how the failover should happen.

A

uh We have some time uh constraints on the failover, so we should be able to achieve zero, downtime planned failover less than two seconds, unplanned failover, and what are the implications on uh on the dash itself and also ability to resume connections in event of both unplanned and planned failover.

A

So, let's, let's start probably with this one it, it is a clear, let's say clear requirement.

A

I hope you agree with me that we want to our main goal in here is to replicate all of the connections across two devices. So assuming we have active and we have a standby device, we should be uh constantly keeping the hot copy of uh of an active device in our standby with all of the connection tracking uh present already. So it can take over at any point of time and.

D

Very sorry to interrupt.

A

D

Very basic uh question: so the model we are going with, the uh active and standby not active, active right.

A

uh As far as I know,.

A

E

Can okay, so I'm basically coordinating ah from microsoft's site, so I can tell you guys that currently, the first version is like active standby. However long-term vision and if you are actually going towards it, we actually would like to have active active, mostly because the active standby right now we are kind of solving with asm prepaints, and there is actually not enough t-com space to kind of advertise and carry over over different layers or tours or the sn printer pre-pencil will be going towards active, active, so preferable. Socials should support active, active right now.

D

Okay, if we do active active, then one of the requirement of hashing complicated hashing, probably instead of so.

E

um So I know that, from the from the source side right, uh the connection will always always go to specific device right so for for a specific, like five tuple right- and this is also one of the reason why we are switching most of the connection uh from nvgre to vxlan- and we are of doing this because there is just more intro in the packets and tours can actually correctly hash it, because if there was nvgr they were they were. Basically, there was not enough entropy in the outer header to actually do this so yeah.

E

You guys can assume that, for example, for specific five tuple, the connection will always uh the tools which will basically should prefer hashing to one specific stuff right. But, as I mentioned like, this is active active right. So when the connection gets established, then the flow should be replicated to the to the other one right and and the following packets can can even reach the other ones right.

A

And what is the condition that the packets will reach the other dpu and active active scenario? What would be that.

E

That trigger may be a tor failure right, because if there is no turf failure, all the taurus usually basically will be the same path.

E

Potentially may rearrange and the packets may reach through to other ones right and definitely uh the unplanned plant right. The main scenario that is actually customers uh are basically is required from a customer perspective.

E

Is that if there is already established connection right like since it actually happened right like this connection cannot be dropped right, we can have a retransmissions and their transmissions basically happens when, for example, there is them like stores or failure right or in our case, like car failure or switch fader rate, but the connection must not get dropped once established right, because this this customer don't like.

A

Especially in such a massive scenario where there are there were all of the connections uh served by the sdpu for the vm yeah.

E

And then, and then one additional stuff when you are thinking from the standard perspective right, I was talking with the slbs of social, advancing team on my site right and even right now, even though right now we are starting from ha which will have a two devices which will be paired with each other right. uh We have data from the slb team which, which they mentioned, that most likely to avoid some of the critical failures that they hid in the past.

E

At some point, you may need to expand it into potentially airing of three devices uh they'll be communicating. They basically mentioned that to maybe very little to cover some of the scenarios, but right now we are basically starting with two, so we should design for two, but but the design should keep in mind that potentially in the future we may want to be able to. If once we have build sufficient capacity or purchase sufficient capacity and place it potentially is spreading to ring of three. Maybe in some configurations.

D

So, michael, that's a very different actually model. That's a clustering model, then.

E

Yes, but this is this- is the conversation that's happening on our site. uh This was basically feedback when we had a review of the aha at microsoft right, there was the feedback from the lead of the software balancing team. That said that they also started from two instances and then at some point there was some live sites incident and they had to basically scale out to minimum three instances uh for them to handle some stuff, and I think like if we start from two.

E

This is this is good model for now right, um but it's kind of keeping in mind that if we are thinking about one, a grain or the other algorithm right, if we can potentially make it easy expandable to three right, then then, potentially we have also solution in case. We need three at some point.

D

Okay, uh just one more question on the same topic uh so and that may might we might want to capture it as a requirement. So let's say the connection is learned on one of the active uh and uh then it gets replicated right on the other active or the active state is maintained on a where the, where the connection is learned.

D

So when I, when we say so, are we expecting replication or okay? So, let's say they're inflow, so active active two devices, maintain n by two n by two or n connections need to be replicated on both active, active.

E

uh I didn't fully catch the question: what is the n by two.

D

Meaning if there are total hundred connections, active, mm-hmm, okay- and there are two devices which are in active, active configuration. So my question is each device maintains 100 connections or they load balance and one device maintains 50 and another device maintains 50.

E

uh So from the point of view of the flow table right in order for them to successfully fail over, they will need to maintain exact copy, so 100 100 right from the point of view of actual processing of the connection right. uh One connection will most of the time unless there is a failover or the switch or whatever right will end up always on the same device right. So from the point of the cps right. This doesn't impact. This doesn't increase cps by two right.

E

It's mostly allow us to basically spread the uh the load, because, basically, some of the connection lines on the first active device. Some of the connection will land on the other active device right, but from the from the load perspective right, but from the point of view of the flow stage, the flow state will need to be duplicated in order to support the failover.

D

Got it got it so, basically, your requirement is that the flow state for all the connections need to be replicated on all the active active nodes, whether it's a two device or three device configuration uh and then uh and then the uh the processing may be load balanced across these uh active active. Yes right, okay,.

E

Will be for sure a lot of balance because yeah I mean one packet will be going to one device. Only all.

D

Right correct and should the processing be always on the node on which the connection was learned, or it can change.

E

Most of the time, yes, however, as I mentioned like there, may be some tor failure in between right and some hashing algorithm is redirected like to for, for example, different torque.

D

Okay and then out of ordering.

E

Out of ordering.

E

To be honest, if there is out of order pocket, I think it's okay to retransmit this kind of packet and drop it uh in case this, like like. I need to double check on this one uh on our sides. Potentially, we should have ability to forward some packets plus minus this kind of stuff right.

E

Why the out of the order is the issue from the point of view of the field tracking.

D

In flight packets uh and when the hash changes right.

E

I mean yes also. I understand that this may. This may potentially happen right. um I'm just I'm just curious. uh Do you have a strong requirement for the sequence numbers when proceeding the packets.

D

Right, that's where I'm coming from.

E

Yeah but, but I believe, there's strong, like the switches should just forward the packets right. The only requirements for the sequence number is really from the.

B

Point of view of.

E

Sending acts- and this kind of stuff right, but this will be destination device, they'll, be sending x right.

D

E

The not the middle device.

D

Correct correct, yes, yeah, but just want to capture that you know just to be clear.

E

Yeah, let's capture this and let's, let's have an open, open item to this right and we can close on this one so jai. How do you want to summarize that.

D

So one is that I think the model active standby, active, active and three device active, active, active uh flow scale flow state need to be replicated on all the devices whether they are in active, active or active standby mode.

D

A flow which is learned on a node active node typically will be maintained on that node, but it's not guaranteed and during the tour failure. I.

A

I would say that in usual circumstances with no failures, uh it will be.

D

It will be right right, but during the failure cases, if it transitions, then there is out of ordering and switch or the dpu will not worry about out of ordering. It will just forward the packet. But that's open item. As per michael.

E

Yeah, I believe we should not worry about out of order. We should just forward the packets right if the connection is established right. The only way when we need the the actual sequence number. I believe it's at the moment. If we need to track the fins right, correct, correct.

D

And I see niranjan has some question.

F

uh Hey, michael uh so in the presence of nat the there is no assurance that the forward and reverse directions of the flow will land at the same dpu right.

F

Your hashing is not going to give you any assurance that they will arrive at the same dpu.

F

What is what is the desired behavior in that case,.

E

Well, so so in case, so because the flows are getting uh copied right so once the flow gets established and what the specific decision to use the specific part right is set up right, this decision will you guys I mean that the switches will propagate the decision right. So even if the return will land on a different one right, then definitely be asynchronous from the point of view right. However, the flow will be there so we'll be able to basically provide, like a reverse transposition, to deliver the packet right.

F

You do, but do you want both of the dpus to make independent decisions on each direction of the flow, or should one of them be treated as the owner.

E

One of I believe one of them should be treated as an owner.

F

Okay, so then there is a requirement that if you get a if a dpu gets a packet that belongs to a flow for the other dpu, we will be forwarding it back to the other dp.

E

I don't believe it's a requirement to kind of do to hope model right, preferably not if we can avoid it right, because this will reduce gps right at the same point, the standard that you are discussing is is not is, is basically not scenario with the with the actual dynamic ports right. We haven't discussed slb yet uh so I'm not quite sure if we should park this decision for later we'll be discussing not balancing this load. Balancing is a entire set of forms, regards to dynamic portal location ability for the customer to kind of control.

E

Exactly so so that's explicitly one of the main reasons why we put the nut out of the equation for now, and also most of the customers who are actually in the need of the high cps. They usually have some firewall appliances that have dedicated, for example, ip address and all the traffic tools to these iprs already, for example, this firewall appliance and this kind of stuff right. They not many customers from that. They're using nvas are using this kind of dynamic portal location. They usually assign the actual entire vip to the appliance.

F

Okay, uh the reason I raised it because it has an impact on the solution and if you intend to be forward-looking, uh this problem will show up at some point right. It's similar to going to three dpos in the ring.

E

Yeah, so I would say the state: definitely we should we need to basically because like when the packet goes out right for the first time. If you are talking about the load balancing right, then the some poor gets allocated right, but even though we didn't even want to even try talking about load balancing right now, but sample gets located and this allocation for the specific flow, the specific port that the traffic will be snapped to right needs to be replicated to the to the other device right.

E

uh So when the packet comes back, which may land on on, of course, like different device right, we'll know how to do this. Reverse transposition, this kind of stuff right, but the idea is no, I mean you potentially want to bounce the first packet, a kind of like multi-hope scenario. You know in a way, for example, when the connection established, for example, sin sin is going right. We can potentially do the same thing that right now the balancing team is doing. uh They are basically bouncing the packet across the ring, so usually for the scene.

E

There, for example, the packet lens on the first device. Then bouncing this to the second device they bust, the third device. They are passing the packet back to the first device again and then only they they put the packet back to the to the network, but all the remaining packets they never bounce across all the devices right, because this will greatly reduce the capacity of the network and devices if they want to every single packet bounce, because there is no reason to the state doesn't change right.

E

It's just for running packets, using the existing already established ports that which are selected. So I'm not.

F

There's a slight difference between what you're saying uh in there in the slb case, one of them is active and the other two are keeping the state. uh You are kind of getting an assurance that all packets of the session forward and reverse will end up at the same slb. In this particular case you don't have. I don't believe this will be.

F

So let me just give you an example. Let me just give you an example of and let's see what the desired behavior should be right. uh Dpu one gets the syn on a connection you established it. Maybe that is not happening.

F

uh You copied it over to a backup, dpo2 or maybe you reflected the packet to a dpu2 and it created a connection. Now the fin is from the destination or a reset from a destination shows up at the second dpo.

F

uh Is it supposed to act on it, or is it supposed to send it to the dpu-1 which thinks it owns the connection.

E

Yeah, so from the from the point of view of like a sin, as I mentioned, like sin needs to be, and in initial handshake needs to be basically most likely routed through all the devices and the same with finn right, because the packet also needs to like be kind of terminated to all the devices right. So it's kind of similar, also not only for load balancing, but also for normal traffic, that the kind of the pin was like also like way to communicate to all the devices that the state actually got removed right.

E

So those kind of like beginning in the end those packets will most likely will need to bounce across all the.

A

Devices right and I think.

E

A

Michael, but bouncing is actually the implementation of the protocol.

A

We may do it in a different way.

E

That's yeah, so so I'm I'm open for any changes there right.

D

E

Is like basically, the initial connection establishment right. All the devices needs to be informed. The fin the connection is.

D

E

Also, the other device needs to be informed right. uh So so so answering your question uh nirajan, I believe again, beginning at the end of the connection. Definitely right from the middle of the connection. The package is flowing, transferring data right then I don't believe we need to impose any of those kind of requirements. What do you.

F

Think, okay, let's.

E

F

That as a working position and see, if that that has problems and yeah.

E

F

The other area where uh this you know is if you have non tcp, like maybe udp packets in there, and I don't know if you have a scenario of simultaneous open in your environment,.

E

F

E

E

You like, but you are talking again like a load balancing. Is there right when you're dynamic, allocating ports right.

F

uh There is an impact on yes, there is an impact on that, and also just for session management. I'm focusing just on session management to understand.

F

You know what are approaches and uh let's take the position that you stated that you know that's what you stated for tcp, that syns and fins are sent to both the devices and then the individual packets are either devices free to forward yeah.

A

Who's going to ensure that sense and fence are forwarded to both devices. This is. I don't think that tors are doing that.

E

No, no, those will not do it. This is like device will need to bounce the packet or or kind of bounce. The metadata right across the the other devices.

A

Right, yeah so metadata, of course, but uh if we uh are defining this as a metadata or let's say just notifying about the state change, then yeah. But uh I have a problem with saying that we are bouncing the packets themselves, because this means additional logic in the dpu uh that will ensure that only one of them will eventually forward the packet back to uh back to tour.

A

For me, there is a difference between bouncing packets and and uh sending just a notification.

E

Yeah so maria, I will tell you why I believe the bouncing a packet may be required, but it's up to the community to decide that maybe we can optimize it right. Okay,.

A

E

Because uh if we are just bouncing metadata right, the thing is like you cannot positively acknowledge the connection was established and forward this kind of synonym outside till you actually replicate the flow right, because then, then you are breaking the customer guarantee that the flow was already established. So there was full exchange on since in arc right and then one device dies and then this flow dies, so customer connection dies right. So we are breaking this guarantee. So in this case we can only uh forward kind of the full sin right to the customer right.

E

So basically the other entity right at the moment when we positively acknowledge from the devices they replicate the flow. Otherwise we are breaking this guarantee. From the point of view.

B

Of the fade over.

E

Right and in order to kind of positively acknowledge this right, you will need to somehow wait for this acknowledgement, which means that if you're only bouncing metadata potential is okay, but then you want to kill the packet and considering that it's like high cp high cps traffic, this kind of stuff. The question is: can you actually afford queueing the packets as you're bouncing metadata and waiting for the metadata acknowledgement?

E

Or do you actually want to bounce entire syn packet, which will be usually very small because it doesn't carry additional customer data right with your own encoded metadata? So you don't need to implement queuing of those packets and releasing those packets on the devices.

D

So marion, why do you say that for bouncing the packet of the issue I mean flow, will have a next stop information right and the flow which is learned on a different device from a fin, and if we see a syn will can have a next stop, which is pointing to the other gpu. Is that not possible.

A

You have, you, have forwarding rules.

D

A

For the given destination ip that for that, iep packet is supposed to be mapped to some certain public ip and then, if your resulting packet is the vxlan with the public ip of that ca. However, when we are talking about bouncing packet, how do we bounce it? What uh in that case, uh the pa should be epa of our dpu number two or our standby dpu? So we.

E

Can run the packets right sorry, we can wrap the packet with the with kind of like encounter it with being slanderers kind of stuff to bounce. It.

D

E

D

Is yeah local end cap can be done, but but I think the main uh sorry marion thing, which is not clear to me what you're saying is: okay, let's take a specific example: a flow got a fin got learned on a one dpu, but subsequent packets, the intermediate packets are going on a dpu, too forwarding is happening and sen appears on dp2. Now, ideally, we want to bind the box.

A

D

A

Is that the real scenario.

F

I think you need to invert the synonym in your example.

D

Yeah right right, sorry, yeah yeah.

D

So the basic thing is that the beginning and the end need to happen on dp1. The intermediate packets happens on the dpu2 right. If the end appears on the.

A

Sorry, I I disagree with that. With this premise from the beginning, we cannot switch back and forth from dp1 to dp2. We may have only one failover before before that point of time everything was going to dp1 after it's going to depute 2.. How can we have unless I don't understand something? How can we have some packets going to dp1 some of them going to the q2 for the same.

E

So so that may happen in case there is a thor failure right or the link. Failure to some specific tour and packet on the physical network gets redirected to different tour, which potentially may do different hashing of the packet at the end because, like at the end, those devices are going to the tours.

A

But that's not back and forth. It's still just one failover.

E

I mean the device may recover in, in which case I also have link flapping at some point which you are trying to cut. So those scenarios.

A

E

A

Okay, so probably this is something that we are missing from it's.

F

A

Like a single failover, also.

F

So marion there are actually two different scenarios based on what you are using. The first one is that the two directions of a flow end up on two different epu's because of hashing- and this is without any failures in the network. It's just normal, steady state, behavior hold.

A

On but we are assuming symmetrical hashing.

F

Doesn't matter because the addresses will be different uh on the forward and reverse direction, the one side may be tunneled. The other side may not be tunneled, so the outer five tuple may be different. uh You have.

A

No guarantee that.

F

They will show up on the same gpu.

A

um Michael is that true, because I uh I didn't.

E

A

That's right because, because those.

E

Are those are physical network devices like, for example, the edge device which are getting traffic from the internet? Like reply right? Those are physical network devices, they will do their own hashtag, we don't own any hashing algorithms there and it's going to stop right and it will just bounce the packet somewhere wherever the vp is advertised.

E

So indeed, indeed, the asynchronous asymmetrical, yes and most likely from the point of view of that, because, like in case of v-net, uh we usually control both source and destination in case of internet traffic, the destination is, for example, internet. uh So this may happen right. This is this is when you involve load balancing traffic, and this.

A

Is like normal, normal behavior right that.

E

The act may arise, for example, different different device.

A

Okay- okay, um yes, so there was a comment hold on from chris that we should make a list of use cases. I agree this is uh this: is the next thing that we will? uh We will talk about so probably uh starting next time. We will try to summarize all of these cases that we have asymmetrical flows, uh because this is. This is really something new to me. I was focusing mostly on v-net to v-net, where uh flows are symmetrical and landing on the same dpu.

A

uh So if we're talking about asymmetrical flow yeah, this is against something.

D

Yeah, I think we should capture it, but I don't see a problem if the flow state is replicated right across all active nodes, then when then, eventually, where the failover results in flow landing right, it still will be forwarded. The only thing which I am little bit uh not clear, michael, you said that if there is a fin packet right and then fin packet should be bounced back to the node where the syn packet came in right, that's the requirement you you mentioned. Is that correct.

E

Yeah, so the requirement is coming mostly from this, so we don't have kind of a state that is that this left over that will fill up the table that we somehow the other device will not know to clean right right right.

D

Yeah, so so, and there what marion earlier made a point which is valid, that flow forwarding is based on. You know some five tuple. uh Now we have to bounce back the traffic, then we we may have to create another rule pro plus fin right.

D

I don't know if that's are.

E

You talking about the internal implementation of the parallel.

D

Implement internal implementation. Yes, yes,.

E

Most likely most likely is you need to intercept, for example, the initial packet and the in the final packet. Yes right.

D

Correct yeah, so that that becomes a little bit tricky uh so that that's something we probably mary and we want to capture it as a separate line. Item too.

E

Yeah- and there is interesting stuff that we can revisit- do we need to really capture the entire twin fin act, and this kind of stuff or or just just the fin act, is the only packet that needs to be boxed right, because because the question is like, maybe we don't need to bounce the entire cincinnati connection, handshake right, uh the same thing finag, maybe just finag,.

D

You're right right yeah, if there's a cleanup, is needed, then finnac may be enough yeah, but you're right. You need to think it through.

C

From the forwarding point of view, the active active pair can continue to forward from each side, but there is this aspect of control messages that need to be synced and we invariably have all these intermediate hops, where the towers or other switches or links could fail, etc, and the assumption will become asymmetric um right. So the control messages sync has to be handled as a separate task, while the forwarding definitely can be done um active active across both the devices in each part.

D

I hope there is no control plane, uh synchronization everything is in data plane, yeah.

E

Yeah control pane here the.

C

Messages right, the control messages like for, like you, were already describing the syn synthetic, whatever happens right.

D

Between the plane, they should be in data plane. They should not go on the control panel.

C

They need to be synced that defines the state of the flow right that has to be sort of synchronized between the two pairs uh right. Even if they're in active, active or active standby, the forwarding side of things will fall in place and both can forward, um but the control messages have to be relayed like if one of the two uh devices um right um are acting in the active active mode for certain set of flows.

C

um The forwarding part, I think we understand that both need to be done in one way or another, either it's active, active or active standby, but the control messages. If one receives some control message that affects the state of a given flow that both the devices are managing. It is required to sort of forward that control message to the other side, so both are in sync, so I I just uh felt that the control messages need to be synchronized.

G

And all this stuff is not, you know at least that's not considered control packet, and that is uh I mean when you migrate a flow I mean to me this is between two tpus. This is equivalent to live migration of a large.

G

You know vm um equal into a host like migration, uh as I could see it, but um uh we migrate when we migrate a flow. The state of the flow is migrated with it right so so there is no reason to kind of uh forward any sin, packets or whatever, to get the state aligned or anything like that, and uh there was one more comment I had here. The window is two seconds: downtime.

G

uh Isn't that um too much I mean? Typically, um you know you want to close your uh uh blackout face within 200 milliseconds.

E

Well so so the requirement is like less than two seconds like I'm, I'm all for for basically much much faster right for that plan, but the unclan basically means that basically, the device dies and somehow we maybe discovered this too late- there's, maybe some bgb propagations problem and this kind of stuff right. So the recommends to second, I don't want to impose like one like 10 milliseconds requirement, like you guys want. I can definitely tell it right, but but but right now the requirement is like two seconds.

E

If we can do, of course, the customer wants to have completely not noticeable right, but it's unplanned fade over, which means the device files right. If during the two seconds there will be some retransmits, but but the connection doesn't die during the time right. That's that we will be noticing. Practice is right now kind of actionable from the customer. This potentially wants some rca on average. The 99 is much much faster right, but that's that's kind of unplanned failover.

G

Yeah, so would you want to put like two different requirements then unplanned and uh in the zero down downtime plan failure I mean.

E

The platform over is completely zero. It must be zero right. So in a way, for example, if we need to service one device like in a way update field murder- and there is no way to do this- like issu updates uh so so kind of like there is no way to fix the highway as the highway is running. But there is some big big, basically uh maintenance, that we need to actually close entire highway right and and kind of like redirect traffic the other way right.

E

Then this will be handled through control plane from the point of view that control plane will say: hey for example, drain all the connections like stop advertising bgp, wait for the connection to drain right and then, for example, give us give us a is information when you will see serial connections arriving on your device and then this connection will be training right. It should be completely zero fa uh downtime for the plan fade over right and then you service device.

E

Then after we service the device, the control play will say: okay, you can pair back and transition all the states, so the state will should be starting replicating in the meantime, we'll see. Basically, when we will get acknowledgement that the the state it was fully replicated then and then at some point it will be okay, then let's start advertising the vip back right. So we can. We can receive the packets right so plan failover needs to be fully zero downtime because we can actually orchestrate it fully.

E

A

Okay, um we have five minutes left, so there are two things, uh one that I want to uh to go over a little bit more another one which is chris chris.

A

Do you think five minutes will be enough for you, or should we schedule it for the next time.

B

um We can schedule for the next time I'll just ask people to look at that pull request, and uh actually I can just take one minute to show the diagram, so people will be and then we don't have to have a discussion about them now. If we can restrain our enthusiasm.

A

Yeah and a couple of things I want to add, after that.

B

A

I'll make it as quick as.

B

I can uh so people are familiar with this diagram, but from the spec, and it doesn't really show the smart switch version which I'm trying to infer from the smart switch rfi. So I thought I should do a new diagram. Also this is, we don't have the source for this diagram. No one can bind it, it's just a ping and we want to build an edit diagram. So I read through it with draw. I o so now it's an editable svg and we can use it as a starting point for other diagrams.

B

I didn't show the third appliance over here. I didn't quite understand it, but after this discussion I understand we might want a third one and then I added a smart tor or smart switch variation where the dp user embedded in there. So this is just a starting point. Hopefully it can aid further discussion and further, um you know specifications. So take a look at that. Let me know if you like them, we can. We can merge it into the repo and then have it as a starting point for other diagrams.

B

We'll probably want for this working group. So, um especially if microsoft could you know, let me know if this diagram is is approximately correct for the smart torque.

B

That's all I wanted to talk about today.

D

Yeah chris, we might want to enhance it with a three note.

B

Yeah yeah, I I now realize I probably need to draw this one back in here and have something here and then we'll need some guidance on how they'll relate to the overall right.

E

Yeah awesome thanks chris, so so the third one or the initial diagram. I I can tell you that it's kind of like a backup, so we were thinking to have kind of always some spur capacity with some percentage, which means because the device will physically die right. So when the device physically dies, one device dies, but the connection on the left side are still maintained but, for example like if there is condition on the left side. This card is kind of like orphaned, like it's a single one. So when it dies, everything dies. So we're.

D

E

From the point of view of the ifr, if the device actually dies- and it's not recoverable to have a set of devices, always ready that we can pair this, this kind of orphan card versus orphan device with the other one. So we we never run in the kind of single single fader mode right. So yeah.

B

So that's the device yeah, I'm not sure how the third tour would relate should just be another tour here, wired in and for the same end goal of a third backup device.

E

Yeah so so we actually noticed from the point of view of how the data centers are constructed, then most likely the traffic will need to go through t2, uh not through t1, um so the third device defines you will not connect to the same q1. We need to go through the t2.

A

Do you mean it will be geographically somewhere else.

C

Well, the t2s are in the data centers yeah yeah, I think michael. When chris finishes, would you be able to look at his drawing.

E

Yeah we will kind of almost like identical like that, the original one. So I think they are fine right. I will take a look more details.

B

Yeah great I'll yield back to marion, so he had some wrap-up items. But thanks for giving me.

A

One yeah thanks chris, uh so the only thing that that probably I want to do before we move on with the requirements is to ask michael, if you can probably next time, present more data on how the asymmetrical flows should look like, because I think we also didn't uh uh incorporate that.

A

For now we were working only on venus to vienna behavioral model and there the assumption is that both directions land on the same appliance, sorry, on the same dpu, the same dpu will see the full handshake.

A

I am interested in seeing what will be the implication of having the asymmetrical flow, where it may land on different gpus and how it affects both ha and connection tracking. So if this is something we can revisit next time, yeah, it will be good yeah.

E

I can definitely spend some time on this, but I think it's kind of like in the agenda we've been probably not more than 15 minutes, because, theoretically, it is kind of like relatively. I think uh one diagram kind of yeah we'll show everything right, so so yeah yeah. So we can. We can set like 50 minutes for this yeah.

A

Yeah and then we will continue with the rest of the requirements, hopefully we'll close it by the.

E

A

Of the next meeting, okay, thank you.

E

Yeah awesome. Thank you guys, uh thanks for for driving this and the interesting stuff, the requirements right as we are, you are doing. This in presentation right will be good to later, potentially amend the uh github right, so yeah everything will be in one common location.

A

Yeah, of course, uh everything that we all the input that we gather in here. It's eventually the goal is to uh to put it back on our dash repository awesome.

A

Okay thanks, everyone.