DASH High Availability WG, 28 Jun 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: DASH High Availability Working Group June 28 2022

Description

June 28, 2022
@Marian sharing OpenCompute PR for SAI APIs in the experimental section
https://github.com/opencomputeproject/SAI saiexperimentaldashha.h

We did not get through the entire set of APIs

Administrative State up/down (add enum w/more states such as starting syncing, syncing progress, sync completed, etc…)
Peer ID (used to communicate state)
IP for the session
Role (optional)

@Michal presented an HA Deep Dive slide

A

uh Good morning, everyone, let's kick off the high availability meeting for the day, it's um june 28th.

A

So if we could get started, um I guess I'll document the answer about the ip address for everyone else to hear the ip address needed to sync information between two peers and it looks like we'll be needing to use the unique ip address and theoretically it should be provided by bgp.

A

So that's the answer I have for everyone today and, if michael joins, maybe he can expand on that a bit and I guess we can get started ryan.

B

um Yeah, okay, so I want to share the pull request to sai with the proposal for hi apis.

B

I want to go over all the attributes and then, if anyone has some comments, uh I saw one for now.

B

So yeah, so this is the site apis. They are still going into experimental directory until dash will be mature enough we're putting everything there.

B

So here we define a set of attributes uh for an aha session, so the two devices will be able to synchronize their state.

B

um Let's start with the first one: it's the administrative state up or down um quite simple. We we decide when when we want to actually start syncing, it's up to the controller.

B

um Second, is the one that I have had question about the pure ip? uh This is the ip address that we will communicate. Our state with um should be unique.

B

And available through the basically through the underlay routing.

B

um Next, one, uh by the way, uh also, let me uh go over the.

A

Marion we did have michael just join as well, so there were more questions about that unique ip. This would probably be a good time.

C

Yeah, okay, guys yeah, the the I I p to initiate the appearing, uh so there will be always two ips, uh at least uh for for uh for those kind of stuff. So basically, there is one ip that this each card will have a unique ip which which is basically to uniquely identify the card, and then this is the ip that will be used to initiate the peering right.

C

So, for example, one like card a will have one ip card b will have the second ip and those are the unique ips that will be used to initiate the peering right and on top of it, there is also the ip that is announced over wgp, which is the the highly available ip right. So one card will be announcing the same ip, then the other card potentially and those are completely different, set of ips right.

C

And then, basically, the bgp, the the ip that is basically the highly available ip. This needs to be both ipv4 and ipv6. Right because, like underneath the underlay, we have some standard that underlay is using ipv4, for which, basically, the car seems to announce this highly available, ipv4 ip and just like virtual ip and for the sellers that are using ipv6. uh The cars is to announce over the pgp. Also, in addition, the ipv6 highly available ip right- uh and this is this- is basically what the data pad will be using.

C

So datapad will be using basic kiddos to highly available ips, that the cards will be advertised and advertised and but then the cards also need to talk with each other using the unique ips- and this is this- separates the ips. We call them like management ips, uh on which, basically, the card can be accessed and they can talk with each other on their on their own kind of like private range and this kind of stuff.

B

And this this a unique ip, it will be available in band right. It's not the out of band management. What.

C

Do you mean inbound versus autoband.

B

uh I mean, uh oh actually, uh sorry um confusing, with the switch like the out of the management network, um so yeah, okay, so thanks for clarifying we, so this is the unique uh iep that will be uh used for establishing the session. This was the missing information that we had up till now. I assumed only the virtual ip okay thanks.

C

Yeah, the virtual will be the the virtual ip is actually more than one right, because one will be the the ipv4 one will be ipv6, because because each scenario requires some like substance, because I prefer some strategies, you guys have pvc for the outer end cap or the like underlay right, yeah,.

A

Michael, what we're doing here is reviewing the pr that marion submitted to open compute in the experimental area for this iaps just to catch. You up here.

D

Yeah marion a clarifying question, so this uh header file here hand generated set of, let's say, administrative attributes. I would call them, and so these are going to be manually or hand written and then we'll have. In addition, the auto generated- let's say data plane, headers right that get added to this so they're additive. Is that correct?

D

I think we ought to explain that somewhere in our documentation, because it's not obvious to someone walking in here what how this all merges together. So at some point we'll want to capture hey, there's some hand-generated apis that are administrative and then there's the auto-generated ones, and they they merge together into one composite set of dash apis.

B

Right right, yeah, uh we will, as soon as we will have an agreement. We can merge this uh this api. Then I will have a pointer to it, explaining uh that this was done manually, uh you're crack. Okay, thank you all right, yeah. So, just to recap, what we have is the admin state which is create and set so we'll be able to change the state.

B

Then the unique ip4 the session. This is mandatory on create attribute and it is create only so. Basically, it will stay for that session.

E

B

E

Question comment about the admin state, so the admin state is the one that we discussed was related to the role. Is it role of uh mastering a slave in terms of thinking.

B

Role is a separate attribute which is uh also optional, I'll get to it next, but admin state is just up or down either we we are okay with starting the sync or maybe we are pending some some other input from the control plane and waiting for uh for the session uh to be able to start the sync up, and the role is the next attribute, which is.

B

It is optional, it is create and set. So there was a question why we need a backup role, so in case there will be a requirement that high availability takes priority over new connections.

B

So we have two roles, so this is an enum. Let me go to it. We have two roles: defined, active and backup. So active is the default one, meaning that you are kind of working independently from you can be working independently from a peer because you are active, so you can accept new connections if the uh if the uh operational state of the session is down, even if your peer is not available, because you can still accept uh new connections and as soon as it will be available, they will be synced up.

B

However, if someone desires to take high availability as higher priority than new connections, then they can declare one of the gears as a backup, so that if the session goes down, backup will start receiving all of the existing flows that it will still forward as before. However, it will not accept new connections until there will be um there will be a new year and the session will go up again.

D

Can you define what session up and down mean.

B

Oh yeah, so this is the operational state. uh This is up to the implementation. This is this attribute. It's read-only and there will be also an api with the notification of this state.

B

But again this is up to the implementation and the underlying protocol of choice. It means that the implementation decides that uh the peer is either available or not based on different criteria. That will be chosen either like heartbeat messages or maybe there is uh like the feedback in case. This is the um reliable protocol that you get acknowledgements from your peer that you received everything. So this is cons.

B

Then the implementation will consider okay, our state synced, and I can declare the operational state as up, meaning that any the judgment is that if now any flow would go to the pier it will be properly forwarded.

D

Maybe we should define just session by that you mean the inter dpu peer peer to peer session right.

B

Hk session, meaning the uh peer-to-peer session.

C

So I have a classification question. This is the operational state right, because uh one thing is the state of the appearing right whether appearing is established is starting. Sync, it's in process of the thing or the thing is fully completed and it's ready right, and the second step is whether the peer is announcing bgp right, because because the idea is that, for example, if let's say the one one appear goes down, the peer goes up right.

C

It should be able to establish the relationship to sing all the connections right, to be able to kind of to prepare for being fully operational right and once that, once the connection is fully seen, then there should be separate way to say.

C

Okay now the spirit is operational from the point of view of c completed, and now I'm, let's say, okay to start announcing, bgp back right, because uh one thing one thing is that, for example, bgp should not be announced before we complete the sync, because if we start announcing bgp that some of the connection in this active active mode right may land on this new peer, the connections are not it sync right. So the peer will not be would not know how to process this right. So I want to basically separate two things.

C

I want us to be able to separate, bringing bgp up and down operational versus versus uh hs session that basically syncs. The connections right because, like I would like to, for example, be able to send the connections uh even if the bgp is not up right, and this should be completely unrelated from bringing bgp up or down right.

B

So this is not related in any way to bgp. However, if we want more popularity like saying, okay, those two can uh talk to each other. That that's number one and number two is that we have a fully synced flow table yeah, I will add an enum with more states not only up or down.

C

Yeah, I would, I would say that, because the uh the sinking, uh definitely like the the states of the state machine will be more right because they'll be kind of like peer gets connected right or it's like peer can be completely disconnected right or appear can be connected and then, for example, and then basically the state machine will go, let's say starting syncing and or like singing progress right and then, for example, sync completed and ready, and it's like ongoing sync, this kind of stuff right, uh because otherwise it's kind of like down versus up it's.

C

It's not sufficient from the point of view of down means like that may mean that I'm not disconnected, but that may also mean that I'm not fully ready, because not everything is sync right. So definitely we need to differentiate down. From the point of view, the connection can be established from something like there is a syncing in progress, but I don't have yet full state fully.

C

Sync it to can bring basically the session fully up from the bgp and then the third state uh that, for example, right now, the the state is fully sync and I'm ready and everything is syncing real time. So I'm, basically that the period is fully ready with everything is synced uh from one side to the another, and this basically only means that this is the time when I can raise the pg position up. So definitely more states here should be needed.

C

B

Of course, understood.

D

I don't know if we ought to look at some of this, so what what.

F

Yeah, so what does appear mean? Is it a dpu peer or is.

B

It you well, granularity is on the eni level, but the peer is a dpu which is uniquely defined by the peers. Ip address.

F

Yeah but but essentially you are pairing up the the eni. Why would you want to pair up the dpu.

C

So so I would say like because one dp, you will be handling, I think, like multiple in ice right, uh if we, if we, for example, pure base on the enis, that means that one device will need to have multiple peers right. So, for example, potentially like one e, and I would be unique with this- the cni on the on the other dpo that this one on this this kind of stuff. I really more think that we should establish something.

C

Let's say uh that I that basically there will be something like appearing groups or peering sessions, this kind of stuff, and we can say that, for example, that that at the beginning that, for example, device can only have let's say one periodic session with with those other cards. Let's say right and those e9s belongs to the appearing session right. So, for example, this can stop, because if we, if we do the appearing granularity based on the eni, I think it may be too granular from the point of view right.

C

So I because, like if every single eni can potentially land on com like peer with completely different card, potentially yes, but at the same point, I'm more like freaking like like in this design. Here, that's kind of like pinning session, and this spinning session has basically the list of eni object to sync right. Yeah.

D

Well, one could build the data model or the api to be generalized to accommodate such a sophisticated approach in the future, and then we all agree. We're going to implement appear one-to-one for the foreseeable future, but the data model will prevent us from becoming more advanced. If we choose to yes yeah, you know that way. We limit the scope of work, but we don't create a backwards compatibility or an api. You know funkiness in the future, but that that takes more work to even build in that road map.

D

I'm wondering if some of these um you know christina actually mentioned this to me once in the past, aren't there existing high availability, architectures and things that we use every day like redis or other databases other places where people have already kind of worked this out and agreed on the states and synchronization mechanism. And what can we draw inspiration from some of those things? So we don't reinvent all the high-level administrative things.

D

It's just. It's just a thought.

A

D

Been you know around since telecom days, and people have solved this problem about a thousand times. um Then, if we reinvent everything from scratch, we might be making more work for ourselves.

C

Yeah, I want to add to it: yeah. We should definitely put as much as existing states impo it's possible as possible right to do this, and for me, the kind of like sinking mechanism right, uh because the eni is is too false right. The eni has like a ghost, which is the configuration right which is, for example, what kind how acos looks like how routes looks like for the specific eni and the running policies? Right.

C

That's one thing, and the second thing is from the point of view of actual flows right and and for me, the most important here would be really to design the synchronization to not have to sync the eni state, which is their policy that that our control plane can sync and our control plane already has a way to basically send it to multiple peers right in normal discipline. Decisions kind of like eventual consistency right, but the but the most important part would be to actually sing the flaws right.

C

So I would like to use this has session to releasing the flaws and make sure uh the flaws get seeing, because that's that's, I would say the heavy part because should be seeing almost like really real time with watching right, but from the point of view of the eni state, which is the which is the basic, let's say: entire uni policy list of lists of kind of mappings to translate this kind of stuff.

C

That is that we can sync through control plane through eventual consistency, because even if the customer, for example, update the accounts right so, for example, customer, let's say switches, one accola and ads, for example, deny all this kind of stuff right. They do expect uh that they will they like once they click on portal, for example, to add samako right. They do expect that hey this, this kind of new deny rule will happen in let's say one seconds few seconds, this kind of stuff right. They don't expect it in fully in real time.

C

So, in this case, like eventual consistency when we, when you basically plump this in in specific location at some point right that that normally happens right but but the flow this is, I would like to concentrate this kind of thinking mostly about the flows. If, because like once, the connection is already established right, because the aha literally the main thing is like.

C

um If there is one device right, if this device never dies right, then we don't need to have a cha right, but the device may have power, outages, kind of stuff right, and this is the one and the main reason why you need to have ha that if one device dies for some reason or another, either networking problem or the power plan or basically just hardware dies right, then the other device can can take over right and and for the other device to take override the important part is yeah.

C

Definitely the acls should be synced right, but at the same point uh they are being synced with eventual consistency which, which means that basically, uh within a few seconds, all that all the cards will have the same state right but, most importantly from the point of view ability right. If I initiate the connection and one device dies, then the connection should be other part on the other part.

C

So I want to make sure that those are kind of two problems and I would like us to first solve the problem of sinking flows, which is which is kind of important one, because thinking ghostly for us is kind of like easy, because the guarantees for the customers anyway, eventual consistency here.

B

Yeah right, so uh this h.a session is all about sinking flows. We are not. uh We are not looking at thinking that policies, because this is what controller will do.

E

um Michael and marianne, so when we look at the uh the roles right, the active and backup, um if, uh if the two uh peers we discussed last time as well, um are uh sort of uh you know the uh are learning.

E

um If there is a host all right, uh eni, uh a vm sending uh flows, uh one of the two uh ha pairs will actually start to learn that, and you know that happens, uh you know without the control from the controller or the the dpu itself right, so uh the active and the backup roles itself when we assign.

E

I see that uh you know what gerald mentioned last time was that the bgp cost that it advertises uh for um the two uh ha pairs to you know um uh will be uh different and the higher cost will become the backup and the lower cost will become active.

E

So I just feel that the role has a lot to do with the bgp configuration as well, and if a dpu comes up um and and the role is already active and then if it's an if then another dpu you know, partner comes up right, which is also sort of equidistant from the vm. Then the cost will have to be manipulated. So uh you know: uh how does that work? How what is your thought on that.

C

So so reshma, let me maybe add to it right so the stuff that you mentioned, that the one is active one is backup and one is.

A

C

Without a ace and prepends, the other announcement with asn prepends right, this is the current implementation that we are trying in azure right, but at the same point uh no one else in azure, there's only one other social that is using this, and this is huge. It causes huge impact on the control plane. Sorry in the control.

B

C

Like a physical network, because all the sis and prepends needs to be propagated with community strings and it uses that income space on the tours- and it's basically not scalable- to work with the active backup design. That's why? Basically, what we are proposing here, considering that we are making new standard to create this in a way that all the other services that we have in our platform like, for example, software load balance or discount stuff. They actually have multiple instances which are announcing active active. So there is no uh basically preplans or anything like this.

C

They both basically announce announced. So basically, the two devices, for example, will come online. They will establish spring relationship between each other, saying that hey, we are ready to receive traffic and this kind of stuff right and both will start announcing. The vp and.

D

C

This case, the physical network doesn't need to propagate communities and doesn't need to use the camp space. They just basically announce the beeps and they can be aggregated and the traffic basically lands. The tours will basically spread the traffic using using normal, like hashing, algorithm right, those kind of stuff and some of the traffic will land. For example, let's say on this v on on this device right and if this traffic lights on this device, this device basically will sync this.

C

This flow to the per device, if some other traffic lights on the per device, this per device is saying back to the first device. So there is availability, but there is no distinction from the point of view that one is strictly active and one is strictly passive right, so we don't want to have. We potentially may want to have this distinction, maybe as a in case it's needed, but this is not really the primary mode of operation that you would like to use so primary mode of operation.

C

We actually would like to have pg both devices advertise traffic without any asn prepans, so the boat should be in the active mode and it should not matter which receives uh the packets. The connections.

E

B

So this is the intent because attribute is optional, and by default uh the peers should be considered active.

B

E

And when it is back up, it is still active, active right. It's just not. uh What is the exact definition when we call it backup.

B

Yes, as I said, the definition is in here. Backup me backup is with regards to new connections. Does it accept new connections or not everything that was learned is still forwarded? But what do we do with the new connections? Do we want to actually learn new connections? While we don't have.

B

Another ha device to to back us up so that we are not synced to anyone for some people for some period of time or we don't really care. We prefer to rather accept new connections and have uh the second device go up uh in the meantime, and that's why we would choose active. So backup is just not accepting new connections.

C

Yes, so there's no accepting new connections right. I wonder from the implementation point of view right because, like we cannot advertise the vip during this time right because like if we advertise the vip on the bgp on this on this card that has set up as a backup right, then all the packets will go to let's say to this backup device right and it does not accept the connections we kind of like black holy customer traffic.

B

C

Right so so that's why I'm more thinking that not accepting new connections should be more like, uh like separate stuff, that if I don't access the connection, I just don't advertise the vip on the pgp right. This means basically not not accepting connections right, because I, if I will be advertising the vip, but at the same point they're all with backup. This means I'm basically black calling the customer traffic in general.

B

Right, but if you want to achieve that with gp, you can just put both into active and uh yeah. Don't care at that point. But this is just a preference of whoever will be using that we're just leaving this option. It may not be used.

E

If both advertised, then you know, 50 of the capacity can be filled up with active connections and the 50 could be back up from the other side. You know for each hi pair and those were the considerations I think we had earlier um before. We discussed this so.

B

D

Michael, um the description you just gave was real helpful. I'm wondering to what extent does the existing high availability and scale read me file in the dash repo match? What you just said, or has has that become a little bit stale talks about longer paths etc, and doesn't I don't know, that's.

C

That's correct yeah we need to. We need to update it, because this was this was created using uh like over a year. I think ago, right when you were the time where, like we always wanted to have active uh active but the time basically, uh it was deemed very, very hard to do this at the beginning. So that's why it's actually passive. That's why it's proposed, but yeah it should. It should be updated. So.

D

C

Should probably take an action item to potentially update this docu like copy this document, as kind of like v1, and have a v2 version, which is the the activation saying that the connection should be actually active, acting fully.

E

C

That's correct a good.

G

Question yeah: I have.

B

G

Quick question um so for for any given card, um I mean the dpu that dpu for any given time will act. Isn't it true that it will act uh active for some enis and we like act as a backup for some other set of enas.

C

No, this is in the in the first design when we had this active backup right when only one card receives some specific traffic right, but long term. We would like to have this this basically fully active, which means that theoretically should be active for all the enis.

G

Oh, so what you're saying is that if he, when we talked about the scale, part right, he said that you know any given card will have let's say: 32 eni support.

D

G

Those 32 will be active enis and then there will be 32, backup, enis right suppose, yeah.

H

G

Why, when we made that that point at that time, my basic understanding was that the card will be. You know, behaving as active for those 32 enis and will be behaving at the back for the rest of the 32 you guys and thereby supporting 64 enis. And then you know you multiply the flows and connections and so forth accordingly, but now you're saying it that you know it's something different and.

C

So I would say like this: in my opinion, both modes of operation are actually.

C

Have some have some opportunities like like in a way, for example, the active passive design is actually working right now, and it's kind of a interesting, interesting way of basically saying that one card is, for example, not receiving any traffic. The other car basically is receiving full traffic. That is basically doing this right uh and the the reason why we have basically two set of vnis per card is mostly from the point of view that we do.

C

We don't want one card to be completely dormant and connected to, for example, let's say giga bit network, but not receiving any traffic, because we are basically spending money for the capacity that is not being used right. That's why I designed.

B

C

Active active backup design. The proposal is that, because there are some things with regards to this flow replication right because definitely from the application in one one direction: right: it's theoretically easier than floor replication in both directions, for example, one packet glance here, the other packet lands there right. So the in the active active is slightly more complex than active passive, uh because active passive basic has always flow replicating in one one way right, but the same one, because we don't want to have one card fully dormant.

C

That's why we propose that if we are to stick with active passive design, then at least have two groups per card, and one one group basically will be active one on the device and passive for the other device. But the other group of enis will be active for the another device and pass it for the first device.

C

So so both cards are active at this, uh like from the point of view of receiving traffic, but it should receive traffic as the primary for a different set of cards uh versus is a standby for a send off in eyes.

C

But it's not going for differences to be nice right, and this this is mostly the two groups is mostly to to make sure we don't have like a card that is completely not utilized, so this was the only reason why we decided to have two groups instead of one group, because because one ineffective, passive kind of you think about this is always kind of like a one. Car is active for a nice. The other is standby, but in this case the other card is completely utilized.

C

So that's why we decide to have two groups, so both cards are utilized and in case one car dies, then the sure the capacity of one cars is halved, because this card is only serving uh traffic for both the the ones that were active, plus the one that will potentially backup which are right now is active, uh and but it is still using the asm prepared right.

C

So this interesting model it works and- and if you, if you think from the point of view, if you actually implement active active model, which means that once the enis are there, if the enis can receive traffic, if it doesn't matter, basically, if the, if the traffic from the tour lands on the first card or the other card it just instantly, gets replicated to the mirror card right, then it's kind of the same replication.

C

So once you have this replication right, then the active active is really the preferred one, because the bootcad are utilized this kind of stuff right, and you can think that it is very easy. Once you have this uh flower application working both direction in the irrespective one packet lands right to switch the in effective passive. If you want, because switching to active passes is really the uh the bgp configuration right.

C

So so, if we consider bgp configuration to be completely separate from the configuration of the flow sync right, then if the flossing happens right, irrespective of how bgp is advertised and redirects the traffic right, if the flossing happens, when the packet lands, for example on device, I and it gets replicated device b or some other packets lands on device b and grades duplicate two to basic device. A and and this works in both direction right then, we have active active, which means the flow application is fully working right and.

F

C

Doing this to active passive that doesn't require any changes to the sync algorithm. It just requires potential some adjustment to bgp. So I want to make sure that bgp logic, which is completely, for example, let's say separate microservice- that is, advertising pgp right this. This should be capable of advertising.

C

For example, one card, I'm advertising, let's say without asn people the same web, the other card would accent, and this will allow control plane of of whoever will be using this to decide if they want to have active active, which is the preferred way right or they want to play with active, passive or those kind of stuff right. So that's why? Basically, those are two kind of whether we later do like active passive versus a versus active active will be only depending on the bgp right, but this is only on the condition.

C

This will be only possible on the condition that the flossing algorithm actually works in both direction right, so it because, like if we design from day one to only support active passive design, then there will be lots of, for example, for example like shortcuts. That will only sync flows in one direction, and this kind of stuff right and we'll never be able to do active, active but really active, active is kind of like a long-term vision, because this will allow us to save the camp space, and this is also how other aj solutions are working.

G

Yeah no thank you uh michael. I I.

C

G

You know it's it's based on again. We are not precluding active active so and it's a more flexible model, but even if we want to start with saying hey, you know, let's start to think about not only in the active passive part, but let's not preclude that hey. You know we are not going to do to active active. It's always the understanding is that have more flexibility around it and then that's why my question initial was: is that always think in terms of that?

G

You know they'd be appearing in e and I so as to not basically say that hey you know there will always be one to one mapping so to speak. You know one card may be protecting ens from multiple different cards. So if that flexibility is there in the design or in the data model, then we are essentially are. You know really designing it for the more flexible and and more scalable sort of like more expandable design. Yes, yeah and.

C

The design should be as scalable and as flexible as possible. Yes, that's correct right.

C

But one thing that I want to: I want to tell basically that, like because we have taken space issues, because we know that, like we are advertising lots of prefixes and this kind of stuff in data centers right. We know for sure that we will will not be able to put in a data center any new solution that is basically active passive right only because, like just asn prepares scale in very limited way, because because they just need to be propagated everywhere, they can be aggregated this kind of stuff right.

C

So, even if we design some like prototype with active passive, only right, this will never be able to land in the data center. It needs to be active, active in order to land in the data center and in kind of like any scalable way.

H

So so michael, this is just to clarify. So the summary of what you're saying is whether a particular appliance is in active or passive mode is completely dependent on the bgp reachability to it right, that's correct. Only the active would always receive because it will have the lowest cost in the network, while the other other peers within that tiering group will keep receiving the flow sync, but they will never be responding or they will, they will have higher costs. Only one of them will be effectively active. There is no active passive, otherwise.

H

Is that correct.

C

Yes, that's correct and one clarification on this yeah, so you're right that, basically it's just the bgp configuration right and just clarification on this that that, in those car, those cars or those devices that have been data center right, we want to actually set the bgp without asm prepaid right. So so the scenario that we are discussing right now that will be advertising. For example, something could be gps active like lower cost and the other as passive as a as smaller cause, like bigger causes kind of stuff right.

C

This is not how we configure bgp like this is not how we will configure bgp right. We may decide to maybe configure this in some, like maybe edge site for the customer, for some crazy reason. I don't know right, but but how will basically configure those cards? I can tell you that you will get those cards established peering and once the appearing is sync right, we will basically both cards will advertise advertise basically bgp with the same cost.

C

So so we will not be advertising bgp with different costs. It.

D

Is nice to have this possibility, but we will not be.

C

Doing this yeah.

H

Okay, yeah, I wasn't quite familiar with the ais and prepend mechanism. I just read about it. I understand what you're saying.

C

Yeah, and uh let me let me maybe share quickly, uh I'm just checking with slides of anything.

C

Confidential or not- probably not at this moment, so I can quickly potentially share it. uh This is kind of what what we kind of designed the project on our site is called sirius. So it's kind of serious, ha it's kind of you guys. You guys should see that.

C

Basically, this is the kind of like pairing card replications right which, which means that, for example, let's say there is there's one card right on one's used appliance that is paired with another card and under serious appliance right and this kind of like vertical bars, which is kind of like, for example, the blue bar green bar, this kind of stuff.

C

That basically indicates one eni right, so so in this case, uh ignore at the beginning the flow splitter, this kind of like additional stuff that I can discuss in a second right, but if you guys consider just only, for example, the the most inner part which is kind of the card that is basically pairwise replication with the with the red arrow right.

C

This basically means that basically, this one card has, for example, let's say four enis of this kind of like blue green.

C

I think it's like teal and yellow, and the same ghost state is basically on the other card stuff right and this card is basically has a pairwise replication, which means that the flow uh can here, it's kind of like higher asn and lower sense. So it's kind of like slightly outdated, but imagine there will be no higher s lower asl a send. Basically, those will be advertising.

C

Let's say the same beeps uh like let's say this 23001 will be advertised by both, for example, right and then the traffic can actually land on any of those cards right and because these cars are appearing they should they should basically replicate the flow to another card right. So this normal replication and and the reason why- uh and this is basically you wanted this- to be working active, active mode right.

C

So, for example, if some vm1 will send some traffic right, it should not matter if it lands on the on the card on the left series appliance on the card on the right series appliance right. Wherever this rapid glance from door perspective, it should be basically replicated with the other card right, uh and this is- and we are thinking from the aha of kind of like enabling this in a few steps.

C

But at the end, don't have like a step three potentially right, but the but the kind of step one and then and the master one will be kind of like have eni on on two cards, so basically program. The seminar on card one and card two from the ghostly perspective right and don't establish yet the floor, application that we are talking about right, so the floor application, for example, there's no floor application right. So, for example, just the ghostly on both cards right and one would be.

C

Let's say on the bgp- will configure that one is, for example, with a higher asl like lower asn, prep and higher same people. This kind of stuff- and it's kind of this kind of backup active, active backup right when we can potentially start testing those cards kind of stuff. But this also only guarantees for the customer.

C

The availability from the point of view of the goal state, but not from the point of the flow right, so here you guys, can imagine that like if one because there is no flow replication, they can do mice as well micelle one if one car dies, uh then the other card can pick up new connections because the the eni was configured so the gaussian was replicated by our control plane right, but because there was no fraud application. Yet uh all the connections will die for the customer right in case one car dies right.

C

So that's why, for example, myself one is kind of good for testing, but not only for production right, so we can test. If two cards can can pro can basically support supported traffic with this kind of like asm failover right, but there is no forward application yet right the floor application. Actually, what we need here is is what we are discussing is basically the step two, which is the two cards, the same unites program, two cards uh and we set up the spareware for application right and we have this automatic bgp fill over.

C

In this case. The step two is kind of this automatic bgp failover and here's like a active, active, active and no asn prepaids right in this case right. So so, in this case, it's active active. So so both cards have the goal state and they set up the pervs application right and if something fails, like collection fails, the bgp will take over and automatically. Basically all the connections will be going to only one card and it's active, but because it was perfect peris for all the ph was happening. That's why?

C

Basically, this card has all the connections so there'll be some few packets that transmit from the customer perspective to wait for wgg converges, which is like 5 or 10 seconds, this kind of stuff sure unfortunate, even event and during, however, no already established connection will get dropped. So, for example, custom will experience some particular transmissions right if they were like transferring some bigger data from one vm to another or from storage right, but no connection will be dropped because because tcp expiration window will not hit right.

C

So so, in this case, the second solution is the one that we want to have and is the one that we are designing right. So basically, you want to have same eni, configure on both cards and flow replication established using this right and using bgp that will automatically cover failover with this active, active mode and noise and prepares right, and this is the one that you want to have, and it's the one that we are discussing right and in the long term.

C

I can also tell you guys what we are planning to do, but but it has nothing to do with your guys implementation right.

C

uh We also want to potentially spread the traffic because sometimes- and we call it kind of like infinite scale, but not really fully infinite scale like in a way, for example, if some customer- let's say one car will give- let's say I don't know like 4 million cbs right, which means that, for example, if we put eni on this card right, the card, the customer can only reach- let's say 4 million connection per second right, and this is the upper upper limit of the card right.

C

However, we can actually do a more intelligent, hashing that right now the tours can do so. The doors basically normally do hashing, based on on like five tuple of the connection right, but it still goes to the same web. This kind of stuff right.

C

We actually want to put the splitter, which is on the left side uh and do more intelligent, hashing like in a way that, based on this, like consistent hashing of the connection that are gets from example from the vm1 right, we will actually end up this traffic with this kind of tunnel right not to the specific card which is like pairwise replication with a red, but, for example, for one of the other cards, which is also paired with the other cast.

C

That's why those are additional cards which is, for example, pairwise with a card that is paired with blue or green right and in this case, we'll program, the eni on more than two cards right, we'll program, the dial, for example, all the six cards right cards are paired. So basically, the innermost card, for example, is paired with the other card.

C

The the middle card is paired with some like other card, with the same or difference user plans here just two series of plans, but you guys can imagine that those card can be whatever in data center right. So, for example, in total, we'll have let's say six cards, each of them will be appeared, so it would be paired.

C

So, for example, in this case uh we have like three cards here and three cards per right and imagine that the same eni is not only programmed on the main card and the paired card, but we program this on all the cards. So, for example, from the go state perspective right, all the cards will have the same ghostly so we'll be able to serve new connections right and but all the cards will be advertising different set of beeps. So, for example, one the middle card will be advertising.

C

Let's say this 2301 without pre-pens right, the other card will be advertised. The other set of cards will be advertising 3302 without repents, and the other card will be advertising 2303 without pre-pens right and then we have a splitter on the source side.

C

A closer to the vms, which means that, uh based on the five tuple hash of the connection, we will decide. Okay, this, let's say one third of the connections that this vm originates goes to the specific veep, which means that, for example, first set of cards right. The other part of the connection goes to other ships right, which will would be kind of doing. The same kind of like ecmp is happening on the physical network right just more intelligent from the point of view.

C

It's it's not only doing ecmp on the layer, uh two, which is basically like this link versus this link right but or like basically like it'll, be doing on layer 3, which is basically the ip based ecmp, so more intelligent. So this will allow us to even reach more capacity than than the four million. For example, if one car gives us four million right, then, theoretically customer can we can evenly split the splitter flows that is experiencing right, so one car will be handling. Let's say one third of the flows, the other car one.

C

Third, so we can have like 12 million cps or even more if we spread the traffic evenly. This kind of stuff right- and this assumes this active, active design between the cards and also also decreases the blast radius, because if one card fails, only some part of the traffic gets affected uh because basically, the other car needs to handle all this traffic.

C

But all this traffic only from the one-third right so basically we'll be doing stuff on our site, which is outside of basically this engagement, because this engagement from the hardware perspective right to have replication p2 cards. But on our side we also will have the splitter, which will basically even split this ecmp, based on the vips and and this kind of next step from our site. But for this we need to have a step number two, which is the the floor: application to make sure we provide availability between two cards.

C

uh So the connections are not getting lost right. So that's that's kind of how we are thinking about this right and this is how we can uh reach higher uh connection per second. Even then, like four million discount stuff right uh per card right and for the customer, so customer will be super happy about.

C

I think right- and this is kind of some of the ip addresses- that you're kind of thinking from point of view of allocating that that basically each card will have like unique address, which is which is, for example, let's say for ipv4- maybe let's say 2 27, so each card, if we, for example, six card per per chassis uh in the risers, uh because this is how much space there is, for example, for some of the some of the devices. This is.

C

This is what we will do and we will allocate this unique addresses right, and this will each card will have this like unique veep. This is the one that is basically being advertised on the on the bgp right and potentially unique, uh really 128, but this really advertises 64 on tours right. So this can staff plus plus uh some additional management ip uh for the management network right.

C

So basically, the unique ips which are here right and potentially the data plane ip, which is those right and and those unique ips, are coming from like management networks, are they unique, right and, uh and those are those kind of? So that's that's that's. Basically this.

E

Thank you michael. This is really uh useful and very good to see.

E

One question I have is that when you split the number of flows between three sets here right will it be based on the capacity of the dpu, and that has to be not full capacity. It has to be like 50 capacity right.

C

So, to be honest, like the the splitting right, it's doing basic, speeding, kind of like ecmp, so not be doing any intelligent monitoring of the dpu capacity and this kind of stuff right. We are assuming that flows. The the splitting element we have will will be uniformly distributed right, so we'll uniformly uh basically split the flow. So let's say one third of the flow or 1 4, depending how many beeps you will decide to split right. It will be doing this, so this would be like uniformly distributed.

C

So definitely not intelligent, maybe in the future, but I don't perceive this happening right. At the same point, the splitting, I would say it's outside of our discussion for the ha because, because this plus, even even more fancy logic that we have on our site, we can we can do with the ha once. We have uh the the basic ha functionality of the of the flow replication between two cards, and this is basically the main idea of this work stream.

C

To make sure that this point number two, which is basically, we can put the same eye on two cards- have flow replication right and then, on top of this use, bgp for failover. That's the main goal right, so so anything else like the first thing is mostly like we are doing some internal testing, um whatever not really related to this. This kind of stuff there's the second.

C

The second step is- uh is really where we need this work stream to kind of give the cards ability to do low replication, so the customer connections are not being uh affected and the and the ecmp is really like internal stuff. We right now are doing this like just normal ecmp, just just using the layer 3 right, but we can. The splitter can be even like more complex or whatever right, but this outside of this works right. So this works.

C

I want to make sure we only discuss how we solve the floor application in both directions. To make sure we can support active, active.

A

Well guys, we have four minutes left, I'm wondering if this is a good stopping point.

A

uh Marion has left the building and does everyone else feel like we have what we need for this session.

A

uh I don't think we got through the full set of apis. We may have to pick up where we left off next week. Didn't win, have a comment before we drop off.

A

I guess I will take silence's agreement michael, thank you for coming and thank you for explaining everything in more detail appreciate it.

G

A

Michael, thank you very much. Thanks guys, hey everybody.

C

A

A good week I'm going to stop the recording and I'll place it onto youtube.