DASH High Availability WG, 30 Aug 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: DASH High Availability Working Group Aug 30 2022

Description

AMD provided an overview of their HA proposal

A

Thanks for coming to the august 30th high availability call I'm going to turn off my camera and hand it over to the pensondo amd team in just a second here, and I guess we can get started.

B

Guys able to see my what I'm sharing.

B

So uh what we're presenting is a proposal for the ha thing based on the implementation that we have yeah like yeah, so we have uh I'll just give a quick overview on what is to come. So we all of us are, you know we know what to expect what's coming up next, so I think uh we have some design goals. We have a functional description that includes how the network topology looks and what we have designed it for, and we also kind of go through. What are the ips that we are?

B

What do we expect from the network right in terms of the whip in terms of the loopbacks that we expect, and then there is uh how the activeactive works. Then we have a small section about the heartbeat and then we get into the state synchronization, both the bulk sync and data path, synchronization of how it works. I think bulky had presented some of this already. I think we have kind of taken it can try to put in more details and abstract it. Some more details.

B

How the uh the high level, how the sync message looks and then, wherever we can, we have kind of put what are the challenges that we see here, things that cultures that we saw when we were implementing right whenever those things whenever those gotchas are not to implementation, specific and more generic?

B

We have kind of tried to bring those out in each of these sections and then we go through one go through the state machine itself, the hs state machine, some explanation of the states and uh after that I think, to illustrate what we mean by you know: there are a lot of messages that are going across in each of the states, etc. So there are two sections for these messages. This is this is, uh I think the messages themselves are illustrated to show what we mean by uh what should go in these messages.

B

There's one set of side, definitions right: this is the interaction between the sonic layer and the sdk and anything the tpu hardware, software underneath that and then finally, we have we have some control plane messages. These are the messages that are going, that we are proposing, go between the two sonic instances right and uh and then there are some message flows right. So we have taken some important uh uh procedures like the node, pairing and pulsing and how that is what are the message flow between them?

B

uh So this uses all the messages that we have defined before and we try to say how the flow would look. uh Then there is one for unplanned switch over and then there's a plan.

B

So that is the general structure of the document of what we're trying to propose any questions here.

C

No questions, but that's amazing work for just a couple of weeks. It's great.

B

uh So these are the from you know. Getting from the get go from the design goals is what uh I'll get presented to. So here we are saying uh so we are on on switchover. The intent is that uh all connections set up before the switchover have to reliably work. uh Whenever there is a switch over either planned or unplanned right we're trying to keep the drops if there was a flow that was allowed before and it was functioning.

B

We tried to uh never drop that flow or create a disruption in that flow for plants, whichever the design is uh attempts to get a zero uh down time for plant switch hours and less than two second downtime for unplanned switchovers, uh also because of floor replication. We don't want any data packets to be dropped. uh That is a goal that we try to achieve and, of course, the high cps rates that are the requirement here.

B

We want to keep up that cps state and sync up the flows and keep that ha at the same icps rate, and I think we have talked about this quite a bit- we only try to keep on it.

B

One goal is to actually not over uh use the pps path and keep the data traffic to them as minimum as possible right and sync only the required packets we're at a very high level. uh This is the I think, just to clear out the terminology in terms of how we are using terms to look. So there is a sonic stack about. This is all the uh sync d asws everything else at enmi. Everything else then there's the dash sdk. This is the implementer implementation, the sdk layer, then there's the dpu.

B

The dpu could be hardware or could be any associated software components. For example, we might have some software components for the cps path, and things like that, so the dpo is one glob which covers both the hardware and the software components where everything that is under the dash sdk is the tpu.

B

Then the network topology itself. I think this is pretty uh known I'll get to this very briefly. So what we're saying is there are uh two uh links coming from each uh uh dpu going to the top right and it will be kind of spread across the two tops spread across the two transfer redundancies.

B

And uh apart from that, we have uh because we don't want the ha to be tied to any of the physical links. We have a loopback ip, which is apart from these two link, ips, which we call as the control plane, ip right or the control network id. And then uh there are two data pathways that are shared between the dpus and uh both of them advertisement. They kind of own, the first one, the control network ip is owned by each.

B

You know it's unique to each dpu and the reason we need the control plane, uh ipa, so that if any one of the links go down, we don't kind of bring down h. There is still a identity that can be reached as long as there is right, so the control plane uh ipr, the cnip as we uh use it, underneath uh this is used for all the control plane traffic, for example, all the synchronization, both bulk and data path.

B

uh Sync any heartbeat uh uh messages that go between the two or any control messages in the proposal to move the states from uh from for the state machine from one state to the other. So everything that we want uh for the control plane or anything that dput dpu traffic is originated using the cnip and terminated using the.

B

Then we have two data pathways or whips. In short, these two whips are owned by the pair right. The way they are owned, as the way they are set up is one whip.

B

Each whip has a primary and a secondary node and it is striped across. So there is uh at any in steady state when both both the dpus are active.

B

One of the whips will be primary and the other will be secondary on one dpu and it will be a mirror image on the other side right, so the other whip will be primary and secondary. The idea is, if one of the tpu fades or if there is a switch over, then uh the surviving dpu kind of becomes primary for both reps right and then continues to service all traffic for them.

B

So this is how, uh because of this, we are able to do active active, so each eni is assigned two of it right, so uh both the so. This is what we call as eni, based active active and the data paths for both the on both the uh dpus are always always forwarding and always active in terms of the forward part, because they are active for either either one of the pips at least one of the website.

B

Then there is a data path heartbeat. There is a heartbeat message that is periodically sent between the two dpus. uh There is a configurable interval and there is a how many loss can we tolerate right? uh If the? If there are x number of losses between the heartbeat, then we declare that the pier has been lost.

B

This is uh configurable, and uh since it it is, it can be as aggressive as it can get. I can do it in a few hundred milliseconds or uh tens of milliseconds. If you want any questions so far,.

A

Hey uh sanjay yeah. I have a question so this uh data plane whips uh right like so, I think like theoretically, uh each dpu uh can I mean like it, can obviously handle multiple enis and this enas can be replicated across a different set of dpu pairs uh right and now uh yeah like so. If we, if we have this dp pairs, um do you expect that each dp, you have like multiple of this books or we'll just pair with just one other dpu, and they will just yeah like provide uh ha for the cni's.

B

Yeah, so I think currently, for each pair there will be two webs, is what we expect.

A

Sure yeah, like I think like if you have to uh shard the enis across tpus right, then we need multiple of those whips right.

B

uh So each eni each ena is always associated to one whip. I think I probably am not getting the question fully so I'll just explain so each ena is uh associated to one whip and that whip is associated to a pair of dpus.

C

I think what he's uh what he's uh asking is um the way we do that with ecmp is that's done at the bm side, so the eni could appear on multiple dpus, but they don't know about each other.

C

All they know is they the those multiple gpus have identical policies and different connections land on each, but that's done by spreading it through some ecmp algorithm at the vm itself. So the dpus actually don't know that there's another gpu handling some of the connections for that eni that have no idea. So from their perspective, it's just you know two bips.

C

They don't know about the other cards. We haven't implemented that yet. But that is the plan, so we will have identical enis and policy on multiple cards, but again those dpus don't know about each other and they don't need to. They only need to protect the connections that land on their dpa.

D

I believe the question was perhaps the sharing of the whip across multiple enis right. The way I understood it, because what happens is that you know if you have a primary and secondary webs, are they primary and secondary per eni? Because if you think about it right, um any dpu could be an active for some enis, but you know standby for others, and vice versa. Right so in that sense are the: are we really controlling at the whip level or you can be controlling it?

D

You know at some other level to say that, okay, you could, you could have a different role but depending upon which ena you're talking about.

C

Is your question that we.

E

C

Failover per eni because we don't do fill overs per emi. The whole tpu fails over all of the emis.

D

Correct, but what happens is that you know um your dpu is not the entire dpu is not really acting as a standby for all all the ends that it is serving, it could be half and up right. You can say that, however, the enis are probably you know right right, right, correct. So then, then, the the question here is that you know how are we really distributing?

D

I mean: is there only two whips that are advertised by a single dpu for all the eni that it is supporting, or is it some other scheme of how the whips are basically configured.

E

Yeah yeah, I think I think you know the way piece per per card right, so you know, um because you cannot really advertise too many whips because uh cause some other issues uh for the underlying network, but I think you know this party, I I I would say it's not. You know directly relevant to the uh to the hd park. It's a you know, um because you know different setup. A different uh you know uh setup may have a different. uh You know how they advertise the vape right.

E

So you know, I think that that could be different. You know that that I don't think it's a it's strictly tied to the h3 part.

A

E

A

Yeah, I think yeah agreed on that, like I was just wondering with how many books uh each dpu will advertise. So I guess, like it's just two so.

A

That implicitly means it will get back to just another uh dpu right like we will not be pairing with multiple other tpus is my understanding right.

B

Right, so, if your question is, I think uh uh it's not. It won't be for whip one. We are pairing with dpu1 and for vip2 we are pairing with dpu3. You see what I'm saying: it's not like a chain. It's between it's a pairing between the two.

A

B

Yeah? Yes, yes, yes,.

A

Right so I think that each dp will just pair with another dpu and yeah. They will just have one set of their.

B

Shares they're sure between them.

A

C

A

A quick clarifying question is that uh the whip is per dpu per dpu. The active flows that it's managing that have landed on the dpo right, the the active flows right.

A

E

A

E

I think I think maybe you know this is a little bit confusing as we we should focus on the you know the h.a part, because the setup I think it could be different. You know um you know. Maybe we should you know you know, simplify the dog a little bit and focus on entry part. You know saying that you know for the hra. We should focus on per yin ai based uh hr, so every year, and I we should be able to you know, do the trade right so that that is.

E

That is the requirement for the hra right. So you know how do you set up those vips? I think that that's kind of uh you know it seems, like people have more questions on that on that front, but the real the requirement for the hra is, you know, be able to do the per unit based day check.

C

It's not pretty unique gohan in that one. I can't be backed up by a different card and another e9, backed up by a third card, another eni, backed up by a fourth card. They all back.

E

C

They're all backed up by the same card.

E

Yeah, that's what I mean. You know the united base right, so you know for yeah. You can have different backup and you really want to do the hd at the yin eye level right so because you know the eni signed, you know, the active and backup is could be different cards right. So.

D

I have a question, let's say in a configuration, you have a simple configuration of two dpus and you have 64 enis. Are you going to have all 64 eni is going to be back? You know active on one dpu and then another dpu, basically acting as a as a as a standby for the the those all 64. Or are you going to really split them onto both the dpus saying that okay 32 are actually 1, 1 and 32 are acting on other ones?

D

B

Yeah it'll be 32, 32 are associated to whip, one 32 are associated with two yeah. One is active on dp1 vip 2 is active on.

D

Tpu, this is exactly my point and now now what happens is that if that were to be the case right and that's how I understood it as well? If that were to be the case right, then your whip advertisements are different for different dni right. So so, and in one case you know one one dpu is advertising the secondary repair, the primary or or you know, and vice versa, so that they know that okay, this is how it is working is, isn't it isn't it the case right that is.

B

Okay, so the it is mirrored so there is. uh We haven't gone too much into detail into the actual advertisement on how we are controlling the uh the bgp advertisements or whatever the uh protocol advertisements, uh but we can uh so there is uh the way it would be. Is it will be mirrored so the way the standby or the secondary announces the announces to whips one whip will have a good metric. The second whip will have a less desirable metric and it will be the mirror image on the other side. On the other.

D

Right right and it will be different for those 32 versus the other 32 who who you're advertising, so that you know things land correctly. If you are only using two whips right, yeah right.

B

I think that actually.

D

Explains the entire thing: you know, I'm sure I believe the suresh and who asked the earlier question so that actually essentially answers the question that hey you know there are only two waves: it's just that you're controlling the advertisement to whom you're advertising that depends that becomes the primary for that one and the secondary for the other ones, and that's how you spread the load evenly.

D

You know to say that okay, both can be active dpus from from the global perspective, but from the local perspective, one is acting as an you know: active versus the standby right, okay, yeah. That explains. Thank you.

B

Yeah, I think, like gohan was saying we haven't this document. We haven't focusing too much on the the network, control plane, advertisements and what we do. We could have another session, but this mod focuses more on the ha portion. It's what's happening, but yeah you summarized it correctly.

D

Thank you, yeah.

B

Then, if there is nothing more, you can go to the synchronization itself, so the state synchronization uh here. I think this uses that, as we said before, this uses the uh cna pipey- and uh you know this originates in this thing, the the uh so there are two uh pieces for the straight synchronization there's a bulk sink, and then there is a the data path sink, which happens, which is the flow by flow sync, and then there is a bulk sink right effectively.

B

I think there are two channels or two pipes that we have to communicate between the two one is the control plane channel and one one is the dpsing channel. As the name indicates, the control plane channel is used for all the control pane messages between the two tpus dps. I mean the whole units, and then there is a datapath sync channel, which is which connects the two tpus. The actual forwarding this thing to synchronize the flow by flow I mean flow by flow. Sync happens there right so the control plane channel in this proposal.

B

We are, I mean it's represented as mean it is uh it's a bi-directional grpc stream that goes between the two instances, and these the stream can carry messages between the sonic stacks right. This might be for controlling the state, machine or uh messages to that effect, or it could be messages that are going. You know the dpu to dpu. If there are some messages that need to go, it gets relayed through the sonic stack via the control plane channel, uh but effectively it is a streaming channel. I think what the bi-directional the streaming gives us.

B

It gives us more efficiency and it gives us enough. There are no. uh If there were uh if it was a unary kind of a thing, if it was a request, response kind of a thing it becomes slower. uh So this stream gives us some advantage there, and uh the dpsing channel is also another stream between the two dps, which carries the flow by layer flow by flow sync messages between them.

B

We will get to. I think when we go through the later sections for the message definitions, we will get to see what this control plane channel messages are. What are the kinds of messages that are there and that should give a better idea on.

B

Then uh any any questions here on the.

E

What what is this uh purple line and the the yellowish line.

B

Yes yeah, so this might be uh there are. There are messages that are, I mean there might be messages on the control plane channel, which are originated by the uh sonic stack itself. That is the orange line. There is a purple line where the dpu, if there is some control, plane information that mean if there is some uh signaling that needs to go between the dpus too there could be.

B

I think the dpu also could, via this eye, there might be notifications that come up get relayed across this control, plane channel and to the other side. So these are essentially control, plane messages, but not between the dpu, but not the data path. Sync messages if everything else goes via the same control, plane channel.

E

B

A

Are you covering uh questions sanji and you cover that in more detail in subsequent section.

B

A

So we'll wait for that. Thanks.

B

Okay, so the bulk sync itself uh is uh uh when we need it is we have one gpu which boots up and then uh when, when the dpu comes up, uh you know the lifetime. We could see that there is no other peer, we wait for a timeout and then it would go into what we call as the primary standalone mode right. So it is configured as uh it is working as primary, but it is not syncing anywhere. It is in standalone right, so this would be for both the webs.

B

It would go to primary standards and it would uh because there is no peer, it would kind of like we were saying it would advertise the whips, attract all the traffic to itself and to the dpu and then forward those traffic make connection flow entries and forward as knob right, and it would build up some flow state.

B

Let's say at some later point in time. The second dpu uh comes online when the second, the pr dpu comes online, uh the two would detect each other. The two dps would detect each other. Now uh the the the new dpu that is coming up would need to sync up all the state from all the accumulated state from the primary right, whatever was in the primary standalone mode.

B

So this is uh when we do the pulsing right uh at this point in time we use the uh uh control plane uh channel that we have and then on this channel we sync all the slope. No, all this accumulated state right. So this is a bi-directional stream and uh in this stream uh the key difference here. I think why it's been a stream. Is we want to from our experience.

B

uh What we have discovered is the uh indiscriminately. You know if you just uh dump all the state to the other side uh to the other dpo uh it can start affecting. Sometimes you know, one of the whips might have come online. It starts affecting the cps path of whatever is uh already being serviced right. So I think what the bi-directional stream affords us is. We can actually stream it in a way so that we can more efficiently uh populate the other side. So there is, there could be in the stream.

B

You could go through some portion of the. We are following the same perfect sync mechanism right, so we are marked in the this thing. We are taking a snapshot in this mechanism. We take a snapshot of what is in the current flow table and we mark everything else that knew all the new connections that come up. They are marked as some color right, so we sync everything uh like the pin the perfect sync mechanism. We sync everything in that bulk.

B

Sync snapshot to the pier right and when we are sinking this we might you know this is usually the the number of flows that we have could be in millions, and it takes some time before uh it takes some quite a bit of time before it was able to sync, I think, last time we were talking about uh tens of seconds 60 seconds to 80 seconds, to sync all the flows to the other side.

B

So while this is happening, uh we want to sync this in the most efficient way, because it's a scale thing, uh so we stream all the flow sync messages. So what we would require, what the dpu can, I think, each implementation on each side. The dpu might have efficient ways of selecting the flows that it needs to sync, for example, it could be, we could have a slide.

B

You know a control plane, a cps path, slice which we want to populate it uh at once, right in one and populate the next slice in the next batch, etc.

B

So the stream affords that and the tpu can choose what streams it will, what flows it will sink in what order, and there could be some back pressure or some more signaling that can come from the other side to control this bulk sink.

B

Right, so that is uh how we choose this thing. I mean how the bulk sync happens. uh There is a little more detail in the message flow when we get to the bulk sync but effectively. This is the pulsing mechanics.

B

So I think the key thing is it's not like. There is a list of all the flows uh and we copy over all the flows in one go to the other side. Rather, it is more of a stream where the tpu can choose which, which set of flows to uh pull from the primary so that it can be more efficient when we sync up from that.

D

But you do color them right, um you, as you mentioned in the previous uh you colored them, and and then that that's how but I have. I have a quick question on the on on just you know. The ip addresses you mentioned before right, there's a control plane ip address.

D

That is, is that what we basically use as a path to for both data plane or data path, sync, as well as for the control plane channel.

B

Yes, so the cinepointers that we were saying that yeah, so that is the identity for each of the, uh so the whips, because they're shared we can't use them for any of the deputy yeah.

D

No to not communications and- and there is of course, that you know it all goes through whatever the fabric we have through the tour and it just and essentially we are using the provider ip address space, correct, correct, the underlayer yeah, the.

B

Provided underlying probably.

D

So so, in other words, all the data that is all the all the packets that we are syncing they're essentially encapsulated in some ways, so that you know um they are just essentially are traveling in some sort of encapsulation, which is which is, uh I guess, you know not overlapping to to the provider space correct.

B

Correct so uh yeah so they're all I think the outer. If you look at the uh uh outer sip and dip, it would be the c nip to see it. So the provider, no provider just routes them here. It's not in the tenants.

B

Okay, okay thanks. So essentially, I think one thing one thing to note here is during the bulk sync: we are not actually carrying any user traffic itself right. So this is our own format that we have below. I think that's there in the defined messages below. So this is our own format that we are sending.

B

This is actually essentially a grpc channel between the two, but uh what you are saying about, encapsulation, I think, makes more sense in the dpsync channel, because the way we are doing uh incremental sync we'll talk a little bit in a couple of sections below. We would.

C

B

Carry the actual user traffic, so we have to put an end cap on top right and carry the user traffic. So this.

E

Part, I think I get it, but I I don't quite understand the comments about the boxing. I understand the back pressure scenes right, but I don't understand, like the selecting part right, so you know what what is the criteria you're, selecting the flows to sync, I think you know in the initial boxing isn't that on one side, it's uh you know it has all the data on the other side, it doesn't have the you know: zero zero flows. So so you, you could just stump all the flows from one side to the other side.

E

So what is a? What is a? What is this? uh You know uh a selecting part. I don't quite understand that part right.

B

Even the walk to uh we, we have multiple of, uh uh I think, I'm I mean we I'm just we're taking when I say from pennsylvania. I think it is our implementation, but we expect it might be very similar in other implementations too. So there are multiple of these uh control plane threads that we have so the way we want to select and push them to the other side. Might we could choose from the other side so this this? This is effectively the purple line. That is there go on each of them.

B

It could be we want to for one thread. We want to get n number of flows from the from for one one of the cps threads and then we also because it's all the buffers are also limited. So from one thread we don't want to block any of the threads in trying to populate the uh the buffer to pull push to the other side. So there's some back pressure there too right on how much we can.

B

I think the whole thing is optimized to the so that we can actually get it in the most efficient way. So you can say how many floors you can put. Even the producer can also choose how many flows it can put put on to the buffer, so there is no blocking anywhere.

B

The second thing is, as we are for each of these threads. We should also note that I think I have some of that. We noted down in the interaction between the two as we are picking from these flows. There is also the same. Floatable is being acted upon by current activity from the network right, so there could be.

B

There could be other traffic which is uh changing the state of the flow like uh terminating the flow, uh or there could be some uh policy changes that are happening where we are resimulating or we are changing the uh flow itself right. So there is some uh because that is happening on the same thread that is operating on the bulk sink too. So there could be some locks right so and we don't.

B

We want to minimize these locks right, so we want to kind of rate limit or we want to control the way we push these flows from one dpu to the other.

A

E

That's a sir, that's fine right, so you know because uh I think you know, because I because you know I I haven't seen the document right so because you know basically, I think all I care about is the api between the sonic and the the site right. So you know as long as, for example, you know I think in your sdk you may have shred you may have logs. You may have other things that you want to. You know decide which flow you want to sing. First, you know those kind of thing.

E

I think that's fine, so so what? But the the api between the sonic and size kind of.

C

What kind of api.

E

That we get right so, for example, sonic still asks for all the flows right so then it's then the sdk can determine what what flows you want to send. You know based on your own priorities, your shares, those kind of things, but that that is your implementation details right. So.

B

Yeah right, so, I think the way sonic controls to say that start, the bulk sync now pull the flows and push them and then the messages themselves are all the messages are uh defined messages for the flows themselves. So on this stream, the only thing the dpu controls is: uh how do we push these uh messages, and I mean these flowsync these flows in what order that is, but uh it does, I think it is controlled by sonic. I think when we come to the messages, maybe that will be a little more clear.

A

B

Yeah yeah yeah. It.

A

Is that sonic is.

C

Not supposed to.

B

C

Yeah, I I I just wanted to put one thing and make sure people understand. There are two things going on. At the same time, you are doing a bulk sync and you are also doing your regular data paths, sync where new connections come in those are being sent over in parallel and they are independent and that's why you can take a snapshot and just work on that. But the new connections actually just behave as normal. They go through the. This is what he's calling the dp sync channel.

C

So there are two things happening at the same time,.

A

That was my question: um is there any race condition that might happen while we are doing the control plane sink and uh the flows are cued up there and the dp sync channel has the latest and it's already synced that uh and then you know once the bulk sync goes through? Will it be you know, do we have um a means to avoid that.

B

Yes, so there is uh so this is part of the yeah. There is you're right so why there could? There are two uh two conditions here right, so one is. uh While we have a snapshot, there might be a new connection that is created right, so that is the easier one right. So we know that we look at this color and then we say that okay, it's a new connection and it is not there in the snapshot, so we just sync it across to the other side.

B

Right I mean the dp sync carries the flow. We know that the control plane sink is not going to come and uh interact with that flow at all, because that is a completely new flow now yeah. So the second one, the second condition my the second. This thing is uh uh scenario: is we have a flow which is there in the uh in the snapshot? It was there. No, it has the same color which we are going to think via the cp sync channel.

B

uh At the same, when we are doing the sync, there is a there is some change that affects the flow either a config. I mean a network event or it could be a configuration event right, something that comes and affects the flow. So now we have to be uh mindful of.

B

Where is it if you have already seen the flow across right? It is okay for dpsing to go ahead and right send an incremental thing, but if that flow has not been synced across right, then you can hold off on syncing on the dp sync, and uh the control plane would actually sink over the state when it actually pushes the state across right. That.

A

B

A

B

Yeah, so I think uh that might be, I think, in the in the what would need in the implementation what would need to happen. I think these are cases that are there below what we need to take care of effectively what, in the it's, a very implementation, specific thing, but we actually go mark in the flow.

B

What state? What is the state of the sink so that way, the data path knows whether to so in the first two cases, the data path can actually go ahead and make the sync across right. It can do the sync, but in the last case it needs to hold off so that happens by.

E

Maybe sorry sorry, ginger. I think I have a basic question. I think you know the document says you know the cpc boxing can dp sync happen in stages. Are they happy in parallel at the happening stages.

B

There uh that actually, once the cp during this uh the bulk sync, the dpsync, is also happening in padlet for all the new flows or any new changes.

E

Also, it's not in stages. They I see.

A

It's not disabled.

B

It's not disabled yeah. It's happening in parallel.

E

Oh, maybe that we need to get clarified, because you know it says in staging meaning that I thought it was like. You know. The boxing happened first and when boxing finished and dp sync starts.

C

E

E

Oh both happen at the same time. Okay, oh so maybe we need to clarify that right.

A

We can mention they're happening in parallel somewhere.

B

Yeah, I think I have.

A

D

Yeah, I think that is.

B

D

Is also one of the requirements right I mean one of the one of the documents that we have on the on our github website about this perfect sync.

A

D

That doesn't talk about. uh You know that this is one thing that we wanted to achieve, and I think this is what it is doing.

B

D

Think I have some.

B

This thing on yeah, I have some listing on the interaction between bulk sync and data path. Sync, one section: it kind of talks about what will be.

C

Otherwise, the reason it's happening in parallel is and why it's called perfect sync is, although you get all kinds of race conditions, if you didn't so so, this is just an easy way of saying if I mark everything that's in the table- and I send all of that over while in parallel sending over anything new, any new connection or being created, deleted and altered or whatever, then I know, but through the bulk sync, that when I'm complete, I send everything and that's what we call a perfect sync, and I also know that anything new was also sent over so you're.

C

Exactly in sync, like right down to the connection, you are 100 complete with just one traverse of the table. You're done, you don't have to worry about the race conditions uh when you do it this way,.

A

Like, for example, when we have all the uh flows included in the snapshot for bulk sync, at the same time, there is change in the state of the flow, that's in the snapshot, but not yet sent or processed all right, not sent suppose um if there's a way to notify the dpu that the sorry the dpsync channel to hold off on that one, then, then we are good uh um right. Yeah.

C

You can also just delete it from the table all together. If the notion is I'm going to delete this connection- and it was previously in the boxing table- and it can be, it can be deleted.

A

Delete it from the snapshot, yeah.

C

Yeah, it can be deleted and then you won't have to worry about it.

A

Right, that's uh the delete case and yeah. They could be.

C

Oh yeah yeah yeah, the league case. You can delete it from the bulk thing, because it's it's the same table you're trying to sync a re-simulated case, may be different or harder.

B

So there are a couple of different ways to this thing. One, I think one other way is you could actually mark in the flow. There is some state you maintain in the flow in terms of what is the sync state right, whether it's been sync not sync could use that from the dp to.

A

A flag uh sunday, one yeah, a flag.

A

Hey uh sanjay one question like so uh uh in terms of partitioning of work right like so, I expect there will be like a ha sonic module, uh which is the one actually streaming, this uh connections that the flow state and it goes via the psi apis to stream uh the flows uh from the dpu and streams it across to the paired dpu right. Like so uh question is like.

A

I assume there is no state that is maintained within this sonic ha module uh for each of these flow states. uh Correct.

B

Not solved within the dp correct, not not, for each of those. There is some state in the sonic thing, which is the state machine and things like that, but not nothing for per flow. Nothing for the flow itself.

A

Sure all that has been maintained within the dpu, so within the debug yeah.

B

Sure thanks this is just uh I think in this thing, the uh the sonic here is just it's just relaying the uh in effect right relaying between the tpus through the control plane channel, but it is uh there's no state maintenance.

A

B

Then then we have the data path, synchronization itself. I think in this proposal we follow the inline replication. That bulky explained in the last meeting in the couple of meetings back so in this, I think we think effectively. We, the two uh dpos, are treated as one dpu. I mean as one logical unit right and whenever there is a flow inserted, we make sure that the other, the pr also inserts the flow before it gets forwarded to the other side right.

B

B

Right and uh so when we are syncing it to the other side, the primary uh dpu is the is the one that uh I'll change something the dsc is dpu. I think dse is our transcendental terminology, uh so the the primary dpu uh kind of evaluates all the policies and- uh and then you know inserts the flow in the secondary and whatever was the policy results from the primary is uh honored on the secondary. I think the flow insert happens uh with exactly the same. There's no difference between the two.

B

uh So the reason I think this is critical is because uh we can never guarantee that the primary and the dsc will be at exactly the same configuration level at any given point in time. There might be differences between the two, where uh the controller has uh put some config new config to one of the dps, but not the other, yet right or we haven't uh applied on both the debuts at exactly the same time.

B

So uh it's a critical that we have some uh source of truth and the source of truth for policy is the whatever is the primary gpu in this case and then on switch over uh there is a reconciliation or the flow recimination that we've been talking about. That kind of updates to whatever is the config level on on the secondary, it kind of updates the flows, and it gets it back to that level on the secondary, but during the sync it is always the policy results from the primary dpu that that stays another takes.

B

That is honored on the second rate in this mechanism. uh Whatever is the control packet we get uh effectively? I think this shows to this thing. We have an outer header, which is the primary uh cnip.

E

Sorry, I have a question here right, so I understand the policy not syncing between the primary and the secondary, but why why we are considering the flowy simulation case during the sink? I don't know, I'm not doing the same.

B

This one's on second, on the switchover to go. I'm sorry, I didn't I didn't mean during this thing on switch over.

A

There happened during the thing too: isn't it.

E

Sure, what's that.

A

Sorry, the re-simulation can happen anytime right during.

E

Why? No, no! No, why why why that becomes a requirement? You know you know re-simulation. My understanding is triggered by. You know us right, so why you know we can block the risk simulation. Why why we need to you know: do the re-simulation during the sink.

E

I mean why why adding that extra you know requirement to to to the to the sink.

A

I was under the assumption that the resimulation happens quite often, and then the sink has to continuously take care of that change.

E

But they think you know at most. I think it's on the order of you know minutes or maybe tens of minutes right, so why? Why is that?

E

I, I guess, okay, so I think the you know presented clarified that there is no street simulation during the sink. It's only. The three simulation during the switchover is that right or yeah.

B

The reconciliation we're talking about here is during the switchover, so we have honored.

C

A

B

Prime or whatever the simulation, so we the re-simulation, we have flows uh on the secondary dpu. Now there is a switch over and.

E

B

On the secondary dpu, the state may not match whatever is the policy state? That is there on the second link, so we wait for the controller to uh to trigger the resimulation, and then we bring you know.

E

Okay, okay, okay, oh I see it. Okay, that makes sense. Yeah! Okay, can you explain.

B

A

E

Mean by can you explain what you mean by policy? Is it just the.

A

The acl result is it: is it the route lookup? Is it the mapping? Is it everything like.

C

What it's all it's all the layers you could have a change re-simulation is to fix all the layers you, you can come down and change the change, the uh transformation of of a policy. You can change the rules, they're, not a sale, they're called rules. You can change any of the rules right in any of the transformations.

C

So so it's all of it, but that's re-simulation, can happen after the fact I mean it can just be initiated and it can happen per normal re-simulation rules it. It has nothing to do with aj. Really it's just something that after you do an aha, it's obviously a good thing to do. Re-Simulation, but it's independent of the aha.

E

So basically, what I'm hearing is that you know before you. You first do the sync right. So then you know before you do the switch over. Actually the control plane is doing the switch over. You know right before that you switch to the second dsc then at that moment the second dsa is getting all the up-to-date policy and do the re-simulation on the second dlc right. So that's what you're talking about.

C

Yeah, but it can do it at its leisure. It doesn't need to do it like as part of aj. You can do it once you do the switchover you can do that. I mean they're, never perfectly in sync. That's why you do read simulations.

E

Yeah yeah, okay, I mean that makes sense right. So you know it's not! It's not really really related to the ha e just that before I you know make this make it.

D

E

Stand up to be the active that moment that I want to do a re-simulation.

C

And I think you should have.

E

An x-ray for that.

C

But you could choose to do that or you could choose not to do that and let the.

E

Unit yeah yeah, of course,.

C

Of course yeah we should check the apis to make sure that we have that to make sure that we can force everything our research.

D

Why aren't that making this? Why are we making this one, as you know, um requirement as part of this one right I mean one thing: we are saying that okay, secondary bpu does not get the policy right right. The policies, changes or policy itself forget about the changes right.

D

The policy dump only happens after the switch over to the secondary another critical moment. So the question here is that you know during the course.

C

That's not true. That's not true, though they'll both have the policy. If you're a backup, you have the policy yeah and.

E

A

C

May be out of sync, because software can't update them both simultaneously no multicast here, so they could be slightly out of sync at any any period of time and that's why we do re-simulations. But um you don't need to do it as as part of the aj. You don't need to do. It's just highly advisable to do it after aj is done or before you do the actual switchover, but it's independent.

D

Okay, okay, so that that actually clarifies so thanks gerald. So the question.

C

I have here they both always have the same policy for sure. If you're a backup, you have all the policy.

D

So then, the question here is that you know when the policy changes happen. They are reflected on both the dpus and so then. The question here is that do both dpus run their flow resimulation independently or is the primary runs? The resimulation of the changes are actually synced across to the secondary afterwards. How does actually that work.

B

uh It is always the primaries policy.

D

So it's the primaries policies. The primary is going to.

B

D

Although the changes are reflected in both of them, since the secondary knows that okay, it is a secondary, it is not going to run the flurry simulation primary is going to run it, and if there are any changes that amounts to it, that the changes are synchronized through a control plane channel to the secondary, and then that ensures that okay, both of them are in sync, so I just you know for the sake of this.

D

This discussion that is basically going on is only about the time period, where, basically, there is switchover in which, if the policy were to change, how do we really take care of them right?.

B

Correct because that's my interest.

A

So you force trigger resimulation after that.

C

The thing is that the it's independent, because the gpus know that they've received new policies, so you could be doing a bulk, sync and, and the data pass sync and the dpu is still getting policy updates from sdn somewhere because they're asynchronous, they don't necessarily know what's going on, and so they can decide to do a re-simulation at any time. If they've received.

A

C

Policy they'll do a re-simulation.

A

Sorry, for the unplanned failover is that after the failover, the newly active dpu initiate re-simulation or that knowledge is known as the control and controller initiated yeah.

A

So um after say after the switch over what happens is basically the controller will uh will trigger a apa to do the resimulation, the reason being that, after the switch over, if you let the, if, if you let the new, whatever the standalone right to do, resimulation immediately, uh there is no guarantee that the policy that it has is current with respect to what what it was in the primary.

A

So it is the controller that will come and invoke a apa to say from now on, you can start resimulating the flows, because this particular node has caught up to the the previous primary with respect to configuration.

B

It's not automatically done.

A

So that means there will be a notification right from the newly active back to the controller indicator is ready for re-simulation, um not notification, see after switch over the unplanned switch over uh after that. The controller basically sees that there is a switch over event and then it will invoke the apa to start the re-simulation.

A

Okay re-simulation at that point is to check make sure that the current policy is applied once again to all the flows. Active loading.

B

Correct that is to make sure that I think the trigger come from. The controller indicates to the dpu that the policy levels are uh synchronized it's at the latest, and then we can actually go and apply this one. These interface.

A

Yeah, the dpo once it comes from standby, to active uh once it's fin, it's done with uh some init tasks. It has to inform the controller that it's ready or it can. You know once it if the rx channel is already open and it can get the command from the controller, then it can cue it and do the resimulation also that's true. It's it probably doesn't need the uh you know uh that the dpo is ready signal to the controller. Then.

A

Sorry, sorry, doesn't the advertisement from the failed dpu just stop, and so that becomes not available. The new one becomes available and says: hey I'm the best path and I'm ready.

B

Yeah, so I think uh from a network perspective, the traffic is still, I mean on switch over without any of the control plane, uh whatever the messages that we are talking about, without any intervention there, the network will actually start forwarding traffic to the secondary, and we will continue to forward in for any traffic for any of the flows according to whatever the state was synced right. Now the flurry simulation kind of updates again yeah.

A

Now I hate to be a pm here. I have to do a time check. Do you guys want to keep going, or do you want to cover the rest of this in a second session next week,.

A

Hi carol john, I.

C

A

C

To do a second session, because it's too much more yeah, but if this is on the, if this has already been submitted, people can also read it.

A

Sure so we're thinking do a pr to submit after this conversation would that be okay with the uh amd team.

B

Sure will do okay.

E

B

Once the comment.

E

Is that maybe you know it would be, um you know better to get to mark down so that we can easier comment on the and document, uh but I might find you know if you, you know, get the pdf out first and then I think you know yeah. So there are some details that we can look and people can communicate and look and then and then read more.

C

How long will that take us to get the mark down christina? Do we even have.

A

Okay, and if you can't, if you need help, let me know anything.

E

A

I guess we're gonna call it since it's 1203 and pick it up next week. Is that what I'm hearing.

E

A

Okay, thank you so much sanjay and bulky, and everyone um bj on I'm going to stop the recording and post it, of course, and then we'll move this to mark down and pick it up next week. Thank you. Thank you.

E

A

E

A

E

I appreciate yeah.