Kubernetes Data Protection Working Group, 1 Dec 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes Data Protection Group Bi-Weekly Meeting 20211201

Description

Kubernetes Data Protection Group Bi-Weekly Meeting - 01 December 2021

Meeting Notes/Agenda: -

Find out more about the K8s DP WG here:
https://github.com/kubernetes/community/tree/master/wg-data-protection

Moderator: Xiangqian Yu (Google)

A

B

A

Can anyone hear me.

C

Yeah can't hear your fine shame, but I can't hear you.

A

Oh, it looks like he has some problem with this, like okay,.

D

Okay, now that's the problem. If I share my screen, I cannot say anything.

A

That's awesome, honey. Let me share okay, uh just.

D

Yeah, it automatically blocked me from saying anything. I hope this doesn't happen to sean later on, because he has to probably also share his.

A

Oh, let me so let me actually share. First, let's see.

B

Okay, okay: I was completely.

D

Talking to myself, then: okay, all right, okay, uh I'll get started again uh today is december, 1st 2021. This is the this. Is the data protection working group meeting and this meeting will be recorded today. We have two main agendas. uh The first one is. We welcome sean.

D

Give us a talk to um to give us a discussion over the dr topic in kubernetes context and then there's an open issue in the uh snapshot controller. Seen as video, you put it over there right, uh we can quickly go through.

A

Somebody, I think, someone else added that I don't know who added that, but we can yeah. We can talk about that.

D

Yeah, we're gonna quickly go through the debug over there. uh With that uh shine and uh shin. I think we need to give sean.

B

D

A

Oh sorry, hold on. Let me actually make you a co-host.

A

Yeah. Okay, can you try? Can you try?

A

E

A

Sharing, first, all right: try that yeah.

F

um Can you folks hear me.

D

Yes, okay, it's weird.

F

And you can see the presentation.

B

F

All right, thank you.

F

So, let's get into this mode, uh so we uh hi everybody. My name is asham. I have madhu with me who's. I think in the call we work at red hat.

F

We work on communities in openshift and stuff and also with ces. We had brief introduction to the disaster recovery topic, uh using storage, assisted volume replication about a month back now, maybe three weeks back. uh This is a follow-up to that to try and present what we've done and take the discussion forward. So so, with that the biggest disclaimer is this is just presented as to serve as a discussion.

F

This is not presented as a solution that we should go ahead with and do and things like that, so so we're going to split it into three parts: the first parts, the problem which kind of revisits what we already talked about and what we understand as commonality across storage systems in the volume replication space, both synchronous and asynchronous, and and present a sketch of the solution that that we are developing uh presented at uh kubecon, so that we can then make it uh make it something cross storage. uh If which is the desire?

F

The whole desire of talking to the working group- let's go ahead, so I'll, run through this a little quick because most of this is understood or already discussed, but please do interrupt like I said this is to serve as a discussion. So do interrupt us if there are questions or comments and so on. So why volume replication?

F

The intention is primarily for disaster recovery, use cases.

F

The idea being replicate volumes to an alternate storage cluster and leverage the replicated copy for recovery purposes, attach the workload to the replicated copy and continue functioning in case of disaster and overall it helps reduce the rto rpo times for the solution, as the replication is handled at the storage layer and so on, and and it's ready to use on the alternate storage cluster, pretty much it's not in a backup or an object store that needs to be pulled back to recreate the storage endpoint and hence the workload technically.

F

This is also extensible to just relocate workloads across these clusters, but this these would be limited to you know: storage solutions, homogeneous storage solutions, because storage replication uses typically works across homogeneous storage clusters, not heterogeneous clusters, which basically means different vendors and so on from a disaster recovery point from the the cube.

F

This still leaves out the application resources. This discussion- we do have some thoughts on that, but more later on that it does leave out the application resources, like the part secrets, config maps. What not uh it kind of leaves out that you actually need an alternate humanities cluster which needs to be ready or created so that the workload can be moved to an alternate topology segment from a recovery perspective so again uh quickly. How's data replicated.

F

What's what's storage based data replication, we do have a slew of tools with backup, but, like I said uh yes, they can be leveraged, but the intention here is leverage storage, vendor supported volume, replication features, for example like rbd, mirroring incest and snap mirror, and that app and various other such technologies that exist primarily because it helps that there are various reasons. Why storage renders provide this?

F

Probably in this group we can add 10 mobile as to why this is being done, but in general it offloads replication to the storage system, uh delta transfer efficiencies are inbuilt into storage systems based on change, blocks or snapshots and journals. What not? It's basically a feature of the storage system, how they optimize their transfers and also it's historically readily available feature by some storage vendors, so uh how to make it cloud nato how to leverage that in cube and provide it to users.

F

That's the point here: storage systems also support synchronous, replication within latency contributions. Across storage clusters are depending on how their solution looks. So that is also a desire in terms of addressing the same.

F

With that, let's take a quick look at what does the system look like? We are really talking about using a replicated copy in a totally different kubernetes cluster.

F

uh The dr use case is to basically recover the loss of a cube cluster and and possibly the loss of a storage cluster in a given topology segment, a region or a zone, and hence the replicated volumes are not intended typically to be reused in the same storage, cluster, sorry in the same humanities or storage cluster and in general to retain low rto and better utilization of resources on both topology segments or in topology segments where the data is being replicated.

F

There are two or more communities clusters with volumes being provisioned on a local storage system, and the storage system are paired for replication and and some or all volumes are paired for replication right. So all volumes are not necessarily replicated only some are replicated based on which volumes are chosen for application, so shamelessly pulling out a slide from the talk.

F

uh It basically looks uh like so right, I mean that is an east cube cluster and a west cube cluster, and there are two storage systems- east and west, which has replication uh peering established and based on whatever the storage vendor needs, which is out of the scope of this dock. Or this.

F

What we even want to do with this and uh local pvs are provisioned to the with the local storage system and they are replicated to the remote system, and the intention is to kind of see how we can recover this workload on the west cluster and.

D

F

Can also manage this replication relationship using humanities.

F

Which basically brings us to the gaps that we have in this area is that we do not have any apis to manage the replication desire or state per pvc and later per volume group. When we have volume groups in play and the desire really is to enable users to manage replication for their volumes and not not not a system administrator or a cluster administrator dealing with this we'll touch on that a little later, because that defines how we want to design the aps when we get there.

F

F

The other gaps or problem area is that the application workloads they come with their pdc definitions.

F

So if you just reapply the application to an alternate cluster dynamic provisioning will kick in and deploy a new volume, whereas we really wanted to attach the pvc to to a replicated copy of the data, not not uh not get into dynamic provisioning. So that's another gap, but of course this has certain solutions with the data source and other such enhancements that are already provided but which can be leveraged, but it's a gap.

F

Nonetheless, uh other notable gaps which we're not dealing with again in the discussion today is it would be good to actually have application quest replication as well, just like application, quiet snapshots, uh but this would be no different than the application called snapshots, but once that is in play, I think this will automatically sub enable application quest replication as well and also beyond the two communities or clusters that we're talking about external traffic also has to be rerouted to the appropriate cube cluster on a disaster and recovery, which is again something that we're not talking about in this scope.

F

All right with that, uh moving on to our understanding of how different story what's com, what's common across storage systems that provide such replication and hence probably provide a stepping stone to design an api or set up the needs for such an api again, disclaimers is landscapes. Why there are many storage vendors, so so what we talk here is probably not going to cover everybody. We hope to at least cover most, but we also want to be educated on what we're missing so that we can design this better.

F

It also will be also not interested in how storage systems appeared for replication. uh We are only interested in how volumes created on such systems are managed for the application, because, again, various storage vendors have different ways of peering clusters.

F

I don't think we're trying to standardize on that or provide an api or abstraction for that with that one of the first actions that usually happens across storage systems are that different volumes are paired for replications.

F

uh This goes with, pre-created volumes are paired for replication, or a pair is dynamically created on appear when replication is enabled on a local volume, which basically assumes that there's some level of control plane communication across these storage systems.

F

These are the two formats of pairing volumes that we've come across predominantly and also this is more an implementation detail, but something to remember, as we discussed later, these volumes may be similarly named or different, so that that so initially one of the operations that happens is pairing or enabling replication on a volume.

F

Also, some commonality that we notice across systems is that the the volumes individually are assigned roles uh primarily something called a primary and a secondary or multiple secondaries, with the primary being the active source for rights and io, and is the source of replication that's replicated to the secondaries and which is the target for the replication and replication can be two-way or n-way, but with only one volume designated as the primary. It's not an active active scenario that the volume is being used in two sites.

G

Higher, do you want us to hold questions till the end or interrupt you? No, please interrupt me. Okay, I just forgot with regard to how storage systems are paired and how volumes are paired.

G

Having worked on various types of cross-platform replication apis in the past, but my observation is that the the one thing that that you can't hide that does tend to bleed through and cause problems with regard to pairing is when the failover granularity is not at the per volume level.

G

There are lots of designs that involve setting up a pairing of like a pool to another pool, and then every volume in the pool is replicated to some other pool so like you can pretend that the volumes are being replicated individually but like if you can't fail, fail them over or break them on a volume by volume basis, because it's the replication is happening at some course or granularity.

G

Then you have problems um that that are unavoidably bleed through into the into the api or the user interface, um and that that's usually one of the big sticking points is that there's some some storage systems, that literally do it on a per volume basis, and you have fine grain control and other ones that insist on pretending to have. You know individual volume pairings, but in fact it's done at some coarser granularity and you can't fail over one volume without failing over all of the volumes.

F

Got it? Can I uh uh so one of the things we? I think we understand why we need volume groups so when we get volume, groups and sort of sort of go back to.

F

I forget where ah here sort of go back to you know instead represent the replication relationship at a volume group, grand latte rather than at a volume granularity. Does that help.

G

Yeah I mean you've got to be careful, though, to have different types of volume: groups like there's volume groups for the purposes of replication, there's volume groups for the purposes of consistency, there's volume groups for the purposes of asset management. I mean there's all kinds of different ways to slice and dice your volumes, and I would just be careful about assuming that every volume group is a unit of replication. But, yes, you could have a type of volume group which was specifically for replication and design your apis around it and get things to work.

G

It's just more complicated then, because you probably have to do both that and something for the per you know individual volumes that are not grouped, um and then you have to have a way of communicating to the user. Like sometimes you have to use groups, and sometimes you don't have to use groups and it's a detail of the implementation and it it just. It makes everything more complicated and ugly, it's not insurmountable, but it's makes your job harder, yeah, so ben actually.

C

I think volunteers will complicate things mainly going back to your comment. Then the first comment that a lot of storage systems they do replication, you know- or at least they establish appealing relationship. Let's say at the storage server level, storage, virtual machine level, and if volume groups span different storage, backends or storage, endpoints or peering endpoints, um it gets even uglier.

H

So ben sezzala here, hey so nice to meet you everybody here. So one question is like you know in the recent I guess in the last decade vmware has these v-walls all enterprise storage systems have kind of adopted, v-walls uh to some degree or the other. So I'm wondering if v walls kind of made it forced everybody to kind of actually have this per volume. Replication topology. I do agree that open source, like soft uh storage software, probably doesn't have those features. But I'm wondering if enterprise software has these features.

G

I mean speaking for netapp, I'm sure that some of our implementations, of like kubernetes volumes, could do this. You know for for all the reasons you mentioned like. Of course we support v-vols and all the vmware stuff, but then we have other ones that either don't or they work differently.

G

And you know if people are going to say well, I want those to also support volume replication. Then we're going to be saying: well, it's going to work differently.

G

You know that that particular flavor of volume is part of a pool that can only replicate at the pool granularity, and so we can't give you the kind of semantics you would normally get and it you know it just becomes very complicated where some implementations can do some things and others can't and there's no easy way for the user to know up front what they're going to get.

A

Then, but when the application use those pools, they will still be uh consume each pvc right. You still need to have like some type of volume handle.

G

Right right, but but the issue then becomes like if, if the expectation on the user side is, they can take an individual pvc and break the relationship or fail it over somehow and in reality they can't, because it's part of a group that has to be failed over as a group, then the user's expectations are violated and and again it's just very hard to set up a system where people know what they're going to get and where it works the same across different implementations.

G

Because if you, if you have one of these implementations that does replication at some other granularity, you just can't do the same set of things.

H

So, like ben, I guess you're talking about maybe snap meter, which kind of probably replicates aggregate or something.

G

uh I mean yeah netapp has snapmere that replicates individual volumes, but like our trident uh csi, implementation has a mode where you put multiple volumes in the same flex flaw, and so in that situation we can't mirror the individual volumes. You can only mirror the whole flex fall, which is as the de facto pool of volumes, and so we could just say: well, we don't support replication. In that case, maybe that's good enough, but I'm sure there will be other cases where someone will say.

G

No, no, I do support replication, but it has to be at some group level and then you have complexity in.

H

So I guess the question for this to this group here is that, should we force a design where we, you know where we force all the storage vendors to adapt to this api? Or do you want to actually suggesting that it should accommodate? Everybody is what you're suggesting ben is that correct.

G

No, no I'm saying if you try to accommodate everyone, you're going to end up with a really complicated and messy api, um and we can't.

A

Have capabilities right? So that's what csi has have a lot of capabilities. uh There is really just one or two things that are really required for every driver right only the month amount. Everything else is uh optional, including creative volume is optional, so those are all controlled by capabilities. So we can do the same thing here.

A

The individual level replication is one capability and then group level replication is a different probability, so we could.

H

Shin, I think what ben is saying is that if you do try to solve for everybody, it'll become complicated and kind of takes too long. Maybe it's best to have an opinion and then say: okay. This is what it is and go with it like. What shall.

G

We do right, I don't want to discourage any particular approach. I'm just saying that, like having tried to do it in a way that pleases everyone, you end up with an unavoidably complicated user interface, because you know you can't set the expectations properly up front on what will work and what won't work.

H

Yeah, fair enough, I think they might clear this uh long ago. There's an api called ost from net backup called open storage technology whatever and then opinion and they're kind of like try to do this, and people are people slowly adapted to that technology.

E

uh Tom dell agree with ben main reason to have. The groups is for consistent failover at a single point in time of a group of volumes.

F

Yeah, I can, uh I think I get it in the sense you know I think group support is needed. I do not deny that the question or observation seems to be.

F

If something is part of a group, can you break out of it? Can you break into it? Should we provide that flexibility should be complicated or just set user expectations?

F

You get a group or you get pc and I guess that's a fair thing to look at and if you go around solving it for everyone, because the space gets a little messy, because a lot of people do things differently, but if you distill it it kind of comes down to certain core patterns. The.

H

Other question I have is: when you have a group of volumes you want to replicate, let's say individually, also, if they're not consistently kind of snapshotted together, atomically then there's a problem with dr because, for example, cassandra has like let's say: 10 10 different volumes. If they replicate individually, that's not going to help, you need to kind of make them all atomically snapshot and send the data.

H

Is there some thought on that? Maybe you have more slides. Maybe we can. I can wait.

F

Absolutely I'll just tell you this. This is why I just put it later per volume group here. My intention of thought was that uh replication is per pvc or per volume group. These are the two. Pvc is not a grouping, but these are the two levels of abstraction that we have. We have pvcs we're getting to volume groups because even now, snapshots are not the same.

F

Cassandra example right and the snapshots are not going to be consistent across these these cassandra volumes because we don't have a group that we can snapshot against and- and I know, jim's working on the proposal for the volume groups. My answer is: yes, we absolutely need groups. uh I was literally thinking when we have volume groups. We extend slash, use that as a replication boundary.

F

And- and we do need application quest replication as well, so that it's actually consistent as per the application, because otherwise we just continue with the crash consistency that we have. They are requirements.

D

So I might I have a quick question here. I want to take a step back. When will the replication relationship be broken? I mean in the dr case, your original cluster is completely gone where the api operations, like deleting the let's say we expose some api where there's cause relationship whatever it is.

D

The cost is gone already and we cannot expect with the gun of the cluster within the vr case. The controller will do the appropriate things to break the association between the source volumes and the destination volumes.

D

How do you mention this be handled in the kubernetes world.

F

Okay, so let me let me, can I I'll come back to your question. I think, and I don't I have an answer. Okay.

D

F

uh I'll come back to your question.

D

F

All right so we're talking about padding and grouping and so on.

F

Let's uh continue on rolls, which we talked about primary and multiple secondaries, and this this probably answers the next question, so um the other operation that there are two other operations that are supported right in this failover and fail back and failover, is typically when the server is down where a chosen secondary volume. If you're just talking about two-way replication, the second volume is forced to become a primary which has to happen at the secondary at the peer storage cluster.

F

So the kubernetes cluster that actually would talk to the pure storage cluster is where this first promotion would actually have to happen, which probably answers the question that you are asking.

D

Yeah I yeah, I don't know, I don't know whether this will uh yeah.

I

So basically I would say we need some fencing api uh to force like the disconnection and, of course, uh but oh it's up to the storage system because I'll close them.

I

Make sure that, even if the the first character is still running, I cannot actually access right to the pb.

I

uh So it's either by the storage system or we need to think about api incoming somehow depends that we make sure that even if the first cluster is still running and the data is not within or basically not corrupted,.

D

Okay, if I understand it correctly, some kind of flag is expected from the storage system telling you whether a paired volume is currently a primary or secondary is is minus. Then incorrect.

I

Yeah, yes and yeah, it can force a secondary to become a primary and make sure that if, if the old primary is running, it still won't be able to write the data and that's one option uh most. I think storage systems do that and if not, then we may extend one api to to do. Some fencing actually tell the first cluster. You cannot try like in the kubernetes level to make sure that your v is not taxes.

F

So uh so, although I said the action of forcing it to a primary has to happen on the secondary cluster or on the peer cluster uh and and and sensing is something, and there are two forms of replication here right I mean there are synchronous cases and there are asynchronous cases.

F

uh The synchronous cases definitely needs not definitely, at least in my experience for storage systems needs fencing. You don't want to corrupt the data, because there could be a spurious node which is still alive in the primary cluster accessing this endpoint. So you need to very clearly break the relationship and or define the replica relationship before starting to use it, but that has to happen on the surviving end, not on the uh so you won't you won't.

F

Yes, you will not have the control plane that failed and it has to be done on the control plane that survives.

D

God the interesting thing is that it's really hard to tell and without human or manual intervention, because if it's just unreachable, it is possible. There's a temporary network issues whatever in the primary storage system, where the associated stuff is not there and after five minutes it came back, then what would happen. The original association is still there in the original cluster right.

G

Right, but if you break it on the secondary, it would block any additional ropes. So, basically, once they come back in sync, they would tell it that the surviving one would tell the original one hey. I cut you off, we've diverged now and I'm the new primary and you have to discard all your old data or something like that.

G

That is based off.

D

Of the storage system has the capability of doing that right.

I

D

G

That that kind of functionality is necessary for any kind of a resync or re-establishment of a broken relationship. You need to decide who wins and have the loser sort of sync up with the winner right all right: okay,.

F

Sorry, that's what I I referred in the second bill, so typically recovery. What happens is when the storage system on the failed site, it's back up again.

F

The contents have to be- I mean it has to be now now it has to be marked as a receiver which, from a role perspective, we discussed primary secondary, so it has to shift to a secondary role, and the storage system would would resync the data to the secondary understanding that there is a new primary in play.

F

Which is what typically happens.

I

Yeah and another render operations.

F

Here but but if you distill them and drop it, that may be three steps to do this four steps to do this two steps to do this, but if you distill the operations, it's pretty much.

F

If I assume a different, if a volume is marked for a different role like secondary, it starts resyncing from the known primary and if the known primary is divergent because it was force promoted, it probably doesn't rethink and from the last non-good snapshot or. However, the storage system implements it, which is what ben was also alluding to. If I'm not wrong.

I

Yeah and another comment: when we decide the cluster is down it's not by the storage layer. It should be decided much in the darker layers and it could be a human that decide that your public primary site is down. It could be some automatic care, so some software detection, but it will be in a higher level. It won't be in the storage support system, doesn't really describe decide.

I

From my perspective, you should get like in in basically a step saying there was a failure. Please fail over, it's not a it's, not something that detected by the storage and you decide. This pvp now needs to be part of me. He gets an event that there was a failure and he will start.

F

Also, the storage I mean this is probably getting into the details, but the storage system could also report on availability. The cube cluster could still be available, but the storage cluster has issues for whatever reasons and and that could be another uh disaster scenario that that requires uh failover.

F

uh But but the monitoring is at a much higher level than.

G

Only strength, there's no reason not to report what you what information you have, but you want the decision-making to always come down from the top not to come up from the bottom.

I

H

Also, the most common thing people do is not failover. They do dr testing, which is kind of not breaking that application, but just trying to see if the other side works and they do kind of set up a test bubble and try to run their workloads on the other side to see if it's all coming up, I'm sure you may have covered this another difference. Let's try to share that.

F

No, no, that's perfect. Actually, uh I have not covered it in the slides. uh But yes, dr testing is something that's uh that is required without loss of data which which no I'm not covered in this life, but I'm yes. We are also thinking about that, but trying to figure out what to do in that case.

F

All right, uh sorry, moving on.

C

I think one thing you didn't capture on your slide was: if the secondary endpoint has a different ip address or different, you know id um then the primary um it would be a disruptive you know, failover and how that's taken care of at the kubernetes level.

F

If I understood you right uh it, basically, if, if the second the alternate communities cluster has a completely different ip range, let's assume uh and so the application, when it's going to run on that alternative, alternate cube, cluster will be I'm.

C

Talking about I'm talking about the storage servers, so let's say you have a storage server.

G

We were talking about the scenario where, like the whole site, you know the kubernetes cluster and its storage are all gone, so you're going to new kubernetes cluster with new storage that has replicated copies of your volumes, but they're different volumes right there. So it's not the same pv. It's a different pv in a different cluster. With with the correct information for the replica.

C

Right so, basically, all the pvcs, all the pvs, have to be new and fresh on the destination cluster on the secondary cluster.

G

They could already exist before the failover and just not using them yet in principle, or you could create them just in time.

J

In terms of uh the network connectivity, I think that was being alluded to, I believe sean you mentioned earlier, that that is not being dealt with right. That's outside the scope, getting the clients from east to west I mean yeah.

F

That was that's that's why I referred to that point. That was about external application, sorry, external clients, to applications running the cube cluster. uh But uh the question was clarified, stating it's about connectivity to the storage system, not external connectivity to the applications. Am I right.

C

I'm talking about applications that are running within the same kubernetes cluster. Let's ask in cluster a so it's not necessarily external to the kubernetes cluster.

C

If it's external to the kubernetes cluster, then you need something like a you know: global load balancers to make it seamless, which is not always possible right, but even with the easiest simplest case scenario where applications are running the same cluster, let's down cluster east and now, if you're doing failover to cluster west, now you're spinning up um a whole bunch of new pods pvcs and pvs, uh and you know that kind of goes back. You know whether these pvc pvs are creating advanced or not. Obviously, you cannot run pause in standby because.

C

D

C

You have a fencing mechanism that would prevent io to the secondary copies you're going to have problems. So you know I imagine, like the pods get instantiated once you do, the failover you're not running them in standby.

F

Yes, yeah yeah, yes, the the pods are actually transferred over once the failover is complete, that's correct uh and I would assume services within the kubernetes cluster will have to auto adjust to the new service endpoints.

F

So the ip address for services and or endpoints for services does not really matter, but I could be assuming something.

C

Services don't matter because that's between just kubernetes pods here we're talking about storage access right. So as long as it's.

C

But the services are, unless you know you expose storage as a service in kubernetes, which.

C

I guess for software defined storage is possible, but again that should be okay. You know, as long as you know, pvcs and pvs look the sds instance on the destination. Cluster can uh work with the new pvcs and pvs yeah. But look do you have a working prototype look or this is more like a just thinking um about look. I just want to look. For example like do you have an implementation or a prototype based on ceph or something else or.

F

Yes, we do so, let's, let's, let's, uh let's go down a few more slides, okay, so I think I think we understand failover. We talked about recovery a bit. There are going to be use cases where the entire cluster is gone and and the peering relationship ship has to be reestablished, and then there is the failback. You know get it back to the original cluster, where it involves more coordination. When you need, you need to mark the current primary secondary and then move it back to the alternate cluster.

F

So, let's quickly jump in here, uh there are some caveats to what we, what the working prototype that we have. Let's call it that it does assume to our environment, and what I mean is that you can actually flip roles and the replication direction would change it does it does not deal with pairing, pre-created volumes or pairing post, creating a volume on the target. It deals with the storage system capable of creating its its its pair when its replication enabled on the local instance and the unfortunate last part of it is.

F

It assumes that the csi volume handle is reusable across the peer storage systems, which is probably not a very good assumption, but that's what it assumes with that.

F

What we have again just borrowed the slides there's a lot of content here, but what we want to do is we want to have a volume, replication resource which would point to a persistent volume claim and set a replication state as simple as that.

F

And uh so a pvc so user deploys the part, deploys the pvc and creates a volume replication resource for pointing to the pvc. The storage systems are already replication appeared, so this particular image goes down to the csr drivers and then reaches the storage control plane to start the replication process and also designate the volume as primary on the east right here. Also, we also extended this with a volume replication class.

F

This helps us carry secrets. This helps us carry vendor-specific parameters, helps us carry replication schedules instead of giving the user the flexibility of saying replicate every minute every three minutes or if it's asynchronous replication, it's primarily focused on asynchronous replication. Initially now coming to the more interesting parts is when east goes down, the process of recovery is to to. Basically you know you, you create recreate the pvc uh point. The volume replication back to the pdc and set it as primary east is still primary, but that's fine.

F

When easter covers somebody will come back and change the volume replication on east to primary, to secondary for the racing to start.

F

The biggest problem here is that what is created first, because volume replication points to the pvc, and it would hence refer to the pv and send a request to csi, whereas when you recover, uh if you're going to create the pvc first, dynamic provisioning is going to take over, which is where what we actually do.

F

Is we we back up and restore the pvs first, but this is unfortunate because this does not give this is not user controlled any longer, because the user cannot take a pv and back it up and restore it onto a different system.

F

uh So we we are thinking of using data source as an alternate to this whole thing uh we'll come to that later. But assuming you allow pv copies for the moment, you copy the pv uh you you recreate the pvc, so it will bind back to the same pv. You create a volume replication pointing to the pvc and flip its role to primary or secondary. However, you want to manage it, and then you deploy the parts the ordering would be would be that way. Hey.

H

Sean, I have a question so in the bottom. There's a line, application management is that data traffic or is just control traffic, and then there is a line with number three which is volume replication. Is that how the data? So, how is the data flowing replication data.

F

Okay, so the the okay, so the replication management link between east and west is both the control and data plane of the storage systems.

F

uh The dotted lines, one two and three- are simply objects that are being restored onto the west cluster uh they're. Not data is not flowing through the cube clusters. That's just the whole point.

F

I mean the whole volume replication is just a desire to say uh which pvc, what what do I wanted state as primary or secondary? And what's my volume, replication class for the other correct.

H

Those lines were confused.

F

Yeah, I'm sorry yeah got it. So this is the replication management lines, the solid lines, the data and control plane for the storage system. They will be too, but we've just represented it as well.

H

So we talked about the data, you know the data from the pvcs volumes going from one to the other, but you didn't talk about how you're going to send the other kubernetes config data from one side to the other right yeah. I missed it.

F

No, no, no, you didn't so all the way up. I said we're not yet dealing with the application resources like parts pdc's, conflict maps in this talk, but but having said that we are this whole solution. So so let's go back so copying. This pv is bad right. So what what we probably really want to do is uh we.

F

We really want um something like a replication handle being reported, as in as a transferable replication handle being reported back in the volume replication resource which hence the user, can take and recreate the volume replication resource on the on the secondary cluster. Instead of pointing to pvc that could instead have a source of the replication handled.

F

I know we're getting a little bit into the design I don't intend to, but I, but what I want is we definitely have to break the pv linkage here and and or uh you know, pvc, because we have the any data source now, thanks to them, I think uh we can actually create a pvc from any data source which points to volume replication instead, so that way we can break the dynamic provisioning problem, as well as giving the user the ability to transfer the replication to replication handle across these clusters.

F

Now getting back to your question, uh what we are looking at is a githubs like structure to transfer application resources like pods pvcs, uh let's just say, parts species and and such.

F

The other alternate that we had earlier considered was q backups of these resources.

F

I don't know, I think different people may want to do this differently, but we are not tackling that particular problem. I I think there are existing solutions for that and I don't think we need to do that. That's my.

F

uh That that's that's what we believe.

G

But there are, there are sequencing concerns when you're trying to restore something depending on exactly how the pv the destination pvc needs to get created and and when relative to everything else. That should happen um so yeah. There are other solutions that do it, but you have to have an opinion about timing and sequencing.

F

Yes, yes, definitely so, like I said so back in this sequence. This is in this in this setup, where we're actually transferring tvs, uh the sequencing is obviously strict. I mean the pv is restored first and then the pvc is restored are created based on the application resources so that it can bind to the pv and later on.

F

It can create at this particular point, even if the volume replication is not created, the volume is actually secondary on rest until the volume replication resource comes in and changes its state to primary, that's not going to change so technically. The storage driver would disallow mounts maps or usage of that volume.

F

All the node stage requests would probably should fail, even if a part comes in out of order from the volume replication resource and when the volume replication resource comes in. That's what trickles down to the storage layer to either force promote this uh to a primary or or whatnot. So, yes, there is a strict.

G

So yeah, so failing the like the node stage or the node publish, would be a way of preventing like an early pod from causing havoc, but it's we still need to have like a recommended way of doing it, because if you just create the pod first, while the volume is still the secondary, the pod just keeps failing and failing and failing, and then eventually you do fail over, like the exact retry loop of when the pod is going to try to come up again is going to be fairly random.

G

You might want it to come up quickly and it might be planning on coming up three minutes from now because of the retry loop. You would want to have a way of saying you know to minimize the downtime in this in a failover scenario.

G

When exactly should you create that pod so that it can rapidly, you know successfully come up in the minimum amount of time. That's the kind of feedback that I think people would want, or that's the kind of information people would want to have access to, but.

K

If I understand this proposal, I think maybe what you guys are saying is what whatever ben is saying here and the application logic. That's that's not. In the scope of this proposal, you are assuming that a higher level layer will coordinate some of that, the sequencing etc. But you are kind of abstract away. The definition of volume replication is that is that kind of true, so that it can be agnostic, and you know, and it's like kubernetes standard.

F

Yes, that is that is true. We, but to ben's point. uh The pv copy model is very prescriptive on order right, we cannot break this order right, uh whereas what we really want to do is uh sorry. This is not moving oops, okay, so, whereas what we really probably want to do is always create a pvc using volume replication as its data source, so that these ordering constraints are not there. So the moment the pvc is created, there is a populator, and you know the source is a volume replication resource.

F

It does what the pvc is not going to be in bound state. It's going to be uh so that the part cannot use it anyway and and and sequencing works transparently.

A

Right so sorry to interrupt, I think we only have five minutes and there is another issue that we need to get to. uh Can we continue with this.

D

Thanks sean, thank you sean. This is. This is very interesting. I think uh I think uh uh we can follow up offline or if uh the group is interested in this, we can certainly have another session a shorter session to discuss the remaining things.

G

Yeah I'd like to echo that this design looks good so far, it's very close to what what we've been designing for a similar proprietary uh implementation of this. So I have a lot of good feelings about.

F

It hey thank you, nice, and we will discuss more uh okay.

A

Yeah so let's see if we can yeah come back and talk about this in your next meeting, so.

D

It's uh it's grant here. I think he created the issue.

A

I I I don't think it's a grand.

L

I added the issue in the dock uh and I think zing singh also replied to the issue, but I'm not exactly sure if that issue has been fixed yet or not.

A

Yeah so yeah I can open that um grant is granted here today. Let's say.

D

No he's not here yeah.

A

uh So I guess that was my question. I was reading this one right, so basically, this is a bug report saying that the great snapshot is being retrying without the the back off, uh but the arrow that I see in this uh in this logs this one should not be there anymore in if you test latest code. So that's why I was asking.

L

Oh, we didn't get a chance to try out the latest code like so the release.

A

L

4.2.1 since august- and there has been no release since then so.

A

L

All have our we.

A

Have a rc release? Actually, if you want to give a try, there's a 500 rc1.

L

Okay: okay, okay,.

A

Yeah, um I think I think grant is looking. I I will ping him and see if he has got any update. Another.

D

A

I'm thinking maybe xiang. We can talk about the soft lines um because, for you know, if it's dynamic provisioning, we actually call chris snapchat multiple times to get the status of the snapshot, um but then we ended up calling chris snapchat more time, so maybe wondering if csi driver supports list snapshot with a snapshot id. We can actually use that to get the particular snapshot status in that instead of calling kristaps snapshot. I don't know if that would help. So so shall maybe we can talk about that.

A

I'm thinking that might be something we could consider as well.

D

I'll take a quick look on that.

A

L

These logs, we pulled from the logs that, uh like by each other, or where did you get these this logo.

A

L

Yeah this one. So if you see right uh what even griffiths had mentioned, the calls were happening very frequently and like the fact that I mentioned right, the api was getting started in the cluster itself, so that that logging is probably not accurate. As per what.

A

Yeah, I think grant says he thinks this probably still happening. I think he's still trying to figure out. I think there was he uh added some notes earlier, but I think he still have not got to the bottom of it sometimes because.

L

A

L

Fix around in the azure csi driver to.

B

D

L

Prevent the throttling- because it was very- it- was a very scary situation, because any failed snapshot and all it was just bringing out the cluster assumptions right if you're throttling the vendor apis. So I think we should like try to prioritize if it is not already fixed.

D

Yeah, I haven't, haven't gone through the whole thread yet, but uh are there any replication steps.

L

uh You just try to create a like the snapshot which, like does not have the right permissions or like some missing permissions, some some error scenarios uh in few of those. We were able to replicate that the retries are constantly happening. If you go and see the control plane locks the retry time stamps that all the griffiths mentioned at the top, they were very close to each other. It was not waiting and there was no terminal state. So I had a cluster in which I had a failed snapshot.

L

It had it kept retrying for a month. Oh.

A

Yeah, but that so we we will keep you trying. That is for sure, because that's actually to avoid.

L

The exponential.

A

L

So the reason yeah.

A

The problem is why this does not weight right, so we do. We do have that logic there, but somehow it did not uh go there for some reason. So that's something that we need to figure out. So our pin uh grand industry from the update on that yeah we'll take that take a look.

L

Thank you thanks.

A

Thank you for reporting this.

A

Do you have anything else, looks like.

D

No uh we're all good we'll write our time. uh I'm gonna stop recording.