Red Hat OpenShift OpenShift Commons | Red Hat Livestreaming, 24 Aug 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: OCB: Cloud Native Disaster Recovery for Stateful Workloads

Description

In this OpenShift Commons Briefing, Raffalele Spazzoli (Red Hat) will introduce a new way of thinking about disaster recovery in a cloud native setting. He will introduce the concept of Cloud Native Disaster Recovery, which characteristics it should have, and the problems that need to be addressed when designing a disaster recovery strategy for stateful applications in a cloud native setting.

A

A

Hello, everyone welcome back to another openshift commons. As you can see, I've changed my background and this is special for today, because we have hopping storage.

A

Many of you don't know, but those that are here from red hat may have seen a lot of our storage discussions and you spend a lot of time with me here in my dining room, so welcome back and we are also changing it up today. Today we have an upstream discussion.

A

I'm always excited about those, um it's so important that we work upstream with the cncf, and today we have members from tag storage talking about a topic that is very important to pretty much all of us, especially enterprise installations, and anybody that is concerned with how to get your data back or your applications back. If something happens, so today we have alex cherkop from storage os. He is the ceo, but also he is the cncf uh co-chair of tag. Storage and tag stands for technical advisory group.

A

Now it has been changed from sig and also raphael yeah, who is a tech lead for tag storage, so very excited happy both. Thank you. So much for joining us and uh raphael and alex, if you'd like to tell a little bit more about what you do, especially with tag storage and then dive right in.

B

C

B

Thank you so much karine and and uh so glad to to to be here with with the openshift community. So the cncf uh a couple of years back had created um sigs uh and six storage in fact was was one of the first um was one of the first cigs that we created and, and the purpose was to help to help the cncf with evaluating projects and to create content and educate um an educational material for the end users and the community.

B

um And since then the sig has been renamed to tag just because sigs were getting confused with the kubernetes cigs um and, and you know, as karina mentioned, tag stands for technical advisory group. So um I'm the co-chair of the group together with uh zhenyang and quinton hool, um and we also have a number of tech leads of which uh rafael is, is one of our newest members. So I'd like to uh pass the battle onto to rafaela, to introduce himself to.

C

Thank you alex yes, I I work at reddit as an architect in in consulting so help, customers build their cloud native solutions and recently joined the tag storage uh to work together on the topic that you see on the screen today, which is uh clarinet disaster recovery.

C

um This is a work that I think started about nine, a collaboration that started nine months ago. I think it was fall of last year when we started talking about. If this was a good idea or not to explore in in this particular tag. Right or it should belong somewhere else, and then we started sharing some ideas and and we uh we are now creating a white paper that I think we're gonna publish soon and today we're gonna present. Some of you know the results of this of this white paper.

B

um Indeed, so so this is, this is particularly exciting um for for the tag, because you know, we've had um a lot of demand and in fact, a lot of feedback um in in the in the different tag calls um which, which which rafaela has been very patient with and and also it's meant, that we've been able to to iterate um on the documents and and just for you know, everybody's benefits when, when we talk about storage in the cncf and and persistence, we're we're not just talking about you, know a volume or a file system, but we cover kind of any type of any type of persistence, layer, including um databases and object stores and key value stores, for example, as well.

B

As you know, traditional volumes and file systems- and I think you know the key thing here and- and you know, keep being honest here roughly. I think I think what we're what we're trying to do is to make sure that developers and devops teams in this areas etc have the information and the and the tools and uh and examples on on how to uh adopt some of these different technologies in in in a cloud native.

B

uh In a classic way, because um for the first time in a long time, it's it's no longer, you know infrastructure teams that are making these decisions, but it's it's it's developers and, and their deployments um and and understanding the the storage subsystems and the different technologies that are available with with uh cloud native technologies, is extremely exciting and and enables so many new use cases which, which raphael will cover shortly.

C

Yeah, that's that's exactly right. I I totally agree, um and what we would like to do today is to present a an approach to disaster recovery that we call clown native. um Obviously, it focuses on stateful workloads and that's the storage workload that that alex was mentioning all of them right. We, it turns out that the hard problem that you have to solve when you distribute a state of workload is always are always the same, regardless of the kind of interface that you expose.

C

So it doesn't really matter if you are exposing block storage and a block, storage interface or a message queuing or a sql uh database, the the the in internal you know sink is really the hard problem that you need to solve, and um and and that's what we explore with this with this white paper, but from from a user perspective.

C

The concept that we would like to to make uh people aware is: is this new concept of cloud native disaster recovery? This is we we call it clarinet disaster recovery. I just said new concept, but it it the approach, it's something that you could have done even in the past. Our point is really that with cloud native approaches it it now becomes uh less complex and more more easy to to to create these.

C

Are these architectures and this deployment and probably less expensive um and the way we define this cloud native disaster recovery is by con contrasting it with traditional disaster recovery. um We had an internal discussion in china in the time whether tradition, calling this this traditional disaster coming was correct or not by traditional disaster recovery. We mean what you normally would find in many enterprise customers. So not the big web scalers, not the newer startups, but you know the enterprise customer that um many of us work with.

C

So let's go down these these columns um and and rows one by one. I'm gonna try to keep it pretty, because we don't have a lot of time but and and create this contrast and and talk about uh what cloud native dr is um type of deployment. So are we deploying active active or are we deploying active passive in most traditional, dr scenarios?

C

What we see is an active passive deployment, especially for the stateful workload. You know you may have state the list workload is. There are active active, but they all point to a single active site and then and then there is a passive site for the stateful part. That's that's what often what you see for cloud native, dr, we are proposing that it should be, it should do. It should be active, active and then obviously we're talking disaster. So there is a disaster, a situation. How do we detect it uh with the r with traditional?

C

They are in most cases, it's a human decision, somebody who says okay, this is really a disaster. We need to trigger the disaster procedure, recovery procedure, it's a human decision in cloud native, dr. We want it to be autonomous. We want the system to realize something is going wrong and react to it and then the recovery procedure itself.

C

What we see is normally a mix of manual and automated tasks. Maybe the tasks are mostly automated, but somebody is is triggering them in a manual way that that is, at least in my experience. What I see the most in cloud native, dr. We want uh full automation on on the recovery.

C

Whatever the this statewide workload needs to do to reset itself or reorganize itself and the replicas and the petition to continue servicing, and then you know, the two metrics of dr are rto the recovery time objective, which is how long the system is down and how long? How soon we can. We can get the service up and running and recovery point objective, which is a measure of how much data is lost because of the disaster, and it can also be a measure of how much inconsistency I have created.

C

If I have multiple copies of the data right so for for rto for traditional, we usually have very we. We could have good good uh enterprises that are close to zero, but normally we have hours to to recover from a disaster in india and cloud native dr. We want this to be close to zero.

C

Essentially, a couple of health checks have to fail, and then we realize the situation is a disaster, and the system will start bringing the traffic to the healthy uh locations and for rpo uh again you could have close uh exactly zero to again hours of loss. I usually see- um or there are minus two hours in in terms of loss data loss in in cloud native, dr. You have two options here.

C

You can do strongly consistent deployments, which will have exactly zero data loss, and you can do eventually consistent deployments, which um may have uh close to zero data loss in normal in normal circumstances, but uh but you're not guaranteed in this situation that the data once it's reconciled is exactly correct right. It's it's going to be consistent, but it may not be correct for your application. So that's that's a caveat of the eventually consistent systems and then from an organizational perspective.

C

What we notice is that in traditional enterprises, you know application teams are normally formally responsible for the business continuity plan of their service, but what they do is they turn around to the storage team and ask them? What is the? What are the rto and their appeal that you can guarantee for the storage that we use and then they adopt that as the as the uh that's their standard? So the facto is.

C

The storage team is the driver of the of the uh enterprise, dr strategy, um and when instead we are, we are arguing here that was with clown native dr it's going to be the application team that will have to choose a correct piece of middleware, a correct product to handle their store, their state right, and then they will use that product to organize the uh dr uh procedure around it and then one other observation that came up working with this new technology is that um dr I for traditional, um traditional enterprises.

C

It's really the capability that you need to do. Disaster, recovery and instantiate. This procedure are coming normally from storage, in the shape of maybe the copper, restore or volume replication, whether synchronous or asynchronous volume replication. Instead.

C

For for cloud native, dr, uh we noticed that this capability are coming more from networking and we, when we see the need for east-west communication between your paleo domains, so that could be your regions or could be your uh your data centers, so that this middleware, that is this new generation middlebury, can the instances can cluster up can find each other and cluster up?

C

And then we we see the need of global advances in front of the regions or in front of the data centers, which we need a way to direct the traffic to the lc locations. So it's it needs to be a smart, intelligent, global advisor.

C

So hopefully we give you an idea here of the the differences between what we call traditional and and clown native. They are um go ahead. Go ahead, alex.

B

I was, I was just gonna, say um one of the sort of one of the key things here. Is you know what what we're talking about? I think um and again you know, keep me honest here: uh rafael is we in in a cloud-native world where applications are effectively composable and the infrastructure is, is declarative?

B

um What we're? What we're talking about is cloud natives gives you a lot of the tools that are available to automate and to um and to manage disaster recovery, just like any other. You know healing process that that that would take place in in us in a standard um cloud native um environment, and so you know we kind of acknowledge the fact that this isn't necessarily straightforward.

B

Some of these technologies are um also fairly advanced, but but the point is that the cloud native these new cloud native, architectures and- and you know, rafaela- will talk about um a reference architecture with some with some of the options available, um actually enable um organizations to to have automated failover and automated automated disaster recovery processes, with um with a better uh with better metrics in terms of rto and rpo than you'd.

B

Get with with you know the obvious manual um failovers and manual tasks that you that you have in a traditional system which, which you know I think is, is extremely exciting, because what we're effectively saying is we've made applications composable and developers can declare what their applications need and from an environment and and now we're kind of taking a step further and saying that that also applies to the disaster recovery process.

C

Right exactly and- and uh it is exciting- I find it very exciting, and maybe some of you may may doubt that it's even possible right and or may say the story that I've been told many times is. I can I can reach rto and rpo as close as zero as I want, but the cost increases exponentially.

C

We I we don't think that this is true anymore.

C

It's um it's actually a matter of composing the architecture in the right way, but the cost does not increase exponentially and you can actually reach these numbers with a relatively inexpensive deployment- and I say I use the word relatively, but certainly not something that grows exponentially and so talking about how how we can build this uh these architectures here we are showing a strongly consistent deployment where we have a stateful workload that is capable of handling this state horizontal states link in the proper way guaranteeing that there are correct replicas in the correct regions or data centers.

C

We need three failure, domains in this case their data centers right. We need to build them because, uh because otherwise we wouldn't reach the quorum and, as I said in front of in front of the state workload, normally we have a front end, probably a status front, end and in front of it. We we will have a global load, balancer okay, so this is a very generic blueprint that you can. You can then reuse in in uh in several situations I I failed to mention obviously the state of workload.

C

uh We have storage uh that comes from from the local local data centers, but, as you can see, we don't rely on storage capabilities to replicate across across this data center. All the interaction is is handled by the by the middleware by this state workload.

C

um So how? How can we like? How can we build this state for workloads? Because, obviously we are we are? We are now relying more on the middleware like on the middle side. um I have a couple of slides to talk about. uh Did uh talk about this from a conceptual standpoint.

C

um I'm gonna try to go fast here, uh so maybe we have time for exploring one one of such deployments, uh so we we need to define the concept of failure, domain understanding the concept of failure domain.

C

It's the affiliate domain is a a an area of our system that can go, can go down or can fail due to single event, so you could have nodes, those could be failure, domain, racks clusters, network zones, availability zones, um regions, data center, they're, all failure domains, and if you, if you look at them, they are they they they contain each other.

C

You can scale out this fatal domain from the the smaller one like a single server, for example, up to an entire debt center, but the theory around this distributed state for workload works the same regardless of the scale of the failed domain.

C

Here. In this conversation, we are talking about disaster recovery, so we are talking about failure domain being the data center, because that's what usually disaster means I lose the an entire data center right. There is an event and because of that event, I lose the entire data center.

C

So we're talking about um our our payload domain of references that at the center and normally so for disaster, we mean that we will lose the entire failure domain or in particular, the entire data center. If we don't specify what the domain is and and the disaster recovery is what what happens when that happens, what is my strategy?

C

Heavy high availability is slightly different concept. High availability is when, within a failure domain uh something breaks one. Maybe I have one fold uh what happens inside the failure domain? Two?

C

Does the service continue or not, um and then we have the concept of the concept of consistency, that is, the property of a distributable workload to all that, all the instances observe the same state. Okay, so the state is consistent across the instances.

C

We need that because when we lose a failure domain, we're probably going to lose some of these instances and we need the state to be to have been synced everywhere. So we don't lose it. Okay,.

B

And I think I think um um on on on it's it's fine. If you go into the next slide um on on that point, the getting strong consistency is is probably the single um biggest architectural challenge right, trying to get um trying to ensure um you have strong consistency, which is one of those key attributes in in in any storage layer or database layer, etc. Is is effectively a balancing act between you know, maximizing, latency um and or even reducing latency and versus versus availability. You know and there's there's a very there's.

B

A very uh convenient uh cap theorem here, which, which you know is is is one of those things where you can have. You have three things. You have consistency, availability and partitioning, and you can have any need to pick two, basically um and I'll, let I'll let um I I don't steal rafael's standard but I'll I'll. Let him explain some of the some of the details and how it applies to some of the different systems we're talking about.

C

Yeah yeah, thank you, alex yeah and yeah you and you announced the uh theorem correctly right. It's usually you pick two of these three uh consistency, availability and network partitioning. I should improve the. I should improve the the picture here. I I like to um say the theorem is slightly differently, because I think it helps understand that the kind of choices that are made in in the software today's today, which is um network partitioning, is not something that you control is a fault that happens.

C

So a way of reasoning about the cup theory is assuming network partitioning. What do you want your software to do? Do you want it to be consistent, or do you want it to be available right? Because you can only choose pick another one. The the dental partitioning is not something that you pick it it just happens um and, um and that I'm showing in this in this little table here some of the of common.

C

You know software today that, uh where where the choice is very clear right, if you choose consistency, you are a strong, it means you're, building a strongly consistent system.

C

If you choose availability, it means you're, building an eventually consistent system, uh both prime cons right- I they have their usages, um but that's that's a very clear design choice that you have to make when you build uh software- and you know in in reality, some of this system can even be tuned and and depending how you tune it, they can change the behavior from from being available to being consistent, but uh it's it's it it they. um They are all they're all built around this, this gap.

C

Theory um there is a color already to be to be kept in mind about the cup uh from the cafeterium, which is the patchalk or passalk. I hope I pronounced correctly choreography, which essentially, it means it says in absence of a network partition. So when the network partition is.

B

Not there, you can only optimize.

C

For latency or for consistency, um so this is, I was recently doing some experiment and it was very clear what the pacific corollary means in in particular, I was doing for kafka right. So if you kafka is one of those that is tunable, if you set kafka to be consistent and then you spread the cluster across highly uh region that have high latency, you get a very, very high latency on on the response for a single. You know, communication, whether you're reading a message or in uh actually producing a message.

C

Reading is always easier um and- and you know that's that's just the nature of um about this software work that doesn't mean that, for example, with cafe, you can't uh have still a significant amount of throughput, but each individual transaction will be will have a signific high latency, because uh you have told kafka, you want to be consistent.

C

Okay, so that's uh that's how this work and and this it's very nice- it's incredibly convenient to have a theorem to think about this thing in because you could you can take a software that you don't understand. Piece of software.

C

Don't understand and you can ask uh you know the vendor or whoever is the expert to talk about the software in in light of the theorem and understand the choices that are made and that are made by that piece of software, and just that way you will understand a lot of how that piece of software behaves.

C

B

C

B

I was, I was just going to say just just for further information. um We also have uh the tag has has created a a more generic uh storage landscape white paper, um and in that we we define all of the different attributes like um availability and scalability and performance and durability, as well as as well as consistency.

B

um And it's it's it's it's interesting to me, because you know what what we have here is: um different systems will will have uh different use cases, and it's probably worth noting that you know no one system will will handle um all of these cases, because you know, of course, um very high or very strong. Consistency also, typically um has might have you know scalability or performance uh implications, and then vice versa. For example,.

C

Right: it's it's a trade-off that you have to decide.

B

C

Your product, um so we said we have several instances of of this software right running and clustering up, creating a logical instance. How do we do they? Sync? We need. We need consensus protocols, um I'm gonna go quicker in this slide, but the there are two types of consensus protocol that I call share state and then share. State first state is to agree on on a state that all of the instances need to reflect need to have for this kind of protocol protocols. You can have a protocol based on a leader election.

C

That is the one that accepts all the rights and the other one are just followers and they validated by the community by the academy. The the the uh algorithms that have been validated by the academy are paxos and raft in this space raft becoming more and more popular. So you know most of the software that we have mentioned so far and we have even um you know.

C

We have another list later they're, all based either around taxes or raft when, when it comes to synchronizing the same state, then sometimes you have to synchronize different states, but you have to make sure that these instances have to agree on all writing or not persisting that state right for that we have the one known two-phase, commit or three-phase commit protocols.

C

Another thing to uh to uh know about is like I was saying at the beginning that the hardest problem that this uh stateful storage software need to solve is really uh always the same, and that is, I need to stink with the other peer um and then I need to persist that data. Somehow I need to have a consensual protocol, a list of operations that have happened, and then I have to store it to persist those uh that information.

C

um You can it's interest. There is an interesting chapter in the uh sre book from google, where they explain how how to periodic theoretically build a piece, a piece of software that can be reused across all of these uh stateful workload products because at the core, it's always the same. Naturally, you should do it that way, you wouldn't get optimized performance, so you have to.

C

This is a theoretical approach. You have to still take take your uh make your optimization out of it, but in in in reality there are several companies, for example, that are using roxdb or uh there is another one from uh apache uh that is, a storage layer with some some level of consensus protocol to to coordinate instances.

C

um So if you put everything together, you can you have um you need to have these this replic reliable, replicating machines, and then you can create partitions. So where you separate your data so that you can horizontally scales and then you can create the replicas where you, the partition, is replicated in other instances, so you don't lose data when when something goes down, and so, if you look at this picture here on the right, this is the anatomy of a stateful application, at least a modern one.

C

Where um you have two replicas here: okay, each replica has a several partitions, so we end up with six instances. This is as its own storage um and then we have the coordination layer between the between the instances.

C

With the same replica, we use a shared state protocol, consensual protocol, because we are also sharing storing the same state and then, if the, if this particular software support transactions, where I, whereby I do two operations on two different partitions, but I want them to be one logical transaction, then I would have to have an inter partition coordination protocol, and for that I can use something like face commit.

C

So this gives you an idea of the anatomy of these fateful workloads that we can use to build. uh What we call the cloud native disaster recovery deployments.

B

I think just just on that on that point rafaela, I think one of the exciting things here is that um what what is effectively happening is we're layering, different, proven technologies.

B

You know so you might have um sharding for for performance and you might have rough protocols for um consistency, um but you might also have you know a variety of um different layers in in that stack where, where, for example, you might have a sql layer, that's using a key value store, that's using a sharding process, that's using a file system, etc, and so it's it is really more than ever before it's it's kind of important to to uh to understand the different layers, because at the end of the day, the attributes of your system and your your failover capabilities and your dr capabilities are going to be an amalgamation of all of those um of all of those different attributes.

C

Yeah yeah, I I agree. I couldn't agree more especially on the you know. On those observations you made on the interface layer, the what is called api layer here we see more and more uh products now that offer, for example, a sql interface and then also a key value store, and it's clear what's happening right. They are just adding a new p, a new api layer, but reusing everything that is below it. So it's relatively easy for them to do that.

C

Maybe they don't always get all the optimization that they could possibly have, but it's easy to add additional additional functionalities to to your state of workload.

B

C

Yeah here is just a table representing some of the workloads that we have analyzed and the choice that they make in terms of the replica consensus protocol. So the share state consensus protocol- and you know unshared, state or shard consensus product.

C

um I I would highly recommend if you are considering you know a new state of workload, ask your vendor or your experts, what choices do they make in this space, because that that already tells you tells you a lot about about the software that you're about to purchase um some considerations around um strong consistency versus inventory consistency, um and they both they both can be approaches for cloud native, dr, but they behave differently, for example in in terms of rpo, as we said, prong consistency is about consistencies.

C

Obviously, so we don't lose any data once we create a well done deployment. So it's a it's exactly zero right. uh I had people, they couldn't believe it, but it's it's exactly zero. I never lose data with with this. It's assuming, obviously assuming that only one disaster at a time happens. Okay, but you don't lose data with a venture consistency um you may be losing. uh You may lose some data theoretically unbounded uh amount of data, but in impractic in practicality.

C

If you know, if the system is not overloaded, it's it's something close to zero, because it's just what was in the local cache that the system didn't have time to replicate and another thing that you instead, you should consider, is then, when you lose one data center, one failure domain in an event or consistent system, like I said before, the rest will keep serving and so that you, uh the two sides of of this deployment, may diverge in terms of date and when they come up um they don't necessarily um they don't necessarily agree with with with the states.

C

So there is a reconciliation. You know algorithm that will decide who is right, but this reconciliation algorithm algorithm may not reason the same way that your application from a business perspective reasons. So what I like to say is that adventure consistency does not mean events or correctness in business logic terms and eventual consistency may pose some design um additional design consideration to your developers.

C

So I personally, if, if it's, if it's, if it is at all possible, I personally prefer to uh keep things simple for the developers and and choose a strongly consistent deployment.

B

I think I think strongly consistent is is the most predictable and one of the one of the um one of the points here about the minimum number of failure domains is that, with with strongly consistent systems, you effectively have an odd number of copies, because you remember when we were talking about the cap theorem and we were talking about partitions if, if a node is partitioned or is unavailable effectively, the remainder of the system still has a majority of copies of the data and can therefore make an automated decision as to um who is up and who is down and which which which which systems are, are authoritative for that consistent system, um whereas in eventually consistent um environments, um it can be a little bit more complicated because effectively, um some of those uh some of those decisions can be um delegated to the application or some of those decisions could be uh delegated to rip um reconciliation processes which which are not perfect.

B

Hence you know the reference to to. It doesn't mean eventual correctness.

C

Right in terms of rto, uh both will react in a few seconds, especially depending on your health checks and the tail checks that you set from the global load balancers, but also some internal abstract that the system has in terms of latency strongly consistent. Consistent workloads have a strong sensitivity to latency between between these failure. Domains between there could be regions or data centers, and by and large, your latency will always be greater than two, the the worst latency between regions that you have multiplied by two, because it's always back and forth.

C

So that's that sets your expectation and that this to me that says not all the use cases. I cannot use strong consistency for all the use cases. I will have use cases that are really that need really fast responses where I can't use it use this system, so I mean you know I need to find different solutions, but if it's acceptable, if the latency that I have is acceptable, like I said, stronghold system keeps seeing things predictable and simple for the developers on the on the other side.

C

Eventually, consistent instead is not affected by latency, because essentially the system first writes locally and returns and then and then it tries to synchronize with the rest of the instances. So uh that does not affect the client latency, um that that was a simplification, but it's it's a way to explain why they are not really affected by inter failure. Domain latency, the throughput instead for both is, uh is, um can scale linearly. You know, as long as we have workloads that are touching all the partition.

C

Okay, so so the the requests are normally distributed and all the partitions are are involved more or less. In the same way, uh this this system is scaled horizontally, necessary scale linearly with the number of instances. So you want more throughput. Just add more instances increase the number of partitions and you get you get the throughput that you want um as as and then, as alex said, prom consistent as another constraint that some of our customers find taxing, which is you need three payer domains.

C

In other words, if the federal domain is the data center, like we were saying, you need three data centers and we we perfectly know that some of our customers have some of the enterprises have two data centers and maybe in the same metro area, okay, they would have very good latency.

A

Between these data sets.

C

But they don't have a third one. So what can they do right at the same time, instead, eventually consistent workloads only need uh required too um yeah. So you know there are solutions to get the third data. Centers one option is to go to the cloud, but that is certainly a constraint of strongly consistent, assist.

C

um We we wanted also to share reference architectures for kubernetes deployment. You know the first one that we shared was very generic. You could build it with vms, but obviously we are looking at kubernetes with special attention here, so uh not not very different. We still have. We still rely on the state workload to do the horizontal sync. We still rely on persistent volumes uh from provided by kubernetes in this case, and then I would have we will have some ingress and a global load balancer in front of it.

C

That decides where the traffic goes. um So here when we lose one side, essentially, nothing happens. The the global advances should realize it and just send the traffic to the other ones.

C

Another thing that you should ask yourself when you build this architecture is what happens if the clients can connect to to my workloads, but there is a network partition between some of the data centers, and so one is isolated for for strong, consistent workloads. This is essentially equivalent to losing this data center, because all of these instances will become not active because they don't have quorum, and so the global advances should realize that this guy is not responding and and send all the traffic to the remaining two, so the behavior should be exactly the same.

C

We have also a reference architecture for eventual consistency, uh similar right except you just need two uh failure domain or two data centers uh here the conversation is slightly different when you lose well again, if you lose the entire data center, the global advance, I can only send you to to the remaining one, and- and you know there is no real state divergence, because there are no rights on on the data centers that you have lost different story when you, when you lose connectivity between the stateful workload, but the clients can still write.

C

I can still have a way to write in this case. You can have divergence of the state um and that's where, when we say you know that that's the conversation we were having for this is. uh This is a situation where, at some point, this fault will be corrected. The network will the connectivity will be reestablished, the reconciliation algorithm will kick in, but the result that you get as the final reconciled state is not necessarily aligned with what you need in your application.

C

Okay, we have some reference material here. If you want uh to go a little bit deeper, this is our white paper and some blog posts around building this architecture in practice.

C

And if we have time left, we can explore one of these environments together.

A

So if you think you can do it in 10 minutes, I'd love to, I think we all would love to see you go into a demo environment. But do you think 10 minutes is enough time.

C

It's going to just be an exploration of what we have, what they have, if or or do we have unless we have questions if we have questions I I also would like to answer those questions.

A

Well, I have a quick question just going from what you were talking about last. Obviously, one of the biggest considerations between strongly consistent and eventually consistent is the cost factor right.

A

So, as you're working through this in tag storage, are you looking at a pco or what do you have? You explored that in tag storage uh overall.

C

I I don't know we don't focus on cost, because we try to be product agnostic. So my my only consideration is you have a third data center that may mean uh more. You know that may may be a significant cost, depending on how you decide to implement that data center. If you go to the cloud, it's not actually necessarily uh a huge cost, it depends on how much you deploy right.

C

It's a as you go model, but if you actually build a physical third data center that it's a huge investment up front, um the other consideration is some companies may be running on um on software. That is not capable of being deployed this way, and so they may be facing a migration.

C

Let's say you may may have to migrate from my secret to a new modern database that can actually be deployed this. The way I was describing, so you know this migration. Pro projects can can create some cost in an enterprise.

A

Yeah, I mean I kind of knew the answer, but I was going to try to tease love the you know. Strong consistency is obviously the best choice right and then all right, diving into your demo,.

C

Yeah no keep asking questions, I'm going to describe this environment so here in the meantime here I have a in three clusters. Okay, these represent my uh three uh region or data centers. They are in different, uh they are in google and they are in different region. As you can see, you assist for your central and us west.

C

um Here, for example, I have one of these clusters. Obviously I'm I'm from redhat, so I'm using openshift, because it's easy for me, but there's nothing involved that implies openshift, you can do. You can do everything with kubernetes. um So here I have kafka deployed this way. uh You see three instances of kafka, but this instance. This is the second cluster three more instances. This is the third cluster three more instances.

C

These are all um talking to each other, and I have, for example, here a kafka console in which I can see that I really have a. If I go here. I have a nine node uh kafka cluster.

C

You can see from the name of the instances of this know that I have cluster uh three cluster, one cluster, two, so all of the cluster openshift cluster. uh All of the instances that are distributed in different openshift clusters come together to create a single logical kafka instance.

C

If you notice this notation here, let me make it a little bit bigger. This is not the usual service name that you get inside of openshift. This is a new standard for, inter cluster discovery of of services and endpoints, so it uses these cluster setups.

C

It uses this cluster set notation and it also has the name of the cluster in which uh the the pod is, and so these are actually the pods that we were seeing before, and I think I have a a q, a topic defined here and if we look at this topic.

C

Look at the partition it, it has nine partitions, so it's well balanced on the on the cluster on the um available. You know nodes of these clusters and it has it it's of the I I configured kafka to be strongly consistent, so each replica each partition has two three replicas and each of this replica uh is in a different region. So if I lose a region, uh one of these will become red, but I I still have two replicas and I can still continue uh working in.

C

um So that's how you you can set up your workload um in I. I also have uh in this experiment that I'm running, I also have a cochlear cbdb deployment.

C

So um coco tv gives you a a.

C

A ui where you can go and see what you have so here it has this nice feature where you can see your cluster, how it distributes in uh you know um on a map right. So these are the three regions that I'm using from from google across north america.

C

um These are the nodes, also in this case, I'm using uh I'm using nine nodes three in each uh in each uh regions, and then each node within the region is in a different az, so so that trying to get both local availability and global uh geographical availability- um and we can see another nice feature of coker's- that I like of cource- is, for example, that it uh calculates the latency between the instances and then uses this information to do some internal optimization.

C

But, as you can see, this is obviously a symmetric matrix or mostly symmetric uh and from west to east is where we have the highest latency around 60 milliseconds.

C

uh So that tells us based on what I said before, that the best latency that we can get from a transaction here once we have distributed the workload, the the data across the three regions, the best latency is going to be around 120. something above 120 milliseconds will get 120 milliseconds just from the network. Then there is, you know, processing, writing persisting and all of that so the ques. So you can immediately start reasoning about okay. What kind of workloads can I run on this database yeah? So it's a trade-off.

C

Yes, your workload may may go a little bit slower than if it was deployed in a single region, but now you get that when you lose a region, you essentially don't have to do anything. The system will continue. The service will continue to be up and the system will continue working.

C

That's it. I, um I think we don't have time to simulate a disaster and see how the system react. um We we would see that here you know all of this system, but both kafka and and cockroach can auto, detect failures and they will start reacting to it. We will see that we lose some nodes. We will see that some of the ranges these are essentially the partitions are getting moved around uh and the system, like, I said, re readjust the new situation same thing: does uh kafka.

B

I think um one of the one of the key takeaways here is that we in in a cloud native world. um We now have the capability of implementing disaster recovery with you know, different storage systems and different databases and different tools, um and in fact you know, you get an order of magnitude, um better automation and better flexibility than you do with your with your traditional systems, and I think this is.

B

This- is sort of the next logical step for many enterprises as they as they adopt cloud native technologies and and and kubernetes and openshift and cloud native storage solutions as they as they look to migrate, more mature workloads and more mission, critical workloads that require um that required disaster recovery. So so my key takeaway here is understand.

B

The different layers in your system understand the different attributes like like the latency and the performance and consistency requirements of your of your um of your applications and then yeah absolutely take advantage of the composability and the declarative nature of of cloud native disaster recovery um and all the advantages that that brings to your application. And you know, touchwood sleep better at night.

A

Thank you, and we had a question come in that I think we'll have to take to if you came in from the linkedin event um longer than a minute and it's a great question: what considerations are required for cloud native disaster recovery in a heterogeneous environment if either one of you want to take it in one minute? Otherwise we can push that to the linkedin chat.

C

Yeah I'll try.

A

C

Answer quick, quick, go ahead.

B

No, no no go on.

C

Yeah, I assume, by a theory genius we mean uh that we don't have a homogeneous cloud provider or infrastructure underneath right, um like I said, this architecture relies on capabilities uh in the networking space.

C

So as long as we can do that east west communication- and we can discover the instances of our state workload on on the remote domain failure domain and as long as we can set up some level of global balancing, we will be able to uh to create these architectures and in fact we are doing it and what um collaborating with some of these vendors, what we're noticing is. um How do you in order to to be predictable?

C

You know in these deployments you would like all of these senses to be to behave the same, but how do you, for example, provision the same iops across different cloud providers? They all give this capability or this sla in a different way. Or how do you provision the same? You know computing power. They are. They are slightly different, so those are the things that you may encounter, but there is no actually blocker to building these architectures uh across a terrigenous environment.

B

In fact, and I'll just take one more second here, I I would kind of argue that um tools like uh kubernetes are actually are actually designed to um to abstract your infrastructure and to give developers the capability of of getting the same services from from different uh from different systems, and I think the glue that holds that together then, is the is the east west um networking and the load balancing services and things like that. That's it. On top of that.

A

Thank you both for taking that last minute question. It's a very, very good one and very important.

A

Thank you again. I will post the video recording and the slides at the linkedin group and then our linkedin event, and then we will see you all next time next tuesday, we'll see if we have something scheduled but see you all then, and thanks again, alex and raphael. This is a great important topic.

C

Thanks for having us.

B

Thanks for having us thanks, everyone.

C