Red Hat OpenShift Detroit 2022 OpenShift Commons Gathering, 25 Oct 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Lightning Talk Disaster Recovery for OpenShift Workloads - Annette Clewett - OpenShift Commons 2022

Description

Lightning Talk Disaster Recovery for OpenShift Workloads
Red Hat OpenShift Commons 2022 @ Kubecon/NA
Detroit, Michigan
October 25, 2022

Speakers:
Annette Clewett (Red Hat)
https://commons.openshift.org/gatherings/kubecon-22-oct-25/

A

A

Obviously, I, based on the slide I'm with red hat, have been involved for the last six years, integrating Storage Solutions into openshift, and today I want to talk about the sort of the latest thing that I've been doing and that red Hat's been doing, which is to look at multicult multi-cluster disaster recovery for applications.

A

So, okay, didn't all right. So we'll start off with a problem statement, look at what the red hat Solutions are and then talk about some technical assets right after that, we'll quickly go to a video demo, a demo that I did is a video. So the first thing is disaster. Recovery in for for applications is not new right, I mean if you've been with a financial institution Healthcare any of those other companies as I.

A

Have you know that you have to have Disaster Recovery planning and in some cases you have to show that you have that plan in some amount of detail even to have the application available. So it's not new, but when we move it into containerized platforms, it becomes sort of there's not a good solution right now. The cncf. From my vantage point, doesn't have a good solution yet and we need it. We need it today.

A

So red hat along with IBM, has come up with I think what is at least a good start, and we need to move it into the cncf. So the first thing is: the challenge is: how do we take openshift and containerize and meet the requirements of our Legacy disaster recovery?

A

Another concern is, how do we trust it right? So it's one thing to test that it's another to know that it's going to be there when you need it. The third thing is it's a combination of products. Red Hat has different cycles of release and all these products have to come together, so the benefits obviously are.

A

If we can get it to work, we can get something, that's easy, automated and can actually very little human intervention make it happen, so I'm going to move on because it takes a while to do the the demo so in disaster recovery- and this is not again new- we have two measurements. Sometimes we come up with these measurements, with no idea that they can be met.

A

So this is actually bringing the two ideas together, which is. You should be able to test that you can meet your both recovery, Point objective and your recovery time objective. One of them is a measure of how much data you're willing to lose on a per application Level and the other one is: how long can an application be unavailable?

A

So in the past this has been maybe measured in hours or sometimes stays. We want to measure it in minutes and we want it. Maybe single digit minutes. So again, these are not new, but this is how um and and for for this solution, how we're going to be able to measure it so Red Hat, along with IBM over the last two years, has been developing two solutions: one, we call Regional Disaster Recovery.

A

The other is Metro Disaster Recovery Regional, Disaster Recovery is the idea that we're doing asynchronous replication on the persistent data, so it has no requirements about how close or how far the sites are apart. Metro is meant to be a synchronous solution. Therefore, you could have recovery, Point, objective data loss equal to zero. So the way that we get there on these two solutions is using components from Red Hat, Advanced cluster management. The Upstream is called open.

A

Cluster management ocm, a red hat, openshift data, Foundation, that's the product that I've been involved with for the last five or six years. It's going to bring along all of the disaster recovery operators that I'll go through in a minute here and then at the center of this is Red Hat Self Storage, which is the Storage software-defined storage. That's going to actually do the replication, store the data and keep track of it.

A

So continuing with the components and again these are brought along with openshift data foundation. So we have three new operators. One is um the the Dr Hub operator, the Dr Hub operators, job and it does live on a hub. So conceptually I'll show you in a minute. But this is a three cluster solution or three location solution. It can be two location solution, but the the Hub operator is really the the one that has the custom resources to actually do the Dr placement and fail over an application.

A

The the cluster operators in concert with the Hub operator and the multi-classy orchestrator sets up all of the required mirroring and replication.

A

This does require, from a subscription point of view, an advance in title meant for uh openshift data Foundation, so architectural. If we look at it, we have a global traffic manager. That is not part of the solution. uh You do need to have your own Geo load. Balancing your load balancing to plug into this is not any different than any other load balancing. If the application is active on cluster one and we want to fail over to Cluster two once the application is live on.

A

Cluster 2, Geo load, balancing needs to re, redirect the the inbound connections, so you see, there's no distance limitation and we've got asynchronous replication. I show it into here in in going in One Direction, but you need to think of it or the way to think of it is it's per application, so we could have an application on, on the left, hand, side that is being replicated and has a failover cluster on the right hand, but we could have another application in cluster 2 where it's failover cluster is cluster one.

A

So it's it really allows you as long as you keep enough Headroom to use both clusters and not have just one sitting there idle. So again, no distance limitation contrasts with metro. Metro still has two openshift clusters. Then you can have more all this is done in peers. So if I had a hundred clusters that I wanted to put into disaster or protect applications, I can divide them into basically 50, which would be two sets.

A

Each one has two clusters so right now, it's it's everything has a pure cluster, so you're, either on the preferred cluster or you're on the failover cluster. Really important to this solution is an external theft storage that is stretch, it's called stretch mode that will basically provide storage. So you have two replicas of the data on one side, two on the other side. So if you lose it, then you have the ability to recover the data synchronously. So there's, no, you get no data loss. You still have to move.

A

The application over important here is also the idea that there's a monitor, node a monitor, is a service of stuff. So somewhere you need to have a a fifth monitor. So if you need to make Quorum that is not going to go down at data center, one or two, so some technical assets that we have here, maybe some of you have seen I've done quite a few Red Hat office hours, the the three there that are part one part. Two part three is my colleague Daniel parkes.

A

He did a great job in this last month, explaining in great detail how to set up stuff in a stretch mode, how to connect it into two open shifts and all the details of that so that that's a really good set I've also done multiple videos over time. These are links to a few of them. Recent ones I've done and then, if you want to get into the details, the actual documentation which I personally helped with so I, can tell you it's on basis of documentation.

A

It's pretty good documentation, but so those are assets that you can take a look at. So let's go if I can figure out how to do this, let's go.

A

There we go so we're going to do a little uh video action here. I would have liked to do it live, but it's three clusters and the chances of everything working out are not good, so um getting started here, oop. So what I've done? If I get this thing to go away? What I've done is I've installed, Pacman on using Advanced cluster management and right now it's the packman application I'm going to play the game, because the reason I want to play it is so that I can create some persistent storage.

A

So in this case, I want to lose super quick and I'm able to do that. Just by putting myself in the right position and as soon as I lose I get a high score and I'm going to save that high score to the persistent data. That is on the. What we'll call the preferred cluster and it showed it in the other one I think it was a BOS one. So now we've we installed an application.

A

We created some persistent data, and now what we want to do is to be able to look at failing over to the failover cluster. So in this is your ACM console if you've seen it, this is actually the multi-cluster console and we have a new thing called create Dr policy. So again, this Dr policy is backed by The. Operators I showed you and the custom resources, so I'm going to name my my Dr policy, informative, because if I had a whole bunch of clusters, I need to know what is this policy apply to?

A

So it applies to these two clusters and then I would like a replication interval of two minutes. I can choose whatever I want and I can have the same. Two clusters can have multiple replication intervals.

A

So after that I'm going to go down as soon as I choose the two clusters it goes out and it try. It looks to see. Are these two separate openshift data, Foundation or soft clusters? Are they the same cluster and the async is grayed out? So it's actually figured out that this is two different clusters: storage clusters and it's going to be an asynchronous relationship now I need to the default.

A

Is five minutes but I'm going to change my sync interval, which means all my persistent data, the Delta Data, will be replicated every two minutes so now, I have a data policy, but I don't have any applications so I'm going to apply it to my Pacman application as soon as I do that it's going to create the the data, the disaster recovery um resources into the namespace for Pac-Man, there's a Dr placement control, the placement Rule and a placement decision.

A

So let's now go and we we now now that we've got everything set up. We want to initiate a failover, so this today is a.

A

It can actually be done by a developer, the initiation because it is namespace scoped. The actual creation of the data policies, though, would need to be cluster admin right now, so I'm going to go into the Dr Hub, so important about this is I'm doing the failover on the Hub cluster. So if I had lost communication with one of my clusters, I would still be able to fail over the application.

A

So this so the Hub cluster currently is on a third open chip cluster in the future. We're going to be able to do Hub recovery and we'll be able to do two locations and recover ACM, so we're going to add a few parameters once you add these, they stick in this crpc again. This is namesake scoped. So this this data, our Disaster Recovery placement control, is specific to this namespace and actually this volume we could have multiple drpcs based on different volumes.

A

So and a volume would be. You know, an image of the storage that you're replicating so I'm going to go ahead as soon as I hit, failover I've changed the status now and I can go and look in the events to see and we'll see that things are happening failing over and a vrg is a volume, replication group, another custom resource, so we can also watch it here, look close in the middle. It says Buzz one and shortly shortly there and switched over so and actually I didn't video magic this.

A

So it actually switched that fast again in my test, environment, I, don't have a lot of latency, but we can see now that it's on boss 2., so that's so big. So basically what we've seen here is an example. Now what we want to do you see is: is the application still working? Also here's my Global traffic manager ha proxy and you can see on the bottom. It switched over to boss 2.. So this and the inbound connections now are coming into the second or the failover cluster.

A

So let's refresh the application on the new cluster and go down to our high score and voila there. It is, and we have preserved so that the the persistent data was replicated via storage clusters and then the the application was rehydrated via ACM and a GitHub. So thank you.