Red Hat OpenShift OpenShift Commons Briefings, 20 Feb 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: OpenShift Commons Briefing Data Protection and Disaster Recovery Solutions Venka Kolli (Red Hat)

Description

OpenShift Commons Briefing
Data Protection and Disaster Recovery Solutions for OpenShift
Venkat Kolli (Red Hat)
2020-2-20

A

Before we get started, I just wanted to mention that we have now opened registration for the next open ship, Commons gathering and Amsterdam- that's co-located on day zero with coop con in Europe on March 30th and the fun news about this. If you are a container storage person or interested in container storage at all, we are hosting a half-day hands-on workshop at the open ship, Commons gathering, and it's called the open ship container storage for admins hands-on workshop, and you can sign up for it as part of your registration for the open ship Commons gathering.

A

There are lots of other wonderful things going on as well on the main stage and other ones too. So we hope you'll join us in Amsterdam and you can register today and it's open and it's only 49 euros and it's probably your best deal of the week to get a whole lot of information about what's going on in the kubernetes and OpenShift ecosystems. So with that I'm going to hand it over today do venkat holy.

A

Who is now going to share your screen and take over and give us a wonderful talk on backup and disaster recovery, so Venkat you want to take it away. Thank.

B

You Diane, hello, everyone. Let me introduce myself: I'm Venkat Kali I'm, a product management director in OpenShift storage, so openshift storage, OCS as it's called, is a native storage solution for openshift. So, specifically, we're going to be talking about data protection and disaster recovery solutions that you should consider when you're deploying openshift and in this session.

B

This is a brief overview. We're not going to be going into technical details on the multiple solutions that we'll be talking about, but wanted to give you a quick introduction to how you plan for these solutions.

B

When do you need what kind of solution is needed for what kind of situation right, and how do you generally go about thinking about at the solutions so think this is like a primer for your data protection and disaster recovery and in the later on sessions, and also in the open, shaped comments that Diane mentioned we'll be going into a lot more technical details on how the specific products and features that are associated to this am to to these.

B

You know functions work, but for this session, we're going to give a general portfolio or view okay, so I said that let's well right into it. So when you think about backup- and dr I mean, obviously, these are the solutions that are, you know meant to protect from your failures and basically, hello is this seems to be crosstalk.

B

So not all failures are the same right. So if you generally look into the failures that are that impacts your applications and data right, you can broadly categorize them into two broad categories, all right, so one is logical failures or software failures. These could be. You know you user errors, where somebody accidentally deletes a file or someone intentionally deletes a file. It deletes some data right and your software bugs or some virus and the other malicious software impact.

B

So in these, these failures are the ones where there's something wrong with the application logic or the application data right. But your hardware is intact right. Your your data center and hardware is running okay, but you lose either a piece of data or an entire application itself and got corrupted. So, in this case, the most logical thing to do is for you to go back in point in time right before the failure has occurred and take a good copy of the data and restore it back to your primary cluster right.

B

So because your hardware is running fine, your cluster is not impacted from hardware standpoint, so the quick recovery mechanism is, you know, going back in point in time and recovering from there. Yes, it does involve in data loss, at least you or you have a good copy to start off with right. So this is where the backup solutions typically dwell all right. So this is how the backup solutions work, so they basically are built, for. You know for software failures and maintains multiple copies to recover to recover from.

B

In fact, the right word to use is restore right, restoring the data- and you have this other class of failures generally, which are less than less common, but have much more devastating effect right. So this is where your hardware failed or or could be you to hold the center has failed. The hardware fails either based on some components.

B

I mean it describes, or you know some other part of your hardware that fails and takes away the node of the cluster right or you could have an HVAC issue or a power grid issue where your data center is down right. So in this case, obviously, you cannot wait for this to be repaired and recovered right. So the recovery mechanism is that you failover to a remote site that you previously set up and where you have been copying the data to all right and move your applications running off from your dr side right.

B

So this is your failover to either a standby or a hot side and we'll get into the details about the different dr sites that you know that mechanism to have. So these are the dr solutions. These are typically a function of a storage at the underlying storage and right. So that basically protects you from from me, or you know physical and datacenter failures. Now the common tools and technologies that are used for for data protection. These are like very common set of tools that are used.

B

So, let's start off with say data mirroring, so data mirroring is essentially a synchronous mirror copies that most modern applications and the storage systems do right. So whenever you're writing an application, is writing a IO or any transaction being complete before the transaction gets complete? It synchronously makes sure there are more than one copy it's right to more than one place and they're always consistently and minaret. So you always have I. You know a full consistent copy before the transaction or IO is complete right.

B

So this way, even if you you know, have a node or a single copy failure without application without a beat, can can continue running on the other good copies right. So there is no data loss here or any application downtime with the data mirroring. Obviously it comes with some limitations and we'll go through this in detail later, but that's essentially what the data mirroring is. So just as an example, the OCS, which is the native storage of OCP, does have a native data. Mirroring replicate Marine Corps built-in into it.

B

So by default it always writes three copies and consistently keeps them copies. Synchronously mirrored right. So that's the data mirroring and snapshot. You know most of you probably already know and heard about snapshots all right, so this is basically where you consistently take point-in-time copies right and you you know you keep a set of them. So when you need to go back to the previous copy, because your latest copy either your latest data is either corrupted or are lost right. You can restore from one of this point in time copies, so this step shows.

B

Typically, it makes the core foundation of any backup solution, or sometimes even the dr solutions right and building on snapshots are the backup applications, so the backup applications efficiently. You know take this point in time coffees now this could dual them right or you can set up. You know certain. You know some mechanism where you can have certain policies where you can. Actually you know, keep those copies at the locally or a remote, so that yeah and also restore from them when when the time is so, this is essentially an application.

B

Typically, in you know, in most traditional enterprises, you have a full-blown backup applications that are built with very rich backup policies all built into them right. So that's basically a backup, a solution that- and that is typically built on snapshots. And lastly, the other common mechanism to use for data production is data replication and again this is a storage function where you're now asynchronously copying the data to a remote destination right.

B

So this allows you to basically have the application run on the remote destination because you have a copy there, but the difference between the mirroring and replication is that mirroring is always synchronous and is always you know are, is always in sync, I think the replication is asynchronous, which means that it is copying at a scheduled interval and I'll talk about pros and cons of setting up those schedules and what are the things that dictate this these ideals, but that's basically the what the data replication is and most VR solutions are built on this data application.

B

So these being the common tools. So how do they come together? So when you think about it approaching solutions right? So you basically are driven by what is something called SLO right, service level objective, and so you take an application. You got to determine the business value of that application and how much service are how much critical is that application to your business right and based on that? The most common two metrics that are used are called RPO and RTO, and they constitute what a service level objective is right.

B

What are your SLO or SLA for that application is RPO is something that how far do you want to be, you know copied or how far or how much data loss are you willing to suffer right? And now that sounds bad right? No one wants to have any data loss.

B

Obviously right, but you know protecting it without any kind of data loss obviously gets you know more expensive, so you have to rely on more expensive solutions, but if the business calls for applications calls for that, yes, you go to the most critical zero data loss solutions right, but not all applications might require that level of you know guarantee. So you can have options to go to previous. You know to more loosely, you know, defined solutions and on the other, metric is the RTO. So this defines how quickly do you need to bring backup right?

B

So if your application in credit downtime, how quickly do you need to bring that back up and make it make it running and just as before, right? Obviously, you don't want to have any application downtime, but again you know for that to happen right. It requires expensive solutions, so you have to basically make a judgment on. You know where your will do.

B

You lie on the spectrum of RPO and RTO right and if you look into solutions that we just are talking about so, for example, the mirroring is the one that basically provides you the most consistent copy without any data loss and also helps you to recover automatically right. So this is getting close to our 0 RP. U +, 0 RTO with that.

B

Obviously, in a solution gets expensive right and especially if you are trying to protect from many other disasters that are more Geographic in nature, but we'll get into that later and getting on that scale a little bit more. You know loser you, application of that is replication right, where you are replicating the data to a longer distance, so that obviously might involve asynchronously. That would involve you know some data loss but you're protected against more types of disasters with this solution right and both replication and mirroring are basically are built for hardware failures.

B

Right are built to protect against the hardware failures and going further back in the scale. So you have the snapshot right and again, depending upon your needs, use you're scheduled for snapshots.

B

You know you could take snapshots every minute or every few minutes or hours or days dot, and you can define your attention of the snapshots as well right, so that basically helps you to go back in point in time and backups and backups are also the modern backups, are all online and are basically stored on a disk so where you can actually have a quicker access compared to the tape backups, which are traditionally used before.

B

But these days most modern applications- and you know modern data centers- are all using disk backups, which is much quicker for recovery right and also helps you with the testing of those you know, backup copies, but for the data that you really really need to retain for very long periods of time either by you know, because of regulatory reason or business reasons, when you want to keep the data for years to come, the discs are not a good solution for that, so that is where people copy to tapes and archive them to to remote to anymore site like Iron, Mountain and a few others have services, you know protecting those tapes right, so that is how the solutions are laid out and based on your RTO RTO needs.

B

You choose the right solution for that, so getting into a little bit more details onto each one of them so start off with they did a production with the snapshots and backup. So this is where I said right. You know based on snapshots, you build a backup solution on it, so the snapshots snapshots are always point in time. Copies and most modern storage systems you know, provide you incremental copy, so it's not like a full copy.

B

Every time you take a snapshot, you have a full base copy and then all the other copies later on are basically incrementals of those copies. So you basically from a story standpoint and also from time it takes a snapshot, is a lot more, a lot more quicker and the local snapshot is your first defense of you know prediction right. So these are located on your the same cluster and is taking up the same space as your primary cluster right and you could quickly recover because it's local on the same node, also on the same cluster.

B

So whenever you have a failure or if you lose a file or have some data loss right or sorry, if you have any, you know failure incident, you can quickly go back to your prior in point in time and you know get that restored from there locally. Now. Why wouldn't in any one use this right as their as their only backup mechanism? One is obviously because you're using your primary storage, so, as you increase in a number of snapshots one, it is eating up your primary storage capacity.

B

So there's you know your capacity left for your applications to run and gets limited right and also could be expensive because most likely you'll be using a high performance, storage, SSDs or nvme drives, etc, and in there and you're, basically using that space.

B

For for your backup, copies and also the other thing to consider is that as the depth of the snapshot so and when you say def, it's the number of snapshots that you're retaining right as the number of snapshots that you're retaining goes higher, there will have an impact on the on the application, because these are point in time incremental copies. You know it's primarily a snapshots, it's created by a pointer mechanism, so you have more processing that needs to be done.

B

The more snapshots to retain right, so you so there's a reason why most systems limit the number of snapshots that could be retained on on on your primary cluster.

B

So the other downside, obviously, is that if you lose your primary cluster- or you know you know any nodes, you know you lose your price snapshots along with it, so that calls for the snapshots to be stored and in a backed up to a remote destination. So that is where the backup solutions come in right, so the backup solutions essentially are taking the snapshots and copying into a more cheaper or an external storage in a typically s3 object.

B

Store is a more common way to are more getting more common for this backup targets right and these scale well- and also you know- is that much lower-cost so right, so you can have more backup copies retained on this, and this object store could be located far out for more disaster like scenarios or could be within the same campus or in a metro network. So you can have a quick recovery happening from there right and by the way, the same mechanism where you're backing up to a remote.

B

You know this nation can also be used for my migration migrating of the workloads. So in fact, OCP has a migration tool. You probably might have come across called camp. This is actually built on the same backup mechanism that you know that that we are built so- and you know in one of the follow-on presentations we can go in more details and how the how that works. But this is also a common way to use for migration, and the other aspect to consider for the backups is that so these are managed by policies.

B

So you can actually see it. You know set up a policy on whenever you're, creating an application or volume for that application right. You can basically define your scheduling and retention policies. Rights. Oh I, want this to be backed up every 15 minutes or every hour and I want to be retained for seven days or for for a month right. So these rescheduling intentions are basically specified in the backup policy, and these backup policies could be codified into storage classes, so you could actually have a predefined storage class.

B

So storage class is a way where you can. You know they define your dynamic volume right, your PVC volume right, so you can basically create a storage class and define these policies into the storage class. So if somebody just shows in a gold class storage right, they come up with a very frequent. You know snapshots and right and have a longer retention, for example, right so so on and so forth. So that is how the you know the backup policies are defined and and used in the OCP. So we just talked about the storage here.

B

Storage backups, but you know, backup, is not just about the storage but about your entire application and kubernetes allows and specifically OpenShift allows you to actually have your backup defined at an application level right. So you'd want to have your protection for a consistent protection for your whole application, rather than a volume by volume basis right. So what is an application level? Backup comes this stuff, so start off, we get at the core. Obviously it is storage application data right.

B

So these are your PVC and PVC is belonging to the application to protect to protect from, and these the storage interfaces that are used or basically are called CSI, which are basically a common set of standard interfaces that are used to create these volumes.

B

This CSI interface is also being expanded to cover snapshot functionality right, so the snapshot interface is now you know getting standardized with the CSI interface right and- and this is currently in beta, but it's going to go into GA Suen in in few releases, so at a core storage level, so you have the application volumes getting snapshot at through the CSI interfaces. Now coupling the with this application data, our application volumes, are the cluster resources right that are associated to these this application. So when we say application, what does the application mean in kuben?

B

It is right you can, basically, you know, define it as a namespace right. You can basically say that you know my application is specified on a specific namespace or you could say that you know I will have a label that's created and associate and tag all the all.

B

The parts containers for application of interest in one label and I could basically want to backup based on the label or I, could say that I'm on the back of this entire namespace right or every application component of that namespace needs to be backed up, and when we talk about that space level back up, so we're not just talking about the application data, the application volumes we talked about before the PVCs, but also the cluster resources that belongs to this namespace. So these are the namespace FRA projects.

B

You know your deployment stateful sets right your config map secrets and you know, amble files, everything that is associated to the namespace. So when you are able to back up write your application data along with the cluster metadata right now, you have a full name, space or application level, consistent backup right so for you to be able to either migrate this application or restore this entire application.

B

It is much much easier if you're able to have one consistent copy that covers both the configuration data, cluster configuration data and the application data right, and that is what kubernetes allows and specifically, we are building into the openshift. Again, as I said, you know more details, more technical details, we'll cover in the follow-on sessions, but I just want to kind of quickly quick, give on how openshift allows you to basically provide application level backups now building on so now you covered up to the cluster right, so you have a whole cluster covered now.

B

The specific applications running on the cluster also requires specific procedures for you for them to be run when you restore them back right once you have a consistent copy that covers both the clustered metadata and the application data right now, the application should be able to use that consistent copy, and you know, needs to perform certain functions to be able to when it you in the restore- and this is happening through the operators right so most applications have built-in operators and even at the storage level, there are storage level operators that helps you to bring back up these applications and follow all the procedures that are needed right for you to you know doing the restore, not just during the restore, even during the taking of the backups.

B

Sometimes you know for you to be able to get a crash, consistent copy crash, consistent copy, meaning that you know, if there's a system that is there's going down, you have a copy that is good just before that crashed right, so, which means that a group of the volumes are all consistent together right before the time of the crash right. So that requires quieting of that application.

B

That requires the application to flush all its caches and all its data to the persistent storage right, so that a snapshot can be taken so that that's a quiet operation of their application. So all of these procedures are can be automated with this application. Operators right and the backup solutions typically provide specific application agents to do that right.

B

So, combined with this application agents and with the clustered metadata and the application data forms the full application level, backup solution so specifically for OCP and we'll see, s-so OCS is built with the incremental snapshots at they're, built with CSI that are going to be introduced soon and Rho CP Works uses Valero, which is other in an open source. You know backup solution to be able to provide a consistent cluster, consistent backup at a namespace level.

B

Right and working along with our backup partners, will be able to offer a full fledged backup solution at an application level you know for for for OCP users right. So that's a quick overview of what the backup solution is so quickly. Moving on to the dr. So this is the basically dr, as I said, is build a solution built on your application right.

B

It is where you're a synchronously copying the data to anymore side two and these remote sites are scheduled, are set up at far enough distances where you know you're not affected by you're, protected against in the geographical failures, floods, fire or power grids, or any of that right. So that's what typically, you define your blast radius, which is basically what defines your protection failure you're protecting from what gets impacted.

B

You know during that you know in the distance, so you schedule your dr site and connected by van, for you know that that that is beyond the blast, radius right and because you are relying on an asynchronous replication to protect against these failures right. There will be a data loss involved right, so there will be not a solution where you can. You know predict from you know, with without any complete data loss right to protect against this long distance.

B

You know, Vandy, you know in a Vandy are so typically when we talk about the asynchronous volume replication again. This is done at a storage level. All the volumes belong to the applications are consistently replicated to the dr side right and the schedule that you scale that you set up for this replication. It really depends upon what your RPO needs are right again. How far do you want to? Are you willing to lose the data now, if you want to schedule at a very frequent interval like even seconds it is possible.

B

However, your network also defines you know the bandwidth. You need to have enough bandwidth network bandwidth to handle the data right. So if you are doing a very frequent replication, which means every change that is happening in that very short interval needs to be transferred over the network to the remote site and and if your network cannot handle that you get conditioned there and obviously you will fail and to keep up with you know with the changes and that impacts your applications.

B

So both your RPO target and your available network bandwidth defines what your replication interval is would be right. But one positive aspect of this is that, because this replication is async, your application is a latency. Doesn't get impacted right so because you write your data to your primary side and your replication takes over later right to make the you know to keep this car. You know changes transferred over to the remote side.

B

So that's basically on the asynchronous volume, replication and I said this is a storage function, mostly a tentative volume, but there are also some applications that handle you know the changes and replication at the application level, but those are less common now, so that is making the data available to the remote site is one aspect of the solution, so the other key aspect of the our solution is a failover itself right. So it's the process of where you're transferring your application from your primary site.

B

We just failed to your remote site and make the application better. You know being and bringing up the application there and you know going or the recovery process there now. One thing good about openshift and kubernetes is that this failover mechanism, which typically is mostly manual in the traditional world, can be fully operated. We couldn't fully automated with the operators right, so the same operators today that we use to bring up your applications on your primary site.

B

We use the same mechanism to have an automated failover management done and even fail back fail back is something there. Once you fail over to your remote site, you repair your primary site and then once your primary site is repaired, you want to get back the original state right. So that's a fail back to your repaired side, so both this failover and failback can be automated, and since these are automated, you know you can do this.

B

You can achieve a very quick failover so which results in a very low RTO right, recovery time objective, if it's a manual set of operations where you how to follow ten steps for this application to be brought up, which is typical in them in the traditional world right that takes a much longer time. But with this this allows you to have a query quick. You know recovery of the applications.

B

This not only helps you during your failover failed back operations, but also the more frequent testing that you do so once you set up with the our site. Most companies require you to you know to do test for readiness of this. Dr, so you actually have you know dr testing mechanism. That also can be done with through the opera, the automated operations with. So that also helps you with the quick thing. So one other thing to note is a room where we keep using the word automated, not automatic right, because it's always triggered manually.

B

No one wants to have an automatic, which means a you know. An on manually intervened feel over because closes an asynchronous dr solution, there's always a data loss involved. So there's always a judgment hat that has to be made whether you want to failover to the remote side. Once you declare, you know your primary site inoperable right or you want to invest in bringing your primary site back up, which could be quicker and where you cannot really don't need to incur it at a loss right. So there's always a judgment that needs to be made.

B

So more so, it is always required for a manual intervention to be made, but once you decide to failover, you need to be able to just click one button and do an entire failover very quickly. So that is automation right and in the case of OCP and OCS right. So we do have. The self is synchronous, mirroring that does the data mirroring a data replication solution at a volume level, and we have the operators, storage, operators that helps you to restore your volumes and your applications backup right.

B

So this is basically using or OCS fluke operators. So now you have both an automated failover management and an asynchronous mirroring built into the you know: open shipped storage right and, as I mentioned before, we'll get into more details in the following sessions, but that is how the solution is defined for for asynchronous replication now getting into a more critical applications.

B

You know where you'd want to protect your data without any data loss and want to have a thematic. You know recovery right. So now we are in the spectrum of data, mirroring solutions right and right that helps you to mirror synchronously without any data loss.

B

Now these also kind of fall into your H, a category right so H, a and D are they go hand in hand, and especially, you know. This is the case where you know some of them call this to be an H, a solution, a high availability solution right that also can work as a TR solution if you use it effectively. So here, what we're really looking at is that your D, our site, are basically defined either as an availability zone. Right and availability zone is defined as something that has a you know full.

B

You know a full friction of your failure domain right. So all the resources that are defined, our it's very an availability zone are redundant and does not get impacted by either failures and the other outside of the zone right.

B

So this is your basically defined failure, domain right so and typically, when customers define availability zones, they set up such a way that your HVAC or your power grids are basically not crossing over the zones right, so they're basically set up in places where the different portraits or your HVAC systems are having redundancy for each of these available to zones. This could be as simple as a rack right, a rack level zone or it could be like a data center bill Eric.

B

You know a building in campus as an available to zone, and you have your next zone in the in a different building in the campus or they could be spread in a metro distance right. So you could have it. You know in one across the town from each other right. So this is a place where you need to have three availability zones, because it's always an odd number is required because of the quorum issues right.

B

So you need to have three distinct availability zones are there and they could be data centers three different, distinct data centers right, so that's! Basically what the setup is.

B

For you know, data mirroring- and in this case the storage is synchronously mirroring at at a volume level or some of this modern distributed applique patience also does the mirroring and maintaining three consistent copies, and what OCP ensures is that these copies are distributed across three different zones right and even when a cluster is in a set up its ensures all the resources there are scheduling is scheduled into these three distributed and scheduled across into three different into these different zones right, so you will not end up in a situation where you will have two master nodes, for example running in the same zone, or you know, some of the zones are not covered, I know without the resources right.

B

The other aspect to remember is that you know latency between these zones. The network running between the zones is very important right so because these are synchronously copied, so you know the more latency the network you know incurs impacts your application right. So typically, you should not exceed more than ten milliseconds of latency.

B

You know, for you know, for your networks between between these zones right so and as I said before, once you have this set up, you know, OCP, by default, essentially schedules your resources across three different, these three different zones right, so that's, basically, what the data mirroring solution is and and the the traffic is handled through your heo proxy, a load balancer running on top of it right.

B

So it knows essentially where all these the copies are, and even if a single zone or a copy that is failed, it distributes the load across those two other remaining zones right and there's a solution that does not incur any data loss and you have a full. You know you know solution that has an automatic recovery and no data loss right, but obviously this can be used for solutions that has the protect against a big blast radius, which means that any base, Geographic failures, events like earthquakes or fires- you know you get impacted by all.

B

Three zones gets impacted by that right, so it does not protect you from those those failure. Events. So, given these three solutions, your backup and your a synchronous, dr and your data, mirroring solutions right, so, as you could see, each of them covers different failure scenarios and provide you options. Different recovery options for four against those failure scenarios, and now the more common way is. You know the users tend to use all three of them or a combination of them together to come up with their.

B

You know their prediction mechanism, for you know, for you know for your for your application right. So you start off with hardening the primary site with you know, with this multi zone solution that I talked about the stretch clustering right. So this is where you can actually have your primary site. Your primary data center, you know defined with three different zones, and you have your cluster.

B

You know spread across these three zones and hardened accordingly right, so any hardware nodes, you know you still can run, keep running on your primary side without having to have a failover and your application runs without missing a beat right now combine that with the DR solution. Now you have a solution that pretends to Geographic failures right, so you have a nice increased data, mirroring that is done from your primary cluster to a remote cluster right now you have Geographic protection against Geographic. You know failures right.

B

Obviously, this could incurred at a loss, but again you have failure against most calm most. You know major disasters with this right and along with that, so that that handles your physical or datacenter failures, and you combine that with your backup solution that handles the logical failures as well right now, you have previous point in time, coffees that are stored in a remote site, right and or in your chip or object storage, and that helps you to go back point in time when your latest copies no longer good right.

B

So the key difference you might have noticed here is that with a dr site, you always have your most recent copy available and replicate to the DR side right so you're, trying to recover to your most recent copy with the backup solution. You're choosing your previous point in time copy when the most recent copy is corrupted right. So if you have a corruption on your primary site, the same corruption gets replicated to your site as well, so it doesn't protect you from a logical failure.

B

That's where your backup solution with your previous point in time copy comes in right. So the combination of your site, hardened multi zone stretch clusters and with asynchronous D are based remote.

B

You know the disaster recovery solution and with your backup solution, you're pretty much covered against all types of failures and you know and have achieve different RP or RT or needs for different applications right, so that forms the in a full solution suite you know what you have available for OCP, so in the later sessions we actually will go into more technical details on each of them and you know how you configure them, and you know what the other considerations that you have to take, but this is I hope this gives an overview of what the solutions that are available all.

A

Right, we have one question in the chat from Rev he's asking what might be better to have multiple small clusters rather than a large single cluster for fault, isolation.

B

There's a very good question Rob and so clearly your data production. You know policies gets defined by that and in fact there are multiple factors that that drive your cluster design right, whether to how one large cluster or multiple clusters, and if you look into how the community is moving and what most of the thinking is going right.

B

Most people are leaning towards having multiple clusters, multiple small clusters that are defined for each specific workload. Right because there are a lot of you know: innovation happening in this multi cluster management right there, the clusters, so you have tools and technologies that are being developed to handle the management.

B

Well domain, is you know your failure?

B

You know mechanism, you know if you're isolated from the failures from the you know having the rather than having from what a large cluster right and there's also a very different from a traditional wall to the traditional wall, because it not very many tools, these tools that that comes with the OCP, so customers tend to have one large shared.

B

You know data center or our primary domain right, where they have mix of applications in there right, but OCP allows you to have multiple clusters and manage across the clusters, so you should be leaning towards more granular custer's.

A

Alright, then, there's one more question again from Rev: how much is performance impact during the backup across cluster.

B

Yeah, so backup of all these tools actually has the least impact on your primary application performance right so because the backups are built on the snapshots technology and typically, what happens is that when a backup needs to happen right, it takes a quick snapshot. First, it needs to cause the application to get a consistent snapshot right.

B

That's the primary impact of the performance, but these days, these most modern applications have a quick wise mechanism where it flushes the cache onto the disk right and the moment, early freezes, the transactions and and because these are point in time copies, especially the worst years of you know. The storage of OCP supports the point in time. Copy snapshots are instantaneous, almost instantaneous right once the snapshot is taken, your application is free to move on and the backup process happens thereafter right.

B

So all the backup of that app thing is going to be done, but your application is not impacted by that. So fall the thing of all the solutions here. Backup has the least and.

A

If people had more questions, you can always check out the openshift products page with container storages and there's links there to all the different pieces of the container storage solutions and these backup and disaster recovery pieces, and you can always reach out to ben cat Kali directly and and find him on the Internet as well. I'll.

A

Just reiterate again that we are there the make sure we are hosting a deep dive workshop in Amsterdam in a couple of weeks, a couple of what a little more than a couple weeks on March 30th at KU Con, and you can register for that at on Commons at OpenShift, org and sign up for the the container storage workshop and we'll happily dive in and answer any of your questions at that event as well, so that that is the container storage for admins workshop on March 30th. So space is limited.

A

So please do register now and you can find the links for that on common side, open shipped org.

A