Red Hat OpenShift OpenShift Commons | Red Hat Livestreaming, 16 Mar 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: OCB: Geographically Distributed CockroachDB with OpenShift - Keith McClellan (Cockroach Labs)

Description

As your OpenShift practice matures, it is likely that you will be asked to support stateful workloads. Multicluster deployment of stateful workloads can become complex especially when considering Disaster Recovery strategies.

In this briefing, Raffaele Spazzoli (Red Hat) and Keith McClellan (Cockroach Labs) will discuss how to deploy CockroachDB on OpenShift across three AWS regions to achieve a zero downtime, zero data loss disaster recovery strategy.

A

Hi, this is anuraga just joined from that incorrect meeting.

B

We're about to start so go.

A

Ahead, serena, sorry, I just got off with serena.

B

Hi welcome back to another openshift commons, we're here with one of our really great partners, and we love all our partners, but especially cockroach labs, I'm here to talk about geographically distributed cockroach db with openshift and raphael spitzoli.

B

He is kind of another one of our uber architects who, if you have any questions about openshift, but today about do you geographically distributed openshift in particular, please ask your questions and also keith mcfarlane from cockroach labs is here with us. So please take it away.

A

C

D

You I'll start uh yeah. My name is rafael. Espacelli, like arena said I am an openshift architect in consulting keith. Would you like to introduce yourself.

C

My name is keith mclaughlin. I am on the solutions engineering team at cockroach labs, um so I help our customers um implement solutions like the one we're gonna be demonstrating here today um in uh in their own environment. So it's I'm excited and raphael, and I've been working hard on this demo for for quite a while and we've used it a number of times, so I'm glad that we get to show it off to a broader audience. Thank you for inviting me.

D

You're welcome so today the idea is to talk about how we can distribute workloads across different geographies and as an approach to to manage disaster recovery, and in this case obviously the workload is to be, and then we are going to present a demo of where we are simulating a disaster, and we will show how everything how the system reacts to the disaster.

D

So I started um about two years ago thinking about how we can manage disaster for stateful workloads, and my my line of thought was okay. We have figured out daedalus for openshift and kubernetes in general as a community, it's time to think about stateful, workloads and and obviously stateful workload, bring bring more problems in particular they they bring state and and and they need to sync state across instances, and but so this, obviously there is storage involved.

D

But today we we are narrowly focused on uh on disaster recovery and one of the things I was I was trying to do when I was thinking about these things is to define a new approach to disaster recovery, which I call cloud native disaster recovery. So here is how how I define it and how it differs from traditional disaster recovery, uh so in traditional disaster recovery, and usually there is a human who decides when a disaster occurs. Okay, so a human triggers some triggers some disaster recovery procedure.

D

It's not uh it's not that the the situation is not detected by the system in cloud native. We cannot wait for a human. We we need faster reaction times, and so the uh the trigger has to be autonomous, and so the system has to identify this the situation and react.

D

When you have a human uh reacting to disaster recovery, what happens is typically a long time passes before you realize.

D

One or two hours, that's what I see um at my customers that that is just to start the recovery procedure. Then the recovery procedure procedure itself in traditional disaster recovery. It's usually a mix of automation and human actions. You know better, you know if you're good, you probably have a habit. All automated, if you're, not very good. You probably have a lot of human actions in in cloud native.

D

We want it to be fully automated, then the two main metrics through which you you measure the sla of a disaster recovery procedure is our rto and rpo. So rto is approximately how how long the system is down. Okay and their po is approximately how much paid or remaining transaction.

D

What is the length of time of transactions that you have missed, so it measures so the first one measures availability, essentially the second one measures, essentially consistency of your data, and so we want in in transitional disaster recovery. You have you have. You can have fast uh rtrt, you know minutes, but you, but it can grow, go up two hours and we have seen why. One of the reason is the human human um component. In in the in the detection and and in the procedure for recovery in cloud native, we want near zero rto.

D

So so it could be theoretically close to zero, but there are some things like load, balancers and l checks that need to react to the new situation and start diverting traffic. So we have a near zero in in the order of magnitude of six seconds outage and then for uh recovery point objective we have uh in in traditional disaster recovery. It could be between zero and hours, depending on how you sync the state, but in cloud native disaster recovery we want it to be exactly zero and then, when it comes to ownership of the process.

D

What I see usually is that um the ownership is is formally on the application team to to design a disaster recovery process, but what the application team usually does is they turn around to the storage team and say what sla can you give me and then that's that's their sla for disaster recovery. So basically, they really can rely completely on the storage team in cloud native, it's gonna be all on the application team to find the right kind of middleware or software that can deal with disaster um in cloud native.

D

There is really no one storage team anymore, especially if you have an hybrid cloud. There is a aws storage, google storage, maybe your internal storage, but there is no single team that you can go and tell give me your sla and then from a technical capability standpoint. There is another interesting difference in traditional disaster recovery.

D

We we usually build these recovery proceed procedures using capabilities that come from storage and storage products, so backups volume sync like this kind of capabilities for cloud native instead- and this was an interesting finding for me- the capabilities that we need come from the net networking space in particular, we need the ability to communicate east west uh between these geographies so that all the instances of our workload can find each other and we need a good global load.

D

Balancer that can detect the announce that the geography is gone, has gone offline and start directing the traffic to the available geographies.

C

You know, if you don't mind me, jumping in here. You know I'd like to talk a little bit more about why all of this is important from a cloud native perspective right. So I've been dealing with these types of problems for for a long time which, like you have reality, is, um as we become more abstracted away from the infrastructure.

C

You know, as you mentioned, with hybrid workloads and um where we're running potentially across on-premise data, centers and cloud data, centers or potentially, even as we move towards um you know a full full cloud deployments right in a lot of cases. What we're seeing is silent disasters happen a lot more frequently. We don't have the we can't rely on our own processes to guarantee that we don't have.

C

You know a data center out of it or an availability zone out or um a network partition right, and so because some of these scenarios become more likely as we move to kind of a cloud-native um ecosystem, we need to start treating them like just any other issue that comes up on any right and um fundamentally, I think that's that's something that we're gonna we're gonna, be showing as a part of this demo later here today, which I think.

D

D

A thanksgiving should start to feel like an aha event, so high availability event where a component go goes down, but everything keeps working in a disaster. An entire data center or an entire region goes down. So we lose more than one component. We lose an entire.

D

You know piece of the it, but still everything should keep working, except for those glitches that we talked about in the availability space.

D

Okay. So um so we have prepared in a demo for you to see this in action. um I'm I'm gonna talk about a little bit on the infrastructure that we set up for this demo. So we have three openshift clusters in three aws regions: two in the east, uh united states and north america. I should say, and and one in the west, um in this openshift we have.

D

uh We, we set up this openshift using a tool called red hat acm, that's the advanced cluster manager, which which is which runs here on the top right corner on the administrative administration cluster, and it can be used to manage the cluster life cycle and to observe the status of the cluster.

D

So we used that tool to bring up these three clusters and then what we have done is deploy a tool called submariner now submariner helps you absolutely establish a tunnel between the openshift um sdns, so this so that's the sdn is a software-defined network that is established inside of a openshift cluster, for the pods to run with, and and so with this tunnel between the sdns. We are now now able to open connection from one cluster from one pod running in one cluster to another pod running in another cluster, without having to egress and english.

D

Using the routers or other means to ingress the cluster, so submariner brings us discovery and and connectivity, so now we can, we can we we can configure a.

D

We can configure pods to find each other to find other other peers in other clusters and that's what purchase needs in order to be able to to establish its own logical cluster. So we will have openshift clusters and we will have one cockroach cluster.

D

Another thing that we did is to deploy vault uh for our secrets and certificate distribution. um We needed.

D

We needed a single certificate of authority for all of the cockroach instances, and so vault is a way to to provision that and and certificates will be provision using an operator called search manager that you see here in this slide uh the last piece of infrastructure.

D

This is all preparation that we needed to to be able to deploy core codes. The last piece is a global load balancer. So obviously we are in uws. We are using route 53, which is a dns right, but it's a very powerful dns and we have a global load. Balancer operator here on the right running on the administration cluster that is observing the other cluster, the control cluster and it's automatically programming route 53.

D

So when we deploy things like volt here, a route is created, a dns definition is great in route. 53 and same thing is going to be for cocor city.

C

So so I have a couple questions for you: raphael um specifically on on the infrastructure setup the opportunity to pick your brain a little bit. I think that's fun. um You know, obviously, I'm a fan of submariner, but how exactly is submariner different from some of the other ways? We could peer the networks um between these different openshift clusters.

D

Yeah, what submariner does is it? It's established a an ipsec tunnel.

D

Between the sdns, like I said so, it's a very efficient diagonal, because episode is an established technology and it's it's encapsulating um layer, 3 on top of udp, which is layer 4..

D

We have seen other solutions that users that use a higher level of encapsulation so slightly uh less efficient um um and- and I I should I should also that that's why we this is. This is one of the problem where we need to keep the latency as as short as possible to to enable to enable this. This distributed workload to be efficient right, and so uh we need to solve the problem as close as possible to the network.

D

uh The network, physical network, space and and submariner here does a good job um there. There is, there are two: there is an upcoming way of running submarine that will make it even more efficient, which is going to be using wireguard, as opposed to ipsec, uh to establish the tunnel and wireguard is a la, is a more lightweight protocol.

D

C

Is that are those the reasons why it's a superior production solution? That's just other ways: we could peer the networks or is there other stuff that we should be thinking about here.

D

No, I think I think that's that's that and the fact that it's uh you know it's going to be deployed by your administrator and it's going to serve the entire cluster right, not just individual namespaces solution. So this is going to uh it's some. It's it's a piece of the infrastructure. Once it's there, it's almost disappears and it it just works.

C

If you don't mind, I'm going to pick your brain about some of the other parts of the infrastructure that you set up for the demo too. Is this a good time for that.

D

Yeah yeah go ahead.

C

So um so I personally haven't used the advanced cluster manager. Before um can you can you kind of talk me through why you chose to to use that to to orchestrate these these kubernetes clusters and I'm curious to know if, if that administrative cluster is running all the time, or um is that something that's that's more ephemeral in nature.

D

No, the administrative cluster is supposed to be there all the time, let's go to it for a second.

D

A

D

I'm going to show exa just quickly what it is.

D

We don't want to use all the time here, but.

D

So in this page here I can uh manage my clusters, so it's a single pane of glass for all my fleet of clusters and customers are starting to have several. You know tons and tons of clusters at this point, so this makes it easy as an entry point to manage all of your clusters. There are some administrative capabilities that you can do from here, for example, I can do. I can upgrade all of them at once, uh and then I can.

D

I can set up monitoring capability where all the metrics are collected um into a single spot, and then I can even deploy application through rackham and spread it across multiple cluster or enforce policies in this particular demo. We we just use them, use rackham to spin up these uh clusters where our workloads are going around got it make sense.

C

um So so it's safe to say then, that um while we have this administrative cluster and we use it for kind of you know administering the distributed multi-cluster configuration here. It's not like a single point of failure. If the administrative cluster would go down because of the failure, the the infrastructure that it's already provisioned is independent of that that's a.

D

Correct statement: that's correct! That's that's the first item. Yes, awesome.

C

So so I want to pick your brain a little bit about how you configured, vault and and dirt manager here, because I think it's really kind of interesting.

C

So, as you mentioned, copper stevie does mtls between our pods we're going to be talking about that here in the next few minutes, but um so to to get a like a single certificate authority across all three clusters, you chose vault, which is great. It's the same thing that I would recommend to to customers going into production. um What? um What specifically did you do to to make vault work as a single ca across all three of these clusters?.

D

Right so, first of all, I think it's important to discuss why I decided to deploy volt this way right, because you can say well, I need I need a ca. Come a common city across these three clusters. The ca could be running anywhere. Why? Why running in the cluster and across the clusters- and here is the reasoning with with these three clusters- we are trying to build the most available infrastructure in our data center right, it's going to be distributed across multiple geographies.

D

So in our idea I should say it's going to be distributed. It's most available thing that we're going to have and a ca, a pki and an assert and a secret management tool is one of those things is is one of those pieces of infrastructure that is in the critical path for applications, meaning that, if it it used to be something that needed to be available at maybe boot time.

D

But now, if it's not available things stop working, so it needs to be available all the time, and so that should also benefit from the most available infrastructure that we're building. So I was looking for a way to have a pki slash a cert. You know secret management that never goes down and vault can do that, because vault uh supports raft as as uh as a storage protocol and and when you, when you, which is the same thing that cockroach has so in terms of syncing the state and making sure and managing availability.

D

They have some commonalities and so ball. A vault here is deployed as a single logical cluster across these three openshift and it can serve secret if they're going to be the same secret for all the three clusters or it can serve certificates and they're, going to be generated from the same pki.

C

Awesome absolutely yeah.

A

Okay, can I also ask a quick question, so I just want to check the global load balancer. Where is this actually running? Is it running on a specific, open shift cluster infrastructure node like where is that actually located.

D

The global balancer itself is route 53, so it's running in it's running run by aws, the global balancer operator, which is the thing that configures route 53 for our needs, runs on the rackham cluster. As you can see, it's it's this one here, um pastel.

A

Right so my question was like if it's, let's not open, not aws, and if it's on a customer's infrastructure right data center. So where will this load balancer will run on? Will it be like on a specific openshift clusters, infrastructure, node or just curious.

D

No, I I in in case you have data center across multiple on-premise data center, but across multiple geographies you, I recommend you do a global load balancer with a dns, uh so you can use maybe something like a five big ip as your dns and and you still need which has the same capabilities kind of the same capabilities of of route 53. But you can certainly do else exhibit, which is the things that we need.

D

The thing that we need here and then it's a matter of how do you configure it and you can do manual configuration, but I recommend having the configuration program automatically uh and an operator would be a good fit here.

D

So it would be essentially the same architecture that you see here, except that here, instead of in parenthesis, router 53, you would have f5 big ip or something along along those lines.

A

Got it thank you and then the green logo is the cockroach dv logo. Is that correct.

D

No cochlear cb is not in the picture, yet the green logo is submariner.

A

Okay, yeah and then I thought and what is the blue logo then.

D

That's certain manager, that's the operator that creates certificates.

C

Okay, we're about to build on this and add cockroachdb to it all the work raphael did before I was allowed to even get started.

D

C

D

To the to caucus to be um actually keith, would you like to describe it or uh or I can do it.

C

No, absolutely so so cockroachdb is a distributed. Sql database um that it's cloud native, we run um the vast majority of our installs run in kubernetes um openshift. We run our database um a service product on kubernetes. um Fundamentally, we function a lot like the other technologies. We've already been talking about right, so um raphael mentioned uh vault and how it uses raft to do consensus-based, replication across sites. um That's the same way that ncd in the kubernetes ecosystem replicates um the state that kubernetes is supposed to maintain across different.

C

You know, pod hosts and whatnot, so cockroachdb also implements the wrath protocol or for doing consensus-based replication of our of our data layer. um Under the covers we use a kv store. um It used to be roxtv. If you've heard of that we've we've re-implemented, uh you know uh a akv store under the covers that we call pebble that was kind of more purpose-built for for what we were trying to do. um Single binary deployment um completely written almost completely written in golang, um on the on the front end of that.

C

What we're creating is a mesh where every single node has the authority to act on some portion of the data in the database um is a follower for some other portion and then potentially is not involved in in some. You know third portion of the data, so every node is active as the leader for some portion of the data in the database and we create a global, logical cluster that allows you to talk to any given node and we will route your queries to wherever the um there's a there's a lot of great stuff here.

C

But but one of the the prerequisites as rafael mentioned is um those nodes talk to each other over mtls? Those are encrypted communications, so so cockroachdb is going to communicate with, in this case, search manager to to get um certificates to enable that encrypted communications and then also for our back end communications.

C

We require that all the nodes be able to route to all the other nodes. This allows us to do things like deal in the case of losing a pod or a site right um to do that. All the nodes have to be able to to talk to all the other nodes, so we're using a submariner here to allow all the pods to talk to each other across the sites, so they can act as a single global database cluster.

C

um I I've been a conference labs for about two years now. um It is especially the easiest database, particularly old cp database, that I've ever had the privilege of of supporting um one of the great things about designing the database to be cloud native from the very is that a lot of the operational challenges that that you would have in a traditional otp system, particularly if you were trying to run it in kubernetes. We don't have those things, um so we could talk about how we manage data replication.

C

We talked about career performance, a lot of great tops we go into, um but I'll I'll pause there as kind of the high level description of the database.

D

So you said, you said it's a sql database. So as a developer, let's say I want to. Let's say I already have an application running on a sql database and I want to start using coker cb. I can probably reuse my sql skills because it should feel the same, but is there any anything that changes or that you want to highlight.

C

Yeah, so so so we've improv implemented the postgres wire protocol, so you can connect to us using postgres drivers. You can, um in a lot of cases, use your existing postgres tooling to to interact with us um there's some cockroach db variants of of like the olms that are out there. The one thing that you need to know- and this is true for any distributed system- is that the data has has a location attached to it.

A

C

um It may have an intrinsic location like an address, does it may have uh it may not, um but that and then we need to consider where it's going to be accessed from and how what that access pattern looks like. So so the one piece that you have to add to your your kind of dba bag of tricks when you're, when you're moving towards a distributed, sql environment, is thinking about how we want to distribute this data across cluster and and in inverse um how we're going to get back out right.

C

So if you, if you ever go and you take it like a data, modeling class in a college, they'll talk about the physical data model as opposed to the logical data model. um When you move towards a distributed system, you have to think a lot more about the physical data um in conquerors tv. We make this super easy. We have a couple of we're called ddl extensions. So, basically, when we define the table, we define how we want to just the data across that table or or set of tables by default, we're going gonna.

C

What do something is called follow the workload right which is a um which is where we're going to basically move the the uh authority to act for any particular segment of the data to where it's um most likely to be used from, but we also have concepts of regional tables and global tables.

C

um All of these things have have different tradeoffs on read and write performance and also kind of impact. What are um what types of scenarios we're going to um survive without user error right? So one of the big philosophical things that we talk about is designing to survive as opposed to designing to fail right. A traditional vr is you're designing a system that can pick up when your primary system right. That's why you have two site solutions.

C

You have failovers and you have backups and all that kind of comes we're designing to survive, so we're going to have three or more sites, because if we lose an entire site as um as we're about to start talking about from a demonstration perspective, we want the system to continue to operate and function um as if it was any other day.

C

So the data center, if you will, is the new rack right. So you know, if you're, if you've ever kind of set up a distributed system in a physical data center, and you wanted to make sure that it survived say a layer because you did because you, um you know you had a pdu failure or you know the top of iraq switched failures along those lines. You didn't want your application to go down.

C

In that scenario, we're now kind of treating the data center as that new abstraction layer that needs to be um survivable without any noticeable doubt. I hope.

A

That answered your.

C

D

Yeah yeah it does, and- and I I think this is a perfect segue to my next question, which is, um I see, customers now that are considering migrating. Their sql farms could be any product right any database, but they want migrated sql firms to openshift right and they may have maybe 1 000 instances of of databases running on vms, which essentially are treated like pets right with uh a team of dbas that trained and care for these pets. And what is what I feel like there is a.

D

I feel there is a risk that we're gonna migrate, these databases uh inside of openshift they can certainly run, but we're still going to treat them like pets right and instead, the philosophy, the philosophy on of kubernetes and openshift is to treat everything as cattle things that can die and will re respond somewhere else.

D

So here is the question for you uh and sorry and to conclude, to conclude the thought: this can be difficult for state workload, obviously much more difficult than for stageless, but how? How does cockroach help in that space yeah? uh So yeah.

C

So fundamentally, state is what makes things special from an it perspective right. um You know, if you remove the state from almost any system, you can probably um genericize it pretty easily, so we fundamentally have kind of taken the same approach to um to this problem, as vault has and as ncd does right in that we we make sure that all of our data lives in more than one place, and we guarantee that our data lives in one.

C

So so then, rather than um having a single point of failure, then we have a configurable. We have effectively configurable availability while guaranteeing existence. So so one of the things that I didn't talk about is the replication factor. So by default, everything that gets written to copper, hdb gets rid of at least three places, um and that's.

A

C

Up from there, so there's scenarios where there might be five that might be seven- you can even theoretically go higher than that, although mathematically, um if you lose 51 percent of your replicas, and you had seven rounds because you probably have bigger problems than the database number, um the uh the intent is, you say hey.

C

These are the everyday occurrences that I want to an aws. It might be um an availability zone outage which happens in aws a couple times a year. Maybe it's a region failure which happens once every three years. You want to make sure that you're surviving some cases. It might be a full cloud.

C

um We've had scenarios where cloud providers have had kind of cascading problems or caused by user error, because almost all disasters are caused by user error at the end of the day um and where we've lost multiple regions in aws and in azure and gcp. um So you may want to actually spread your workload across multiple clouds. This is very.

D

C

Neighborhood of the hybrid workload I've got two physical data centers across you know, and um my third data center is in google or it's in aws right or maybe I only want to maintain one data center. Now I want my other two data centers to be um to be in uh two different cloud providers, because I don't want to put all my eggs in on that right. um So fundamentally, we use raft to do this. We have some enhancements um two raft that that make it function from uh for sql databases.

C

So if you ever go and read our life of distributed transaction documentation or you read any of the blog posts about how we guarantee um serializable isolation across transactions, when you may have two transactions that come into come, two completely different nodes and two completely different data centers, we have a a lot of very interesting writing on that topic and I won't go into today.

C

But the first thing that we do is just make sure that not nothing about or pod in a cockroach db cluster is so special that we can't survive without, and that is the at the very kind of fundamental baseline. That's how we solve for that problem.

A

Hey keith, can I ask a quick two questions, so you did mention that, like most of the time we designed for failure uh and not for survival. So if you can just explain that, what do you exactly mean by that and then also if you can help us understand what really cockroach db like? How does it differ from maybe with reddish or infamous span, or is it.

C

A

Completely inherently different architecture that you are using.

C

Yeah so I'll answer the second question: first, because it's pretty short um so those other databases are nosql databases, they're they're. They tend to be um in the. If you look at the cap theorem as soon as you're a partition tolerant as soon as you are um so as soon as you have to manage for network partitions, you can either guarantee consistency or you can guarantee availability. Okay, so, generally speaking, nosql databases are lean towards availability um and then so they don't guarantee consistency.

C

In all cases, I'm not going to go into the specific databases, because the nuances there get get really specific, we're a cp database where we can increase our our um availability um by increasing our replica account because we're using consensus based so um so fundamentally that that'll makes us more valuable for um system of record-like workloads.

C

So things like inventory management, financial transactions, um we're used by a number of large financial institutions, the united states and europe, for example, because um we can be in a cloud-native environment like this, and we can have extremely high availability um as well as guaranteed. Consistency for transactions.

C

Things like redis and cassandra and mongodb are um are much better at um workloads where they are kind of right. Once read many um and that's a broad generalization, I know that there are people in the call. They could um give me specific examples where something like a cassandra or mongodb would be a better fit for cockroach that for a problem than cockroachdb, and all these things are always true use the right tool for the job. We are specifically very focused on transactional workloads that require guaranteed consistency.

C

We do all of our consistency in that. If you look at acid, um like an acid compliant database for a fully acid compliant database and all of our transactions are serializably isolated, which um so your earlier question designing to survive versus designing to fail. um This is this is the difference between, in my mind, this difference between high availability versus fast recovery. So um when you, when people put together a disaster recovery plan, they're expecting that things are bad enough, that they're willing to accept that things aren't going to operate as they they normally would.

C

um The the challenge is: we've moved to these kind of newer cloud native technologies. We're running the cloud is just us running on other people's computers right, we have less control, so it's more likely that something outside of anything that we've done could possibly cause one of these failure. Events to happen, so what we want to do is we want to be able to treat them as a high availability.

A

C

Where we have automated failover to the kind of the point raphael was making earlier and not treat them like they're, a disaster where we're taking a multi hour.

C

um So philosophically it's basically like coming at it like a glass, half full versus a glass, half empty kind of a perspective. By going to it saying hey, you know, I need to be able to continue to operate if I lose an entire region of aws and I'm going to design a system to solve for that, then, if a region in aws fails, I shouldn't need to get a page at the middle of the night um to get up to fix my computer, I shoot our systems.

C

I should be able to kind of um kind of deal with it in the normal order of of things rather than treating it like a disaster um and- and if you soon, as you start to look at it as I want to be able to continue to operate as normal during these scenarios versus, I need to be able to get back up and running at some point in the future. If something like this happens, that's all of a sudden you're designing to survive as opposed to designing to fail. Hopefully, that answered your question.

D

Can I add, uh I have a consideration on eventual consistency that I think just just building up on this conversation.

D

um When I started looking at these architectures, I could have chosen a an event or consistent database, so a database that decides to be available rather rather than consistent in in an event of a network partition and- and if, if that was the case, we would probably see only two openshift here in this picture, because in that case you just need two to continue working.

D

uh If you lose one, you just need one to continue working, but I read, I read a bunch around eventual consistency and one thing that people may not realize it is that eventual consistency does not mean eventual correctness.

D

Eventual consistency means when the network petition goes away and all the instances can talk, they will converge to a state, but there is no guarantee that that state is the state that is logically correct for your business problem and and so it's very hard. So so so I didn't like that situation as a developer. I don't want to think about that situation and how my my code would would have to be.

D

You know designed to react to that situation, so I I chose to go with consistent databases for this, for this research and I think it makes things very much easier for the developers.

C

Yeah, you know I'll add one thing to that, and I know we're running short on time really want to get a demo but um effectively. Nosql databases take all of the logic that we bake into an rdbms and offload it to the application developers.

C

So you have to think about all of those potentially income things that you just mentioned: raphael um in your application and you need to handle them at the app player. um There's good, valid reasons why you might need to do that in certain scenarios, um but in a lot of auditable audit type situations, particularly to mention financial management inventory tracking, where correctness is of utmost importance.

C

uh That risk is unacceptable, um at least in my opinion, um which is why, which is why I'm at concourse labs and I'm not currently working at a sql database better. So.

A

Yes, okay, as you go move forward like what is the data we are storing in this cockroach db cluster? Is it like the application application data? We are storing.

C

Sorry, it is application data. This is a great transition to the next slide. Actually, um so um there's an industry standard, um uh oltp benchmark called tpcc, it's been around since the 90s, um it simulates um literal warehouses and how um packages or or things might flow into or out of those warehouses, um as well as kind of like a point of sale system. Where those you know products are getting manufactured and then shipped out to customers, um it's very much a transactional use case.

C

um This is the same type of database um implementation that you might use for inventory tracking at a large big box, retail vendor right. um It's actually modeled after a large big box, retail vendor that um and what they actually did in their what we're doing in their environment.

C

um It's a it's a good generic benchmark because it's one of the one of the benchmarks that's most available for sql databases, there's published benchmarks and um and guidance on how to run this benchmark on pretty much every um sql database. I've ever ever seen going back to like 1996..

C

um So so it gives you a good wide swath of of kind of what that looks like what we're going to. What we're going to be showing here today is what happens when one of these sites goes away while we're running tpcc um against calculus tv.

C

um So with that, I will uh leave it to raphael to actually walk us through the demo, since he owns all of the wonderful infrastructure.

D

Right and on the infrastructure, I should I should say one thing: you said before cases very correct today, many customers, many enterprises, are considering uh building a recall solution where they, where they deploy on different clouds. There is nothing in this demo that cannot be deployed across multiple clouds. It's just that. The account that I have is only on aws, and so we are using aws only for only for that reason, but you could you could do it. You could do this them across multiple, um multiple clouds.

D

So here we have the cochlear hdb uh console. We can see in this nice uh map where the data centers are and where the calculus db nodes are. Okay, we have nine nodes, uh three three and three um of course, and we have some uh ranges. These are the data spaces that uh concur manages and sorry, if I'm not using the right word here, but essentially these are the partitions and the replicas that are being managed by corkers.

D

I have preloaded some data for the for the use case that we're going to run and now I'm going to start loading the database.

D

So with this tpcc workload, and this tpcc workload as uh keith said, is generating a bunch of oltp transactions, so they're typically fast, insert fast, um fast, uh uh um uh update or or.

C

Yeah, so so, let's select yeah, so it's the majority of the workload is going to be individual item updates. So as a particular widget moves around a warehouse or a set of warehouses, then um that that record's gonna get updated and then a portion, I think it's six percent, although don't quote me on that, maybe I should have said that on a broadcast forum, um our aggregate queries to look at, like the current state of the inventory um for that warehouse.

C

There's a very, very succinct description of exactly what the the query mix is and how many other updates versus selects versus deletes and versus aggregates on the tpc.org website. I'd encourage anyone, that's more interested in how the the actual workload simulation works to check that out.

D

Okay and as you may have seen that one of the databases was orange uh and that's that's what happens when it goes down? I don't know it must have been a just a little glitch, but everything is up now. um I want to show you that we are generating so this. This um see this little processes are generating load. These are pods running inside the cluster, so they are near.

D

uh This is simulating traffic coming from different sources and we direct a portion of the traffic to the database that is close to the source, so the traffic stays there, so they are generating traffic on the local uh cluster. Obviously, dyna cockroach will spread the data where it needs to, and I'm just redirecting the output here and we can see all the transactions that are generated and if we, if we go to the metrics, we should see that we have some queries.

D

So you can see there is some load that is being generated on on this database.

D

So now what we're going to do to simulate a disaster is that we are going to take down one of the regions, so I am going to actually take down the the west region and the way I'm going to do it is I I am going to completely isolate uh the vpc in which openshift is running, so nothing can go out and nothing can go in, and this is the perfect disaster simulation, because it's a network partition right. You, you, don't know you're sending a packet, but nothing answers.

D

So you don't know if the packet has been received or not uh it's it's it's way more difficult to manage than I'm sending a packet and I'm receiving a response that says there is an error right. So that's the perfect. That's exactly what happens when there is a disaster.

D

So to do that, I need to copy a script. I can't remember everything, so it's yeah, so I'm gonna. So this, if you can read, is probably too small here. Let me copy on the other side.

D

So this is going to set a deny, basically traffic rule on all the address addresses on this vpc, which is the west region vpc. So now what we should see is we may get some oops.

D

Oh wow, like I said there can be some glitch when this happens. This remember this is our uh cockroach console, so it's it's um it's. uh The traffic that comes from my browser is is load balanced by the global load, balancer that I was describing yeah before so it could go to any of the regions. So maybe we took down the the pod that was serving this.

D

This console in case you, you may want to describe it, but, as you can see, after a few seconds we were able to connect again and and the the console is already aware. You see that three clusters, three uh nodes of the nine, are suspect of having a problem.

C

Yeah, so so what happened? There was because each node um has all of the services of every node in the cluster. You were definitely the load. Balancer was originally routing you to one of the the plots that that we just um segregated from the network so as soon as the load balancer realized that was happening around the pods that were still available. That's that's what we would expect in this type of scenario.

C

Right, there's a you know, um a couple digits second service glitch for certain operations, um but queries that had come into um nodes that weren't impacted by this will continue to operate and we'll be able to continue to process queries against the database.

D

And in fact, I want to show you that we are still processing see the metrics did not go down and our client number one is still working, although you see it had some glitches. So this client didn't have a connection problem, but uh cockroach was adjusting itself and there were two arrows which this particular client manages with retries and that that's a best practice that that you should follow also in the code that or developers should also follow in their code.

D

But you see the client didn't break and continue to work same thing for the second client. It only got one arrow in this case and obviously the third client died, because well we severed the connection even to the tail uh of the log here right. Okay, so we are what we have done so far. We have simulated the disaster and we have demonstrated that we didn't have to do anything. The system reacted by itself and continue to work.

D

Now we are going to restore connectivity and we are going to show that again, we don't have to do anything and the system resumes working with all of the capacity that that we, that is available because another problem of the of disaster recovery failures is that you know usually yeah. You have you have a disaster recovery procedure to recover from a disaster, but but when the system that was down comes back up, it's usually as painful also to to restore the workload where, where usually was, is, is the same kind of uh process.

D

Keith. What you were about to say something yeah.

C

I was just going to say right now: those three nodes are still listed as suspect we don't evict them from the cluster. Until I think it's five minutes, then we assume that they've they've been they're dead. um The only difference in recovery, if we were to wait for five minutes, um is just which path we take for re-replicating the data to the nodes as they come back under five minutes. We assume that the nodes aren't that far behind and we can get them caught up using the raft logs after five minutes.

C

We assume that they're too far behind and that we're going to um we're going to re-replicate the ranges there, which is a slightly more expensive operation, but still both of those paths are, are invisible to as a slightly different performance impact after we bring those new those instances back online.

A

Okay, so what is the pitch while the rafale goes and saves the world? Maybe I can ask another question so so what is the pitch like? My customers are using oracle database and they're moving applications to openshift, and this is not connect to this uh oracle database, which is outside openshift. So are we saying instead of that use cockroachdb now.

C

Well, this will allow you to move the database into openshift as well. So um one of the things as a as a recovering operator and a right like I used to run systems like this in production. um One of the things that it's really frustrating is when um you have to treat something special so right now and that's what you're talking about the the infrastructure for oracle is special, the tooling for uh oracle special.

C

If you have an oracle disaster, you have a completely different run book for resolving that disaster than if your application fails by by using cockroachdb and moving that into openshift. All of a sudden, now you're handling a database failure. Just like you were handling nha proxy failure or an app to your failure right. It's um it much.

C

It drastically reduces the the scope of of the types of disasters that you might have to manage the types of availability events you might have to manage on top of that, getting all the great self-healing capabilities that raphael is showing here today, um just by reducing the administrative burden of having to understand multiple different ways that multiple different applications in your stack are running. um Can drastically reduce how difficult it is to do that work right. um So there are a ton, there's a ton of other things.

C

We can talk about too, but we only have like.

D

C

I'm gonna save the world.

D

As oh sorry, I'm gonna save the world like you said, because I like that word and then you can, we can talk while we observe what happens, okay and and now the cluster recovers. So.

A

What is the underlying layer where it purchases the data like is? Does it write it to a file system.

C

Do we use stateful sets um in um in kubernetes, so those stateful sets are presenting a file system to the database um for sure um we use a kv engine to act as our kind of storage layer. So there's not a like a so. You have a decent amount of flexibility there. um You generally use something like a persistent volume claim to get a persistent volume from whatever storage layer happened to be available to you in in your various openshift clusters.

C

Here, because we're in amazon we're using um ebs volumes um to access the back in store.

A

Yeah before karina kicks me out. I just last question so when this data center came back up- and you know it had to replicate all the ranges which had issue so while that is happening at that point of time of a request comes into this data, you know uh to the openshift cluster. It just came back up. What happens to that request? Is the cockroach db database locked at that point of time and it cannot handle.

C

No because we never lost quorum on any of those ranges, those queries well until the west coast data center in this case was caught up. Any queries that came into the west coast data center were um were routed to one of the other. Two remain.

C

Every node in cockroach db is a common gateway to the entire cluster. So uh as soon as those nodes had connectivity the rest cluster again, they could act as what we call a query coordinator they're. They aren't necessarily the query responders they're, not the ones doing the work on the data, but they are acting. They still can't act as a client gateway immediately. So you don't you don't get a scenario where your database is locked up when we're replicating or any of that kind of stuff.

C

um It's just a matter of those those queries get routed to whatever node um currently has um authority to act on that particular segment of data and then once the west is um back up and running it takes over that it's fair share of that workload.

D

And in fact, this true clients kept working and they're still working and so to just see once I restored the connectivity.

D

uh The first thing that needed to heal was the network tunnel right submarine and needed to hear and reestablish all of those connections and now and then cockroach uh see it's healing right now see it's, it's replicating the ranges and then every every node will be back at uh full capacity and serving traffic.

D

Here we go now, it did it. So again, I think the point to take home here is besides the inner working of corkers is as an administrator. I didn't have to do anything right it. It managed the disaster. It reacted to the disaster and also recovered from when, when we healed the disaster, when we fixed the disaster recovered through to full capacity all by itself,.

D

Okay, I just want to add before we close, this demo is completely scripted and anyone uh should be able to produce if you guys are interested- and everything is here.

C

We also have we also have um an awesome like two-part blog post, that raphael and I co-authored um walking through exactly all the like underlying steps. We did here that um we can share that link as well.

A

Then we also share the link to this slide deck in the chat.

B

So, let's not do that just yet! Please um ralph, could you in the references? Can you put the links to the the blog posts and then we will post the link to uh this all right um post it out.

D

B

A bit over time, so I want to make sure that, um and on around you can ping rock offline too right.

D

Yeah actually, the last two links here are the blog posts. Okay, so you can yeah awesome.

B

C

Thank you for having me raphael. It's been great working with you on this project.

D

Yeah same here, thank you, kids and thanks.

B

For having us, thank you so.

B

B