Cloud Native Computing Foundation Storage Special Interest Group, 10 Mar 2021

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: CNCF SIG Storage 2021-03-10

Description

CNCF SIG Storage 2021-03-10

A

Good morning well, good afternoon,.

B

Greetings, I'm lurking by hello.

A

Hello, hello, jimmy hi, luis.

A

A

C

Good morning, good afternoon,.

A

Hey am I pronouncing your name right when I call you rafael, okay,.

C

Very correctly, very right: it's a it's a.

B

A

All right we'll just wait a minute or two, and then we can move into uh roughly the the dr.

A

A

I'll put the presentation into the chat.

A

D

Is it possible to get the presentation comment only or view only link? I.

B

Was going to say it's locked.

D

Yeah, it's locked right now. It has to be added individually.

C

I mean, I think, if there is a maybe a group email group I mean I can enable that I don't otherwise. I don't know how to do it. I have.

B

No, you should be able to set it to like anyone with the link and view only and then after this presentation, uh it's probably best to be able to keep it like uh everyone with the link view only.

A

Yes, do you want me to um raphael if it's, if it's sort of on your internal- and you can't share it, I can I can make a copy and share that. Instead, if you want.

C

Yeah yeah, please do that: okay yeah, I can only. I can only enable the link to reductors. I cannot do it, make it publicly available, but I can enable people individually. I see.

D

B

Alex your way is best.

D

I wish that red hat I used to copy it to my personal account for community stuff and then share that using gmail.

C

I could probably do that too, but I mean yeah it's well, it's it's. What alex is doing? Basically, so uh after that, after today, I'm gonna use the his copy. If there is any plate, cool.

A

A

Let's post the link, can you can you open that one.

B

E

E

A

All righty, then, I think um it's five past we should probably start so I'll hand over to to uh rafaele who's. Gonna um who's put a few slides together to summarize um the documents that we've been working on to see if it makes it uh easier to to get some of the concepts across and maybe provide some feedback so over to you, rafael,.

C

uh Thank you um yes, so, as you know, there is a much more detailed document that we are considering, for you know, maybe to publish where we, where my intention and hopefully becomes. The group intention is to talk about um a cloud-native approach to disaster recovery.

C

um So this I created this presentation just to make it easier to communicate. But really that document has a ton more of details and, and the idea is, you can still use all of your traditional, dr approaches, but we think that there is maybe a new way to do things with cloud native, and so we're going to talk about that. Okay, that's not to so so again, it's not to say that the other things, the other approaches, cannot be used.

C

Typically, the active passive approaches, but we we're looking at uh we're going to try to give a definition to what cloud-native disaster recovery might mean- and this is my my slides here to compare the two approaches, um so I'm going to go line by one line by line um in in as far as disaster detection and the art procedure. Traditionally there is a human decision.

C

Okay, something goes wrong and somebody triggers the dr procedure. Then maybe the dr procedure is automated, but there is very often there is a human decision we want to get to with cloud native or what we define as cloud native disaster detection is. Is that it's going to be autonomous, so the system automatically understands the disaster and triggers whatever reaction is need to be triggered.

C

Then the dr procedure itself, what I find with my customer is that normally it's a mix of automation and human, you know manual actions. We want it to be fully automated in cloud native inclinative, the recovery time. So that's uh how? For how long our system is down, uh needs to be near zero?

C

um It cannot be exactly zero because the there are caches. There are load balancers that need to switch, but it can be very close to zero, and normally we see it to be around minutes or two hours right in modern traditional data centers the recovery point objectives. That is how much data I lose or how much inconsistency I have between copies of my data um in traditional disaster recovery. I see it being between zero and hours, depending how we do the sync or backup and restore in cloud native.

C

It can be exactly zero, so full full consistency.

C

Process wise, I see this pattern and you guys tell me if you see that's not true, but in traditional disaster recovery, the owner, the formal ownership of the dr procedure, is on the apps. It's always been a responsibility of the apps, but what the apps do is this app uses this this storage?

C

They turn around. They turn around to the storage team and say, ask the searching what is your disaster recovery sla starts team as an answer, and they say okay for this app. That is the sla that we're gonna get. um So it really.

C

You know, even if formally the dr procedure is, you know owned by the stories by the application team. In reality, it's owned by this by the storage team. Just now native yeah, sorry.

D

Sorry, just to a quick question, I hope I can interrupt with questions. This is okay,.

C

Absolutely okay, that's the majority. Thank.

D

You uh is it when you say zero in the uh uh almost zero seconds. Is that? Because the assumption is that the is the same cloud or is the same region.

F

D

When you do a dr, not across like uh across far regions or close to crowd clouds.

C

No, that's not the assumption. The assumption is, we are going to go. We have geographically distributed workloads, possibly across different clouds, and we still get near zero. Okay, okay, interesting, okay, thank you, um and instead going back to ownership in cloud native, it's it's an application responsibility.

C

The other observation I made is that, in terms of technical capabilities, most in traditional disaster recovery, must we leverage capability mostly from the storage side, so backups volume, syncs um and this kind of capabilities, but to build this cloud native disaster, recovery, infrastructure or architecture.

C

We we need capabilities from networking and in particular we need we're going to see that we we need his ability to communicate is west. So if I am in two different clouds, this cloud have to be able to communicate um horizontally east west and to have a good global load balancer capability, and that's that's where the switch happens right.

A

So I mean just just a quick comment here. I I think.

A

I you know, I think we may need to differentiate between sort of what the high level objectives are and what happens in reality right because you know to have recovery point objective of zero is, is certainly doable and it's plausible, but it also kind of means um it. It sort of also implies that that every transaction or every database action or every you know, file action or whatever the application is using is is going to be synchronously happening across multiple sites um which, which may or may not be the case right.

A

So so you know, I think I think zero could be the target. That's the that's! That's achievable, but it's it's it's because we're enabling the automation- and I think I think that's the point that we're trying to make here right.

C

Yeah yeah, I perhaps I see what you mean alex the not not every time I I may need to get to zero right, not always.

B

C

We need to get to zero right now, yeah, the point I'm trying to make is now you can get to zero uh and it's not that complicated. We, you know the the narrative was. You can do, dr as you can make dr as good as you want as long as you're willing to spend a lot of money right. I think, with cloud native approach that narrative changes a little bit this.

C

These architectures are not that more expensive than the traditional you know active passive ones uh and so much so that in in an article that I wrote about this, I called it the democratization of of zero down time right during a disaster, because I think anyone who can swipe a credit card and start deploying on different clouds can achieve this.

C

You know in a way that is not that expensive. So that's that's the point here that it's it's achievable and it's not that expensive, yeah.

G

I think it's not just about cost to alex's point. We had the presentation last week. I think it was called full fs or that they were talking about performance tradeoffs, how they were not posix compliance. So I think it's not just about cost. It's also about performance, because if you want to have zero rpo, you have to write to all you know every single zone and get back acknowledgements, and so it's not all about costs.

C

Agreed agree: there are some certainly some use cases where performance, trump consistency uh and that's totally fine. We we can. We can. I totally acknowledge that cool, okay, so high level. This is the reference architecture.

C

We assume there are, let's say three data centers in three faraway regions.

C

uh If reason is if you're thinking about the cloud or it could be your three data centers in different geographical locality, if you think about uh on premise it doesn't matter the architecture works anyway, there is a stateful workload, you can imagine a database or a queue right that is, is distributed across this data center to form a logical entity, but obviously there are different instances and these instances communicate between each other via this horizontal, east-west ability to communicate.

C

We don't need to know at this. You know in this reference architecture how that is implemented, but they need to be able to communicate east west so find each other look up each other and uh discover each other and and communicate and that's how they achieve data sync. You know state sync: each of them will have a volume.

C

So we need storage, of course, because we're still storage doesn't go away, but but we don't ask those volume implementation of that, that you know that volume uh that storage implementation. um We don't ask you to have any particular capabilities. Besides the ability to obviously store data, and then we can imagine that there is a front-end or maybe just direct connection, but probably there's going to be a front-end status front end, and then there is a global load, balancer, okay, and so the idea is when one of these regions goes down because of a disaster.

C

First, the stateful workload adjusts itself because it has some kind of you know, leader, election and state-sync protocol, and we can analyze those in detail, but uh it adjusts itself instantaneously. There is no data loss and then the global load balancer has some level of health checks as some of the checks, uh and so clients will start going only to the regions that are active okay. So as a so we we we reacted to a disaster uh completely in an autonomous way and the clients keep working.

C

Maybe they get they get a glitch of a few seconds. I work with the database that where the glitch can be up to nine seconds uh and then but but then everything continues to to function normally.

C

Okay, so that's the idea, uh I think it's the general model, the the trick is to find state of workload that can actually work that way, and there are some prerequisites that they have to implement in order to do this and this state for workload. Sorry, I just mentioned queue and and databases, but obviously it could be a distributed storage.

C

Although performance there could could become an issue, it could be a distributed cache you know it could be. It could be anything that needs to manage a state.

C

So, in order to understand why this works, I need to bring to mind some concept. Sorry was there and uh yeah.

F

So sorry, so um I was wondering uh in this reference architecture. We basically say that for the dr we're basically, this implies the state sync is down always going to be down by applications right right and application have to have ability to operate replicas across different data centers, which might potentially has very high latency right, and so that's the on this is. Is there any other model we consider, or is this going to be our like? Only reference architecture for dr.

C

Well, you like, for, for this model. That's that's! That's how the application needs to work. Like I said and like the slide says, you can still do your active passive models or master slave. You know that I've always worked right, but you don't get all the automations that that I described um yeah so.

A

Maybe maybe I'd like to suggest sort of a slight refinement here, mostly to do with the terminology right um when we, when we put the the sort of the storage landscape paper together, we kind of talked about um different ways of of persisting data and that um you know that could be some sort of volume, but it could also be you know um app level stuff like like a database, but also you know, key value, stores or or or um object stores, for example, are also are also.

A

You know valid ways of of persisting data, and you know whether it's distributed storage, that's providing volumes or a distributed database or a distributed key value store right. I think what we're kind of saying here is the stateful workload needs to have a distributed way of persisting the data and that that could be. You know distributed volumes, it could be. You know like like um a distributed file system or a distributed storage system, it could be a distributed database.

A

You know like a cockroachdb or a yet applied or or or for tests or something or it could be. You know a distributed object store, and in that case then you kind of have that that that sort of functionality um available to the to the application- I think yeah. So so I think it would kind of be useful to to sort of change the stateful workloads as sort of some sort of distributed storage layer and the volume is ultimately where data has persisted.

A

But it's not necessarily you know it could be a distributed volume in that blue layer. Potentially, I think.

C

Right right, I agree um yeah, so I didn't say what service this state of workload offers to the green.

F

C

Right and that this service could be storage right or it could be key value store or it could be sql or could be eq.

C

You know it could be anything, uh but yes, I think I can improve this slide by by adding that that piece of information and so yeah this this volume here, like I said, could be the the disk on which this state for workload is running, or it could be another layer of the defined storage. It doesn't really matter because the state sink is managed at this layer. The blue layer.

A

C

Yeah so one second, I'm taking note on.

C

C

Okay, so was saying um you know, so the the the document that I wrote tries to explain why this is technically feasible right because you you might say, I don't believe that this can be done right. Well, we haven't done it for many years right now. Is it possible right? I? So I I try to explain uh why, and um I just have to remind you of a few concepts.

C

uh So I think everybody knows what high availability and disaster recovery is. I just wanted. I want to define them in a in in relationship to what a failure domain is okay, so failure domain is something is an area of id of our iit system that um when when when there could be a single event that makes everything running in that area fail? Okay, uh so it could be a node, for example, which means all the process running on the nodes now fail.

C

It could be a rack, which means all all the nodes running on the rack fail fail or a cabinet cluster, a network zone, a data center right, so a failure domain is sort of a fractal concept that is out of similar at different scales, but um uh in what we need to remember remember here, for this discussion is when we talk about a high availability relative to failure domain. We really are asking the question: what happens when one component fails within this failure domain right?

C

What happens to my system when one component fails um assuming a h a of one? I I I have a full tolerance of one, I'm assuming that that is what we mean by higher value. When we talk about disaster recovery, really we are asking the question: what happens if everything in this failure domain is lost? So basically the failed domain fails. What happens to my system? Obviously I need to have other famous domains somewhere and normally in this case, the failure domain is the data center.

C

But conventionally, when we talk about disaster recovery, we talk about an entire data center going down. Okay. So, with this in mind, um I'll continue because I'm sure everybody knows um this- this concept um this consistently here we we mean we mean that all instances are observed in the same state and are reporting on the same state.

C

It's it's not the consistency of acid, which is more about multi-threading on a single instance, and every every thread is in the same state.

A

Yeah, I think I think we we define um consistency in the white paper, pretty well right. I I really like this slide in that. I think we should. What we should sort of pop out of this slide is that high availability is about sort of the the recovery from a single point of failure, or something like that, um whereas disaster recovery we're talking about the failure of an entire failure domain and- and that's that's- a really useful um differentiation to have.

C

Right and and it's um I- I felt the need to make this differentiation, because um these two concepts, when you talk to the customers these days are starting to overlap, because rightfully so they would like to treat a disaster recovery event like if it was an aha event and and in this theoretical model that I just described. That is exactly what happens, but unfortunately that also brings confusion between these two concepts, and so it's it's important to to understand what the difference is uh continuing.

C

The other thing we need to remember is the cap theorem again, I'm pretty sure you guys know what it is. um Many you know the common way to to explain the cup theory is you can have thinking about consistency, availability and partitioning you can have you can pick two, but not all of three.

C

I like I like to tell it in a slightly different way that I think helps in this discussion, which is that you don't choose partitioning uh network partitioning is something that happens.

C

You know miss the errors will happen right. So assuming that you need to be partition, tolerant, how do you design your workload? Do you design it to be available, or do you design it to be consistent? So that's really in my mind the choice that you have and I have- um and I have here a table showing some of these choices made by some products.

C

You should um obviously every state for workload that attempts to solve that attempts to be distributed has to deal with this theorem and that's to make a choice here.

C

The other thing that we need to keep in mind is the concept of consensus protocols. Okay, so.

A

Hey rafael, sorry, just just done one step um when you say the capital choice for those examples. For example, mongodb is consistency as in it allows um deferred consistency or or it's optimizing for uptime. What did you mean by that.

C

So when you choose consistency uh in in the cafeteria, it means that when the system goes into network partitioning, which is which is where the system cannot establish anymore, if if there is a piece of the system that is actually working and the other is not working, but it doesn't know what the other piece of the system is is doing, then it puts itself in a not available state, which could mean read only or maybe just rejection of goals, uh and because the objective is to maintain the the state consistent.

C

F

C

Accept rights anymore right, while cassandra and dynamodb will keep accepting rights, even if they don't know the state of a piece of the rest of the system, assuming that they, because they assume they can do eventual consistency when all of the system, all of the instances would restart uh to be able to communicate.

A

C

Yeah the problem, so eventual consistency is, is an appealing approach and I think it has been explored a lot. There is now some emerging. There is a line of thought you may agree or not, but there is a line of thought that adventure. Consistency is kind of dangerous path, because eventual consistency does not imply eventual correctness.

C

It just implies that at some point in the future- and there is really no sla that you can put on that statement, but just at some point in the future, all the instances will agree on the state. It doesn't mean that the state is what you would have expected from a business stand to business logic, state point of view right, so so the developers now have to take extra care to make sure that they catch these these incorrect inc.

C

You know consistency decision, because there is a conflict resolution. You know algorithm in this in this um in this state for workload that decides when there is an inconsistent that decides who's right, maybe with the timestamp or something else. So now the developers have to take care of that and yeah. There are some papers like google or some you know um thread in where they discussed.

C

How painful was to remedy those kind of things, and so at least this, this line of thought- and I I like that line of thought- is: let's keep everything consistent consistent with the risk of taking an outage, but it's it's simpler for from a developer point of view to in many in many cases, right it's simpler to to operate that way. There are situations where, in consensus, it doesn't matter too much, um and so in those cases it's fine to use to use um those databases, but I work a lot in financial institutions.

C

Consistency is important, it's very important there, um so I'm going guide, but but this this was to explain why I focused here on consistency, so the and and that's that is really what we mean when we say uh zero, rpo right, it's there is no. There is no inconsistency. There is no data loss, so uh consensus protocol.

C

I invented these two definitions, uh share state and then share state. This is really my terminology and we can change it, uh but the concept is like well, let's, let's find consensus. Protocol first is the idea that I have d. I have distributed workload that is, needs to act as a single logical entity, so they, the various instances, need to agree on actions to be taken right and there are two kind of protocols to agree on actions. The one in the first one here share state is when we have to agree on all of us.

C

You know all of the instances doing the same action, so we share the state. We share the action and.

C

Now, at least from an academic standpoint, the way you solve this problem is with a legal election kind of consensus protocol, where the strict majority can agree on on the action to be taken and they commit that action and then the others that are followers or were not able to to agree on on the on the action they just have to follow and and do the action later when when they come up online or they are able to join the network.

C

um The major um algorithm uh in this in this area are paxos and raft and raft is gaining uh popularity, because it's much easier to understand, I couldn't even understand it. Paxos is it's: it's just magic.

C

uh If you try to read it um and then um there is a shared state consensus protocol, where the participants to these orchestrations really are can potentially do different actions, so maybe I'm writing to a database, and then another participant is sending a message in a queue right. So in this case um we have the this historically well-known, two-phase, commit and three-phase commit algorithm, but notice that these algorithms require all of their instances to be online.

C

Essentially, there is no tolerance to network partitioning when you do when you do unshare state consensus protocols and that's it's understandable right. We are not doing the same thing so or potentially we are not doing the same thing, so we all need to know what we are doing individually.

C

We cannot ask later so um there are a couple of papers from google that, based on these shifted consensus algorithm, you can build a reliable replicated state machine, which means there is a generic way of of agreeing on the state and then on top of these- and you know then raft does a generic way on agreeing not just on a state but on a series of actions to be taken with the concept of operation log and that's really the state that is being shared between these instances and then every instance has to do the operation that is written in the operation log and then building on top of this concept.

C

There is the concept of reliable, replicated data store. Where now the action here is I have I was. I have a series of operations. I have a log of operation to to do, but really the operation is to write something on on on at that store. So this is a concept very highly reusable concept that could be implemented generically and then, on top of this I could. I could put an api to serve some kind of storage uh service right, so it could be an api to do q.

C

It could be an api to do sql and and all the things we said before, and this is, as has gone beyond um sorry beyond theory.

C

Now, because if you look at the apache bookkeeper project, it's here in the node on the left, that is exactly a reliable replicated data store with with uh with the abstraction the operation that they abstract is really the the things that kafka does so append only operation to a sort of file system file and in fact, apache apache bookkeeper is being used to implement highly distributed geographically distributed queue system and pulsar.

C

If you want to take a look, pulsar is one of such implementations.

C

So putting it all together, um we have, we have uh replicas, as we know in uh so a stateful workload can have replicas and we we have just studied how we can. uh We can coordinate this replica with with boxes or raft, and then we can have partitions, which is I I partition the data set um so that each each uh each uh group of replicas has to manage um a subset of the of the data set, and I do that for being able to scale horizontally right and partitions.

C

It's it's in one of those cases where partition a and partition b are doing different things. So if I happen to have to coordinate a transaction that touches two partitions there, I have to use one of the unshared state protocols. Okay, so I can use chersey protocols between replica.

C

I can use a shared state protocol between between uh partitions and that that is how I can create a highly scalable, uh highly scalable state of workload distributed state for workload, and here I have collected some examples of these these workloads because there are starting to be many.

C

um Like I said, not everything will work in the way that I have described in the initial slide, but there are started to be many of these and I try to collect what they do for the replicas and what is the consensus protocol for the replica and what is the consensus protocol for the partition?

C

Some of them don't support partitions. um Some of them don't have inter partitions uh operations, uh they support partition, but you you can only work with a single partition uh at any given time, but in general this is. um I thought it was a good, a good exercise, and I thought these are actually the right question to ask if you are examining a state of workload and making the decision, whether you want to use it or not.

C

This is how you can understand what you can or cannot do in terms of reacting to failure in a in a distributed way.

C

Okay, um the other thing I look sorry could.

G

You explain: what do you mean by partitioning consensus vertical look. What do you mean by.

C

So what partition is is the concept of partition clear some using.

F

C

Some you know it's it's a there.

A

C

Terminology right but yeah, so so um a client may try to do an operation that needs to touch multiple partitions right so far, so good yeah. um So, for example, um let's say I think in inelastic search when I uh each index is a different partition or some something similar like that. So if I try to add a lot uh any piece of information, a document in in two index with a single transaction, I need to do that operation across these two partitions right.

C

So how do I coordinate that? I need I need, because the operation is not going to be the same because the partitions they deal with different data sets right. So how do I coordinate that? That is usually that happens with one of those unshared state coordination protocol.

C

G

Maybe I wasn't very kidding okay, um so I guess you know it kind of depends on how you look at it, because replicas can also denote partitions, or at least but here basically you're, meaning, if you're doing.

F

G

Look, you use the rough term replica to talk about different copies of the same object, whereas partition you're talking about grouping of objects or you know, operations across different objects. So that's what you mean by partitions.

C

Okay, I suppose so, let's say let's say my data set goes from a to z right. I could say that I want my partitioning to to deal with a a2 m say and then partition b to deal with n to z. Okay. So if I divide my data set into different ranges, uh my entire data set into different ranges and each partition is, is essentially a standalone state of workload just operating on a shorter interval of that data set and they don't have its partition, doesn't have to do anything about the other partition except.

F

When the client wants.

C

To do an operation that logically touches two partitions right, if, if I'm working only only on on the range from a to m, what did I say before m? I am just interacting with partition a and I can. I just have to make sure that each replica is replicating the state right across the partitioning.

C

But if I I'm inserting a a piece of information in in partition a and in partition b, at the same time then I have to- and I want to do it in a single transaction then I have to find do. I have to have a way to make sure that partitioning and partition b agree that they are doing that operation.

A

Yeah, I all right, maybe um maybe I see where what what you're sort of pointing out it's, I think the use of partitions here is possibly unhelpful from a terminology point of view, and maybe it would be easier if we just call them shards, simply because you know partitioning, as in the verb when applied to the cap. Theorem is sort of different from partitions when we're referring to a shard. So.

C

A

We want to call them shards over here, rather than partitions.

C

C

Yeah I find that each each stateful workload uses a different term for this concept. I thought partitioning.

F

C

One that in english meant the closest thing, but I don't have a problem changing to something else.

A

Yeah, I think when we we we had to. In fact this was one of the things that we debated when we were putting the landscape together and we sort of um ended up, putting a table to describe um shards and replicated charts and charted replicas as well, because different storage systems apply them in different ways, um but yeah. I think it would just make it easier for everyone if we call them shards on this slide. Okay,.

C

I can do that cool, taking notes all right. um So I'm sorry, gentlemen, that was asking the question.

C

G

You yeah honestly yeah.

C

Okay cool, um so these are just um um some databases. uh You know some some stateful workloads that um I have classified along those um parameters.

C

um I think we should have more, uh but we should extend this table to to more um more products, but that's what that's what I have so far.

C

um The other thing I, um what I have explained so far, is really generic and it would work anyway anywhere or with any deployment, but I thought we could take a look, a closer look to kubernetes and how this would work in kubernetes right. So it's essentially the same slide as before, except that now there is a kubernetes cluster in which our workload is running.

C

So we can, we can translate it to more close more closely to kubernetes concepts. So we have a persistent. We will have a persistent volume, we will have ingresses the global load. Balancer has to um load balance for you know to these ingress, or you know, ingress is using generic terms. This could be a load balancer service or it could be an ingress object, and this is where you see better well. What I meant by I need to have this east-west cap.

C

You know networking capability, because building uh building that across clusters is not that it's not necessarily straightforward today, with with kubernetes it can depend on the cni implementation that you're using or the cloud where you're running.

C

Still, if you can do it, uh you know still. This is the a requirement for this database for these workloads to to be able to to be stood up. That way.

C

Okay, I'm assuming everybody's familiar with kubernetes, not much to say here much else to say.

C

I didn't set up the demo, I mean I have it set up, but I did wasn't planning to run it today. I don't know how much time we have uh we. I could certainly run a demo in one of the next day, but just to explain one of the next meetings just to explain what um what the demo is about. I uh we would have this cockroach database that is distributed across clusters in three different uh regions. Right now. My setup is on aws, but it could be.

C

It could be anything um we deploy a network channel in the case of I'm running an openshift. So in the case of network um in the case of openshift, we need to deploy a network channel to make this cluster be able to talk to each other in a horizontal way. So without doing eagers and ingress like this, we are essentially merging the sdns into a single, larger.

F

Software-Defined.

C

Network, so that everything is routable and discoverable uh to do that, we use a pro. We use an operator and a product called submariner, which was initially developed by a rancher, but now, I think, is joining the cncf as a product um and that basically, it establishes a uh ipsec based vpn across the across the uh sdns of the clusters and then um with I deploy a global load balancer with health checks on route 53, using using an operator that talks to route 53 and and makes this configuration.

C

So this from a cockroach perspective. We have nine instances because we need uh the way cookers work. It's it's better, not not mandatory, but it's better. If you do local.

C

What is it called local majority, so local leader selection and then global leader selection and then uh so? We have nine instances, but they behave like a single cross-original entity, okay and then the way the demo works is I take down. I take down one one region and we see that the clients keep keep just working. Normally we set up some client here that run the tpcc test, which is a standard sql test for highly.

C

You know highly transactional operations on a sql database, and we we see that. Obviously we killed this instance, but we see that the other instance kept working- and this is this- could be our stateless frontend, but here we're just generating a bunch of transactions.

A

Hey, finally, so is, is that you know is: is some sort of uh network handling like like with submariner um mandatory for this sort of architecture or.

C

So the ability, so all of these state workloads the way they and then working with others. The way they work is each instance need to discover and establish a peer-to-peer connection with all the other ones, that's necessary for the raft coordination to work so so discovery and connectivity is needed.

C

The way you implement it, that's up to you right, for example. I know that we, if you use the if you use the google kubernetes service, you can build the cluster and switch a flag where, if all the other clusters are in google, you know regions, they will just be able to talk directly, so it's they.

C

They give they give it to you, but other implementation of or other distribution of kubernetes may not have this capability right, so you have to somehow provide it and in uh I can only talk about openshift for this particular capability and obviously, if that's, how we're doing it.

A

Right and in this in this instance um in this sorry, in this example, um the the database has nine nodes. Total three in each region um does is. Does that behave like a single logical database is is, is? Is that kind of the the the gist of this here.

C

Yeah, yes, it's kind of it's it's just and it's exactly what happens? It's actually nice to see! That's why maybe next time I'll um uh yeah, if you guys want to see this demo I'll, be happy to show it to you I'll, be very happy to show it to you. But yes, it's! It behaves like a single database for from the client's perspective.

E

um Could could you highlight on this slide to for just to you know, follow the previous slides, uh where you're doing your replicas and where you're doing your shards, I mean it's pretty obvious. It's pretty obvious that you want your replicas in the regions and then you're you're sharding within that, but just it would be nice to. Since you have three and three here, it's not clear or.

C

Obvious, so this is where we start talking about the second generation of state of workload which decide the sharding by themselves.

C

So cockroach, based on how you use the data, can recharge can recharge and can decide how to charge. So you can hint when you create tables, you can hint how to shard them, but you don't have to, and it knows what to do. It's really really. It calls. I think they use the name tablets for shards. So that's yet another name. Okay and- and it creates its own tablets, um you don't have to decide it, and these are nine replicas.

C

So all the database is re is, is fully replicated everywhere, except we don't have to have all of these instances agree to in order to proceed with that transaction, and that's that's how they can make it efficient. um We I did. I did this with the cockroach guys and you really for this question. You need to talk to them, but we run a performance test.

C

So keep in mind in um in amazon between is the u.s east and u.s west region. There is about 70 milliseconds of latency. So that's that's. Just physics, there is nothing you can do around that, but um with that kind of latency we were still able to run the tpcc test with 97 efficiency, which the tpcc 1000 sorry, so that emulating 1000 databases doing uh oltp so highly highly transactional kind of operation. So not it's not data warehousing or you know. Big queries is more insert insert selections to select these kind of things.

C

So, with that kind of traffic pattern uh emulating one thousand instances we did 64, we um sorry 96, which is, um which is almost the same- that you would get from an analytical database. Probably more analytical database can do a little bit more, but it's it's close to the theoretical limit 100.

C

um So they were, they were happy with the result uh they they could already do achieve those results running on vms, but the exercise you obviously was running on containers and inside of openshift.

A

I guess from a concept point of view this this applies to to just about any distributed um storage right, if, if, if you have, if you have a logical instance that combines sharding and replicas between the between between sort of multiple cluster instances- and you have some sort of network tunneling, then this this can apply to potentially distributed file systems, key value stores and object stores. And so so so you know we can probably make this um a fairly generic play as well.

C

Right that that's my objective here, I I don't think it matters what the stateful workload does. What what we are finding a solution for here is replicate state across regions right and um or keep stating sync across the region better. uh So I think it can be done with other ap. You know interfaces because this is a sql interface right. It's a sql service.

C

I think it can be done with other type of state of services.

C

In fact, I would like to be able to showcase this this same architecture with other kind of workloads, because it proves the point right, the the point right now, one might say: okay, it works with coco cb, but it's not a general solution, but if I can make it work with other uh products, then it starts to be a generic statement. More.

F

C

A generic statement, so I'm collaborating with other partners to see if we can recreate the same kind of deployments.

G

I guess the part that can vary across different distributed databases or file systems is how they consume this topology. So for this demo like how did you convey this topology of you know? There are three different availability zones and you know how did you make cockroaches aware of this topology, so proper starting happens across azs? You know, as opposed to within the same age,.

C

Cockroach has some parameters that you need to pass to the process when you run it to make a topology aware so using downward api and other approaches. I I make the pod the pods of where or where they run and then and that's how it decides to do the sharding right because, like I said it's, uh that's it's a nice property of it of that it. It does all the sharding.

G

I see select the node labels on labels right.

C

Right, um yeah, yeah cockroach understands one level of topology, um I'm working with another database now gigabyte, which understand multiple layers of topology potentially so it understands uh cloud uh region and and az passing these parameters. You make it aware of where each instance runs, and then they can make a decision on how to distribute the data.

A

I guess you know it probably therefore makes sense to have a short section or or a slide or or something to to cover discovery and topology, as in you know how the nodes discover each other and how.

G

A

um And, and how do you kind of um like define the topology somehow somewhere because it could just be? It could just be labels, but but just as equally right, they could be. They could be um looking that data up in a discovery service as well.

C

Right yeah, there are yeah um and I think yeah. I can create a slide on that. I think I talked about that a little bit in the document it started to it.

C

It becomes implementation dependent very quickly that that's all I that's the caveat, yeah.

A

So yeah yeah, I'm I'm, I'm not suggesting we start to define how they, how they do it, just that we kind of need to tell people if you're looking to build this. Oh yeah architecture, you you need to.

C

A

Out how you're going to do your discovery and your topology.

C

Yeah topology is is a fundamental discovery. Anthropology are fundamental in in the case of some arena. It comes with a discovery service. So if I know what what to look up, if I know the name of the server you know, if I know the name of this, these are stateful set right. So if I know the name of these individual instances, I can look them up from this cluster just because I have a generally distributed discovery service, but yes other.

C

If you don't use submariner, you will have will have a way you need. You need a way to do that right. um For example, celium, if you know psyllium is, is another cni that you can. um You can configure in your kubernetes cluster psilium support network tunnel out of the box, so it's a switch that you can turn on. I think what is the other famous one, uh the other famous cni ah calico.

B

C

Garlic does this, I think calico has the same capability. If you, if you look into that.

A

Interesting, maybe maybe it's worth pinging um the the sig network and seeing if they have any information about those those product capabilities.

C

Yeah, we can do that and um the multi. I think it's the multi-cluster sig, but um there is some sig that has defined a standard as a final spec for cross-cluster discovery. They don't define the tunneling, but they define the cross-cluster discovery and uh submariner implements that spec.

A

Very cool um we're actually a minute over, so I think we're gonna have to call time. But but this was um this was brilliant rafael and I think we've got something solid to to to work on.

C

Okay, thank you.

A

Thanks everyone um and we'll see you all in a couple of bye-bye weeks. You thank you, rafael,.

A

A