Cloud Native Computing Foundation Online Programs, 28 Sep 2023

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Data consistency across cloud native systems

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

All right, hello, everyone welcome to my session on data consistency across Cloud native systems, I'm Jimmy zalinski.

A

um This talk is going to be relatively technical, I'm, going to cover some kind of fundamental distributed systems topics, um but I'm, going to give an example at a very high level, that kind of introduces the concept of going to ask and answer the questions of what is consistency.

A

Why is it important, um then walk you through that example, and then we're going to look at a couple, different systems that are kind of ubiquitous components that you'll find in Cloud native architectures and talk about some aspects about them that are interesting, some aspects that may be surprising um and the things that they've kind of the history behind the ideas that they have and what they kind of contributed to the overall kind of landscape for architecting software, with an eye towards the consistency of the data being used um so jumping into that uh I am Jimmy.

A

Zilinski I am the co-founder of a company called auth Zed. We are the creators of a database system called Spice TV um Space TV is a data store specifically designed for storing and Computing authorization data, um so that means that it is basically the core engine that your business would use to compute whether a person has access to perform an action or not um I like to use that term permission or permission system rather than authorization system, I, think it's far more approachable and defines the problem way better.

A

But you can understand how consistency is kind of a core part of this. Just because, fundamentally permission systems for software, they have to be correct, or else there is a security flaw. If um some software allows someone to perform an action that they otherwise shouldn't, that is absolutely Mission critical in most software and doing this kind of work at scale and at low latency, because absolutely uh every action has to take place in your software system has to check whether it has uh the access allowed to perform that action.

A

It puts us in the critical path, so um data consistency, super important for Spice TV, um but before I've said, I was previously working in the cloud native space. I worked at a company called Red Hat by means of the chorus acquisition, um so I've been kind of working in this space since before actually the cncf was created, um core OS was kind of building uh distributed systems and kind of container Technologies.

A

Basically, the foundation of cloud native software, um uh since before uh kind of the cncf and this whole kubernetes ecosystem emerged and in that time I have contributed to a bunch of cloud native projects.

A

I've co-created, some I am also a maintainer of oci, which is the standard body for container images um and kind of this all kind of folds back into my passion for distributed systems, even before working at core OS I always had an eye towards distributed systems um and as an early adopter of a project called SCD which ultimately became the data store.

A

That's used by kubernetes I'm, going to talk a bit about NCD later in this talk, but then also as a part of kind of building kind of large-scale SAS systems on cloud native software I have also ran MySQL and postgres. These type of relational databases um at scale I've seen where they fall over I, know sharp edges.

A

um When you build enterprise software, for example, you try to do things um uh without introducing new dependencies on other systems, so your customers don't have to set up yet another software dependency, and you start to bend a lot of these products to their will. um So in ways they should not Bend so you're trying to like actually get different database properties out of databases that were never designed to do certain things.

A

um I've been down those dark roads as well, um so I definitely have some very, like scary knowledge of some of these systems and know when you're doing the right thing and when you're doing the wrong thing and how you should really focus on architecting uh things for success in the space.

A

um So I also left my my contact information on this slide. So if, at any point in time uh you want to reach out to me feel free to just shoot me an email with a question um or uh actually and the final slide I'm, also going to link to uh Discord community that you can join to kind of discuss, distributed systems in general or data consistency, um but I I prefer email.

A

And then you might see me around on Twitter or GitHub under these handles as well um and enough about me, uh it's on it's time to move on to the actual primary subject, which is, you may have seen these terms thrown around uh in your software development career.

A

um A lot of these are specifically kind of like used, particularly in the the database World, um if you're selecting database reading database documentation, maybe you're taking a course in college on uh how to build databases, um you're definitely going to hear these terms thrown around.

A

um But uh fundamentally these Concepts uh they're not kind of unique to databases, uh because so many software systems all store data. Eventually, they kind of Punt it off to a database that is then responsible for maintaining that data, but, like the fundamental systems, are still offering views of data and they're modifying data potentially and before they pass it all around. So whether we're talking about databases or microservices, the concept of kind of the data you're working with, is always going to be relevant.

A

um It's actually really interesting um acid, which is like one of the super popular acronyms thrown around in this space, which stands for atomicity, consistency, isolation and durability- um that that acronym actually has this kind of uh story around it. Where actually folks believe that the sea was just made up to make the acronym work well, that c is consistency which is the topic of this whole.

A

uh This whole presentation, so I hope by the end of it uh I, can kind of formally explain at least uh how I think about consistency and why it is almost certainly, even if that was the initial intent. Definitely not the case that it is. It is a made-up concept that is, that is just to make an acronym work, um because you'll find that it's used in many other places and other discussions outside of just the term acid.

A

So it clearly has some relevancy on its own, um so I'm gonna actually use some of these terms in here, I'm going to use them in different contexts so that there are kind of definitions become more clear rather than just kind of trying to describe them, um abstractly in a vacuum.

A

So um I've talked a lot about these things, but I still haven't like covered the very Basics yet, which is what actually is consistency um now I didn't use the the Wikipedia definition that, like you, just Google for um and find instead I kind of defined it the way I like to think about it and the way I feel like most Engineers colloquially use.

A

It I think that's really important, because um there's you can go and look up a lot of this terminology and read a very dense uh article or read research papers that talk about these Concepts, um but that doesn't matter if you're just trying to communicate something. To your fellow engineer. What actually matters is that kind of you have this effective communication tool and you both have a shared understanding of this topic.

A

um So I tried to Define it in my own words, rather than like the mathematical terms that you might find elsewhere, um so how I Define it is strictly around the contract between how data can be observed in a system. um I often kind of talk about freshness uh with this, like that. The concern of like how fresh is the data that you're working with becomes a part of that equation, but fundamentally um I think the core concept here and the way people most often use. This term is largely around what I would say.

A

Quote-Unquote correctness and correctness basically uh is context dependent, which makes it kind of tricky. It depends on what type of system you're trying to build and when you're trying to build these systems, you're going to first kind of talk about the problem and then work backwards to find the consistency model that is going to work for your solution.

A

um So why does consistency matter? And why are we working backwards to arrive at it? It's because, if you're building applications and um fundamentally you have a contract between what your expectations of the data that you're going to use in the application and the data in the database, for example, or that the users will see and your application that you've built if that contract is broken, um systems can explode in catastrophic ways.

A

uh Basically, silent errors can occur, data corruption can occur um and, fundamentally, if you want to solve this problems like these problems between this inconsistency, um certain things will actually just be impossible for you to do without totally re-architecting your software around something that works more consistent. The door closes behind you when you open up and move to a less consistent system. You don't have the capability of adding this in retroactively and that's kind of the scary um part about consistency.

A

Is you really need to understand your problem and your domain first, because um if you pick something that is not going to jive with the system in the future, you are going to be in a world of pain, probably re-architecting or carving out some subsection of your application.

A

That has to be treated special with completely isolated data that works at a higher consistency level and all that might not mean too much now, but I'm going to go through a concrete kind of, like example, uh now and then eventually we'll talk about some systems in the real world, those components how uh how this all plays out in those components.

A

So uh here I've got this example and this type of Medical, but it's a real problem- that's actually faced by everyone, designing e-commerce systems in the world um unless they're building on top of a pre-existing system of someone who has already solved this problem for them. But even then, as you extend that those systems with your own systems, you still have to continually think about consistency and how that plays out. But uh I digress.

A

um Here is the hypothetical scenario there are two humans involved in the scenario a child in a parent. The parent is supervising the purchase of an item by a child online.

A

um The child is going to basically review uh the the orders on their account and see if the item has been purchased yet then they're going to purchase the item, and then the parent is going to double check, just make sure that the the child did the correct thing, um and so basically we have this flow over time, uh which is the child. First reads the orders: um they see that an order hasn't taken place.

A

Yet so then they're going to purchase the item, and then the parent is going to read the orders and find that the child successfully purchased the item. Their child was good. Everything acted accordingly um and I just wanted to make this kind of concrete one more time. Nowhere have I mentioned servers, databases microservices. None of that this is all actually purely from the external facing side of the system, the user. At the end of the day, um sometimes your your users are real people.

A

Sometimes your users are other services and your microservice architecture, sometimes the they are the actual service, and you are the database right, um but the the point here is that these types of problems that I'm going to describe in the scenario this plays out, regardless of that it does not matter actually what those are um it's equally capable of happening in all of these scenarios. So um here is the problem. This is another way the order of events can take place.

A

um The child reads the orders, the parent reads: the orders, the child, buys the item, and then the parent buys the item now. Why did this happen? It's because the parent checked the orders right before the time in between the child was going to actually purchase the item and at that point in time the parent looked at the order list and said: oh, my child didn't purchase this so now, I have to as the failover I have to go and purchase this item because the the shop was not successful, but actually the child was successful.

A

The parent just checked too early um and the these events got basically interweaved, um and this is kind of a problem because the parent fundamentally made their decision based on stale data.

A

So by the time they made their purchase technically, the read that they they made was invalid because the child actually had um had already purchased the item, so they would have had to reread before uh actually finally making that purchase to do this successfully, but they have absolutely no signal to tell them that they needed to reread so uh computer scientists love to sound, really smart and they like to use words from math and physics.

A

So, there's actually a term for uh the relationship between these two events, um which is causality or causal ordering or causal dependency, um because the purchasing of the item is dependent on the read.

A

um Basically, if you look at this from like a physics relativity perspective, the only reason why the purchase happens is because uh the read happened and the output of the read the outcome of the read was: there's no item re in the purchase history, so I'm going to progress with purchasing this.

A

um So fundamentally, uh this is kind of the type of language that a lot of people used to talk about these types of ordering of events and systems, um because these types of events, where causality involved when you have overlapping, leads to inconsistencies in data.

A

um So moving on from here, there kind of is a really obvious way that a lot of people think about solving this problem and it truly does solve the problem, um which is to just combine things with causal uh dependencies. Why can't they happen at one point in time? So um when folks typically think about this, they think about like transactions and relational databases or Atomic operations in um like the sync libraries and like their their programming languages.

A

um So this does solve the problem um and it also leads to a pretty good segue, which is that actually, so far, I've really been describing atomicity in this example. So that's the a in acid if you'll recall, uh but not the C in acid.

A

Oh that's because um while we have grouped all these things together, we've been working under the assumption that every single time an action takes place here um and if we follow the flow of time that is immediately visible to all of the outside actors in that system, um and this is where we start to really get deep into like the physics and relativity kind of uh analogies that exist in distributed systems.

A

um But we can actually imagine scenarios where, um when you actually perform these actions, uh the visibility happens later so a child and a parent. The exact same story plays out theyatomically still, the child performs the atomic operation, um but when the parent goes to read, it actually happens before the atomic operation is visible to all participants in the system.

A

um So this happens obviously way more in um distributed systems because, for example, you might have a read replica that is getting changes um stream to it asynchronously, and it is best effort trying to stay up to date with the most recent information, so that maybe folks in another Geo region on the planet um get fast performance for for sale, data or not, maybe just not up to the um most consistent level of data. So quite a common real life scenario, largely in uh just read it systems.

A

This is the difference between animus, atomicity and consistency um is like this visibility aspect um that happens in relativistic systems, um and this is what will play out time after time in any distributed system.

A

um It may seem like kind of far-fetched looking at this, like the example that, like well, hey, how's, the visibility actually like how can this be delayed? Maybe if it's not a distributed system, but even in um normal database, you're running a database on a single system, the time at which um it takes for a transaction to commit in the time at which it is actually then propagated um to to actually the visible data that is queried. That actually is a Time window in which you can race against the visibility.

A

So this can even happen on single nodes. That's getting a little bit in the weeds, but I just wanted to be clear that this was not purely distributed systems problem.

A

um So I kind of have this consistency Spectrum, where I kind of lay out the problem in two different dimensions. When we're talking about the uh types of um consistency that systems can have across the bottom, uh the x-axis I have immediate consistency, which is uh what I was talking about for most of the example like when I was talking about atomicity, that is once a change happens.

A

It is immediately visible to everyone in the system and then, on the right hand, side of the x-axis I have eventual consistency, which is time passes and eventually folks will receive an update. They'll eventually see the change that has occurred, um and that could be basically arbitrary amount of time. Until that happens, this is what's most commonly talked about. I feel like when, when folks talk about consistency, do we need immediate or eventual consistency?

A

What does the system look like, especially a few years ago, when there's a lot of discussion of nosql systems, a lot of those were making consistency, trade-offs and and opting for more ventral consistent systems.

A

um So I feel like a lot of folks talk about that, but they actually less so talk about what I have uh portrayed here on the y-axis going vertically, we have strict and weak ordering um so I think the other important aspect of consistency that uh often gets overlooked is the order of the operations that go in.

A

um So that means that uh if something occurs first, it's guaranteed to be before the thing that happens after it um that that sounds a little bit uh spacious, but I'll get into why that's relevant and the systems that benefit from it later I've kind of dropped. In some examples here of these types of systems, um for example linearizability- you can see that's on the, but that's the most immediate and the most strict ordering.

A

So that is one like the strong, strong, strong strongest uh kind of guarantee that you can find in a system and what that actually means is there is a total Global ordering across of all the systems for each change in the system, and when those changes are applied, they are immediately visible to everyone in the system.

A

um This is the strongest guarantee you could possibly get and then on the polar opposite of that I. Have this eventual uh consistency that is also weekly ordered now, that is um a kind of interesting bit of Technology called a conflict-free replicated.

A

Data type and crdts are a kind of building block that a lot of systems are kind of exploring right now um and what that actually lets you do is propagate changes uh that can basically, when you have a scenario where changes don't matter what ordering is applied, um you can actually use this as a very effective way to synchron eyes data the they basically rely on this property of commutativity.

A

um uh You might actually remember this from uh maybe you've had an abstract, algebra class, or maybe you just remember, um basically, learning addition as a child. um Addition is commutative. So that means that you can do things uh like one plus two equals three or two plus one equals three. It doesn't matter what order you receive. Those changes when you sum together a bunch of numbers, they're, always going to converge to the same result.

A

So if you're performing operations on your data that, regardless of whatever order you apply to them, you're always going to converge of the same result, um that's great. um That means you're going to be good and you can use one of these systems and still get correct answers.

A

um So there's kind of other variants in the space I kind of use serializability, which is that there is an order, for example, immediately consistent, there's no total Global ordering it just means everything happens, independent like independently and isolated in an order, and then we also have eventual consistency which is more similar to what you see out of a lot of the new SQL systems.

A

I, just kind of wanted to show that, like there's, there's varying levels in the Spectrum, it's not just kind of like the opposite corners and then the middle ground, um but I kind of wanted to dig a little bit deeper into this, because um there is super important to understand properties involved here.

A

If you'll look at the bottom and the x-axis I've replaced immediate and eventual with slow and fast, because in the real world, this is the implication of choosing one of these um effectively. It is way way less performant to choose something that is immediately consistent because you have to make sure before you write something that it is going to be visible to everyone. So that means it probably has to be replicated everywhere before it becomes visible and accept it as a right and then crdts.

A

For example, you can just basically dump out a stream of changes, hope that eventually, someone gets all the changes and then they're good, um and what is really interesting here is that we kind of have this middle ground. This box and I call this box cleverness, because this is where you're going to find a lot of the stuff in the real world. That's compromising and trying to make a lot of systems viable.

A

um If you have a problem that needs to be solved with linearizability or you have a problem that can be solved with crdts, you kind of got your choices pretty easily made for you. uh Your problem domain has made it really obvious.

A

You're stuck in one of these camps, you're stuck in either corner of the spectrum, but actually most systems will not actually have those problems and instead they're going to live somewhere in this middle ground, and this Middle Ground is where there's going to be a lot of interesting tricks and things that you're going to be able to partially apply and to gain a lot of benefits in your system, um without necessarily paying the costs globally across all of the data that you're working with um so uh some some typos in those slides.

A

um So that was basically what I wanted to cover uh at the high level of kind of, like the conceptual side of consistency and now I kind of want to jump into some examples of uh basically live systems, components that folks are using in Industry to to solve their problems in the cloud native ecosystem.

A

um S3 I wanted a really good example. This is actually kind of funny. I wanted a good example of a ubiquitous system. That's eventually consistent that a lot of folks are using um and I immediately thought of S3 I built product in the past. On top of S3 and yeah, you basically would submit blobs to S3. It would tell you, hey I, wrote it that's great, but if someone else immediately then tried to pull down that blob, it would not be available yet.

A

So there was no necessarily uh guarantee that, after you had written something, it was immediately viewable um to external uh actors and actually, as I went to go, make this presentation I found an article um AWS actually fixed this. They actually changed this a couple years ago. So I don't know if this is actually true across all the implementations of blob storage.

A

um So if you go to Google for Google, Cloud, Storage or Azure for Azure, blob, storage or Min, I o, if you're running something yourself um I'm, not actually sure if they all make those same strict guarantees, but certainly for the vast majority of the lifetime of these kind of blob Services.

A

um Eventual consistency was kind of the the uh the status quo there and um what is kind of interesting is uh I'm gonna dive into it deeper later, but the system backing S3 storing metadata uh was given additional consistency capabilities, which is what made it possible for uh the developers of S3 to actually uh change this consistency, to make it actually more consistent.

A

um That is very typical for systems, and this is uh actually an example. That's kind of rebutting my conjecture that it's actually uh it's actually impossible or hard uh to uh Pro prohibitively hard to kind of add this add consistency after you've designed a system that doesn't have it, um but us3 is actually sufficiently simple that it was actually not much of a hurdle once the underlying dependency that they had offered that capability. So uh remember take everything with a grain of sand, that I say a grain of salt.

A

That I say uh because you know the system you're working with and uh I have to kind of speak in generalities for systems. That I think are what are the most common place and that I've seen most commonplace, but maybe you're building something that is it's not exactly that.

A

um So here's one that's going to be really fun relational databases. This is kind of a system that a lot of people are familiar with um and I think that in the general case, a lot of developers believe that by simply adopting transactions uh in their usage of a relational database that they have kind of solved, consistency, problems and um I'm here to tell you that transactions are not a silver bullet, not nearly, and actually what dictates your consistency in these systems.

A

Far more than transactions is actually um basically the isolation level set in the database, and even if you don't include transactions whatsoever uh implicitly, the statements that you send the server are going to be wrapped in transactions so like whether you think, like everything, is a transaction. Basically, it's an atomic unit inside of a earth relational database, whether you use the keyword or not in the SQL you're you're. Writing um all right.

A

So what are isolation levels, isolation levels are kind of this aspect of the database and it's usually defined at the table the table layer um effectively. What it says is the the consistency of the data that you're working with um on that particular table. uh So there's no standard for this or anything like that, but across MySQL and postgres they kind of agreed that these are kind of the basic isolation layers um in my sequel, the default isolation level is repeatable, read so you'll see that's the second from the top and then in postgres.

A

It's actually recommitted, so you'll see that's actually the third from the top, so postgres is actually by default more lenient. It can actually have less consistent responses uh by default. If you don't actually go in and clarify what what isolation level you need in the SQL you're writing, um so I kind of wanted to run through like the different types of scenarios.

A

It's uh that, like are kind of outlined here um dirty reads, are when you perform a transaction and when you read a row, a another transaction which has not been committed, yet it hasn't been written to the database. You'll see that data you'll see changes that have occurred um so basically I open a transaction I try to uh modify something. You open a transaction. You go to read that thing.

A

You'll see the change um that is a dirty, read and so you'll see that unless you're really kind of like explicitly choosing to go inconsistent, that is unlikely to be ever a scenario. You'll see when you're using relational database. Unless you specifically say I want read uncommitted.

A

um So that's not a super common problem, but it's interesting to note that it's even possible um and then effectively that that guarante that eliminates some of the benefits that you get from what's called a mvcc or a multiversion concurrency control system database system. So that is super atypical. Unless you're working with a database that is not in VCC.

A

um So uh then we have non-repeatable read which this is when um you. Basically, you reread data committed by other transactions and bind has been modified, so this means that you're in your transaction, another transaction modifies some data that you've already read and then, if you go to read that data again, you'll see it's updated within your transaction. So this is kind of breaking the kind of atomicity um thing that a lot of people consider transactions to provide them, um but actually you'll find that this non-repeatable read um and recommitted.

A

That's that's actually possible um so that by default in postgres, this is totally a scenario that can happen to you, even though that's probably shocking to to people to believe that hey, like I'm supposed to just be reading from that particular snapshot, I'm not supposed to see these types of changes, um but that's quite possible, um and then finally, we have Phantom read which this is more commonly the the thing that is it's most quiet because you're not going to see it, but also it's the thing. That's going to actually corrupt your well.

A

It's going to be it's not going to corrupt your data, but it is definitely going to be the most surprising way that could possibly corrupt your data, which is um basically you read some data in your transaction.

A

Another transaction modifies it and when the database goes to apply it, it just happily applies both um it doesn't try to rerun your transaction um and uh your transaction performs a read and then, depending on the value of that read, it performs its right if another transaction before it comes in it just totally swaps that out doesn't matter it's just going to progress anyway.

A

um If you had reread it reread the value um in the non-repeatable read scenario, it would not have mattered, uh it would have lied to you and say the value hasn't changed, but then, fundamentally, when these transactions kind of get committed, that's the wish. The time uh the value is going to have changed and you're going to be uh Sol, so um Phantom reads, are actually even possible, uh basically at um the default isolation level in my SQL.

A

So unless you have explicitly configured your data store to be serializable, um that's the only opportunity for all of these to not occur. To you um now the interestingness that you get in kind of like that cleverness box, uh that I kind of showed off in the diagram earlier.

A

um The interesting thing that you you get to have is: there are constructs in SQL that allow you to do individual row level, locking um so uh in SQL.

A

There is a select for update Clause that you can write that says I'm going to read this row, because I am going to causally update some other row based on its value, so that actually lets you describe this causal dependency, this causal ordering in uh in terms that the database can understand and what actually happens internally when you use select for update, is it does a row level lock on that data so that that actually locks that data and prevents any other transactions from modifying that data for the duration of your transaction until you can actually commit your transaction, um so this is what's going to give you guarantees that no one else changed that value at from underneath you, and this is kind of having your caking eating it too.

A

You don't have to turn on full serializability to get that guarantee. You can do that with any of these modes, so that is where you're you're in that cleverness box again you're selectively. Choosing I need consistency for this operation right here locally, but I don't need consistency across the board everywhere.

A

um So there's there's a lot more and there's a lot of other tricks deep inside these relational databases for for kind of like working with this data. But I think this is the very high level most important thing if I had to like teach someone about consistency and relational databases that, uh if I had 10 minutes to tell you something this, is it um walk away?

A

Knowing that, like isolation, levels, are a thing that you should constantly make sure you're reminded of and familiar with, when you're you're, writing schemas for relational databases and also knowing that like. If you need to uh read them right in a relational database, you should be using select for update most likely um cool.

A

So let's talk about a less commonly used system, but an equally interesting and very relevant one, which is Lock Services um and that is kind of what I call this class of of software, although um they were originally designed to be locked Services, they kind of have um larger Scopes these days. um These are projects like SCD and zookeeper. So uh what are the guarantees here?

A

Or actually, let's, let's a little bit talk about what Lock Services are um so basically, in the probably mid to late 2000s Google wrote a paper about a system that they had built internally called chubby, and that internal system is a distributed lock service um and the point of the system was we have a. We have distributed systems, we have a bunch of different applications, they all need to coordinate together, so they need to. They need some mechanism for them to safely. One of them needs to acquire exclusive access to some resource.

A

They need a lock a distributed, lock, um that's very tricky, and it turns out formally that to solve that problem, you need a linearizable system. um So what ended up happening? Is they? They wrote A paper, how they designed this. This distributed lock service um and, um ultimately, we we see projects inspired by that a zookeeper is I believe um it is definitely inspired by that paper. It doesn't implement it directly.

A

It implements a completely it's its own, unique algorithm called Sab, um but then we also see systems like SCD, which is also inspired by kind of uh later computer science research, around kind of the the same um uh consensus, research in distributed systems um and ultimately these are linearizable systems, but then we, as the stems, have gotten more and more mature.

A

um They kind of decided hey. This is a really useful property having something that is linearizable, so we have always guarantees about whatever like critical beta, we have in our distributed system, let's save it all over there. um We don't just need it for locks. We need it for more things.

A

um So, actually a lot of these Services expanded, Beyond, just locking as a capability that they could serve and our general purpose key value, Stores, um and so, notably at CD as the solves this exact problem, it's a key value store.

A

That is actually the core data store used for kubernetes um and, what's really interesting, is that kind of while it is realizable there are lots of tricks that you can kind of use in the protocol and under the hood, to optimize um the the consistency and like the performance of such a system, um without kind of impacting the external user facing um appearance, the freshness of the data that they're seeing so there's a lot of really interesting, distributed system tricks here, um I'm just going to call them tricks um because I don't want to dive too deep into them, and there's lots of variations of algorithms under the hood that are shortcutting things to to make this really fast, and then there's also capabilities in a lot of these systems to kind of relax.

A

The consistency that you can actually use with the system. If you would like to trade that in order to get higher performance- um and the important thing to note here- is that when you are kind of like relaxing this consistency and like really kind of playing with these things, uh these are the authors of these systems and they're doing it kind of at the protocol level and like in the API level.

A

So it's not really exposed to anyone on the outside, consuming the system um so much as kind of like optimized uh around uh the guarantees that they can provide. There are some that provide apis, so you don't necessarily you can choose if this needs to be a quorum, read or not, for example, but by and large um lots of your tricks are internal to these systems. um So most critical, most strictly required strong consistency systems they get stored in blog Services.

A

um So then, basically, what happened was the new SQL Revolution right um and that eventually turned into kind of what we we all kind of refer to now as distributed SQL I. Think of databases uh in this space as Cocker's, DB and tidb. um But there are a couple more.

A

This era is really interesting because um they basically looked at solving the problem of scaling out a relational database in a horizontal fashion, so you can keep spinning up individual nodes and scale up the system without any like replication lag there is you don't have to direct rights to one particular node, um they're kind of solving these traditional problems and scaling uh these relational databases and doing so by applying um some of the research from the lock service optimizations. So what happened? Was the creators of these databases?

A

Looked at the research uh that kind of gone into like the the lock services and went well hey the data we need to effectively the internal data in our database, the bookkeeping. We need to make sure that we can actually scale these systems. um We can serve that basically store that in using these Lock Service techniques. Using this, the consistency tricks that these systems have developed and um that is going to make it so we can actually scale and provide our SQL systems now.

A

You'll notice that, as a part of this process, uh they're not actually kind of passing on any of that to the end users, they're still providing the same uh kind of isolation, levels, transaction logic, selector, update logic. All this all the things that I mentioned previously about relational databases. Those guarantees are still here in these systems. They've only just been made possible to be expanded uh in this kind of new uh kind of Auto scaling World, um but that's not to say that magically these distributed SQL systems are linearizable.

A

In fact, the examples I've given here, cockroachdb and tidb. Neither of those are linearizable. So really the the major benefits of folks uh from the consistency being used here.

A

Are the database admins, the The Operators, the sres running these systems, because they're able to basically uh scale out these relational databases uh using the more effective um kind of consistency for only the core data necessary there, but not any of the end user application developer data um so cool that unlocked like a bunch of new capabilities for the the SQL databases, but the end users themselves didn't necessarily get anything better, um and this is kind of where uh this is the less defined.

A

What I would say is like the interesting New Era of um kind of databases which I'm calling adoc here, but I think this is kind of flexible consistency systems. um I use two examples here: Cosmos DB and dynamodb, which are kind of foundational data stores at Azure and AWS respectively.

A

Dynamo is actually notorious for having uh been a paper published um many many years ago, uh probably mid yeah, mid late 2000s, and um what has actually happened is the Dynamo described in that paper is not at all like the Dynamo that currently runs at Amazon now, which is why they're able to add capabilities uh to S3 right to actually kind of add, more consistency, because now um systems like this expose consistency to the end user API.

A

So the folks consuming this, uh these databases on the Fly can choose what level of consistency they want for the data in the response. um So if we like touch back on the other systems, Lock Services are very strict and strong consistency. No matter what you do and then relational databases, you don't have any Ops there beyond the isolation level and kind of Select for update mechanisms, but um that that is all like very, very uh domain, specific and and kind of it's still uh non-obvious.

A

There's no way to say this particular treat this whole glob of things. This whole operation I'm going to perform with this. This amount of consistency in those systems so um very, very different from what we previously had. This is actually kind of what I would describe as like unifying all of the things all the benefits that I just talked about, because now the end users are more in control of the consistency of their data. They get to pay for exactly what they use in terms of like the performance cost of.

A

When things need to be consistent, they can actually slow things down. Just for that part, and then for the things that don't need to be consistent, they can actually take advantage of the optimization um and and really go for it, um and what's interesting here is uh space DB. So the database that my company builds uh is an example of one of these ad hoc systems. Users per request can specify the consistency they need and there is a default to be done, specify one.

A

um But what is really really interesting is that um this is an opportunity for a lot of user experience. Research um because, uh at least in our domain, because we're specifically handling authorization data, uh we can actually tell users what is going to happen at the different levels of consistency, because we know more about their data because we know about the domain they're operating in. So we don't just have to explain kind of like these distributed systems, research Concepts to every user of the system.

A

Instead, we can make it obvious that, like if you check this permission with this, it's going to be based on time where time is rounded um for for performance reasons. um So like we can. We can actually draw on actually analogies that folks understand about our domain um rather than trying to treat uh teach them very low level.

A

Concepts um I'm, actually super excited for for more of this I would really like to see um apis where you kind of can sneak the concepts of consistent C into it so Folks by choosing the proper apis thinking about their use case. They they no longer think about consistency, they think about their use case and then, by virtue of picking that, then they have just chosen the API with the proper consistency for what they need.

A

um So it's very, very cool stuff um I'm really excited about this space generally um and it wouldn't be possible, if not for an approach where we kind of start with very consistent world view and then allow folks to relax that over time, um because I feel like when you work with the relational database model where it is very relaxed and then you layer on more and more strictness when you do that, you just have to know too much about both the domain details and the internal knowledge of the data store, you're working with or the internal knowledge of the system.

A

You're working with and I think it's only by starting from a safe posture and relaxing yourself that we're actually able to see more of these ad hoc style systems, where it's way more flexible, adding consistency or using consistency, and with that that's all I have. For today.

A

um This is a very deep subject uh and I just want folks to know that I have only like kind of scratched the surface here. um People spend their whole professional careers researching this stuff. um That's a cuckoo clock! That's how I know I hit my my number for the time for this.

A

This uh webinar, um but yeah consistency is a super super important subject: I still see very seasoned developers overlooking it or downplaying it in their systems and especially when they make architectural decisions for their overall design of their system, and it really it really shouldn't be overlooked. It's one of the most important things that, like you, need to discuss and decide when you're designing something, because excluding very few of these occasions, we have previously not worked with a lot of systems that give you that flexibility to um actually adopt more consistency, if you need it.

A

So, if you've foregone consistency and chosen a system where you would have been have to somehow layer on more you hear off, you often just can't go back, you often are stuck and you'll break the contract and things will start breaking.

A

There are many of large companies that folks in Industry think are the bastions of Pinnacle of engineering that get to hire experts all around the world. They pay well because they're, a giant company building a really cool product, but a lot of these companies have huge problems because they the way they grew, they were incapable of dealing with consistency.

A

They didn't have the technology at the time to deal with the consistency of their systems and they have outages as a result, they have data loss as a result and they are stuck in a state where they're forever trying to migrate critical data or the data that has different consistency into new systems and I do not want to see um more developers in that scenario, especially since we've made so much progress on this subject as an industry and with that I'd like to thank everyone for watching.

A

If you want to discuss more topics like this or learn more about how, for example, space DB uses ad hoc consistency, you can join our Discord.

A

um The link is right in front of you and, uh if that's not just exclusively for spice DB users, it's an open source Community where folks that care about distributed systems can talk about all kinds of all kinds of research and practical usage in Industry. So thanks for time.