Apache Cassandra NYC* 2013, 8 Apr 2013

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: NYC* 2013 - "Exactly Why You Can't Have A Pony" by Rick Branson

Description

Speaker: Rick Branson of Instagram
It's upsetting whenever we hear that we can't have things that we want. It'd be nice to live in a world where it was possible to have things ACID transactions, uniqueness guarantees, and sequential counters that were globally and always available. What makes this worse is that when we're told we can't have them, people just wave their arms around in the air and shout things like "CAP theorem." In this talk, I'll walk through some of these "ponies" and demonstrate the points at which things start falling apart with practical, real-world examples.

A

Good afternoon, how are you guys doing good all right? My name is Rick Branson I'm, an infrastructure engineer at Instagram. That means that I am sort of a DevOps role. I read a lot of code, but I also, you know, keep things running I'm. Giving a talk today called why you can't have that pony. The idea with this is that I feel like a lot of people talk about.

A

You know why you can't have things like distributed transactions and things like that, but they just kind of throw out terms like cap theorem and things like that, but they don't. You know, there's not a lot of stuff out there. That I feel like has a really decent sort of description of sort of taking things through the you know, actual like protocols and the way things work and talking about kind of where you hit these walls.

A

When you run into these problems, so hopefully I can give you some information on that and leave you guys with something important I saw this tweet today it was like yesterday and it kind of surprised me because I mean this is obviously a Promoted Tweet by company called foundation DB, which has this supposedly distributed highly available fault, tolerant.

A

You know thing and they have this paper, and it's supposedly says that cap theorem doesn't let you do this or does let you do this or what not and then you kind of read it and you're like okay, you know it says as any asset database during Network partition foundation. Db must choose consistency over availability, all right that seems fine. Then you get a little deeper and it says this does not mean the database becomes unavailable for clients when multiple machines or data centers hosting a foundation DB database are unable to communicate.

A

Some of them will be unable to execute rights, since the interesting interpretation of what availability means. It kind of shows you that you know everybody's kind of got their own definition in terms of that, and this differs from cap availability when you actually read the paper and they admit it in their white paper, but ultimately, there's I feel like they're, just trying to draw you into this discussion and obviously they're trying to sell their product. So, let's talk about these ponies. I mean these foundation.

A

Db guys are right and there have been many people talking about this there's the lot of the new SQL people are doing. This will DB and some of the others are you know using this sort of trying to sort of you know, spin these words into saying that they have this high availability and also in there they have.

A

There are some ways in which you can get this well a lot of these acid properties, but not in the way that they're actually talking about and cap theorem, is kind of used and abused as you as I've demonstrated.

A

You know it kind of blows done to this. It's the data just kind of get stale. If you can't synchronize it. If you can't talk between two systems that are trying to synchronize data with one another, well they're going to fall out of consistency, and so it's misunderstood. A lot of people think it's this like pick two of these, these of these three when, in reality, I, don't know where this came from I, don't know who kind of came up with this idea of pic.

A

Two I actually spent a lot of time last night, trying to dig and find like a link where somebody like has some legitimate sort of thing behind it other than just saying it, but really it's just pick one when it comes down to it. So this is your choice and I think the consistency thing is little. It's a little miss named I'd like actually prefer the term consensus, because consistency confuses it with like acid consistency, which is completely different.

A

This has to do with you know: are things synchronized between multiple nodes, so you lose these things. Cap tells us we lose some of these attributes. If we want availability, strict resource allocation, which is like bank accounts, with your flight booking things like that, compare and swap, I want to say, update where, for instance, like in a sequel query. I want to set this value where this value is equal.

A

You can do some something like that, but you can't guarantee that the result is going to be, for instance like if you increment a counter, and things like that. You can't guarantee that you're going to get the the only person winning that result and, of course, unique. This guarantees that this all kind of falls in the same, these same buckets, global uniqueness guarantees.

A

You have to say you know, for instance, if you're handing out user names on if your Twitter and you're handing out user names, you want those to be unique, and you know an AP system simply can't guarantee that and also get I'm. Sorry, a CP system can't guarantee guarantee that, because it's it, it would gives you consistency all right. Sorry, I'm, confusing myself it's! This is how confusing suppose something get. The AP system can't guarantee that you have that uniqueness and the CP system can.

A

However, the CV system is subject to availability problems and there is hope and I'll talk about there's some hope. Some new think develop into Cassandra's some new research that's coming out, but I'll talk about this towards the end. Let's talk a little bit about this resource allocation problem. I think it's kind of the easiest one to understand.

A

It seems, like you know: hotel booking financial accounts, warehouse inventories, anytime, you sort of have a a restricted pool of resources that you're trying to dole out to people in limited amounts, and so let's talk about in terms of like the cassandra model, which is you know, a synchronous replication, and this is very simplified. This is the cassandra model provides availability over consistency on. Yes, it is tunable, but you know in order to really exploit the properties of availability, you have to use a lower level of consistency.

A

So say we have a client, we get it balanced and comes back for a hundred dollars. We have two replicas each one's, storing the balance. If one client says I want to deduct $75, the first replica says: okay and at the same time, before it's replicated that change to the other replicas. It gives that old balance to that to the to the second client, so it tries to do that same deduction, and your result is that both of these end up with a negative balance. Now, yes, I realize that banking.

A

This happens in the real light in real life. That banking is a sort of tired example for this stuff, because most everybody in this room is probably experienced an overdraft fee because of something some consistency problem that exists within a banking system, but this is purely for illustration purposes. So how can we kind of do better? How can we fix this if we, if we do, want to choose consistency over availability? What does that look like when we start digging into this? This is a system. That's got two replicas.

A

It's a protocol called two-phase commit two phase commit involves a coordinator node, and this is kind of the the like system that serializes all of it all of the rights that that sort of aligns every reason right so that they're all in order and everything is hunky-dory kind of looks like this. A client would say, deduct. $75 the coordinator might say: okay, give me a second I need to conduct this transaction and it would send messages to the different replicas and they would, they would say. Okay, I've prepared this transaction.

A

So if another client came in instead do the same thing. While this was running, you know, client, on the left-hand side. Is you know that transactions still pending and the one on the right hand, side you know, tries to send when the coordinator will block it will not allow that to go forward? That's the job of the coordinator, so that the client on the right-hand side is still pending, but now we've agreed since we got those messages back.

A

We've agreed that, yes, the client on the left-hand side should be able to commit this, and so we agree that the new value is $25 and then that coordinator can say. Okay I failed that, because I have a consistency requirement to keep this balance above zero dollars, pretty simple, it does require all the replicas to respond, which is unfortunate.

A

It means that, basically, if I can't I can't contact one of the replicas I'm in timeout mode, it's almost like using Cassandra with a consistency level of all you know, 11 failure will cause you to have a problem. The coordinator is also a single point of failure and a bottleneck so forever. Whatever the coordinator is in charge of coordinating it's the Sol system, through which you can both read and write data, which is a problem if the coordinator dies you're kind of left in this weird state.

A

You can't actually like, for instance, in Cassandra, because operations are out impotent because they do converge. You can't you can retry things with this system because you don't understand, make sense of the state without sort of you know, reading everything you sort of are in this state and you just have to say, I, don't know where I am and you're in this weird sort of near land between transactions.

A

You can use a standby coordinator, for instance, to replicate that log. This is, for instance, what they're working on doing with a dupe with the well-known name, node single point of failure. So you have the synchronous replication of coordinator log, and this is because this is this is exactly from the two-phase commit Wikipedia article, and basically you assume that you have stable storage, no node crashes forever. The data is never lost or corrupted and that any two nodes can communicate with each other, which none of which happens in real life.

A

All the time, if you have a slowdown of that stand by you're, now sort of completely hooked together with that stand by that stand by starts slowing down or crashes, things stop so you're, actually more likely to fail, because any time you introduce more nodes into a you know any time you introduce more nodes into a system, you're more likely to have a failure of any given node.

A

So we can. You can do better, though there's a system called paxos and I'll try to explain Paxos in the most simple way possible, because I think most explanations out there are way over complicated. It's granted. It says it's still consistency over availability with this we have to have three replicas paxos requires it. You have to have at least you have to have a essentially you have to have a tiebreaker anytime.

A

There would be a sort of something you know it whenever need to create it with us, what's called a quorum, so your client would say I want to Doug $75. You might reply pending just to say I got. I received your you're right and now it's it's where I'm working on it it does.

A

The first step of this is to do a proposed, and the proposed tells the other, the all of the replicas of this data, that, on all the participants of this transaction, to say, okay, start open this transaction up and it doesn't actually tell it what to do it just simply says: I want to start a transaction and the return from the replicas is the is a promise and the promise says, among other things, basically, that I am promising to make you the leader of this this this round or transaction and the nodes hold locks after the proposal.

A

So, for instance, if this other client comes in and tries to say that it'll hold that until it finishes so it'll say. Okay I've already promised this other node that this account is locked, at least until maybe a timeout or whatnot it'll mark that it'll queue. That operation is pending the next step. Once it's once, the proposer has received the.

A

That they'll basically send out an accept and say this value is $25. That's after I take the hundred dollars out of the balance I now $25 left, I sent an accept out now notice. We've got two pending requests to either both of these. These uh you know to both of these nodes and the a is still the leader in this in this case that still everybody's locked waiting for this transaction off thats led by a to finish, and then after after this step, we do what's called a. We acknowledge the acceptance of that transaction.

A

Everybody broadcasts to everybody, so that everybody knows that everybody else has accepted this. This basically gone through that round. The value in this is that it's very fault-tolerant. Obviously it's a lot of extra work to go through this as well, and most people don't use Paxos for bank accounts, they use it for leader election and things like that, and then they assign a master and things like that. But this is again.

A

This is a demonstration for demonstrative purposes and then it would just simply say fail because once it got through that round, it would know that there what wasn't seventy five dollars in the account and or I'm sorry there wasn't seventy five dollars in the account there was only 25. They would be unable to check that from the balance. It's somewhat fault, tolerant to node failure like I said it's a quorum base, so it can say that if I have a majority of nodes, I've got a and B already.

A

I can assume that, because I have a majority this transaction can go through. It allows the round to continue also this property of broadcasting. The results to everybody in that round. These are considered the learners technically means that you're also sure that you're not going to get conflicting values at the end of this and you're not dependent on one node, for instance the proposer to actually to it, could fail in the middle of this, so we need to basically broadcast to the rest of them so that everybody understands the value everybody agrees on that value.

A

It's still subject to cap, though it's not it. It's at first everybody gets excited, they see Paxos a c-note failure they're like oh. This is great, you know, but the problem is that you can still have network partitions and you still have to choose which side of that network partition and network partitions are not always just single. You know two sides they can be complicated, they can have. You can have some nodes that can reach others, and then you know you can sort of have this multifaceted, weird diamond-shaped network partition.

A

So you know this is in the simplest case. You know no day is trying to propose and it simply can't talk to the other ones. So anything on no table fail at that point and even reads will fail at that point, because it's in a strictly consistent system, you can't guarantee that nothing's changed on the other side of that network. Partition you lose so you do lose. You know your strict resource allocation, like we talked about, like banks, hotel booking things like that compare-and-swap unique dis guarantees.

A

Now that doesn't necessarily mean that you know the best business decision is to go with strong consistency, but you simply can't those are the things that you can't get around in terms of from a very technical perspective and I. Think if I could reduce these down to anything, it's basically the ability to pick a winner. You know, for instance, in the immobile uniqueness guarantees somebody has to win a username.

A

You can't have two people with the same username. If that's your business requirement, I'm sure that there might be some intrepid, Souls out there that allow that to happen, and they, you know, lottery out the number or they decide. You know, based on some kind of heuristic who gets it if there's some kind of conflict, but you know I think most people would just like to have one username. So consistency is, is it's something that's important in a system like that?

A

Can we still get acid, though I didn't necessarily put these acid transactions in there and I? Think that's because it's a little bit of them is a misconception that you can't have sort of these acid properties in an eventually consistent system, even with distributed transactions, definitely within a row so think of a cassandra row. In terms of you know it's on a single node or a single set of replicas. You can get things like durability and adam icity through the commit log.

A

The commit log ensures that you know once something's been written to the commit log. It'll continue to get replayed and you get durability through that at the process, crashes, it'll get replayed and that's really a and d. You got isolation and 1.1. So now, if you do different changes within a within a batch in Cassandra, you'll, see you'll see those changes or you won't isolation being like.

A

If you add a thousand columns in one batch you're, either going to see the thousand columns or not you're not going to see 500 of those columns get inserted in consistency, I mean maybe it's it's very loose. Cassandra is obviously not like this really strict. You know, sequel, like system where you have these really complex constraints. There are some constraints ticking it could, potentially, you could add on to a system like Cassandra things like you know, I want the value to be within these ranges.

A

As long as it's not something like a counter where you can have, you know incremental operations as long as it's like a string, I don't want the string to exceed 255 characters as long as you're only replacing that string. For instance, you can guarantee that so what about like Crofts row or distributed transactions? This is kind of the Holy Grail. It's a little more interesting I. Think then doing sort of oh great.

A

I have a row, you know it's useful for things like if you have a graph database and you've got you want to track both the outgoing side of the note in the inbound side of the node for query purposes. It's really good to be able to guarantee have some of those guarantees around. You know when you do a distributed transaction, making sure that the system will follow through with those those things.

A

Consistency. Probably really not. I mean when we think about consistency. We think, like foreign, key constraints and global consistency like global uniqueness requirements and things like that, I'm going to I'm going to go ahead and say this is in this is my opinion that it's probably not really something that that is feasible. It is in some ways, but again you know it would be only for maybe a value like this string doesn't exceed 255 characters. That's relatively easy to provide something that doesn't require information from other nodes.

A

Durability doesn't really apply per se, I mean maybe distributed. Durability would be like you know, writing to multiple nodes, but really acid is kind of a poor choice of term in a way for a distributed transaction, but will kind of muscle through it 1 point2, they added atomic batches I was talking to Matt Dennis who's with data sacks- and yes, you could have accomplished this before, but it's nice that Cassandra takes care of it for you and actually kind of deals with these edge cases. Kind of works like this. You have a client.

A

It sends a batch to one of the nodes in the Cassandra ring and that Cassandra that node will actually replicate that Babs log to another one to another node in the ring before going forward, and then it'll do a right, so just write out that batch to all the nodes that it needs to write it to it. Was it in the main point this is that it resists coordinator def. This is the hardest part of implementing. This is, if you have the coordinator die, but let's sing that you sent the batch 2 from the client.

A

How do you guarantee that it didn't just fail in the middle of that and you've got partial rights, and things like that? This is really useful for like if you have, if you're writing indexes of your data, you have like a main column: family, that's call your data in it and then you want to have query in different ways. These are the you know the cassandra way to do your data modeling, it's a lot harder if you have no way of knowing that that transaction is going to finish all the way out.

A

So this is really a really useful system. You know if it dies like this.

A

For instance, you have, we only got two one want two of the replicas: it's on the third yeah, all that stuff, the repair and things will fall follow through, but this is more in terms of I didn't actually make that last write like this is I wrote row number one I wrote Road number two, but I missed row, number three and then basically, the bachelor / club will detect that failure eventually and replay that batch and in the idea is that you get this.

A

Eventually, it's not a it's, not atomic, in the sense that all of a sudden everything appears at the same time. It's that it all a sudden it eventually everything will be there or it will not.

A

Actually pointing to the wrong clicking on the slide kind of a rookie mistake, I really meant to point at isolation there. Basically there's this paper, that's being that came out of Berkeley there's this guy named Peter Bayless, who actually wrote the predict consistency code, I, think you wrote it in the Cassandra in the newer version of Cassandra. That can actually do a prediction of exactly how eventually consistent your data is, and when can you sort of predict when you'll reach the consistent state, but he's really smart.

A

Kid made this with some other people at Berkeley made this hat paper hat stands for high availability transactions and describes a way to get eventually consistent transactions with read, committed isolation, which is really interesting.

A

Basically, the guarantee is that data won't appear, so you can't read it until it's live on all the replicas and that's cross row. So if you need to write like two different sides of the same data set, but you don't want that to appear until everything is done, it's really. This is a really valuable kind of concept. Again, it's still a concept. It's still something like the paper came out like a month a half ago, so people are still noodling on this.

A

This it's you know it's obviously been published, so it's a little more than just someone's blog post or something so hopefully, this this kind of stuff can get in there.

A

You know it's like it is among a replica group, so you is that a data center I, don't know it'll, be interesting to hear from once this paper sorts to make rounds within the Cassandra community start hearing about what how people think about you know implementing this in this system. If at all, I'd be really excited to see this, it does still require convergence, though ultimately the property of Cassandra. That makes it a bit. You know a bill, make it makes it able to survive, outages and survive.

A

These things in and repair itself is the ability to take you no conflicts and always converge them reliably. You never are in a state in Cassandra, where you don't know how to kind of get back to a stable state which is, unlike you, have where you know, split brain issues with master slave systems and such so cool. Hopefully this was not too didn't. Put you guys too much to sleep.

A

I left a lot of time for questions at the end of this, because I had a feeling there would be a lot of them and discussion, so I'd open the floor up now. For that.

A

He's gonna come by with the mic yeah.

B

Can you talk a bit about what you do with Cassandra Instagram I had.

A

A feeling that would be the first question, so we use Cassandra for high right low, read rate data. Most of our data is stored, like our social, social data like likes and comments and information about photos, and things like that is stored in Postgres, just a charted, postgres setup. It was it's been that way since sort of the beginning, there's definitely been.

A

There have definitely been talks of moving things over, but there's nothing's really kind of solidified, because there's just you know, you have a certain number of hours in the day they kind of get things done and as the data size grows, it becomes less and less sort of becomes definitely non-trivial to to migrate data over and doing it doing. The way that is make sense, I mean we're, definitely growing our usage of it, though it's just a matter of finding the right use, cases and pairing it.

A

Just specifically the types of data we use it for is like security logs auditing and things like that spam fighting. You know that kind of basically stuff that we want to make sure we have and keep and we don't don't worry about performance problems or outages and it's very low maintenance for us. Actually, it hadn't had a machine dial up yesterday. It was like kind of didn't even realize it until you know, I look back through my pages and saw that replacing everything was fine. So.

A

No most of the data we have there is designed to be permanent unless we do permanent deletion of the data like, for instance, if a user deletes their account, you know that we have to go through and scrub all that data, but we try to keep the ideas to keep this data for around forever. So if you have any more questions about that, I'm trying to also talk about that I was.

C

Curious how the Hat paper kind of differs from setting the quorum for the number of nodes on the read yeah.

A

So the quorum, the the quorum and Cassandra doesn't guarantee that you. It does guarantee that you for a given row that you'll see that if you read and write with quorum, you'll basically have this sort of read your rights. Consistency where, if you write something you'll read it back, but what it doesn't guarantee is that if you have multiple nodes instead of I'm sorry multiple rows inside of a batch that are on different nodes, which is very likely given distribution of a cluster. That you'll actually see those changes in an isolated fashion.

A

So say you make updates to two different rows that are on different, completely different partitions. The sort of advanced the sort of innovation with hat is that it basically allows you to only see that when everything is finished with that transaction, it's truly like a a transactional. You say begin submit things to push into the data to push him to the cluster and then you commit, and then things would be only seen after it's sort of percolated, so that everybody would sort of get that instead of just certain given Rose.

B

Could you explain plz, like your opinion, most interesting zookeeper and paxos zookeeper.

A

Is zookeeper is built on a paxos a basically a protocol called zookeeper atomic broadcast that is built on top of Paxos. What the taxes that I explained was a very, very simplified for it's sort of the most simple version to implement. There's several versions that are that are more complicated. That can avoid some latency problems. Zookeeper, specifically only really.

A

It basically elects a leader and only transactions basically flow through that leader and it uses the Paxos to it, uses Paxos to elect leader, but because it's the leader is there, it can do some optimizations, it's almost kind of like in a way kind of master slave ish.

A

If I had more diagrams I would probably be able to explain it better, but hopefully that that answer. Your question.

D

A

So the question is: how do you model your data when you have, for instance, you can have a network split, it really is extremely use case dependent. It depends on your business decisions too. I mean this stuff. This stuff has to basically kind of be educated, all the way up, because you know if people don't understand that you have to make that fundamental trade-off. They're gonna make bad decisions they're going to ask you for things like you know, one hundred percent perfect up time with global consistency, which is impossible.

A

So it's important to educate people in your business about you know how these things work and and then try to come up with sort of contingency plans on what to do and when these things, when there's problems like this, for you know a good example is when Amazon over sells their inventory now you know I'm not sure this is because of eventual consistency, problems or not, but they certainly have done it before I've been there. They, you know, send you an email and say sorry and here's five dollar credit or something like that.

A

I mean that's, that's a way to solve that problem and it becomes sort of an actuarial problem. It's not a it's. It's you kind of create a sort of you kind of have to create a formula of like if we're down for this long. How much is it costing us versus if we may be over, sell a few things or we have to give people five dollar gift cards. I mean that's its kind of that bad choice.

E

Images in Instagram and have you tried to stop.

A

We use us three and we have experimented with using Cassandra for it, but I think s3's just been perfectly fine for us, so it's I mean they've. It's been actually really kind of incredible. How well it's worked for us, so there was there's. Not a lot of sort of the motivation would be to fix things that are breaking or broken or sub-optimal and s3 is definitely not one of those things, but I have sort of experimented with it.

A

I, don't it's it's, it was. It seemed to be mostly disc. Ground images are fairly large compared to the average data that's stored in a database, so you end up just disc bandwidth bound and seek down on your drive. So.

A

We use flash yes.

A

Yeah, so we use flash exclusively for our postgres. We don't use it for Cassandra, because Cassandra is again low, right or high right low, read data. We push, you know tens of thousands of rights, plus probably more now through Cassandra, whereas we may do a few hundred reads per second. So it's really the right performance is great. I mean we just don't need you know, that's just ease for that.

A

As far as the the flash drives for post grass, you know it's we haven't done any like I, think we I don't think we've done any benchmarking. That would make sense anybody else. We I mean it's worked really well for us. You know basically like in a relational database in order to get any kind of reasonably good performance out of it. You need a giant drive array or you need a lot of RAM or you need SSDs. So I mean it's.

A

The latency I mean we typically use our like sub 1 millisecond latency for most of our queries now, whereas before, if it was either in ram, it was getting that or it was on disk and it was like you know, 100 milliseconds, or something really bad like that.

F

Nope I think I've heard a lot of suggestions about how you it is possible to achieve multirow consistency using mostly theoretical mechanisms or anything. So do you care about that? How do you do it? Do you use zookeeper as a transactional layer on top of Cassandra? What does this to instagram do nosso.

A

The only case where we really need to do this is where you have sort of friendship relationships, and we have been working on a way to do this, just using data repair, but not basically doing read, repair on you know doing like a chance of.

A

Basically, if you read the differential relation she's out rewriting at the other direction, checking to make sure it's there similar to what you know Cassandra, you can set a ten percent, read repair for replicas and this would actually be across different rows, not just within a within a partition that makes sense at all. Maybe okay, we maybe we discussed this offline.

D

Hey hi, hey um first time long time, big fan, paxos over high latency links or, if you're doing it, multi region, I, don't know. Yeah Rep, instructor, I,.

A

D

It just go off the rails, like we use ooh keeper across it, and we have observers in one place. Yeah.

A

I mean it it can. It depends. I mean Paxos, is kind of like more an idea than like a a like implementation of an idea. Okay um I mean there's like multipacks hills and generalized Paxos, and then there's like zookeeper, which is its own version of it. I would say, like you know, it's generally recommended not to run a zookeeper out sort of with unreliable, like wham leagues, I mean the whole idea. Is that you know you can you can resist sort of small small sets of network failures but ultimately like?

A

If you know your link between two data centers goes down, which is much more likely than inside of your data center, then you're sort of in this state where part of your system can is almost you know it's basically not very functional. You also like it's weird, because you end up sort of spreading its. It becomes weird to like spread things up and see. If you have two data centers, do you put like how do you tie break that right? Some people don't have three data centers to always put something in, for instance.

A

So how do you create a quorum and tie break that I got.

D

Three, so whatever.

A

Yeah no big deal and be.

A

But to add on to that I think people that are using it for cross data center use cases, aren't using it for like data storage or like anything like that, I think it's an anti-pattern with zookeeper and I in general to use it for data storage, sorry yeah for leader election things like that I think you know I would say generally try to avoid leaders in general, but sometimes you can't so you know you wouldn't necessarily want certain things to get outed consistent would cause issues concern of configuration things like that.

A

Typically, though, people will set up a zookeeper like sets of zoo keepers and different data, centers they're doing something like consistent configuration, they'll set up a zookeeper, you know and zoo keeper installation in each data center and they'll. Basically, you know use the local one in that datacenter I think it's a good pattern.

E

Hey concerning about atomicity I've experienced a problem where they there are two simultaneous updates at the two batches yeah.

A

E

They interleave, they don't guarantee a didn't, guarantee, atomicity and then I research on that and found the problem that probably a factor package. We use and may be an older version that assigned a different time stamp within the batch okay. So that's a specific problem, then, on some thinking. If we have two different data centers that audit even two different JVM, it's very possible to assign the same time stamp, then that will break Thomas city right. It.

A

Won't because all that's happening, then, is that the conflict resolution cassandra is sort of ruling that out the all the atomicity guarantees is that that's that batch of operations is applied or it's not, and then the conflict resolution is actually a completely separate protocol. On top of that, basically, those rights will actually still happen, but then, once Cassandra actually performs the read it'll see that one turns out. Timestamp is higher than the other ones and then resolve that yeah.

E

But and also studied the resolution algorithm, they always put delete in front of updates right for two updates. They take the higher value in bytes right that will make it really inconsistent for two simultaneous updates.

A

Everything everything would still reach the eventually the same result, though that's the idea is that you know, eventually, you would all every every every replica is is going to eventually receive the same result through the entry entropy system, so it may not be now. It may not be like you know, in one millisecond, but it will be eventual.

E

A

A

Yeah the conflict resolution.

A

Ok, so I'll repeat what would basically I think what you're saying the idea is that the time stamps, because they're at the column level and the data inside the batch, instead of being at the entire batch level, you could have two transactions that would potentially conflicting other there's no guarantee within that transaction that you know there isn't one overwriting, the other I mean obviously putting them all with the same time stamp is is sort of a no-brainer.

A

Maybe it depends on obviously what you're doing if they are all the time stamp, then all with the same time stamp I see what you're saying actually now.

E

A

You will eventually, you will eventually get the same result.

A

Right right, so that's are you saying that if one update sees a different update that was there before and how those converge, they will still those that they will still be applied in order, though in a way, so the ultimately like it ultimately everybody's going to see the same set of the idea with with this with a dynamo based system? Is it?

A

Everybody eventually sees the same set of replication out at every node and they're all going to eventually perform the same same set of transactions and I'm sorry, the same set of operations and roll through them. That way.

A

Well, I would have argued that it should be expected. It may be, it may be surprising, but it is the way that it works so, but eventually the highest time simple win for that, given column, hottest time stamps going to win and, like you said there are there, are there are ordering things like the delete will get ordered before that. But again, that's if that delete operation is there so and that's. Why tombstones, for instance, stay around for ten days, because that gives that time for things to catch up eventually,.

A

A

Yeah I mean it certainly is possible, I mean the idea with it is, though, is that if you have, if you have data that would conflict that regularly, you probably should avoid overriding data anyway, because you it's hard to actually come to a, even even in a like system, with it more advanced conflict resolution, it's hard to even come to a reasonable consensus on the sort of end result of that once the sort of all the updates flush out so like a good best practice, there is to write, you know, append data to the to what you have kind of leave the old data, where it is maybe garbage cloak that over a period of time, oh.

B

A

I mean yeah so.

B

Have you ever run into anybody using like I know and sometimes in high frequency trading systems? They have special, like atomic clocks, that you could hook into your system to get really good. You know like, as you know, the clock adrift using ntp are in a local data center, so you have any experience with that. No.

A

I think, if you're, if you're trying to do that, then it's there's a really low d, a really bad fit for Cassandra, again like if you have that much conflict on data in a concurrent system like cassandra is designed to be. You should probably avoid that kind of conflict. It's just a best practice.

A

In general, I mean like the stuff that google's doing with spanner is really interesting, but spanner doesn't actually guarantee like, for instance, like globally available rights, spanner just guarantees that you can perform a read at a time stamp and guarantees that like, if you perform a snapshot, read over you know and since like if you're doing a big hadoop transaction or a big, a new job, you want data to be stable in that Hadoop job spanner can guarantee that just so, you know spanners this thing that Google came out with where they have GPS quacks and atomic clocks plugged into the servers, and they did some interesting things with it.

A

It's definitely worth while reading about yeah.

B

I was just saying it from the sense of like millisecond, you know, or nanoseconds aren't accurate. They go up. They go down right, so even in Cassandra. Wouldn't it be better, though we could remove that condition and we could be sure they'd at least always go up. Yeah.

A

I think it it's it's, it's all it's impossible to guarantee that they always go forward in a without having global coordination. Obviously, something like, for instance, like the true time. Api would be a way to do that effectively. The global coordination is affected. Everything sort of all of those systems proceed. You know in the same fashion, but even the true time, API that they have in the spanner paper, which actually talks about this exact problem in in detail. You know can't necessarily guarantee that you'll end up.

A

You know it can't guarantee that that you won't can't still have conflicting rights. You know you have to basically have a way to converge, those that data together, whether it's last rider wins using timestamps using vector clocks using immutable data that doesn't actually have to converge, and you know it's it's spanner again is still at a you know: it's still a dependent on it on having open communication available, you know and and no partitions for rights, so.

C

G

Yeah with eventual consistency, it seems like you know, if you're willing to wait long long enough, you'll get more consistency. Do you have any framework or methodology to quantify that yeah.

A

Actually, the the stuff I was talking about earlier, the Hat paper actually is came out from the same set of sort of the same research group that released this thing called PBS, which is basically a probabilistic way to determine the sort of window of consistency based on it's like based on network latency and like Albert. You know basically statistical things that it statistics that you collect Cassandra actually has a way to do this with no tool.

A

Now you can say no to a predict, consistency, give it some parameters, and it will actually tell you, you know, give you some kind of prediction on that. It's not obviously you're, not perfect, it's not a guarantee, but it is. It is a probabilistic and it can actually tell you it's within a certain percentage of certainty.

A

C

A

My question 11.2. Sorry, sorry.

C

One more question.

A

Proportion anyone.

C

B

One more so, how do you feel about cap and that probabilistic determination verse the systems that say they have like an ordered set of operations and then, but you take into account like real disk failures and outages. Do you think, like probabilistic consistency on average actually does better consistency than those systems that claim to be really consistent? But when you think about like disks, you know disk getting corrupt and other scenarios like bad like bit. Rot, are eventually consistent or distributed consistent, distributed ly consistent systems better at that problem.

A

That's actually a really good point, I guess in a way, yeah I mean, if you, for instance, the system like Paxos, has three replicas of data or more you know, assistant like zookeeper rather has three work through more replicas of data, I mean it that would that would make it less likely to be subject to you know solar flares and things like that and, like you said, just bit, rod and disc players, I, don't there's still so many like you know it's the whole Byzantine generalist thing, there's so many ways that things can go wrong that often the things that you don't predict, but as far as like a a system like are used to asking like whether, regardless of consistency, a system that's fully distributed, probably will have a larger likelihood of surviving through things like disk failures, I mean the more replicas you have, the more likely it is.

A

You can resolve conflict issues and things like that. If you set things up right.

H

Standpoint, how can how is the replication going to keep up with, like not necessarily just most recent copy of the record but versions of that record, as it moves forward in time is? Is that does that sort of change the whole acid discussion about like locking records and locking a version of a record, and so does do there any consideration in this in this replication for that kind of topology? So.

A

Which replication are you talking about so.

H

Like, like you know, when we're talking about Cassandra we're talking about like the the you know, the records that are being sort of kept at multiple nodes right. So there's this whole concept at play when we're talking about temporal data versions of a record for an interval of time right right. So in that respect, a lot a lot of the issues that arise or unlocking an asset around that record come from also the consideration of the version of the record that has to be locked internally in the engines.

H

Everybody's sort of struggling with this issue and I just wondered if Cassandra in the in the replication. Topology is also considering this, because it's actually coming out as like a sequel three standard as well on they on the I'm.

A

Actually getting us like a cut off: oh okay,.

H

A

We can do we can take this offline. Thank you guys. So much again, I'm available from you.

C

Can we can answer these questions and then meet the experts room on the eighth floor? Speakers.