Apache Cassandra Cassandra Summit 2013, 26 Jun 2013

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: C* Summit 2013: Eventual Consistency != Hopeful Consistency

Description

Speaker: Christos Kalantzis, Engineering Manager of Cloud Persistence Engineering at Netflix
Slides: http://www.slideshare.net/planetcassandra/c-summit-2013-eventual-consistency-hopeful-consistency-by-christos-kalantzis
This session will address Cassandra's tunable consistency model and cover how developers and companies should adopt a more Optimistic Software Design model.

A

Good afternoon, everyone how about a hand for data stacks, what a nice lunch that was right, good good!

A

Well this afternoon, uh we'll change pace a little bit and have more of a philosophical discussion about you know, embracing optimistic design in your persistence layer, which I called eventual consistency, doesn't equal hopeful consistency. So who am I? My name is christos kalansis I work in netflix I manage the cloud persistence, engineering team, which is a fancy word for the cassandra team. um My twitter handles chris callan.

A

If you want to link in there's my address and I think the slides are going to go online, so you don't have to take take note of it right now. So, let's get started with a recap of cassandra's replication and consistency.

A

So cassandra, as we all know, is eventually consistent, which means the data will get there eventually, uh but eventually it's not a day from now. It's it's not a minute from now. It's it's not even a second from now. In most cases, it's going to happen in milliseconds.

A

Also, cassandra has tunable consistency.

A

You can do all which, in a replication factor of 3 means all three nodes need to get the the right or, if you're reading you have to read from all three nodes: that's not that great on latencies uh quorum quorum has kind of been accepted as the de facto you know standard it's. You know good balance between availability and latencies. Personally, I think it's meh.

A

What I really like is one and- um and you know we're going to talk a lot- a lot in the next half hour about why one is okay for your reads and your rights.

A

So let's take a trip back in time back in the early 2000s, we had one master database and you know we needed to scale reads out. So what did we do? We, you know we we slay, we had a bunch of slaves, either piggybacking or replicating right off. The master, as you can see in the diagram rights went to the master updates, uh went to the slave. An application would read from the slaves and write from the master. Well, that's an eventual con, eventually consistent architecture.

A

But you know sometimes transactions got lost in replication and we were fine with it.

A

Sometimes we didn't even know it. um There wasn't a repair function, so I tried that yesterday, in the latest version of my sequel and repair database didn't work. So it didn't. It didn't repair my cluster of masters and slaves, so we trusted it and there's a lot of big name companies out there that still trust it.

A

So when I talk to a lot of these companies- and- and you know, they're thinking about going to cassandra- and you know- they've got a lot of concerns and the concerns are usually the same and here's a top three.

A

I want high consistency in my reads and writes just like I have in my rdbms. You know master slave cluster. Well, you know you never really had it. So, if you're concerned about it in cassandra, you should really be concerned about it in your current architecture, want my db to catch integrity issues.

A

That's that's a funny one, because um if anyone's a data architect out here, we've been turning off foreign keys for years on our databases to to reduce latencies and and increase our right throughput.

A

As a matter of fact, uh modern uh mvc frameworks like rails and grails when you use their internal um object, domain creation tools, they're, not creating foreign keys, even if you're creating some kind of relation between the objects and the final one which, which is one of my favorite and we'll you know touch upon uh for a little bit is. Can I trust cassandra that cassandra will replicate my data when writing at cl1?

A

Well, I had that question asked about a month ago. Actually, within netflix one of the application designers said, you know, I write a quorum and, and I read a quorum because do I really trust it and in a replication factor of three? Would you know if I write at one? Will it actually go to um to all three replicas?

A

So I said you know what let me test it. So I created a multi-data center cassandra. Cluster just happened to be on 117 uh 48 nodes in each data center. I put load on the cluster. um This is wordy. I've got a diagram which summarizes this.

A

I put a 100 thousand total operations per second fifty thousand on each in each data center, and then I wrote a million records.

A

I wrote a million records as you can see on the bottom right and then I read the same million records all at consistency level, one, the the right and the read and they all came back, which means not only did they replicate within the data center they actually successfully replicated outside uh to the other data center, and I was randomly hitting hitting nodes.

A

So you know I wasn't always hitting just one note, so you can trust it, and so well, no test is is good if you only run it once so, we ran it five times and each time we had the same result, all records were read back successfully, so we can trust it so now we know we can trust cassandra. Now we know that you know our current architectures aren't as as high a high consistency as as we thought they were.

A

So why is it we're trying to always use quorum or all when when designing um a schema or an application using cassandra?

A

So this is where we're going to get into optimistic versus pessimistic design, so in a pessimistic design, if you're going to do quorum, rights, quorum, reads and god forbid, all um you're designing, with high cons, you're punishing your users, 99.9 percent of the time, and I'm being generous with that 99.9, as you saw 100 of the time it was fine, but for arguments like, let's say, 99.9 percent of the time it's going to come back successfully higher consistency, as we all know, uh equals higher latency, so uh and and higher latency means you've got a diminished user experience, especially in an architecture.

A

That's um that's a service oriented architecture where one web page may call five or six services and each one is making a database call underneath so embrace optimistic design, trust your data store. I just showed you that cassandra can be trusted, uh know your business and your application. I mean yes, okay. This is not the answer for every application, but think about it. um Can what can your application live without?

A

You know. High consistency level in your reads and rights, I'll give you an example. So netflix, as you know, you can watch a movie, you can pause, go to another device and you know continue where you left off and that's being done in the background by sending by sending signals to a cassandra database and we're actually writing at a consistency level of one 99.99 of the time. It's fine. Sometimes it's not sometimes the movie is going to start five minutes earlier, because that may be the last uh the last ping we recorded into the database.

A

Well, we know our business, so we don't believe I mean who in this uh you know audience is going to dump their netflix subscription because once in a million you know you, you resumed on another device and it was a couple of minutes earlier and not exactly where you left it. Well, I don't, I don't think anyone and- and if you do maybe you should be our customer.

A

um So um really you need to know your application. You need to know what problem you're actually solving, and you know and in certain cases you're going to have to handle edge cases through contingency plans, and I've got a couple of examples of of companies that aren't netflix and how they handle contingency plans um in a in a um in a eventual, consistent uh architecture.

A

So the first example amazon.

A

Sometimes you buy something on amazon and it's just not not there. It's not. It says it's in stock, but it was it's not. You know when they go fulfill the order. uh You know either the robots in the warehouse can't find it or or or if it's a warehouse with humans, the human can't find it, and so what do they do? They cancel the order.

A

They cancel the order and their contingency plan is they'll credit you 10 towards a future purchase. Okay, I mean it kind of sucks. I wanted it, you know we're in the generation that wants everything right away, but you know it happens and you know they uh they uh not reward us, but you know they have a nice contingency plan when that happens, so it doesn't hurt that much.

A

So, okay, here's rita, that's an example of retail being able to live in a low consistency, low consistency architecture, but what about financial I mean? Surely that needs to be. You know high highly consistent right, I mean you know eventually consistent, doesn't work in the financial world. Well, banks are the most eventually consistent system out there um you know I can. I can write a check. Actually, I'm gonna write you a check for a million dollars, okay, one or two things.

A

Please yeah when the two things are gonna happen either because of my very generous netflix salary and the fact that our stock jumped uh you know really high in the past year and I cashed options I'll be able to cover the check or it'll bounce it's going to bounce by the way.

A

So um you know the banks have contingency plans, uh they're, gonna, try and recoup uh recruit the funds uh out of the uh you know, target bank account and also charge a very handsome fee for you bouncing a check. So you know the myths that you know. Everything needs to be highly consistent.

A

uh You know aren't true and, and I've just shown that even banks, you know the banking system is, is- is a low, consistent uh design.

A

A

I'm giving you the tools, I'm I'm giving you I'm giving you examples of, of where this works and and where it doesn't and the idea is. I want you guys to go back to your companies. I want you guys to um really really think deeply about the applications, you're building and and really evaluate.

A

You know how consistent you need to make your application so you're, going to go back to your company and you're, going to face hurdles right, you're, going to be a little pawn against the whole chessboard of of engineers and managers who who probably think you you drank something silly at the data stacks conference, and uh you know you don't know what you're talking about. Well, engineers are stubborn. You know one plus one equals two, not eventually two uh so you're gonna have to manage down.

A

uh Middle management is scared. I mean you know good on you for those using cassandra that you convinced them to even be using cassandra. uh Now you know you're gonna have to convince them to uh to accept a low uh low, consistent, uh low consistency. Architecture. That's not gonna be easy. You're gonna have to manage up as well uh you're going to have to uh engage the product team a lot of engineers they like give me the specs go away I'll, build it for you and you know just hand a nice.

A

You know box of software and code that did exactly what you asked well now a conversation needs to happen. You need to go back to the product team and say: hey look.

A

I can do this, but here what here here? What the trade-offs are higher latency?

A

uh You know diminished user experience, you know, let's talk and, and you know a lot of them will listen, because if you tell them that the user experience is going to suffer, you know they're going to be right there with you saying: okay, cool, let's, let's reduce the consistency and let's figure out what those edge cases are so how to overcome those hurdles.

A

Wow. We really breezed through this. So how to overcome these hurdles, uh you can prove it through a poc that'll help manage the downstream that'll help manage the engineers rupa, and I the db engineer that run that test. uh We really, um you know, we really changed one of those engineering managers minds by by showing him uh the results of that test and they accepted to reduce their their consistency level.

A

um You need to show the benefits of improved user experience. I mean if it's going to take five seconds to load a page. You know telling telling a product guy hey. I can cut that latency in half. Well, you know they're going to be happy about that and and say, okay cool. Let's do it and finally, you may be working for the wrong company.

A

Netflix is hiring jobs.netflix.com and- uh and you know, I'm looking for devops engineers if you're, if you're such a person come see me after the session and and we'll have a chat, so we really blew through this. So I want to open this up to questions and and whoever I pick please please go to the microphone.

A

Okay, let's start with.

B

You so when you were talking about the benefits of lowering your consistency level in real numbers, how did that change latency, so.

A

We were like, I said the de facto has been quorum right, so people have written and read in quorum so um quorum for those who don't know when you, when you read or write quorum, you're writing to two nodes and you're or you're reading from two nodes and you're waiting for um a coordinator to actually finish writing to those two nodes before it releases the write or it's reading from two nodes, then uh then really they're really checking which one's got the latest timestamp and sending you back. So we cut it by two.

A

When you go to cl1, you cut your latencies by like two-thirds at that point. Okay, it's not even half it's two-thirds, because you're just doing a point read back and forth as opposed to doing the extra comparisons and waiting for those rights to happen.

B

And I guess rights would be more, I'm more concerned about because uh reads: I'm fine with consistently level one but with the right, if you're sure it's written immediately, then you know it's a little safer right, but yeah yeah. So.

A

Did you see similar numbers as well or yeah? No, absolutely.

B

A

I said you're writing to two nodes, you're waiting for both nodes to uh to have gotten the right, you're waiting for it to the coordinator to agree that they got it and then come back. Okay, yeah.

C

James, hey chris, does so doing your million uh record comparison test. How long did you wait after um feeding the data into cassandra before doing your million record.

A

Great question great question: so we we were writing and then and then we were reading. So I think we were off by uh half a second, so 500, milliseconds and.

C

Is there like a standard tool like the bracona toolkit for comparing two cassandra databases or clusters.

A

um I think there's new features in in 1.2 that can catch possible, dropped, uh drop, replication or statistics on that. However, there's a repair, if you just want to blindly you know, heal the whole cluster. If you think you've had dropped operations, there's a repair command in cassandra, which will then make sure everyone's got all the data they need to thanks. Thank you.

D

Howdy hi um did you do any testing where nodes failed in the midst of your test.

A

Yes, absolutely so, um and that ties into that repair command as well, so uh we did, we did kill some nodes while we were, we were doing that test and I think that was in our third or fourth run, that that we killed some nodes and we still continued reading from it. However, um after an event like that, we will run a repair at that point.

D

um Just to be sure did you um so so in that case, when you say you killed a node, did the node become um unavailable permanently, I.e was the data on that node lost.

A

Yes, yes, yeah, we didn't just shut the node off. We terminated the node completely. Okay, thanks. Thank you.

A

E

A

Yes, so the question is: have we seen use cases where cl1 is not appropriate? uh There's always going to be use cases uh where it's not appropriate. I mean I'm not here to preach cl1. Only uh there are there.

B

A

Yeah, I can't yeah, I can't lift the kimono completely.

F

Sure, yes, so uh for your experiments, the uh hold on just sorry to cut you.

A

Off, however, we have no use cases where all are appropriate. Quorum's the highest we're going to go.

F

So we create tables, we can define like consistency, level c column, but your query: you can have a different consistency, absolutely right absolutely, but for your experiments, how do you do that? You have. The all. uh uh Consistency is one when create table and also your query.

A

F

A

At consistency level one and we read at consistency level, one so.

F

The replication factor is one. No, it's three. Oh, it's three yeah.

A

Yeah, so so the replication factor of the cluster was three: we wrote, it got replicated successfully uh local uh locally and at the other data center, and the reads happened at the other data center and we were randomly hitting nodes. So it's not like we were. We chose the node that should have that token. We just you know we just let the coordinator do its thing and randomly chose nodes and we read everything back. Yes,.

F

We run some clusters extended clusters and I noticed that they're always a problem. When you do repair, it takes a very long time to repair that cluster. We have the 16 casino nodes. So would that be a case that we need to consider when we run accenture clusters, for example, for the repair situations?

F

So is the question it.

A

F

Long time to repair.

A

Yeah yeah well repair is that a issue so repair repair is a safety net right so uh and and the reality is, we do run repair regularly on our clusters. um So the worst case scenario. Actually, if a node didn't get it, it will get the data next time we run repair. That's the worst case scenario.

D

A

That's in the point: zero, zero one. You know off that that you know a particular right didn't make it in so um repairs. Take a long time. Yes, absolutely repairs do take a long time, but uh you know it's it's that safety net. It's that absolute okay! You know we're drawing a line from now on. It's it's repaired and everything's everywhere.

E

Thanks you're welcome, yes,.

G

So you were using priam um in this test yeah, so you were going uh token aware right to where one of the replicas was.

A

You're, I think I think you're you're you're thinking am I using asthenics at the client level.

G

Yes, okay, yeah yeah, yeah.

A

So no we were doing round robin.

G

Robinson, so your coordinator wasn't necessarily something that had the replica.

A

Exactly of the data exactly.

G

So that was very.

A

That wasn't very important in our test to prove to prove that that you know cassandra's actually doing the right thing underneath.

G

So if we've run uh benchmarks and we've gotten numbers back and we dropped from quorum to one, we should expect it. If we had a replication factor of three, we should expect our cassandra turnaround to be a third. The.

A

Latency, it should be significantly faster. My particular use case uh was two two thirds faster simply because that was the application. I don't know what your read write pattern is, but it should be significantly faster thanks. Thank you.

E

E

So for that particular case, uh basically, when we compare quorum uh versus consistency level, one um does cassandra execute um if you use quorum, for example, and replication factor is three, uh so you basically need to get response from two um in.

H

Four in four: yes,.

E

So that's cassandra send those requests in parallel. So basically.

H

E

You're limited by slowest uh note out of two right: it's not basically you're not going to get benefit like 50 faster because you're you send requests in parallel and basically maybe one node is 10 slower.

E

That's basically your benefit, not 50.

F

All things being equal.

A

It should be, it should be around 50 faster, assuming assuming all nodes are, are the same for.

E

Your right so, let's see- and you are blasting.

A

All nodes, so so what the coordinator is doing, it's waiting for the two firsts to come back and then and then it's uh and then it's returning so.

E

Does it say, does it send requests to like in parallel to all nodes yeah, assuming that all nodes like uh like? Equally you like loaded, you expect response like approximately. In the same time I mean I mean you, don't send first request, wait, wait for response and send security.

A

No, no, it's that's blasting everyone and yes, that's.

E

Why why do you expect like 50 percent faster, if you use uh consistency level, one versus quorum, good, good question.

A

That's what we saw and oh actually hold on nitish over there. One of my db engineers can answer.

I

I

I

E

A

Thank you, you're welcome! Yes, oh yes,.

D

uh Just to follow up on the testing where you um uh simulated failures during the testing, my understanding is that there is a case where, if you're writing to a replica node using a replica as your coordinator, which, even in round robin you would have encountered a few times, the right is accepted on the coordinator with a consistency level of one.

D

But then the node immediately fails. Repair is unable to repair that right because you've lost it.

D

Did you encounter that in practice.

A

uh So, like I said, we ran it five times so five, I'm sure if I ran this test infinitely and and did an infinite amount of permutations, I would have seen some some failures, but that's why I said 99.9.

D

Absolutely thank you. You're welcome.

H

So I like this idea of uh having the applications compensate essentially for inconsistency, but I'm curious uh what you would recommend in terms of strategies, both from a operational perspective and when you're talking to application developers about detecting and reasoning about these inconsistencies. For instance, do people say well as long as I have an sla of 99.9, that's good enough or do they need to, or do they often ask for maybe like well, I mean one aspect of optimism is a is a callback or a notification when, when things go wrong, yeah I'm curious.

H

If that's a requirement or even whether or not um people you know want that sort of, because.

A

It's kind of annoying to give you the answer and say it was the wrong answer. Absolutely so I'm going to give you the cop-out answer. It depends on your application, but I'll give you an example. So um you know in certain things we might not catch it in the application, we'll catch it by customer service calls- and you know at that point: it's it's a human contingency plan. It's not necessarily code that that's doing it. There's there's going to be. You know a combination of both depending on your application.

H

And when you do it in code, how do you typically are there mechanisms of cassandra.

A

Leverage, so cassandra does have mechanisms actually to fix it. uh There's read repair sure, so you can turn on read repair. So if, if, if you return it once and then read, repair runs and and depending on what percentage of repair you have on second time, you read it, it will actually be repaired. Thanks you're welcome.

F

So here we're talking about the uh consistent level of cassandra, but we always face the case where we have different systems. For example, we want to have a search over the data, so we have some index we create built on top of data. So how do you guarantee the consistencies between these different systems? Because that's a problem I'm facing for our gotcha group and I think netflix provide the same like search infrastructure on top of the sender, yeah.

A

Well, uh in the netflix one, um the data we index and we search you know like when you put the movie and it pops down, it doesn't change as often as uh in other applications, so we can control that we can control that you know either through qa, constant qa and so on, but that that's a very you know we can have a discussion offline and talk about how you can you know if you go dsc and you've got solar and you've got your cassandra.

F

A

A sudden, it's not consistent between both yeah we can. We can have a chat later and uh I'll share some ideas. It might not be the absolute solutions, thanks cool, I think that's our time. Thank you.

A