Apache Cassandra Cassandra Summit 2013, 26 Jun 2013

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: C* Summit 2013: How Not to Use Cassandra

Description

Speaker: Axel Liljencrantz, Backend Developer at Spotify
Slides: http://www.slideshare.net/planetcassandra/8-axel-liljencrantz-23204252
At Spotify, we see failure as an opportunity to learn. During the two years we've used Cassandra in our production environment, we have learned a lot. This session touches on some of the exciting design anti-patterns, performance killers and other opportunities to lose a finger that are at your disposal with Cassandra.

A

All right, so this talk is how not to use Cassandra, and my name is axel alliance and I'm, a dog person, and that's all I'm gonna, say about myself.

A

So a little bit about the company I work for called Spotify.

A

We are a company that are looks like we're, either ridiculously huge or tiny, depending on, if your Google or not so we have around 4,000 servers. We have a catalogue of roughly 12 the soccer fields of music. We stream about four Wikipedia's of music per second to our 24 million active users.

A

We accomplish this through a back-end of 70 services, or so most of these are written in Python. A few were written in Java, one in C++, but yeah, mostly Python and Java shop as pretty much anyone who isn't a spectacular failure is doing. We write many small, simple services. We try to keep them responsible for one thing and one thing only, and sometimes we fail and that's when we get really badly functioning services.

A

So when we started out, we were pretty much exclusively a Postgres shop and we have a really good experience with Postgres. It's it's a system that works very, very well, but once we started hitting a really large scale some time around, when we were when we moved to the US as well, we were starting to get a lot of problems.

A

Postgres has compared to say cassandra very poor, cross-site replication support. If you get a network outage or a net split or something then getting back in sync again will is a painful process that often requires manual intervention and that's not something you want to do on a regular basis. And of course, if you need more than one node in your cluster to handle more more load than sharding, we'll throw out most of the relational advantages out the window. So for a certain workload we know we felt that we can't really use Postgres anymore.

A

So a little bit more than two years ago, we started playing around a bit with Cassandra, and by now we have something something like 24 services that use it: 300 Cassandra nodes. We store something like 50 terabytes of data, so overall a pretty heavy Cassandra user back when we started using Cassandra there wasn't that much information out available about how to use Cassandra how to design your schemas stuff like that, so we really managed to screw things up a lot, and that was quite boring.

A

I mean you guys here most of you don't have the same good excuses. We can do. I mean there. Was this really good talk this morning by Patrick from data stacks about modeling? If you didn't manage to catch its, you should see it online because everyone needs to know these things, but I'm also going to be talking a little bit about what not to do so. This brings us to the actual subject of our talk and the first part about how not to use Cassandra.

A

Is this how to miss configure Cassandra so Rick from Instagram just talked a little bit about read repair, which is a interesting and often very useful feature with read repair enabled whenever you do a read you, even if you do it with a consistency level of one, it will send the request to all nodes in the cluster and if you have a consistence level of one once the first one replies, you send back the reply and everything is fine, but it will still keep waiting for the other ones and just check that the digests of the hashes match and that way make sure that everything is consistent and if not, it will merge and do the right things and just repair whatever damage you have.

A

This sounds like a pretty neat feature. Right should be no problem with doing this and your position here. Anyone who would okay, so it's an interesting feature of Reed repair, is that it is performed across all data centers. So if you say have a reed repair factor set to one in a multi DC set up, every single Reed request will sound something like half a dozen requests o'clock across the Atlantic Ocean and that can get boring quite quickly.

A

So we have made this mistake multiple times and it was equally boring every time and in Cassandra 1-1 there is finally a DC local repair configuration item so that you can configure there to only be read repair inside of your data center, which is probably the only type of Reed repair that should be long to exist. Okay, so another cool thing in Cassandra is the RO cache. You guys all know about the key cache right.

A

That's we store individual parts of the indices, to speed up things that were so that you don't have to go to the disk, look at the index file and so on. Well, the RO cache is even cooler. You can store an entire result there, so that if you hit the road cache, then you will not touch the SS tables, not touch this. You can fulfill one request entirely out of memory. This was intended at something of memcache alternative.

A

If you guys are putting mem cache in front of Cassandra for performance purposes, you can just stop doing that and switch to enabling ro cache and everything will just be faster, and you won't have to worry about consistency and cache invalidation, because that will just be handled transparently by Cassandra. Pretty awesome right right. What do you say?

A

Awesome perfect, so your toast, so the thing about the row cache is that it only stores entire rows ever.

A

That means, if you're doing a get request on a single column that read if it's a cache miss will become promoted into a full row slice, and if you have big rows, that means your performance will probably go down by at least one maybe two orders of magnitude, and that can be bad and to make things even more interesting. Whenever you write to a row that is also cache and validated just thrown out the window. So in conclusion, never ever ever use the row cache.

A

Unless you know all of your data usage patterns- or you will just destroy your performance and we have accidentally enabled grow cache every once in a while and a few services and just wondered why the hell did the performance just go down like that. So that's a good thing to know. So another cool useful feature in Cassandra is compression.

A

This is really nice with Cassandra you can just toggle a switch and Cassandra will start compressing all of your Estus tables. Their compression algorithm in in this case, snappy is really super fast. You can compress and decompress like a hundred times more than that data. Then Cassandra can deliver per per second, because it's almost no overhead. So it's really super fast and efficient and you could probably just enable it and well. You will have less date on this.

A

The reads will be slower and smaller and pretty much things should just be faster right, right, yeah, perfect, cool wrong, so the problem with compression is that there are a bunch of fast paths in the cassandra codebase, the most obvious, one being that you can enable a mapping of your files into RAM, which means that you don't have to do a whole series of context, switching and stuff like that and seeking and other boring stuff, but there's also some fast paths when it comes to indexes and so on, and all of these things are disabled.

A

When you use compression, you will not notice this if you are using something slow, something that hits the disk most of the time and something that's like a slow data set. But if you actually are handling thousands of requests per node per second or even hundreds, you will get a quite severe performance degradation and you will once again be a sad panda. Okay, so that's it for how to miss. Configure Cassandra and I would I'm now going to move into different, interesting ways to misuse. Cassandra.

A

Overall, an interesting property of Cassandra is that when you install Cassandra on a few nodes- and you load up some data, it will pretty universally be very speedy. It will, it will do requests in no time at all the the problem starts to occur once you have written a lot of data over time.

A

Basically, your write patterns, the temporal write patterns. If you keep writing more data to the same row, then that row will move into more and more and more SS tables and once a single row is spread out over many S's tables.

A

That's when you get bad performance, so you start off fine and then you degrade slowly, and this can be something like half a year or a year or even several years before it starts to stabilize and you get the kind of performance you can expect moving forward, and this can be something of a problem so to know if you are in this in this situation or not, there's this really useful and hard to decrypt tool called CF histograms.

A

It's part of note 2 and what you can do with no to see if histograms is basically find out. How many S's tables are my reads touching so, whenever I do a read, how many different seeks does the disk probably have to make and so on? You can also find the the time, distribution and a bunch of other really useful things.

A

So I would strongly recommend to anyone who has to do any kind of performance work with Cassandra to really get to know see if histograms, because there is a huge amount of very useful information tucked into that window. That kind of looks like a matrix with just a bunch of numbers, so know it and love it.

A

One thing that we have done that Spotify is a patch to make sure that for every single SS table you store the smallest and the largest column name written to any row in that SS table, and this means for specifically for things that are or look like time series data, or you have a very sharp time partitioning. You will get awesome performance increases if you're using any size tier compaction. You will see that every single read you do will only have to touch the SS tables where your data is all actually located.

A

So I promise you I promise my coworker Marcus. That I wouldn't make anyone applaud for him, but Mark was throw that and it's a really useful patch, so don't applaud or anything but he's awesome. Alright! So another thing: that's we have encountered in foundries the problem that we are one of the few Cassandra users that use big traffic and big sets of data that are spread across the Atlantic or across a large ping chasm.

A

Basically, so there aren't many cross continents and reducers, which means that we have been more or less on our own when it comes to a bunch of in issues that you only encounter them, we wrote a patch and when I say we I don't mean the the majestic plural Meister. All this way, I mean Marcus wrote a patch to disable TCP no delay. So that's when you are writing very quickly. You will actually try to batch up multiple writes in the TCP network, stack and that's reduced.

A

Our packet count by 20%, and this was a huge improvement for our poor old firewalls that had a very hard time, keeping up with all of the packets that Cassandra was sending. So that was also a nice patch, alright, so the next subject here will be how not to how not to upgrade Cassandra I love.

A

This image, who doesn't I, mean it's beautiful, and one thing that should be noted here about Cassandra, I'm, I'm being kind of mean I'm saying mean things about how much Cassandra sucks I'm saying mean things about how stupid we are and bad at using Cassandra. But an important thing to note here is that we have never ever lost any data with Cassandra. We have hits every major bug that Cassandra ever had.

A

Probably we have had so many pains and back in the oh six, oh seven days, Cassandra was really buggy back then, and still we never lost any data, and that is probably due to the mutable Ness of the SS tables and the upend. The only way that compactions and and everything works so because of these properties, cassandra is really something that, in spite of being a new and immature product, it's compared to things that have been around for a few decades like oracle or Postgres, or my sequel or whatever.

A

We still have decided to trust it with our data, and we haven't ever had to regret that decision anyway. So, let's go back to being mean they upgrade from 0.7 to 0.8 was the first major upgrade of a production cluster. We did, and everyone on the internet said that Oh rolling upgrades just work, take down your cluster, upgrade it and start it about back up again and everything will be super awesome and it was not. There was in fact, a bug.

A

That's meant that it failed to start up and nothing worked so we downgraded and we located the problem. We wrote the patch we submitted the patch and everything worked, and you would expect that, given that we upgraded to 0-8 6, someone else would have already discovered this problem and tried it out, and that turned out obviously to be the wrong conclusion.

A

So our takeaway was to always try rolling upgrades in a testing environment before doing it in production and never believe what people tell you on the Internet by the way. If you are watching this on, like YouTube or some kind of streaming over the Internet, you can make an exception to me, because I am a very trustworthy person, even though they okay, so the next upgrade we did was from 0 8 to 1.0 and wise from our previous experience. We upgraded first in our test environment.

A

Everything worked fine and then we tried it in production and everything worked. Fine. We learned from our mistakes, don't we except for the last cluster, and once we upgraded there, all of the data was gone, no keys, no nothing! It was just empty just gone, and this was the around the time were: transfer shots.

A

So anyway, what we did was that well, what happened was that for very large clusters, there was a bug with the bloom filter code and made Cassandra think that the bloom filter was somehow dead, an empty which meant if the bloom filter is empty. That means there's nothing there. So every read you would do you would check the cluster and it would say no, no, there data there, and so everything was just gone. We did a data scrub and the data was back and everything was fine and dandy and we were on the road again.

A

So the takeaway from that is, when you do a test upgrade, do it with production data all right, so the upgrade from wonderdog to 1.1.

A

After all of the previous experiences, we did the test with production data in the testing environment. Everything worked, then we did exactly the same steps all over again. I noticed people giggling, so you're reading ahead of my slides and some nodes. Well, some services were reporting missing data, we're not talking about empty data. This time it was just like well, half the data is there half the data is gone and some things are all inconsistent and weird.

A

So we scrubbed all of the data restarted and everything was well again and since then we have never been able to reproduce the problem. We actually snapshot that the SS tables of the the Cassandra clusters before we did the scrub. So we have the exact files exhibiting the problems before and when we move those over to a new server to just try and reproduce everything just works. We don't know this might have been a classical problem exists.

A

We couldn't between keyboard and Cher, because we've worked really hard to reproduce this and well so the takeaway from this is that we are really really looking forward to the 1.0 upgrade all right. So next next topic will be how not to delete how not to deal with large clusters.

A

There's this cool thing in Cassandra called a coordinator. So a coordinator is the thing that that performs partitioning that passes on requests to the right number of nodes for your consistency, level and so on, and thus the synchronization and checking the digests. All of these things.

A

What happens in Cassandra if a single node decides to be super slow and when I mean slow, I mean slow, not dead, because there's this thing called the snitch in Cassandra that will notice. Oh, this guy is dead, I'm, gonna, I'm gonna, take him out and I'm, not gonna. I'm, not gonna, send data here, but if a node is slow and there can be many reasons why this would happen.

A

One of our Spotify favorites is bad rate and RAID controller batteries, because you have to replace your rate controller batteries roughly once a year, because if you don't, then they will switch to slower mode because they are afraid of you. You no longer cash rights as much because then you might lose data that you save. You have fdisk, F, synced and so on. So so that's countin can slow you down and well parent we suck at replacing our batteries.

A

Another reason for for sudden slowness can be just bursts of compactions or repairs, or something or a bursty load or network hiccup or maybe you're doing a major garbage collection. In reality, slowness happens in a 60 ml + to a hundred. No cluster or whatever, maybe you barely ever encounter it with three nodes, but in a big cluster, this will be a thing that will happen many times a day.

A

So let's go back to talking what exactly happens if one of them is low? Well, it turns out that the coordinated that I mentioned before it has a request queue and this request queue is shared between all nodes that it talks to so.

A

Basically, the queue will just start filling up with more and more and require and request this one single, slow, node and the queue will become full and then you will have. If you have a 16 ohm cluster, you will have one node, that's post and 59 nodes that aren't receiving any traffic because all of the coordinator queues are full does. Has anyone ever told you that Cassandra has no single point of failure because they lied.

A

We have actually noticed several situations like that, so so there I mean in general. It is true, but there are lots and lots of situations where you will encounter failures that make that more optimistic hope that than a fact of life.

A

So one way to solve this particular single point of failure is to use a petitioner aware, client, that is a client that knows the petitioning schema and knows where all of the data is located, so that when the client sends a request, it will only send a request to one of the nodes that actually holds the data. This is actually also performance improvements, because then there will be less cross.

A

No chatter and you can talk within the club within the same process, and things will be slightly faster, but mostly I would say it's a reliability improvement, and this means that at most because of things like this, three nodes can go down instead of the entire cluster and that's quite a big improvement and Astyanax. The the Java client for cassandra has this feature, and you should always use st onyx. You should not use hector's, among other things, because of this, so a big shout out to Netflix for writing and awesome cassandra client there.

A

Alright, our next topic is how not to delete data and I could just make this really short and say: don't delete data, but let's go into some more detail specifically. The way cassandra deletes data is kind of messy, the reason, of course, being the much much valued immutable Ness of SS tables. If once data hits disk, you never ever write or mutate that as a stable again, how can you remove data from it? The obvious way to do it is through tombstones.

A

That is that you put in a write with the same column value as as the previous thing that you're trying to delete and some kind of special flag saying this is a tombstone. So this date has been deleted, and this is what Cassandra does and well tombstones are basically any other column rights. They have the same type of time, stamps for versioning and so on. So we can look at them as any other one right that can override other rights and so on, and the question then becomes tombstones are kind of special right.

A

I mean they are things that we would like to go away. Once the original data is gone, then the tombstones can go away. So does that actually ever happen? Well, the answer is- and this is really complicated to reason about tombstones- can get merged into the SS tables that hold the original data and finally, the tombstones then can become redundant, but and also once that has happened because of the the possibility. That's some nodes in the cluster have gone down and they might come up. There's also a grace time.

A

So, even if you have no more true non tombstone values for something, you still have to keep the tombstone around for a certain grace time, something like 31 days or so. But after that has happened well during a minor compaction.

A

Tombstones I'm gonna have to read this because it's so complicated, at least to me to express tombstones can only be deleted if all values for the specified row are all being compacted. So if you have perform the rights to the same row as the column you're trying to delete, then at some point in time- and it has it an SS table that you're not currently compacting, then that column can never go away.

A

And this is the important difference here, because maybe you only wrote to that to that specific column once and then deleted it, and that doesn't matter, because if that column is in a row that is kind of popular that gets written to every once in a while, then that tombstone will much never ever going away with sighs tear : compaction. It's just not going to happen. So, let's talk a little bit more about level compaction, because level compaction is the savior of Cassandra.

A

It removes so many degenerate cases from Cassandra and it really improves the read performance by such a huge margin that maybe it fixes this as well and in theory.

A

It should because in theory they say that 90% of all rows live in a single SS table with Cassandra, and this is assuming a bunch of less and that about the distribution of Rights and the distributions of this, and that what we've found in our production clusters is that, somewhere between 50 and 80 percent, depending on the type of data, we're storing between 50 and 80 percent of all reads only hit one SS table. The other hits multiple SS tables.

A

That's generally, not good enough, because if you think about it in where the level compaction, when you're doing a leveled compaction, you will only ever be compacting. Two separate SS tables that hold the same data space because of the partitioning scheme used in n level compaction. So the result of this is, if you're doing a lot of writes to a specific row, then that row will exists in all or most of the levels in the level compaction scheme.

A

And, if that's the case, then the tombstones will stick around, because you won't be compacting all of the levels at the same time and the tombstones will stick around so even with level compaction. This can definitely still be a problem. The conclusion that you should take away from this is that deletions in Cassandra are super messy, and the only surefire way to get away from from tombstones is to do major compactions and, as you probably know, they have there huge set of problems.

A

So also the problem will generally be much more popular and problem bigger with problem with the popular rows, so that all of the rows that you don't care about, if you have a very skewed workload, will have no tombstones but the ones that you really need to be quick. They will be slow, avoid schemas that delete data.

A

So all of the smart people will now be thinking. Well, what about TTL data once the TTL of column expires? You can just compact it away. You know that you don't need it. You know, that's there's no resurrection thing. You don't need a tombstone right, so we don't need the data anymore. We can just drop it and TTL data will be super fast right, come on yeah right, yeah, awesome, cool, no okay.

A

So the reasoning here is that, well what if you write a column with an aunty tailed with without the TTL, and then you overwrite that same column with a piece of TTL data once the TTL data expires.

A

If you haven't happened to do a compaction on the old non TTL data, then that might reappear, and that would be so terrible right so because of this very esoteric use case, where people are writing the same values into the same column, family, sometimes with and sometimes without TTLs, they decided to do the whole tombstone singing a song and dance thing that breaks everything even for TTL data and so TT ELLs will not help you at all. It will just not one thing that does help.

A

You, though, is that once again my amazing co-worker Marcus has written a patch. What it does is that it records, if all you have in a specific SS table, is TTL data and it will record all the expiry date of the last thing to expire in that entire SS table.

A

So that's when you're about to do a compaction. You can notice that hey all of the data in this SS table has already expired. Then you can just drop the SS table you just a simple unlik or unlink order. One operation pretty awesome right, pretty awesome, and this will of course, mostly work.

A

Well, when you have TTL data that is short-lived, because if you have long live detailed data like six months or whatever, then those you will have merged a bunch of things in different sizes and blah blah blah and you could have you will most likely have some new data in there and it will not work as well, but it's a good start.

A

Alright, I'm going to go back a little bit more about cool ways to fail at using Cassandra, specifically I'm, going to go into a very specific way to fail with Cassandra, which is our playlist service. Now this is a really big service. We have something like 1 billion playlists in there we are receiving a load of somewhere around 40,000 reads per second, a few hundred writes per second 22 terabytes of data compressed some almost double if you uncompress it.

A

So this is a large data set and we had an old playlist service that stored everything on this gasp files.

A

That was a really interesting design choice, because 1 billion files has a bunch of drawbacks, one of which being that it's really slow to back up a hundred million files or a billion files or whatever we had to resort to sorting them and all of the files in I knowed order, so that we can read them out in roughly sequential order during backups or otherwise.

A

It would just say, take several weeks even that didn't help us after a while, because we had so many files that we just had a long period of time where we had no backups for one of our most important core services. That was in the old days when, when we were a lot smaller and more naive and even less so well anyway, we also had a home-brewed replication model that basically combined the world the worst of all the worlds.

A

If a single node went down, we lost data, but we still had to write to a bunch of machines and masamune and so on, and so it was very. It was an interesting time we had frequent down times back then none of our users will notice this so much anyway. We figured that this would be the perfect test case for Cassandra. Let's take our most important piece of user-generated data, and, let's just and we have so so many problems, let's just use that as a testbed for an entirely untested cool new product.

A

You know that we don't know anything about. Isn't it all right? Here's an interesting thing that not a lot of people know about the Spotify playlists. Every playlist is actually a revision object. You can basically think about it as a slightly simplified gates repository it's a distributed version in version control system.

A

The reason for this is that you have the use case, that you say that you own two mobile telephones and you go offline on both because you're going to cambodia and then you try to update the same playlist on both devices simultaneously or well during the same vacation.

A

And then, when you come back to the United States from your lovely vacation, you go online and while you obviously need those changes from both devices to be correctly merged into single, coherent history, so that you can see all of these changes on both machines right yeah, that's a really useful feature said no one ever.

A

On a sidenote, though, it should be noted that this is a design decision that has kind of managed to save us from ourselves, because this is obviously in many ways silly, but on the plus side it whenever the playlist system would go down. Well, people were having these revisions object, things in their clients and they wouldn't actually notice all that much unless they had multiple devices which people didn't back in the old days.

A

Usually they wouldn't notice that the playlists weren't sinking to the servers and because well it just kept on working so long as you were only using one device and if you were using multiple devices, then the only thing was that for a few days you couldn't see the edits from one device in another, and no one actually cares about that.

A

So, let's talk a bit more about this playlist' data model. Well, it's a revision system. So what we store? The authoritative data is a sequence of changes.

A

Everything we store a bunch of other things, and but that's basically just optimizations and Cassandra is really neat at storing this kind of data because we know usually what the key is.

A

That is the revision ID of everything, that's something we look up on, and that means we can always do reads with a consistency level of 1, because either we will get back the correct result or we will get back nothing because it hasn't that column hasn't been written into yet and if we get back nothing, we can just issue a new read with a higher consistency level and we will get the data, so that's pretty cool use case for Cassandra right there anyway.

A

This is a rendering from a web interface of how a playlist actually looks. You can see here. That's here. We have a fork with a bunch of concurrent notifications that are have them been merged, simple stuff anyway, so we have had huge amounts of issues with playlists and Cassandra I'm, going to most be talking about one of them which had with the head the column family. Now the head column family store something very much akin to the head in gate.

A

That is the pointer to the latest revision ID, and this serves actually somewhere around 90% of all requests to the playlist service, the reason being that whenever a client logs in it will say I have this version of this playlist. What is the latest version and then usually? No one else is modifying the playlist in question and it will just get back. Oh you have the latest version cool, and if you get a mismatch, then you will have to figure out and read the changes that have happened since and things like that.

A

But 90% of the time you have the latest version. So all you need to do is access head. It should be noticed. Oh, it should be noted here. That's we actually for a long time, mem locked head into RAM because it was so frequently used and we had performance issues because of page cache mission, ministers and so on, and this is something that might be a good idea to do, depending on your use case.

A

So, but when I will talk about now, is that even though we had had locked completely into RAM so that accessing had was known to never ever cause disk I/o, we noticed that there were requests that actually took several seconds to complete and we could just reproduce this in the Cassandra CLI. We would issue and get to the right to the right row and what we will get back was 15 seconds of waiting before getting the data.

A

So what we did was that we copied those SS tables to a development machine for investigation and we converted them to JN Jason uses using the SS table to JSON format, and what we noticed was that some rows contained around 600,000 tombstones.

A

So that's cool.

A

Why well we it turns out that the way we store data in head is that we stored the revision ID of the latest revision in the column name and have an empty column value. Why do we do this? Because when we do two simultaneous writes, remember your lovely vacation Cambodia. When we do that we want there to be, we don't want the rights to clobber each other.

A

So when that happens, when you do two things like that, you will simply end up with two values in the head: column, family and then, when you do a read, you do a slice request across everything in head and you will almost always get back just the one result. If you get back to results, then you know oops I need to do a merge, and then you will do that. So it's just a fork detection thing, but of course this turned out to not be a really optimal thing for Cassandra to store.

A

We hadn't really thought about this. We designed this like I, said two plus years ago back when we didn't know everything about Cassandra. We still don't know everything about Cassandra, but we do know more about Cassandra. We didn't realize that. Well, if you keep writing to the same row, then there there will be values and for that row in a bunch of different SS tables and then the tombstones in and can in fact never be compacted away. So for all of our most popular playlists.

A

We still had something like this was a half a year ago. So so back then we had something like 1.5 years worth of tombstones. That was boring.

A

Our solution to this, which is a bit humble, is to perform major compactions because with a major compaction, you are actually merging all of theses tables and what that means is that you know for a fact that you don't have any stray other things. So if there is no value for specific column, then you can just delete the tombstone.

A

So that's the only plausible way to safely remove tombstones and we did that and things started looking up and things were speedy again for about a week until we noticed that everything was slow again, so that was boring. Why? Well? It turns out that we were running repairs in between the major compactions.

A

So that was an interesting situation.

A

We solved this by doing repairs on weekdays and major compactions on weekends, and hopefully you will already know that you should never ever use Cassandra to store a Q and, of course, if you think about that, the head column family was a queue that always tried to help hold only one value in the work you, but still some kind of queue and just Cassandra is terrible at storing Hughes. Don't do that.

A

Okay, so another interesting Cassandra feature the Cassandra distributed counters, so in the Spotify UI there are a bunch of things that we want to count. We want to count how many people are following a playlist. We want, if account, how many people are following an artist. How many times has this song being stream? This kind of thing, and this distributed counters in Cassandra, is supposed to solve this and just transparently just count things and it will bump and everything will just work. So it's this awesome, yeah yeah sure it's there.

A

Anyone from data stacks here in the room.

A

So do you think it works.

A

Charity picnic.

A

So he doesn't want to guess well well anyway, the distributed counters in Cassandra have gotten a really bad reputation, mostly because they don't actually count things very well, but that for us they work pretty well.

A

See the thing here is we: if we have, a million listens to a particular stream and cassandra is saying nine hundred and ninety eight thousand or a hundred and five thousand, we don't care.

A

The really important thing here- and this is something you tube is not realized- is that if I click reload, then the number has to go up by one and that will pretty much work with distributed counters. It can be off by a bunch of things and sometimes it will go up by a thousand or down by a thousand, but no one cares so long. It's it's not the same number. Then people will just be happy.

A

So look another dog picture, I wonder how that got in there anyway. So how not to fail- and I should note here that this is only a partial list. One thing that you should do, if you want to use Cassandra scale, is you should treat it as a utility belt. This is a set of really well tested cool, interesting algorithms.

A

If you don't know what the algorithms do, you will poke it and you will lose a finger and you will be sad, but if you know what algorithms do and what they do well and what they do poorly, then you can put together a very interesting storage system that will solve your problems in a very efficient and cool manner. Also, flash storage is completely awesome.

A

We do a lot of one of things. I main mentioned most of these, but I'm just going to reiterate, we do a a bunch of really weird things in our cluster. We have clusters where we do major compactions on a weekly basis. We have clusters where we will delete all of the data physically delete the vs. tables and just recreate things from scratch. We mem log frequently used as a stable Center, and we we do weird things and you need to you need to understand what the algorithms are.

A

You need to understand what the performance profile of the things are. If you don't, when you hit scale, you will be very sad, and this is actually a stark difference between Cassandra and say Postgres in Postgres the things that Postgres does well, it does well and things post Chris does poorly it does. Okay, Cassandra is really good at the things it does well and it's terrible at the things it does badly.

A

So you need to keep that in mind and that pretty much goes for all the sequel things all the old school databases and so Cassandra read performance is heavily dependent on the temporal patterns of your rights. Specifically, if you write the same row over a program long period of time, you will get bad performance.

A

Also, Cassandra is initially snappy, but it performance for bad use. Cases will go down slowly over time, and this is an important point. This makes benchmarking Cassandra close to useless. We have all seen a bunch of really optimistic slides about MongoDB.

B

Is faster Cassandra.

A

Is faster, or whatever is faster? All of these benchmarks are pretty much useless because they didn't actually run that benchmark for two years before trying measuring things. They ran it for like a week or something and that's not relevant. That just shows you off what Cassandra will do for the first few months, so don't believe in the benchmarks.

A

Okay, what else don't delete data and see if histograms is your friend.

A

Also, there is still a bunch of esoteric problems working with large-scale Cassandra installations. I think the next few years in storage are going to be super interesting because we've probably gotten roughly as much performance as we can out of using flash in the format of SSDs that still use the salt or whatever protocol. In order to really start reaping benefits, we will have to come up with new hardware protocols to talk to storage.

A

We will suddenly have really cool new abilities when the database products can go down and manipulate things in the lower layers and we're in for a really interesting time in the next few years. And if you agree with that statement, you should totally come work with us, because we're gonna have a blast, so Spotify dot-com, slash jobs, I think there supposed to be a microphone somewhere around here. If anyone wants to ask a question, so there's a hand in there and there's a man with a microphone. They seem like a good fit.

C

It's right here, tutu.

D

Yeah first, let me say thank you. This.

E

D

Stuff- let's see one something, it seems fairly obvious to me. Another single point of failure, I'm curious to know if you've had problems, the reliance on NTP seems a little bit scary and I wonder if you've had any problems in that area at all that.

A

Is absolutely a valid thing? We have not really had any major problems with that, but I can imagine, that's a bunch of people would have yes, we've.

D

In our environment, we have distributed systems that are dependent on time and we've had a million problems with NTP, and so our environment doesn't yet include Cassandra, but it will soon- and so this worries me that we'll start to see new problems related to that. You really.

A

Need your NTP infrastructure to work reasonably well in order to use Cassandra, otherwise, we'll be sad. Yeah.

D

That's my impression: okay.

A

D

A

Right, there's yeah: maybe we should make a line there and just move everyone who wants to ask a question get in line behind the lady with with a microphone.

B

So I was curious about about when you were talking about. When you write very frequently to a row, the deletions don't get deleted. Yeah.

E

B

Does that? Does that mean that you need to rethink your data model or, or is there? Is there a way that you can still have.

B

Like with a user calling family you're always going to have the same set of users, you don't you don't really change that the row key very often so so would the correct thing to go through delete the entire row and then rewrite it every now and then are coming up with a new way of indexing. I.

A

Would say that in general, if you have performance problems with Cassandra, you probably need to switch your data schema specifically in the case of the user row. If you're deleting columns your use of data-based and what on earth are you doing? You should not do that. If you have the need to remove and add columns a bunch of times, you should probably serialize all of that data into some kind of binary blob or whatever protobuf JSON I, don't really care and then put that into an column with a constant name.

B

A

Welcome any other questions. Yes go to the microphone.

A

How someone else came in front earlier.

F

In the presentation you mentioned that compression visibles of a set of phosphate parts, how would you detect that, and how do you address that? So what's the question: how.

A

You detect that yeah.

F

Like how I didn't so, let's say, I turn on compression and there are certain brief parts that might not be a sufficient. Is that a significant difference that will help me catch it or it's something that that I, wouldn't notice, sign and.

A

It's been a significant difference for us, in some column, families, and we are happily using compression in other column, families, so the way to detect that is to turn on compression and measure. The performance difference once again know to see if histograms is your friend, because it will show not only the average access time, but also the distribution, and it will give you quite a bit of data, so you have a lot of data at your fingertips. Just notice. Is this a problem for me or not.

F

A

C

To handle these Tom's tombstone problem right so have you looked at writing your own compaction strategy, so Cassandra the compaction strategy is pluggable right. So since you seem to have submitted couple of patches to Cassandra right. So what is your ex? Have you looked at it and go? What is your experience? There was.

A

The question of whether we have thought about writing our own compaction strategies and the answer is definitely yes. My awesome, co-worker Marcus is still sitting over there and he'll turn his redder every time. I mention him. He has a plan for a very useful compaction strategy which is based on time.

A

It's for time series data, so you basically just compact things that have happened concurrently and that will make for very efficient usage with specifically time series data because you will always have you will never have to for any given time point check more than one SS table. If you, what you're storing is time series data, so we have also thought about a bunch of different things.

A

Honestly, though, level compaction is a really really clever thing: it's it solves a bunch of problems for the size, tier compaction, I, don't think people in general understand how it works. Well, enough, I'm, probably going to at some future conference, try and they were talking exactly how it works, because I find that most people know that it's something about charting or whatever, but it's a very good thing and it has some drawbacks. But if you know about them you can often work around them and it's it's a very interesting compaction strategy.

E

You were mentioning about scrubbing the data. What do you there.

A

Is I think it's an O'toole command called scrub.

E

A

And basically, what scrub will do? It will delete the index, it will delete the bloom filters and it will regenerate that and it will regenerate all of the SS tables from scratch, so that it just makes sure that your data is in a consistent, good state and.

E

One last thing is the you were mentioning sometime, you have to reload data into the cluster and what's the reason for doing that,.

A

Well, obviously, if well what we want to do. There is basically we want to replace all of the old data. We have data and we regenerate it from scratch every day, and if we would just insert it, then the old data would to be compacted away, because we are just deleting the SS tables. We can turn off compaction entirely. We can just write the data and be happy and that works very well for us. For that use case.

A

I would like to add to the previous question about writing our own compaction strategies that I also spit and suspect that once we have cooler access to to flash, we can do some really really interesting things when it comes to new compaction strategies, but that's in the future.

E

B

Was wondering so generally generally, when you set up.

B

Is that a problem for Spotify I would imagine that you wouldn't want to duplicate that much data.

B

And if so, how do you handle user access levels across regions that.

A

Is a very good question and I cannot give you a short answer better than it varies. We have very many different services and they all have different storage needs and we really try not to stay away from doing any kind of one-size-fits-all solutions. We have some data that lives only in one data center. We have some that is replicated to all data centers. We are moving towards making more and more data home sited.

A

So that's it's located only in the data center, where the user accesses it the most, but we're not doing that a lot currently because we try to do it and we failed spectacularly, but I don't think that there can be a good one answer to that. It depends on how much rights you're, seeing how big your data set is and a bunch of other factors. Also, of course, how much data in question is shared between people. So it's a really complex question and I. Don't think, there's a one single good answer. Sorry about that.

B

A

Welcome, okay, there's don't seem to be any more questions, so thank you so much for your time.