Apache Cassandra Cassandra Summit 2013, 26 Jun 2013

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization

Description

Speaker: Albert P Tobey, Tech Lead, Compute and Data Services at Ooyala
Slides: http://www.slideshare.net/planetcassandra/c-summit-2013-practice-makes-perfect-extreme-cassandra-optimization-by-albert-tobey
Ooyala has been using Apache Cassandra since version 0.4. Our data ingest volume has exploded since 0.4 and Cassandra has scaled along with us. Al will cover many topics from an operational perspective on how to manage, tune, and scale Cassandra in a production environment.

A

All right well welcome. This is my talk about practice. Make extreme Cassandra, optimization I did not pick that title, but it is better than the one that I've chose. I hope you'll get a lot, I'm going to cover a lot of technical information, I'm going to move fairly fast, cuz, there's a lot of content a little bit about me. This is going to be short, I'm sure you guys are tired of hearing about all these different companies and their. How they're, special and unique I'll just say y'all is really awesome.

A

We've learned an awful lot about how not to manage Cassandra clusters and how to do it correctly. I'm going to talk a lot about that. It's been a process over the last couple of years of just gradually making incremental changes and making things a lot better or more colloquially, make it suck less I'm. Introducing a term that, I think, is new.

A

That kind, I think describes what a lot of us do in terms of performance and how you make Cassandra or any distributed system, or even a single system perform better and it's what I like to call being a harass titian, which is using heuristics to rather than being more scientific, which is expensive. It takes a long time.

A

I'll talk about that more in a little bit I'll talk about tools that I use that maybe you haven't seen before or didn't know how to use them correctly and then at the end, if I have time there's a lot of settings like I'll go more into depth about specific settings that we use at a yella. And then I got a couple little things about bowel system usage, how we're using btrfs and ZFS to make our clusters on store a lot more data run a lot faster, getting a lot more flexible about me.

A

I'm tech lead of the computing data services team at who yella, where a dev ops team, and what in that means a lot of different things to a lot of different people. But what we do is kind of the more pure DevOps and that we do the entire stack top or bottom to top hardware.

A

Software operating system database tuning all the way through and what we're building is a platform for yella to use, Cassandra and Hadoop, and all of these different large big data systems, so that all the engineers don't have to get concerned with. How do I setup Hadoop? How do I interact with Hadoop so we're building services in front of these systems so that all the developers can't have to do is just know where the services is submit their jobs and then forget about it?

A

We have about 100 Cassandra nodes, spread of costs eight to ten clusters. This is probably going to more than double in the next month or so, and then, obviously, the obligatory we're hiring make a char happy. We always found in 2007. Most of stuff is pretty boring. 230 employees, 200 million unique users. The important one on here is 2 billion, analytic events per day.

A

All that data, all those analytic events have to be stored somewhere they have to be processed somewhere, and we need to be able to serve the results of that data to our customers as fast as possible, and that's where cassandra comes in we've been with we've been with cassandra since 0 dot 4, it's been a rough ride these days, it's a lot smoother than it used to be thanks to datastax and ikuno, and other companies that are contributing to cassandra a lot more than we are.

A

We really appreciate all their efforts and stabilization and operational support, use cases that we have all of our analytical data. I have a slide on this in a minute, but basically gets batch processed in the results of all of our roll-ups and things get written into Cassandra so that those can be exported to our web interfaces for our customers to use cassandra provides us that very low latency ability to fetch values and chill them to our customers. We use also use it in various places, is just a highly available.

A

Keystore forget all of the columns or storage and all of that it's a really nice distributed key value store. So you just don't have to think about cache coherency or you know. Do I connect the memcache, d1 or memcache d to do I deal with hashing. You just forget about all that crap connect to any node in the cassandra cluster and your data is there. Well, we have an internal monitoring tool called Hastur that we use to track all of our metrics, so there's a stat CT style interface.

A

It's basically in place of where you normally use graphite, arm that that's currently receiving about 50,000 in metrics. A second writing that and do a cassandra cluster of a five nodes. We use it for play head tracking. So if you've used netflix and you watching a movie on your ipad and you pause it and then you shut it down and you go to your TV and you bring up the same movie, it will resume where you, where you left off. We use cassandra for the same I'm, pretty sure Netflix uses for that too.

A

Then, machine learning, machine learning data. So we have a bunch of ml stuff that we run in Hadoop that generates outputs that we store in Cassandra so that we can use things like geo replication to get it out to the edge that's an ec2 and do cross data center movement.

A

So this is our current analytical pipeline. This is kind of the more interesting bit so what happens here is our players when, when you're watching a movie sayin ESPN com, every few seconds will send a little update to us telling me telling us where you are in the movie.

A

You know if you're having buffering issues, what bitrate you're at a bunch of different little things like that, your IP address, so we can do geo-tracking things like that. So all that data goes and it gets HTTP push to our edge arm, which I call the loggers that data gets crammed into s3, which is a big black box on purpose, because that's how we treat it and then our Hadoop cluster pulls all that data down into our local HDFS for crunching, and then that writes into Cassandra and then so on and so forth.

A

Then the ABE service is basically just a front end to Cassandra that our web tier can access to get to the data. So here's a little Oh like oh right, there's something I was going to talk about there so that read-modify-write there I don't have access to my notes. So there's a weird setup here: read-modify-write is what I want to talk about first verse for performance.

A

So if you can dip model your data for Cassandra in such a way that you can avoid doing, read, modify rights and what that means is is when you go and you fetch a data from the database, you may be incremented or you add a number or you penned a string, and then you write it back in the same location, the same Roky same column, key arm, you're, going to have all kinds of issues with consistency, it's harder to do distributed systems on top of it because I now you have to do distributed coordination which is going to get easier and to all apparently, but also like, avoiding that just you can do a lot higher throughput doing things like we do with our metric system, which is entirely insert only and the read side is completely separate, so we can just jam in those 50,000.

A

Second, we don't even think about synchronization. We run five instances of the service and we just don't have to worry about any kinds of performance issues. There. There's no sharding, there's no there's no a consistent hashing at the service layer. It just writes into Cassandra and it doesn't matter which one does the work, so it kind of an example. What happens when you do read, modify write so I came up with this example. So I was thinking about trying to think of an example and I thought.

A

Well, a good thing to model would be how much my team is going to drink their moments. I whole team is here so I said that you know with Philip he he'll probably drink a lot. He's he's a beer guy, so he's got 12 drinks. We got Kristoff he's in London, nobody in London drinks, right Kelvin, doesn't really drink a lot. Frank's Frank's, a nice guy. He drinks a few drinks every now and then and so on. So getting closer to the conference.

A

A couple days later, the mem table has been flushed to disk into an SS table and I want to make an update, so I go and I read the value out, maybe or maybe I just override it and I say I update like I'm not going to drink that much because I have to talk on Wednesday and Wednesday night. I got to go home, so I'm not gonna, drink. Anything and Philip tells me he's quit drinking. I berate him for his lack of dedication and then update the column, value and then closer to the conference.

A

I decide. Wait hold on I'm gonna be out drinking with all you, people, so obviously I'm going to drink a time. So I update my value, but instead of just updating it I actually pull the value out and I I read the value out and I add 20 to it and that I write it back.

A

So now that's sitting in a mem table in Cassandra there's an SS table on disk that has the old value so now you're back into conflict resolution territory and there have been other talks about conflict resolution not going to go into that right now, but that what this means is now you have it. You have to compact before you get totally consistent right. You have two values in your database system until this until all of the compaction stuff fires. So by avoiding read-modify-write you just don't even have that problem.

A

You're, your Compassion's are going to be a lot smoother you're, going to have a lot less conflict resolution stuff to deal with and that's kind of the lesson there and then, finally, after a while, it all gets compacted down and that's what it looks like when it's all done. So, if you have a lot of hot data, one of the one of the patterns I really push our developers to use is to do that. Insert workload to the point where we actually do stuff.

A

Where we do we waste work, so we like to insert the same value twice, sometimes and insert it over the same over the same key value, same set of keys, the same value and let kassandra's conflict resolution take care of all the overrides and it works really well for us so moving on, so that that's our old architecture. So this here's, a bit of the practice, makes perfect stuff.

A

So in 2011, when I started at yellow, we were running on Cassandra 06, all kinds of operational issues with it, but the most important one was is that if you've been with Cassandra for a while and you kind of how it works, it has this property of once. You start up a cluster and get it kind of working for the most part. It's easy to forget about. So we had this 18 node Cassandra cluster sitting there for years, and people just forgot about it.

A

People forgot to do backups, they forgot to do compaction because we were still on sized heerd. They forgot to do repairs and we have that rehab II read-modify-write workload and we had some deletes in there. So we had expired tombstones. We had all kinds of issues of the data, so what ended up happening is to get to zero dot, 7 and 0 to 8 and it being zero that eight.

A

We we looked at all the different options for reading data out of Cassandra with their tools, and there wasn't a lot of good tooling there and one of our guys said well, I'll just rip the data of the SS tables and in a big MapReduce job and write it back out to Cassandra and really that's you can do that and yeah. It would be easy right. Are you sure you know what that word means, mmm and you know he's one of these.

A

Just these wicked smart guys we have a bunch of running around, and so he went and used one to play with Scala, so you skali, he pulled the SS table classes out of out right out of the cassandra source, wrote a MapReduce job that what that could read the SS tables and right back into the front end of Cassandra. That way, the clusters are completely separate. This is all happening in production and we needed to do the whole migration with zero downtime.

A

um So what we did is it was like well if we could pull all the snapshots in the HDFS and just process them there. That would be really nice, but we didn't have enough space in the Hadoop cluster and no time to buy more city, so I said well. We'll just do this trick. Will export over NFS all the snapshots to the Hadoop cluster and do this big crossover everything mounts everything craziness.

A

It turns out the colonel. We were running at the time we've been on there for years and didn't have an NFS support, sort of all crap, and so we looking around like glossary FS point-to-point actually worked out great. So we did Gloucester FS point-to-point export the same mishmash mess and ran a MapReduce job across it completely saturated the network.

A

For three days and at the end of the day we had a nice new cluster with completely consistent data, we'd scrubbed out all the expired tombstones fixed up all the inconsistencies and rebuilt all of our indexes and that that's our first migration. If we do this again, so lessons learns there. Oh don't forget about your Cassandra cluster.

A

hmm Is you know doing those repairs regularly and things like that? If you keep, on top of it, you're not going to have these kinds of issues and have to do these kind of nasty migrations, it turned out for 06 07. We kind of had to do something like this anyway, but.

A

It's just a really interesting way to move data I. Think if you look at Priam it has some ways to actually do some similar kinds of things by ripping SS tables out of s3, and things like that.

A

So, while this transition is happening, I had the opportunity to watch this receiving cluster be under some extreme duress. You know doing 20,000 30,000 inserts a second which is higher than we'd seen before and was pretty high 40 days, so we ended up going to 08. We end up going to a 24 gigabyte heap, which is not recommended anymore I used to recommend it about a year ago. But what happens is if you don't have it a big enough cluster? You spent you're going to have a lot of frustration from your engineers.

A

What dealing with just really slow fetch it reads and writes, because the JVM is too sitting there doing GCS right, you're gonna have a hundred millisecond, sometimes 500 baby, sometimes a thousand milliseconds of GC time when you have that big of a heap just because it's got to scan all that memory. We move to java 16 moved to the newer colonel, because we were on a bunt too, and the Colonel's that ship with a bunt to just aren't as fast as what you can build.

A

26 36 had pretty good exif us some improvements in XFS, which we were using at the time, and then we moved to md raid 5 because we used to be on raids raid 0, but what we found was as we needed we needed to keep most of the space available, but when disk drives failed sure you can just go rebuild the node if it's raid 0, but the rebuild would take 2-3 days and we just couldn't afford to have that much risk exposure.

A

So we chose to go with raid 5 at the time and then the biggest thing. The most important thing that I really want everybody to remember is turn swap off on your cassandra boxes. If you don't have another reason like you're running other applications that require it turn it right off, swap off minus a is your friend you'll have more consistent performance.

A

You won't have as many WTF moments where you're just looking at and you just have these latency spikes and what the hell is going on and nothing's wrong with the system and when I log into the system, if I see swap going at all it's the first thing. I do always, if you can't disable swap I've talked to a few people that say well, we have applications that really require swap they. You know sometimes they'll just run a little over and we just don't want them to crash.

A

So we want some swap available at least set BM swap eNOS 21. So this is a lot of your sis admin's, but it's an SE sysctl comp I'll talk about it more in a little bit. But what this does is the linux.

A

Vfs cash is really really greedy and what it'll do is, if you've been using links for a long time. You remember the days when locate would run and you'd come to your box a couple hours later, and none of your would work because it would been all swapped out to your disk and you got to wait an hour for everything to come back. Sometimes you just turn the power off turn it back, Isis, faster right or you log on to log into a system.

A

You think maybe it's swapping, but you can't get in, because SSH has been swapped out. All these problems just go away. So what linux does it that causes us to happens? When you do a bunch of I/o letters will actually be trying to optimize that, for you and it'll go and steal as much memory from this free memory from the system as it can to make space for VFS cash, which is great, except that now it swapped out your sshd. It's starting to try to swap out your jvm and that's just bad news.

A

Inconsistent performance really slow and sometimes the box will just become unresponsive and never come back. If you set vm SWA penis 21, it tells the colonel it says I never want that. Stupid behavior just keep just you, take the memory, that's free use that, but never swap out my applications until unless you're under extreme duress more on that a little bit so this year we did the or toward the end of last year. In beginning this year we did this migration again. So I had gone off to another team.

A

We had some other people kind of take it on as like an extra duty to manage the cassandra clusters. They forgot about it again, repairs didn't get run.

A

We had to move to this new hardware because they all hard was going off lease stuff like that, and so we just said screw it we're just going to do the same process again revived it did the Gloucester FS p2p, but the big difference here was is instead of using our production, MapReduce cluster, we just fired up DSC on the new nodes use DSC MapReduce, which allowed us to basically segar, take all that workload off of our production.

A

Hadoop cluster run it on this new greenfield cluster and avoid all the issues with like over are overloaded, MapReduce cluster. This process works really well and you end up with a nice clean, the cleanly distribute II, don't have to worry about like snitch problems or replication problems. It's all just works, so we made a few more changes. We do that again. This cut this new cluster is different to hardware, a different different layout, and we it's time to tune again so we're on DSE.

A

We finally switched the packages we used to deploy with Capistrano, which is a nightmare.

A

We love to heap the same because we still had these massive bloom filters in memory. I'll talk more about tuning your bloom filters in a little bit and we did we stuck with the same distro on a bunt to lucid, which I definitely don't recommend the the whole job ran a lot faster this time. So instead of running on an 80 node Hadoop cluster, which is one of our production clusters, writing into this 36 note cluster.

A

We actually ran the whole thing on there, so it's doubling up its workload, but it still ran faster because of things like DSC. Cfs is a lot lower. Latency hdfs is by default. So it's it's really nice in that way, when you go and you upload a file to it, just goes BAM and it's in there because it's built on top of Cassandra instead of this HDFS architecture, that's loaded with latency and connections going all over the place.

A

So the mistakes we made, we moved back to raid 0 because we had problems with disks filling up past. The fifty percent mark against I steered compaction bidis again. I would never do again because, as soon as we did a couple days later, disks a couple discs fail on the cluster. Fortunately they were spaced out in the ring. Those nodes went down and we had to go and deal with Dell and our data centers remote.

A

We had to go and do all that stuff, which is why a lot of people love ec2, but the nice thing is if you're on raid 5 or age 6 you just don't have to. You can wait a couple days to deal with it and that's just really nice. So the from this point forward we're pretty much sticking with some kind of rate underneath Cassandra and I know, that's not generally recommended, but it just makes my life a lot easier when you're on hardware and then we're still being on a bun to lucid.

A

So more recent versions of I think also the Apache Cassandra packages and the DSC packages are built on debian six and then, basically, all the native bits that are built won't actually run on ubuntu, lucid though they don't match up with the lipsy.

A

We also took this opportunity to make few changes to our schema so when you're doing performance tuning one of the most important things, probably the most important factor in performance- is your data model in any database. I don't care if it's mysql, if it's berkeley, DB or if it's cassandra. If you get your data model right, you got a lot of overhead to get to come up with really good performance.

A

These aren't necessarily data model, but I just want to make sure I mentioned that compaction strategy move move to leveled compaction. If you love your ops, people I highly recommend sticking with leveled compaction. Even if sized here is a little bit faster. Leveled compaction is a lot friendlier to your operations.

A

People because you don't have that fifty percent threshold on disk, where you wear your size, tiered disk usage once it crosses fifty percent or two times your largest SS table or your largest column family, then basically you're stuck you've got to put more discs in the node or do these kinds of crazy migrations that we've done in the past. So if you're unleveled compaction, you only need enough free for two compact, two S's tables together. So in my systems you know I need about a gigabyte free so arm. That was a big change, bloom filters.

A

So this is changing a lot, especially with LCS and 12, in on words, but prior to that, we found that we end up running 24, gig heaps because we had these. These calm families where we'd have rose is what, with 15 million columns in them and those would just use up a ton of space for the bloom filters and I was talking to Ben covers in one days like yeah, you need to adjust your bloom filter and we did and we actually can drop down back down to eight gig heaps, so GC gets smoother.

A

The whole system just starts running a lot more consistency once you have the smaller heap size. So that's something to look at. That's what we use: zero dot, one, that's not a great value. It's just kind of a we just kind of just said: we didn't have time to really sit and play around with different values. So we just picked one that was basically effectively disabling the bloom filters see what happens. We started cramming all that data in and our false positive rates still really low.

A

So then, when you're tuning this, you really want to look at. If you look at node tool, CF stats you're going to see in there the bloom filter, false positive rate. That's what you want to look at to tell if you've got the right value, the default is 0007. I moved up to 0.1 and I still have just a few add a false positives, which is fine by me because I'm not io bone, the other big ones SS table size in megabytes.

A

So this is on a per column, family value, if I remember correctly, and what we found is the default of 5 megabytes. So if you have really deep Cassandra nodes, so the recommendation and tillery I don't know what it is right now, but you used to always be on IRC s like don't go over 200 gigs per node may be on SSD, it's a little different, but we're on spinning rust.

A

So what we found was is, with the five megabyte: we'd have 500 gigs on a node and that's a whole hell of a lot of SS tables on disk and when Cassandra starts up, it literally opens every single one of those files and em apps it. So we use a ton of file descriptor years um we run and we ran into limits with em maps, the Linux setting for max em maps and okay.

A

Right so the jb m and the current we're spending an awful lot of time, just managing all these file descriptors. These huge lists of files that were open. We saw like really high system load. We saw in just the ton of CPU usage on a load that really should just be very low, CPU in very high I. Oh, we turn on snappy compression, not terribly impressed and then we've set. We turn compaction, throttling right off, I've, been told by the day stacks guys. This isn't really necessary anymore.

A

The version that we were on we were doing the migration had some bugs in that, but what we found is a lot of people are still I see it on IRC are running into problems where compaction get behind. I. Think I might have an example of that later.

A

I have an example of that later of how to look for it, but what happens is if you're, not if you're compaction is get behind, then you kind of get into this loop, where you're inserting data most of us they're using Cassandra, have a pretty constant insertion rate right. It's not it's not a wave like this.

A

You just have this like 50,000, a second 24 7 wham into the database right, so compaction would get behind for us and we were doing these really high insert rates, and then you know the next one to come along and x1 come along and they just stack up forever until you just it was impossible to get out of that mess. We turn the limit all the way up. I said you know, this disk array should be able to do about 500 megabytes.

A

A second I turned it all the way up to you know a thousand megabytes a second and it still couldn't keep up turn it off. It does just fine right, so if you have linux tune, just the right way doing all that extra I, oh, isn't that big a deal most of it's actually happening in memory. There are a couple of settings you can play with to make that even more the case I'll time coming to, and this works just fine. So if you it might be worth something worth playing with.

A

If you're dealing with that issue, where you have really high insert rates your compaction czar getting behind on leveled compaction, it might be worth trying this especially on SSDs and then now we're actually working on yet another migration. But this one's not it because we screwed up time it's because we want to move from our old data center cage into a new bigger one, with more more racks available and all that stuff. So we're going to a much bigger cluster nodes of time.

A

I was supposed to take that number off there and again we have to do it with no downtime, so I'm going to move off of this, because I was wasn't supposed to show it.

A

So what we're moving at to is our new pipeline is going to be all those events. I talked about earlier. Those analytical events are going to be flowing into Cassandra in real time as they come in and then we're going to be running analytical jobs over that data to do successive aggregations directly out of Cassandra, so both read and write workloads are going to be pretty much equal in a few months.

A

Yesterday, I venchan talked about spark: that's what we're building so I'm building the bottom half of it he's building the top half and we're actually looking at doing amp, blab and blab just released a tool called tachyon for basically it's a front end for HDFS, because HDFS slate and see like I mentioned earlier, is really high, so they wrote a new in-memory HDFS cash that lets you do right through cash or right back, cashing with memory in it's all in memory, so we're actually looking at writing a Cassandra back in for that, so that we don't have to run HDFS in parallel with our Cassandra clusters.

A

So this is the next generation, so a lot of stuff that should be familiar to everybody in the Big Data community. If you're new to this stuff I'll quickly cover it. Kafka is a really cool tool. It's um if you're, using storm or Cassandra. Basically any of these big data stacks where you want to do real time streaming. Inserts Kafka is a really nice tool from LinkedIn open source. That, basically, is a message: queue for really high volume, rep messaging, so basically all those logs.

A

But get written real time as a request coming into calc into a casket queue, and then we have an in gesture that pulls off of those calf get queues and writes into Cassandra. What that lets us do is if we want to bounce that in gesture or have it down for ten minutes, Kafka will eat up all the rights, and that's generally, what you like message: keys for.

A

So that's that's. The overview of the architecture spark does all of our reads: we may had tachyon and later the ingest code is basically custom, Scala stuff that we wrote and that takes takes the data. Kafka does a very minimal amount of normalization and then writes into Cassandra.

A

So this is my intro into the next section. So I wanted to say something about engine, practical engineering and this video kind of says it for me. So just it's just a few seconds.

A

That's the business. A lot of us are actually in right, we're not building you know. A lot of us aren't going out and discovering new algorithms or building new hot hot snot, like computer systems, we're taking a bunch of components like Cassandra and Kafka and Scala, and bringing them together to build an aggregate service, that's unique, but from common parts off the shelf.

A

So we have it's a more practical than than pure computer science, and so, when we're talking about tuning, there's kind of a philosophy that I have that I've really worked hard to try to be able to share with people and the main point being is there's a lot more to tuning than just raw performance right. So the number one is security. It's always security. Your number one, that any kind of design decisions should always be security, even if you're turning it off, which a lot of us do with Cassandra. You turn off authentication.

A

You don't bother to do crypto, because it's a pain in the ass. You still need to think about it and it needs to be a conscious decision. um Then, after that, you know your cost of goods sold like sure it would be great if I could have a thousand node Cassandra cluster and ec2 on SSD instances. I'd never have to touch a damn thing right, but my business would would be I'd, be spending. You know fifty thousand dollars a month on that cluster or whatever it costs.

A

So you need to think about that and you're tuning parameters. You know what how much memory you choose, how much disk you choose, which kind of disk that all comes down to how much of the businesses money am I spending on this cluster?

A

How much are your Ops guys going to hate you or hate themselves? You need to think about that. So, when you're picking Cassandra it's a great database for your ops people write a note can fail in the middle of the night. They look at their phone. They go back to sleep. That's what I do that's, what a good ops person does with a distributed database, but for for the engineers in the room when you're designing your schemas, it might be tempting to mess around with your replication factor and say you know.

A

Maybe three is too many: I don't want to store the data that many times they're running rate under it. Why it's a big deal? Well, it's a big deal, because if it's, if your application factor is 2, that means your system in has to get up in the middle of night and fix that node, because if the other one fail, if the next node in the replica, if the nodes replica fails at the same time, they got it now, you got an emergency on your hands right. So just keep that in mind as you're.

A

Designing schemas, as you make messing with all these settings, is think about what happens in the middle of the night when stuff fails, because it always does it never fails during the day. It always fails during your kids baseball game or in the middle of the night, when you just having the best dream of your life.

A

You've all woken up in the middle of night, because something like that right so and then you know physical capacity do I have Rackspace, you know. Do I have, can I get the hardware so on reliability and resilience, I recommend everybody look up. John, all spas has blocked nice blog posts and a few talks about resilience and just in general, the philosophy of it I'm not going to go into that.

A

But you know going back to the replication and stuff think about reliability and resilience and then there's compromise, which is probably the one that that is the most frustrating to deal with. So you have different business use cases. You have a whole bunch of different engineers that have different opinions and how to build schemas and build databases and which one they should choose should be used. React should be used Cassandra. Should we write our own or have this like massive level d bees all over a hundred boxes?

A

You know you got to make compromises in your tuning. It actually has tuning. Your system needs to take all those compromises into account. So I think I got the hope we got the point across that you know it's not just the maximum performance that you're looking for you're looking for the the best of the best that you can do. They look not the lowest common denominator, but something a lot like it, hopefully not as depressing.

A

So I mentioned earlier a little bit about the this, the science of tuning. I won't spend a lot of time on this and bore you guys with my my philosophy, but the the video reason why I brought that up is because I'm not a scientist about it. It would be really great if I could spend a month in the lab, with cassandra playing with all kinds of different hardware and memory and and get it just running. Just super awesome before I show it buddy.

A

That would be awesome right, I've done it before other jobs and it's fun, but the reality is production comes first and you got to get the code shipped. So how do you tune a system to the best of your ability without actually going and often having a lab? So what I'm, trying to bring up this term look at this Harris Titian, like you got to use heuristics you got to use your gut. You got to use your education to make decisions.

A

You know when I'm choosing hardware, I know that you know quad channel ddr3 is going to do a lot more bandwidth than dual channel and if I imbalance, the banks and I don't have to test that like it's, just that's how the hardware is so you can make that decision without actually testing. It is what I'm kind of getting at and your your brain is really good at her. This kind of stuff right. This is what we do better than computers and for now anyway. um So you should.

A

You should learn to use that ability if there's a really good book by Malcolm, Gladwell called blink. That's all about this kind of stuff and how gut reactions will often beat clinical studies and scientific, very measured control group, and you know, being very careful, he's careful studies and that they did studies with doctors and things where they found that you know you could go through this procedure.

A

That was designed in a lab for treating heart patients, but they found that if they just had the doctor make a quick snap decision out of his gut that outcomes were better. There are a lot of factors to that, but I recommend the book highly and then it back to you know. The practical approach is just concentrate on bottlenecks. Look at your cluster in production and look at what's what's slowing you down is: are you doing a lot of I/o is? Does it look like your memory? Bus might be over saturated.

A

If you see one process, if you open up top and you hit H and you can see all the threads and there's one thread on top doing, one hundred percent CPU and all the rest of them are idle. You probably have GC problems and I just wanted to mention this one real, quick, this is kind of getting popular in the ops side of the world. Is the OODA loop. This comes out of the military, there's a really good, Wikipedia article on it. If you want to google it billy says this is my slight variation.

A

Why I'm tuning? Is you know, II observed production systems under load under your load, because it's the only thing that actually matters. Nobody gives a crap. If you can make Cassandra superfast in the lab and then you put it production and your production load crushes it right so tune with your load. If you can replicate it in the lab great, but a lot of us can't we don't have the resources, we know I'm the money, but at the time, so you have to observe your production clusters in production.

A

It make changes safe changes that you're confident about. Maybe you test those on a small set up an ec2 for a couple days, shut it all down and then try it. If you're not confident, you are confident or you're crazy, like I. Am you just make the change in production on one or two nodes and then roll through? And you can make these changes with confidence, because you just get good at knowing what the health of your cluster is. I won't spend any more time on that. So that's! Basically what I'm coming to is.

A

You know I love, testing, shiny things, I love play with these settings. I love getting an extra one percent out of my cluster and I know I'm kind of on the outside with that. But um you know, linux kernel has come a long ways and performance in the last few years, if you're on sent os6, your kernel is missing. Probably twenty percent performance overhead on a lot of things. The memory subsystem has gotten a lot better. The I/o subsystem has gotten a lot faster, a lot more parallel.

A

Most of the file systems have gotten better I've run into weird bugs, like XFS has had this bug a while back with. If you set Alex eyes to anything it would all so you could set XMS up so that when you, when Cassandra, goes to do a right on esos table right, it opens up a file and it goes to the right.

A

The first bite under the covers XFS would actually go if you set this value to like Alex I is equals 64 Meg's, it's a really cool, optimization right because it goes and allocates the whole 64 Meg chunk under the covers right. As soon as it gets the first byte written to the file system, so you don't follow the success of streaming right out of Cassandra.

A

Just there's you don't have the overhead of going and finding an open block on the disk, so it can just stream it all out as fast as possible, except there is a bug and that it would, if you were only Road out like five megabytes, that 64 make it would still leave the 64 Meg's allocated on the files system and your disc could fill up like in a couple days, especially in LCS I, like playing with different Linux distributions. I, don't typically do that in production, but you know it's something that comes into play.

A

A lot of people just pick sent to us because that's what the organization uses, or they pick a bunt too, because it's cool and popular with developers, but there are things you can do with those distributions by changing out some of the the components you know. Looking at what jbm my shipping. Obviously most of us are shipping our own JVMs.

A

For that, the colonel it ships with it, which lib see is it and how is it compiled and how much performance can I extract out of that, because the JDM depends on it very tightly playing with different file systems wearing ZFS on Linux in production. Now and it's it's awesome, you get most of the advantages of ZFS and it's in kernel, and it's fast as hell. Btrfs is really interesting because it's in the linux kernel it ships with most of the distributions. Now it has some of the features of ZFS.

A

It doesn't have any of the performance features, but it does have LGL compression. It has support for ssds with trim commands. It does its own version of raid 0, that's built into the file system, so that the file system is aware of where the blocks are distributed and can do optimizations that you can't do with these successive layers of lvm and md raid and the disk device is abstracted so far away that the fallston has no freaking idea what's going on and then you know, different JVMs is: does it openjdk viable?

A

Yet it's not, but it's worth trying every few months just to see because they are advancing pretty quickly and then I do I test all these things in production. None of the things I'm running production today started out in the lab. All of them started on a production cluster and here's. How I do it so Cassandra actually is perfect for this kind of crazy stuff right. So you have a new kernel. You want to try, say: 33 11 comes out and it's got some really cool optimizations for memory memory, bandwidth.

A

What you do is you just take advantage of the ring, so you have um you, have your ZFS filesystem between to ext4 your well-known, stable, while working nodes- and you know you that your replicas, you have replicas elsewhere in the ring, so you don't have to worry. If that note completely crashes in shoe data, which hasn't happened to me yet, but it could then I have a backup. I can always rebuild that node back to baseline. So you know this is one of those things that comes up with these new distributed systems.

A

Where your data is highly replicated. You can make these changes in production I think without having to resort to building a lab, because it's never going to be to the scale I mean some of you may have the resources available to do that. But most of us don't so. This is this is where thinking about talking to your peers, about, because they're not going to be comfortable with it. Most people aren't, but it does give you a lot of ability to move forward quicker. A little more.

A

So now we're moving the tools, which is how you stop that this I was going to do one of these charts, but brendan greg did when that scale this year. You can find this online if you search for it just search brendan Greg tool, chart or scale 11, X and you'll find it. The link will be in the slides when they post them, but this chart has every tool that you really need to do this job.

A

It has a few under that aren't really common DTrace on linux isn't really widely deployed yet, but things you on vmstat pits that mps that especially d step for one of my personal favorites, the having this around just helps you when you're looking when you know which part of the subsystem you look at this chart has most of what you need, I'm not going to go into all of it.

A

So this is D stat I leave this thing running on everyone in my system. So when I fire up my workstation and I start screen, you ever notice that your lights dim every once in a while. That's when I start up my screen session, because it logs into all of my systems automatically- and it starts this- D stack command, D, stat-, lrv, n, 10, now, d stat has a bajillion options to it. It can pick up system stats from just about anything and it's written in Python.

A

Writing plugins is something like 60 lines of code and then what it does. It just sits there and the 10 means that every one of these lines changes every 10 seconds. They update every second, regardless of what you say on the command line, and so what you can see here is when I'm looking at this from my production systems, what I see in this middle part, where you see kind of a lot of yellow over there on the right, is that's where our MapReduce load is running and writing into the cluster.

A

So I can see about how long that's taking I can see it right from the system. I don't have to look at anything else. I can see it all right here. I can see that you know my cpu the green line. That's idle cpu there that never really gets too crazy. There's a fair amount of I'll wait, but it's writing reading and writing at a fairly high rate. That's I'm pretty comfortable with these stats, but I can see how much VFS cash I'm using.

A

I can see this all at a glance, so I can flip through the cluster control, a n flip through 36 nodes and just quickly look through and see d stat tuning on a single system, absolutely the best linux performance tool. There is, in my opinion,.

A

So this is my / tool. I wrote this the time github under github com, /, tobert, pearl ssh tools, terrible name for it, but this of all the tools in that repo. This is the one that I use the most often and what this is is I started from a tool.

A

Colin razor wrote a long long time ago in Perl that basically would read the proc net proc netdev file on Linux, which actually has all this really detailed information about what your network devices are doing, but I wanted to add a disk stats and I wanted the load average and then I wanted to build. Do it at a whole cluster at a time. So all this does is.

A

It opens persistent SSH connections to every note on my cluster based on a cluster list and starts reading those files every two seconds, and this thing updates every two seconds. So I can see in real time how much network traffic I'm doing, how many discs I ops and what the load average is, which gives me a rough idea of how busy the CPUs are and then I can see an aggregate.

A

What my clusters are doing so I've had this thing up to you, know 300,000 megabit, on our news on our new stuff, just to splint and I can see all what nodes are busy and I see that something like this. This line down here this. This green line is showing me that I gotta know that's not doing anything. What the hell is going on. I should probably go look at that as Cassandra down is the MapReduce stuff failing.

A

You know that there's something wrong, so I use this for every kind of cluster we have Hadoop Cassandra react anything I'm working on that has more than one node. This is how I watch them and I'll leave this running. 24 7i. This is the old standby.

A

If you're having I OSU's, where you have a raid array, especially software raid, and you have- and it's just slower than it should be right, especially raid 0 raid 5, not so much with raid 10, but it can still help they're, usually what's going on, is you have one bad drive in the raid array right? If you have hardware raid, this stuff is hard to see and generally it'll fail at out after a while, for you automatically it's built in that black box software.

A

But if you're running md raid, what this will show you is, you want to look at the only important thing used ESET for all these other numbers, but for the a way the REO A&W a way. What you're looking for is no particular value. What you're looking for is a different value. So if you have six discs in a raid array- and this thing is ticking on every second or two seconds- and you see that there's the weight is higher on onedrive than all the others and sometimes they'll even be an order of magnitude.

A

It's probably a bad drive and you should either wipe it and reformat it and see if it comes back to normal or you should just throw the damn thing away and get a new one, H top. So if you're, using top, you should probably in cell H top it's a better top than the one that comes with Linux. It shows you all the threads by default. So you can see GC problems right away, because there will be that one thread on top: that's doing one hundred percent CPU, while all the other ones are idle.

A

That's that's almost always GC issues that goes for most java apps and I really like the cpu bars. This thing is infinitely configurable you can put all kinds of crap on there. I work on too many systems, so I just use the defaults jconsole or if you're using oracle jdk, you can use visual vm. It's a lot of the same stuff. Visual vm is just a much nicer version of jconsole, but jconsole is in every GDK, including OpenJDK and I took a screenshot of this screen, specifically because this system has a to its x.

A

Mn size is way too small. So you see the sawtooth going on. That's because what's happening is is filling up the eden space and then in GC fires, and then it goes all the way back down and as soon as you hit that GC, that's where you get screwed on latency. Now it's pretty fast but like if I want to see this get a lot smoother is so that's what we're working on we're tuning a lot of these clusters.

A

Ops center is pretty handy. This is that big cluster, again I thought that was pretty fun to show I love the ring view, because these bring these balls and ops center will get bigger and smaller, depending how much data is on your node. So it'll show you imbalances in your cluster really quickly. It's nice and visual and it looks really cool on a dashboard at the office. To just tell people you wrote it and they'll think you're think you're a rockstar.

A

Everybody knows about node tool ring it's the first thing you check if you're working on a Cassandra cluster and I just kind of highlighted, an interesting value. We built this cluster just really quickly to do some quick performance testing and there's one node that owns 11.6 % of the data. That's going to be a hot spot, guaranteed.

A

No tool CF stats, I, don't get into these commands too much, because I do most of my tuning from the system side. Looking upwards and I, don't really look at the database that often list there's something specific that I'm looking for and ill see. That's what I'm looking for is.

A

If that SS table count is really high than I might look at tuning the SS table size and megabytes to get that lower and then the bloom filter, false positive ratio that I mentioned earlier that that you'll want to tune if you're, if you're, using a lot of heap, your mem tables are flushing way too often you can either increase the heap or you can tune that downward and there's another one. There's an index interval. You can tune down to save a lot of memory. Datastax guys have some good docs on these settings.

A

I won't go too far into it. Proxy histograms, I, don't know what the hell this says.

A

It looks really cool, though, when a busy cluster compaction stats is where you look for, if your compaction, you think your compaction is getting behind. This will show you that really quickly. This is from my are metric system, the 12 in 50k a second. These were running about two minutes apart, so I can see that it's making progress on compaction, but it's it's a little bit behind it's not where I like it I'll, probably tune the I/o subsystem to get that or grow the cluster to bring that down.

A

So stress testing tools, I'm almost done so there's time for questions, Cassandra stress, obviously, ships with Cassandra. It's got a million parameters. You can fire that thing up. You can tune it. You can get that thing to do, workloads that are very similar to yours. If you spend enough time with it and so I recommend learning how to use Cassandra stress really spend some time with it. We run it production some good production clusters. Sometimes it's dangerous, so we'll do it in times when we're all it's all hands on deck.

A

So if something goes wrong, we can fix it, but it's a good tool to get used to, so that you can. Basically what you know when you bring up a new cluster, just burn it and make sure everything is functioning before you put load on it. It creates its own key space. It's real easy to clean up afterwards. I had been able spent much time with it, but yahoo has a benchmark called YC SBS on github. It looks pretty neat: it's a multi, no SQL database benchmark, we'll probably published in numbers up from our clusters.

A

In a few weeks, like I said I like to use production load distress test stuff wherever possible. If you're running DSC pterosaur is kind of the standard Hadoop benchmark, that's a good one to run it when you're burning, a new hid, dupe and/or DSC clusters, and then my personal favorite is homegrown stress, testing tools. So it's they're not that hard to write a lot of times. I've noticed with benchmarking tools. Maybe you have to a lot of the benchmarking tools.

A

It takes you months to learn the darn things and by the time you do it's very specialized information. It's not useful for anything else or you could just write the code to generate the load that you want and that might take you a half a day and then you have absolute control over what it does and measuring latency and things like that. Just isn't that hard.

A

So it's something to think about and make a case for for yourself if you're, confident in doing that, I will talk to people about this later so sysctl comp the slides come out. This is my default settings. I put this on every machine before I test anything because the Linux defaults are broken. Just keep that in mind. The defaults that Linux ships with our for Linus Torvalds workstation under his desk at his house there and he's not shy about that. So you need to tune stuff. Dirty ratio.

A

Dirty background ratio are the ones you want to read the most about the colonel pit, max 1. That's just what I do for fun. It actually doesn't fix anything. It just means that now your pits can go up to a million, so if you're processing processes are flying by really quickly, they don't roll over. As often so. You get a nicer PS listing a lot of people still put tuning information, RC local. There are a few things that don't really have a good place elsewhere.

A

The bottom one is the most important one below the bottom two, so the scheduler one where it says cfq cfq, is the default on most Linux distributions. Today you should real. If you're not running any other applications on your cluster, then you should. You should probably look at deadlines. Deadlines is almost always better for latency than cfq is especially on drives other than SATA drives.

A

So if you have fast, SAS drives ringing switch to deadline, you'll get better I Oh immediately and then the if you're running raid 5 the striped cache size I'm not going to go into that. But by bumping that to 16 k from the default of 128, we picked up about 100 megabytes, a second I'm raid, 5 jvr margs, real quick, don't mess with the defaults. If you don't know what you're doing with the JVM, but the big one that you want to tune is the xmm the new size.

A

If you datastax, recommends 100 megabytes, /, /, actual physical core, that's actually a really good recommendation and the other one we I've been experimenting with. Is the JVM support for Numa, so apache cassandra scripts shipped with newman control, interleave built-in? So if it sees the Numa control tool installed on your system, it'll use it to basically make sure that memory allocations are balanced across newman nodes.

A

If DSC doesn't do that or at least it didn't last time, I looked, but this X X, minus X, X plus use Numa means that the JVM will become Numa aware and will automatically balance its allocations and talk to the Linux kernel until it wants that memory allocated and I believe that the GC is also aware of Numa. At that point, too, I am running out of time, but so see groups I'm going to say that it's something that, if you're a systems administrator, you need to learn about. Absolutely.

A

You want this on every system that you manage, because it gives you much finer grained control over resources on Linux, then you've ever had before, so that stuff I talked about earlier with linux systems becoming unresponsive or just you know, you have weird scheduling problems right. The linux scheduler is notoriously just pain in the ass, so by turning on cgroups, you just give it a lot more information.

A

You can say these threads go together or these processes go together and it's on a per task basis, which means threads and processes, and it's all just a file system interface in the kernel. It's really easy to work with with shell scripts, and things like that. I'll probably be blogging about this in a couple weeks about how to use it without all the LXE madness.

A

There's more. This is kind of how I do the default setup of see groups. This is office to be in the slides. I just put this in if you're running, Apache Cassandra they'll be on debian, there's the etsy default Cassandra file or on d SEO, the etsy fault DSC, or for a red hat it'll, be like Etsy, sis config, something like that, and it's just a shell blob. So you can actually put shell commands in there.

A

So what I'll do it and it's running as root, so I'll actually put the sea group set up in that file which is kind of a dirty hack, but all this the child processes inherit the sea group. So you can actually do this and get put Cassandra in 1c group and I'll actually run two clusters on the same hardware and just separate them with C group resources and they can coexist just fine, no virtualization, necessary.

A

And I'll take comer meds, so we're running btrfs in production, we're almost done rate we're running raid, 0, btrfs and ec2 on the femoral drives. It's really nice. um We turn on lgo compression, which gives you a lot more compression than then snappy will ever give you I believe the LZ 44 stuff will get pretty close it's in within striking distance. So if you're on Cassandra one but two you might want to look at that. But btrfs is really interesting.

A

It's not considered stable and you a distro community yet, but there are like fedoras starting to ship it and allows you to install root file systems with it. Bug too is looking at it and I've been running it for about a year and a half now and never had a crash on it.

A

They say that the the big corner cases around filling your disk and stuff, but I just tend to tend to tune my systems to stay with the hell away from those kinds of corners cases in the first place and then finally, this is my last kind of show. Until item is ZFS on Linux, so ZFS on linux org.

A

These guys, Los Alamos labs, hit, took the ZFS code from open, solaris and because lustre had actually switched entirely over to ZFS, they ported it to Linux so that they could continue to use Linux, and then they successfully built back up all the layers of ZFS on Linux. So it all works fine with the Linux kernel. It's not the fuse stuff. It's actually quite fast. It's not a hundred percent of what it is on solaris yet, but it's pretty damn good and it gives you all the flexibility to do crazy.

A

Things like I didn't put it on here, but there's a setting where you can do. Zfs set sync equal disabled, which is a really fun performance trick, because it basically means ignore, f sink like it never happened. Tell the app it did, though so Cassandra will go and say.

A

Like ok, I wrote this data to disk now f sink, make sure it's written to disk and don't return to me until it is difficult, and then it will write it to disk later and it'll I now, I'm not recommending that you do that, but occasionally occasionally it is really fun, and then you also have compression there's snapshots built in the filesystem btrfs.

A

Has that as well, which can be handy even though cassandra has snapshots, and so finely tuned is multidimensional, like I talked about there's all kinds of different parameters that come in and performances, not by far not the most important one stability security. All these things, all our big factors. Do you have the money for it. um The bus saying no I, like the same. They like saying that production load is all that matters in my mind, like I, don't really give a crap how fast something is in the lab.

A

When I put my load on it, will it survive? Will it wake me up in the middle of the night? Those are the kinds of things that you should be thinking about. Not how fast can it be? You just need to make it fast enough, and Cassandra generally gives you that by default and then lean on Cassandra like it's a really great database, it's super stable in it and it's distributed. So you can mess with a single node. You can kill minus 9, Cassandra and just see what happens and I do this in production.

A

Sometimes just you know to keep Phil on its toes.

A

You should run experiments, and just you know, if you want to be scientific, do outside of the system, keep a spreadsheet and do a/b testing and then no one metric tells you the whole story. You know cpu load, I'll load, all these things have interplays and you can tune for cpu load and the obvious on your iOS is sub system is screwed and you really need to balance all these different things to really have a really well functioning system, and with that I'm done is my contact information.

A

I do tweet a lot about Cassandra and then I might get hub there. If you, I have a lot of just related to cassandra and then the pearl ssh tools, and then you can email me if you have any questions. Thank.

A

Does anybody have any questions to ask.

A

B

A

Like when, were you certified okay.

B

Great presentation.