Apache Cassandra Cassandra Community Webinar Series, 8 Aug 2013

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Cassandra Community Webinar | Practice Makes Perfect: Extreme Cassandra Optimization

Description

Ooyala has been using Apache Cassandra since version 0.4.Their data ingest volume has exploded since 0.4 and Cassandra has scaled along with it. In this webinar, Al will share lessons that he has learned across an array of topics from an operational perspective including how to manage, tune, and scale Cassandra in a production environment.

Speaker: Al Tobey, Tech Lead, Compute and Data Services at Ooyala

Al Tobey is Tech Lead of the Compute and Data services team at Ooyala. His team develops and operates Ooyala's internal big data platform, consisting of Apache Cassandra, Hadoop, and internally developed tools. When not in front of a computer, Al is a father, husband, and trombonist.

A

To welcome to this edition of our Cassandra community webinar series delighted today to have with me Alto be al: is the tech lead at who yalla and has been using apache cassandra since the very early days? And there are very few people in the world that have as much experience with cassandra as al, especially you know, running systems in production, so al is going to pass along some great lessons learnt around extreme Cassandra, optimization so al. Thank you very much for joining us today. We will pass you the ball and take it away all.

B

A

Oh one more thing: I forgot my little housekeeping item. As always, we will be taking questions via WebEx use, the webex Q & A panel. If something is very contextual for Al, he said he doesn't mind being interrupted during the flow, but generally we will reserve the last 10-15 minutes of the session to take your questions. Okay and with that I'll, take it away.

B

Alright good morning, at least for me, so this is practice- makes perfect extreme Cassandra, optimization, I, didn't name this talk, but I really liked it better than the boring title I've chosen. So with that, let's move on I work at yella. We are a video provider.

B

We provide analytics and a full video solution to our customers, but that's not what you're here for so I'll start with. This is just what I'm going to go over how not to manage your Cassandra clusters. We've learned a lot of lessons over the years working with Cassandra and making various mistakes.

B

We know how to make it to fix those situations, I'm going to talk a little bit about performance tuning and how I performance tuning, not as a scientist, but is what I call our efficient and then some of the tools that you need to really do the system side of performance, tuning and kind of tuning, your Linux machines and making make everything flow a lot smoother and I'll go over a few other things. That's about at the end.

B

So me, as he said, I'm decklid URI heard that my team is what we call the computing data services team. We do where a dev ops team, we have three operations, people three engineers and myself I lead the team. We call ourselves badass, which is big data. As a service.

B

We have about 100 Cassandra nodes, that's actually closer to 200 ml. We just turned on a new cluster, that's 114 nodes and obviously we all is always hiring just like everybody else and I already talked about viala. If you're interested in the stuff is a slide to be out soon, so we always been with Cassandra stand, 0 dot, for that was before my time, I came in at about the time we run 0 dot six, so we use it for our analytics data. We have a Duke system that processes all of our raw logs.

B

So when you're watching a video on ESPN com, for example, you will see that player in the background will send what we call pings back to our systems and then those all that data will go into do we crunch it down and then all the Roloffs and things are stored in Cassandra and when you log into our back lot system and look at our analytics, that's why we get really nice response times on.

B

Our analytical product is because when it goes to get those statistics, it just comes out of Cassandra it's you know some 100 millisecond, it's really better than that. There's network stuff in the middle that makes it more complicated. I'll come to that. We use it in various places as a high bill, highly available key value store, as opposed to a memcache d or reddit. We've replaced reticent a few places with Cassandra, simply because we needed to high availability. We use it for time series data.

B

We have an internally developed monitoring system that currently takes about 50,000 to 150,000 in search for a second into Cassandra and I, very dimension, playhead tracking, but we also do for some of our customers if you're watching on Netflix and actually I believe that place uses it for the same purpose: users Catana Perez, which is we track part of what that player reports back to us, is where you are in your video. We record that in Cassandra.

B

So if you go to a different device and pick up the same, video it'll pick up where you left off and then we also have some machine learning stuff. All the outputs in the machine learning system is good sword in Cassandra so that they can be used by our edge infrastructure, which has five nines availability.

B

So that's what I've been describing thus far, our primary and original use case for Cassandra's is now what we call our legacy platform. We have a new platform, that's coming out toward the end of this year, where we're offering third party analytics, so you can use it with other people's players, but this this older ones been around for a long time pretty much the entire history of the company, which is, if you start in the upper left hand corner. We have all the players everywhere.

B

All of our customs across all of our customers reporting that information to what we call our loggers. All that data gets stored in log files put into s3 and when our Hadoop cluster will actually run a MapReduce that just sucks those files down into HDFS and then the our pipeline fires up processes of log files writes the output into Cassandra and you'll. Note that I have on that barrel between the orange aduke boxes and the blue cassandra boxes that were agree modified right. That's also, you can translate that and just say: evil.

B

That's getting better in kiss 92.5 it I'll just set in a few minutes, and then we have a service that sits in front of cassandra. That has a thrift api. That, basically knows what our schema is: an abstract sit away from our edge web applications, so read-modify-write, here's the problem with read-modify-write in Cassandra and as I said when Cassandra 2 point 0 hits the ground there's going to be the cavs arm, support that actually has a built-in way to do it safely. I see it question, hdfs vs. CFS its just its legacy.

B

We used HDFS from a time before CFS existed. We are actually evaluating moving towards CFS for all of our newer, all of our new stuff that needs distributed file system so read-modify-write.

B

What we have is so, let's say, I hurt the Cassandra conference that was earlier in July um I took a lot of my team to that to that conference and for an example, I said well, let's talk about how much we're all going to drink so that we can coordinate, make sure that at least one of us is sober on the next day. So what we did is it's.

B

What I have here is a column family that has each of our names is the row keys and then um I'm going to close the QA for a second. So I can get through this, so the rocky is the name suit the Albert Evan Frank, Calvin, Kristoff and Phillip, and then the columns are on the right for where I'll have tuesday as the column key and then the number six is the column value so what's day, I just fire all this data into Cassandra.

B

So and then I say well, you know I, don't really want to drink that much because I gotta talk on the second day of the conference, so I'm going to update this so I go and I right over the same value. It's the same row keys in column key and are updated with a new value. Now Cassandra is holding two different values in memory.

B

If you do this really fast with modern Cassandra, it's not such a bad thing, but you have a potential to have race conditions where a value hasn't replicated yet so you need to do. Do this at arm read repair at a hundred percent. You got to do a bunch of tricks to make sure that you get consistent values in the today like I, said: cadfix is this so now I have in let's just say that my mem table has been flushed out to enough this table, and now my mum table has a different value.

B

It's time to resolve all this for you, so you don't typically see it until it goes wrong and then later down the road, it's splashed again. So now I have two S's tables on disk and could hasn't fired yet and now my mum table has yet another value in it.

B

So what this does is it does a couple of things until compaction, fires I've got all three of those values values in in the Cassandra system, and there are bunch of side effects of this, so one of them is is that you have a lot more work for compaction to do as as time progresses. So, if you have old data that you just consistently update over a long period of time, you're going to be compacting files that that otherwise wouldn't need compacting. Ideally you want your older data to sit at rest.

B

That's where you're going to get the best performance. That's why things like time series really kick butt on Cassandra, so after compaction, this all can packs down, and then you got a nice clean up this table and you can read from it our system does this every 20 minutes forever, and so it just causes a lot of extra work on the system and there are other ways to approach writing to Cassandra and design. Your schema I highly recommend Patrick mcdonoughs talk on schema design and there's another article.

B

If you look on twitter, I'll retweet it later on the that recently came out on the datastax website about how compaction really works, and I highly recommend that as well, if you're interested in these things, so I'm going to move on, but the gist of that is, if you can find a design for your software that avoids the read-modify-write cycle. That's where you're going to get the best performance, even with calves, down the road.

B

It's going to be a lower performance option than doing straight right through design patterns, so around 2011, when we were doing the zero that fixed 2028 upgrade. We hadn't had some problems and it's all kassandra's fault and I mean that in the best possible way. So what happened is is cassandra is just ticking along where our by produced job is hammering it like crazy, every 20 minutes, and it just worked. We had an 18 node cluster, they just SAT there cranked along and people just forgot about it. Literally, they just forgot.

B

It was there and so Rick repairs didn't get run and that used to be really really important. It's not it it's not as bad as it better. Today, um none of the maintenance stuff was done, backups weren't really running, so we kind of got back into a corner where we had this 59 system sitting there, where we had to service it in place.

B

While the bus is hurtling down the road at 70 miles an hour and figure out a way to clean up all the data, get it into a new cluster and make sure that we scrubbed all the old data, get rid of all the old tombstones all in one pass, and do this without taking anything dumb. We couldn't scheduled downtime in the system. So what we did is we played this dirty trick. We because the kernel on the system's was old and didn't have an affectionate.

B

We ended up just using blusterous up point-to-point mounts which were actually works really fine. Today, what you will about Gloucester at us, it has its problems, but the point-to-point mode is occasionally very handy for exporting apala system from one-call system to another. If you just ignore all the distributed systems- part of it, so what we did is that as eight ball, 18 notes exported their cassandra data file systems to all 80 nodes of our hadoop cluster. And so is this big cross-matched mess.

B

And then we ran a MapReduce job to actually went over all of the SS tables. We pulled the coat out of the gutter cassandra, one of the beauties of open sources that we were able to do that pulled it out of the Estes tables scrub. The data using things that we do about the data that could even cassandra didn't know, because we know, if the business logic and what the schema really meant scrub that data and then wrote it back into the front end of a new cluster of cassandra.

B

We did a couple other things, while we were doing that, you know we did a bunch of tuning on the new cluster, newer, colonel newer district, linux distribution and all those different things that you would do when you're upgrading your systems and format.

B

So we did then, the other thing we did is we discovered that our indexes are manually built indexes as opposed to we're not really using secondary indexes today. So our inverted indexes and things like that, rather than copying those we just scan back over all the data and rebuilt all those. So as a bonus we got all of our index, is cleaned up and react amides and straighten out this took about three months to from start to finish most. It was developing the software and testing it multiple times before.

B

We finally did the flip over to the new cluster so that all went smoothly. We moved over we're in 608 very, very happy with it big improvement over zero dot. Six.

B

We move from a I think at the time it was a 16 gig heat to a 24-game heap I. Don't recommend that other people do that anymore, especially with Cassandra, 1, dot, 2 and upwards, because of off deep cash and all those things that most people should be able to sit at eight gigs, although if you are running into problems where your Cassandra nodes are crashing, when you're loading huge rose, it is something to consider trying we updated to the latest on Java one dot.

B

Six I think it used to be on openjdk, which is, which is a really bad idea, at least OpenJDK six openjdk, seven, if you're really adventurous can be done, but it's not recommended. We move to the Linux kernel, 26 36. At that time, that's a custom in house built colonel from upstream, and we move to md raid 5 on xss. So a word on raid is while cassandra is a distributed system and it has its own replication. It's one thing: that's really nice about having rain. Underneath it is your ops.

B

People will be a lot happier and reason. Why is this? Are the most likely thing in any distributed system to fail and in your computer and any server that's run working data centers over the last 15 years and by far nothing fails more than hard drives. So it's really common for our hard drive to fill in a larger cluster. If you have a raid 5 or raise 10 or really any kind of anything, but raid 0 underneath your system, what happens? Is you don't have to rebuild the whole node every time it disc sales?

B

Today, if we're I'm raid 0 and one of the disk fails, that note is offline and you got to replace the disk and then you got to rebuild the note, because hundred does that just fine! So it's up to you and your business. Do you decide whether having that exposure, if it takes 24 hours or 48 hours, to rebuild that node and if you're, ok with that, then great go with raid 0 or go with Jay Boz?

B

But if that's not okay and you need maximum availability, consider using raid, 5 or Z, which I'll talk about later and protect your database from single disk failures and your office, people will be a lot happier and then, by far the most important tuning thing and I recommend everybody. Take this to heart is if you're dealing with any kind of database, cassandra, mysql oracle, even MongoDB. The most important thing you can do on any Linux system is disabled flop entirely.

B

If you don't think you can do that, because maybe you're running other applications on the same box, at least that in SD sysctl calm, the value be my penis equal one, and what that does.

B

Is that tells the Linux kernel to never swap out my applications to make space for DFS cash and what I'll explain that a little more and what that means is Linux I, the VSS Linux will actually, as you read, files off of the disk, will load all the pages from those files into memory and just to optimize it so that if you're, if you go back and read the same page again, it would just get it out of memory rather than going back to disk. This is how we get good performance out of hard drives.

B

um What linux will do and a lot of people notice this over the years. A lot of the old school system ends. You'll still see, will disable things like locate, DB and things like that, because what they do is they'll stand over all the disks. It will want more memory, for that and Linux will actually swap out your applications to make space for VFS cash, and it's totally wrong in my opinion, but it does work really long in desktops and that's what the defaults are all set up for.

B

So just get swap at least out of your way turn it off I, don't even allocated on systems anymore. Then this actually well, if you, if you work a lot on different kinds of systems, you'll find that just disabling slots will give you a lot more consistency for almost any application.

B

So moving on last year, we decided to expand the system again, so go from 18 notes to 36 nodes. It gone off to do a different project for a little while so I wasn't operating with Cassandra clusters and handed off some other people and they got distracted as well and again it was just sitting there ticking along. So people forgot about it. We wrote notes down a document saying you need to repair. There are strips in place at some point. They failed and we ended up in the same boat.

B

Again, it's pretty embarrassing, but that's what happened so we took that opportunity. We do already knew what to do. We just did the same process again. The gloucester plus p 2 p, except this time we threw in a twist with instead of using our production to do cluster, which runs at about 115 perspective, has to be all the time and realize that's a silly number. But if you look at our ganglia graphs, that's actually what it says arm, though we actually used the DSD MapReduce this time, it's good.

B

Since it was a scholar job, it was really trivial for us to load up. We loaded up BSC three-point. Oh, we ran the MapReduce there that way. Only the new cluster had the resources being spent on joining the MapReduce job. It was actually writing back to itself and that worked really really nicely. We were doing about 20 gigabit at second of transfer from one cluster to the other. It was really fun to watch and we got an opportunity to do a whole bunch of a whole bunch of performance singing.

B

While that was running so originally we run an Apache distro. We moved the DSC at that time. We stayed with the same heat because we just have some really fat rose. Last time I checked a couple months ago, it was about. We have some rows in the 15 million column range.

B

We stayed on the same distribution, which is a bump to lucid, just because we didn't have our infrastructure team wasn't ready for precise. Yet we ran way faster this time, because we had the dedicated MapReduce resources on DSC.

B

We, because one of the problems we ran into was prior to moving to levels compaction. We were on size, peered. We ran out of space. We crossed that fifty percent threshold and people got really nervous. We got we convinced ourselves that raid 0 is a good idea, and that's why I spent time talking about that earlier. Is we move to raid 0 big mistake because I swear a week later, I lost two drives on two different nodes that were right next to each other and I had to go scramble down to the datacenter.

B

Grump drag my team down so that we could do get this all fixed up, get the drives, replaced and rebuild these notes before had an outage. It was alright. We've had pretty close to a hundred percent up time with this cluster, but that's the lesson learned is: if it's really important that you have five or more nines, which Cassandra will do happily just put right under it and then the other one was. We should have gone up to a bunch of recites I.

B

Don't want to listen to really long in the tooth now, not as three LCS is old and also some of the native stuff in DSC three-point. Oh, won't even work on there because it was compiled on debian six, which is a perfectly reasonable choice. So that's just one thing is Bunty precise or onwards. Debian six is good or Ralph six and I think is supported.

B

So we made a couple config changes. We did this load, we switch the levels compaction, an important thing to remember about levels. Compaction is, if you, if you love your ops people or you want them to not hate you, how do you recommend using level even if it is a slightly lower performance option? So if you're doing large volume rights, sighs secured compaction can still be a higher performance option, but level compaction is always going to be a lot easier to operate.

B

Just because you don't have that space constraint where you have to have 2x your largest cup, your largest column family, where key base yeah I, think it's calm down your largest column family to be able to do compaction. The other one was is the bloom filter, false positive chance and I know that they've changed. The defaults recent in recent releases.

B

I haven't revisited this in the last 12 months, but that value what happened was it was set to the 0.007 I think by default, and it uses a ton of heat space in the JVM and that's why we had to have such large heaps, because those really large Ruth rose would consume a ton of space for bloom builders. The other thing we ran into and I've seen a few people on IRC run into this. Is the default at this table size and megabytes in Cassandra one, not one anyways.

B

It was five megabytes and what happened was right. Crammed 200 gigs /, a note on to this 36 notes your end up with a hundred thousand-plus files in just a single T space.

B

That's all fine and good, but when Cassandra starts up and opens up and M maps all those files, you start to reach out to the edge of what the JVM and even the kernels and support reasonably, and you spend a lot of time in the kernel end and cassandra is just doing bookkeeping on all those files, even if they're not being accessed so I recommend increasing it I'm running at 256 megabytes. My race systems perform really nicely.

B

We have we've tuned things that our mem tables flush rate is pretty steady, so it works really well for us, it's probably too big for a lot of people unless you're on SSD, so maybe 128, megabytes 64 megabytes might be better for you. It's definitely something that you should decide based on. Your environments needs.

B

We also switch to enable snappy compression thus far, we have been fairly happy with it, but I've gotten, actually better compression out of using thought system based compression. That's something to consider if you're willing to experiment with all system technologies- and we also switched in the Cassandra animal we disabled compaction throughput are limiting, and that was because, as we were doing, this huge data load of I think the total was 30 terabytes across from the old cluster in the new cluster.

B

We just got really far behind on compaction, and so the first thing I tried was sending compaction throughputs to say a thousand megabytes, a second which is more than the array can do thinking that that would take care of it and that actually didn't do it because it still is consistently behind. So the next try I roll through the cluster again and disabled it and then boom. All the problems with compaction just went away.

B

So it's something to consider if you're having problems with compaction- and you can tell that your disk subsystem isn't under duress, you can go ahead and disable compaction throttling and see what happens and I believe if you're on SSD, that's the recommended setting and then this year we're working on something which is going to be a another migration.

B

But this one is this time: it's not because we screwed up it's because of good things, which is that we built this brand new 114 notes cluster and we decided to run vse on the entire cluster and we're hoping replace our old Cloudera cluster with it. It's yet to be seen, because we've got a lot of stuff going on right. Now we're even doing this migration again. So there are a couple of problems with these migrations and we've been working really closely with datastax on this is migrating this data.

B

We don't want to run that big MapReduce process again, because it's just a lot of work for us, so we're actually doing a different kind of migration, we're working out a post about how we're doing that. We were probably on the eol blog in it in a few weeks, and so we have to do this with no downtime.

B

So the reason why we're doing this is we bought a new cage stage with neck linux and the new one is much bigger and we have newer racks and you were networking and all this stuff and the old one costs a ton of money, so we're going to deprecate that shut it down and move everything over. So you had another migration and that's going to be in moving us to DSP 3.1 at the same time, so we have a couple of new use cases coming. We have those events that I mentioned earlier from.

B

You can't have a diagram of this next, yet so those those events coming from our players to the loggers, the new architecture, is going to be the loggers in real time for that to Kappa, which is a really nice queuing system, well written by linkedin, open source and from casket is pulled down to. We have a custom in Jeff's system written in Scala that pulls in those events.

B

Processes them does does a little bit of massaging just a lightweight normalization and inserts it directly into Cassandra in real time updates, two or three indexes in real time, so that all of our get raw data system Cassandra for our goal is about three years of retention um and that so that's, basically the gist of the new architecture.

B

So Cassandra is really good at this kind of thing, where you just have to insert load, is there's no modification happening in the database after it's inserted right only, and so it can fail independently of our query system, it's nice and separated it's a nice design pattern. I think so, that's one! We are.

B

We built some new code for spark that actually builds for us olap cubes, which is just a clever way to do indexing and query systems and that those cubes get stored back in Cassandra and then the other one that we're investigating right now is the amp labs that guys, who invented, makes those and spark also have this new system called tachyon, which is a in memory cache for HDFS, we're probably going to do a CFS backend for that.

B

So I already talked about starka textures, so we have all. The data gets written raw onto Cassandra in real time, so we use spark to query it. That's a scala based, scattered gather, query system. That's very nice! Berkeley, apt lab again!

B

Tachyon will probably make an entrance in the next couple months and we have a job server, which is something we're contributing back to spark which gives us a rest api in front of spark to run queries, and then our api just talks to the job server.

B

So we have this nice platform where that that arrow up to the upper right-hand corner, that's the demarcation between what my team manages and what my development team is managed and what they developed so that they don't have to deal with all the gory details of how to set up a cassandra cluster and set up spark and connect to it. All of that stuff is abstracted away, so we really do have a platform as a service.

B

So just a little bit of time spent on performance tuning in general and then I'll come back to attended. More specifics is tuning for performance has there's a lot more to it than just performance. As the slide says, you got to think about a lot of different dimensions, and I run into this a lot where it's just people forget- and you know- maybe they're just haven't- run into it before, but it's really important to consider security first, it should always be first.

B

You know you don't want to compromise the security of your of your systems to get at one percent of performance. It's just not worth it. You got to think about the cost of goods sold, so I can build you, a cluster that can do to millisecond response times, though, under extreme duress, it's going to cost you a few million dollars, but so you need to think about that. What's my budget do I, spend more time on tuning or do I just go buy more hardware.

B

If I have money in those times, then I spend the money. If I have lots of time and no money, then I spend the time, and that's that's just one of those trade-offs. You guys think about your operations, people if you're a software engineer you got to go and you got to think about well, if I design it this way, if I choose level compaction versus size here, if I'm doing a right, whole new system, I'm doing read-modify-write that all has an impact on your operations team and your DBA teams whoever's operating Cassandra for you.

B

If it's not you, if it is you, you should care even more. um Those are things to consider because they got to deal with the fallout if the system gets loaded. If you're on size here and they don't in their new to Cassandra- and they don't know about compaction, doing major compaction, then you get into corners so coordinate with your operations.

B

People, you know, coordinates, go read all the data stack posts about compaction in these things and how to set up for clean operations and make sure that your operations, people are on board and that you're considering their their work life, because if you help them they'll help you I've been in operations for 15 years. I can tell you, that's always true developers come and ask any of my colleagues and say: hey I'm doing this new thing.

B

I'm really thinking about doing this, and these are the trade-offs and discuss it with them, and everybody agrees you're going to have a much easier time going forward. Another big one to give offer happiness, so you can tune. There are some choices you can make in schema design and tuning settings. That will give you a little bit more performance, but the arm, if you just go after the performance- and you don't consider the fact that it's going to make future development developers insane you're. Just it's not. It is not a worthy trade off.

B

So you need to kind of go back and forth and have that discussion again and make sure that you designed for your future maintained, errs the my favorite phrase that I always use it assume that the person is going to maintain your code when you're gone is an axe murderer and they know where you live, so make those decisions, assuming that that person is going to be really ticked and try to develop for developer happiness and I also do that on the system's side and make sure that what I'm working with the developers I set things up so that their life is easier as well, because it has to happen on both sides for it to work physical capacity.

B

Obviously you know how many racks do you have? Are you in the cloud, amazon or giant or whatever you know? Those are decisions that you got to take into account. What's available goes back to cost, these things are all interconnected, reliability and resilience. 33 note busters. If you got a small shop and you're trying to maintain costs, three is the bare minimum like I.

B

Think five is where I think the sweet spot to start out with, so you might start out with smaller notes, with five of them or three large notes, if you really just want to keep it fairly, compact just make sure you think about that and don't just go out in Google Cassandra knows sizes or Cassandra et tu instance type and just copy paste that take the time to go and look into what do I want to maintain whether my availability, metric or targets?

B

What's my hat Soleil and make the decision based on that, rather than just a wild guess or a google search and then always be ready to compromise as I've been saying through the slide, is these things are all trade-offs and performance is awesome. I love tuning for maximum performance, but there's always there always needs to be that compromise where you pull it back and make sure that you make it secure and available and maintainable enough on that. A.

A

Couple of questions in the wrong panel Oh.

B

A

Okay, well, it will take those with the end yeah, no.

B

Problem so, as I mentioned earlier, I'm just kind of playing with this word and passing your own people to see what they think, but the way that I approached performance tuning and not everybody, does it the way I you, but most of us in the ops trenches end up doing it. This way, the whether we like it or not, is I would really love to be scientific about this. I want to have a lab with identical to production hardware and be able to just set up clusters.

B

Tear them down, try all kinds of different stuff run, load tests and track all the numbers in a spreadsheet and be very scientific and make it make the best possible decision. The reality is, is almost none of us have the time or resources to do that, so what we have to rely on a touristic and educated guesses so that that's the big difference is performance tuning and not being afraid of doing it.

B

That way, because you'll hear people say and they'll hammer on the soapbox and say which I'm doing it now on and say you have to be scientific about it. Do clear measurements and get all the noise of the system, and it's just BS, except for in really large environments that have the resources, so what I recommend is is leaning into the database. It has the replication in place and you can do things like make changes to single notes, observe the performance over a couple of days back out to change.

B

Put it back, make a different change. You can be methodical and careful, but there's nothing. That's going to replicate production mode like your production, loads and cassandra is a great database for handling that kind of situation.

B

And the quickest way to achieve better performance in this kind of situation is just as go after your bottlenecks. Look at your applications, see where the latency is building up do traces through your system. If you can and then go after those first, if you have one particular insert load, that's got high, latency or you've got to read that's taking too long go after that.

B

Sucker use Cassandra tracing and look at all your systems and see where see where it's hanging up and generally, you can get the pretty acceptable performance rather quickly, just by going going after that, and then basically, this whole thing kind of comes out of or a great way to start with. All this is welcome.

B

Glad wall has a book called blink, which is basically all about your basic instincts with you know, if you have experience in this industry and you've been developing software operating systems for a long time, you probably have a good deal for what for what's right and you need to and just learn how to rely on that instinct and blink is a great short read that that's all they talk about is how there have been various experiments and proving that that's a very effective way to make decisions.

B

Briefly on the Pluto loop, something that I've heard about through John Ross paw at the observe orient decide and act invasive.

B

It just says: go look at your systems, observe your production systems under load, maybe not Orient, but make you make small changes, observe again watch what happens and then either roll back or gorgeous roll forward through the rest of your nose, but just keep doing that all the time the loop is the important part of it is that you just keep iterating and keep iterating small incremental changes so that you can do it safely and do it in production.

B

If you have non production systems, obviously do it there first, but nothing is going to behave the same as your production systems. Most of us have non production systems that are scaled back, their smaller to sometimes single knows. I. Don't recommend doing that because it just changes about just about everything about how this is going to behave and yes, keep iterating and just going on the same subject: I really love testing shiny things, it's kind of what I do I mess around with new kernels when they come out.

B

The latest 323 series, 310 311 kernels, have a lot of really cool features that have benefits for Cassandra things like transparent, huge pages, which will make your large JDM beeps a lot more efficient. In terms of how many pages that kernel has to track, the file systems are all been improving steadily over the last 20 years. That Linux has been existence in the existence. They're constantly doesn't fixing bugs and performance regressions, sometimes giving us new ones.

B

I've had some experience in the past gfs, and it happens that there's a native port of ZFS to Linux now so we're experimenting with that running in production Mel. It's really awesome. In terms of management. We've got some nodes running on btrfs, which has some of its like DFS's little brother. It has a lot of the similar features and kind of the really neat ones for Cassandra is that it has built-in raid subsystems.

B

So you can I think now you can do rate 5 and raid 6 underneath btrfs without using MV raids, but the more important ones are. You can use the file system compression, which we started with this before snappy and lv4 were available in Cassandra, but I've found very, very good compression rates using LG for through ZFS or btrfs under Cassandra, because.

A

Literally, everything.

B

On the file system gets compressed and it's all running in kernel space before it actually does the I/o, so it compression really really helps with Cassandra we've messed around with different JVMs. We're now actually deploying everything on Oracle Java 7.

B

Well, we've done some things with OpenJDK 7 is it's just not quite as reliable if you're really passionate about open using the entirely open source tax, it'll work but like I said, data stack, doesn't recommend it and I've noticed that just the Oracle JDK is a lot more consistent and then these all these things you know I have not. Everybody, has larger clusters and has this luxury.

B

But if you have 20 notes, if you take one node and try out btrfs on it, I don't recommend doing on more than one note at a time when you're first getting started, but you can just replace that file system with btrfs, rebuild a node and then watch and see what happens and get some experience with it. Rather than saying blue scary new file system throw it on one node, you got a distributed database. If something fails, who cares just rebuild it? Go back to exp for AXA best or whatever your poison is.

B

So this is kind of how I do it kind of an extreme version, and the important thing to do is to make sure if you're running, multiple experiments in production that you space out those experiments across your ring so that you don't get two things failing the same way right next to each other.

B

If you have three nodes in series, we're all running an experimental college system that Hall system has a corruption bug you're going to lose some of your data, that's just how it goes, but if you have nice stable exp for on either side of your experimental call system, you know you've got a stable base if completely crossed everything, just wipe it out and you're good to go um so now, I'm the performance tools.

B

So I was going to build this chart and then I realized that it was already done for me earlier this year at scale. Brendan Greg at Giant is one of the probably the world's expert on DTrace and he gave a really nice talk. There, I didn't get to see it but I. You know I got the slides afterward and you can see the link here.

B

I highly recommend printing out this chart if you're not familiar with all these tools already, because almost all of these will make your life better if you're debugging systems or performance problems in production. My favorites I'll come to those and clear up here, so my very favorite tool for Linux system monitor performance monitoring his visa. It's written in Python, really easy to write. Plugins I haven't had to write any yet because it's got support for just about everything.

B

Saudi stat minutes, lrv n10 is kind of what I do in the morning is I fire up screen and it logs into all my clusters and it fires this command up automatically. So I can flip through the screens and see the last hour of activity. I actually run those at 60 second intervals. So what will happen is those lines that you see will update every second and then it will flip the new line every 10 seconds, which is that command line argument, and you can see everything on a page.

B

You can see your disk I/o happening network aisle. Your CPU another one that a lot that's kind of more advanced is context. Switches can be kind of indicative of a lot of performance problems and on the left-hand side, you know you have your typical load average and processes, and then your memory in the middle there, where most of the green is.

B

um Steel, nuts, that this is a song I, wrote a number of years ago and I just keep dragging it around. It is on github and I, haven't seen many tools like it and what this does is it logs into all the systems with a persistent ssh connection and just rips the performance metrics out of / croc on Linux every two seconds and then computes this display, and it just updates every two seconds. It just rolls the screen. It's not fancy at all.

B

It's something that I've probably spent a total of about four hours on over the last five years, and but what I found is when I'm tuning large distributed systems having a global view like this, it's really helpful and being able to see. What's going on when you spin up your load on your Cassandra cluster, you can see the network I'll take off. You can see how that falls down on the disk watch.

B

The load average gives you kind of a loose idea of how busy the CPUs are, and then those those totals on the bottom are just really fun when you've got a total network traffic on your Cassandra cluster of 20 megabits, a second that's just kind of cool, so, like I, said github I'm, probably going to push it to see pan tune just because I've had a lot of complaints about installations. So I'll update to get up when I do that when you have a raid system, I'm I got about a few minutes.

B

Left I won't leave time for questions I'm a little stunned when you're tuning a raid system, especially or even when you're on J bod, one of the most common things to go wrong and cause performance to go belly-up is you have one dry, especially with spinning Ross? You have won drive. That's just got it's starting to go bad. It's not gone bad yet, and you'll see that latency on that one drive will start to climb and the best place to see that is with iostat.

B

Minus x1 is what I use, and what I'm looking at is. Is the a weight and the RA way and all those numbers on the right-hand side I, don't really care about the throughput necessarily what I'm looking at is the latency per hour drive and I'm, not looking for any particular number. What I'm looking for is outliers so you'll see you know, I got one there, that's like 5.6, a and 4.4, but if I see one in that same group, that's up at 100, then I know.

B

I should probably take a look at that disk run smart control on it and see what's going on, it probably needs to be replaced. This will happen pretty often in ec2 if you're running raid 0 with the femoral drives. So it's a good thing to keep an eye on the ephemeral drives.

B

An ec2 are wildly variable in latency, and it's something when I spin up new clusters need you to is actually when the first things I'll do is I'll, actually 20 all the drives of DD I'll spin up something like four or five times as many as I want to end up with and I'll call all the slowest ones right away and I've had very good luck with getting good performance. By doing that, a lot of people don't know about a chop and I.

B

Think it's crazy, because it's way better than the default tops, that's installed on systems is by default, breaks things down by threads. You can see here, we've got thunder running and it shows all the processors and it just looks really neat and your colleagues will think it looks cool and it has a bajillion options for configuration and setting up the display and configuring exactly the way you want check it out.

B

If you haven't used it before and J console and visual vm are the other two big tools that really help you tune: Cassandra clusters or anything to do with the JDM. So you want to take a look at these and start to get an idea. What healthy and unhealthy garbage collection patterns are, and that's very sigh cluster, but what you're really looking for? Is you don't want to see stuff like this, where you're filling up your your new size and then it's going down?

B

um It might be an indication if you have this kind of sawtooth pattern that you need to tune your new size on your Cassandra needs, or your Cassandra JDAMs a little bit larger, so that there's enough space so that all the new objects can be allocated without having to do a compaction or a collection.

B

Op Center is really great. We really love this view. We keep it up on dashboards with 114 sure it looks really awesome there. I'm working with data Spectre I need to get some of these bugs worked out just because they there just aren't that many clusters of this size. We were really happy with the tool. Are my engineers are very happy with the schema browser, it's quite handy for being able to go and figure out what data is in your database.

B

Obviously, notes you'll ring I'm going to skip the head real quick now, so we can do questions. Cf staff is another one to learn to look at. It takes some time to learn what all these values means, but the kind of the interesting ones are you know. That's the stable count can be useful when you're looking at compact, maybe you're compaction behind things like that and the bloom filter as I mentioned earlier. This is where you can look and see how how much space you're wasting on bloom filters.

B

So if you see that false positive 0 and the false ratio is well 0.000, but if that false positive is a zero and you got a bunch of heat being used by it, the filter space use, you should probably tune that upwards. The false positive Chancery proxy histograms is something that we talked about on IRC a lot in terms of performance, tuning and kind of figuring out, where all of your latency is I'm going to address that now compaction stats.

B

So if you see that your iOS subsystem on your servers is being thrashed, this is a good one to look at to see where Milan's I/o is coming from. Very often it's just compaction, that's causing the iOS rashing. That's the state mental right out to the disk stream out. They go very fast, especially even on spinning media. It's usually compaction. That's causing the Iowa to be saturated, stress, testing tools not going to go into these, but Cassandra stress is a really nice tool, quite easy to use we're experimenting out.

B

Obviously, I said production is the best place for that Terra sort on DSC is something we've used for performance benchmarking. It's a really good way for figuring out how your MapReduce is going to perform and then I. We have some homegrown tools. We have I have one on on github called go stretch, I wrote and go that basically just tries to hammer the database as hard as possible.

B

These stats, the colonel, pin max that's just what I like it's kind of a vanity thing, but if you're running MapReduce jobs it basically keeps your pits from rolling over too fast, it's more cosmetic than anything, but I'll, really like it and I haven't, found anything that it breaks yet, and the rest of these are really nice to have on any Cassandra cluster. The dirty ratio dirty background ratio, I recommend, reading the colonel documentation on those before you mess with them. Bm sloppiness put that in every system you have, even if slop is disabled.

B

That way. If somebody comes along and decides to enable it won't screw everything up. But the file max max napkin are the important ones if you're running at large scale, and then those TCP settings are just kind of the defaults that you should put on almost any server just to give the TCP subsystem and Linux a lot more memory to work with it's more sophisticated than that, but that's the short version um almost done so our feet. Local I put these things in there, but the quick and dirty ones are the cfq.

B

Scheduler is really good. If you want to use see groups, if you don't deadline, is probably the best throughput option still on Linux. So you can echo deadlines into this. The second line from the bottom sis block s, baq scheduler and then the striped cache size thing at the bottom is really important. If you do in raid, 5 I recommend googling it and reading up on it and then the NR requesting that I'm going into totally random order.

B

But arm is useful for just giving the block I/o subsystem Linux a lot more it'll few more requests in memory, and then it can do better scheduling and get better linear rights to disks jvm args and skip over that, except to say that if you're running on large parent metal systems think a little bit about Numa and make sure that you have either xx and use new modes enables for the JVM or use Numa control, minus minus interleave and the rest for most people.

B

You can just use the defaults that come with Cassandra cgroups, I'm going to skip over all this. If you have questions about btrfs or as eff, I'm on twitter is l, toby and I'm happy to talk about those things or on the IRC channel called Cassandra on freenode.

B

So fitting is a multi-dimensional thing. You got a lot of things to consider production mode. It's going to tell you more than any other benchmark and lean into the database, because it really did we'll take good care of you. I've had almost I've had zero data, corruption, failure, I've, never lost a bit of data and all of the outages we've had has been our own, our own fault. So thank you and there's my contact information and hopefully we can get through some questions.

A

Brilliant, thank you. So much out, we've got loads of questions here so buckle your seatbelts. If you do want to ask our question, please use your Q&A panel in side of webbot, so I'll keep your answers short as possible. First one from rob why software raid over hardware I.

B

Prefer soccer aid, because it's just easier for me to manage hardware RAID, can has some advantages and having md realm of batteries, it also has a higher cost, and since the database does such a good job of doing performance, tuning or doing linear rights and all these things on its own, we've never had a real problem with performance on our supper. Raise. So that's why? If you like hardware raid and go for it, it's just it's an extra cost that I don't need.

A

Great and I sh-shall costs what what's your replication, the facts are al. It's.

B

Always three in all of our clusters.

A

Break and then follow up from a shock. Can you just talk briefly about tachyon and Spock.

B

um That's a little bit longer if you're really curious about that I believe the recording of Evan chance taught at the Cassandra summit is online. If you go to the dates tax website, you can find it I'll try to find the link and tweet it a little later, but he talks about that in depth. But we're using that for all of our new analytical systems is a.

A

Goliath is just everyone go to planet Cassandra under the learn, tab, you'll, see community videos and then all the videos from the Cassandra summit that okay, great, so is your entire infrastructure on-premise or are you a hybrid model, hybrid.

B

So are our bigger Cassandra clusters are on bare metal in our data center and that's about 100 50 systems, and then we've got a bunch of smaller clusters in ec2 in Telstra's cloud and pretty soon we're going to be adding them enjoy it and you're. Oh.

A

Great and then here is just a suggestion from in a glossary of acronyms for some of the newbies out. There would be great so maybe there's a blog post in the future weather glossary of terms Dallas.

B

A

Allen asks: haven't, you met any issues with setting compaction, throughput 20, and then he goes on he's experimented and he's got an excessive throughput inducing a lot of Iowa etc. So if you would mind just talking.

B

A

B

That that's a pretty common issue on the IRC channel um and I recommend anybody. Who's experimenting with cassandra. Just have a window open on there all the time, because data saxes drip decks.

B

I don't know his real name is on there all the time he he kicks ass, but there's there's a lot of other community people, including myself that like to answer those questions, but with back to the SS table or the compaction throughput, I haven't seen any issues, but that's due to my workload on my primary cluster being driven by a MapReduce job, which is it hammers it for about five to ten minutes and then it's idle for another 10-15 minutes and that it gets hammered and then it idle.

B

So for us we want that compassion happened really quickly and be over with, because our reads are coming in pretty much all the time. It's just for our workload. It works really! Well, we haven't seen any issues. Are newer, bigger cluster, we're running into a few more interesting problems, a lot of that just do the size of it. But no, we haven't seen any issues, but that's cuz I. We spent a lot of time really tuning our I/o subsystems to to run as fast as possible.

A

Great Allen off: how did you choose the size, coss tables using LCS, which is leveled compaction? Oh.

B

Yeah, that's a schema level setting up I'm. Sorry I forgot to mention that so um I think it's when you define the column family that you set that and it's an option. um I have some juice on github and I should be taking those people Christian here any later of all the stuff I'm saying L, say and Twitter in phila yeah, so ski bum, yeah, okay,.

A

Great there are sweet questions in quick succession. What version of Cassandra do you currently run in production.

B

Almost everything is now I think as of yesterday actually on BSE 3.0, which I believe is cassandra. 119 gapless, don't remember too that.

A

B

Right, good DSC.

A

Treetop one has one thought to in it right.

B

And we're kind of waiting on that. We want to wait for three dot, one dot, one, not because we don't trust datastax, but just because we have a lot of really important data and we're doing some migrations and merging some clusters. And it's a lot easier if they're, all in the same version. So once we get done with all those merges, then we'll we'll go back to upgrading the DSP 3.1 great.

A

How much data do you keep on a single node? It.

B

Varies by cluster, our older, so that legacy cluster is one I think it's at about 200 gigs, /, a node right now, two or three hundred the new cluster we're going to be pushing it a lot further! That's going to require DSC 3.1. Actually, we've had a lot of varying issues with having really deep nodes. I think the recommendation generally is to keep it under 200 gigs, but yeah on average. Most of our larger clusters are Ontario expert node, the smaller ones are. It varies from a couple gigs up to you know: 40 50, gigs,.

A

How do you find bad queries, problematic queries like those which read too much data.

B

um We haven't had a lot of trouble with that in our primary production systems, mostly because they're fairly mature and most of them are doing very small reads. um Most of that comes down to what I was talking about earlier with having that collaboration with your Cassandra operators. If it's not you and really thinking of thinking through making sure your software is designed to do that right in the first place, obviously, in a perfect world, that's what would happen. The Cassandra trade stuff will do it.

B

Well, generally, we just end up to stepping through the application looking for and.

A

B

A couple times where I had crews that were crashing nodes because it was trying to read, you know: 20 30 gigabytes of data in one in one query over thrift, and it was just filling up to eat, and it just happened that we just step through the software manually and found it um I, don't know about what tools are just because they haven't needed them.

A

Okay, so view off: do you suggest rate of any kind for Cassandra buses rule disk access.

B

So, like I mentioned, is um I recommend raid if at least 59 s enough availability is important and if you have a separate operations team that has to service the disks, if you have physical of data center, you have a different crew. That actually goes and does that work so putting a raid subsystem and that's the primary reason to do it. It just gives you a little bit of a buffer disks fail all the time I mean literally, you know: I'm 500 old cluster.

B

It's only a couple months old I've already had five discs fail right. So having raid in place just protects me for a little bit longer indicates the disk failure and provides an extra level of redundancy. It wastes a little bit of space. What the reality is. You don't want to fill up three disks anyway, so raid five works fairly well, the md raid 5 is pretty it's fairly performant raid 10 is going to give you the best performance um and that's why rajat?

B

If, if you have a large enough cluster and you're comfortable rebuilding nodes in that they can be rebuilt in a window that protects your data, then go for it like the j bodycon Cassandra 1.2 is going to give you the best possible performance, especially as it matures, but the raid is really going to make your your relationship with your gracious people better.

A

Okay, great so many questions still left and what we'll do Alice. We will do an export of these questions and send them over to you. Yeah.

B

No thermal quarter, yeah.

A

There are definitely some you may want to do a blog post with.

B

Some of them in bed, its.

A

Absence but um yeah, that's all we have time for today. Unfortunately, loads of questions left l. Thank you. So much I will definitely have you back for a follow-up, because it seems like it's a really hot topic in the community so greatly.

B

A

B

A

B

I'll try to get those questions answered in the next few days.

A

Ok, brilliant. Thank you very much indeed. Alright.

B