Apache Cassandra Cassandra Community Webinar Series, 1 Mar 2014

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Community Webinar | Large nodes with Cassandra

Description

Starting with Version 1.2 Cassandra has made it easier to store more data on a single node. With off heap data structures, virtual nodes, and improved JBOD support we can now run nodes with several Terabytes of data.

In this talk Aaron Morton, Co-Founder and Principal Consultant at The Last Pickle, will walk through running fat nodes in a Cassandra cluster. He'll review the features that support it and discuss the trade offs that come from storing 1TB+ per node.

A

Hello, everyone and welcome to this week's cassandra community webinar delighted to welcome back Aaron Morton, the co founder and principal of the last petal is also a committer on the Apache Cassandra project, extremely well known in the Cassandra community. If ever you are on IRC or the mailing list asking a question odds are erin has answered one for you, so welcome back Erin and just one piece of housekeeping.

A

If this is your first webinar with us and we'll be taking questions at the end of the session, please use the Q&A tab inside of WebEx and ask your question there and we will get through as many as we can at the end. So Erin exciting times, I believe at the last pickle you are expanding. Business is good.

A

So we look forward to your webinar today. Take it away.

B

Thanks, Christian and good morning to everyone and squish. Instead, I'm the co-founder and principal consultant at the last pickle, where we help customers deliver and improve apache cassandra based solutions were all beta sex MVPs and I'm a committer that they had the maintainer for the hector library and a committer for the apache user Grid project and were based in New, Zealand and America.

B

Well, one to do today was talk about large nodes. Now, when we're talking about large nodes, it's important to have some sort of context for what's March. After all, we're supposedly din dealing with big data. A few years ago, I started saying, as a rule of thumb, don't put more than 500 gigs of data on a node. Initially, this was just talking about et tu and the reasons had to do with a little bit of what it was to be running on easy too.

B

So what sort of throughput could you get on the networking and the performance of your discs? It's bound up in a bunch of operational concerns, and these are the things that are going to talk about today and how these operational concerns have disappeared. A bit over the intervening years, but in general we're talking about nodes with over 500 600 gigs of data per node. Nowadays we can talk about node in the one to three terabytes of scale.

B

That's our framework, but we're also talking about nodes that have more than 1 billion rows per node, so a billion rows over all of your tables. Now that could be a case where you've got multiple terabytes of data. You can also pretty easily get to billions of rows per know. Just by having lots of small rose, they might be recording something like website hits or website moles, or something like that. You can pretty easily get to a billion rows.

B

As I said before version 1.2, there are a number of operational and performance concerns that we had to worry about dealing with nodes in this sort of category /, 5 or 600 gigs over.

A

B

And over a billion rows, close 1.2, there are fewer operational concerns. Most of them are still there, but they're there decreased in terms of their impact, but it's important to understand what why we were concerned about these things in the beginning, because it's good to have an understanding about why changes are made in Cassandra and some of these operational concerns are still there and you could grow fast enough. You will still run into them.

B

So look at those issues we had prevalent 1.2 and we'll look at some of the work around that we had just to give us some context about why changes were put in version 1.2 and beyond, and we'll look at a couple of the issues coming up in 2.1 that are going to improve things. Even further memory management was always a big isn't was always a big concern. There are some memory structures in Cassandra that grow with the number of rows and the size of data that you've got to note.

B

The first one we'll look at here is bloom filters, and these are probably the most well-known and Rhys understood data structure that we have. We use these to test. If a roti exists in a particular assess table, it will tell us either that the routine definitely does not exist, or it does exist with a certain probability that that's false. We hold this in memory, potentially just a bit set, and by keeping any memory we dramatically reduce the amount of disco yogi has to do so.

B

The size of this structure depends on the number of rows in a particular, a desk table and a property fitting called the bloom filter FP chance. This is a setting that goes on the table, but we don't know exactly how many rows we've got in an SS table. We do the first time when we write it out.

A

B

Know when we flush the desk we're going to we're going to write under grows into this SS table over time, though we just have to guess. We estimate, based on some things that happen in compaction, we'll look at that.

B

So in fernley, our bloom filter was implemented as a two-dimensional array of Long's. Again we treat it just as a bit set. This was just how it was implemented and we look at they use the memory usage for this guy along the bottom of year. We've got millions of rows and when we hit a billion rows at the right-hand side, the grey line indicates when the bloom filter, FP champs, is point zero, one which is the default when we use to size kid compassion.

B

Strategy at that setting will use approximately twelve hundred megabytes of specs just to hold the bloom filters if the value, if we're using the level compaction strategy when your column family is created, the blue filter FP chance defaults to one or ten percent and see on the red line there. That's approximately half the size that we have for the point: 0 1, so it's still about 600, megs, all stuff that we have to keep in memory.

B

We also have compression metadata that we have to store in memory now when we take your data and compress it, we need a map that tells us. Oh, this chunk of uncompressed data actually starts at this position in the compressed data stream. So when we take the offset for the start of your row from the index component of ESS table, we know where the book the size of this depends on the size of the uncompressed data and it's held in memory as well again with implemented as a two-dimensional array of roms.

B

In this case, though, that are actually used as Long's, we just have the long then we offset in based on the offset from the excess cable.

B

The size of this depends on the amount of data will try to compress and to a degree, the compressor that's in use and the size of the chunks that we're compressing. If we've got a terabyte data using the snappy compressor, we can expect a couple of hundred megabytes of compression metadata again all suffer has to be held in memory and can't be released.

B

We also have index samples so in our SS tables, on disk that the three most important files for each SS table are the data component, the bloom filter in the filter, BB and the index of primary index. This is your row keys and their offset into the data DVD component. Now in memory we hold a sample of every 128 keys by default, and this essentially gives us a skip list or an index over this.

B

So when we want to find your okie for a read, we look through our index samples in memory that gives us a page that may contain your interview or the key that you're looking for and we drop down the desk and do a read, of course, if we get a hit on the key case, we don't do this part, but if we get a key case, yes, we do this, so the size of that depends on the number of rows that you have on the node and also the size of your key, because we're holding those keys in memory than in the index samples in version 1.2.

B

This was implemented as an array of Long's and an array of bytes, a 2d array of bytes for the routines, probably version 1.2. It was implemented by holding the objects that actually those right to get busy realized into a decorated key, but in 1.2 it look like this. This is a bit easier to get your head around again stuff. We have to hold em memory.

B

We look at the total size of being indexed samples and along the bottom here, there's millions of rows when we get to having a billion rows and our row keys are 25 bytes long. Looking at a two hundred and thirty odd makes those data that we have to hold.

B

Now all this is an issue because in the JVM, larger heaps take longer the garbage collect. So again, let's set a context for what we mean by larger. The default configuration in Cassandra in the Cassandra am be attached. File is to not create the JVM heat above eight gigabytes.

B

Once you get above that, there's extra work that the par de noob compaction process that works on the new keep has to do, because it has to look at all of the old data. All of the data of the ten you deke, sorry to see if any of it is pointing to objects in the newly and also CMS, is going to take longer on the Kennedy.

B

Additionally, if you put a large working set large amount of memory, that garbage collection cannot free you're, going to see more frequent and prolonged garbage collection, typically in this NS collector concurrent mark sweep on the tenured heat and normally, if you're looking at at a graph of your jvm usage, what you want to see is that it goes up and then drop. Suddenly. It looks like a sawtooth pattern.

B

If you do mean closely and the password is going up, because it's a jagged line as Tanya comes along and time, you could be collecting 600 megs to to a gigabyte of data, so we go up in a jagged line and then the CMS keep them, and that frees a lot of memory and we drop down to give you an idea running with the neck big heap its default is that the CMS will kick in at seventy-five percent occupancy, which is around five and a half to six gigs because of some other part of memory, that's used for the pen, Jen and I.

B

Don't we like to see that in a healthy machine dropping to between two to three gigs? If it's not getting below three gigs a lot I want to understand why and it depends on some of your configuration settings and things, but to give you an idea. This is the sort of thing we should be overseen. Gt. Doing me a big chunk like dropping a three or four degrees data at a time, so they'll give you some of the operational concerns.

B

Bootstrapping is the process of bringing, and you know, then it joins the cluster. It takes ownership of some tokens.

B

Those tokens specified data, that's going to be a replica for, and it talks to the other nodes that are already replicas for that data and ask them to stream data over to it. So you can take ownership once it finishes the bootstrap process. It's got all of the data that the other no sense it, and it also has rights that started that occurred as soon as it started.

B

A bootstrapping process, so in bootstrap finishes, is ready to go and this process is throttled by a yeah, more configuration setting called the streaming throughput, outbound megabits per second and defaults to 200, which is roughly 25 megabytes, and the reason that it's throttle is that this is a very extending data.

B

Using the streaming process is a very low rebel approach, then you can go as fast as the disk and the network will let it and we go Montana, sending those to potentially do a denial of service attack on themselves and saturate they're out there network-wide any data quickly to a bootstrapping mode, because they're still serving request.

B

When you bootstrap it mode with RF three and we're not using virtual nodes here, this is premium version 1.2 my node will come in. It will have an initial token and that will identify one pokin range, but we've got RS three. So there's actually three token ranges that this knowing is a replica for till the last three nodes, laundry to those Tobin Rangers to send it data to bootstrap.

B

We know that this ending process is throttled at 25 megabytes a second, so our maximum center in is 75. Megabytes per second in practice is going to be less than that, but that's our maximum. If we've got a one gig networking in place, we've got 125 megabytes per second, so there's quite a lot of headroom on our bootstrapping node, and we really like to saturate that guy. We can't control the time for sale, but we can control how long it takes to recover from failure.

B

So when we have to replace it mode, we want Mexico absolutely as fast as possible, and we want that process to scale as we grow the cluster. You want to get more value as about costume now, if you don't always meet the bootstrap. Often we do a protest that internally recall a lift and shift, and this might be that we're upgrading the new hardware inside of 82 or an enterprise data center or moving into new networking infrastructure, or something like that. We don't need to use a bootstrap process.

B

Cassandra is pretty flexible here we can just shut the node down cleanly copy all of its data in config over to a new node and started up and Cassandra will just see that I peas have changed, handle it all and no concerns. So in this case we're just talking about that transfer speed through the data center using I. Think or something like that.

B

So if we get 50 megabytes per second in 82, that's probably about what I'd expect and it's going to take us half an hour roughly to move 100 gigs if you've got 500 gigs multiply that by five, if you've got over 500 gigs on the player by, however many and you start to see that it can take some time. This is a copy.

B

Disk management is one of the hardest things I think in in deploying and Cassandra cluster disks, don't like having more than about seventy five. Eighty percent of their space used performance degrees. When you get above that on a spinning disk, we want to store more than 500 gigabytes, we're going to need multiple terabytes of data on each data space on each node. We could build a single volume or we could use multiple volumes to do that.

B

If you want to build a single volume traditionally, we'd use raid 0 impure on an 82 downside of that is that if we have a single disk failure, we lose all of the data data. On that note, and this could be a pain we've just seen, the bootstrapping can take a while.

B

So there's sort of a negative feedback here build a node with raid 0 put a log data on it, congratulate ourselves that we've got a lot of space and filled it up loser disk because we're using row 0 and then go back to the beginning and have to do a bootstrap and, and that can take a long time. Another option, this user. A pen, of course this doubles the world capacitive requirements.

B

Typically, we might see this in an enterprise data center, where there's a standard, build for machines and we'll come in with a rate 10 great for operators. They really comfortable just replacing disks in a hardware level, but it increases the costs. You can use multiple dateable. You can mount each desk individually and tell Cassandra about those through the data files directory Yemen setting now again we're talking in the context of pre version 1.2 in that environment, Cassandra was not intelligent about how its distributed and load amongst those multiple volumes.

B

It would just choose the one with the most free space, and so you could end up with multiple right threads, trying to write SS table as quickly as they could onto the same volume. Also, if you had a single failure in a data volume, it would shut down the whole mode. We still have all the remaining data edges that the noise would have an exception and shut down the repair process.

B

Is you know that the repair process causes problems for people, but we know it's important. The background here is that when we do deletes, we do a soft early and write a tombstone, and we want to make sure that that tombstone is fully replicated before we purge it off desk through the compaction process. Also, repair is the way to ensure on this consistency across all of your nodes. The way it works is that we calculated in court of Merkel tree and then we use that to compare differences to build the miracle tree.

B

We have to read all of your data in your particular table. Technically we do it by reading ranges of data at the time, but mature. So we have to read all the data in a particular table. You can see this in no tool. Compaction status called a validation compaction. That's because this process runs through the same infrastructure as compaction, its modeled by the compaction throughput megabytes per second, which the animal setting which defaults to 16, and so you can probably guess what I'm going to say. Is this process.

B

Rose and time and the amount of data on the nose grows if you've got 10 gigs or data on this, your repair only have 3 10 mins of data. If you've got one point, two terabytes, we have to read one point: two terabytes and calculate a half, so that's a CPU intensive operation depending on how your machine set up first thing: we've got to get all this data off desk.

B

Second thing is: we've got to make the hash, so we can push the CPU that, if you're on SSD this tends to be a bottleneck on the cpu if you're on hard drives, this can be bottleneck at the desk.

B

The second part of repair is that afterwards exchanged the Merkel tree and detected differences. We stream those differences using the same process that we use for bootstrap the streaming infrastructure and again this has trouble so the same reason: the bootstrap it struggled and we don't repair individual rows. We repaired ranges of rows. That's where we detect the differences. If you've got very big roads, we could end up streaming a new copy of a very big road to a no just because another node in that token range that was checked as out.

B

Another road was out of sync. If you've got billions of small rose, you can end up streaming billions of small roads because one of them and data sync now compaction is a fact of life in Cassandra. It's a fact of life in any sort of log structured, merge, storage engine like we have. We have the great advantage called writing new things to desk writing. New files to disc every toilet. Flush takes out a lot of locks in the storage infrastructure, but it requires a compaction process.

B

Otherwise, reaper forms would just fall off a cliff over time, as we had lots and lots of new file to look at so what compaction does is it looks at a particular set of files on disk and it writes the same truth that you find in those source files into some new files, and it's discard the information that you've no longer required. It might be that you've done an overwrite and the previous value is no longer required and that one goes into the new files. They have two strategies for this.

B

The original one is called the slidecage compaction strategy, and this code is root. The SS tables by size files in the same bucket that during the process or 50, then within fifty percent of the medium of the size of the files in that bucket, and it assumes no reduction in size to the output so even going to compact 53 450mm files, it assumes it needs 200 needs of free space. So in theory we need fifty percent free space on the desk. In practice, we've seen this run with the less than fifty percent free space.

B

Although it's not recommended really, you should use fifty percent free space as a soft limit on your desk. Now this doesn't sound as bad as it is because we know that if we get above 75 percent, we're going to see the throughput on our disk reduced quite a lot on spinning disk I'm, not sure on the impact on SSDs.

B

The other strategy we have is called level compaction strategy, and this is expired by leveldb from the google in there chromium project. This group's SS, table together by a level and data moves up in the higher level for more often is compacted not based on size, just based on some other heuristics inside each level. Your robe is guaranteed to only have one fragment.

B

This has great result that this can have a great impact on reducing the read latency level compaction. Also, there's a really good job in a highly mixed workloads when you've got a lot of overrides and deletes. But to do this it requires a lot of disk I/o, approximately twice the displayer and my feeling is. It requires approximately twenty-five percent disk free space.

B

So there are some of the issues we've dealt with those before we've seen those running with over a billion roads. We've seen those running with seven eight hundred gigs of data. We know that there are some some issues there, but we can work around them and keep those clusters healthy.

B

When the first things we don't really do to manage. Memory is reduce the bloom filter size, so it change the bloomfield at p, champs from point O 12.1 on some column families, and we know that they've got to reduce the size, building children's that we have to hold in memory now. This is probably going to also increase the read latency, because we know why the bloom filters are there they're there to reduce the amount of waste of this photo where we go and look for a road in particular SS table, it doesn't exist.

B

We can play around with the size of the compression metadata by adjusting the chunk length. We've never really been a fan of this. This can increase the read latency, because now we've got to decompress more data to find the piece that we're interested in this is a more typical thing to do. We can reduce the size of our index samples by increasing the in depth interval in the yellow file, and we typically typically kick this up to 512 up from the default 128.

B

So again, we think about what this is going to do less down to the memory when we go to disk larger page that we're going to scan through the find your robe. So again, this can increase our read latency.

B

Now all that doesn't work, and you probably actually do this in conjunction and while you're making those changes, you would increase that you can increase the jvm heap up to 12 gigs, sometimes a runner at 16. I would say this should be seen as a temporary measure, and the goal should be to get back to running at an eight gig. Heap may be doing this. You can increase their new size of the heat, to something reason to keep that something reasonable around a thousands of twelve hundred megabytes.

B

The bootstrapping, if you put a down time on your cluster or a slow time, sorry I'm clustered my bit nice or something like that.

B

You can adjust that streaming throughput using no tool, so you can go in there and use no tool go to the modes of sending data, set it to zero or something hot to disable, throttling or something high, and with that in place, bootstrapping will run as quick as possible come in the morning, put it back to the 200 megabit setting to so that your bandwidth is dedicated and it's evenly split now as we're doing a lesson shift to move now to a new infrastructure.

B

What we'll do is we'll run a snapshot first and I think the snapshot over to the new node while the original is still operational and when it comes time to move the original note completely, we do a clean shut down, so we disable the api's, the disabled, also free terrain, so that everything in memories flushed to disk, and then we do an arctic to just get the deltas and over to the new node.

B

So we had to copy a lot less than that nodes down for a lot less remember if you do that to include the flag in there to delete files on the destination that has no longer exists on the source node, this benjamin. If you can use raid 0 and over provisions, I mean just you know, have a little bit more than what you expect, not a huge number and if you're in 82, you don't need to do this because Amazon's or any provision thousands of nodes that you can get one within a few minutes.

B

You can just use repent and accept the additional costs again. Typically, we'll see, write em in an enterprise setting where there are standard fields.

B

Now, repair really is something I encourage everyone to use unless it's taking several teams to complete, because you've got so much data. If you really need to, you can only use it when data is deleted, in which case I'd recommend that the consistency level is not one which should be core to ensure that your data is written to at least two nodes.

B

You can also do sort of more frequent, smaller repairs. You don't have to repair a whole piece piece. You can run a repair that table level or you can run the repair and individual token range and if you've got a very big table, this was on the jmx interface for a while. It got moved on to the node tool. Repair function, motul repair until I can't remember exactly what those who have got moves on there, but you should be able to use it now. Compaction.

B

Your donors are over provision. The disk capacity when using size, kids, typically there for a modern 82 mode and in one XL, which used to be the standard build. You have 1.7 terabytes of data and reduce 500 gigs of soft segments. So in that case, we've sort of over provisions. If you're running low on space that a little pack you can do, that can help is, if you adjust the men compaction threshold and the max compaction threshold and drop those down, you will juice.

B

The number of files they're going to compact becomes more aggressive in the sensor that runs more frequently, but instead of compacting 400 Meg file. That will only ever compact two and over the only needs 200 megs, if you are really in a bind that can allow compaction to make incremental improvements in philippines more space.

B

Mobile compaction is a great thing to use if you've got a lot of over rights and because psyche compaction doesn't handle loads very well or if you you care, a lot about latency and that reflect use it where appropriate. As I said, it takes approximately twice the disk I/o if you're on spinning disk I'd use it sparingly on just to get your column families, but neither if you are nesting you can go crazy.

B

So a little bit of background about why some changes may have happened. One point two things we'd have to work around if we're still on one point before 1.2, hopefully now convince you to be using at least 1.2 or 2 point 0 and give a bit of understanding about why changes are going on so version, 1.2 lose the bloom filters and the compression metadata off the JVM heap version. 2.0 move the index samples off the JVM heap.

B

These still take up memory, then sitting out there still, but the garbage collector doesn't care about them and we've reduced the size of the working set. So now our CMS, which before we have to go along and couldn't free up enough space, perhaps because the bloom filters were sitting there now has loads of space. So we can get a much better sawtooth pattern.

B

We have lowered lower pauses, you haven't in our process. Virtual nodes were added in version 1.2, and one of the reasons they're added was to improve the performance of the bootstrap process. So when you get up to having 30 or 40 nodes in the cluster, you can get some value for having all those nodes. Vinos deserve. Having one token range to note have 256 by default. So each night is a replica. Each node shares replicas with so many other node at you. Eventually, all moved in the cluster share data with another node.

B

So now, when we bootstrap- and you know it in, it- has 256 token ranges, replication factor that it needs data for and it goes and talks to lots of other nodes in the cluster. So for bootstrapping in this environment and we've got ten modes adding another one. All ten loads can contribute a small amount of data and becomes a lot easier to saturate incoming mode.

B

We also have jboard support just a box of tips. This really did improve the way that handle multiple data volumes, see the mountable up individually list them in data files directory just like before. But now when we go to write to one a lot more intelligent, it will write to the volume that has the most space that isn't currently being written to. So we don't get a thundering herd going to a volume that suddenly got more the space because compaction bring something up or something like that.

B

Together with this new setting called the disk failure policy, this makes it a real, viable option.

B

So the ignore setting for this failure polity makes it work like a pretty version 1.2, which is the exception that handled server shuts down the stop setting says: okay, when you get an I/o exception, run into this, handle the exception. Margot mark that that data volume should no longer be used for reads or writes, make that information available wire jmx, including visit jmx, push notification interface. There, then it puts the nose into a suspended state. It disables forth rest of the binary AP. Is it disables gossip and the protest keeps it running.

B

So essentially, it drops out of the ring molds and makes all the information available about why it did that your operators can come along and make a decision about what they want to do. The best effort setting double that except it doesn't shut it down, so detect an error.

B

Isolate that volume they no longer do it now under breathing right to it mol get the information make it available via J of X and the J next push notification that keep running so you've got four bits. You suddenly lost a quarter of the data on this node and it's going to keep on running if you're, using CL core there's no problems there. The corn process free reads and writes, will still detect that this node that's suddenly.

B

Loss of data is returning data that doesn't match and we'll repair that if using CL 1 you're going to get some stale data in the best effort, it's really good. You can then, when repair and repair that data you could replace the disk and run repair if you wanted to, and the process that you go through, the best effort is roughly the same process which stop you can stop happens. You can make a decision.

B

Okay, do I, just want to wipe the interfaces on this again and we will reuse, CL quorum and we'll start off with repair and we'll fix it. Or do we want to take some other action like bring the node down a couple of things that are coming up in version 2.1, there's improvement to the repair process.

B

Our SS tables on disk are immutable once we've calculated the Merkel tree in theory. If the bottom, there is no way it will ever change too. Shouldn't have to calculate it again.

B

Ticket 5351 takes that idea and makes improvement so that repaired takes less time, has some impacts on how compaction works and we'll just have to see how that process works when we bring that in for munoz operating to 2.1- and I mentioned at the beginning about bloom filters that there are guests as to how many rows are there when Compassion's running, particularly in workloads where you've got a lot of overrides, you could compact for files that had a million rows in the beach and end up with one file that has two million pose, but the compaction process would allocate a blue filter the fifth floor million rows and so wasted.

B

Half the space ticket 5906 uses a piece of software, called the hyperbole blog to make estimations about the distinctness of the rows and then allocate bloom filters appropriately. So there is less wastage.

B

So that's a look at what happens when you take Cassandra beyond a billion rows beyond 500 Meg's 5mg, sorry of data turn mode. Hopefully this also, given you some background understanding about why changes I think Cassandra. Why were the bloom filters? Take them off steps off jbm, peep, and things like that. So when we see some new features come through, you can understand why they're there so I'd like to hand over to Christian now and any questions.

A

Like that, thank you very much Sharon.

A

Can you hear me? Okay, yeah.

B

A

Okay, great so just a reminder to please post your questions in the Q&A tab inside of WebEx we're getting a lot of good questions coming in and on your screen. Right now, you'll see our upcoming webinars march, six patricia gawler, also a member of the last pickle and then on april. Third, we have Cassandra lithium. Liam is a social interaction platform for large enterprises.

A

I will answer sub rotos question up front because he is asking where we can get a replay of today's recording. We always post the slides and the recording to Planet Cassandra within about 24 hours. So this time tomorrow they will be on Planet Cassandra. So are you ready for some Q&A here? Let's.

B

A

Okay Alan asks is Jay board useful on ec2, with datastax Annie I mean using raid 0 and writing to raid 0 Cassandra. Aren't we already writing multiple disks? So I don't know if that's one that you can take or if we should, you know, get get someone from datastax to respond to that. No.

B

I can handle that so yeah if you're, using raid, 0 you're using multiple disks and the reason. One of the reasons we have raid 0 is that it gives us a nice big data volume and the other reason it is an increased performance by striking the rights. The rights across multiple disks personally I think about using an m1 x, well or insert or any amazon instance with spinning disks I would still use a raid 0 to get the best performance. You can mean the highest amount of dis cops out of that volume.

B

If I was using say one of the new I two instances that use fsd, I considered going to G board and I, don't understand precisely how they could fail, one fail and not the other. So you probably want to do some research on that, but I would consider using jbug just because it is there an inducible if you have a single disk failure, beacon people working one of the downsides of day board is that you don't have a huge one.

B

Big data volume, so compaction can get a little bit concerning when you get too much data on the node. But in the normal case, where you don't overload the data on, though it should be fine.

A

Great giana asks question about levels: compaction, I'm working with time series data of a couple of column, families that are queried heavily every hour when computing roll-ups. Would these be a good candidate for leveled compaction? I.

B

Would that knowing too much more, I would probably say no, I'm going to assume that your roller process is not a latency sensitive query so, but latency sensitive, I mean you're in there you're in an API request, you're doing a page refresh or whatever it is we're going to assume the background process. Also, if you're doing time, series you're, probably not mutating your data, so you I'm guessing you're not doing over rights in or doing deletions. They probably depend only data model, normally so to sort of data model that works well with site in compassion.

A

Okay and then wait is asking a little bit of clarification here. I didn't understand the repair merkel tree, compare improvements.

B

Right so now, method tables on disk are immutable. So once we've read all the data in particular assess table and calculated the miracle tree, which is a tree, a passion, so the intermediate nodes are a hash of the hashes on the nodes below them, and the leaf level know they're a hash of a range of roads once we've calculated that it will never change and currently before what we do.

B

Is we calculate that again and again and again you could have a hundred gig SS table of disk, and every time you run, the repair will go and calculate a hash on that guy, which is a waste. So what that ticket then 2.1 does is take advantage of their and store those those miracle trees. Then there's a bunch of clever logic in there about understanding how what excess tables have had been dropped and which ones have been created and that impacts on compassion of it.

B

So what is going to do is reduce the amount of time it takes for the validation compaction on repair. So if you do a fully there on Monday and then you only write in venue right, 10 gigs of new data on Tuesday we're only going to have to calculate the hash on the new team gigs of data.

A

Ok, thank you. Why do we have a limit of five terabytes per note.

B

So again the bootstrap process it even though now it's better and you probably know there are some physical limiting Cassandra. So just how much? How high a long will count for something like that, but if you've got five terabytes on a bit on the nose, what will you do when that node fails? How will you build a new node? How quickly will it take you to get five terabytes on there?

B

There's, probably not a lot of physical restrictions like okay you're going to overflow this long, so you can't have this much data on it, but it comes down to things like challenge again. Take you directly to the node how long's are going to take for repair the rug? How long is it going to take to do a backup of get stuff off that node? That sort of stuff, then mostly operational management things? Sometimes things like glue on that sort of stuff can kick in and become a concern.

B

So, even though say, you've got five terabytes I've. We got like five billion plus x 56 billion rows on that mode. Bloom filters at a billion if we use inside T compaction. Looking at 1.2 gigs of data, so now we're looking at six or seven gigs off heap data for the bloom filters that can take it while to load and now we're looking at yeah 10, maybe 20 gigs of data in a memory to hold the JVM and all the associated off heat information. And so now you know you're getting bigger dis, operational concerns.

A

Okay, great is that a practical dis size limits for spinning disks on Cassandra one got two: that's from senior I.

B

Don't think so you know my concern with spinning disk is always how fast they spin. So, typically you get a larger single volume. Like a you know, you get a one terabyte, a two terabyte. It might be a slower speed speed, so it might be down. That's like seven thousand set of 12 or 15.

B

Also given the choice, if I had this scale, if I had to put like two terabytes of disk on something I'd rather have a couple of them would rather have 21 terabytes or something like that, because now I can use raid 0 and get twice the disk I/o or even using jboard I can write to one biscuit and write to another desk, so I'm splitting my rights and effectively using the disk I/o that they provide. So all things being equal.

B

You probably want to avoid having just one disk for the data having multiple disks gives you increased disk lineups.

A

Okay, great and we'd is asking about for repair. You didn't mention secondary index rebuild. Could you talk a little bit about that.

B

Yeah I probably didn't because I don't use them a lot. So when we do a in the streaming process, when we bring over some data I'm just trying to remember the cold.

A

Well, maybe maybe is there a reason not to use them? Why don't you use them very much I.

B

It mostly because I started using Cassandra before they were there, and there are some concerns. Their secondary indexes are good in some situations you might have some either model, which is a this query. Only it gets used by people on the internal portal or CRM pipe people. It doesn't happen very often and we just need some support for it. So that's a good use case for a secondary indexes.

B

The if you've got a query. That is something that's part of a hot code, part I'm like okay, this happens. Thirty percent of time we do a page refresh this happens. This gets called all the time. Then I think it's best to model that as a first-class entity in your data model secondary indexes, we need to a query. We have to go and ask a lot of nodes.

B

We don't know exactly what node has your data, so they have some reduced performance there and they have some reduced availability because they have to ask so many nodes when we do a bootstrap process and we will rebuild the secondary mixes and when we do a streaming process, we will rebuild the secondary indexes. As well, I believe, internally, secondary indexes, are just hidden tables.

A

Okay, great so Michaels asking and by the way, Michael has some data out there by Netflix about how much data they still in Cassandra but Aaron. What what's the logic data capacity you dealt with in Cassandra, roughly how much disk for node and how many nodes.

B

um So we've dealt with system into the model in the high around tenish plus terabytes.

B

Number of nodes is difficult: the size of per node tends to be a factor sometimes of the performance that you want to get out of it and type of data, so I I would just go back to those guidelines of.

B

If we've got a lot of data that really have really low latency requirements, we want to be a terabyte ish of data promoting post version 1.2.

B

If we've got, if we mostly just writing data and doing back offers pipe reads, I think you could put more data / mode if you've got operational concerns about how quickly you can request those again. So you in advertising retargeting or something like that, those guys normally have really big throughputs. We want to be able to replace those pretty quickly, so it's kind of a balance between all those issues.

A

Okay, great and kalman asks add option memory changes helped with the max column, families per class. That's a sheep.

B

It will have helped a bit that they're concerned they're, so each column, family each table takes up some memory that has an end table that have that stays on the JDM heap. But what it also does is when we do the calculation to work out when to flush the desk, how much memory we've used?

B

One of the factors in there is a number of column families, so we will flush the desk more frequently and when we flush this more frequently there's the extra I/o of washing the disk, it creates more compaction and they use more disk Leo there. So I don't think it fixes those problems. Now we can have multiple pins of column. Families I've had been some systems where they're in the 400 plus column families, and it's just really painful to do anything.

B

It really need mega across-the-board schema change, it's 400 400 updates, and that can take a long time like half an hour or so when you tryna, move things around.

A

You no, we lost New Zealand help Aaron, we lost you momentarily oops, parallel.

B

A

B

Just saying Alps, but they'd still the issues of the too many columns. Families use too much disk leo and when you admitting a system like that, you end up with lots and lots of small files that are hard to copy around quickly.

A

Okay, Annie asks will repair performance, improvement, bug fix, which is targeted for two dot. One also be fixed in one dot. To do those do those patches go back and get applied to previous versions. Know.

B

If they don't, then there they just all going forward things. So what you're going to probably now they've been 2.2 I say we're at one point two point one five and that guy is essentially based unless there's a a bad problem. I'm not going to get me updates to point, owes at five or six now. So that goes pretty much ready for prime time and once we get 2.1 into global into general release, then the chances of anything going into two point.

B

One point to a pretty low and you know, changes like the improvement to to the Merkel tree management are big changes and just to much f at the back port them.

A

Yeah so Alan asks what's the max sighs, what's the difference between maxsize in a1, dot to node and two dot, zero, no.

B

So not a lot between 12 and 2.0. A lot of those improvement came in in in two point in 1.2 to point the index, samples of heap as well, so there's pretty much the same size.

A

So Aaron this one is probably a broader like consulting question so that this one's from yarn, but if there's a very strong arm so great, if not maybe you know just go one on one with yarn afterwards. I have a database that will contain around 50 to 75 billion rows and the future, which is spread over two tables. How stable would it be, and does it make sense to use a structure like that.

B

Yeah that doesn't it doesn't sound any alarm bells to me. We've talked about what happens when you have lots and lots of rows. So the best thing you can do is jump on AWS you get a decent node like an eye to node and just fill up the data and see what it looks like. You can then program, look at memory, usage and disk usage and things like that, but willing to store that much in two tables and it sounds fine. You'll probably end up with a few nodes in your cluster.

A

Okay, great so the next question I can take it's a doc question. It's from the damp. It's done a great job, documenting process for installing to sound on ec2. However, there's very little documentation on the errors, for example, I'm getting a new user data available error. They cannot find any doc to resolve it.

A

The best thing to do and I was just looking at up datastax documentation team has a twitter handle, it is date, is packed. Datastax docks at dates, tax docs. You know to keep them that question and someone will get back to you as.

B

That, if you're using the am I and the pre-built script to install that it sounds like the user data into stuff, you kind of type into a little text box sounds like it. Couldn't find that to just double check that you've actually put in there that the number of nodes and things.

A

Okay, great thank you, Lewis says hi. Thank you very much for the talk. My question is: can you tell us about the suitability of ec2 instance types for very large data, our memory optimized instances better than storage SS, the optimized instance this bachelor for an intense weed scenario.

B

Yeah, so we used to also use the m1 excels because they were folk on to the best there's another machine, so I'm just trying to grab the name here of which has it's called an m24 itself for memory, and that has six year old, gigs of ram and 2 800 gig disks on it and lots of course. So that could be a good situation where you've got a large working set because you've got 60 gigs of memory and then what you can have a good page cache. Now the new i2 instances really are good, see.

B

They've got a lot of cause and we like having cpu calls to do things like have a high write. Throughput then, can generally get three to four thousand writes per second per core. On the note, we also like it for compaction because the compaction is going to sit there and run it's going to take up one core. So it's good for that. We've got a lot of good lot of memory. It wasn't always a huge amount of memory on those m1 XL, so there in the 30 to 60 range and there they've got SSDs.

B

So they've got loads of loads of performance. You get them on a one-year reserved price, they're, competitive or below, and I think yeah one year reserve price, they're below and m1 XL on the band.

B

So if you need high and throughput and you need a lot of reeds, I would be using a knife to instance, and I'd, be looking at madeo model to see where I could use level compaction because that's going to reduce your latency and there's a good blog post, a couple of good blog posts on the dates text, blog about Windows level, compaction.

A

Greets we probably got enough for two more questions. Mark asks with the large size mutability of CFS blocks and the custom CFS compaction strategy is a much higher data volume per node, practical I.

B

Again, it comes down to what you do when the node fails and how you're going to replace it. So you can put as much data on the node and you have this capacity. But what do you do when that node fails, or you want to update anyone moves to another node, so I just think that there you need to look at it. From that point of view,.

A

Okay, great and majeed asks disks asunder have any features like tablespaces in.

B

A

Database is it possible to move files containing bloom filters and compaction metadata to SSD and keep actual data in normal spinning disks.

B

You can't you can't separate the components of an SS table, but what you can do is- and this is added back around version, 1 point 0 or 1 point 1, when SSDs we're still super expensive. The the layout on disk is that the directory for the King spades, another directory for each column, family and so what it used to be was we'd say: oh okay, you got some some 850 and some hard drives. Ok, Maldives up now.

B

What we'll do is put in some symlinks in place so that the column family, that you're really sensitive on performance on, can go on SSD and the Collins family that you're really sensitive on the not so tentative one can go in hard drive and I. Think by the time you make any sort of change such as okay. Can we split the SSD components? But these things we don't read much like the bloom filters over here by the time that came in and that was bedded in you'd, probably just beyond SSDs Caribbean.

A

Okay, thank you very much and Aaron. Thank you very much again for today's presentation, looking forward to Patricia's next week, yes, next week on the screen right now, if you are interested in presenting at the Cassandra summit, the call for papers is open and also we will begin selling tickets very soon, for that so make sure to come along. It is a great event and then also if you want to continue your Cassandra learning, you can take the course at the link on your screen: decks tax, Academy, geologic learning com, that is free training, so.

B

A

You very much we'll see you next time.