Apache Cassandra Cassandra Summit Europe 2014, 27 Dec 2014

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: BigStep: Cassandra’s Scaling Economics

Description

Speaker: Alex Bordei, Product Manager at BigStep

We all know Cassandra is supposed to scale but what is its exact scaling pattern? How much faster does it get if you add an extra node? Is it truly linear? How sensitive is it to hardware constraints? Bigstep has benchmarked Cassandra in an attempt to understand how to size underlying infrastructure for optimum performance. We’ve tested using a custom jmeter sampler built using the java driver on beefy bare metal machines with 192GB of RAM, 10Gbps network and SSDs. We’ll share our findings and other best practices for scaling NoSQL DBs in cloud environments, found by working with some of big data’s most popular DBs.

A

So I'm alex um I'm alex from big step, and this talk is going to be about the economics of scaling cassandra.

A

It's actually we've had a set of analysis that we did for this particular talk.

A

We did it for different technologies before for couch base, elasticsearch and a few others, and the idea is that we want to understand how big data technologies scale in the sense that how much money do I have to put in to get a specific level of performance or vice versa, because we all know that a lot of nosql databases scale horizontally, so they say, but the problem is that by adding an extra node, it doesn't necessarily mean we get twice the performance.

A

So the question is: to which extent um this holds true for a particular set of technologies, and today this analysis is going to focus on cassandra or specifically data stacks, so we're going to go over a little bit about the way we did the benchmarks about scaling horizontally, scaling vertically a little bit about virtualization and about docker, and I would really love to make this a conversation rather than you know, boring monologue. So please stop at any time, raise your hand and ask me anything right.

A

Okay, so I'm again, I'm alex from big step, I'm essentially developer and I've been product manager for about seven years now, and I've been designing products for the hosting business ever since.

A

If you haven't heard about us, we're an infrastructure as a service provider focused on big data and high performance applications.

A

um That's all we do we're essentially a bare metal cloud, but we're different than soft layer or rack space in the sense that we have compute nodes and storage nodes, and we put the compute nodes from the storage nodes and the storage nodes are all ssds and they're distributed across the entire network. There there's no connection between them. We have a very fast high throughput and low latency network that uses cut through switching. So we only look at the first few bytes of a packet before following it to the appropriate port.

A

So because of this architecture that we have, we can do stuff, like you, know cloning and snapshotting, and we can do it without any hypervisor. So, essentially, you have pretty much the same flexibility that you have in virtualized environments, but with bare metal. So this means we're very good for um essentially everything that it runs in memory, because you know, if you take virtualization out, you get a better performance level and actually I'm going to show you exactly how much you can gain, not necessarily on our platform but in general, on virtualized versus experimental machines.

A

We also partner and we deploy a lot of other technologies like datameer. um You know elasticsearch splunk and we've we've we've done a lot of. We've done a lot of benchmarking with them to try to understand what what happened. So essentially, we like to say about us that we're like the fastest cloud in the world. You know it's a bold statement, but we are willing to put that to the test right.

A

So, um as I said, we have been doing a lot of testing to understand the technologies and also because it's quite cool to you know play with them uh and also we have this constant, recurring question coming from our clients.

A

You know how many nodes do I have. I have this amount of you know data. How many cassandra nodes do I need you know or how many couch-based notes do. I need how many hadoop knows how much memory and stuff like that, and we have to you know, be prepared to answer those questions.

A

So, let's get into the more you know, techy stuff, the benchmarks that we did were done on fairly big machines with 192 gigs of ram um the target ones, the ones that actually hold the data are um 192 gigs of ram. The other ones are 128, both of them with dual uh dual socket configurations: very, very fast cpus and 10g networks. uh Essentially, two 10 gig networks in parallel, one for the rpc, the other one for the regular traffic we use center s, um we use dev shm.

A

So essentially, instead of you know using a disk, we used memory because we wanted to really drive the performance without you know benchmarking, the disc, so we used fshm and we had about 100 gig free on them.

A

We again did a little bit of optimizations with irq balance and disable transparency pages, and we use jmeter instead of the cassandra's built-in benchmarking tool, we've written a custom sampler for jmeter. Have you used jmeter before, by the way, all right, oh cool, um so we used we've written our own custom sampler that uses the java driver.

A

uh We did that because we wanted to make sure that you know we could at least compare a little bit between this technology and other technologies, and also I want to make sure that um you know I get to experiment with something that the client would have anyway, we'll get back into more details later on.

A

So what we did essentially was we inserted about 10 million records into cassandra and got them back and then changed every one of them, so it's essentially we're doing three operations. It's insert and select and update on those records using uh 2000 concurrent connections from all the eight machines.

A

So a lot of jvms, they're working and essentially this is what we got now. The actual the absolute figures are not necessarily important, although they are quite um quite fast. I mean those are milliseconds uh and there are milliseconds, um for you know the entire data set and 2000 concurrent connections, so single digit and two digit milliseconds.

A

But what you have to remember here- and this is the latency test- is that if you compare running the same test on one node versus the one, with two nodes, for instance in the in the insert system you do get, you know you reduce you half the time it takes to insert right from 19 to 10..

A

But if you add an extra node you're, not again you're, not reducing that a whole lot again for four nodes, it actually increases and the same happens for select and updates right so again on one next to no, not necessarily it. You know it increases performance, but not necessarily doubles it. The same is for throughput and throughput. Actually is amazing. We got up to 250 000 operations per second with this cluster of four nodes, which is very good, um and we pretty much see the same. You know the same ratios.

A

The insert benefits a lot more than the selects in adding extra notes.

A

But again, you are not necessarily doubling, as you see on on the insert side, you add some extra note and the more nodes you have, the less you know the less efficient you are in the way, you're increasing your performance.

A

Okay, so keep that in mind for a second now um scaling vertically. Essentially, we are comparing two instance types that we have. One is um the ford 32, which essentially, it has four cores and 32 gigs of ram smaller one to the one that has 20 cores and 192 gigs of ram, and there is an increase right. It's almost you would say twice the performance for the bigger one as opposed to the one with 32 gigs and a single cpu.

A

Again, you have to remember this um a bit so twice twice to see the slice, the power and now in order to really understand um how that translates into money. We did what we call an analysis of performance to price ratio and in order to get a filling or performance, we had to have like this composite uh metric uh out of all the other.

A

You know metrics that we got and we ended up something like this and we divided it to the price of that particular configuration and long story short. It looks like this, so um essentially what this tells us and by the way this is identical, or I mean looks the same for other technologies as well. It looks the same for couchbase, looks the same, for impala looks the same for all the nosql technologies out there. You have this curve that drops, which tells you the more nodes you have the less efficient.

A

You are in the way, you're spending your money, unfortunately, so the actually the best place configuration is the one that has you know the smaller cpu and um and less ram, because that one costs about half a pound per hour and the next one, the one with 192 gigs of ram costs about two pounds per hour. So it's four times more expensive, but only twice more performance. So hence it has. You know uh it has a better price to performance ratio right now. um Why does this happen?

A

You might be asking- and some of you already know about endless law, and maybe some of you already know about the harder prices, so the anders law essentially tells you that, regardless of the technology that you use, you'll always have something in common, something that can be paralyzed, because the no problem is, you know, embarrassingly paralyzable and so yeah. You know you you, you will not be able to paralyze completely and the other one is the harder prices now this happens for every component, but the cpu is easily verifiable.

A

There's this website, which is called the cpu mark, which rates benchmarks all the cpus on the market and rates them right, and you can also find the prices for the particular cpus, either on an interside site or on that particular site. And if you take all the cpus in this e5 family and you try to plot them and do something like hey.

A

If I have, um if I have a cpu which costs twice, uh will I have you know twice the amount of power and, of course the answer is not by you know not by far apparently the you know, the more you know the more power the specific cpu has um you get like an exponential increase in price right, so, for instance, uh for 10k whatever? That is it's a 10k mark for the e5 two six three: um it costs about 600 bucks and for about twice that much for the e52670.

A

It costs 31. You know three thousand dollars, so it's a huge huge um change in in prices and the same happens for every other type of component in in you know in a pc in the server all right. So this is essentially how money spending efficiency looks like you're best off using the single node. If you can't use a single node, because you shouldn't use a single node, um don't try to overkill it with two big machines.

A

Try to you know, keep the the size of the machine small enough and also don't try to overdo it with the number of machines. Try to keep the number of machines to where your sweet spot is right.

A

Now, let's move over to the virtualization impact on big data applications.

A

So if you do, if there's a there's, a tool that you can download and install, it's called sysbench and you can use it to read and write one terabyte of data to and from memory, and you get something like this on a virtualized host versus a native one, and this is on the latest hardware we could find and the reason is and I'm gonna get. You know lower level a bit.

A

I hope you don't mind is because um if you ever written any c application, you know that that you know number that you work with with pointers is not necessarily it's not an actual address in memory. It's actually a reference to the virtual memory address space right, so every process has its own virtual memory now, in order to do this, translation between the virtual memory space and the actual physical space, the cpu and the operating system work together.

A

To do this translation in this translation um native, you know, if you would do it um badly enough, it would have to have at least you know two and two requests to memory in order to actually access one memory, so essentially would double your latency to the memory. So, of course it doesn't happen. This way an operating system have implemented something which is called pages or memory pages, which are you know, segments of continuous uh memory blocks that can be allocated to applications.

A

The typical size for the page is about 4k now, in order to do the translation between the same translation, but to the pages on the operating system maintains something which is called the page table right and the page table has about 32 bits and it describes. Does the translation between you know physical memory in the actual, uh the actual virtual memory?

A

Now the problem with this approach is that if you have, as I said, a page is 4k long so to map a four gig memory segment to uh to these pages means that your page table uh should have like no 4 million divided by 4k is about 1 million entries. So for every 4k four gigs of ram, you have to have like a million entries in the page table now for 40 gigs of ram again 10 million entries of you know two bytes, so the page table grows bigger and bigger and bigger.

A

In order to whenever you have to access memory, you would have to look up this entire page table and do the translation, which is not efficient. Now in the 70s. um A solution has been introduced to this, which is called the tlb right. The translation, um the translation, leukocyte buffer- and this is essentially a fourth cache in the cpu which helps do the cpu. Do the translation without looking into the entire? You know page table now.

A

I know it's a lot to take in right now, but I'm getting somewhere. um So the problem with uh the problem with the tlb is that whenever you, whenever you hit the tlb, it's fine, you do the translation.

A

The cpu does the translation for you and it's very very fast, but whenever you hit something which is not in the tlb and by the way, the tlb's typical size is about 200 entries, which is very small in comparison, the amount of ram that we currently have you do something which is called the tlb miss and a tlb. Miss incurs a penalty which is proportionate to the size of the page table now normally about 150 cycles, and this is on bare metal.

A

So we are hugely inefficient for applications that use you know, do a lot of random reads and guess which applications have the highest ulb miss ratio in the world. There's a study done by intel, which is called memory system characterization for big data workloads, it's fairly new, and I encourage you to read it. They analyzed, you know hadoop tasks and nosql applications to see which ones have the highest tlb miss ratio, and they did something even more amazing. They put some chip between the actual motherboard and the memory team to record the traffic.

A

This is very interesting study. You should read it, and this is like two excepts for it. Apparently, no sql applications have the highest glb miss ratio because they have like random, read patterns right now.

A

The problem with this is that when virtualization came along, um you had like a virtual to virtual to physical memory translation. So it's a third level of indirection which of course, blues up every you know performance benefit that you would get from regular tlb cache. Now um you know and that's why the first implementations of virtualization software, like qm, were like incredibly slow. Now people thought about this and tried to solve the problem with.

A

You know three levels of indirection and did something which is called shadow page tables, which essentially you would use directly you're, not doing a virtual to virtual to physical translation, but just rather virtual to physical translation directly from the virtual memory to the actual cpu.

A

Now that's good, but when you do context, switching between virtual machines and you do one vm to another vm and you have to flush the entire tlb, because it doesn't make sense for another vm one vm has you know a state and the other one has another one, and whenever you flush the tlb you get like a huge performance penalty. You have to rebuild the tlb, or at least you know repopulate it so that didn't work either.

A

So now they have this technology, which is called vtx by the way, the the other one was still vtx. With another version, this vtx analogy: what happens? Is um it modified the entries in the tlb cache so that they have an additional byte that you can identify which vms uh yeah to to which vm that particular entry belongs to?

A

So you don't have to flush everything out every time you do context switching between vms, um but the problem with that is that tlb misses now incur like 10 12 times more time to rebuild that particular entry, because it's a lot more complex to build a tlb cache. So essentially, they sacrificed a particular type of applications and a particular type of memory access patterns to get performance for smaller applications that use you know smaller space, and this is something which is documented on on vmware as well um on intel and other solutions.

A

Essentially, it's a trade-off. So it's now, whenever you have a high tlb, miss ratio, it's actually a lot more dramatic. You get even less performance from from virtualized environments for big in memory. Applications such as you know cassandra or spark for that matter. So um long story short. I encourage you to avoid virtualization.

A

If you can now there are solutions to you know if you want to have something else and of course docker is one potential solution unnecessarily covering the same area as virtualization, but for certain it could help you and we did a benchmark on this as well. I want to make sure, because you know people say that docker is native speed. I want to make sure that's the case and we did the test by running. You know a native node and a native node, with one docker container running the same.

A

You know the same benchmark in it and with two containers and four containers, and apparently it's a bad idea to run multiple containers on already saturated node, which of course you would. I could have thought about it before I did anyway, um so you do get a bit of performance penalty with docker. I don't know yet why we did a lot of optimizations. We run the docker container in privilege mode.

A

I think it's because of the additional routing that has to happen in the machines, but it's there I mean it's almost at the same level you get about 21 versus 19 milliseconds for the inserts 18 versus 19, so almost barely noticeable, especially in you know, when you're not driving the cpu to you know close to 100. So I encourage you to use the darker if you have to use something now this is a throughput. This is how the triple looks like it's pretty much the same story now the conclusions are cassandra is actually very fast.

A

The only thing that was faster in some particular workloads was couch base, but you can't really compare the two, but again, cassandra is very fast, especially I mean it was fast on reese as well, um people usually associate cassandra with rights. You know write performance and write in general, but it's very good in reads as well.

A

I've also tested um the in-memory feature that has been you know recently introduced and it gets to pretty much the same numbers as I have here because remember I use device, hm but the problem with the in-memory option of data stack is that it's limited to one gig.

A

So if you have to use table which is bigger than one gig, you can't use a you know out of the box in memory section at least right right now, because it runs on it runs in in heap. So you would also get a lot of garbage collection related issues so maybe later on it could you know it could work um or you could do what I did with devs hm now use docker.

A

If you want it's, not a problem, but don't try to use more than two containers um again, the more nodes you add, um the less efficient. You are in the way, you're spending your money, but we haven't taken into account iops and disk in general, and that's very you know very, very much platform specific so depending and you know workload specific and if you have like a very write, intensive um or you know, read intensive iops bound application.

A

Of course you might want to consider you know looking at the disks a lot more, but that's a different, separate discussion. Now again the bigger the nodes, the less efficient. You are on performance, the price ratio. But if you need the you know, increase performance for that particular node. Maybe reducing latency across the entire platform go for those.

A

Yeah and we will try to do a few more tests on on the disks, so that was it I was quite fast. Do you have any questions raise your hand? Okay, please.

A

A

uh No, I haven't done any pneuma pinning for the darker containers, but I know that cassandra does it natively anyways? So right? So I think because we had numero ctl installed.

A

It comes with irq balance where where was it so it should have worked, but you're right, I haven't checked to see if they do in my painting, by the way for those of you who don't know what pneuma pinning is or do you know, okay, so pneuma pinning essentially means that there is something which is called a qpi link between two cpus and one cpu has a memory controller and the memory dims you know get controlled essentially by that particular cpu, and if you have a process that tries to access memory because from the operating systems perspective, you have like one continuous contiguous block of memory, but when you have a process running on one cpu that has to access memory from another cpus memory dims.

A

What happens is that you have to traverse this qpi link and we did tests previously um and there's uh there's a talk on o'reilly that I did. That explains how the qpi link affects performance.

A

But the idea is that whenever you're doing this cross, qpi link memory access you're like at least doubling the latency of the memory access, if not like four times slower so I've seen occasions, if you do that memory test with the sysbench that I've been showing you, uh you can get actually on a two socket machine, you can get way lower performance in accessing memory than you do in a single socket machine, except if you do pneuma pinning, which essentially means that there's this utility called numa ctl and there's a daemon called numandi, which you can use to pin a process to a specific pneuma, node numa stands from non-uniform, mirror access, it's the architecture of multi.

A

You know multi-socket machines. So when you, whenever you do pinning, you will only access memory from that particular node. So you get a lot more performance and happily enough cassandra. Does it automatically if you have numa ctl installed, which is a great thing to have and by the way you. I also encourage you to use something like irq balance, because by default um operating systems, don't distribute the irq's interrupts whenever you have like uh network interrupts.

A

Coming from the network card, you're just blocking a single c, a single core by default, so there's this tool called the irq balance which balances the rqs across the entire set of cores that you have in order to get you know faster performance and also there are.

A

um There are driver stacks for the network cards which can be used to move uh to move access from the kernel space to the user space, so you're not doing context, switching between the current space in the user space and apparently that really pumps up pops up performance because you're not using interrupts anymore you're using like a pool model or event based model. If you, if you like um and it's it increases your performance on the network side lowers the latency, reduces the cpu usage.

A

um So it's better any other questions.

A

Can you can you, please repeat that.

A

um So if you use like this new generation of cloud providers like us, which essentially provided with pretty much the same level of flexibility um like you can do, as I said, you can do cloning, you can do you can. You know, hit the button and in 10 minutes you get up and running the instance and then after 10 minutes you shut it down. uh You have access to the actual physical machine via you know a kvm like interface, you get, you know, get to see the um the screen of the server.

A

uh If anything fails. So uh there's this new breed of cloud providers which come along and so try to solve this problem of performance, and do you know performance penalties and use by virtualization, but you know giving you pretty much the same level of flexibility, so I I can't say for certain that it is exactly the same level of flexibility. For instance, what you can't do right now, you can't do like virtu. You can't do like live migration right between hosts with physical machines. Actually we're working on that.

A

But right now uh there's no way of doing live migration. You have to do like restarts or reboots, but what you can do, though, is you can do upgrades, so you can do your machine. You have like 192 gigs of ram. Now, if you want to switch to another box with like 256, you can do just a restart and reboost on another machine. So essentially um what happens is that the virtualization layer tends to move towards the hardware lower and lower.

A

I mean the things that virtualization accomplishes now will be done in harder later on, anyways and but not not in the sense that it will help the software do this, but rather the harder itself will enable virtualization related. You know paradigms to happen and we already see service server providers that automate servers with rest apis.

A

um So you can, you know, can work with the server as you would with a virtual machine or you could use you know us or somebody like us.

A

Any other questions. Now is the time.

A