Apache Cassandra Cassandra Community Webinar Series, 19 Feb 2013

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Cassandra Community Webinar | V isn't for Valentine's; It's for VNodes

Description

Patrick McFadin discusses what's new in C* 1.2 -- VNodes, Collections and CQL3.

C*ollege Credit is the bi-weekly educational webinar series for all things Apache Cassandra™ (C*) related. Each webinar features an MVP or member from the community that has hands-on experience building the next generation of applications on Apache Cassandra. The series is designed to provide a baseline of education across an array of topics from 'what is NoSQL' to 'how to scale out Apache Cassandra across multiple datacenters'. http://www.datastax.com/resources/webinars/collegecredit

A

Hello, everyone and welcome to this edition of our data stacks Cassandra community webinar series very excited today this this one has been on my calendar for a long time. It's Valentine's, Day and I'm delighted to have with me today Patrick McFadden, the Barry White of Cassandra, who is going to entertain and educate us today. So hopefully, you've got the lights dimmed a couple of housekeeping items.

A

Patrick will be taking questions at the end of this webinar. So, as usual, please use the WebEx Q&A panel to ask your questions. I'll be filtering through those and we'll pose them to Patrick at the end, so Patrick without further ado, a romances with me nodes.

B

You know I noticed you didn't undo my mic until after you got to say that right, that's thanks! Well! Thank you. This.

A

B

Equal welcome and.

A

B

All right, let me let me flip this over to keynote here and get going so so how's everyone doing today, it's Valentine's Day- and here we are talking about be notes, so I couldn't think of a better thing other than maybe doing something that has something to do with people, but it's only going to take a little while, so you can go out and do something after this. So this is all about virtual nodes and I'm, going to start things out here with just a quick introduction so well.

B

What I'm going to try to do today is to make this more like a university class right, because we we are trying to learn something here. The agenda is really to start with. Really the basics I want to make sure everyone's level set I'm, not sure where everyone's at experience level with Cassandra.

B

So, just what is a node and there's some details around that that will make explaining what a virtual node it's better I mean what it lies better, what it is and then I'm going to explain one Titan, twitch, nose, node, then I'm going to introduce virtual nodes and then how to convert your existing cluster and I'm sure there's plenty of Cassandra users out there and that's a question mark I'm already hearing it.

B

So how do I go from regular nodes to virtual nodes and actually that's pretty easy process, so we'll go through that and then I'm going to talk about some of the benefits of switching to virtual nodes, why it was done. The Cassandra codebase is now pretty much all B nodes as of 1.2. So let's talk about why so in the beginning, since the beginning, Cassandra is always had clusters and clusters, of course, are what you, when you set up your let's say under a cluster.

B

It's full of nodes in those nodes all contain key spaces which all have column chemist's, which hold your data and every single column. Family has a row key, and this is where things really start. This is a fundamental piece that you need to know about how how Cassandra works in yl, nodes and nodes are important. So row keys are all about identifying information inside a column, family.

B

Those are the unique part of a column family, so you have one row key and a really wide column, our wide row with lots of columns and so the row key of that fundamental unique part. So it can be up to 64 K size in inconsiderate, which is probably big enough for anything. You need, and there are two ways to sort those so the first one is to sort them by using the byte order. Partitioner or Bop is the cool kids call it, and if you notice there's a little exclamation point there.

B

That's because that comes with a warning. Don't don't do this unless you really know why you want to do it and so you'll hear plenty of people and they're talking about Cassandra and byte order, partitioner freak out about it, there's good time to use it, but what you can really probably find more is the random partitioner, and that means that when you place a row key into your cluster, it's randomly put into one.

A

Of the nodes and.

B

So it tries to distribute the data. So what about that? The random partitioner? How does that work so row keys as you set them into your in your database? How do you create some random numbers? So that's that's an important piece. We want to randomize it so that it distributes your data around the cluster, and so how do you make that number big enough so that we can use it and then have it?

B

Make it reproducible and that's where md5 comes in md5 is a cricket, bat cryptographic, hash and basically, what you do is you take a string and you put that through md5, you run it through this algorithm and you get out 128-bit number and what's really magic about it is that it always comes out as 128-bit number. So, if I put in my Twitter handle boom.

A

B

Get a 120 bit number now what's interesting. Is that every time I put in that string, the app Patrick McFadden I'm always going to get that particular number and that's what's been called consistent, hashing and it's very commonly algorithm memcache uses this other know, sequel databases use. It is a very handy way to get a consistent cash on handy 5, even though it's used for doing some crypto like encrypting passwords and things like that, it was really a much better for for.

A

Us now, because.

B

It's not very strong encryption, it's really better for the rest of us now, just creating these big numbers, and so cassandra is always relied on that. So whenever you create a row key, it hashes it into this hundred twenty eight bit number, and so that's really important, and that's because we need we need to have a really big number, because you can have a lot of data right. It's big data. So how big is this number? Well, so the hash is zero through two to the 128 minus one.

B

So those of you not so mathy that I put out that number and there's a lot of Commons in there and a lot of numbers and I. Don't even it's like a deck of gazillion or something but I, don't know it's big, so Cassandra actually uses two to the 127th and the so the the bonus reason is so you can. You can tell people about this wise because we we take out one bit for fun, but still I mean tell me you can.

B

If you can tell the difference, train attend to a two to the 127th number and to to 120th number. Let me know could be really that cool. So that's a huge number, so that gives us a lot of keys. Incidentally, little trivia, question or trivia is a 2 to the 128th is actually the ipv6 address range as well, so coincidence, I, don't know so. This is a really really big number perfect for whatever we need to set this massive amount of keys into our cluster.

B

So what does that got to do with notes? We were talking about row keys, what about notes here. So let's talk about why this makes sense, and this was going to make a lot of sense in a minute. So it's all about so each Cassandra node is assigned a token, and that token is just a number inside of that range from zero to two to the one 227th, it's a lot of numbers, but what is trying?

B

What we're trying to do is we're trying to break up the ranges in all these different nodes so that when you set data in there, it's being spread out neatly now those of you who set up a Cassandra cluster, you know that you set you create tokens, that's part of the set up, and that token is a really really big number or just it looks like a string of numbers, but it's actually a number. It's a divisor of that big number.

B

So, whenever you put a token in there, it marks the ownership from from you'll see from token zero all the way to this I, don't even one, though don't say that number. But that means that everything on that to on that node is going to be inside that range. So what happens whenever I take a row key and I say I'd like to put this into my database and you run it through md5, so you can get that 128-bit number consistently right.

B

So when that happens now, when it writes that data into this okay, here's your number- this is done at the client and then and then finally goes down to the cluster. Here's this big number, which is my row key, oh look, I got a no.

B

This is I, got it so that that assigns ownership to this now this doesn't have we're not taking your account replication or anything here, we're just talking about a very simple setup here, but this is I'm using us to explain a very important concept here, and that is that whenever you have a lot of data- and you need to set it into your into your cluster, you need a way to do it consistently, but the nodes, the way that the nodes are arranged and how they are, how they are related to the row key.

B

This is how it's related to the token. So this brings up so now that brings us all up to speed here. So now we know when you're setting data in here. This is why tokens are important where the nodes fit in, but what does this mean in pre 1.2? So this is a Cassandra 1-1.

B

This is what everyone's been using up to this point, and so that means that a node is responsible for all of these keys and it could be a ton of keys if you have a three server cluster you're going to have a lot of keys on one server and that single token is set up, so you have one server one token in one node and that can get kind of out of Aegeus. So what the container communities all have talked about is these community nodes you'll also hear people talk about.

B

We want more of less really small tiny server get 1u boxes. Not a lot of disk has really been kind of the the way we want Cassandra to operate. It works better that way right. Well, that's not the way people want to do it anymore, because, let's face it, you're racking up a few hundred nodes which I know I, know companies that are doing this. You don't want to put in 150 servers to do this. You want to try to pack it down and get a little more density.

B

So what you really want is that, and so how do you go from commodity node, which is tiny little Velociraptor, this big little trainer source racks and that's a big problem, so we need to do plan and that plan it's packing more data into a single server and fortunately we have a solution. That's why we're.

A

All here today.

B

So we want to make that one node is responsible for more data. Well how to do that. So this is where virtual nodes are going to come in the other thing, and this is really kind of an important I mean this is probably one of the most challenging aspects, starting with Cassandra and it's funny whenever I get into this conversation with people, because they just look at me like I'm, crazy and that's a signing token- and you know you have to actually plan this out.

B

So this is serious token assignment sucks, I I've been a long time, Cassandra a user. This is one thing, I've, never really liked, so that is hopefully something you'll never have to do again so and the reason they'd start so bad is because you have to do this business of alright. We need to evenly distribute all the tokens around how many servers you have and then, when you get it evenly distributed and you go to grow it, the guidance has always been. Will double your ring? Well, doubling your ring that isn't very practical.

B

Sometimes, if you only need 10% more power, you know container is a linear scalar, but if you're doubling every time, that's not very linear, but if you want to put in just say one server or two servers, then you have to do a rebalance operation that sucks and then, if you shrink a ring and take out one server, you have to rebalance or take away half and that's always kind of suck. So this is not good. We don't want that anymore.

B

So the other thing that's always been kind of a pain is, is that whole business is having to add token into the Cassandra, um which is in config file and I've, seen some pretty crazy chef scripts out there that do this, and it's really because that has to be done as the server's coming online. So it can insert it so properly in.

A

B

And if it isn't done, if it's randomly assigned you're gonna have to do a rebalance anyway, so all of this has not been fun for anyone in operations. So let's kick that out, hmm though, we have relief, virtual nodes, so the whole premise of virtual nodes is that these big servers, these try and source rec servers, should have many nodes and these servers can take it. But what we should do is make each one of those nodes very small and make them not as expansive in the range of the keys and tokens the.

B

Why are we assigning any tokens? I mean this is kind of silly, so it's like assigning a MAC address to your hardware to your network card. It's just. Why are we doing that? This is a 21st century, so we're going to have a whole new plant here with version 1.2. So let's see how this works, so virtual node features here we go. This is what is them? This is probably why you're here? What is this all about?

B

Each node or the each server see we're going to have to get away from this I'm saying this a lot to that. It, we usually say note, is like a server, but now the server and now nodes are different, so you're going to have 256 default nodes per server. That's kind of interesting to watch I mean you can see this.

B

You can see your key range get diced up in a 256 different nodes on one piece of hardware which that's a lot, but if you think about that's a really good idea, because that's a very small range and a lot of things come a lot of algorithms that work on T ranges such as doing repairs and doing cleanups are now operating over smaller key range. That's good, so I thought.

A

B

Tokens yeah the the winning baby but yeah for the wind, because now.

A

B

You start up a server and you bring it into the ring. You don't have to assign a token to it. You can it has to do it itself and it figures out where you're at and it assigned it, creates all the tokens and then it evenly distributes it. It sounds magical and cutting unicorny, but it Valentine's feel the love I mean this is good. That's what we wanted all along anyway, right I'm, going to make operations life so much easier, I'm!

B

So glad we have this now, because I don't want to explain to people that they have to go figure out. You know divisor of attended 127th number anymore, so the other thing that we're going to look at improving now that we are creating these smaller key ranges it's after rebuild and if you look at the way the rebuilds are done now like when you lose an a whole server.

B

Yes, you're losing a lot of key ranges, but when you bring a new server back online and it has a lot of smaller key ranges, we can do things like take smaller chunks, but parallel stream in from all the other. Smaller chunks out there and it makes it a lot faster to bring a note back online and when you're bringing a new server online.

B

Let's say you do need 10 percent, more capacity or 20 percent, more capacity, you just add 10 or 20 percent more servers, and whenever they come online, they're Auto assigning the tokens and they're evenly distributing themselves. You don't have to do a rebalance operation. Yes, now there's a new partitioner that we're going to be the md5, the random partitioner. There's a new one. I'll cover that in a minute, but on this is really these bullet points really where V nodes are at.

B

This is what's all about, and it's not a very it's not very crazy topic, but it's pretty revolutionary. Now the thing is like years from now. You know people here that you know if you've used Cassandra from one met one or before then you're going to be like one of those people like.

A

Back in my day,.

B

We had to sign our tokens, so you know you can brag it up. You got that going for you. So how do you transition and I'm happy to say this is a really easy? It's not going to bring down. You don't have to bring down your cluster, so you have a running one dot. One cluster! You can transition to 1.2. So here's how you do it so when you upgrade to wound up, you can leave it using random partitioner and having a token assigned to each box that guys is totally cool.

B

So that's probably how you're going to start so you do. The in-place upgrade everything's still sitting there as if it was a 1.1, non-virtual, node cluster, so to change it to a virtual node cluster. You go into the gamble file and you see there's two lines, there's a num token. As an initial token, so the initial token is probably going to have your token for that big old fat, huge number that you had to assign to it when you initially created the cluster.

B

Well, now you just eliminate that so you change it. You just take that out and then you uncomment the num tokens line and you put it in numbers well. That number is going to be in there by default at 256.

B

Don't worry about that right now, there'll probably be some blog posts on that number and how it relates to your hardware, but for now just go to the 256, then restart the node. Just you know: Cassander stop or casino restart on the command line, and when it comes back on line you'll see. If you look in the system, log you'll see a stop for a second also you'll, see a ton of it'll create a bunch of contiguous nodes in that same key space range. So the key space range.

B

That note is that older node was responsible for and then it'll just take up and keep going all of a sudden that node is back online and it has 256 more and if it'll operate with all the other non-b node servers running just fine, and so you go ahead and just do that on happy server inside your cluster and when you're done you'll have a nice setup. So now, once it's done, you have all these nodes going out there there's one more operation that needs to be done, and this is pretty critical.

B

Winning transitioning from non V, node to V knows and what's going on, is that, like I said it's creating a contiguous range of tokens so that the range that's inside that server is going to get busted up into 256 chunks, but they're all going to be next to each other? That's not good! We want them spread out. So what we're going to do is run this shuffle operation, and this is a one-time deal. You only have to do this once when you're upgrading.

B

This is not something they do whenever you start with a 1.2 cluster with V nodes, so there's two command, there's the create into what it does that calculates, and it's only has to be done on one server: it catch lates, what everything's going to be moved to and rearranged and where all the replicas are and adjacencies and everything it's very in the background, and once it's done and you're ready for it to operate, you do a Cassandra shuffle and Mabel, and the enable then is a background process and I put a little thing on here that it's running in the background, this is a slow up, be patient!

B

No seriously is first time, I ran it. I was like they're doing anything, but it is, and it's meant to be zero impact. If you can and it's going to take a while, I mean it could take a few days if you have a very, very large cluster, but that's okay. It's just moving one little piece at a time here and there it'll take care of it. If you want to keep track of what it's doing, you can use the Cassandra shuffle LS command, so that's it.

B

So, let's walk through the I got really cool graphics. Here, I've learned how to do graphics with keynote, so I'm feeling pretty cool. So, let's, let's do this one step at a time. So here's our existing wound outline cluster, and so every server has a range of keys and I'm I kept my key space really really small, because it's got to look good, so I.

A

B

Four four keys on one on each server, so this is my existing one, one cluster, so I'm going to upgrade and so I'm going to change that to the numb tokens equal four and it's going to split these up, but I restarted. Now what I? This is. What I was talking about? I have the contiguous range of keys still sitting on each single server, as it was before just broken up into four different chunks.

B

So this isn't good, so I'm going to go ahead and in this do my initialize and my naval and when the Enable starts to run and that's a shuffle, it moves all that data around very slowly and when it's complete now I have a nicely spread out rating of all my tokens in different places, and this is good because I don't have any replicas sitting on the same server, and so this is the final configuration that I want.

B

If my data is now spread very evenly around every single server, so that is pretty much how it works. So what does this mean for operations and from what I feel is like the group that is going to benefit the most from virtual nodes is going to be operations because you were going to I was just recently at. We were talking to I'm talking to Jason Brown at Netflix who's, one of the guys line priam- and this is a Primo's built around the idea of assigning tokens.

B

But now, like half of that code is just gonna, have to get ripped out because it's none of that's going to be needed anymore and that's pretty amazing because that's always been the problem, but now not a problem. So when it comes to your ops life, you can just add 1, node, 2, node, 3 nodes, no problem, no balance, no shuffling or anything going on. No token assignments, you just nip I. Think tokens are going to be one of those academic things about Cassandra. That's interesting!

B

That people were they when they really want to dig into the engine and how it works. They'll know about, but most people really won't know or care, and that's ok, that it should be in the background and now we're looking at building bigger servers, and so this is. This is another operation topic, but it's also good for everyone else.

B

Who's got to spend money on servers is if we can start building bigger servers, that's going to make it good, so we're looking at like how many you know how many tokens be defined to these bigger servers. Again, you know, look for blog post in the future, I'm, actually working on this topic right now and I want to know more about it. Too.

B

I mean what is a good number for how big of a server you have, but it makes sense that you can start adding more and more tokens to a bigger bigger box and you can have dissimilar sizes too. You can have less tokens on or less token ranges on one and more on another if you have dissimilar hardware, so also the decommissioning of nodes, which is just like the adding on those.

B

So those of you who are living the good life and Amazon with your elastic load, balancing and up and down up shifting and using Auto scale groups, that's always been almost impossible with Cassander because of you know you have to assign a token and then in Figure make sure it's balanced, but I've already talked to a couple: people that are working on using 1.2 with Auto scale groups, and this should be something to you should pay attention to, because this is where you get that elastic work, load, balancing with your database and with virtual notes.

B

This is a real possibility. There's nothing really in the way other than just figuring it out, and this is a this- is a big deal. When you need more database load, you want to scale up, you should be able to just do it.

B

Another thing to mention to for the ops folks is that there is a new node tool command. Now, if you have 256 nodes sitting on a single server, the potential is you're going to have a lot of nodes. If you do no tool ring and if those of you who know what that command is that just shows all of the different token assignments through your entire cluster, if you had just a 4, node cluster or 4 I, keep saying that 4 server cluster with 256 be nodes per then you're.

B

Looking at a thousand 24 tokens, they have to get displayed. So now we have a new command called status and it just shows you the each server of JVM, that's running Cassandra and then, of course, how many tokens are in. So it's a much much nicer way of looking at your ring without having to you know pipe it to more or worse. You know just watch it stream past your screen, so one more for the baby dinos for the win.

B

So I mentioned that we do have a new partitioner and so along the same line of hey. How do I pick a really big number make consistent and hopefully fast? We have a new partitioner called the murmur three partitioner. So murmur three is a algorithm much like md5, but it's not a cryptographic one. So it doesn't. There's a little bit if it's cryptographic meanings, it's secret-spy, you know I gotta, have you know if there's a lot of password system to use md5, unfortunately, and murmur three is not for that: it's not cryptographic.

B

It's just meant to be a way to create hashes, so it in a lot of testing has been done. It's slightly faster than md5, so this is not not like ridiculously faster and so as Cassandra moves forward. We're getting like I said: more bigger, bigger, bigger, more more MORE, it's good to get those incremental changes, and while we can so starting with version 1.2 that will be the default partitioner. Now I've already got this question and I'm going to answer it right now.

B

No, you do not need to convert your 1.3 1.2 cluster from random partitioner to murmur 3 partitioner. First of all, it's almost impossible to do that with any kind of ease, and the second thing is, it doesn't give you that much performance increase. It's really not going to be that big of a difference. It's really for the future and new as we create new clusters and new setups as we go forward, then that will use memory, but don't don't feel like you're now in the lower performance tier.

B

It's not that big of a difference yet so, if you want to, if you don't believe me, this is open source right. We can go, look exactly what happens so I have the JIRA right here, the Cassandra 3.77. That is the entire details from the proposal all the way down to it's committed and I. You know if you've never looked at edit at a Cassandra or any open source ticket and say Apache in the Apache JIRA I suggest you go, do it. It's really interesting.

B

I think I even have some non-technical users that have looked at and just to see and get some insight in the process and it's very open as its proposed different people talking about it. The debate that goes on back and forth I mean it really is out there that there's no backroom conversations and then someone comes out. That's okay! We decided something and here's what we're going to do. No, it's all done right there.

B

So it's pretty interesting and you know you find out what the motivations were and you'll see like in this particular ticket, how it first it wasn't so performant. Then there are some people trying to figure out why and then they figured it out, so just I would suggest, go checking it out. Just so, you can understand how the process works. So the conclusion you can go do this today. This is in Cassandra 1.2. Here's my one gratuitous valentine's thingy with a cupid or, as my four-year-old says, the the baby that shoots people.

B

So you can go get it. You can download it go to data stacks calm, the community version. That's going to be your RPM and deb file. If you want to use that or you can just go directly to cassandra at org and download the Powerball, this is we're currently on version one to one. If you want to try to test upgrade one of your 1/1 clusters, that's awesome, go out and give it a shot.

B

It's really cool and I'm, going to I put these references at the end, so you can read some of these, but there's these two blog posts that we had I'm going to put my slides up on SlideShare. So that's more of why I put this slide up, but really you know if you come and look at later, you can go look at these blog posts and there's a lot more details about some of the motivations and.

A

B

Of the more detailed things you may want to know about, so that's it.

A

Excellent. Thank you. So much Patrick you ready for some Q & A I am.

B

A

I am going to read you some questions, just a reminder to those of you. Please ask your questions in the WebEx Q&A tab. I will go through those and impose them at Patrick and just as a follow-up to what he said about posting to SlideShare. We be emailing out the video archive of today's presentation. You had such an amazing Valentine's Day. You want to relive it. You will be able to do that tomorrow, okay, so this one from Steve Brawner, let's get straight into it, Patrick there's the murmur.

A

We benefit over nd5 hold across multiple CPU architectures that might optimize parts of md5.

B

Interesting well that there on the micro, this is where the the optimization was found within the micro benchmarks, and so we're talking about nanoseconds here. I know that the him, the the murmur 3 has a lot more optimizations, especially for multiple, the multiple processor um I couldn't I couldn't say exactly. Yes, it's going to be better than that. But I know that from what I've read about both md5 and maror 3, the murmur 3 in the long run will win.

A

Okay, thank you very much. Jack schmitt and mike went also both have the same question, which I think I can answer and questions are along the lines of any idea when we'll see 1.2 in date, stacks Emprise, so it will not be included in data stack, Enterprise 3.0, which is slated to drop on February 25th, the big.

A

So what around that release is all around security, but we are scheduling a drop of stage type enterprise later in the year, which will include all the greatness of 1.2 in it there's a little lag as we are now jinx and also do a lot of testing on that making sure it's ready for production Patrick. Anything to add on that now.

B

That's that is definitely this team with a sex enterprises making sure it's baked. In so I mean you can go out and get the community version and really, if you want to play around, you know it's going to take a while for it to get fully baked for the enterprise and think about for data stacks enterprise, especially there's a lot more than cassandra going on here.

B

We also have to think think about how a things like hadouken solar are going to integrate now that we have a thousand 24 or thousands and thousands of virtual nodes out there. So there's there's some brain cells burning on this topic right now, great great.

A

Point and actually to coincide with the release of date, stacks 3.0. There is a community release scheduled for the same day that is Cassandra 1.2 to a more stable release, and that also has some security features in it. So go get a date stacks community.

B

A

If, if you want to check that out on the 25th, how is replication affected by dinos Patrick and that one is from Rick Matthews I.

B

Nothing, it's interesting that those digging through this initially a lot of the existing rules are still stay still apply, so the adjacency issues are they're still there. So you went for instance, one token, then, the next adjacent in the next adjacent after that are where the replica pairs they have. The the only difference is in how those replicas or where those replicas lived, and, of course you don't want to- and this is why the shuffle is such an important operation.

B

You don't want a small node living next to the same on the same server and have those be replicas, because that would blow the pol whole point of having a replica if you're, not if it's not on different hardware.

B

So that's actually the that's part of the building of those virtual nodes and then the shuffle part, but once it's done that that's going to be taken care of exactly the same way where, if you do a replication factor of 3 or 4 or 5, it's going to replicate to the adjacent token, but just know that it's going to be on a different piece of hardware.

A

Excellent and draw Armel loose asks. Is it possible that two replicas of the same row are stored in the same physical mode, yeah.

B

That was well that's a similar question. What I just answered an answer is no I mean it's possible, but well very unlikely. If at all- and so I was talking to.

A

Another best practice: that's.

B

No, it's not even a best practice I, don't a won't even allow it to happen. That was these are some of the questions I had initially, but that would make any sense. So that's one of the reasons shuffle is important to do is you know, you've got them. It calculates where all the tokens are in the replicas are and then make sure that those are not sitting next to each other.

A

Okay, thanks that and Chris Tremont asks in a J bottle. Multiple data file directories configurations do V nodes, allocate their ranges across all data file directories or there's a V node map to a single data file directory I.

B

Love that question and that's actually a feature that I've proposed so right now, no, the J bob configuration is well balanced by just the key space and not the the actual virtual node itself. I think an interesting thing would be now that we're talking about virtual nodes and having these smaller ranges and how it would play into larger Hardware again, if you look at one of the points of virtual nodes to large use larger boxes.

B

So let's say you have 8 or 16 or 24 drives and j-bot configuration I think it does make sense in some of those configurations, especially with SSDs 2-pin virtual nodes, certain virtual nodes to a single SSD, now I might get some argument from some of the developers side on Cassandra, but I I could see that from an operation side making a lot of sense because then you're, you know you're just really. Sticking to that one.

B

Another topic that's been floated around is pinning on large core boxes, pinning a virtual node to a particular CPU or core, and that that also has some benefit as boxes get bigger, I mean, let's face it. 128 core box is not that far away and do you want to go through the cost from the CPU side of having all these contexts which is going around and you building these stacks as you take the same thread off the same virtual mode and move it around to all these different cores.

B

It may make sense to just go ahead and pin it directly to a single core. So those look like a bonus answer, yeah.

A

Time sounds like you definitely got a plus one from Chris on that feature. This one from Steve, Brona I, will take this one. So Steve asked any early release or dev early Cassandra 1.2 data stack enterprise versions. We can sign up for or volunteer for and then the hashtag guinea-pigs welcome. Steve. The answer is yes. Email me Christian at data stacks comm and I will put you in contact with the right person to enroll you in our early access program.

A

Basically, as soon as we have a version while we're starting to test 1.2, we will roll that out to you, ok from Jeff Schmidt when changing replication factors we shuffle happen. If needed all. Must we execute the command ourselves now.

B

You only run the shuffle once the when changing the replication factor. It's going to change that it's going to put replicates on different nodes, so our different physical servers so yeah the this is another one of those questions. I asked early on to and I've been assured that that is not the case that the algorithms are built so will not put replicas on the same physical piece of hardware.

A

Ok, thank you for that clarification and John sander asked. Can you talk about setting up a heterogeneous cluster? West servers are allocated a different number of tokens.

B

Yeah, that was one of the examples I had I went through it really quickly. But again another conversation I've been having with other Cassandra users is, if you look at virtual nodes and what the goal is is to really open up the door. For some of these conversations.

B

What if you look at like, for instance, where, if you've been using Cassandra for maybe a year or two you're, getting to the end of life on the hardware that it's running on as an example? So what if you want to transition from say a 8-core box to a 24 core box?

B

Well that that's where I think we're looking at creating these different ranges of notes, so you're a core box which is older, you may know you only have 256 virtual notes on that one, but on the bigger box, creating maybe say 512 makes sense, because you're you're using more hardware, more memory, no more disk, more CPU. So you want to try to keep the smaller note size if you can and you're just going to put more data on there anyway right. So it's just gonna.

B

This is these are some things that I'm actually working on right now and so, like I, said, stay tuned I'm going to try to put more of these into blog posts about things I'm discovering about this, but virtual notes opens up these types of discussions and also some new features, as we transition from older hardware to do hardware.

A

Okay, great I'm, just going through I think we have all the questions answered so far. Only last minute, questions I'm sure to ask them here any last words of wisdom from you, Patrick turn.

B

Off your computer in car side, great.

A

Advice, but before you do that make a note of next week same time same bat, Channel. We have Aaron Morton, giving his introduction to Apache Cassandra webinar. It will also be talking about 1.2 and some of the other great new features in Cassandra. So.

B

Scaling other scaling things that are worth watching for there to vinos is only one topic on the make: a bigger node, the Triad asaurus rex node, so lets yeah that that's what I'm going to be watching too for sure.

A

And a couple of links here for everyone, you know lots of training event, probably one in your area check it out at data stacks check out the new community resource, which is Planet, Cassandra lots of good information on different use cases out there and also lots of blog posts and good info. And then, if you are on the East, Coast March 20th, we have our NYC star big data tech day. We have great speakers presenting including eBay and Comcast and Instagram.

A

So should be a great day if you would like a little discount off of that use, Planet Cassandra 2013 and you will get 50% for this price only for the next week. So thanks everyone Patrick. Thank you again go enjoy the rest of you, Oh Valentine's Day, and we look forward to hearing from you soon.

B

Thanks everyone.