Apache Cassandra Cassandra Community Webinar Series, 20 Dec 2013

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Cassandra Architecture, Development & Operations -- Your Questions Answered Live

Description

'Tis the season to get all of your urgent and demanding Cassandra questions answered live! Get ready for yet another very special edition of the Cassandra Community Webinar -- Q&A style. Please join us as we close out 2013 with an entire hour devoted to answering as many Cassandra questions as possible. Our panel of experts have a vast knowledge of Cassandra and will be answering questions ranging from a variety of personas; developer, architecture and operations.

A

Hello, everyone and welcome to this Cassandra community webinar. Thank you for taking the time today to join us and usher in this season. I am delighted to have with me today three fine gentleman from the Cassandra community, Patrick McFadden he's the chief evangelist here at datastax alto, be an open source mechanic and enough I'll explain exactly what that is in just a second and match them Solutions Architect here at datastax. So thank you to those of you who have already submitted questions for us to answer on this webinar.

A

If you would like to go ahead and submit a question, please use the Q&A tab inside of WebEx type, your question there. We will get through just as many of them as we can in the remaining time. So before we get started Patrick. Why don't we start with you and you can give us a little bit about your background and also your your areas of focus for Cassandra and then, if you could throw it over to our and then I'll, throw it to Matt.

B

What a throwing around, isn't it I hope you can hear me: okay, I'm out in the middle of nowhere right now,.

A

You are coming through loudly and clearly, sir. That's.

B

What you wanted to okay, good? Well so yeah I'm, Patrick McFadden as al likes to point out and and I always tell him to stop talking about it, but all those data modelling videos that I do yeah. So that's me so I'm chief evangelist and a stack, the multiple solution, architect and work with customers on a lot of cool stuff, and so I've, probably seen most of the big problems and I'm always surprised if every day I find a new one.

B

That's really cool and compelling to solve so questions about architecture, development and operations are all on board. For me, ow.

B

C

Al I am open source mechanikat data sex. What that means is, as I mess around with little projects and find little corner cases and things lines of Investigation around Cassandra and storage. Things like why are these this kind particular kinds of disks very slow and try to share that with the community um I've been doing operations prior this operations for about 15 years, so I've seen a heck of a lot of storage and servers, and things like that. So those are all fair games, questions, including things to do with Cassandra.

C

It's been about two years as a developer and administrator.

B

Hi, my name is Matt stump on the solutions architect with this tax. Basically, what that means is I help really large customers deploy cassandra successfully that can be dated modeling performance tuning, advanced troubleshooting planning applications. All that sort of stuff previous to join dis tax I was a customer for two years. I was one of the first customers on solar and to do so a lot of experience both within this box and actually out in the field using it. As a user.

A

Ok, great and you gents buckled in and ready to get started.

B

A

Hey I'll, take your silence as a resounding yes.

A

So last time we did this I tried to preempt who would tackle the questions. I learn my lesson so I'll just read the question and then whoever would like to claim it. You just you just jump on it. So this one is from soul shaker and he says hi. First off the keys are visible. If I run a select star from table, till compaction occurs for tombstones way to recover. Is there a way to cut a deleted data and Cassandra and it seems tedious?

A

Why is this so don't we have a simple way to restore tombstones in Cassandra.

B

Wow I think I hear that one quite a bit I'm sure Matt. Probably here, is in quite a bit as well. um Can't you just do an undelete. That is their kind of you guys think that's kind of where you headed with this sounds like that.

C

Yeah there are a couple of issues in there, but yeah. That's one ball right, yeah! Well,.

B

I mean that tombstone attention system marker and it just a matter of unfortunately, the act of changing a tombstone into with something real is just rewriting data. But then you just need to rewrite your data. I haven't seen anyone create something that would go through, say an FF table and remove a tombstone that seems pretty dangerous, actually yeah.

C

You probably want to do a trip through your arm. That's the stable dump, but that note.

B

C

Of work, a lot of work, yeah, the I- guess the problem too big. So the big suggestion, if that's a problem for you in production, is make sure you're running snapshots, regularly, they're, really cheap and easy and worst case you can get back and recovery data.

B

Now you can take a snapshot with every SS table creation, and so, if you wanted to, you, could just roll back to the previous SS table, and so you there'd be a Delta of information which you could get from the commit log.

B

That's probably the best you can do it without having to act together. A bunch of code, yeah, no I, think every time I've heard the question like can I undelete a tombstone and then I delve into what the real problem is. What they're looking for is something that snapshot does a lot more efficient anyway. So I think that yeah unda leading a tombstone is just a really hard way of doing something you should be doing in the first place.

A

Okay, great, thank you very much for that next question: it's a two-parter from Simon, so part one. When can we use triggers in production environments, I? Think that's more stability question! Maybe a date stags Enterprise related area there and part two: can we use triggers across different virtual data centers, for example, for the insertion of a new user profile into user profile table in virtual data center one and it can trigger an insertion of a new hire event in a vent table in virtual data center to.

B

Buble, what.

A

Does mel the first one that.

C

The fillers papa.

B

A

Nope guys hold on let's tackle the first part of that, which is you know, is it advisable to use triggers in production environments today, or you know, should should Simon wait, maybe until that it's a bit more advanced.

B

So I'll take the first bed, so triggers became available in Cassandra 20 Cassandra 20 is shipping. There are people using it. We've gone through three minor passes so far, so currently 203 is available. However, Cassandra 20 isn't available yet in datastax enterprise. So that means the largest accounts. The largest users of Cassandra aren't upon 2 point 0. So take that for what it's worth using triggers, how you described is just n. No first, you can use a trigger to trigger on a mutation which then executes a Java class.

B

That Java class can do anything that could be an insert into another column family that could be a call-out to a external service. It's just Java. As long as you implement the interface you're good to go, the triggers are executed asynchronously and they have their own thread. Pool that's separate from the mutation stage in Cassandra. Now, I think there's a misconception about how data centers working Cassandra data centers.

B

So if you have two days since there's 1a and 1b, they both participate in the same ring, and so they both have views of all of the information. And typically, you would have a policy that says: I want two replicas of my information and data center. A new replicants of my information data center beat Cassandra just candles that for you, and so, if you write to one of the data centers that will automatically be replicated to the second data center.

B

So you don't have to use Twitter's to replicate information between data sensors, business, just an inherent property of Cassandra, so I think that's what you are going for and if not you can moisten Apollo question and mala, try and answer it, because I.

C

Had known as will they will the triggers fire or behave differently in different data centers right or will it fire again in the second data center?

C

That's what I thought they were asking how'd.

B

The answer to that is no right, and I was thinking that it was about moving data from one like having a staging table or having a propagate from one table to another, which is not like a replication thing. Sony has 33 peers, treating, but then I would just replace the word data center with the word key space and there you go.

A

Okay, great so patrick has it has a question here about solid state disks and is it recommended to run Cassandra with solid state disks I.

B

Had this question.

B

C

As the answer content.

B

On this, one help me up.

C

Yes, absolutely run for Sondra.

B

I message via tell you that yeah.

C

I mean that's what I far away you're going to get the it's kind of a smooth and easy path to running Cassandra, if you, if you just don't want to have to deal with iOS use, Patrick and Matt can probably go on for a very long time about war stories, finding slow, SATA disks or raid subsystems that are much slower than people anticipated, and it causes a lot of problems, whereas most people running on SSDs are pretty happy and only end up tuning when they find things like saturating the set of us.

C

So yes, emphatic, yes,.

B

A

See a lot of people times, I think you know- maybe maybe there's a misperception out there, that we need to definitively clear up and what we're hearing is a resounding yes, absolutely solid state disks are a really good choice with Cassandra yeah.

B

So the the only time that I ever hear pushback on on SSDs is that people complain about the cost, because.

C

B

Typically, cost twice as much, but if you start to delve into it, is actually much much cheaper to run SSDs with SSDs, you can use levels compaction, which means you can drive your disk utilization up to ninety percent. So right, there you're almost getting twice the capacity for the than the equivalent a fast drive or SATA disk, so you're going to recover costs they're. Also, the bottleneck in cassandra is almost always the disk I/o subsystem as soon as you go beyond a data per node that exceeds the buffer cache of the OS.

B

So if you used us as these, you can run with much smaller Cassandra cluster, because the the bottleneck is in the disk, so it's almost always cheaper in the long run to just use. Ssds.

A

Okay, great, let's take this next question.

A

Amrit asks: does cql automatically split a wide row if it is too large, why.

B

Just got this question the other day: uh no, the cells that are inside a partition are what they are. There's no auto partitioning or anything like that. So that's when you really need to consider your data model and how you partition your data from using the primary key. So when you have multiple partition keys, you want to think about how that's going to break down your data and how many cells will be in there.

B

You can use CS 50 grams to see what the cell counting and get an idea of what's going on there, but once you reach 2 billion your ear at the end of the road, it doesn't automatically split off tho.

C

Sic, you all does help you with. If you do have wide rows, it's in a case where your data model isn't a problem with pagination. You can read through those what wide rows without having to have because under load those all into memory, all at once and basketball over the wire like it used to do. You can actually page over them. You know a thousand at a time or something like that.

A

Okay, thank you me, cow to pasta here, part one drive a question: do you have any plans to upgrade the driver driver to use Neffe for Nettie 37, which is currently used is not especially a garbage collector friendly, so I don't know if any of you have insight into Java driver plans when we take that part first, because that could be quite short.

C

I haven't heard anything about it. We'd have to go dig into it with the driver. People.

C

A

2 here, why don't counter tables support time to live on rows and columns?

A

And then he says it's a major limitation as he has to do manual housekeeping on those.

B

It's an artifact of how counselors are implemented. So when you mutate a counter column, you just write down the increment, and so, if I want to increment the counter by one or I just write down post, one I was going to increment it by three write down plus three and then the read path actually merges those values when you request it and so you're merging multiple columns together. So it's difficult to do. Ttls for that.

A

Okay, thank you very much and young is asking: can we generate an incremental backup right before a snapshot in one transaction, a.

B

Snapshot is an incremental backup.

B

A

B

Visit factor you said you have to enabled in the ammo file. It's not it's not like in a command-line tool.

B

Yeah I think he you're misunderstanding: how house snapshots work? It's just hard links to SF table, and so you can essentially have Cassandra create a snapshot which is just a series of hard links to the SF tables. Every time in Essos table is created so you're there for getting in incremental backup you're, not creating a full snapshot of your data with every single snapshot. It's not like you're copying a terabyte of two terabytes of information. It's just you're pointing links to your existing information. That's why they are so cheap in bandra.

A

Okay, thank you. Peter asks what is the process for restoring from a backed-up SS table? Is it necessary to shut down an individual server or the entire cluster while copying the files.

C

That's a little bit complicated if it's a purely deleted SS table where there's no overlapping data with other tables that are still loaded in the system. I'm, pretty sure you can just copy those into the data directories and use node tool refresh and it will pick them up live. You need to make sure that, ideally, you get it on the right note.

C

In the token Rick get it in the right part of the token range, but that can be done, live and snapshots and that's the stables can be moved around while the system is hot, since the NASA stable on disk is in the musical file, it's not recommended you mess around too much in those data directories. Unless you really understand how everything is laid out and what the different files are, but yeah an SS table can be loaded online and.

B

Worst case scenario: there is the SS table loader, which you can use to load at the tables from a different machine into a cluster. You can also use that, if you're restoring at the stables until clustered with different size or using a different partitioner than the previous cluster we have Okla jmx is one of my favorite tools. You can use point AEV use confusing DMX, you can give it a directory named full you're at those tables, and it will read all of them and streaming back into the cluster reorder.

B

The data passionate properly put it on the right nodes is one of my favorite restore techniques for sure.

A

Ok, thank you very much, so Patrick I think you're. You may be pimping. Your data model talks with this question, but awesome asks how is Cassandra's data model from big table and HBase? That's part. One number two are super columns removed in Cassandra, 20 or simply not recommended part. Three is the following statement: precise column, family is a map of maps. Keys of each map are always sorted so, let's say part 1, because this is a probably a long question. How is Cassandra data model different from big table and HBase I.

B

Suddenly different I mean yet they both share the big table roof, which means that you have one love lots of columns, and so that's about where things depart, maybe some smaller subtleties, like secondary indexes, exists in both systems. The bigger difference, of course, is that HP is more of a sequential order, dro system, so it can do things like roast scams, whereas with cassandra is randomized and there's going to be.

B

You know more of an implementation detail at that point where you have region servers and you can have scanners and filters that are just very different and how you implement HBase, it's much more complicated in that regard. Now that we've added cql a pretty much the differences, they're huge CTL makes the programming techniques and then ddl working with cassandra so much easier, and it makes it look from a development standpoint. It's just a lot easier.

B

If you look at some of the projects around HBase, it's about building, trying to build schema or something like resemble schema, whereas a CTL you just have it it's enforced on the database level. So as we go forward with both of those systems are diverging quite a bit at this point, and maybe two or three years ago you could say they're really really close but they're starting to diverge quite a bit.

A

Okay, pause to our super columns removed in Cassandra, 20 or simply not recommended.

B

The no they're not removed and they're still not recommended I actually just went through this exercise so that the implementation and storage engine has changed quite a bit and we're before super calm. The biggest issue is super columns with their created. This massive destabilization problem, where you had to read in the entire super column and into the jb m I I, could tell you that I've seen more garbage, collection issues or super comes and anything else, and that is usually the case where you do a lot of reads up to the columns.

B

You start filling up there, Jamie M, it's that hasn't been completely eliminated, which still need to read in quite a bit of data. So it's not like we've fixed it, but then we were making a llama son honors.

A

Okay and hop three IM.

B

C

A

Sorry say again.

B

Okay did I cut out all three.

A

Is the following statement? Precise column, family is a map of maps. Tease of each map are always sorted.

B

It's truly gets a little bit more complicated in the QL three because, like you, can have multiple what is essentially columns and map back to our partition and the the the last element of your primary key is going to be. The column is used to determine sort order, so the threat of the question is generally to true. It gets a little bit more complicated with the QL three.

A

Okay, great and.

A

The letter, let's move on to another question here from from Jason, what is the procedure for replacing a seed node.

B

There isn't a procedure essentially.

C

No one, it's probably an ec2 if yours to what you were using OneNote for your seed note in that you had to replace that instance, um you just go. You just go through and just read a different note for your seat. Node I mean you can have as many in there as you want, um but yeah just pick another one. Any node in the cluster can be seen.

B

That's probably one of the bigger misconceptions that good speed note is special, nothing specialized. If she knows just a designation, if not kind of thing, he turn it into like a master, node yeah, it's some people will actually use just you know round robin DNS for their seed node and points to a different note. Every time. It's essentially the first note that it's going to be contacted in order to get cluster information when a node joins the ring.

A

Jeje thing is, was just you know, for those who are unfamiliar with Cassandra's peer-to-peer architecture, maybe just spend a minute on the architecture versus something like a master-slave architecture might be helpful for the new beef.

B

Sure so that in most database technologies, Oracle a MongoDB, my sequel, Redis, most of them out there, they use a master-slave.

B

Architecture, so you all right have to be performed through a single machine and then those rights are some points replicated down the slaves. There's a couple of problems with that. The first is that your white throughput is limited by the right throughput of a single machine, or you have to shard that machine.

B

The other is, you, can you're always exposed to the possibility of data loss unless you turn on synchronous, replication, so I write, a piece of information for the master master will acknowledge that right to the the client's innocence then persisted to disk and everything's good. Well, if the master were to go down at that point before that, piece of information is replicated to the slave and that piece of information is gone forever and you don't have a way to determine how much information lip gloss.

B

It's just gone, so you can't have true durability if you're using master slave complication and you're doing that replication is incrementally, which is what most advanced solutions do. So Cassandra took a different tack. We use a ring of machine. Every machine is the same. They are all peers. They participate in the ring. Equally and when you perform a write-in Cassandra, you specify what consistency level you want that rights to have.

B

So, if I say I want a replication factor of three and I want my rights to have corn, what's going to happen, is you're going to go out and you're going to contact any host in the ring that toast will then go out in contact. The node is supposed to have that information.

B

If it's not the note that you're currently connected to it will write the seeds of information there, that node will acknowledge it and then also the it will reach out to another note: the roof open note for that piece of information and righted there, as well only after both nodes have acknowledged the right well, that right be considered, successful and be relayed as such to the client, and so you have a guarantee that the information not only exists on one node but two nodes, and so you can tolerate a single node failure.

B

You can go even more than that and using a quorum consistency level with a multiple data center setup. You can have a guarantee that your data exists. Not only on two machines, but also within two data centers, and for some customers that are using there are storing things like cryptographic keys. This is particularly important that information can never be recreated and there's very large real talks in terms of dollars.

B

If that key is lost, so people always seem to get are afraid of tunable consistently our eventual consistency and but in reality, Cassandra offers stronger consistency, guarantees. Then most databases out there that includes our old, must equal monthly to see what happens.

A

Okay, thank you very much. So it's a fantastic explanations. Thanks a lot man, okay, Constantine off, and actually we forwarded you this question in your inboxes because there's some level of detail but I think we can probably tackle it at a high level here. So Constantine had a four node cluster running Cassandra 12. They upgraded to 20 and enabled vinos, but didn't shuffle them. They added two new nodes and ran a repair and cleanup on each of the six nodes.

A

Now the cluster is unbalanced and there's some detail about how it's unbalanced and his question is: how can they rebalance the cluster again.

B

Well, there is no rebalance operational v nodes that for one thing- and that is opinion, I- think it's a pinion gear at this point, I think the first problem was enabling vinos without doing a shuffle, because that means that all that data is still parked in the same place and if somebody looks with I'm looking at the email now you know there you can tell that there is talk and those that were added. It didn't get all the data, and this is an x1 and d1 projects.

B

Without going into details about what we're looking at here, yeah you can have, you can have imbalances pretty easily with vinos, with not impossible at all. So one thing you should know about be nodes. You will not get a perfect balance. You will see variation, five, seven percent between the nodes and that's based on the amount of those you have in the system.

A

And then in the future, if I, if I understood what you are saying, that is, when you add denotes, you should reshuffle.

B

Know if you, if you upgrade, you, can take a cluster in your navel vinos. You have to run a shuffle after that. That's the step to the way we we'd like to do. This, though, now is add a data center with v nose in it bring up that data center. Let the data stream into that that cluster and then use that as your new cluster.

A

Okay, thank you.

B

Allers mass, how it has more so we have about you- could always be commissioned the node, which would stream the information off and redistribute it and then also then just re at the note. That would also fix the issue.

A

Okay, thank you next question this one from Sean for development and testing of Cassandra. What are the minimum server requirements? Is it three servers with a replication factor of two.

C

No, not at all I mean it depends on what you want to do in your develop testing environment, but you should be able to run just fine with a single node you'll possess your schema run to replication factor of one. It's not. We don't recommend that in general, but if you have two offers I want to run it under workstations or you have small environments. Absolutely you can run a single node arm whip down to a gigabyte of RAM people have gotten smaller.

C

Your performance isn't going to be seller, but for non-production uses that can suffice and save you. A lot of resources. um I personally, have always preferred three to five nodes for at least staging clusters. So I get some real-world behavior I want to see the replication system in action in my staging environment, but for development. It doesn't need to be that big.

B

So the caveat of courses, don't don't spin up a test cluster and then expect it. Yeah I've seen this too many times where well, it's not performing very well and or it's not what I expected. You have to balance that out. If you're going to put suboptimal machines or a smaller configuration, it's not a performance environment. It's more of functional! Well,.

A

Okay, great so Ramakrishna I'm guessing he is working at a banking and financial institution because he is asking for you know. Is it? Is there a banking financial use case for Cassandra for an OLTP type application.

B

Boyd, that's pretty general. Yes,.

B

With what kind of application of sign introduce escapes, Roger.

A

So Ramakrishna, if you go to planet Cassandra org, we have a five-minute interview there and use case with blum m in capital management there, a hedge fund manager and they are using Cassandra, fairly heavily fought for their exchange system. So you can go and read up about that and there are a couple of presentations as well that they've given up meetups.

A

You know one of the things about banking and financial institutions. While we have many as customers, you know some of them tend to keep their use cases fairly close to the best. But you know we can. We can certainly plug you in with some folks if, if needs be I.

B

Can say in general, the type of use cases in C and financial institutions are fraud, detection user screen analysis also mobile applications would need that kind of scaling. Friends, the mobile front-end information things like their customer facing websites want that going on. So these are generalities. The five minute interviews are always going to have more detailed.

B

It's really when you need that velocity of V is.

C

A

I'm not sure, but that someone has an IM on that sounds like a duck as being stabbed or something okay. Next question from Vijay: if we have different data, sets with the same t, say flow underscore ID coming in at different time intervals. What is the best way to model the table so that these new columns can be inserted and are there any performance implications with updates.

C

In the same, did you parser Patrick, because I did if it sounds like their opinion to the same role in this case, let.

B

A

B

A

Data sets with the same key, for example, flow ID, coming in at different time intervals. What's the best way to model the table so that these new columns can be inserted, that's the first time and are there any performance implications with updates? Well.

B

This is straight-up time 3d model, and this has been talked about aimlessly tube, so that that's good I mean there's plenty information about how this works. I mean essentially what you're looking at a date receive a storage row, but in eq l it works quite well. You couldnt create a primary key say with that slow ID and it looks like and I've seen this several times is modeling natural, but flows are the tcp stream or something like that and there's a NASA I did avneet up actually on this very topic.

B

Nasa uses it for this very use case. So might wanna look that up, but the it works quite well. If you use the time staff as as the value to in your primary key, that creates the partition for the clustering key and the performance is really good and that's one of the reasons Cassandra's so good, a time series how it lays out that data, then the storage engine. So if you're updating that data say with a flow ideas X and have several event, it hit time Syria time anyways, it was started memory.

B

It is stored on disk and assorted for master, and you go to retrieve it assisting the slices in the seat. The thing I always mentioned, though, is it, is, if you're doing something with a generality, actual ID, think about the size of that of that store or even embed that that cluster is going to get much bigger. So the partitioning you want to try to do is make a slow ID in a base like a single day or week find a way to break that down.

B

If that's the case, if it's something that's short-lived, if you're only going to get a few times and events office, then you know keep it with that. But this is that this is a very typical time to game models, technique.

C

And what the updates the thing to watch out for is, if you're doing, updates on hold or what we call cold data, you could end up triggering compaction of data. That's been cold for a long time, depending on which pack some scheme you're on, but um that's the one to watch out for is, if it's not actually the kind of time series where your data settles down and kind of get stored on disk and doesn't change anymore.

C

If you're still changing all the data all the time, then you might want to look at a slightly different data model and.

A

Patrick, all there any videos you can think off that maybe cover some details of data modeling. No.

B

I'll, let out I'll knows about a more than idea, really.

C

Hope, after that, has this wonderful series of webinars on data, modeling and I think they're on.

B

C

B

There's one there's a data model entire. Actually I believe I did time series data modeling in a supermodel big, some big danger, crazy yeah, so I cover actually I have a paper up to you on planet. Cassandra is getting started with time series data modeling, that's probably exactly what you want: just replace weather station ID with slow ID and go for us. You'll love it. That's not even a video a.

A

Like it should be, yeah.

B

No I should be 11, a nap I know.

A

What Vijay will be watching over the holidays.

B

Netflix know.

A

Okay, another question here are the most use cases that you've seen focus on. Ok do most cases use cases that you've seen for Cassandra use a few column, families or a few dozen of them. Have you seen any applications that require hundreds of column, families.

C

I'm, pretty sure, Patrick or Matt have seen that before um most the most I've seen in production is a couple dozen armed and I believe there are some problems you can run into if you get into the hundreds, but it does happen. It's just not recommended right. Patrick for math.

B

It's the number of columns families, but that there's a lipid tuning problem, but each each column family has a mag of overhead and splat allocation, which you can turn off. So thinking about memory usage, if you had say a thousand column families in a in a key space button, you would have a Giga data sitting there and that's just an overhead for those. But you can turn that off and that's that is a teacher hasn't like one to six or something that but.

C

In general, that many column families we start for wonder if the data model needs a little bit work right. Yes,.

B

Typically, what typically, when you see that somebody taken an existing data model from a relational database, where you have to break out a separate table for every single minute of one relationship and they attempt to just shoehorn that into Cassandra, because the data model is because, but the types that you can represent in a contender column are so much richer and you have hydros and things like that. Typically, what you'll see is, when you migrate from a relational database, to Cassandra that your overall table count shrinks.

B

So that would be my first inclination is that somebody hasn't gone through that process and they're, probably doing more queries than they need to be better score. Table name I get a little freaky, because that's clear that it was just an export of relational database. Then the next bad thing you'll see is where people take motive tables of data and then join them in memory like an application.

B

And it's just that's a clear misfire on the data model where I've seen a lot of conf amelie's is when you have a lot of indexing of your building and that's a cool use case and doesn't call families, but yeah I really did sad with Matt scenario there, where I see people just dumping, a relational database and Cassandra and hope for magic unicorns. It's just not really works.

A

Yeah we hear that over and over again you know when you, when you get started with Cassandra, really familiarize yourself with the data model, you know think differently up front, don't thinking in relational terms and I believe there are some good videos on data modeling that will help that.

B

You know they sell themselves rates.

A

Pennies on the dollar, okay, next question: this is from mpo joins. If this is not how you pronounce your name but I'm, going to go with shady and I think again, this is a little misperception that needs clearing up. So how can Cassandra work with, let's say, Hadoop they're, actually saying cloudera or Hortonworks, so how does Cassandra work with Hadoop and then their examples Cloudera homework, given that Hortonworks uses HP? What are the best practices?

A

So maybe here you can talk a little bit about you know if you need to do some analysis on your hot data versus you know, maybe each yelling into an external Hadoop cluster or something like that. It.

C

Was early, there are a lot of it dependent around that question. um So one of the one of the use cases on I've seen a personally worked on that was very successful in Cassandra and YouTube together, it was called era was armed where there is a bunch of log files that need to be processed, so they would process those in Hadoop and then the aggregates would get written up to Cassandra. That's a really sweet use case of where the two come together nicely.

C

If you just use the CQ all client from your MapReduce job to write your results out in the final stages, um the other forms you know where you want to do things like pig and hive over your Cassandra data- it it depends, gets given more nuanced and it depends on what you're trying to do. um There's DSC that produced, but there's also the hive and pig side special from the designers. That is improving a lot, especially in future releases um matter. Patrick II of Montana.

B

This pretty much spot on I mean, like you know, if you need to import data from Cassandra into your Hadoop cluster there you know, we've got a bunch of classes which, in years that utilize the streaming protocol that usually and efficiently it did it in and out- and you know if you need to do MapReduce or do analytics over the data 13 Cassandra. Well, we have a product, we can tell you and call the Mystics enterprise and it works really. Well it does.

B

You know, workload, segregation, those that your analytics nodes don't take down or impact the performance of your real-time knows so yeah. It depends on your use case, but we've got you covered multiple ways.

A

Okay, thank you very much. A few more questions here. To get through can you use different versions of datastax enterprise on different virtual data, centers.

B

Boyo, jim, oh, no, don't do it make sure you're on the same version on everything I mean because you datastax enterprise version usually corresponds with a version of data to standard underneath and, for instance, you're not going to mix a 1 dot 1 and a 1 dot to cluster together, just not going to happen. So it's best really to keep on the same version to eliminate any kind of emergency issues that could happen or just you just need to be on the same person. I. Don't know why you would want your.

C

Favor with one big, but all right, Patrick, which is one of the one of the nicest ways to use that that situation is when you're doing upgrades.

C

So if your organization is particularly nervous about upgrades- or maybe you personally are- you can have two data centers in the same data and you can upgrade one data center- try to only do this over minor releases, but do that minor, upgrade on one data center and then leave it sit for as long as it takes for you to be comfortable as that release is good, and then you can do the other side, but you definitely shouldn't leave the mix for more than that. A period of time well,.

B

In an incremental upgrade sure you're going to be stuck in that situation, leaving it that way for any particular reason, I would recommend. It's just imagine me you're talking about an incremental upgrade, which yeah, of course, you're going to go from one. He likes 3, dot, 1, 2, 3, 2, okay or even three. Two, two, two three three three three two three that sure makes sense and both.

C

What not update specifically, though this is a particular trick that Cassandra can do that most other databases can't which is give you that time in production to build confidence in a new release, and that might be a case where we say it's: okay, for you know a few days or a week so that you just know you're in a good spot and I've been in operation long enough, it's something that we've always desired.

C

You know with with everything operating systems all the way through the databases be able to do that and that that's the one place where it's a useful facility, but, like patrick said in general, don't do it yeah.

B

Uptime chill right, I mean what was your up sign into the yellow yeah law.

C

In the arm on the primary annoyed cluster, it 100% for almost two years there.

B

You go and you were you on the same version for two years. No, no, no, we that's awesome message. Would.

B

A

Hey you guys, remember Vijay's question about the data model flow ID time series you do good because he has a follow-up. We need to generate a lot of top-end reports based on the flow information stored as described above. That is the flow ID data model question. Should we use cql or solar for this purpose, as datastax enterprise supports both.

B

mmm It's where do you want it? Yeah I can do this. So it's. Where do you want to spend the time to essentially performing the work to get your top at so with solar? Your indexing, the information as it comes into the system, so you're you you're putting an additional cost on all insistence that they call them families that are indexed.

B

So you have to price that in, if you do it on the analytics side, then you're doing the work after the date is in the system and you're going to have slower response times of those queries. Worse and solar. You get immediate response time, so you're doing the same amount of work. It's just whether or not you do it up front and you get immediate responses or you do it out the back end and you get slower responses so up to you and what's your use case that dictate.

A

Okay, thank you. So it's we got. We got about nine minutes here itself. We can get through a few more.

A

Malay from my health care provider, Blue Cross Blue Shield here: can you talk about each base versus Cassandra and you know? Obviously we we are Cassandra experts, so you know we have much more insight there, but we can probably talk in some generic terms. Earlier we talked a little bit about the architecture who wants to take a crack at each base versus Cassandra.

B

How many of us have run each base and production I.

C

I have run it in production, but the big ones that kind of stand out in terms of when we evaluated a couple times when I was at we are armed HBase is incredibly complicated to set up and operate. You have to have zookeeper, you have to have HDFS pretty much into all of the dupe has to be there, and then you have to set up all the components of a space on top of it, and there are a lot of knobs to turn in a lot of moving pieces.

C

So the big contrast there is that cassandra has eight. The nodes are all identical there they're homogeneous arm, whereas HBase has a bunch of other moving pieces to think about when you're trying to figure out. What's going on um so from an operational perspective that that's always been really important to me, um there's obviously a higher level different design. What the other guys go and do.

B

C

B

Yeah I mean there: let's go the development cycle, you know I I tried using HBase years ago, but you write. It is very complex.

B

The development side and I mentioned earlier I think the term a inch base is really you know, eccentric around, say a bro and using that kind of that kind of interface into tryst and getting data inside a space, but then I'm not dressed good, a bro and using the RPC to data in and out of HBase, which is where cassander was, but with cql we're really putting the developer first, where I don't think age basis.

B

I know each base is more of a subdued first type implementation and it works well in those environments, but for programmers. I think would probably prefer to you something like to draw, and I think that's that's where we're headed and I think that's probably good reason to consider to dinner yeah what I mean. So it's like HBase is more complicated, got single points of failure. The crossfader sensor, replication, isn't as strong. The api's aren't grade. Is the data model sort of a pain in the butt and it twice as slow.

B

So in my mind you know I know I'm employed by data factions and so I drinking the kool-aid. But it's from a development or office perspective. It's hard to see a clear case where each face would win out.

C

Yeah in the facebook what the example is going to comes up a lot and we hear it a lot what's interesting is when you talk to the people at Facebook. The reason why they stuck with HBase was because there they were at the time that Cassandra was being kindled. They were displaced deep into HBase. They had years and years of investment into it. And um so that's why that went down. But it's interesting to hear the amount of effort that Facebook is invested in making HBase serviceable for their needs.

C

And so, when you hear people say oh well, Facebook uses a face. You have to put behind the fact that they have entire teams dedicated to making HDS work with us and end up Eleanor.

B

C

Lot of Cassandra installations- there's some there's just one of you that that has it running and generally that should work.

A

Okay, thank you very much and yes with any questions like that, you know please bear in mind. We are you, know, Cassandra experts. First, we're always going to come at it from that standpoint, but you know definitely recommend you asking the folks who are more familiar with a space as well for their views. Also, okay, so here we go another question. I think we probably have time for two more since pig and hive have similar capabilities. Do you recommend or prefer one over the other to use with Cassandra and why.

B

So the they're both going to do the same thing hive, is going to be a little bit more friendly to your e I, guys that aren't necessarily coming from programming background pig. It has a DSL, it's a little bit more extensible and is has a little more in terms of capabilities and extensibility that a programmer or somebody with life programming experience I can tap into that's. The only real difference isn't.

C

It just a little bit faster, it generates it just generates. Slightly faster code, doesn't know I I,.

B

Wouldn't know I think.

C

I've heard that before that I haven't experienced, it.

B

Yeah I haven't heard the difference in speed. I think that the it's the direction of who wants to rubbing it the operations cake that you know there's a lot of people know pigs for the semantics and then hive of course it's like a SQL. So let's be here to approach. If you're coming from that direction, did the speed one of us has occurred to us, though,.

A

Okay, couple couple more quick ones: we're coming to the top of the hour. Is there a rest interface for cql queries.

B

C

It isn't pretty easy.

B

C

Yeah and it it's something that if you do it just be very careful because that's there are a lot of security implications. Is that if developers expose that to the edge by x2 edge networks by accident,.

A

Proceed with caution and.

A

Okay, we've got another comparison question here: let's hit it really quick, probably going to talk about document oriented approach, but it is Cassandra versus DB. mmm Before you guys answer this one is from George George. If you go out to Planet Cassandra and look at the five-minute interviews, we have dozens of examples of people migrating from MongoDB to Cassandra and it is always when they hit either a scale issue. Where cannot keep up or operational complexity which can get operationally very complex with MongoDB, but anything else to add their folks.

A

C

The big difference is: if your data is important to your business, then you shouldn't be using manga. Dt I mean I, yes, I work for datastax, but if you follow the news or online going to be especially the recent stuff about going based on the the writing on the wall is pretty clear, I'm going to be at a certain point, it's going to turn on you and lose your data. I mean that's just it's track record in the industry arm, you know, use anything else. Actually, it's cassandra is not a good fit for your application.

C

Then you know I would encourage you to look at post guards. Mysql Cassandra react. Any of these are going to be better choices.

B

Yeah I mean especially document models. I would would actually prefer to use couch over at this I think couches much better for document databases, yeah.

C

One postgres has H store, which actually a lot of people move from mondo to UM that's a different animal altogether.

B

Like I mean trying to bring down the red or in just a little bit, you know cassandra is going to be as easy to use, as in most cases, I think it's gotten a bad reputation in that regard. Try it out for yourself, you know, play with it it's easy to do. We can help you with Cassandra side.

B

The stuff is need to set up, do a bake-off and understand how each one works and didn't understand the real implications, go see what other people are talking about, and we, you see plenty of people talking and you don't have to listen to us and evaluate it yourself run a remote just find out what it's going to take.

A

Great and with that, thank you very much, gentlemen for joining us today. Answering so many questions. We are off for the holidays. Well, we're not actually, but we we are off from webinars until January, 23rd, Cassandra back to basics. In the meantime, I know you all want to get going with your Cassandra training. We offer free online training, Java development with Apache Cassandra, at the link on your screen right now, datastax Academy, illogic, learning, calm.

A

While you are digesting your Christmas dinner and all of that stuff and ushering in the new year, fire up the courses and get certified on Apache Cassandra, gentlemen, thank you very much. We will see you in the new year.