Apache Cassandra Meet Up Presentations, 1 Sep 2011

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Apache Cassandra: noSQL, Yes to Scale

Description

In this presentation, SriSatish Ambati is going to talk about The Apache Cassandra Project, a highly scalable second-generation distributed database.

He'll cover:
- Use cases
- Why Cassandra
- Brisk and Hadoop
- FUD: Consistency
- Facebook and Cassandra
- Community, Code, and tools

A

Without further ado, let's look at what apache cassandra has got to offer. No sequel is actually more more annoying about your queries, applications and, and obviously there's a lot of good intro to no sequel that has happened. If you to take one slide away from this whole talk my talk, it would be, you would have to know your queries up front. You'd have to know them very well. At least eighty percent of your application uses mostly the same queries and so you're paying costs for being the versatile, as SQL is great it.

A

It is very versatile. It gives you a lot of flexibility to change your queries on the fly in the runtime in realtà, towards it after you've built your application, but turns out most applications when they scale end up having to apparently using only very small set of their real functionality, more frequently, the 8020 rule and, as a result, what your application is really depends on. How it performs really depends on a small set of queries.

A

So imagine you have to redesign your application or redesign your schema such that you, those queries are answered really well, and that's. That's the crux of the problem that most cyclists trying to help it and I'm kind of imagining most applications will end up fitting into that space. There are some applications which would not, and those are not ones we'll talk about so with that I'll. Just give a brief on what the what the topics are going to be just talking, points feel free to stop and and ask questions as we go.

A

It's going to be a synchronous run through some use cases on why Cassandra and how Cassandra should be the reason for your no sequel, adventures and we'll talk a little about the use cases, obviously pepperidge with lots of different use cases. There are more use cases in my mind, so if you have any more questions will probably answer it with a use case, so try and see if it fits fits your kind of use case and there is so up front.

A

There is a lot of obvious discussions when Cassandra comes up eventual constancy pops up so we'll talk about that. We'll see why and how and what use cases fit and watch these cases to not so and obviously the number one reason for adopting Cassandra will be the community and the vibrant community around that code. The code is also speaks for itself very brand new code. So you look at that and there's a journey of tools that are being built as we as in the last few months as well.

A

It's a fast-moving project, so the community around cassandra is definitely rich and that's kind of one of the reasons I got it. Excitable center use case users, so this is the day in the age of millions of users or our same users producing millions of clicks, you're trying to find out funnel your user see how it fits in see what he did before he did. The shopping cart see what he did. Where did he go? What did it tweet and try and connect the dots around your user and that's one common use case?

A

We see where people when they move to Cassandra they're, looking at trying to funnel and shape their users Netflix common example, big marquee example. For us they are in production and all the movies you're watching our are being all your user. Data for netflix is on Cassandra running real time. So so a couple of couple of concepts are introduced.

A

Your key, you need key, it's a so most of these, no sequel still course have originated from key value stores so being able to build a good key is actually a big big answer to your your problems. You're representing your data. Your data is also a part of your key, so you keep my customer. It's a very read heavy all your place as you play them on.

A

Different devices are being stored in a column as they go, and it's growing long row of columns Cassandra can support up to 2 billion columns, but if you're not using lots of columns, then you're, probably not using Cassandra strength. At the same time, you don't want a big fat row, so that's other option other thing to look at key by customer movie. So you want to know how a particular customer, if a bunch of customers are stopping a movie at a particular spot, you want to know.

A

If that's a probably a problem in that video file, so again you want to see how the customers are watching. So that's other set right. Heavy operation, you're writing all the top, and so that's the other high speed rights is another common strength for that Cassandra brings to the table, and so that's one of the right heavy operation.

A

Another common use case time series. We see a lot of customers trying to build trying to get data around their devices or they want to get periodic readings or even on on their own applications. They want to know production application, how they're doing how they're? What the performances of the different stacks twitter is a common example. The rain bird is as a base sort of cassandra trying to collect statistics on different pieces.

A

We all are another common use case doing time, series style data. We see a lot of young startups who are not up there in the big names who are using cloud cake was one of the earliest use case for time. Series as well metrics turns out gathering metrics around data. You've already collected turns out to be much larger. Data set than actual data, so basically you're trying to get the top 10 you want to get you want to shape your users based on different time series. The footprints have left all the data surrounding your.

A

Our data is actually the stats tables end up being fast and furious, and so that's where you want to attack the problem with something that scales horizontally scales, without worrying, about about filling up your desk or without worrying about how do you partition it? So partitioning comes in naturally 22 time, series stuff, so that's another common use case that people have when they're trying to to to shape their their end user experience.

A

So these are kind of common push towards going to no sequel, and some of these users have chosen Cassandra and here's a quick compendium of why so the Y Cassandra, it's operationally simple dynamo.

A

Cassandra inherits a lot of its distribution model from dynamic, Amazon's large-scale store and has a schema from big table, but for the most part, the distribution model, every node, its peer to peer, so peer to peer, has historically proven to be more resilient to scale. If you look at DNS, it's a very peer-to-peer storage of your of IP information and it's scales, it's killed for the last many many decades so and and so peer-to-peer having no central point of of anything, makes Cassandra very resilient to a lot of lot of stuff.

A

So that's- and this occurred one of our customer made, which basically, when an ops guy picks, which no sequel store to use that he eventually grab to it gravitates towards Cassandra, and that's been the case for most of our customers. They spend less time on operations of how to run a distributed data store when you're spending more time. Sharding your my sequel store. That's when you're arrived into big data, that's when you're ready for Cassandra.

A

Little structure, Cassandra, actually is it rich. Cassandra is actually in a ring-like structure, all those each one of those actually present a load. Every node has a comet logs, which basically is where it's a classic database concept of read the head of basically sequential logs, append only sequential logs, and so all so, when you send it right, for example, three nodes and, depending on what consistency leverage if applied to it, you have three nodes participating in that right, for example, right there so basically or I mean it goes, finds the key.

A

The coordinate, a node points, the nodes that are participating that will look clean and stores at the same time and fires off the other two synchronously, now sequential, writes, are very fast and what that leaves is a very simple right model that also does replication. So it's happening while you're doing the right. It's not postponed to do a replication after the fact. After your build system up after you have your data on it all your data, all your data is being replicated in in the right.

A

It's not it's not it's an amortized cost of rite of replication, so you're getting a lot of distribution out of the box. So what does that mean? Why do I care so back in the day? Only the financial services guys would have high available multi data center up times and have that kind of availability. Now everybody can get it anybody who's who's able to have two nodes in two different availability zones will get that kind of availability, whether it's ec2 or in on your own machines.

A

You can do rack of air within your own data center if you want so all that at a very inexpensive cost of a single right, so every right replicate another example they're showing two different data: centers DC, 1 DC to now ec2 has multi region, so you can use that or ECG as well. When you read you, don't necessarily have to cross the boundaries of off the data. Centers, so important to know is that you can actually separate your right performance from your reads, so you can continue to do answer your queries locally.

A

So you get that performance. You don't pay for your performance on the reeds, so everyone anyone here deployed on in situ and and have hit the outage on April 21st. So most people are watching Netflix and Netflix was running on AWS right. So an epic was using multi data center, multi-region multi availability zones or features of cassandra, so others other Cassandra customers have survived as well. So let's switch gears too fast gerbil rights right I mean the reason I got it excited about. Cassandra was I, ran the benchmark the.

A

Why CSP cloud benchmark- and you look at that- and I was like- maybe the numbers are wrong: maybe the digits are off, so maybe the data is not in there right, so coming from coherence and oracle and other structures from bing from big stacks from the j2ee world. I looked at this number like this doesn't make sense. The rights are super fast. You have to run it to know it and the data is in there and it is so. The crux of that is the commit log that we saw earlier.

A

It's append only fast seeks, so cheap inexpensive discs can get you fast performance and it's not it's not accessories that are getting that performance. In fact, even on the cloud we've, a femoral discs, which is the local disks, will get you better performance than the most more expensive ones. So in some sense rights are the like the workhorse of off Cassandra. You can basically write a lot of rights and and not- and maybe orders of magnitude better.

A

So in many ways, so single digit, millisecond right our common for our customers in production and fast reads and you're not paying the cost a tree time will also double click on the rates. Part rights, so single digit milliseconds happened only reads so the other interesting new ones are on Cassandra, which attracts attracts really really a lot of talent is, is basically that reads: you pay your repair costs while you're reading it. So when you're reading data you're, actually repairing the entire equals the distributed system in some sense you're.

A

So, just like you replicated while you, while you wrote and you monetize the cost of replication, wonder when you wrote the reads are basically playing the cause of they do the the part of home keeping by just basically going through the net system and seeing if there's any off data, fixing the data so you're paying the cost of repair while you're reading? Now it does look like you're, paying the cost a prawn and doesn't? Is that a good thing no, but turns out?

A

The reads: are fast they're not super super slow, but the repairs are actually going to keep your data in saying most of the time. So that's kind of your the second interesting nugget from Cassandra world is that am authorized. Repair actually actually pays off in the long term. Your data gets to be in better shape and your one node is not too far off from the others. So there are a bunch of Cash's key and row caches and kudos to the HBase team which implemented offi.

A

We also have off heat on Cassandra arms, so you get the benefits of escaping from garbage collection, JVM problems um so off. If Jenna based off fips indexes secondary indexes, are in the new Sandra world and but for the most part metallized a way of looking at things don't expect joints. Joints are not there. So many lies your schema so to fit your queries, so mr. will go back to the slide one at some point: you're you're, paying the cost of metallic material. Your data upfront, your scheme up front, so clients be preferred lines.

A

These days are sequel. Drift is the civilization distillation format within Cassandra, and that still happens to be some things that people use customers use because I have seen a lot of Python and PHP in our customer base. A lot of roll-your-own types from scala and closure, but Hector leads the pack in terms of the number of Java clients that we have so in the number of clients using Java client is hector. P. Lots is a simple get your hands dirty quickly, so you can.

A

You can go and roll your own, quick, Cassander client, so, but that's it screw. Em rubin closure, other roof run. We have scale up so.

A

Use case number three Hadoop turns out. We started implementing a lot of Cassandra customers started in too many Laura Cassandra's scale, and we would help them get quickly up to speed and see. What's going on, eventually, they'll tell us the whole story and what so we want to improve performance of reads right. So that's a common question that would come up and we look at it and say it's a ruby client trying to talk data from from Cassandra or talk into Cassandra, and we look further. This client is actually reading from a cloud era.

A

Cloud of Hadoop's serve or the new and apache hadoop, so you'll see that dupe was actually siphoning a lot of the data from log files running a bunch of things and then they're, storing those into cassandra and so and serving those that data from Cassandra, whether it's alex or real web web apps. So this turned out to be a pretty common case for our customers and that led us to invest in investing time and building. What's called brisk. Brisk is a truly peer-to-peer, Hadoop and we'll see we'll see where that gets.

A

But brisk is essentially hive plus the hdfs plus Cassandra. So that's our that's. The Cassandra's entrance into the Hadoop space, where we're trying to see how we can solve solve problems from their name. Node has been a problem in Kazan and HDFS perk for a little while, where you're unable to put a lot of to all the inodes. Anyone who has and I'm expecting taught to really go deeper into this slide and explain some more of the HDFS Hadoop internals.

A

But anyone who has seen Hadoop distribution installed will see that they're spending their limited by the size of at this, with which the name node would scale. So one of the things we saw that the an opportunity to make this all peer-to-peer and basically the dupe in brisk essentially Britain brisk. We basically took HDFS and laid out the cork I note and blogs, as as just basic tables and any table essentially scales, peer-to-peer cross all the nodes and Cassandra. So so too.

A

Here you see a piece where all the with the elephant's inside those nodes are brisk nodes. You can continue to run the rest of your cluster as a Cassandra cluster, so this is. um This is um three months in the making and and and now currently in adoption and several customers so and it's a bi say very good play in the bi space and and people trying to use together a low, latency and batch together are working with it.

A

Blue double-click on some of the use case there, but the column families essentially are Cassandra's or big tables way of talking about tables. You hear about Colin families, probably more as you read through the space, so it's been. Essentially we took I know'd and s blow up and and made them real tables, and that basically puts them on Cassandra on peer-to-peer, so so, but low latency and you have a cassandra data center notes and for batch analytics. You use brisk data center notes. What does that do to do to me as an application provider?

A

You try you're now putting in logs through a dip into a cluster. You don't know what cluster it's a Hadoop cluster now that data essentially make becomes available for you to be run as queries through hive or through through basic, even other operations that you can make small tables that can now serve real-time data or low latency data near real-time data, low latency data for the rest of the world, so this brings together a problem that our customers were working a lot to connect all the dots of different pieces of no sequel space.

A

The no single space is the tail end of a real big space here, which is the Hadoop space. The Hadoop space is, is basically all your machine generated data. All you mentioned you did today. Most of it is going to the HDFS. That's your store! That's the that's!

A

The true true high scale store that's happening, and now the tail end of that, where you have now once all this data created a small little table that or small set of tables that you want to put are now sitting on Cassandra and you're, serving them off of the sender to the rest of the world. That's a common! That's a typical use case that we see.

A

Of course there is this flip use case where you want to put petabytes of data in case, and that happens to but dupe itself is a market that has taken off and is something we are aware of and are paying attention to all right pause in the talk fun. Let's look at what flaws and fraud that's surrounding cassander space and we'll also look at some real real flaws as well right consistency. So people talk about consistency in cap, serum and cap theorem, and actually the paper that Bruno put up Nancy Lynch.

A

She proved the paper eventually. She was also part of Leslie Lampard's. She reviewed a slam, poet social paper on Fox, which is another interesting tip it from from back in the day, anyways consistency you here: r WN algebra. When you talk about cap theorem, what what is our W and then RS number of reads: w is the number of rights or a number of copies frights. You make sure that one at have to agree on a particular value. N is the total number of replicas.

A

So, given that, let's look at how this works, if so, what the cap theorem states is, if your number, if your read consistency and right consistency, is greater than the total number of copies inside your cluster, you have a consistent data. Now, how does this really work? Let's look at Oracle to node failover application scenario. How do you, how does a two node article be consistent?

A

It asks for it asks reads from one node, so it asks for any one of the nodes to agree, and it makes sure that the number of right every time you write to one or make sure it writes to the second node, the w equal to chew right and the total number of copies are always going to be too. So, if you, even if you've, made a third copy and only wrote twice, then that would be not r plus w greater than N, and so it would be inconsistent right.

A

So if you made a backup last night and and did not write to that backup, that would be behind today's data. So the reason, r, plus WN or the reason Oracle replication works. Even the big Oracles is because the total number of nodes is too, and you always made sure that our press w is greater than n. So this is the simple logic behind eventual consistency. This is a simple logic that we're we're saying that that's not going to bite your data. Your data is not getting inconsistent.

A

There is an inconsistency vendor for every complex system and that's different. The eventual consistency model has worked. Dns is the most popular eventual consistency system, when's your consistent system and has scaled for us for years. So what we like to think about this is more as more as tunable consistency and tunable consistency is it gives you flexibility, so you can program consistency for the first time all along we paid the cost of always consistent all the time for every little application. So, for example, the geocode of this particular site.

A

It's not going to change in neon, so why do I have to make sure it's locked heavily around or why do I have to make sure it's it's it's not going to change the immutable data. So, let's let me actually get something: that's not consistent or try to be, not pay the cost of consistency for it. So for the first time you actually have an application paradigm that actually allows you to program it, of course, and yes, there is cost with that and that's kind of a the big pushback you get.

A

Is it's expensive to program thinking about all these, but that's what some of our customers are actually gaining from? Why not saying that? I don't need to lock these pieces and these pieces are fine with having a consistency, level up of one or concerns or level of quorum for high constancy to regard. Let's the Cassandra programming model actually makes available all the levels of consistency.

A

Cassandra itself has changed a little bit in the last few and we have added consistency level from zero to all, for both reads and and mostly for a lot for right, so you can actually enforce a very high level of consistency and you're trading off high availability. So that's kind of the piece.

A

So if you enforce, for example, I want to make all the nodes right all the time, then even if one node is down, then you're, basically paying the cost of of not being available, and so that's the trade-off and literally that's the trade-off in the mind, and at least part of this talk, hopefully we'll get that across.

A

So so, if you want a very highly available system, you want to write lots of copies and make sure you read from any one of them and you'll find or any two of them agree on the data and you're fine. So you could have pretty high high available dates with that. So that's the that's the last piece there and and at- and there is a ton of fun, especially because audibly and and n is usually confused with total number of nodes, which is different.

A

You can have a hundred nodes and you can just have can have a number for because be too so that's other you wants in the way it's been introduced in the original dynamic paper. Definitely put a lot of a lot of discussion in this topic.

A

So another common question that I get asked is: why is Facebook not using cassander anymore, and I just typed facebook and cassandra ancora, and you see like a dozen questions which are all pretty much saying the same thing which application is using in which it's not and oh and I actually spoke to the team that wrote Cassandra at Facebook recently and happen to connect with them and ask why and turns out. They actually only recently and actually only recently removed.

A

The application on Cassandra, like a couple of months ago, on inbox search so in boxers was a virginal application that was actually running it and it did scale then scale for them. So they, the crux of that is it did scale for facebook, from 100 million to find admin, users and that's a true story and that's not made up.

A

And if you are running into that scale, problems or if the things that, if your your context, may not be the context of Facebook essentially, and so it did scale for them and they did use and they were using and we were all using it as part of that so and the average no sequel deployment size is not nearly nuts eyes right. It's very small and the one we see is actually 12 notes.

A

Usually so it's not it's not the problem that you guys are going to face going out, not all of us in a case.

A

That also gives a hint into another use case which is search. Cassandra is actually used in conjunction with solar as sole andhra, which is actually an interesting product that we baked in labs, shake whose github account is they're, essentially able to do solar store on Cassandra. So we essentially, you can store and sandra get a bite scale indexes. So you can get the same kind of search interface that you got in the pin, solar and the scene and use that on essentially using.

A

So that's another common use case at me to see eventual consistency is harder to program, and that's true actually, but it's also a flexibility. You have you never be never had it. We always burping acid sea was always taken from us from whether it was my sequel or our Kobe, always they're paying for that.

A

But as you as you put yourself in a person, who's shouting my sequel for the first time, you understand that you're inventing inventing when she'll consistency, while you're doing that some of our customers who chose Cassandra where they're filling up the shot faster, they land a good shot so said. But the crux of the argument that I make is that the average customer has mostly immutable data you're using Hadoop. You have written the log file last night right. The data is already immutable.

A

It's not changing so you're, mostly mutable data and complex systems at scale are, our only part are only half a consistent anyway, so that if you had a big GC pause on one of the node, that node was behind behind on data, and that was true for weblogic right. So it was true for the previous tight as well other miscellaneous Mets that are wrong. Sandra is like you, probably you probably have partial rose or you have data loss.

A

No, we have a commit log, it's per row, and it's actually and it's very it's a sequential happened only if you lost the disk. It's still recover from that wee bit with the commit log servi. We actually have a test and some of our customers also have tested. They kill the disk and make sure everything is fine.

A

I actually was at a customer who, or the turn of the Christmas, had a big customer jump onto their analytics cluster, which is not very large, but one of the cluster was filled up and they didn't know what to do and I said just don't worry and it really and we migrated everything into a bigger, larger disk system and then double their setup, but it really did not inhibit them from running their app.

A

So I mean that node was was out of action, fine, five out of six nodes, one of the six nodes and rest five nodes performing, and essentially that's what you're paying for really when you're, when you are in the thick of it, and you need to scale that's when Cassandra. That's when you would thank the senator. Actually, three more reasons were using Sandra before I run out of time.

A

One is tools, a bunch of a mis have come out. Datastax emi is the one preferred ops center? Is a data stack stool again, which will allows you to look at data? Look at your appt look at your cluster and be able to perform operations on it. It's a Jake, it's kind of a jconsole on with a very clean jmx presentation, but jmx is I mean every little nuance around cassandra has been jmx, so you can actually look at all these metrics cuz I'm is one of the most I mean I've started.

A

I've worked on jboss when it was pre. One point O so looking at when I saw that jmx richness of jboss I was like wow and kiss and has made me Bob again because it definitely tracks everything through jmac, so very well, nuance project or project are on our own. Metrics AppDynamics has a pretty pretty cool tool around Cassandra and other apps as well. So definitely check that out.

A

Another big reason that it tracks cassandra is it's beautiful code, it's new code and it's actually lot smaller than you think. When I started looking at it was about 75,000 lines and the last one were point: 8 version we close to 90 k as of last night and its uses the latest java. So most people who have here from the java land will automatically be able to look at it. Looking for it every piece of it. It's Jesus, mostly concurrent collections, so skip plus you'll, see that the core of the at the architecture.

A

You see very interesting good collections. So if you're, if you're avid reader of code, you would love to you, love Cassandra, it uses annotations bloom filters, Markel trees, lots of interesting good data structures. These are real hard problems, guys distributed counters very hard problem right, so people have done that before, but you get to invent them again, see again and it's happening. It's not like the product is getting to be 1 point 0 soon right, so it's non blocking use, non-blocking I/o staged. So then you actually do a note tool stats.

A

You will actually see the stages of where each of these stats are. Where each of these thread queues are. So you can actually it's a staged architecture, so you can actually see a lot of interesting.

A

He wants us there if the current focus around counters and c ql, which is trying to make it less trying to make it easy for people to get in that's one of the hardest biggest critique of cassandra and somewhat valid for so far is it's very hard to get into using it from an end application standpoint and a lot more gaps exist there and are being fixed and we're trying to make a simple blind. So you can actually use it, and so the flip side is true for manga, where the client is dead.

A

Simple to use, you can actually use it right away, so so credit where credit is due, cassandra is actually it's going to is now focusing on making that dead, simple for end-users and there's a lot of operational, smoothing, that's happening hardening before one point, oh and more of that will will us, as we go forward the community behind cassandra, as I mentioned, that's kind of the biggest biggest reason to choose and learn cassandra and use that as your no sequel place, it's a very robust. Its rapid hash cassandra is active.

A

24 x, 7, I've not seen it be empty anytime of the day. Most people on tonnelle is founder of datastax is one email away and usually his email arise much faster than my own boss, email. So so I mean these are really very passionate engineers young set of engineers behind this project datastax, as well as the people who did work on it and and it's a bunch of Engineers with independent consultant startups, reddit other startups from san francisco rest of the world. Large companies Rackspace twitter, Netflix. All of them are behind us.

A

It's not a project. That's going away! Come join the efforts and that's kind of one thing that I would say before: I live. Here's other trends that we see job trends downloads, don't speak well. Well, so job trends of Cassandra sanda up and this case of numbers are off. Another use case is first no sequel den scale. So when you're first no sequel may not be the one that you're going to be using for scaling get.

A

So the move from netflix actually characterize their move from our DBMS to no sequel was one long year and then once they got there, moving to Cassandra was a week, but but getting to know, sequel is actually a very harder developer journey because you have to think queries first, you have to think queries and how the data is growing and all so all that journey are there to give up a lot of the of the goodness of of our DBA messes and move to North sequel.

A

That's a bigger journey, and so common use case we see. Is people go to simple DB, which is easy and then what do Cassandra Netflix an example there you will go to MongoDB and I. Just converted a MongoDB customer road we can and then to Cassandra.

A

So it's a common, it's easy to get in there and then and then, when you really need to scale and you're pulling all the operational pieces together by yourself when you're rolling them on that's when you're moving to Cassandra, so job trend to is Cassandra and and John 20 is Cassandra and HBase. So space is definitely on the rise and, what's also so, is Cassandra. So it's good. So that's and actually there's a job trend for which I do not put. But it's the total sample of no sequel.

A

Is these three guys and they are playing very well and let's it's it's it's a healthy requirement for the space and we need that there is a port one that was coming up with us for three uncle.

A

We want a standard right, so we want to standardize all these and then we create a newer, newer version of ql. So it's not initially unless leave it work, but there is different flavors, no sequel and you'll hear the rest of them and there is a healthy learning, that's happening within the system. We learned a lot from each base and its basis for picking pieces from Sam, and so is so all these things are not static. They're changing as we speak.

A

If you google up any of these, some of those blogs of 06 are no longer valid. 408 version of Cassandra and some of the versions pieces of HBase or named nodes will not be true in the future, so things are changing as its peak, and these slides are dated and so will be at current, but they've dated so too I see a future where all of these guys will be robust enough to become your database of choices. So, in summary, cassandra is a high-scale peer-to-peer distributed database questions.

A

The question is a party college concerti. Are you asking for difference between a apache cassandra and apache couchdb in terms of what a ho Apache treats them or I'm? Not the apache spokesperson for this, but I've seen really fairness on their part. So far, I don't see a reason. One should be preference toward other when they're just I mean in Apaches ways this mostly peers, there's no reason to to be worried about losing or gaining from an Apache standpoint.

A

Historically, Apache played community or code, and they stuck to that and there's a community around college TV, it's going to be there for a long time. So I don't think that's, that's, probably not the reason to to go either way. But the main reason is what you said: it's one is very distributed and peer-to-peer another one is: gets you up and ready with no sequel and that's good. So what play their roles? Yeah very well, yeah, so I think that's a more of a good panel like question.

A

We be great to have the other speakers finish and we could get back to the same question if that makes sense. The question is: how do these different players, work and I think Cassandra definitely clearly is designed to make sure your data gets a durable e written and written across lots of nodes. So you don't need to worry about availability and it's partitioned, and so it's clearly focused on making availability and partitioning the key focus.

A

So if you're, if you're a small website, you don't have to be a netflix or a big large company to be using Cassandra if you're a small and we have a lot of small startups, actually using it if you're on ec2- and you don't want em to the east, even today, there was a ECT oth. So if you're worried about up time on your ec2 set up, your should be thinking about Cassandra. So it's a when you can't fit things in one in memory on one boxes, so as John.

A

So if you win big data, the definition we get is if it's not fitting in today's memory, all in memory and you can fit all the users data, it's a billion rows.

A

6 billion rows can actually fit in memory on one box these days, but once you start mining that data and creating data around it, that's when data falls off one note. So if you doing multi-node data stores, you and you can't pay, the big bang for Exadata is in the big systems. Then your choice is having to partition it having to run on multi nodes and that's the context behind almost equal. Essentially.

A

So the question is at what point do I move from one go on to Cassandra right? It's much longer nuanced question apologize for chopping it, but the question so is dead, simple to program and it's I mean even I love that it's an interface that you just basically it's very simple, speaks like Java. You write it and it's there right d, quite the deep, it's a JSON style tables. So everything is JSON. So you understand that as a JavaScript developer, you can see that see how your app looks like.

A

So it's a very good programming world for end-users. What happens is when it actually falls to a larger than one node set up when you start putting many nodes on it. When your data scales remember the whole reason, we were coming to no sequel, trying away all the goodies that Oracle build for us. Our IBM build for us or deep I mean all the big databases right, the tip. That's the database theory for 10 20 years. We threw all of that off as well as the caching vendors, because we want ski right.

A

So the only reason to come to North sequel or the prime reason to come to North equally scale and when it scales, you'll, say the whole replication, the whole read and repair and being able to get a large system to agree on. As on a few queries that whole nuance of maintaining it.

A

That's when people move from using a traditional, my sequel or a into into Cassandra, so the use cases we see today are data that suddenly your app becomes more popular than you expect, and then you see higher higher number of users, I a number of rate, enema, frights and so write speeds of obviously speak very well for Sandra and- and so that's I mean language is not the reason. People are moving to Cassandra.

A

The main reason they're moving from from is really because of scale and being able to be able to kill your app at mercilessly. I mean I, actually tell my customers if they, if they are worried about restarting their Cassandra I, said just kill it. It's it's designed to replay everything, that's in memory or not flushed the commit logs every time it comes up exactly replace everything, so it's I mean. We actually believe that the apps that that live long enough, our design, so their fault, tolerant.

A

That way, so so fault tolerance and when you getting two data sets that are not fitting on small number of nodes and replication. When you do multi set data center application and all these features are coming into those databases. So I will let those authors speak for them. Actually, those are the features that they are also seeing. Customers and the markets force market wants the same thing they want. They have the same problems. They want multi the availability they want. They want some level of consistency.

A

They want partitioning right so that the market wants all of them actually and they want them all to be easy to use. So that's what the average designer wants, and so all these did all these no circle stores are slowly heading towards that. So so so anyway, so I hope I answered so there are some numbers around it too, but really when you try to get beyond a few nodes, you quickly find yourself at the mercy of mongos scale yeah. So the question is: how does Cassandra handle backups?

A

So it's not something you don't worry about. Our customers have incremental and so there's a global snapshot. So you can take a global snapshot and you can take notes, slap shots and store these be the files are actually so. You have committed send data logs say they are usually run on different lungs as well, so they get good performance, but so you take snapshots and then you take incremental backups. Our customers take copies of these files and put them on s3, for example, and then take incremental Delta's on them on their local.

A

So it's a regular file file system, backup. The actual app does not have any state, so all the state is in the llamo, the Yama files or the database and log files. It's all file based so and we don't do anything fancy on the file itself. It's real raw data, so hex data, so there's nothing that stops you from taking that snapshot and creating a new cluster, so yeah and the option. One of the feature for the opscenter is to actually press the button. Do the same thing get a snapshot?

A

Essentially, there is a set of things that you do before you take that snapshot, which is basically flush and repair and compact and then take a coffee yeah. It's not it people do that all the time. Yes, thank you for being here.

A