Apache Cassandra Cassandra Community Webinar Series, 30 Oct 2012

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Cassandra Community Webinar | Is My App A Good Fit for Apache Cassandra?

Description

Speaker | Eric Lubow (CTO, SimpleReach)
Date | Tuesday October 23 @ 8:30AM PST

Join Eric Lubow, CTO of Simple Reach and DataStax MVP for Apache Cassandra as he examines the types of applications that are suited to be built on top of Cassandra. Eric will talk about the key considerations for designing and deploying your application on Apache Cassandra. This webinar is 101 level.

A

Okay, so without further ado, we will get cracking um in our college credit webinar series we're taking you all the way through from uh what is nosql all the way through to pretty advanced topics, and this is the third in our series and we're very happy to have eric on with us today. Eric is uh an mvp for apache cassandra and he is also the cto at simple reach and uh having having met eric in person and and heard about his mixed martial arts. uh Definitely not someone to to mess with.

A

So I'm glad we messed up from afar and he's not sitting next to me today. Many apologies eric for question and answers. So in the previous webinars we've had tons of questions. So please go to the q a tab in webex and ask your questions there. I will be monitoring them and at the end of the presentation, we will ask eric those questions and we'll get through as many as we can.

A

You can also ask questions via twitter use the cassandra cassandra qa, as you can see down here and eric, I will hand over to you just make sure you can advance your slides.

A

It looks like I got it: okay, perfect, take it away.

B

Cool um so, first of all, thank you very much for having me. um I will try to do my best to get through some of this stuff, uh some of the superficial stuff a little bit more quickly to give everyone time for some questions, but um yeah. So thank you very much.

B

My name's eric lubow, as he said, I'm the cto of simple reach, and I'm going to talk to you a little bit about whether or not your app is going to be a good fit for cassandra cause, I'm just going to be a good fit for your app just. You know, depending on how you want to see it.

B

So I'm going to talk to you a little bit about how how you can plan for deciding on what a good data store is and talk to you a little bit about what some of the data stores are that are available out there I'll show you how I made the comparisons, and maybe that sort of thing will work well, for you give you a few use cases, some final thoughts and then hopefully we'll have time for plenty of questions.

B

So, where are you if you're in the planning stages, then you know this is definitely a good thing for you to be listening to, especially if you need to try to decide. You know what the goal of your application is going to be. Sometimes that's obvious, and sometimes it's not just because you know what specific problem you're solving doesn't mean. You know the best way to get there. I can speak to that problem from experience.

B

uh You know if you're at building the minimum viable product stage, then hey.

A

Eric uh eric sorry to interrupt, should we should we still be seeing your title slide or your agenda slide cause. It's still on the title slide.

B

um According to my screen, it says I'm on the: where am I slide perfect.

A

It has now just changed great. Thank you.

B

Okay, so there's a there's a bit of a delay I'll keep that in mind.

B

uh So if you're on the minimum viable product stage, then my suggestion is that you probably are going to spend the time trying to decide what your application is capable of or needs to be capable of coming out of the gate, in which case I would go with what the easiest thing for you is to use. um You know for some people that might be my sequel or postgres for some people it might be, um and if you're you know, if you're comfortable with cassandra, then that might be a good place to start.

B

If you know you're going to be moving into that big data neighborhood. But again, I'm going to talk a little bit more about that in the future. If you're on your iterative steps, you know like you've, already built your minimum viable product, and you kind of see this going in the direction of a little bit larger than what you're used to handling.

B

Then that may be one of the times that you start exploring whether or not this is good a good fit for you or, if you're, on your final decision, and it is a final decision, you probably should step back and go back to the iterative, there's really no end state. You really need to make these decisions on a on a continuous basis, because sometimes the needs of your app are going to change.

B

It also depends on what you're building, if you're, building a small user app you know- maybe my sql is going to be good hobby project or a learning project. These two could be one and the same.

B

If you're, building a learning project then make sure your learning project is apt for what you're going to be building for the data store that you're going to be building it with um building a plugging. A blogging platform on cassandra is not typically the best move. It's it's certainly doable, but it's just not what it excels at. You know.

B

If you're building your own twitter setup, then cassandra's a really solid example, a great way to learn and if you're curious about whether or not it's going to work, seems to work pretty well for twitter, and uh you know they got a fairly large setup going on on their end and if you're, really just after building a big data system in general, then twitter, then I'm sorry, then cassandra's, certainly something that you're going to want to look at.

B

But again you need to know where its strengths and where its weaknesses are, so what does big data actually mean? Well, big data really depends on what you, the user, define big.

B

uh How the how you use it to find big could be it's slightly bigger than an expel excel spreadsheet and for most people the excel spreadsheet is less than 60 000 rows, and you know if it's, if you're in that realm, um sometimes even 60 000 rows and a ton of columns could be very difficult to deal with.

B

Sometimes your data is bigger than can fit on one server and for some people one server might be a terabyte of space and uh and then you're starting to move into big data territory.

B

If you're looking at one rack- maybe that's 12 servers, maybe you got a couple of terabytes per server and you got about 15 terabytes. Maybe that's big data, but you got to remember big data companies. Google and facebook are both big data companies and they have petabytes.

B

So the term is a very loose term and it's a very buzzword term, so make sure you define what it means to you before you go forward and here's the biggest truth about dealing with large amounts of data, even with the right tools. Eighty percent of the work of building a big data system is acquiring and refining the data into use the raw data into usable data.

B

That is something that most people do not understand going into the process, I being one of them and the vast majority of the terabytes and terabytes of data that we look at day in and day out at simple reach is stuff that we had to spend a lot of time. Turning into usable data, so make sure you keep this sort of thing in mind.

B

You may need multiple data stores and an intermediary processing system where the the initial piece is just for taking in raw data, and then you have a final data store that actually holds the usable data so again things to keep in mind.

B

Now, I'm not going to go through all of these planning questions, they're, just they're, just really notes and a guide I went through. I went through all these planning questions for for simple reach and uh it took an immense amount of time, but these things are all uh these things are all very specific to your answers will be very specific to what you are looking for.

B

Some of the basic questions are built are broken down into categories, your data, what type of data how you're querying it? How you're loading it? What type of schema you need to store it in is going to be a factor of that? So, if you need to aggregate data on the fly, then you're going to perhaps think might be faster for aggregate encounters.

B

Cassandra has uh has distributed counters, which means the reconciliation might be a little bit slower on the order of a few milliseconds or a few seconds and depending on your application, that might be entirely too slow. So you really need to look at what what's important to you in terms of technology. It doesn't need to be fault tolerant. Does it need to be supportive of encryption standards depending on the type of data you're? Storing does the data need to be distributed?

B

I think uh everybody got a healthy dose of um reminder of fault, tolerance and distribution. Yesterday, with uh amazon's um us east, uh one data centers having trouble so the folks that were able to distribute their data and had a fault tolerance system experience less of a headache than those who didn't one would hope.

B

Another big concern is their support for your language. Cassandra, for instance, has a support for has support for a great many drivers and languages, and when we got into it, we found out that the node.js support wasn't great. We ended up having to roll our own driver. uh Another company that I've spoken to is interested in using cassandra, but most of their technology is written in go and there isn't a go driver. So that's something that you need to be aware of.

B

If you work in the financial world you're going to certainly have issues that you need to think about in terms of data center distance, I off the top my head, I'm not 100 sure, but I believe it's 60 miles between uh stored data.

B

So you know you have to be able to support a distributed system in that respect and the other financial concerns are actually around how much it costs you to build things out.

B

You know if something goes wrong and you have 20 servers in one location and 20 servers in another location. You know that's people that you may have to send to both locations if you've got a physical data center and if you uh and the other things you need to keep in mind are what legal requirements you might be bound by folks in the medical field, I know have hipaa to deal with.

B

You need to know what the community is like around certain data stores. Cassandra's got a great community. There's always people helping you out.

B

So that's certainly stuff to keep in mind so, what's out there well, there are a ton of tools out there and here's a slide of just a few of them. I know I left off quite a few, but you know these are. For the most part, these are the high points.

B

So when you're comparing applications, you need to know what is good about each one and where you may run into problems based on your needs and again the languages and the support for each one of those data stores is going to vary based on what it's used for and how people who are existing in that community have chosen to use it.

B

So what am I saying really, basically, I'm saying you've got to find the right tool for the job and sometimes that's easy. Sometimes that's hard.

B

When you're looking at cassandra and when we looked at cassandra, we knew we needed a few things. We needed to be sure that there was a large volume of data that could come in at high velocity.

B

We look at approximately uh 150 to 200 million events per day, which translates to about 2 000 events per second, so, on a on a light day, we're going to be looking at about two or three times that, because each event happens more than requires more than one right.

B

So, if you're looking at 6 000 rights per second, that's a lot to deal with. So that's a high velocity and a high volume that your data store needs to be able to handle. Maybe my sequel is not so great at that.

B

Maybe is I'm going to get into how you compare data stores in a moment, but I'm just going to go through some of these basic pieces up front.

B

So when you look at your query patterns for us when we want to query something for we look at things in terms of page views, tweets, facebook actions refer data and the way we get to that data is we slice groups of information out of the rows, so we want. Let's say we want to see all the page views that happen in a particular hour. We can slice out page views. We want to see all the tweets that happen for a particular hour. We slice out those tweets.

B

Another really important piece of determining. What's good, for you is finding things that have tool kits, so mature applications have toolkits built around them. Cassandra has off center monitoring a system that is distributed in this fashion is not very easy, especially when you need to monitor in different data centers.

B

You have to worry about latency, there's certain things that you can monitor very easily through typical through typical monitoring applications, applications like nagios or isinga, but there's certain things that get a little bit that require a little bit more work and the toolkits that get built around them, as I said, are a sign of maturity of the application and the feature sets that come with it.

B

So if you find that there's a particular data store uh that can do roughly what you're looking for and there's no tools built for it.

B

Yet that's typically a sign of you're going to have to do a little more work than you may want to do, especially if you're still in the early stages of your product, like the mvp or the early iteration, and one of our favorite features- and I just like to say this: all the time is is the ttling of uh of certain columns or rows and ttling stands for time to live, and that basically sets allows you to set an expiration date on your pieces of data.

B

So if I'm importing, if I'm bringing in all of my data for the month of october and then we hit november- and I say you know what I don't need that anymore- I can set the expiration date as it comes in and then at the end of november.

B

All of october's data goes away and that frees up a few terabytes of information, at least in our case, uh without us having to do any additional labor.

B

So that's a very handy feature for us and to give you an idea of what the data looks like for us a more specific example. So what you're looking at for for us is we get to take a look at social data from all around the web anytime, an article gets published. We look at all the social data that comes out around that article in real time. So we take a look at the facebook actions. We count the page views and we count the tweets and we go across all the rest of the social networks.

B

So if we only want to slice out a little bit of that data, we can say hey just pull out those page views and you'll see it pull out those four items and it will. Then we can dig in and find more information find out a little bit about the user find out about the referrer.

B

What browsers are being used most frequently? We want to pull out the twitter data. We can find out how many tweets happened over the course of that hour and because we can segment by in our case social network.

B

It gives us a really good idea of what's happening and what's trending and that's something that's very specific to us, but it's an example of a use case.

B

So let's talk a little bit about so one of the things that's really good about is that um it is incredibly fast for doing atomic increments, meaning that, if you want to add one or two or three or whatever, the number is to a number that's already in the database.

B

If you're coming from a json based language or javascript based language like node, it works very fast. There's! No change of language that has to happen or serialization deserialization that has to happen, and it allows you to do things very quickly and that's like a huge advantage for us because of the way we because of our infrastructure, we happen to be a node.js heavy shot.

B

So the way shards is actually also very interesting and quite different from cassandra.

B

For example, um cassandra takes the hash keys, which is how it decides where your data is in the cluster and says I know roughly where that is, and it can go out and get the data, whereas with it knows exactly where the data is in each in every case, which means that finding your data might be a little bit finding out where your data is maybe slower, but then getting at your data is faster, whereas cassandra takes the opposite approach.

B

Finding out where your data is is going to be faster, but actually bringing the data back could potentially be a little bit slower. So these are trade-offs.

B

If you're building the system from scratch and you're early on you're going to want something that has a good tie-in, especially if you have a web app the orm, the object relational manager for for rails, we use id, but there's a ton of them out there. So you're going to want something. That's able that allows you to build that web layer and for us there's just nothing out there that cassandra has so we needed to use as well.

B

The other thing that gives us is a pub sub system. So, if you're not familiar with uh what publish and subscribe, does it really gives you the ability to say anytime, something new comes in pass it back to the client immediately also gives us b3 indexes, which is something we don't have in cassandra, and a b tree index is essentially says.

B

If I have 10 numbers, I I can ask for the middle three if one of them is below four and above one and that's something: that's only available with b tree or b plus tree or b minus tree indexes, and you won't find with the with the hash indexes that are available in cassandra and because we store everything in json, as you noticed from the previous slide.

B

The document model that uses is very handy for us and again allows us to ttl data, which means that we can get rid of it whenever we need or whenever the clock expires, and that's really helpful for us in terms of getting rid of uh keeping our amount of free space available, and this is what the document looks like for us. So we get to take a look at all of the data all of the social data for a particular url, and we just in this case, use the increments.

B

Because again, the increments are very very fast. So this is one way in which for us performs better than cassandra.

B

Now redis is a whole different animal and the reason I'm talking about redis is because I want to be able to make the comparison a lot of people compare and cassandra directly and when, in fact there they have. You know different use cases and the same thing goes with redis, but it's slightly more obvious.

B

So if you want to talk about what's good and what's bad about redis um reddit's, just like or cassandra can support thousands hundreds of thousands trans of transactions per second, but they all do them slightly differently.

B

But what's great about redis is that you can guarantee that everything that you every transaction you go after is going to be stored in memory. Everything is memory, mapped and that's that's a speed thing, but it also binds you to the amount of memory that you have on that particular machine, so again, compromise that you're going to have to make if you're on amazon, that's typically going to be 64 gigs and that's not a whole lot of data.

B

It's just enough to keep you going, which is why we tend to use it as a caching engine. You can expire your caches again, just ttling, like the other systems, do, and it allows you to minimize the amount of data that you're, storing and using up in memory.

B

It also allows some really cool uh variable types like sets and sorted sets and lists, and a lot of those things are coming to cassandra, I believe in 1.2, but you know they're not out yet, so you need to keep that in mind when you're making that decision about where to you know where to go.

B

It also acts as an excellent, centralized, locking system so part of the problem that you could have that you could end up running into with cassandra. The cassandra is eventually consistent. So if you try to write a lock on one server, it may not get to the other server, but by the time you go to check that to see if that lock exists.

B

So it's while it's great for distributing your data and writing things really quickly. The fact that it may not be there when you go to read it immediately is the idea of eventual consistency is something again that you need to take into account, whereas with redis, because everything happens and is stored in memory you're. Guaranteeing that as soon as you write that lock it's going to be available on the following, read.

B

Now I did talk a little bit about what was negative about cassandra and a little bit about what was negative about. um But it's not just you can't just look at the good parts of each system.

B

Redis is fantastic as a caching engine, but its limit is that other than being able to store the data in memory, it can only utilize a single core. So if you have 64 a 64 gig server on amazon, only using one core you're quickly going to overwhelm the registers on the cpu and you're not really going to be able to get the your best performance.

B

So knowing the internals of your system is also uh something that you need to keep in mind. I talked a little bit about why b. Trees are are important for us and how they could be important to you and one of the things that's uh difficult to deal with with, especially if you're in the cloud is that they force the ping times to be very very short. So whether or not they check how they check. Whether or not the sort of sister servers are alive is coded directly into the application and not configurable.

B

So if you have a widely distributed application, say you're in amazon, us, east and amazon, us west odds are your servers will not survive in in a replica set. So something to keep in mind if you need that distributed uh capability, then may not be the right answer for you.

B

So what are some cool use cases now that I've talked all about? You know what things are good for and what they are so models. I'm sorry, cassandra is great at storing time series data and the reason time series data is the reason cassandra's so good at storing uh time.

B

Series data is that everything is it's a very right heavy system so as soon as you write something it stores a time stamp and it writes them sequentially, and then it goes back at the end of a large amount of writes and does something called compaction puts all that stuff together into an easily accessible format for the system to read, and that format is always in time series. So you storing time series data is native to the application itself.

B

So that's something to keep in mind when you're, working with time series data like for sensors or for events or anything along those lines, counters uh voting voting systems are really great to do in cassandra, especially if you have a high volume, because, ultimately you can write those you can make those increments anywhere feed based activity, which is just like events, say you know you could think github. I know a lot of people use github.

B

You have the ability to every time you get an event just throw that into a column based on the user, say the user is the route and you can just start pulling that back into the application in chunks say. I want everything that happened for this day for this user and get a really good cross-section of what you're. After also when you're after large amounts of data, you really need to think about.

B

What's going to be good at storing it and accessing it, for your data for your pattern and that again goes back to the large volume and high velocity ingestion.

B

This slide, just so, we all know, was made prior to amazon going down yesterday. It's just that much more entertaining that it happened, and I have this slide.

B

The purpose of the slide, however, is to explain that when you want to iterate quickly, you want to test your assumptions: you're you're, building that mvp you're doing your iterative testing. um The easiest thing to do, and sometimes cheaply, if you want to use spot instances, is just to fire up a couple things in the cloud fire up. One or two machines build a small cluster, whether it's cassandra or or reoc, and test test. Your assumptions, there's libraries for doing just about everything in the cloud.

B

If you want to use cassandra you can use, I believe it's called ccm and you can spin up a cluster with about four lines of python, to test your assumptions and that's a really good way to get going, and I believe those four lines of python are even in the read. So it's really not that much more work than copying and pasting.

B

This is something I like to talk about, regardless of whether or not you choose to use cassandra or any other application, whatever application or data store, you intend to use, you really need to think about what happens when you need help like who are the experts? Can you get to them easily?

B

Is it going to cost you an arm and a leg, and- uh and this is not intended to be a a a pitch for the guys over at data stacks, but these guys have by far the best customer service I've had of any of any vendor um in the in the years I've been working in I.t, so keep that sort of thing in mind.

B

10Gen typically has a great the guys who make has a have a great customer service as well, but some of the guys some of the data stores that are up and coming or the ones who have been around for a very long time. You may not be able to get all the help you need.

B

I'm sure that many of you who have dealt with oracle have horror stories about trying to get um oracle, to help you out and and when you're in need of uh you know. Maybe you lost some data or the experts and having that expertise available to you, you don't think about it until you need it and when you need it, it's probably too late to look for it. So make sure you keep that in mind.

B

So what happened? What did we talk about? Well, we talked about planning. We talked about finding good, write and read patterns for your data. We talked about tool kits, but the biggest thing I think we talked about and that everyone really needs to be aware of is what compromises you're willing to make in order to get your data into a fashion. That's good for good for you and your application in your use case.

B

So I hope we still have enough time for questions and um thank you very much for listening.

A

Thank you very much indeed eric, so you can start to submit your questions in the q, a tab on um the webex or you can go to twitter and use cassandra qa, and we will pick them up there.

A

In the meantime, as I said at the beginning, this is one of a series of webinars. So today we had eric is my app a good fit for apache cassandra. In two weeks time we will have um aaron. Morton will be back on to do a look at data modeling for apache cassandra, so um we look forward to seeing you there and then also uh you know just to put a little plug in there. I was talking with our recruiter yesterday and he's like hey.

A

You know, make sure you tell everyone that we're hiring- and this is certainly an area where there is a skill, set shortage all the way along so by attending this webinar series and sharpening your skills. You are definitely you know, making yourself very very attractive in the market today.

A

Okay, any questions here not seeing any come in yet or maybe I'm looking in the wrong place.

B

Yeah there is one there's one question here uh from which data management system that I moved to from because did I move to cassandra from um so.

A

Oh, you, may you may.

B

A

Your own ones eric. uh I am not seeing questions okay,.

B

Sure uh yeah, uh so the question was which data management system did we move to cassandra from uh the answer is actually when we started out, we were using, and that was it.

B

uh But the fact is, we actually haven't ditched and we we really have no plan of getting rid of anytime soon, because what we found was that there is really no one good answer uh for our problems. I mean we take in. You know we have terabytes and terabytes of data and the required view is that uh is different from various parts of our system. I mean we provide a social analytics package and sometimes.

B

The data needs to be viewed in aggregate and again is fantastic at providing those counts and we've kept. We we've kept the around to explicitly do those increments and get at that data very, very quickly.

B

The other thing we use it for, as I had mentioned, was uh pub sub, so those two things are is continuing to serve a purpose, as our you know, as a primary data system to us, and the only thing that uh that it would have been good for should have we decided to use it later on is probably the orm you know being able to have the user and account management for the front end application.

B

uh We moved to cassandra because we found that we were pretty much taking out. We were pretty much bringing down on a regular basis, just based solely on the fact that the the data input, speed and volume was too much for it to handle, and we also couldn't get the distributed fashion. We couldn't get the fault tolerance that we needed, so we, while we still do use it, is not our authoritative data source any longer.

A

And eric we are starting to get lots of questions that I can see in here now rodolfo asked: could you explain how cassandra is better than mongodb for active active with multiple data centers geographically distributed europe, asia, u.s west et cetera, sure.

B

So uh cassandra works better for the distributed, because um there's no real concept of active, active with cassandra.

B

The whole idea of active active uh is basically talking about the fact that they're required to be a master and whether every right or read has to go through that master at some level.

B

The fact is that when you have a system like cassandra that's distributed uh without getting too much into the nitty-gritty, you can actually query any node for any bit of data and that node will act as a coordinator for the query itself. So um just because all the data is not stored on a particular node that node. Where knows where it can go to get the data and beyond that, it doesn't just know where it can go to get the data.

B

It knows where it can go in terms of distance, and uh when I mean when I say distance I mean um like uh it might be geography to us, but uh it's topology to the to the computer to themselves to the network. So if it has a couple of other nodes in the same data center, say us east 1a and there's a couple of nodes in u.s west 1a, there is something called a snitch that will say you know what I know that it's closer for me to get to the usc.

B

So I'm just going to ask these guys and if they have the answer, then I'm going to go here rather than going all the way to us west, which has a slower ping time to me. So um with the requirement for that whole um for the replicas having to be very close to each other.

B

It actually puts you in a somewhat limited uh position, because you can't have things that are too geographically distributed, uh because if I remember correctly, the the required replica ping time is like one second, and it just takes. You know by the laws of nature. It takes a you know, a few milliseconds just to get from one side of the country to the other, let alone one side of you know either side of an ocean and uh if you get any any uh blip in that, then you're gonna have a problem.

B

um Keeping that connection up and keeping that replica in sync so just presents a few challenges in that respect. um They do have the concept of tagging, but uh that's still in a pretty nascent state.

A

Okay, thanks eric, we will uh try to get through as many of these as possible. um Sure tom asks uh did you? Did you consider hbase? Do you have any uh perspective there.

B

Yes, uh we did consider hbase. um In fact, when we sat down to do our original considerations, we basically looked at three systems. We looked at uh react, cassandra and hbase, and- and I know, there's more out there, like you know, google has uh has theirs and amazon has dynamodb. We ruled those out because we didn't want to be put in a position where our data is stored in a place where we can't get it out so you're sort of bound by what amazon or google gives to you as a feature.

B

And if you want, for instance, we have, uh you know we're still growing, but we should. We have about 25 30 terabytes of data, and if we wanted to pull that out and still maintain some sort of continuity, that's a that's a pretty challenging feat in and of itself, um let alone you know dealing with the 30 terabytes uh 20 to 30 terabytes on your own, so we ruled those out.

B

uh We looked at at basho's react and we just we didn't like the fact that there weren't too many tools around it and it was still a little immature at the time. Although it's gotten quite a bit better in the last, you know six to nine months, um so we sort of ruled that out from a maturity toolkit perspective and uh when it came to hbase, we sort of looked at the two uh between cassandra and hbase and said: okay, what are the real differentiators here? Well, with with hbase we had to?

B

We knew we would have to spend some time with zookeeper and dealing with all the region nodes um and dealing with I'm sorry, the regions and and uh some of that stuff just gets a little complex and what was nice about? What uh datastex does? Is they actually have a product that, like kind of bundles that and takes care of it for you? And you know, if you choose to deal with it, you can, but otherwise you can sort of ignore it and leave it as a little black box under the hood.

B

But the biggest thing for us was, you know: how do we get support and how do we know that the product can do everything we're going to need it to do not just now, but a year from now and uh in terms of support, cloudera does a great job and they've been doing a great job of taking ownership of hbase, but going forward.

B

We just didn't really see that they were having a whole lot of control in terms of uh determining the roadmap and um in terms of all of our tests, we found that you know the speed and, and everything was pretty similar between the two of them on a lot of our query patterns and our right patterns, and when it came down to make the decision we said you know what cassandra looks pretty good in the fact that we can help determine the road map, and we, you know not only gotten- to do that by speaking to the folks at data stacks pretty regularly uh but and building some of the drivers out.

B

But we've also um been able to write some of the features you know write some of the features ourselves and hand them up, and- and uh you know a couple of things have been brought into cassandra as a result of that, and we just didn't see that as something that was possible with uh with a space.

A

Great, thank you very much. um Al jurgensen asks is cassandra. A poor data store for an app that requires search, I.e. I want to retrieve by equality almost every column, uh just a little plug there. Al for date, stacks enterprise, datastax enterprise actually integrates uh cassandra is the engine that powers the platform, but it also integrates solar for search and hadoop for batch analysis as well and eric. I don't know uh if you have some perspective in your app. If you need uh search for your app, but um you know, maybe maybe you could answer.

B

Sure I I can give you a very, very small and basic answer, because we are not heavy solar users we're actually in the midst of testing solar out. um Solar does not work very well at the moment with uh what are called wide rows. It works very well for skinny rows and just about everything we do is wide rows and composite column based and once you get a little bit more into the cassandra schema. You'll you'll know a little bit more about what composite columns are. So in our in our specific use case.

B

It's actually not as good as we'd like to be, though, if the support was there we would, we would almost definitely use it. So what we've ended up doing is we load all of our search data into redis, and then we use we query redis, but the search data is all loaded from cassandra. This way, redis being the sort of ephemeral store. That is it that it is.

B

If it goes down, we could uh we could bring up a new one or bring up multiple redis servers and just fire up uh and just you know warm the data in the cache. So I don't have a great answer for you. um I would love to know more about solar in that case, but if you know solar happens to be something, that's that's a good fit for you. I do highly recommend looking into solandra or dse.

A

uh Thanks eric and that may be a good topic for a future webcast for us uh frank asked: would cassandra be a good fit for an ordering system uh and then, in parentheses, with lack of transactions, roll backs to acid.

B

Well, if you really need the transactional nature of uh of us, if transactions are really important to you, I probably wouldn't go with cassandra. You could certainly fake transactions, but I don't think it's it's a good thing to build transactions on top of a non-transactional system.

B

It's because you're going to find yourself you're going to end up having to over engineer something and if they decide to build transactions in on in the roadmap, then you're going to have to undo what you did in order to keep up with it. So I I would sort of recommend against doing something.

B

That's that low level building that on top, but uh you know doing it, maybe on the longer end like if you store the initial transactions, and this way you can do I'm sorry if you store the initial order and then you use the like a system like mysql or like something that does transactions very well in the front end.

B

And then you move all that data, perhaps batch job to cassandra at the top of the hour or or the end of every day, and you can use it as a back-end system and keep my sql very lightweight. um That's certainly a possibility.

A

And uh eric I'd like to say that that's what we see a lot of customers doing where, if you think of your old oltp paradigm, the olp goes into cassandra and the t stays in a relational database. We see oracle. We see my sql um quite a bit um so that you know that that's pretty much what you just.

B

A

There are cool um trevor asks it's worrying that cassandra is such a low version level. uh He has an oracle background, and this suggests it will be buggy as not yet heavily used. Any any comments there right.

B

So you know: that's that's always a tough thing to deal with. You have to decide what your tolerance is for bleeding edge and there certainly are bleeding edge versions of cassandra, but the stable versions, especially the ones that are in dse datasex enterprise.

B

I mean the amount of bugs that we found uh and we've been using it for probably about nine ten months. At this point um we we found a lot of bugs early on. Then we moved into dse, we found very, very few and the ones that we found really weren't. They were edge case uh bugs so you're always going to run into that.

B

You know bleeding edge software sort of problem, but the fact is that if you want to be a part of progress- and you really like, if you want to be progressive oracle- maybe not the best at being progressive.

B

uh Considering that you know at this year's uh open world, they just decided to really take on this cloud thing head on. um Considering that's been around for quite a few years. You know, I think you you always have that risk. But again this is this. Is a chicken and egg thing? The less people that adopt it, the less people are going to find bugs the more people that adopt it. The more bugs can get worked out. So you know, if you, if you want to take the chance, you know it's pretty safe system.

B

uh At least it has been in our experience, but yeah, that's a personal decision. You know those are. That would be a great question for the planning slide. Are you you know willing to take that sort of risk.

A

uh Master or chowdhury asks: uh can you share your data modeling experience with cassandra and and uh I'd like just to um reiterate that our next uh webinar in two weeks time is focused on data modeling in cassandra. Aaron morton will be doing that once. Can you share your experience briefly.

B

Sure uh iteration, that is our data, modeling experience.

B

uh I would say that um when we first started, we probably did it wrong about 10 or 12 times in a row, but we knew we were going to get it wrong the first few times you know we did something and it worked for a couple of days and then we all of a sudden found out that we couldn't query like that very efficiently.

B

So we tried something else and this just this happened over and over and over, um because you know when you're solving problems that haven't been solved, you know very either before or don't get solved very frequently you're going to run into these data modeling challenges.

B

So the idea is that keep keep your ability to iterate open, try not to tie yourself down until you've found something that actually works for you, one of the things we did was. We actually asked the folks at data stacks if they could send someone out to our office and spend a day with all the engineers and teach us about modeling, and we ended up doing that and we made a lot fewer mistakes after that.

B

But uh there is going to be a learning curve, just as there is with any system. I mean you're going to find out what the query, uh what the query uh profiler through the query, profilers, which things maybe may work better if you're, using your data in one form or another you're gonna, some query: optimizers are gonna work better than others like each data store, has its own particular issues.

B

So what you may think would work well coming from a my sql background, isn't going to work in and if you're coming from a background, it may not work in cassandra. So data modeling. uh There are some basic paradigms to follow and I'm sure aaron's going to cover those uh very well in the next webinar. But the fact is, you have to be flexible and you have to be able to iterate.

A

Thanks eric and I think, given we started a little late, let's squeeze.

B

A

One more in uh rationale asks aside from needing to handle.

B

A

Me aside from needing to handle more volume or increasing velocity, can you talk about some other considerations for moving from relational to a big data model? And uh just just before you give your answer, you know in your webinar, um you know you highlighted time series and multi-data center replication. Those are obviously two. So are there other things that you would uh bear in mind other than volume and velocity.

B

Sure uh the biggest thing that I think many people have a discomfort with when moving to big data is that in relational systems, you're always concerned about you know normal form. Third, normal form. Fourth, normal form whatever it is, and you find that when you get to these new data, modeling systems that involve big data, you're going to store your data once twice three times four times some cases even five times and you're, showing really the same thing just in a few different ways.

B

There is a couple of particular sets of data that we store six different ways, and I know that sounds like completely eccentric, but it's really the best way to access your data, so something you know and when I say six ways, I don't mean six ways in cassandra. I mean I think we actually store it three ways in cassandra and then in two ways in redis and one time in my sequel, um seven actually because it'd be one of so.

B

The the issue then comes from having a comfort with the new paradigms. um People who come from the old school, our rdb rdbms world, tend to have a little bit of uh resistance to change, um and that's not everybody but you'll find that when you say you know we're going to need to store this, I understand that it takes one terabyte to store this data, but in order to really access it, the way we want you may have to store it three or four times and all of a sudden.

B

You have four terabytes that you need to store. uh It becomes a little bit more of a like a cost challenge and a paradigm change that some people just aren't very comfortable with. So um it's it's a lot more cultural than one would think when you're moving from uh relational to big data- and I certainly think that that's something to keep in mind.

A

So eric thank you so much uh apologies again for the issues we were having. Please join us in a couple of weeks time on uh the 7th of november for data modeling for apache cassandra.

A

We will make the slides in the archive available on the website in a couple of days and you will be notified when they're available thanks. Everybody.

B

Thanks for having me.

B