Apache Cassandra Cassandra Summit 2013, 26 Jun 2013

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: C* Summit 2013: The World's Next Top Data Model

Description

Speaker: Patrick McFadin, Principal Solutions Architect at DataStax
Slides: http://www.slideshare.net/planetcassandra/c-summit-2013-the-worlds-next-top-data-model-by-patrick-mcfadin
You know you need Cassandra for it's uptime and scaling, but what about that data model? Let's bridge that gap and get you building your game changing app. We'll break down topics like storing objects and indexing for fast retrieval. You will see by understanding a few things about Cassandra internals, you can put your data model in the spotlight. The goal of this talk is to get you comfortable working with data in Cassandra throughout the application lifecycle. What are you waiting for? The cameras are waiting!

A

This is like I, said the third part of a series, and so the saga continues. So this is a there will be no models on stage. This is a data modeling discussion. So if you do not want to do data modeling on Cassandra you're in the wrong place, but don't move because I, like the number that's good, so I was going to try to beat Adrian on headcount and I, don't think I'm going to do it, but I'm close I'm really close. So we started this whole series together well by show of hands.

A

I guess not all of us, but you'll get there. So what the data model is dead, long live. The data model was really all about going from the relational modeling world. So my background is I was a relational DBA, very relational, developer. I think a lot of you have probably built in the same boat. So how do you get from point A to point B, and so that was really the start: how to get from a model relational data with multi? You know multiple tables and normalized data forms.

A

How do I get to my Cassandra data model that that conversation went directly into now? Now you understand that part become a supermodel.

A

Er was really the transition into how to just make really good data models in general, with Cassandra and that general, and by that I say taking just your concept and how do you express that using c ql and that's Cassandra, query language and c ql has a lot of features and a lot of things that people probably don't know about, and that was really the point is to how to how to do it, and we, when we went through a few examples, so today we're going to do something a little different I'm going to recap where we started, but then we're going to have just examples like real world examples and that's where I want to end this thing, because I think everybody learns better with examples and I I know: whenever I write code, well least all the pearl I ever wrote, I copied somebody else's code and just made some modifications.

A

So in bash grips right, there was only one dude that ever wrote a bash grip. The rest of it's just copy so same with pearl Mike, my clear you were here, I'm, sorry, it's the truth: n yeah, mr. I wrote a pearl driver for Cassandra, I'm going to copy it, and then I'm going to make my modification so uh so we're just gonna go through a few little topics to catch everybody up, because you know we got to so.

A

Why does the admonitor- and so I've said this before I just said it this morning to pankaj my friend I said what Cassandra lives closer to your application, and so this is really where the data model is much more important with Cassandra, because if it's living close to your application, there's no generalizations and it makes you think a little differently about how you deploy your applications. So we always talk about well. Cassandra has a right use case. Yes, it does have a right use case. I was an Oracle DBA.

A

What did I want to do in pankaj, I'm gonna go back to use you as an example we'll just use it for everything and because it was what we had, but then it didn't work out for a lot of things. I want to do multiple data centers or things like that. I love, memcache, I use a lot of memcache, but that doesn't work for everything. So we've gotten to this point. Polyglot persistence is where we're at now is 2013. We store our data differently for different applications so for different use cases.

A

So when we do that, we have the opportunity to really screw it up or make it cool. So I I've seen this a hundred times, I work with customers, all the time, I go out and talk to people and I, see the wrong data model or I, see the right data model, so the right data model winning and the wrong data model there's a sad panda. So let's not be sad pandas. This is what we're trying to do right.

A

We're trying to do the right thing, and you know there doesn't need to be any sad pandas out there. Then again we don't need a lot of Charlie Sheen's out there either I'll probably be a bad idea. Was it tiger blood so see how them on time.

B

A

So when to use Cassandra- and this is the nutshell- version and I put in little tiny textear- it's real tiny. It's about that big. These are orders not hands the here's. A few bullets and I think you probably heard this enough, but I'm just gonna reiterate for the rest of the review crew. Here we need to be more one data center. That's a requirement! Okay! Well, if you have multi master active active, what other choices are you going to have that eliminates a lot of choices? So how about the scaling problem?

A

I used to spend tons and tons of time trying to figure out performance and capacity planning, and what did that mean? Well, that means that somebody's got to give me an accurate number. Well, this is how many users we're gonna have okay as an engineer. What did I do? I just x, 10 so and then alright, I get a call HP or dell and say I want the biggest box you got, and that was capacity planning well, when you really have no clue and or you're really not wanting to put that into the problem domain.

A

Sandra's a good choice because it scales so well when you need it, so the other thing that I hear is I need this maximum uptime and, what's funny, is I feel like people have been lying for so long about up time, because I'm I did it to maintenance. Time doesn't count on your up time. Oh no, that was planned. Maintenance, of course, is down. Well, what do you mean? You were down your customers couldn't get to your website. Well, yeah! That was planned, okay. So what did that mean?

A

You probably had to do it like three a.m. on sunday, and that was exciting. So getting out of that world is awesome and so getting closer to a hundred percent up time. There you go. Cassandra is very good choice for that, so we also have the problem of money because, let's just face it, you know we go to the vp of finance to say yeah. We need an unlimited amount of money to this application because it's going to be cool. What is that going to be?

A

No, so we have to think about dollars, and so we, when we want to do scaling or we want to do multiple data, centers how's- that gonna play out money wise. I bought a lot of oracle my day and when I use golden gate that just quadrupled the price and that sometimes made it so well. I guess we're just gonna have to deal with it. It's in one data center, okay, and even then it wasn't that good of a solution, so it just makes sense economically and then the final thing, that's. Why we're here right?

A

We gotta get the data model right and I work with a lot of you on just this topic and I I've seen so many good apps out there. You know we have left places where you know dear, is just gonna, be some good stuff happening, because the data model was just perfect.

A

So here we go. What we're gonna do today, we're gonna go through for real world examples, real world, meaning I get to see it all the time and we're gonna try to go through those. It's not going to be a deep dive into the application. Let's face it. We got an hour here, I'm going to be crunched to try to get that much time in I want to leave some time at the end for questions and answers, because I know you have a lot I'm, just quick disclosure, I'm gonna be over there.

A

After my talk, so you can come up and talk to me. Let's just try to but keep in mind. Everybody has questions. I know that well I'll try to get through Maz quickly as possible, but I'm not going to be able to dig through your app with you just today, but there's plenty of time left in the year, so we can figure it out, so we're going to go through the use case. Basically, it's just here's. What they were trying to do is what they're trying to accomplish and then how we did that now.

A

I'm gonna, make sure and make this point. If you see this data model up there and say dude.

C

A

Cop, that's me you're, giving away my company secrets. No, not now I see these all the time, so there's I'm not even away anybody, and in the case of where I was close, I probably anonymized it some and I blended a couple too. So don't worry your you're not going to pop up here, maybe a little bit, but not a lot. So all these examples are in c ql. Three. We don't do any thrift on these, but c.

A

Ql three makes this a lot easier for us to exploit so I'm, going to express the data models and c ql. Three. It's going to be a lot easier because it is a very elegant way to describe a data model. Here's exactly how you're going to store your data and there's some caveats with that will go through all quick.

D

But wait it's a trap. Yeah.

A

So yeah here's, the here's awana we hear a lot cql doesn't chew. Dynamic light rose that that's why I'm using thrift dude come on sequels laying no? No, it's not I can so.

A

This is probably the thing that we all need to just get over. If you don't believe it, and so we say, there's some terminology changes. I know, as you know, from the community. This has really been kind of an interesting transition. You know it's it's moving from one house to another, but trust me we're getting there together and I've seen a lot of help on this. So if you've never heard of grifter you've never used it. Then this isn't a problem, but I know where we are.

A

We have some people in this world and some in that world. It does do the wide rows, meaning that you know you could have lots and lots of columns and when I tell people that they're like. But this is a fixed schema. No really and Jonathan. Did this really great blog post? I really suggest everyone ready.

A

It was written out of frustration, I'm sure, but we just really needed to get this into a clear, concise thing and then, if you go read it you'll see my almost equally as long the response to some comment, which has even more information on it, so I'm sorry, I kind of doubled, the size of that blog post, but that's a really good read if you're really just want to get your head around it and, more importantly, if you want to get into what's going on behind the scenes, how is it data being stored even though I'm seeing it on the screen?

A

It's like rows and columns what's really happening behind the scenes. It will help you and we're gonna. Look at some of these models here. So can't put my hand in my pocket. My friend Rebecca said: I put my hand in my pocket too much so I'm not going to do any more. It's like saying um with your pocket.

A

First, the end model shopping cart, it's not yours! It's somebody else's so I put on here, cuz, giving customers giving you money is a good thing for uptime right because that's true, you don't want your shopping cart to go offline. So here is a use case. We want to be able to store it, reliable and meaning that it's when someone says I want to buy something.

A

It goes in there and stays there and we want to eliminate downtime, because if that's not available we're not making money again, let's go back to what the point here is we're making money on our website. So multiple data centers, really, if you're, if you're chasing up time and you're in one data center, I really you're faking it you're not doing it right. So the other thing is a cyber monday problem and I hear you know everyone has a type of problem like that.

A

Not everybody, but it doesn't have to be cyber monday- may not be selling anything of Black Friday, Cyber Monday! You may have another thing. When I worked at hobsons it was you know, students would show up at the last second for everything: okay, that's our cyber monday problem.

A

It's just the Thunder incurred, or whatever analogy you want to use, but that problem is hard, because you can never predict how screwed up people are, and people are just really I mean they're the best load testing tool, because you will never ever ever figure out how they're going to act until they do it so be ready for that.

A

So what we're trying to avoid here and what has been the problem in this one, is that for every minute, you're offline, you're losing money and that's really not okay, and so you know that's that was kind of the driver here and speed. You know that was the thing that Amazon did a couple years back my history in web performance. That was one of the things we were chasing as the fact that for every you know, millisecond or 10 milliseconds of latency on your web app. It's something amazon calculated using him.

A

They were losing a million dollars now granted. That's because they're making billions but come on I mean that's money and they're losing it, because people are pissed off that they couldn't get to the next page. So, let's fix that too.

A

So here is my whiteboard.

A

Those of you who have gone through my whiteboard discussions before kind of looks familiar done it so here's my whiteboard, so my plan is that each customer is going to have one or more shopping carts. That's right! We're gonna have something really cool. They can have two or three or four shopping carts and we're going to denormalize that data so that we can get it fast d. Normalizing mean putting all that data into the same row.

A

So when I ask for a shopping, cart, I get all of the information back, so one shopping cart equals one partition or that's a row on the storage on the storage side. So that means we're going to get isolation. Row-Level isolation, I'm not going to go through row level isolation. But if you go back a couple of my webinars I talk about it, some and that's really how you make sure you have consistent data on a single row. So then each new item is going to be a column, so see I got a laser pointer.

A

So, as the cart goes in I'm gonna have this partition key, which is a row key underneath in the storage engine of a username with a card ID, and then this wide rows. It's just all these items and it's gonna be. Who knows how big, but that's cool right, because what? If somebody wants one hundred and three items great about 200, great five? Well, you gotta work. I, have another data model to get him to six, but we're gonna try to make it so that it's flexible.

A

So whenever someone puts random amounts of data in their great, let's do it. So here's and I really love this big screen, because last year I was on smaller screen and everything that I put up was micro. I watched the people in the back of most fall out of their chairs or like what is that? So this is huge, the so what I do it? I create I have a couple of tables here. So I have my user table up here.

A

Whoops back back back, there I my user table at the top here that it's just gonna debts, just our normal user or its. You know an entity where we have a username and a first name last name, but part of this is I, have a set ah now part two of my webinar we will went into collections and collections are really a useful tool in c ql because it give you some dynamic portion, but they keep you on the same row.

A

You could denormalize data really effectively by end, but also put that random element on top of it and so using a set list or map. You can do some things like that now I use them quite a bit just because I like that flexibility, but I, think you'll see that they I do have some really interesting uses. So what I'm doing here is I'm just storing a set of shopping cart, so one user has 3 10 100 spine.

A

They can have that many, and so, whenever someone logs into our website again thinking about this from an application-level someone logs into our website, they get a list of their shopping carts. As part of the result set, then, whenever I want to display those, I have those already I can go get the the bar chart is going to be storing that particular shopping cart. So what I'm going to do at this point is that I have a shopping, cart table and there's a couple of things that I'm doing here that are kind of interesting.

A

So again, it's going to be by username, who are they the cart name, and then I'm going to start having these item ids down here now. These two things right here are going to make up my partition key. So that's that's that part right here that I created when I created my primary key. That's basically one row! So one user, one cart! One row awesome. The item ID makes it so that it partitions into a wider format and it'll put it all on one row.

A

So, whenever I put in new items like up here, so here's my one storage row, one partition so I have here's my this is me and I have a shopping, cart called gadgets. I want and then here's the actual item that I want to put in there and then, if you notice, I've created a map here at the end for dynamic information and that that's again, yes, I, said I use it a lot.

A

But it's effective I'm going to use that map part down here to put in these random stuff, like related information or volume, discounts, just things that are applicable to that particular item. So when someone goes to my website and they click on their on their shopping, cart, I'm going to be able to get all the items out. So I have all of this information in one poll and it will be one row. So it's going to be pretty fast. So, let's go back to our the way.

A

Our data is fast and Cassandra row look up for speed, and then we do that slice or in this case, looking at a full partition for the efficiency. This keeps us all in one nodes, very quick, we're talking about milliseconds of access time here. So this is great we're just satisfying a couple of things here. It's gonna be fast for our users, so you know they're not going to walk away. The other thing is that this is going to work really well in a multi data center model. It's not over multiple rows.

A

It's on one row, but it's it's I mean it's good. Cassandra Cassandra just does multi data center perfectly. So that's my shopping cart. Data model see I'm doing with time. Here. Yes, use your activity tracking. Now, I'm not gonna, go into a full-blown data science thing here we got people to do that all the time. What I want to do. This is get into more of the, where the rubber meets the road. The actual use case where people are using that data.

A

If you go to strata you'll hear about all these cool, you know cool things that people are go yeah. We got this guy who's got a PhD in statistics. You figured out, you know. If I had this one user, they'll click on something and if they click on that they'll go by three more of these things, and that was awesome. Well, that's not what we're trying to accomplish it. What we're trying to do is find action with that we have.

A

We can watch our users, so we want to react to that in real time, as users are doing things on our website and I know. We've all seen this before it's a little creepy whenever things happened like that you're going on a website or you're like on Amazon or Google, or something like that and you click on something and all sudden things start popping up that seem a little related and you're like I do want that.

A

Well, that's that's part of what's going on behind the scenes, but you need to be able to react to that in real time. If they came back to you say an hour later and said, wait, oh I think you might have wanted some nails too late. I'm gone dude. So that's where the real time component comes in right.

A

So we have all these application pods that need to be supported and a pod in an application sense is that there'd be in different data, centers or maybe a different racks, but they need to be spread out all over the place, and it could be that we have different applications themselves that are watching activities that can be cross talking to each other. There's plenty of companies that have multiple properties underneath one umbrella, so you get the crosstalk between those and the scale I mean that's I. Think that's going to be always a good reason.

A

So the the bad part of this is that the company in question was was having a hard time because they were losing those moments. They knew that they had actionable items that were going through and they were missing him and you know there's nothing worse than leaving money on the table. So in that case they needed to be quick on this, and Hadoop is just too long now.

A

Hadoop can create models and do things very well, but in this case they needed to action that need to be right on it milliseconds and they needed to be ready to go so. Here's our whiteboard again I love white boarding, so here's the kind of like a high-level, a diagram of what it would be. So here's our here's, our dude, it was on our website and as they were walking through our website were making decisions, and we have this interaction decision algorithm.

A

That's algorithm, not algebra, so that that's going to be making those decisions, but they have to have input. That's the hadoop or data science is really good for feeding up on figuring out those models. But then it comes down to you need to have data to put into the function machine. You know input and output and that's a little harder sometimes. So what are you going to do with that data? And sometimes it's not just one thing. It may be a course of action, maybe four or five things so I'm getting feedback from my website.

A

All the time now keep in mind we're trying to do this at speed velocity. How much are we trying to do this? Okay, five, a lot of people on cyber monday, I'll click on on my website and I'm trying to get page lift then we're probably talking thousands or hundreds of thousands of clicks per second. So we need to be ready for that.

A

So every interaction point so as they go through the system is being stored in a table, and that's that's where all that speed is going to come from the long term, interaction we're going to break that out into a separate table. Now there here's a concept and I've been if you've had me in your in your office, whiteboarding and footing. You've probably heard me say this: a hundred times do not be afraid to write to multiple tables because Sandra loves rights. So if you got to do five, ten hunter tables, / interaction, awesome it'll!

A

Do it no problem got that and what does that mean? That means you're gonna, be ready for the read or whenever you need that data so do that. So, in this case, I have this requirement. I want a really short table. I'm gonna, I'm gonna, show you how I do this, but I'm gonna have one table for that, and I'm gonna have a longer table for that. Longer. Interaction like I want to store it out later, and that gives me some options.

A

Then I'm going to use a dupe on that longer table. But that means that the old did, the data- that's hot, that's fast, I'm, just gonna dump it yeah, that's right: I'm gonna, get rid of data and that's kind of a cardinal sin in data science, but I'm gonna do it so the other thing is I want to use a reverse series, and you probably see me do this a lot and I'm going to show you why this really makes it's about speed. We want to be as fast as possible on the database.

A

So here's my data model, the data models I'm going to have this. These two user activity tables- one of them is hot, and one of them is more of a long tail table and really the biggest difference is how I'm storing that data in them. So the first one all right class. Remember this one! Don't we and reverse order is kind of my own. My secret weapons, not secret. It's just a great weapon. Reverse ordering again means that as I'm storing data. So I have my user activity table.

A

Here's my person, the time that it happened and then some activity codes with some details, but look at my primary key. The row partition is going to be the username and then I'm gonna have my my columns all part of it interaction time. So that means I'm storing this really dynamic row data columns going like crazy all the time stamps on them until it falls off the stage like I'm about to, but I don't want to look up data over there.

A

I want to look at the last thing that happened so I'm gonna, reverse that meaning that I'm gonna have all my time. Series data reversed, naturally, for me and I see some people not in their head. Cuz they've seen me do this before it's cool right, but that means when I say I want the last ten things that happened: you're looking at the last ten things on that storage room, you're, not iterating over a hundred thousand items or 10,000 items to go to the end.

A

That's just not efficient, so we're just going to store it in a reverse format. On the longer tail table, I'm going to put the interaction date. So now I'm going to partition the Rose themselves, so every day has all the interactions for users. This is going to make it easier for me to later, when I want to run hadoo, where I can create a Hadoop job that will iterate over all those rows per day.

A

If I wanted to do a range of days, it's just an easier query to do, but it's just formatted in a better way and I'm, not reversing it I'm just going to keep it in its natural order, which is fine, because whenever I again like with pig or hive, that's it doesn't really matter. I'm going to be iterating over a lot of data anyway. So let's just keep it like this. So what I'm doing whatever I'm inserting my data into user activity? Is?

A

You will notice right here using this TTL, I'm going to expire that data after 30 days? I just want to have 30 days worth of data in that one table. The other table doesn't have that it's just going to be forever. Not school. Now bonus, it's not up here, but I'm going to tell you is another thing you could do. Is your column family? You can name that with a different name. You could put the month in there or a quarter or something like that.

A

So if you actually did want to keep your column, families separated or just wanted to use that as a way to separate your data, you can do that and so that that's just an option in there, but really this 30-day expiring. Now again with TTLs, you get a delete for free, pretty much. That means I mean when we run a delete on oracle or even my sequel. Would you run a delete in the middle of production day if you had to say delete a million rows in rhetorical know?

A

The first thing you get is a phone call from dbas telling you you're crazy, but it's is gonna, create a lot of redo logs. This is a great solution to that I. How many batch jobs have we all written called cleaner, yeah yeah, so this is gonna really help out right, because now you're gonna have this idea of you know your date is gonna, be gone after 30 days now, keep in mind, there's no kill switch. This is going to happen.

A

It's really hard to undo this and make sure when you use a TTL that you're using it. The way you want to it's not I'm gonna make it disappear. No.

E

F

A

Really hard to undo that, so just make sure that's what you want your data model, so now what how am I going to use my data when I have this now, so this is where that reverse really helps things out so up here I have my select from user activity limit 5 by doing that, I'm only getting the first five items great, very efficient, but what I got here was. I got this whole thing right here like okay, so I logged in. I went into my gadgets. I want I deleted it.

A

Then I created one company has verse. We guess- and I went to jewelry well, oh and the decision engines on man. Okay, it's time to go me. Let's put some flowers in front of this guy I, don't know, maybe a trip to Cabo where ru Chad alright, so um you were gonna, really use that information effectively. So this is kind of a funny example, but you can think of some really creepy things you could do with this, which is cool and that's actually what's being done in real world this.

A

This decision engine is really based on the interaction. What's going on and storing and just a few of those items can really give you some interesting insight and or some actionable items for your application.

A

All right now are we doing on time. I want to make sure we have plenty of time for questions all right, we're good log collection, prot. This is where I came to Cassandra I had a lot of logs and I. Think a lot of people find themselves here because of the way that Cassandra stores ate it. So there's two problems right: log collection: it means that something is fitting out a lot of logs and it's never just one at a time small bits. Now it's tons, it comes like a torrent, its cure.

A

This amazing amounts of data out there that are creating now from machines. So we have to have something and to keep up with it, so the high-speed logging part is that was a requirement in this particular application. So we also need to have our Cassandra nodes near where the logs are being generated. Now, if we were in multiple data centers with our applications, we want to make sure that Cassandra nodes are right. There too, we don't want them to go across data center laughing at one is a bunch of bandwidth bills.

A

Of all this data, just getting streamed all over the place. I want Cassandra to manage the replication in a more efficient way, but make sure that the applications just connect to those local Cassandra notes. That was my that's one of the big ones and then I. I also have this requirement where I need to kind of pre dice my data now.

A

This is even better example of what I was talking about earlier about writing to a lot of tables and I've advocated this so many times with the rest of you and that really drives us home. Is why I hope? But that was one of the requirements, because we have dashboards what dashboards do other than make sea level. People happy. Why else? But I mean, if there's like a lot of things on there? Oh it's pretty, but they're.

A

You know they have some use I've been to the etsy office, and you know they have this wall of plasma and it's just a bunch of you know, graphs on there and people look at them, sometimes there's meaning in it, but that kind of stuff requires some. You need to have speed, because a graph from yesterday is pretty boring and it really isn't going to give you any idea what's going on today.

A

So the bad side of this is that the scale I- and this actually happen to me- I just couldn't get no scale out of my relational database for some of the logging that needed to be done. If I denormalize my data for one thing, that it hurt the ingestion, speed, I couldn't index anything, it just turned into a big problem, and of course, so this many times before my oracle box was scaled right up until I ran out of money and when you're storing logs it's really hard to get a lot of money.

A

For that because it's hard to say well, this is worth a million dollars of Exadata. No, it isn't alright. So that was one of the really big problems to get around, and so the batch analysis too it's just too late. Bouches batch batch has a lot of use. Don't get me wrong. I love doing data science on a lot of data, but when you need a dashboard, you fire up a hive job. It's over. You know you lost everyone's attention span and you know the people looking at it. They have a short one.

A

So here's our whiteboard, so I'm going to use flume. In this case you can use other things like storm I, even scribe another one I mean this is flume because I have a lot of experience with flume and flumes job is to collect a lot of logs.

A

You know like a log, flume and I'm, going to use that flume connection point to create a scene which is a flume terminology to dice up my data, so I'm gonna take my my raw logs here as they come in, say it from my web app or something like that, and I'm just gonna do three things with it.

A

In this particular case now I could do a lot more and I probably should, but in this case I'm going to do a raw log, a latest success and the latest fail just really short bits of data that I can get that out from and then so with my raw logs I can just do a lookup.

A

That's one use case from the same data point and then, with my late, my successes, I'm gonna, throw this up into my name's fancy graphs over here for eye candy later awesome, so I'm, really taking that one data point and I'm doing multiple things with it really cool, it's not like recycling, but this data model is not that hard. It's just kind of a concept thinker. You got to really put here your mind and then look. You have to look from it from where it's gonna be the consumer ingesting.

A

This mini is not hard because the Android can keep up with this. It's now we need to think about our application from the readers point of view.

A

So I have my three tables here. The log lookup is pretty boring, but that's okay, because I'm going to have a lot of other things going on, so the log look up is just I'm. Gonna have a source, so I have my source my date, so I'm going to be storing each row is gonna, be one minute of data because, let's face it, those logs can come in milliseconds. So we have date up to the minute.

A

So by using this compounding here we have a source and a date to the minute as a rocky and then just timestamps, and that's just going to create how many columns we need and just keep going, keep going, keep going. So the other thing I'm going to do here is I'm, just gonna store the raw log and I'm gonna gzip that and that's a really good idea.

A

I've seen this is a good effect and play a lot of places where, if you just have like some JSON or XML god help you can sit, it takes just. It takes a few microseconds and Java to run it through a compressor, and then it just saves so much on wire speed. If you're putting a 2k block of text over the wire at you know, 10,000 per second, it's going to add up, but you can probably crunch that down by fifty percent or more and alt you're saving is wire time. That's good!

A

And let's face it. When you do a lookup like this, you can then reverse this and say give me all the logs from this source, and this time you know this time period and then pull it back and deserializer decompress it at that point. That's an application concept. It works pretty well, but our other two here are much cooler.

A

So now we're going to we're actually going to get into some interesting uses of counters, now counters have good uses and bad uses in this case I'm going to say this is a good use case, because what we're trying to do is just create numbers that will be graphed for people looking for eye candy. So in the login success is here we're going to have this source and date up to the minute again.

A

So we're going to do this as the same type of key, except in this case we're not compounding it so we're just going to be marking these sources and date up to the minute and then marking each one of these, so we get counts. So that means that, from a counter standpoint and I'll show you the code how to do it. That means it's going to be incrementing a counter in that minute. So if I have 10 things that happen in that minute, it'll be the count of ten hundred 100.

A

That sort of thing, but that way I can say how many things happen over these many minutes and I'll get multiple counts, I'm, also because I love it I'm gonna reverse this. So I can ask what are the last 10 things that happen last 10 minutes of things that happen so I've created to table serv, you notice, I've created a login success and a login failure. I'm really busting out this I'm dicing.

A

This data way out here, but that's great, because now, if I want to graph login successes and login failures, different tables different calls, but it's very efficient because I just pretty much created this massive materialized view very fast and, as things are happening up to the millisecond I'm, getting access to it on all my dashboards or whatever graph I'm going to use or something like that, so this is going to make my logging a lot more effective they're.

A

So this is what I'm going to do with that actually creating some data. So we have this one simple, select command. This is really what we've done: the beauty of what we've done here when we ingested that data and diced it up it made it. So we have this opportunity now to create this really stupid, simple select and that's not going to be very long to run.

A

That's going to be a few milliseconds to run and what I get out it is I say, give me the last 20 minutes, because I know everything's in there for a minute. So I get this nice graph of data and, if I hit refresh I'm gonna get up-to-the-minute on everything. Now, if I want to change that to something like milliseconds or whatever, that's fine and the counters are going to be counting along in the background so notice that there is this little window here.

A

I'm gonna put this in as an example, because this is where logged eat it. Sometimes it's really important and I know Eddie I know you're in here and splunk is awesome, but this is a cool use case too. Okay, you have these two minutes where something bad happened with your logins. Well, if you knew about that tomorrow, because you ran it in batch, then you're just gonna be scratching your heads. So this is where this kind of usage in these case really makes sense. It's like whoa something bad happen.

A

Oh yeah, that's logins failed and then it's better now or maybe we were able to do some action on a day. So, but this is that high flight data, all right, I, think that's all right.

A

Adam are you in here? Man raise your hand all right. Baby I figured it out, yeah sorry been so userform versioning and now this is a calling on Adam because he's one who made me think about this even more.

A

It is really a use case that I've found a lot of places, and so here's the here's, a problem that we're dealing with is that we want to store the version of a form indefinitely and we just want to have a very efficient way of doing this, so that we have version 1234 how many versions they are. We want to scale to any number of users and that that's you know, always a requirement.

A

We want unlimited scaling and because we're gonna have a million users just like Facebook, and we want to have to be able to. We want to be able to commit and rollback our data. If I make a mistake on version 2, I'm going to go back to version 1 and if I say I want version 2 to be my gold version, it's going to turn into the right form cool.

A

So I've tried to do this before in a relational database, and it's not easy. I mean it's a lot of tables that have to be joined, and so it wasn't a very easy model there. It's just not mean Z model anywhere really, but in this case I think this will work really well with a the way that rows and columns work and sander.

A

We also have this need where it needs to be all over the place. We have our local data center, which is where most of our data is, but we also have cloud components like some of its in hamazon or some of its and Rackspace, so it needs to live in both places and that's really difficult, especially if you're trying to create like a homogeneous persistence layer where it's maybe the same technology.

A

It's not always easy to go from one to the other whenever you're doing local and cloud, especially if you want to try to get him to synchronize in any way you want to have them replicating. So these are some of our Bad's.

A

So here's how it's going to work our whiteboard session says we're gonna. Have this partition key row, which is going to be a username and a form ID? But I'm going to explore these blocks of those form attributes for each time they make a version change and so I'm, just gonna keep growing. That out is you know random amount of times. If we have somebody who's a real busybody on a weekend and they create 10 versions.

A

Fine I'm, not gonna, have to deal with that in any other way than just my data model will maintain it. We also want to separate the the tables to have. We have the data that they're working on. We want to have some stuff. That's live on the production site, so we're gonna. Have these two things going on. So that's right! That's that's an easy requirement. The exclusive lock now I hate, seeing the word lock, it's got it. So much is an exclusive lock for a computer which is a different problem.

A

It's more of just making sure someone doesn't stung upon somebody else. Now. It's funny because I've had this discussion with somebody else about. Why don't we just teach all our users? How do you get? Ok, yeah, ok, yeah, so we're gonna get a lot of admins in like HR to start using gay, no I ain't gonna happen. So let's well, let's think about this, a little more in their domain. How about a web page? Just whenever you go to look at a forum, you can see who's currently editing it there. You go!

A

That's good enough, I mean. Can you imagine somebody doing a cherry pick on a forum in HR? No, it's not gonna happen all right. So here's our data model, so we're gonna, have a working version which is more or less just how that's going to be where all the activity is going to be happening. So we have the username in the form ID, which is going to make up the roki. We have a version number which is going to make all those rows.

A

We're gonna have a locked by and then my one of my favorite collection items. The map back here is going to have all those different attributes of form attributes that I'm going to be using for this particular form ID. So that means that if I change any of this, that I'm going to increment the number in my application and it's going to store the form in its entirety, so that in each form can be different, and so that gives us a lot of flexibility and why not?

A

Because I like to do it every time anyway, it's kind of like coming my boring trick, but it's very effective, reversing that so that my the last form that I was using is the first thing on the stack more or less so whenever I want to get the latest version, I know that it's right there. I don't have to hit a rate over a bunch of stuff. That's that's just really in general, a very efficient pattern with Cassandra, so we have our first version that goes into system. It's gonna, look like this!

A

So I had this I have my own little coding scheme here where I'm gonna have a text box called first name and here's the of it there's a display name and here's what's gonna happen with the HTML. So this is my first version, so when I do that, I create that that's version 1, so I'm going to lock this I'm going to edit it again, so that just means I'm going to put a username inside that table. So whenever I go to edit that table, 1 insert puts username in there.

A

So if other users come in to use it, you can read that say: is it blank or is there somebody there? Oh is somebody there? Okay, I'm going to go back and display to the user, Oh user p mcfadden is of using that form right now. Okay, can't touch it. That's a lot easier than getting, but I think for our use case. It'll be fine! Now, if you want to modify that a little, that's completely possible I'd love to talk about some of these. All of these models probably have variations.

A

That I would love to talk about. These are meant to be somewhat simplistic, so we can get through it in an hour. But there's lots of variations in here and I know that you probably have some really cool idea. I lie here, so we have this version number that is locked. So whenever I finish it- and I say here- I'm done you know- I got my form finished up, I'm, just going to put a null in here whenever I increment it to the next version, so that pretty much releases the lock.

A

So the next person that comes along it's going to read that it's going to be empty and they'll be fine. Now this is all happening at speed across datacenters, my local data center. My cloud, that's all they're very easily replicated data.

A

That's it so you got four models out of this and I hope that what you can get out of this is just some some ideas. These are asked what these are all about. I haven't, given you the full picture, but hopefully I've, given you something that you can work with.

A

I get asked all the time and I'm actually I'm pretty happy to hear I mean not for my own personal edification, but that I hear people say: I watched your video on data, modeling, good use, resources, I I, do these webinars and these modeling sessions, and so they it can help you because I when I first started with Cassandra I had so much trouble with data modeling I mean I was trying to shoehorn my relational ma and it was just horrible.

A

I know this is a need, so I figured it out I'm, you know, and this is a community I'm gonna, try to help you figure this out, and so you can do this. This is very doable. So that means that yours is next and I want you to think about what you're doing right now and I know. A lot of you are working on projects right now and things that you're doing currently because I've talked to you, but you know: try out a few things and really iteration is the key here there.

A

Maybe there's a lot of different ways to do these, so try different ways to do them, see how they work and if it doesn't work one way and you really get frustrated, find us there's, there's we're out there. The community is out, there engage an expert, you know I'm, not that hidden if you've noticed I'm kind of obvious come out and find me, and you know twitter. If you follow me on twitter, I talk about data, modeling and Cassandra topics all the time, and you know that it's you can use it like an RSS feed.

A

If you want and I've had a lot of people directing the contact me about data modeling and as much as I can do 144 characters. I can help you, but sometimes it turns into a larger discussion. That is the point. I'm trying to make is: engage people I, really hate to hear how people try to project with Cassandra and failed because they couldn't get their data model to work when they didn't even ask so ask people are here to help I mean look at all the green shirts. We have in this room.

A

Raise your hand if you got a green shirt on oh yeah, okay, so that's good for you all right that is all I had I think we can do some questions and I. So here's the rules on questions I'm not going to rewrite your app. But if you have general questions about, why did you do this? Or how did this work? You can go ahead.

A

Let's ask those questions, but if it's going to take like a I need to know all about your application, let's try not to go that far so where's our Mike, let's its aw man, oh man, there's a lot of hands up. We're gonna play the lottery.

A

Let's just start over here and then we'll rock away across.

G

So when you do multiple rights, you replicating your data to multiple locations and obviously there's bugs in your code and you're, going to make mistakes you're going to write it incorrectly to one location, maybe to another, is their processes or tools or concepts that have emerged to help? Make sure that you haven't hosed yourself by as you write that data the multiple locations, because writing bug free code is not an option.

A

All right, so, if you're, if you're, in a position where you're writing code that and you're deploying new code into your cluster and you have the potential of writing bad data. Essentially that's what you're talking about right, yeah that that is probably the reason number one why you do a backup on Cassandra is not because you're going to lose your data is because somebody hosed it for you, and that happens all the time.

A

So in this case snapshot is your friend I always tell people before you do a code push do a snapshot and a snapshot is literally that it's a point in time. So if you're gonna do a snapshot right before the code push and then I'll send your code deletes a bunch of email addresses right on pankaj. Do you remember that? Don't you all right, so I'm gonna point out.

A

He wasn't his fault by the way, but it was a it's gonna happen, and then you can use that snapshot to get your data back and there's a lot of options with that. But just first and foremost do the snapshot. It's it's almost free on the file system. It's a hard knock over here.

H

Hey you better, hey.

A

There we got you a new we're gonna, ask a question: okay, go ahead! How.

H

Rigid should your motto be if you're gonna be using solo, for you know a lot of searches, the.

A

Solar question so Solar brings out a whole new thing and really it's you probably want to think more of your data in terms of your solar schema, then Cassandra, if you're using it for just solar. If there's a blend, then you have to think about that, but keep in mind that if you're you're overlaying solar on topic of Sandra, the solar schema is fit. Here's what you have you can put dynamic fields in, but really it is fixing the scheme in some way. So there's a little different way of doing that.

A

I had probably more of a deep dive about your application, but in general that's all you'd handle it all right. One more over here, Oh.

E

On the last example you gave, there is a update, you were doing and that's a read before right so normally says that you try to avoid those, and so what is the basically a general rule of thumb where you were okay to do read before right and were you not yeah.

A

I know I'm gonna get that oh I broke the rules. Now the read before we ride is really about. You don't want to do a lot of those like if you're doing, read, write, read right, we'd, write in this case I'm doing one read to see if there's anybody there and its user based. So if I click on a link, I get a read and then I can I'm gonna make a fork action on it? I'm either going to move to the next form or I'm going to give them a different form.

A

Saying it's locked right now, so it's not in the application, I'm not reading and writing it like one after the other, I'm doing a read, and then, if there's something going on in the application, that then will result in a right. So I mean it. That's I guess my my cheesy way is saying I'm not doing a read before right, not in the way that we wouldn't want you to do, which is right after the other, in the same block of code, I.

H

Had to smoke questions, though first nursing and I started in this secondary indexes example: juicing is a nice impotent, expiring column, Oh exclusive login. You examples.

A

You're talking about secondary indexes, why didn't I use them all right? So in my second webinar I think it was I talk a little bit about secondary indexes, I feel like secondary indexes, are a crutch for relational folks because they think its speed, secondary indexes are built for convenience and not for speed and I try to avoid them in general.

A

Just because there's a lot of confusion, but if there was a case where you did need it, I would call it out, in this case, have any and I think you'll find that if you do your data model correctly from an application standpoint, the need for secondary indexes is very minimal. I mean there's six there's always a case for something right, but I feel that that means it's very small case pneus. A second.

H

Any questions someone said Swift is better to help you form the habit of Cassandra doing is and as well. The reason is better than the cql.

A

Drift is better for what form.

H

The habit of using Cassandra instead of the like regular traditional, ask you of syntax yeah.

A

That's that's what I! Here too, I was just reading that blog post, it's not like that and I I think there's just a lot of people that are been using. Thrift and I was one of them. It took a while for me to get my head shifted around and understanding it, but in really in that blog post, and if you read my comment on it, it's really about understanding. What's going on in the storage engine, what's really going on here, cql is abstraction.

A

It's it's a way to make it look better for the user and for the application. The storage engine hasn't changed that much there's some there's some different things going on, but not that much. It's still doing the right thing, but it's forming like it's giving you the correct path to get to that right thing. So that's why I advocated all the time you're gonna have the better.

H

A

F

Yeah, can you talk about limitations on collection sizes if any and yes in your lock model, they're just curious? What happens when two people ask if it's locked or not at the same.

A

Time anyone's gonna get that one too. He never put a lock up. That's why I said oh, no I put a lock up for you think you always get the lock guys. You know wait a minute, so the first question collections are do have practical limits and the reason that there is a practical limit is because it cause you have to deserialize them so that can cause a performance hit.

A

I talked about to some one of my other webinars, where it's really about your, where you're, at with your performance, if you're looking for the most performance, the serialization is not something need to take lightly. So, if you put say a hundred thousand things in there or ten things in there, there's gonna be a different cost right, I believe there's actually a hard limit of 65,000 right now, I, don't know! If there's anybody from a Cassandra core can you were I?

A

Think that's what it was 65,000 right now, but then again you probably don't wanna put that many in there anyway. It's not a it's, not a substitute for a good day tomorrow. It's an augmentation all right. Second question: how do I manage either luck? I, don't have a good answer for that I mean there's collisions. It can happen at that point if you're really that worried about those collisions and what I would probably say is that you can create a different model where you put a timestamp along with the user.

A

You create a locking table and then that way, if you do have two people in there at the same time, you're going to see that there are two, and so you just augment that row or that column value so that you have a little more information in there say. Oh timestamp is really good, so that, if you do have two in there that you, then you can flag that as a problem. It's all about the knowledge that there's an issue yeah.

A

What's that the check is all right, so yeah I'm, not gonna, talk about 20 features, I'm talking about what you can do today, but yeah check and set is really the way to do that and thank you. Cassandra core developers, cuz you're gonna, make my life a lot easier. Yes,.

I

So in your user activity table you had four or five columns plus a clustering column, so every insert will write essentially five columns to the storage engine. Is there a concern of the table getting far too wide above the typical ten thousand columns or twenty thousand columns? That's considered, you know safe for Cassandra, which.

A

Table was that LCD? Oh you mean the initial one. Well that was date to the minute, so it only stored up to the minute of that data. So there would only be one minutes worth of data in that one column, family all right that one row. So if you were collecting large, say nila. Second, that's gonna be a lot. Yeah I mean in general.

A

What I look at from the width of the column is more about the volume data that's in there and not the count, because what you're trying to avoid in this case is an incremental compaction, so we're really getting into the weeds now. So the Inca mental compaction that I try to avoid is say like that: 64 Meg limit on a mem table. Now you can make that bigger.

A

If you wanted to to keep that from happening, but in general that's what I look at now, if that's gonna, be a problem in this case, let's change that to every 100 millisecond bucket, or something like that. You know you tune that there are some teams, probably in here that I've worked with, where we've done that, where we've modulated that roki, so that we can get a different column based on what you're trying to accomplish, if you don't mind having an incremental compaction, because it's not really an impact on your system.

A

Fine, let's have three hundred thousand five hundred thousand two million columns, but if that's more of a concern because of performance later, you don't have the I/o to take it. Then, okay, let's figure this out. That's that's the try things and then measure do it again, and this is how we do our development.

A

More I saw a bunch and yeah.

D

um Again, on that same table, there was a clause in there with. I think it was with clustering order by right, and you had the time stamp in there right. I I clearly don't understand how that's working, because it seems like you'd have all of the same time, stamps together and if you're, looking it up by user and time stamp, that you'd want it clustered by user and time stamp.

A

Okay, so this guy right, yes.

D

A

Date, a minute.

D

A

Call it Louise our interaction. Okay, all right hold.

D

On back a little farther back.

A

D

A

Right, okay, great yeah, so the roki that one partition of data is going to be the username and then every column. It's going to be a time you, ah okay,.

D

Clustered within the row right.

A

D

Okay, yeah and thinking relationally. So sorry, that's.

A

Okay, we'll beat it out of you as a community. We do that as a service right. No, it's fine it because that's that's where you think about what your application needs. I really wanted that in some sort of laid out for Matt- and this is one of the great use cases for Cassandra's- that temporal data and probably the go-to move for a lot of people. Then unfortunately, I hear a lot of people say: well, it's only good for now it isn't.

A

But you know thinking about wow, that's perfect, because now I have this one row that I'm looking up of a source and I need that data over a certain range like what time it makes perfect sense, because what do you always do with time data you say I'll, give me the last 10 minutes or give me three days worth or something like that so yeah putting those into individual columns does one seek on the disk. That's what we want very fast, more.

J

Hi hi, when doing your data modeling, are there special considerations when you're going to be storing binary data up to say five megs and a column I.

A

You can and that's where I always say compress if you can, and so, if you actually I have worked with people that are doing that large log files or something like that. There are even images that there's plenty of use cases. The thing you have to consider whenever you put something large in a column value is the wire cost, or you know how long it takes to get something over the wire.

A

If you put something to Meg in a column value just based on the speed of Ethernet, you know you don't expect that in 10 milliseconds, it's not going to get there that fast. But you know that's that may be okay for your application.

A

Maybe you have a larger SLA, so that's really the biggest consideration and then again how many columns you have you know you may get to a large physically large column like a lot of megabytes that will generate a different type of compaction, but it's more tuning and that may not be a problem either, especially if it's somewhat like cold storage. So to speak, we always look and again back to our application.

A

What is the important thing here if your SLA is the most important thing like I, have to have every we'd come into 20 milliseconds, 30, milliseconds? Okay, let's tune for that. If your I need to be able to write a hundred thousand per second of this, but I don't care how fast I read it out completely different story. That's why we're thinking about our application first and not our data. The relational way was I got a lot of data. How can I dice this up?

A

You know I got a bunch of normal forms, so that was thinking it from the data standpoint, so I mean I kind of gave you the bigger answer on that. I hope that.

K

Hey Patrick I was wondering about the future support for native sequel 3 in solar and hive, and how soon do you expect that to be mature, here's.

A

Five bucks, the for answering that ask thanks back yeah that'll, be in a couple weeks, so data sex enterprise which has to do from solar. This is one been one thing that our data sex enterprise team has been working on very hard. It really. What we're talking about is Cassandra 12, which has a lot of great features: virtual nodes. It has as the c ql native transport.

A

Those are two very good reasons to be on it, but data sex enterprises of using version 11 to get to version 12. We had to make solar and hive be okay with using things like virtual nodes, and so that's pretty much done so that'll be out pretty soon and you'll probably have it before. You know it and I'm happy to say that so DSC team you grow hawk. Thank you very much. So luckily, it's not! I don't know any more questions. More uh-oh patricia doesn't get any asked questions all right. Fine, I.

C

Thanks so so, I had a question about so about size, constrained, columns or size constrained rose, so I know it's not practical to to constrain your rows based on a set number of columns, because then you have to do the full read for each row every single time. You do it right, but could you have? Could you have a use case where you set a counter column for each row and then check that counter column only and sort of go through and do do a hack constraint that way, I think.

A

You just answered that you said the hack word, no I wouldn't do it that way, because that all right, there's your read before the right: okay, you're after we'd the column value and then do a right or yeah you'd have to do it right or even do another read with that. So you know you're asking like if the column count is say 10,000 or a million or something like that. What would be the like the constraint? What are you worried about in that case, like having too many columns and you're having a good count just.

C

Having yeah having too many columns having just keeping the size of the data to a certain size,.

A

Well, you're, not all right. So if you're worried about that from the beginning, I would say that we would need to consider the data model as that's the problem. I don't want to have too many there's.

A

So it's kind of a more advanced topic, there's more interesting ways to modulate how many columns you have by using the row key and I would prefer to explore that until you ran out of it to do before you do anything with the word hack in it. You know, if you have the word hack in your solution, you're, probably gonna.

A

That's gonna come up with an X print, because someone's gonna say you can't have a hack, so yeah I would I would look at why you're even worried about that and address that problem from the beginning and just eliminate, because if you, if you're creating a counter column for every time you put in a new column, then I really feel like that's an anti-pattern there's and it's. Unfortunately, the answer thanks.

C

A

Getting you're gonna follow up our you are all right, one more one. More who's, you've already asked a question, but oh there's, one.

B

um I was wondering if you've had any, are you didn't.

G

Have because I any thoughts on.

B

Like using sets and collections as primary keys in data modeling is hot. Oh.

A

No creative, no yeah I, don't think he could do that. Wow I've! Never done that! um No, you cannot because of the way that the collection is created. It would not work, but that's an interesting idea. Why would you want to do that? It.

B

Came up in a thought experiment with some friends: okay, that's interest! So there were some artificial constraints but see what.

A

What you're all right! So let me try to go down that rat hole with you for a second, if you're, creating a dynamic data structure as your unique key that may not be the best idea right.

B

Well, the idea is that it would be it would, you would create it and then it would be immutable and then that you would use that immutable set as your primary key. If.

A

You're, creating an immutable set then just create the right data model. The first place. Okay, I mean I'm, not trying to be harsh bad. I mean I'm just saying that that's one of those things where we sit there and talk about it, because if you really say this is immutable or I have fixed fields and you don't need a dynamic data structure, sure yeah sure, okay, all right I'll be over here.