Apache Cassandra Data Modeling, 2 Aug 2013

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Cassandra Community Webinar | The World's Next Top Data Model

Description

You know you need Cassandra for it's uptime and scaling, but what about that data model? Let's bridge that gap and get you building your game changing app. We'll break down topics like storing objects and indexing for fast retrieval. You will see by understanding a few things about Cassandra internals, you can put your data model in the spotlight. The goal of this talk is to get you comfortable working with data in Cassandra throughout the application lifecycle. What are you waiting for? The cameras are waiting!

A

Welcome everybody to this edition of cassandra community webinar series. I am delighted today to introduce patrick McFadden. He is the chief evangelist for the Apache Cassandra community here at data stacks and Patrick. This is the third in his data modeling series. You know we hear from so many people in the community that getting your head around data modeling is really a prerequisite to being successful with your application on Cassandra.

A

So in this webinar Patrick will just recap some of the fundamentals around data modeling and then by the end of the webinar, you will really be in good shape as you move forward and as always with our community webinars, we will be reserving QA until the end. Please use the WebEx Q&A panel and post your question there. We will reserve time I will read through them and we will get through just as many questions as we can in the time allotted so without further ado. I will hand everything over to Patrick welcome Patrick.

B

Am I unmuted, oh now.

A

We can hear you loudly and clearly good.

B

A

B

Is good for me, I think it's good for everyone! That's on the webinar too, all right! Well, this is um I'm not going to do that slide. Yet you can go back to that one later, Christian, all right.

B

I'm, having trouble dancing.

A

You should have to bold pants rate. Whatever you change, we should see.

B

Got it okay, so we have a. This is the third part of the series, so, let's get into where we were so. The first part of the series was: the data model is dead, the one with the data model, and it was really about just bridging that knowledge gap between relational database, world and Cassandra. It was all about really into context and trying to help you get through some of those concepts, always a requirement because I I talk to people all the time. We've learned relational databases we as the community.

B

We as a technical people, have learned relational databases for years and years and there's just some ingrained knowledge. How can we use that knowledge instead of replacing it augmented and get to this newer type of technology? That will give us things that we want that? Maybe a relational database won't and then the second was the became a super model er next, one I really get into the hard core part of data modelling, and it's all about cql and the ticket standard. Query language I really wanted to use that as a it's.

B

Just the only cql discussion and give you some really good tools, some of more advanced features, some good topics around that. So what we're going to do this time is now that we've got our background. Our foundation set I'm going to walk you through some real-world examples. I work pretty exclusively with customers and I get to see what people want to do, and I've learned a lot about some of the more common use cases and I hey. You know, I'm an engineer, I learn from examples pretty easily I think that's a really great way to.

A

Learn and I hope.

B

That you guys can do the same thing. So here we go. Why are we doing this? Why are we even talking about data modeling?

B

So the data modeling Cassandra, in my opinion, is the path to a good time with Cassandra. So whenever Cassandra data models are created, they're all about your application, I do not care as much about the data as I do about your application, and you I say that a lot, because the application really is the important thing: you're, not building data you're, building, an application that will store data. You need a persistence tier that will work the way you want it to work. So that's why I always take Cassandra data modeling from the applications here.

B

What are we doing there? So it doesn't work for every single use case and that's okay, because it's not I, don't think we're in the world. Now that we should say there's only one solution. I've said this in my previous webinars polyglot persistence. It is true. You've gotta have that. Now in this we got different ways to store data. If you're going to do search, you're- probably not going to use something like members or memcache, because it's a key value store, you probably do something like solar, which does search really well.

B

So you know those are extreme values on different into the scale. The standard covers a lot of topics and a lot of different use cases, but you got to make sure you use the right ones. So when it comes down to those two things, you get a choice.

B

You got to make the proper use case with the proper model, and that's that's my thing here to winning and you will win if you get this right and if you don't well, that's where the sad panda comes in and you don't want to be the sad panda you want to get it right, the first time, nothing sucks more than having to re architect your solution because didn't go to right the first time, and so what we're going to what I'm going to do is show you some of these, so when to use Cassandra boy, do I hear that a lot actually I hear the opposite of a lot more too, which is we not use Cassandra there's you know that there's both into that.

B

So let's talk about this a bit. The win to use Cassandra is I I go to like its core strength. Is. It is very good at replicating data Cassandra HBase is very good and making sure your data is available across inside one datacenter. It's hard to kill Cassandra where we've lived in a world of single points of failure, and everyone preaches a good word. Don't use a single point of failure?

B

Well, alright, we don't want to use that use Cassandra and when you need to scale scaling is another issue that is difficult in a single point like if one database or one database server usually means scaling it up and building up a bigger server. One bigger server I prefer to work with things that are horizontally scalable anymore I mean it's 2013, I'm sure now that just data just talk, but we shouldn't have to worry about things vertically scaling anymore.

B

So whenever you use Cassandra, you get horizontal scaling, but you don't really know what the scale is going to be today or tomorrow. The next day. You know you have question marks about how far it's going to go. It's great for that Cassandra is linear scaling. You need ten percent more load for your application. Add ten percent more servers, it's just that easy and uptime guarantees. I've worked in engineering like I, said a long time and I always got. The question is like how do how do we get 5-9 to uptime?

B

Well, you just start adding zeros right and that doesn't really work for a lot of people, because then they don't have the budget.

B

So you need to find something that will get there without having to cost all those zeros, and so Cassandra is really good for people that want to have most 100%, uptime and I'd say close because yeah I'm, a realist, things happen right, but you shouldn't have things like maintenance periods or maintenance windows that interrupt your make your uptime, you should be able to take down a server without bringing down your entire cluster and in all of that, getting to any of these solutions costing just enormous amounts of money, and you know we're all cost.

B

Conscious always have to be I. Don't know anybody that says hey my IT budget is unlimited. It just doesn't happen, you're going to have boundaries on how much you can spend on things and I've been involved in this too many times where we to make really bad decisions because of money like yeah we'd really like to have closed 100% uptime, but we can't afford to put it into two data centers or something like that. So it just it's bad.

B

Whenever you have to make that decision, so it gives you more abilities and then doesn't cost as much so you know I, think I'd say a lot. Is the Oracle scales right up until you run out of money and that's pretty bad when that happens, so these I, like I, put a little app to come there there's a lot of Oregon hands on this there's a these are all ores of not hands, so it doesn't have to be.

B

All of these have to be true and there's caveats on everything, but this is a pretty good list, so you have to get the datum so right. That's why you're here data models, everything, because it's going to express the way you want to use Cassandra directly with the data model. That's how we're expressing mine using commander this way, here's how I'm going to use it.

B

So here's my my core examples that I'm going to present today are their real world and they are I should say expressions of the real world and I put an area. You think, in my view, if I gave this talk and I actually called out somebody that it was them but I'm, neither generalized cases, I hear these all the time and they're. That's probably why I use them, but I want to make sure you have something that's generally useful, but also very specific as well.

B

What I'm going to present is it's really just the use case like is what they're trying to accomplish and here's how we did it. Here's the model and everything that I'm going to show you is in cql, 3, c ql. 3 is really the path forward for Cassandra and if you look at our drivers and I really encourage you to go out and check out the data stacks drivers, we have C sharp Java Python more coming and all of these drivers use the same methodologies and cql is a core for all of them.

B

Go out and download them. They'll make your life a lot easier, and it's just a joy to work with these um I think. If you come from of maybe two years back say or year, back using Cassandra was a lot more difficult. Whenever you did programming, it makes it a lot easier and I really like working with them.

B

So let's go through some comments here about CQL and I. You know: I got this. Gql doesn't do dynamic growth hit it all the time and yeah they do so I'm just I'm. Just just my Mythbusters slide, just make sure to keep reiterating this point, but cql does do a wide row, meaning what we've always talked about in Cassandra, where you have a row with a lot of columns and that's a really good data model. You don't want to make things too tight. It makes it real makes it look like that.

B

You don't get a really wide row anymore. It looks like it's turning it more into like a relational database. That's not really the case. It's it's really an abstraction on. What's really we're doing best practices under the hood and it's an abstraction layer, I encourage you to go check out the blog post on here. My slides are up online on SlideShare. You can go check these out, so you have an actual clickable link here.

B

There's so Jonathan Ellis is one who did this blog post I put like a a really huge comment after he did his blog post at the end kind of doing it more I.

A

B

More to it so read: Jonathan's blog, post and then read my I, don't know what you would call it an addendum in the comment that I put in there is actually an answering a question that somebody had but I. This is a big deal for me, because I really want people to understand it. I'm not just telling you to hit the ibelieve button. I want you to go. Do it and look at it and understand it.

B

So I put some examples of how you can just do it yourself and look and see the data in the on the disk and see how it looks so. I.

B

Just trust me well, in this case I mean, if you're not going to run off and go do this and maybe well, if you're, watching this on the recorded version, but just trusted. This is the case for now. You can go verify it later, but this is how we're going to make it work.

B

Okay, so my first model shopping, cart and I. I think the shopping cart models great because.

A

B

There's a good example of hey I need as close to 100% uptime as possible because think about that, that's people giving you money and you do not want to lose that and you wanted to be online all the time because giving people giving you money is a reason, probably why you're in business. So what's the what's the use case, the reliability of that shopping, cart, critical! You have to know that that data is getting stored, that if people put I mean I'm I've seen this happen, you put something in a shopping cart.

B

Something happened to redirect you to the homepage and then everything in that shopping cart is gone. It never happened, never happens as much anymore, but it when it does happen. You don't want to go back and relook for stuff and have to redo your shopping ideas. How many times are you just going to walk away from it? You do not want to present your customers with that choice of just walk away when they had money sitting in the in the another, shopping, cart, downtime and minimizing. That downtime means multiple data centers.

B

You cannot be serious about your uptime until you start talking about multiple data centers and it's just a fact of life, because data centers go down, they will go down, may not today, maybe not next year, but they will eventually and if they don't, your other data center will probably go down. So just doing multi data center is how you can get the most the most and the best high availability.

B

So cyber monday is now in our vernacular and that's it used to be the Black Friday. You know when people would kill each other to go into a store after Thanksgiving us Thanksgiving is a November, so it's always Christmas shopping things. Just it's really kind of a problem that has more of a generalized case now. The other thing you would call it as a thundering heard problem where all of a sudden, all your summers show up at the same time, cyber monday is in the u.s. is a problem with the Monday.

B

After the weekend of Black Friday, all the online shoppers show up at the same time. Well, you want to be ready for that. You want to be able to scale, let's say: you're your load, it's X and if you want to do X, plus 50% or X, plus a hundred percent be ready for that. So the last thing you want to do is again leave all that money on the floor.

B

So what you're going to get out of this from the bad side- and this is what your boss is going to tell you- is- we cannot be offline for a minute because that's money, and so you don't you don't want to tell your boss we're offline for a while. There goes all your money and online shoppers are all about speed as well. If a site starts to lag, that's bad, we do not want your site to lag if you have that X plus 100%, so thinking about scale, and that way it's really important.

B

So how do we do this? So I got my little whiteboard here showing how we're going to do this, so each customer can have one or more shopping carts. Now let be or more, it's kind of cool I've done this a couple websites I think it's more common now, but if you want to keep a couple of shopping carts going, and so you want to be able to store that we're going to denormalize all the data, so it's for fast active.

B

So taking back to my other data modeling discussions, you want to go with the row row key. The key is your fast access and then denormalizing that data into a single path and making it all part of one row that.

A

B

It fast it's fast access, so we have one shopping, cart, one partition of data, that's row level, isolation, so that gives you the ability to update that cart from multiple sources and make sure you never get a dirty read out of it either. So that's our row level, isolation, that's already in Cassandra, so that that will save you from some embarrassing moments where you might have half of shopping carts show up in somebody's inbox.

B

That's not going to work for anybody, and so now, with this model, you can see that you, each new item, will be a new column, so dynamic. It will grow out. However, many columns needs to based on how many items that you have you want to put in there and that that makes sense that fits into the application model where I may want to order one thing or ten things or a hundred things. It shouldn't being limited by some sort of hard one. That's like how big an array is or something so.

B

What we're going to do here is we're going to use a couple of cool techniques, and so my first table is going to be my user table and it's going to have. The data model is pretty standard, but we're adding this extra element, which is the shopping, cart shopping, carts, I, should say with a set.

B

So it's that, if you recall, is a collection, that's available in cql, really cool method of gathering random bits of data inside your static model, and that set is going to contain and I put this example over here on the right where I have this set. That shows all the stuff that I've put in my shopping, cart and I could be normalized that, if I wanted to, but in this case you know, I just put in a bunch of information about my shopping cart.

B

Now what about or like a list of shopping carts in this case, like what.

A

B

In there is another thing you can use: if you want to do that now, the shopping cart itself is really where all the juices so I have a list of all the shopping carts. Now this is where the white row comes in. Now, if you look at my on the right where the example is sorry that was both of these are shopping carts in my bad. If you look at the when you're inserting data into the shopping cart, it's really dynamic and it's showing all kinds of interesting information in there.

B

But I also put in item details where I've used the map to do that. So different items are going to have different details like a book will have pages, maybe or a something that has weight or something that has dimensions, not everything.

B

If you order a an e-book, then it's going to have so many cages and it's going to be such so many kilobytes of size, but it's not going to have any dimensions, but if you order something physical say like a skateboard going to be so long and it's going to be so tall and going to weigh so much to ship. So these are all individual details of each items that you can store dynamically inside your shopping, cart!

B

Now why you would store it there, you're just denormalizing at this point, and it just makes it easier you're bringing that data in that's relevant. So whenever the shopping cart gets the checkout phase, maybe there's some information in there that helps figure out like say shipping costs or like how it can be delivered or anything like that. So take this model, and you can see now you know how it can be done. You can also mix and match on how you want to make yours what's cool about.

B

This is it's very flexible, but it's also very available. I can have this running of multiple data. Centers active active- and this is very simple enough to do- that.

B

Ok, the next one, is activity tracking. This one is a little bit freaky but I like it. Then I've seen this a couple places where storing data about customer interaction. So we have a this need to react or to act a.

A

That really kind just to let you know what one of the viewers is offed if we can zoom in on the slide. So if you could go fullscreen that that'd be great sure.

B

Like that, back.

A

Tosa. Thank you very much indeed. Yeah.

B

Does that work yeah.

A

And just to everybody, if you want to then ask a question just mouse over at the top of your screen and you'll get a little toolbar, because sometimes it's a little tough to find and ask us any questions in the Q&A panel. Thanks a lot Patrick all.

B

Right, sorry, if that wasn't very viewable, we try to be interactive here in real time. Oh yeah, what I was talking about? Real-Time right, so use your activity and there you go now: I'm reacting to user input, good segue, Thanks, so the root you user input you do need to react to in real time. You can't wait, and so many times I've heard people say I'm going to use Hadoop to do that. Well, all right you're going to get that tomorrow.

B

So that's the bad right waiting for batch and getting input and then dealing with it is really critical in today's world support for multiple application pods the plot. The application pod is more of a design architecture concept, but really it's all about building out your application, so that everything is in you can deploy in two parts and I will show you my slides in a minute here. What that looks like and then speed is always important and if you're acting to something in real time, you need to be able to do it in milliseconds.

B

You know humans. Humans tend to notice things that are over 200 milliseconds or about 150 milliseconds. So when you start reacting much slower, it starts to become a very apparent to the end user. So let's try to work that into the budget for time and react in that time. So the too bad I mentioned the batch, which is you know if you got to have now. Hadoop is not the choice. It's just not going to do what you need to do now.

B

You have to have another plan and losing those interactions when you're on somebody on your website say they're in the shopping, cart or they're doing something on your website. You want to react to that, can potentially give you more lists or give you more money, then you need to interact with them now, and so how do you plan that out? Do it? So my activity model data model is is somewhat different. This is them.

B

This is more of a flow diagram, sort of thing where so the interactions are stored per user in a very short table. So let's say: if someone logs into the website, we have a decision engine that can make things happen, and so, as things are happening we can do. We can take these long-term interactions to store them later, but then we can. We can also interact with them now and make decisions so like when they log in and then we maybe have.

B

We have something like a an a/b choice, and so they can go either to a page or a B page. Well, the feedback is going to be whatever you log in to the page. We could store what this person has and what their next destination will be, maybe based on this user profile, we want them to go to a different web page. Maybe it's something that they did.

B

The interaction decision algorithm will say well, based on the last few things you did we're going to send you to web page B instead and then we'll get some feedback on that. Did that did that make them buy something or did that give them did that, keep them on the website just a little bit longer.

B

That's the kind of freaky part about this, which a is cool and creepy at the same time, which is we're kind of getting into people's heads a little bit, and but that can only be done if you're doing this in real time. I've done a lot of data mining and data data science tests on previous interactions, they're cool, but they just don't- have the same kind of impact as real time like what I'm talking about here.

B

So one of the other things we're going to do is store these in long term, so we can create feedback. There's there's still cases for Hadoop, there's still cases for a long term batch analysis and that's that's to validate our models and then things like that. We'll still store that data and we're going to use a reverse time series. Reverse time. Series is one of my favorite little time series tricks and we're going to look at that.

B

So I have my user activity table and this is the one that stores that interaction as we go through getting the data. Now it's just as someone's clicking on webpages. So basically, we put a widget or something like that in the web page whenever they click in click to the next page, blah blah blah that we're tracking that interaction that was going into the user activity table.

B

So the user activity tables can have who they are the time what kind of activity they did some details if we want to put that in there, but if you'll notice I put a reverse order on base 11 times down, so this clustering order by- and that means I'm going to reverse this row on based on the interaction time when you reverse a row like that, what you're saying it's!

B

The last thing that happened is the first thing on that row very efficient, because when I go to ask for that data, it will be right there. We don't have to do a big seek on the disk and I'm going to show you what I mean in min here.

B

So when I insert data, let's here's my login, you can see on the right the very first top top right, where it says user activity I'm, inserting who it is my interaction time, which is a time UUID, which gives you microsecond resolution and it's the normal logging, but then at the end, I'm putting a TTL of 30 days, meaning that that column will expire after 30 days, so I'm, really just keeping a 30-day window of those interactions, which is totally cool and I will work around that because the next table I have is my user activity history again same type, beaded I'm collecting about who they are where they went, but I'm changing the data model a little bit and I'm going to make it so that the user name and interaction date are the partition key and the time is not reversed.

B

It's in a natural order. Everything else is pretty much the same. When I insert that data I'm not going to try to expire it. This is permanent. So to speak, it's going to be history, so this is a great way to store a long tail.

A

B

What I want to point out here is when someone logs into the website, for instance, we're storing that dated to two different tables and that's totally cool. You can do it in five different, get six different tables, 10 different tables and I've done this a lot they where you just get so many good. You have such good right speed, so you can do multiple, writes per second thousands and thousands millions potentially on Cassandra. That should not be limiting factor in your data model.

B

You should, when I was an Oracle, DBA, dev I, worried about how much I was inserting into the database at one time. So that is just not a consideration with Cassandra so think about. What are you going to do to lay out your data, so you can read it up more effectively, so how I'm going to use this data now? Here's an example of why that reverse works so well. Is that if you look at the select from user activity limit 5, what I'm saying is I only want the last 5 things that happen.

B

I don't want the last 10 I want the last 5 and it doesn't have to go over this massive amount of data to get that data out. It's just going to go over the first 5 items in that row very efficient and if you're running a trace of some sort, you're trying to figure out where your performance is, if you didn't do the reverse, and you asked for the lab size, things you'd be iterating over a ton of data before you got to what you want. Reverse prevents that so this is fast.

B

This is an interaction pattern that makes it feel really fast. So what I've got here is this this person they went into the jewelry. They created this anniversary, the shopping cart and then they deleted the gadgets I want. So in my my algorithm that I picked them like whoo. Maybe we should put a sale item up for flower, see because this dude must be in trouble and so then open up the gadgets I want and so wait a minute hold on. So we deleted. We created this anniversary gift. Now we're doing jewelry awesome.

B

How about some flowers with that? It's got my dumb example, but in my in my opinion you know this is the kind of stuff you could do. You just got to come up with what you want to do, and that's probably that's when it gets fun. Probably a little more interesting is: what do you want to do with that data, simple or complex? You have the data. No that's the important thing.

B

My third one is the log collection, and that's probably one I, hear probably the most easy, because people want to store blogs in an efficient way. It's a big problem and there's just a lot of it. Machines generate a lot of data. We've got, man is weird I mean there's so many cool and scary statistics out there. Like you know, humans have generated more data in the past five years and they have an olive history or something we just love generating data. So.

A

B

To store it logs, if you write a few, our application developer, I'm sure you put enough log statements in your in your code to make it so that you need something like this. So always it's always going to be. We need to collect data at a high speed, so it's coming at a high speed. If you have 30 application servers in your app tier, then they're generating 10 15, 20 application, log events during a single interaction, you can be generated, x' and hundreds of applications. Events, application, logging events in a very short order.

B

So commander will do this in near where the logs are generated and I. Think that's! When you talk about how we deploy applications, we're probably going to do a multi data center hat. If you are a multi data center and you're collecting your logs to one data center you're sending a lot of data over the wire, but it may not be as relevant.

B

So we want to make sure that the application that generating the data is just goes to a local, stop and then, like Cassandra, do the replication for us, the the need for dicing the data as well so classic use for logs is something like a dashboard or lookup. So we need to come up with it. That's our use cases. We need to be able to dice up all that data or put it into a usable form.

B

The bad side is, is that the scale you know we don't don't try to do this with your really expensive, relational database, and it's just a lot of data that it's going to have to collect you're going to have to scale that thing to the point where it's not very cost effective, and so it's a better there's a better way to do this. The batch analysis of logarithms way too late as well too, and let's go back to my previous use case where the data you're getting now is relevant.

B

Now, maybe it's relevant the next five minutes, maybe the next 10 minutes I can think of many times where I've actually built the system years ago. I built this system in Cassandra, and it was really so that we could get up to the millisecond information about, say how many five hundred areas we were getting on our website or how many 404s we got after a code push. These are relevant information points that we needed now.

A

B

Batch picked it up and gave us a report the next day, because it's usually affecting customers right now, if we're getting all sudden, we spiked up and we get a delta on our 500s of you know, plus 100 percent. In the last five minutes we better start looking at what's going on and that's some relevant information that we're getting from our logs.

B

This is what we're going to do we're going to use flume in this case those of you not familiar with flume, it's pretty cool tool from it's another Apache, another patchy project. It's actually a sub project to do, but it could stand alone pretty well. It doesn't really need to do. It was a gent again as a way to get logged in to do, but it's found a lot of general purpose. Waves I've worked with some teams that use flume in all kinds of interesting ways, especially dumping data directly into Cassandra.

B

It works really well, it seems to scale in a similar way, so it doesn't fan out things and it can go to different tables, so we're going to use for them. In this case, we have various tables that we're going to set up in Cassandra to log those.

B

Now what that means is that we're going to have the table for I put some some on here, like just one for the raw log, so just store the raw, but I don't really care what format dates in I just want to have those for the like, if I ever want to do batch analysis on those and then I have my latest successes and my latest fail. So this is a dicing action, I'm talking about it.

B

When, when data comes into flume, you can look at the event pick it apart and start setting up counters or maybe putting that information into a separate table. I wrote an application that for every web event, we got, we wrote it out to 90 different column, families and those table. Those 90 different tables we're collecting data for different reads perfectly: okay, Cassandra will scale to that. For you and so there's different things we may want to do with it like a lookup or a dashboard.

B

Dashboard applications are really cool for that kind of stuff and you need access to that data in just a short window of time. Great time series model, so here's our log lookup, so I want to look up the raw data in my logs and, let's say I just know: I have a time frame or something like that. So here's here's the case, let's say we're using an access control device and we want to look up based on the source and the date when we actually, you know when we actually collect that data.

B

So I want to look at an event that happened at a certain time. Well, the data event or the tables logger lookup table I'm, going to have the source and then date up to the minute, as my partition key. That will make sure that I have a nice partition of data. It's not collecting tons and tons of data all in one row, but I can get to it, because I have one of the keys that I found to creating the queries and putting data into the table.

B

It's using data that you have you're going to have a time stamp on that data. Use that data to create your key you've got data to create the actual columns. You don't have to derive anything. You just use the data you have and then it's natural on the other side, if I'm looking up data, I'm, saying hey everything from this one, Access Controller I want to know everything that happened at this certain time to that certain time. So from the range of dates awesome, you can do that now.

B

It's interesting is we have this blob of data, this raw log I did here's my ninja trick. Everybody should pay attention this, because I've seen this in great effect, you're, storing any kind of raw log or even something that doesn't need to have you don't need to get insight into you just need to store something in bulk, like even JSON or XML gzip it at your application, compress it at your application before you send it over the wire great idea, because you're you're minimizing the amount of data that you're putting on on the wire.

B

That's a good thing, but it's also more efficient on the Cassandra side. You're just going to store data, I mean you're, storing gated to retrieve later there's almost no cost to zip it I mean you were talking microseconds and time to gzip. You know a reasonable science object, so just use that so we on this, we were also storing that exact same and web event into two other column: families throw their tables. So if you look to the right, I have my login success. Now, what I'm doing with that? It's so like.

B

Let's say this, Access Controller I keep the the source which asterisk controller is you the date up to the minute is as the one of the partition key but I'm, using that to create a wide row, and then my successful logins and I uses the counter so the counter column. What I'm doing, if you notice again, like I, said it's one of my favorite tricks with time series in time for dinner models. It's clustering order by reversing the the clustering, so that you could that last objects is the first one on there here we go.

B

This is where the dashboard comes in. This is going to be awesome, because what you're going to get out of that is I'm going to ask for the last ten things or I want.

B

You know everything within a time frame, but usually it's going to be the last ten things think about a dashboard where here's my here's, my I, select I'm going to select up to this minute, successful, logins and I'm, going to limit 20, so I'm grabbing the lap twenty second for the stuff, cuz I, know I'm, sorry, twenty minutes with the stuff because I'm collecting it by the minute. So here's my counter.

B

This is really fast because I'm going to be able to get I, have the exact row key and it's a very small slice of data. It's only twenty items, but it's pre-optimized. It's ready to go for this dashboard, very cool way to do this. So in this case you know I'm, looking at here's, my successful logins are going up.

B

Also, they start going down in my log into my failed logins are going up, maybe I have an access and my L duck and sort of on the back end isn't working or my maybe one of my access controllers isn't working very well. This is relevant information right now. If I had to find that out with a report tomorrow, I'd still have a whole bunch of users, not very happy with me, so the the kind of stuff you want to know about and react to it immediately.

B

This is real time the need for real-time data, pops, one is user form versioning, and that was one that I called out, but I think this is more. This is a little different, definitely not at mainstream, but give you a bunch of interesting ways to model your data, so here's the this is that this is a good actual real I had a couple places do this, but a real need for this user form versioning and be able to roll back so I want to I, have a form.

B

The user form on my website and I want to store every one of those versions efficiently, and the I mentioned this before I think it's funny. Is it I? Do not expect people like admins in the front office to learn how to use get I mean? Can you imagine that it just wouldn't work I think that there's that kind of functionality that you may need to present to them? But you need to find a way to do it a little better and there's, probably that's where we probably can get a little more creative with.

A

B

Model so, and the scale is really important if we have a lot of users on our websites- let's say you're, creating web forms for users and they like use them on their website. It's got to be able to scale, and that's where things that, like version control issues of locking and just turning a real problem of scale and the other really critical thing is you need to be able to do. They commit a change, but then, more importantly, roll it back.

B

You need to say, oh, that was bad change, I'm, going to take it back to a previous version, so you need to have the way to scroll up and down the versions. So in a relational database it is a very complicated relationship model, and this is probably one of those times where relational model breaks down. It's just so complicated to do a joint. In these cases it becomes prohibitive in a way it's very complex and very slow and I will look at our data model.

B

Why it's a little more simplistic whenever you look at it from a denormalized sense, and the other thing is that it has this particular customer have a local data center and then a cloud presences doing that with a single like a like.

B

An Oracle is very difficult, because how do you replicate that data, and especially whenever either you get it to work in the cloud and your local data center in a way that there's parity, so here's our data model, so each user has a form they just own a form and every version gets created so as and we'll reverse that, so that the latest version is the one that's sitting right next to there, the first thing on their row, but we keep a dynamic version well, every time they create a version, it just goes out and we'll use a separate table to store the live version, meaning we just have all the attributes for the current working version.

B

So what we can do is create that data. That data will be used by our application to create the forms on the webpage. But if we ever need to change it, we can just roll back to a certain version number inside their form table and the other thing is: we need to have this, this business of an exclusive lock. So look well, how do we do that? Do did I say lock. I did so. A working version is our table that we're going to use for our forms- and most of this is pretty basic.

B

You have a username and a form ID which creates the partition key and then a version number version number we'll make it so that every bit of every one of these attributes is then locked into a particular partition. A cql partition. So it'll look like a roll of data when you look at it and see ql, but the created the locked by is how we're going to make sure that we only have one user during this at a time and then the form attributes map is where all the real business is.

B

That's a very cool use of the map. So you have a key value pair inside of your working working version, that's locked by a version number. So if you look at my inserting of my first version, so the working version that I'm creating is this version. 1 of the application and you'll see that it's my my username and my um my form ID and then I. Have it just a map of all the different things.

B

I have a first name, that's text and I'm going to have my label for it, and the last name is tags, the email, a newsletter, that's a radio button and here's my choices so like I said this is a real example. Is this was used. This is probably out there. You probably find.

A

B

But I've also added this clustering water by so whenever I create new versions I'm, actually using the version number and reversing it. This is similar to what we did with the time series model, but with the version number, because it's a natural integer I take Amanda Tanner integer tonight.

A

B

I made a manager, so it will do natural ordering if I, don't reverse it. If I reverse it then it'll be the highest number will be the first thing on the row, which is kind of what we want. We want the older versions to start trailing off and then have the current latest version right there. So how do I do this so when I insert my first version and I'm so that that's in the system now, whatever I start editing it.

B

So this is an application concept, and so what happens whenever someone logs into my site- or they log in they, say ok I'm, going to work on that form when they click on that we just do a single write. We say locked by this person and I'm going to do an update where username equals people add the form ID equals 1138 and I have a particular version that I'm walking that's cool most that's what I would call an optimistic lock.

B

It doesn't have to be deterministic but from a website that works pretty well, because you're only going to have two or three people, probably working on the form. If ever and you just need to know, if someone's doing it so whenever you go to edit a form you can see, if there is anybody that they've got locked by is even filled now, so what happens whenever you insert a new version, so you want to put in version two well now what we can do to quote-unquote release.

B

The lock is just nullify that number so or that that name that lock by so whenever you in one pass you're pretty much releasing a lock. You insert your new data and take up a lock it just doesn't just anymore and so that, yes, this is kind of a poor man's lock, but it's really a cop accomplishing. What we're hoping is that we just want one person working on this form at once. This isn't like a transactional lock in a database.

B

This is more of just a user lock like keeping one person out from getting another later a later webinar. When we start talking about 2.0 features, I will further enhance this using our Cavs functionality and compare and set- and that is probably a good use for this whenever you want to get better transaction quality, if you're really being paranoid about it like having multiple, multiple, multiple users going against at the same time that functionality was designed around this problem, so stay tuned for that, so there you go.

B

We saw our four data models and there's Master Yoda he's the data model of people. Don't know that he's not only pretty good with the lightsaber and dudes running through a swamp, but he's also really container modeling, and so this is important stuff to think about whenever you're creating your data models like okay, these are all good use cases. This is how I walk through the process. This is not meant to turn you into an absolute ninja but you're on your way.

B

I hope and I know that people have watched some of mine, my what I've seen people now. You know, especially with groups that I work with when they watch data modeling, stuff they're, starting to really get some great use. Cases out there so I hope, someday I'll do do more of these and they'll, be your use. Cases and I. Think yes, that's about it. So you can always you know, get with me. I have a lot of people that get with me on Twitter about various things.

B

Awesome I'm on there all the time you follow me. There you'll see my tweets going like crazy and I usually talk about this kind of stuff, either data modelling or it's all I talk about Cassandra, mostly on on Twitter. So just follow me there.

B

If you have questions, ask me there might have to get any email get in to IRC you user group I mean they there's a very strong community out there willing ready to help you so please, let's do that and I would hope that everybody you know takes this and does what they can to make their own data model really cool, so I think that's it Christian. We need to do some questions a we.

A

Sure do we have 11 minutes, let's see how many we can get through pretty quickly, here's one from Tom or Paige Lewis. She took it away from me. I was on the cusp: Tom can order by be used for any column in the primary key? For example, primary key columns ABC can I set the order by on just column, C.

B

You have to have a column, a and B. First before you can do a a column, C order by I. Believe that's the rule set for that. So you have to you, have to specify a and B as an equal, at least.

A

Okay, great I can I love questions. I can answer. Here is one for me, so many people asking about getting access to the slides and the recording we always send out to everyone who is registered for the webinar, the slides and the recording in about 24 hours so stand by for that also people asking where they can view Patrick's.

A

First, two webinars on data modeling go to Planet Cassandra, hit the learn, tab and webinars, and we have all the recordings there. So this one definitely teed up for you Patrick regarding cql query: is there a good Java client to use Hector.

B

A

To have good support with cql.

B

Whoever asked that I swear I did not make them do that. Yes, we have an awesome driver. The data stacks Java driver, it is in GA, it's I'm, you know, I I know it's new and not a lot of people know about it. You use that driver and our driver has a lot of things in it. That you will need I saw questions. They said no to wear. Of course it's no to wear it.

B

Does things like cql, it's actually built around the idea of using cql, but our drivers- and this is not just the Java drivers. It's a c-sharp driver and the Python driver as well. They also embody the methodology, will failover and clustering and doing things like setting up chorim's. All these things are built into it and they're very you know, they're very.

B

What's the word. It they're very friendly to the developer, because we've wrapped a lot of things that were somewhat complicated before into easy to use frameworks, so go download it. Today we need a slide rate.

A

That is Patrick McFadden public service announcement right there so interesting on compaction here, so I lost it hold on one second, bringing it back. We're going back. This compaction scan all the SST tables. Does it does it need this compaction need to always scan all the SST tables.

B

No compassion doesn't scan all the SS tables it. Compaction takes the S of tables that need to be compacted and then freed them mergesort some and then writes them out to a new SS table. It is the only thing that ever if the only way you would ever do all those tables or the compaction is, if you did a major compaction, meaning that you it was a user-initiated compaction. I do not recommend doing that. That is not a good thing to do. The other there's two other things that will touch.

A

A bit, anyway, is the clarification, because weight is counter flying TTL and SF tables.

B

No, no, it the TTL, the TTL s will expire from the application level. You won't see that data whenever you ask for it and they will naturally expire over time whenever the compaction is run, but not deterministic. Now with 1.2, there are settings there that will know that that will trigger what's called the TTL compaction, but only in those specific use cases, it's not something that whenever we run compaction, it just looks at every single access table that that is not what's going on.

A

Okay, great so another one from Chen man, Chandan wondering what is preventing not so two users from not stepping on each other in respect to locking in your form, versioning use-case. That's.

B

All right so there's a couple of things actually that the calves functionality would be where, if you were worried about simultaneous access, the new version 2.0 feature will prevent that. That's using more of it, it's a deterministic type lock. What I presented was more than optimistic. Lock there. There are probably some methods you can do to detect that, and this is meant to be more of a simplistic data model not into more complicated. We can go into your specific use case if you're worried about that.

B

But really has functionality was designed around that like this is a niche case. We need to manage these from the database, but not the application level. So taking away that complexity from the application.

A

Jasbeer is asking in the log model if you can go back to that slide. Patrick the cql Clause has clustering Clause. Please explain that. So. Please explain the clause that has clustering in it.

B

Clustering order by yeah.

A

B

Clustering order by again is how you specify the sort order of the data as it's being created, so an SF table sort of string table it stores data on the disk in a sorted fashion when you use clustering order by you're specifying that order. So in this case, what I'm doing is I have the date to the minute, the varchar'. So it's a date when I say clustering order by him using that date, two minute and using the you know, basically reversing that so, instead of instead of ascending order, a SCA use descending order.

B

What it's doing is storing that data in a way that like well, what I would call reverse, which means that the last.

A

B

That happened would be the first thing on that row so that it's just some complicated, it's more complicated syntax to mean I want to reverse the order of data, that's being stored on the disk, much more efficient, but the key is that it allows for things because select queries just like that, where you say select date to a minute limit. 20 I want the last 20 things that happened.

A

Thank you so little clarification here on replication Maury is asking in any data center containing three nodes, but with replication factor to does the replication happen. Clockwise if the topology is in a ring say datacenter one is two and datacenter two is two, but each data center itself has three nodes, so I think a little clarification on how actually replication works would be good, Patrick.

A

B

Inside the data center, yes, it's the adjacent nodes, like you said it's clockwise, but across the data center the coordinator will communicate with the node in that just one node in that data center. It does not replicate data to each individual node. So when that, when that coordinator then sends data to the other data center, it talks to a coordinator on the other side and then that coordinators job is to replicate it in, but it replicates in parallel. It's not one at a time. It's not a serial serial replication. It does everything.

B

In parallel, it's determining the replicas pairs is in a clockwise fashion. Maybe that's the kind of that's a combo you're trying to mix together here when data is actually replicated when you create it. It's done in parallel, but when you, when you specify the replication pairs or the replication server, they're done in a clockwise fashion,.

A

Great thank you and hear from Sal and and I think. This is always a good clarification, because we talk a lot about Apache, Hadoop and Apache Cassandra and South Austin. You know: do we need Hadoop to run Cassandra or are they separate need each other to live.

B

There they are completely separate they're their cousins who talk to each other at the family reunion every once in a while, but they don't they don't need either of each. They exist so independently that there's not even any crossover. For instance, Hadoop is pretty complicated from an ecosystem. It requires nindo job trackers zookeeper, you know, there's a lot going on there. Cassandra is completely self-contained, there's nothing. It needs to run by itself and you startup Cassandra. That's all you need to start so.

B

There's some integration patterns with data sex Enterprise that we've created to make the to work well together, and it's just really more blue than anything and our our big native sex Enterprise version of Hadoop. Well, then Hadoop will then use Cassandra as the data store instead of HDFS, which that yes, then at that point they are pretty inextricably locked, but it's just because we're layering. Can you tell topic to sander.

A

Brilliant- and we have one minute- I think we can get through these really fast. So so is there a non-blocking compatible driver available for java eg1 that uses neo or Metis? Yes,.

B

Our Java driver the data sexes, Java driver I, get it say this twice uses Neddie it uses non-blocking I/o, uses thread tools and dispatchers go download. It check out. My I have a presentation from the from the summit about it. It is awesome, you'll love it next question quick.

A

You sauce, what are the recommended ingestion and extraction clients that are available.

B

uh Ingestion ah boy you're asking the wrong guy because I like to write my own code for ingestion, but if you have to, if you want to use something off, the shelf, I mean if you're gathering log data flume is awesome for that. There's plenty of code out there example code. If you want to use something like a ETL tool to Tahoe Talon, they all they all will take data. What data sex Enterprise! We also use scoop grabbing data out again, you can use something like Talon. You can write your own code.

B

We have some utilities on the command line, SS table to JSON. So there's there's lots of opportunities to get your data house just what do you want to do with it? And how do you want it.

A

Perfect great, we are at the top of the hour Patrick. Thank you. As always, everyone please join us next week for our performance. Optimization webinar that will be delivered by alto, be from a yalla he's, had a lot of experience with Cassandra in production, and that is a great presentation. So.

B

I see Brad can't recommend that even more that is going to be I'll, be watching that one.

A

So you know it's good. Oh yeah,.

B

A

Thank you very much Patrick. This has been great, really appreciate. It.