Apache Cassandra Cassandra Day Seattle 2014, 12 Aug 2014

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Cassandra Day Seattle 2014: How I Learned to Data Model + 5 Do's and Don'ts

Description

SlideShare: http://www.slideshare.net/planetcassandra/c-data-modeling-37298974

Shehaaz Saif: Software Development Engineer, Expedia Inc.

A

Hi, thank you for coming for cassandra data modeling, uh so start uh about me just quickly. uh My name is how I got started with cassandra.

A

Datastax had a big data developer contest a few uh like last last april and uh during finals I randomly found this contest and I joined, and I built a deal finding app on android and then I was one of the five finalists and they flew me over to san francisco and and that's where I've met expedia and I'm a work on the ssl uh help us group team. It's been five months at expedia, so I'll. Just this talk is at the end of this talk.

A

I just want to be confident, modeling data, so it's just like john stock talking in the morning and I'm using examples from patrick mcfadden, his youtube videos, just so basically a layman's understanding of like modeling data with examples. So you can relax and just like not to stress too much, and this is uh twitter, uh so do so with cassandra you have to you have to you have to want to. uh You have to know the answers you want to get out of the data before you put.

A

The data inside the database, so so it's a query, driven data monument as john mentioned and don't worry about redundant data so like so don't worry about writing a bunch of writes to optimize for the read and don't have large partitions. So this is one of the mistakes I made. When I made my deal finding app like I had like the row, you can see yeah it's.

A

uh This is the time stamp and the data was for the deal that the person would post and this would run out like really quickly after you know, we'll reach 2 billion really quickly and it's wrong. So you shouldn't. Do you shouldn't have large partitions use cql? Initially I use the virgil, which was the rest api. It didn't support uh cql. It was using thrift. So I'd like do everything on my own on the application, which is not good, and it's like a funny cartoon.

A

You can read it, but yes and the sequel looks like sql. So it's just exactly like what john showed this morning super simple and then let's go into an example. So we have this weather station and you have they give you this information. They give you a weather station id a time and temperature and you're defined. You have to be able to dash this this on a dashboard for the last, like five minutes, for example, and how would you do it so the table would look like this.

A

For example, you would have weather station id and event time where this is an id would be uh the partition key and the event time would be the clustering column. So what's what's wrong with this, because eventually you're going to run out of uh columns because you're going to keep on adding event times and it will reach 2 billion. So this is a way. For example, you would insert very simple- and we do a range query.

A

You would do like this, so the better way to model it would be to put the date also so you'd have a partition key with the weather station id and the date. So so, after each day you would roll over and create a new table and the trick to finding how to set the partition.

A

Key I found was so what what's the data I have when I create the table, like I have the weather station id and the date, so that would be the partition key and the clustering column is what I want to uh sort by on disk. So, for example, if I want to find the last five minutes, there's just one this seek and I can take the data and then put it on a dashboard, and you put this ordering thing and you would order it by descending.

A

So, yes, basically what I explained- and this is what it looks like on disk- you have where decision id and date would be the key and everything would be sorted by the partition column would be the timestamp.

A

So this is the way you would uh do a range query. I want the temperature from seven one to seven four and from this weather station for this date.

A

Now, let's talk about collections really quickly, like john mentioned this morning, really he didn't give a given example, but he so let's talk about set list and map and what's great about them, you can have a dynamic item in a row, so a user could have email, one email too, and you could have a list of emails, but the con is. There are serial serialization costs, because when you're reading it they have to do serialized before before showing it. So that would uh so it will slow down uh your application and the the list. This.

A

The collection must be smaller than 64 000 elements, so keep it uh small.

A

So this is set for a set. It's basically, you put a curly bracket. One two and the thing I want you to take away from a set is the ordering is determined by uh the cql type. So the cql type here is text and you make sure that it and make sure it's ordered by by this type.

A

One and two I mean you do an insert, uh you just do add three and it appends it at the end and when you do zero, it prepends it in the beginning, because it works by orders by the secure type.

A

If anyone uh when you want to delete something used to minus that element and there's no read before right and it goes away, unless you you decide the order of the element or the order of the elements and for anyone to append something at the end, you would put plus three and if you want to put it at the beginning, you have to put it before. So I'm deleting the same thing like I said, maps are more interesting again.

A

It's ordered by the cql type of the key. So in this example, you have one and two.

B

A

uh When do you pick like collection now, collection, the list is a type of collection. Collection is the interface yeah. So so, for example, let's, let's look at a map example, because the map is very simple. So if you want to delete something, you get the key adding and you want to modify it. You want to put it in spanish. You put it like this, so map example. So you have this user with location table right and you want to have the user location by a map of user's location by the time you uid.

A

So we want to see where they logged in for certain when they logged in what, in which location so remember that was serialization cost with with the collections right. So you would. I set a ttl of 30 days in seconds, so this update would like go away in in 30 days and uh and then you would call the now now function, and that gives the time in milliseconds.

A

So it you create uniqueness in microseconds or it creates uniqueness, and this is the way you would model that that problem, so so ttls can be set per insert and per update and there's also a default gtl per table like I was talking with dvd patel yesterday and she was asking me this question and I looked it up and and yeah so you can actually set a ttl for an actual table for it to go away and yeah.

A

So in this example, the user, we give it a one day expiration for his password and if you want to update it, we send an update command with a different ttl and it updates.

A

The actual uh calling.

C

A

The password yeah call password.

C

Or it's still accessible well, that password is still valid, but it means they need to re-login. You know it's yeah, but.

D

The session has.

C

E

His example was the password that you can't use the temporary.

F

A

So replication factor is how many copies of data I want in the cluster and consistency level. Is the acknowledgements you get from the nodes after you do a read or write. uh So, let's, let's know what quorum is because john talked about this morning is replication factor divided by two plus one: that's the quorum. So if you set a replication factor of three, uh your quorum would be two. So you after you do a right. You wait until you have two uh acknowledgements uh so there's something called row level. Isolation which I found interesting.

A

Also uh so you have uh let's say a person has his login is eric 21 and he has a he wants to change it to eric 22 and set a new password after cassandra 1.1.

A

uh It updates both the logging and the password uh with the for the same row. Key and there's no concurrent read would get like eric 21 and the new password or eric2, and the old password like both get written or none get written so just kind of interesting.

A

It's really interesting, but, for example, if you have a quorum uh quorum, you got to wait for two responses to come back. You've got only one and there's a time out, so the client has to do a redo to make it make sure it writes in because you didn't get the second acknowledgement from the from the cluster, so index, and so in indexing. Secondary indexes, like you mentioned in this morning, is evil and you shouldn't use them so I'll.

A

Just give you a quick example and don't tell you so you have this user table and you put an index on state and uh so, for example, we can do like a select star from users where state equals texas. This is kind of an okay, like example, because there are many rows that would contain that indexed value.

A

So it doesn't uh destroy it's not really bad, but you should avoid this type of situation. So when not to use it, when you have something uh unique like an email address, product id or video tag, you want to put an index on a video tag, because you can tag it like funny cat videos. You know grumpy cat, like a bunch of like a bunch of the tags you could have on videos.

A

So, for example, this the this way you create a tag index for the video, so this is very fashion efficient. So every time the user updates uh and adds a tag for their video, you would update the the tag index with the tag and the video id. So this is the. Why is this better? Because there will be many distinct tags, so you create you create a separate index and because we don't want uh many disk seeks for a few uh results.

A

So this is a very uh the final example. I have is uh so you have a you, have a car locked right and you want to find a car according to the color model and make right, so you have like seven combinations, your color model make and and the rest, so an example. Entry would be ford, mustang, blue, and you want to be able to find like. Oh, I want to find all the blue cars. I want to find all the fords, so so what would be uh the partition? The partition key in this situation.

A

So what would be the partition key.

A

You could make three separate tables, but a more efficient way. You could have a it's a it's a compound partition key. So so it should. You could do make model and color, because the trick uh of having make modeling color, because you have this information before creating the table right, the vehicle id- could change. There's a trick to thing like how can I? Why would I choose a partition key, so something that's unique to make modeling color right.

A

So this is the way you create the table right the so the thing that cassandra comes into play is because you want to remember you want, if you want to curry for a blue car, you know specific model specifically make. How would we do that?

A

So you would write seven inserts for this specific ford, mustang blue, with all the combinations and like, for example, you put like empty string empty string blue because you won't tag it as blue, so we'd have like all these combinations into it and when you want to do a search when anyone would read it's really quick because you can be like. Oh, I want to find all the blue forwards and you'll quickly get it and you'll sort by the this vehicle id, and you want to find all the cars that are blue.

A

You just do like empty scene, empty, saying, blue, and that gives you all the blue cars. So it's kind of neat way of doing things. Let's.

C

Use cassandra what if your data was very sparse, yeah.

A

I know yeah, but this is a specific example.

D

But this is saying.

A

Like use cassandra's like because able to have had a lot of rights to use that power, it has.

B

The same table.

A

Exactly so forward empty string, empty string is a specific entry, so you can search all the ports by by the vehicle id.

B

Or you have to provide the other no.

A

When you do a query, you can do this like, for example, the last one in the bottom. There I'm doing like empty string embracing blue, and that gives me all the blue cars, because when I inserted it, I sent it uh yeah. I can't because that's the way I wrote it in right, so when you read it because you have to give it while you're writing, that's why you there's a partition.

A

He's not going to find it because when you're writing it you're writing it with empty. Oh, you got to give it something: it's a partition key! It's.

B

A

But it's going to get confused right! Think about it like you would have like yo. You have to say.

B

A

Want nothing, nothing blue.

E

It's not additional people he's doing one insert per search.

F

F

A

But this is there's only one: that's empty string, empty string blue, like there's; no, nothing for make nothing for model and only blue. So that's.

B

That's the way you can get it yeah.

B

D

Yeah inserting redundant.

D

F

So this can you use this model as opposed to my application, because you have three columns of payments. The card is by making the cards by.

C

Color and then put them together on the salary right.

F

This is more efficient know that the queries.

C

You're, going to generally be doing for all three is that right.

A

Yeah yeah exactly it makes your application code very simple when you click application. When you write your app, it's very easy to find things because you can just see if there's an empty stream just like fill in make model and blue and send.

C

It just seems like it could get really out of exponential.

A

Oh yeah, but because only three of them, yeah yeah,.

D

E

E

As opposed to needing 50 tables to manage all the different.

A

You have to give it a make in a model, because that's the partition.

B

Key so yeah it's different from traditional sql databases where we can specify part of the primary key it can scan the entire database. It's not scary, exactly so.

F

It's okay checking from the table.

D

A

Yes, exactly so it's query driven development, so you think of like oh, what would what are the questions I want to answer before.

E

Today to answer any combination of the structure: yeah, there's there's seven, what seven different search queries.

A

Yeah, that's yeah, there's seven, there's uh there's a list.

E

Each permutation of search queries is already concerned. That's why you won't get the same row back twice.

E

A

Answer all those questions yeah, then you have to do.

B

A

Yeah you think, like, oh, I won't answer these questions.

A

I'll give you everything, so you can watch.

D

It yeah yeah for sure yeah. How will you send this.

A

uh Like through the.

E

A

So don't worry about redundant data and.

E

A

I was going to talk quickly like a high level overview of like how rights work.

A

Like a quick over because he went really in depth this morning, because I want to like go over it again really quickly. So when a client wants to do a write, give they give the role the row key. It goes into the append, only commit log and the mem table, and it quickly sends back the acknowledgement. That's how we can do a bunch of writes and it quickly sends it back acknowledging the client like.

A

I have the data, it's all cool and then once then, it gets flushed into the ss table, which is a string started table that's inside in disk and for example, now what happens when an update comes in when update comes in the asus table would be in the mem table. It comes into it comes in here, and it sees like. Oh it's because there's a new uh access table added.

A

It goes into memory and they do a merge sort and that data gets merged using compaction because it automatically does compaction and gets written back into the asus table.

D

May I ask a question about.

A

D

That was the very clear the query, optimization that was very cool. Now, I'm going to update my via this vehicle id. That was a ford blue mustang, nothing, nothing mustang! You know blue blasting.

F

You know all that right.

D

Okay, now I'm going to update that vehicle id with uh I painted it red.

F

D

So do I need to have now updates for the same number of queries that, because it's no longer a blue one, but.

A

Same same vehicle id right.

D

A

Yeah, so you have to so this this. It became blue uh yeah everything that had the color. So you would update this. You have to delete.

E

It because blue is part of the partition.

A

E

Set because it's attached to a particular server and location.

D

Do you actually.

E

Have to create a new record.

E

D

A

Yeah, you have to like delete it before uh updating it. You want to change the color.

B

Make modern color.

A

B

You change it any.

B

A

B

So when would you pick macron, color or any other combination.

A

ah It doesn't mean when you, when you create.

B

This this matter, I don't think that.

D

Doesn't matter as.

A

You're entering it just make sure it maps to.

B

A

He's saying like, could you put like color make model? No, no.

A

So yeah, when you come here, you don't have to do a color made model.

D

F

Mean you do as.

D

An application developer, so people know what you're doing, but.

C

Can I call that kind of data.

A

Yeah, you can add it yeah in this. In this situation,.

C

A

C

It's gonna be sorted by vehicle.

E

E

It's actually possible at the database level to have like the same, make and model in there with end vehicle id with two different colors the restriction on the primary key. Doesn't it wouldn't just like.

D

Complain: yeah.

C

Question how many, how many uh items can you have in a primary key in that flight? You make model color region.

E

E

The thing that makes that work is because you put the empty string and that's kind of awesome.

E

Well, I mean there is, but it's like a you're gonna hit the practical limit a lot more technically internally. Is that.

C

Primary key, just literally a concatenated string of those treatments.

A

Exactly so, it's like the the weather station id and the date I had for example, so it's like one, that's the key and then everything else is good.

A

I'm not exactly sure, but, like you.

D

Read the documentation.

A

It's like it's like one yeah, I think it's it's cola.

A

And, like summarize, like, I had all my references in this like link I'll, send it to you guys so, there's like four data modeling videos that I based my talk off just watch all four of them and their links to it and just like they're, very useful. So it's very yeah. So the example that I use used the car made model is from patrick and he actually had it in one of his talks and talked about it. So yeah, patrick mcfadden, the chief evangelist of datastax and.

C

A

Okay, sure and then uh so the read uh you would actually come into uh the key cache and the key cache would check the row key and see. Does it map to an ss table and it would directly go into uh the ss table and get it get the data, send it to the mem table and send it back to the client if it misses it goes into the bloom filter. Bloom filter is a probabilistic uh data structure.

A

uh It basically finds where the data is not like. I tried uh reading about it uh last night. It was really complicating so I gave up and then it goes into it, so it finds it.

A

It goes into this table, sends it to the mem table and sends it back to the phone. So so that's the way reads: work.

A

A