Apache Cassandra Cassandra Community Webinar Series, 7 Aug 2015

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Webinar | Oracle to Cassandra Core Concepts Pt 2

Description

Third normal form? That’s so 20th century. Learn the newest techniques to make your Cassandra database sing from the rafters in performance and scalability. AND it uses concepts that you already know and apply every day. You can do this. This is the must-see half hour of your professional life! These developers found a new way to work with databases. First you will be shocked, then you will be inspired!

A

All right good morning, everyone, this is yet another exciting episode of Oracle Cassandra core concepts. So this is part 2 in the series and we are going to dig into some data model. Today we have a it's me: Patrick McFadden, I'm, chief evangelist for Apache Cassandra and who do I have with me.

B

First, what date attacks all.

A

Right, you cut out a little bit say that again sorry.

B

Rachel the dress key I am also an evangelist with data sacks and.

A

So and now you know how to use your microphone right. I do.

B

A

Right, good, all right and your.

B

A

B

Isn't is not yeah.

A

Well, you use it effectively, so we're going to just launch right into this. So what do we do? The last last episode was about the transition from Oracle to Cassandra, and most was more of a conceptual problem that we're dealing with. How does it work? I would highlight some of the differences and is just to ease you into the concepts. So, if you're coming from an Oracle background again in case you didn't hear, this I did an Oracle DBA forever since the 1990s, and then that happened to me, I.

C

Know that happened.

A

To me, I had to make this transition at one point, and so this is near and dear to my heart, but that happened last last episode, I want to say last week, but whoever, whenever you're watching this elapsed, episode was about that transition and again just to try to ease you into it today, we're going to be talking about because the under data mode straight up.

A

This is going to be gory details and so much more of a technical discussion this week by far I'd say this is probably one of the meteor of the of the three and we'll be digging right into the Cassandra query: language, which is cql. It's top-down, going to be beef, where's, the beef. It's going to be part two and in the next episode is going to be about building an application, will preview a little bit more at the end of this talk, but it's really gonna be about top-down. How do we build an application?

A

So now you have all these concepts. Let's do this, let's build us an application, so Rachel once.

B

A

Start us off yeah.

B

Last week we ended it like well where's the catch. You know everything about Cassandra seems, you know pretty cool right. It was designed for the new world of applications of multi data center replication, always on I mean it's kind of a to kind of seem like a you know, rainbow farting unicorn. It can do everything. Oh well. Let me need to talk about the fact that it was for transactional purposes.

B

You know it's not particularly good for ad-hoc style queries, but but just to remind everybody, though, and for those who did not make last week's and did not click on the watch now link.

B

Typically, what happens is you have an application that is attached to a Cassandra ring and the Cassandra that application is attached through some sort of driver and as the application inserts data into the Cassandra ring, the primary key is hashed and it's given a token and then each one of the nodes in the ring is has a token range that it's assigned and the data is inserted properly.

B

You can choose to have a replication factor in this case, we're doing a replication factor of three meaning that each piece of data will live on three three nodes on the ring. So it's up it was. It was pretty straightforward, but this is how we got the replication to work in Cassandra. Go ahead right, perfect.

C

B

All right there's got to be a better way of doing this. There.

A

Should be a WebEx people, you know you're on the phone fix it fix it, so we can have two people controlling slide. Okay. So what about? What's the? What's? The first big concept we're going to cover today and cql? Well,.

B

We're going to talk about a static table, so, let's start with something that everybody's very familiar with. This is a create table statement. It should look vaguely familiar to those who are even completely uninitiated to the world of Cassandra. You have things like a table, name, column name, going.

A

Sometimes column.

B

A

If there's just interesting little segue here so what's interesting, I feel about this is what's missing right. If I came from Oracle and I and I have done quite a few of these in my day, varchar'. What what's the missing part there.

B

Looks like the pseudo-code I used to write like I used to jot down on napkins right because it would be like I didn't you know. I didn't put some of these details in, but now you don't even need it. I look like it's missing some size values, yeah yeah,.

A

So that's I think that's. The first thing you should notice is that, for instance, with a bar chart text in any of the types that the cql uses do not have a sizing assignment, so you don't have to say of March are 255, which is something that easily can be done right.

A

It could just do that, might sleep, but that's really because of the way that Cassandra works, it's sparse data, so the data that you store is the data that you have on disk, whereas with Oracle, when you create a database and you create a data file, it will allocate that space in the blocks and everything before you even write data to it, and that's that's not the case with Cassandra. Obviously you.

B

Know and it could, because Oracle came from a different world, it came from a world of mainframes and we're sighs. It was very, very important as far as you know, allocation and this.

C

Is a sequel database right.

B

This is, this is a little bit more freeform. We don't have these constraints anymore. We don't need to constraint any more more.

A

Yes and a lot of people asked like, what's the limit well on a bar chart, you're looking at any or any of these types, actually, the the sizes to gigabytes so I like call out a practical limit. Don't ever try to write two gigabytes into a column I take your application. Will barf before you finish, writing it. So it is a it's a big number and it is doable I've seen people try, but that's very interesting right. So that means that every type that's in Cassandra has a pretty huge limit and yeah. So.

B

The question is, why would you even have types again with, but the type a type in the create table statement actually tells your honor to make sure the data that's putting put there is, is actually an integer or text or a varchar', so it it does actually have a reason to live, and it's reason is to make sure that the data type is correct. When you use your data, oh.

A

Yeah, so that that's a huge thing is a lot of a lot of folks. Think no sequel means no schema, or it's just whatever you throw in the database. Cassandra is very strongly typed schema and you're right that enforces the types. So, if you put in int and try to put in some string of some sort it will, it will throw a violation right now. You can't do that.

A

So this is how we can we can map out our application logic to a table using types and you'll see that we have some exotic types in here too, not too excited but maps and sets and we'll get into that in a bit. So moving on on this table right.

B

We have a primary key designation yeh again something we're very comfortable with the primary key designation on a static table in cassandra does exactly like it does in Oracle. It will make sure that that piece of data is unique.

B

Its uniqueness thing, but you know we like to like to you- know, be multi tasker's here right, if not just for uniqueness, it also becomes our partition key. So whatever that first value of the primary key designation is it's what is going to get hashed? If you remember back to that previous slide and tell where that data live in the ring and that hash value is going to be between well we'll talk about them a little bit, but.

A

Yeah yeah yeah you're gonna head me: oh yeah yeah. So let's, let's get to the concepts here first concept row: this was like new.

B

Look it's a row, it has a partition key or a primary key and a couple columns that seems pretty standard. A partition is going to be multiple primary keys with multiple different types of columns, and this is.

A

Yeah but it has the same partition key, but.

B

It has the same partition key exactly and we're going to see later about how this actually represents how the data lives on disk. So just.

C

B

For a definition, we've got rows and then we've also got partitions. A table is a collection of partition keys. So we've got partition key one that isn't that spans multiple logical rows and partition key, which spans multiple logical rows.

C

B

So then we have the concept of a key space, which is a collection of tables, and you may think of this also as a schema or a database. It is pretty much exactly the same thing, so the concept of tables and and key spaces are probably very familiar to you. If you lived in a relational database, world tables is a collection of rows and a team of a key space or a schema is a collection of tables.

B

A

Let's talk about how this works, here's your first shot at the DDL and Cassandra query language gql. This is just an insert statement and it should look really familiar at this point. If you came from Oracle I, don't know if there's a whole lot of difference, maybe other than just some syntax, very small syntax issues like single tick versus double tick, but this is an insert. So we have a table called videos, and this video table is going to store.

A

We have this data model online and we'll be talking a lot about it next week, especially big this there's, a data model online called killer video and this actually exists as a web application. You can look at that online. We will provide that in the show notes later the link to this, but this videos table is an entity of it's just a video and it video has users etc, and this is an insert statement for that and what I want to look at is what how does this work?

A

When I write insert statement, how does it work in Cassandra? So when we have a table name, we have some fields and then we have our partition key now the partition key is required. When we do an insert just like you would have a primary key, that's required on an insert with with Oracle, because you need to I.

B

Patrick's, please explain yeah.

A

Yeah, that is, that is an odd number if you are familiar with a UID or G UID. That's exactly what that is, and we use this quite a bit in Cassandra when we get into more of like an application design. We can talk about when to use it, but what I can tell you is. This is a really good way to create ID numbers with Sandra, especially any distributed system, mainly because what you're getting here is a guaranteed unique number in the universe it resets. The plan is.

B

A UUID unique to cassandra, or is it a concept that we borrowed from someplace else? Everybody.

A

Uses it like I, said GUI D. So if you, if you're a Microsoft LAN, is the GUI D they just they call it a global, unique, not Universal I like that they got burned on that 640k thing, so they're pulling back a little bit, so the the GUI GUI D is in I think about every driver, I can think of or every language Java Python rust, you name it. They all have a UID way to create a UUID and there's different types of UID.

A

But in this case is a type one me t ID, so the partition keys required we're going to throw in some values. I want to talk about the partition key, though, let's let's dig into that right now, so we have a couple of inserts here now: I've created a so much smaller version of that we didn't have to get too big. So I have two inserts here now. If you'll notice, I have the first two partition keys yeah, it may take a minute to parse that, but those are different and those two inserts.

A

What will happen with those now you mention this before about the hashing. Let's, let's look at that specifics so when I took going to pass a partition key to Cassandra what it does with that is it. It does a hash with that and we use murmur three now we have used md5 in the past and that may be still what you're using today. But this is a consistent hashing algorithm in consistent hashing algorithms have the guarantee that if you have the same input, you will always get the same output.

A

So if I hash these you IDs I will always get a number out and that is assigned to another 128-bit number, which we call it token notice, where this might come in right. So token, again, are that consistent hash that 120 bit number? This is going to show its position in the Cassandra ring. This is how because murmer 3 is a random partitioner. It will randomly partition this value across your ring and so how you have spread out around many.

A

If you have a thousand nodes, you'll get one 1000 and then, if you have two nodes, you're going to have 1/2 this token value is assigned always to that value of the partition. Key, so this is how we can always get it back, and this is how we always know where I'll go and it will be evenly distributed throughout the system, which is great.

B

A

There you're gonna, say I, don't.

B

Remember now, yeah.

A

Sorry, I ruined it; okay, sorry, it was I'm sure it was epic anyway.

A

So what about the next kind of table? How about that I'll make it up to you talk about this one see.

B

What do you mean the next type of table this.

A

B

The only table sorting system databases right, Rach right no, so this is a no sequel database, so leave that static table that we just described to you is where relational databases end right there. It is, it has a number of columns and a primary key and it's stored on disk is rows and that's it because that is what could be done. We now have a new concept, and this is concepts we. This is a dynamic table and the only difference in this particular table.

B

If you notice it still got all the you've got you you IDs and time stamps and text, the no 1 and no sizes. When you look at the primary key designation, it has two values which again is very typical in a relational database world, but in the cassandra' world that second value is have special meaning and what that special, special meaning of that second value is, what's called a clustering column, and that is what so, video ID is. Actually it's actually metadata. Video ID is never the word.

B

Video ID is not actually stored in the table was actually stored, as the column name is the actual video ID, so a UUID. That is what is, that is the new column name, the actual opioid yeah. Let's talk about this because this is definitely jumping down the rabbit hole right. This is. We have completely left the realm of traditional data modelling and we're in someplace else. This.

A

Is very specific to Cassandra and something that you know I you and I talk to users all the time and I think this is probably the concept that has to get over the fastest because it will make or break your data model so pay attention class. This will be on the test, the primary key relationship. How does this work, and so, if we look at the primary key that we took from the last table, we have a tag and we have a video ID. So we know that the tag of the partition key.

A

We know that the video ID is clustering columns. These are this is just terminology right, we're talking about how if you and I are talking about. This is what we designate these as. But what are they at? What do they actually mean? We know that the partition key being data model. That would be one of the tags. We know that that's hash right, so we're going to use that, and in this clustering column we have now with a partition, key clustering column, we'll put these things into a logical row by themselves.

A

So this partition will sorry physical growth by themselves, so the partition key says: hey all this data is co-located, so everything that's that has to do with data model. I have every all the video IDs associated with that word. Data model I want to physically be next to each other on the disk. I want to cluster those together on the disk.

A

Hence the word clustering column, and there is an order to this, and in this case the order is not as easy to figure out and I do have a better example in a minute here, but it's the UID and if you'll notice, each one is so for data model. We have three videos that are associated with data model. This is a one-to-many relationship and.

B

Obviously sorted to me: oh my god. Yes,.

A

I do 120 bits sorting all the time and I say that's correct, and when we, when we look at when we look at something that's a little more human I like an integer or a string, it will make a little more sense. But this is what we mean by clustering. It's actually clustering. These values, together on the physical disk. So back to the selec I want to talk about how a select works.

B

It works just like a select.

A

So what I'm about cql it doesn't? It has few surprises other than when you get into the into the data model itself a Cassander, but it is still very familiar and I hear some people you do too I. Think Rachel's. When we hear people say well, you guys are trying to reinvent SQL, no I, don't think so.

A

I think it's just eight to eight. Why? Why invent something that works? Pretty well, I mean this is I.

B

Mean I just think of it as a we're just expanding some of the definitions of it, but otherwise it was. It looks exactly the same and.

A

From a programming standpoint, it makes a lot of sense because I went here, give me some data from the database as a programmer. This is a syntax, it works really. Well now, you know I know there are alternatives, but this worked pretty well, so with a table name, you know fields, then that.

B

Partition is required. What that, like, that's new.

A

B

Up with that, well.

A

Yeah I'm glad you asked that, because this goes into what the partition key actually means right, what what is it? What does the partition key do so, whenever I select this data now, what if I just said, give me every name description at a date from videos period and I, just grabbed all the things.

A

Well, that's a different problem. What, if you have a thousand nodes of data and there's a trillion records? What are you going to do with that? Well,.

B

Even if you have a less amount of data, I mean what would your application do with all that and pushing pushing in millions of rows through the wire.

C

B

Remember this is this is for transactional workloads right right.

A

So we we as they as programmers, need more control over where the data is and that's about data locality and that's what the partition key is going to give us with the hash. But we also have some control as well with the order that we put the data on disk so there. What we're trying to build here is a high-performance data model right and I mean I feel, like that's always been the case. Then. If you remember last week we talked about that was the problem. We're trying to solve is the database is slow right.

B

A

That's very so, yes, so I'm, here's a more complicated example, and so this is the this is another example that I use quite a bit killer weather. We have a lot of killers out there, and this is storing time series data and raw weather data is, is just a table that does that there's a weather station, ID and then I broken down the month year or the year month, day and hour, broken those down into individual parts and then just storing a temperature. So if you look at the inserts here, um I.

B

Have the bloody hub parentheses around that the partition key.

A

Good question now.

C

A

C

A

You're good for questions, the the primary key is, if you notice it, has a lot more going on than just one partition key one clustering column now, I put the parentheses or the brackets if you're in Europe, around WS ID, because I wanted to make sure that you know that this is a partition key. This is this is being very correct with my syntax now, if I wanted to add more columns into those into those parentheses, I could group together columns into one partition.

A

Key, let's say: I wanted my weather station ID in my year is one partition. This is a design decision again something we'll talk about in the next episode, but this is what that means. Now, when you have multiple partition or multiple clustering columns, it will. It creates a new new thing for us. It gives us some control. If you look at the inserts here, Rachel look at this: it's inserting data and there's in the only thing, that's really changing it. So seven, eight, nine and ten. Those are the hours so every hour.

A

That's this weather station is giving up its temperature okay, and it seems like something: I'd want to keep in order or retrieve in order right well,.

B

Yeah, it would be nice or especially clustered on disk all the July is together right.

A

So this is what it would look like right now. Let me let me go back one and look at that last statement here. The clustering order by this is this is when you start leveling up on your usage of Cassandra and one one feature that I feel is the is going to make your data model sink, and this is a performance issue. This is on insert I want to control the order that the data is sorted on disk. So, if you look, the natural order, sort order for integer is ascending.

A

I've overridden I made the integers now all defending the basically, what I've created is a reverse datum. All right, oh I,.

B

See that way, the newest data is going to be at the top all the time, because it's, how was the data that I access the most the most often rather than the newest data being at the bottom?.

A

Right, so, if you look at the table format that I have here, this is the table of that data. That I just inserted the our 10, the one that was just inserted, is now at the top. It's the closest thing to me. Do.

B

You have to break it down like that, I mean. Could you just have an entire timestamp as you're clustering, home.

A

You could, but the reason I break it down into that is because the order is important and the order that you use the clustering column.

A

So if I do a select, I can do it select off the partition key and then the year or I could do is select the partition key of the year and the month or the year month and day, but because of that order, I can't do the partition key and the day you have to do them all in order from left to right, and it gives me some granularity because I know it's all in a single partition of data, meaning it's all clustered together.

A

If I asked for all of 2005's data, I know that I'm doing a single read off of the disk. It's clustered together on the disk. um Basically.

B

We're saying that you always have to have the partition key in your where clause, but the but your clustering columns are optional, and but there go from left to right. If you.

C

B

If you want the the one furthest on the right, you actually have to use the ones on the left. First right.

A

And hope, correct, I should say and yeah be careful when you say that so correct. You would do that as a matter of order that you set the demand in the primary key and speaking of order. What we've just created is your favorite, your.

B

Favorite function.

A

B

Data sorted how nice you've.

A

Created the order by initially and efficiently in the in the beginning, so you get to order by descending. This is how you do it and it's a very, very powerful data model. It is a very powerful.

B

Data model: it's.

A

Probably powering a lot of things that you're using today in a mobile, app or web app and because.

C

A

I think I spread this one around plenty enough. So let's talk about the right path from last week and how this now, how this looks well.

B

So I mean cuz. Here's a question right I said earlier that sorting is expensive and I have I've spent before I came to data. Stacks I was 15 years of data, warehousing experience and a lot of time. Assent, sorting data and a lot of time was trying to push your order, buys into memory and all that stuff in order to make it to actually make it perform it that amount of twisting and turning that I had to do so.

C

B

We say that we're doing this for speed. You know speed of retrieval yeah, but you know: where do we pay for that speed of retrieval and here's where I think the it's really cool about the way that Cassandra amortizes the cost of a write over time? So we have the client and it's inserting data. We recognize our insert statements. It finds the node, so it hashes that that partition key determines which node it needs to go to, and then it writes. Institute is Right.

B

The first of the commit log and that commit log is a sequential file, and it just keeps writing it. It does and when the new file needs to happen, a new file comes up, it's not there's, nothing really managed, except for just sequential writes. The second plays it writes, is to memory and it's going to put in the partition key and all of our clustering keys and our actual values which, in the cases, the temp into memory memory very fast to retrieve from so somebody's reading from memory.

B

It just comes out pretty quick, but eventually memory hold.

A

On dad now it acknowledged it's got to say that yeah.

B

Yeah you're done just back to the client.

A

Yes, that's right. Super spot good right: okay, yeah, yeah, yeah, yeah,.

B

A

That's important right because this is why container is really really fast, because it writes to commit log, writes to memory and then you're done.

B

Yeah: okay, on your right side, all right! Let.

A

B

Move it on, but here's what eventually memory fills up we're not to the point yet where we have unlimited memory on our machines. It'll be great one when that day comes, but for now no RAM is limited. Eventually, data has to go to dim, and this is where the, where the sorting happens. So this it's, this smaller amount of data that lives in your memory is actually sorted at the time it's flushed to disk and it's plush to something called an FF table. The SS table stands for sorted string.

B

Aha, so we've now sorted the data by the order by Clinton.

A

Mike you've cut out on me what.

B

Hello am I not here all.

A

Right, I, don't hear you anymore, so I'm just going to assume that you're, not there, maybe I'm here you.

B

Have me now, anyway,.

A

So we flush in out SS tables when we have that this table is going to disk we're going to have multiple SS tables. Getting written out right and there's multiple at the state table are oh I can't hear you anymore, Rachel! Sorry. Can you hear me yeah.

B

I can hear you, okay,.

A

Was I just talking to myself I.

B

Don't know I could hear you before. Can you hear me now yeah.

A

Let's beat it I'm gonna I'm, going to blame cats and pipes or something yeah. This.

B

Is the hole you know pipe full of tubes or something all right.

A

B

I was saying: the data is sorted into these sort of strings tables, and it's sorted by that clustering order by statement that we saw earlier so that so these SS tables are immutable and they are written every time at the mem table /, but eventually those SS there's a number of SS tables that build up. So a process comes through called compaction which will merge various sort of strings tables together. Put the data correctly in the right order and create a new SS table.

B

That's called compaction and compaction is something that you do need to be aware of, because it probably will run all the time in a well-tuned system, and it's something you do want to tune for again. Your amortize in the cost of your rights over time, and this compaction is a cost and is the cost in particularly in cpu and an I/o. So.

A

I get asked this question a lot. Rachel is like what happens if I have an outer order right what happens and that this is the compaction process. Is what covers that? If you have that weather station, let's say that it's kicking out whether it's weather data and it goes offline for a period of time or you get some error and it misses a few hours and then later comes online and re-establishes this connection. How to order data will be recombined during the compaction process. Yeah.

B

It's not an issue at all in.

A

B

Compaction also gets rid of data. That's that's been deleted because we'll be talking about deletes later on, but deletes happen in two phases. First, data that it's good to be deleted as tombstone and then, after a matter of time, compaction will remove those tombstones from disk and give you back this place right.

A

So here we are back with our back to the storage model. Now we'll just wrap this concept up, so we're back to selecting some data from the disk and just to reinforce what we've talked about before when I do a select from from the database and I give it a particular partition key this weather station ID, but then I give it the clustering columns I, give it a year, month, day and year, month and day.

A

If you see this like what you see here, is this this data, that's sitting on the disk and or the sitting in front of you like this rows and columns. That's what programmers want to see right. Well, this is actually how it's sorted on disk, and this is.

A

This is really what you need to understand about partition key and the ordering now that I do an order, but I do a clips from order by and reverse it now, whenever I insert new data, it's actually going to the beginning of that record, and that means that as new data is appearing, it's now becoming then the first record there, which makes it a lot because it's merged and it's sorted and it's stored sequentially. So now, when we get back to the read path, this is where we actually get the payoff right.

B

Yeah exactly because all right, so we now have our select statement and notice that we've we have all of our our partition key is there as required and because we want to go down all the way down to the our and we're doing a range scan on the our that we also include the year month and day we could include just a year the year in a month the year month and day or the year month, day and hour. So the client sends out the select again chooses.

B

The node is appropriate based on that partition key and then the VSS tables will retrieve that the data sorted back in to back to the client for retrieval well,.

A

In the mem table and chanmin back to grant yeah so then- and that's a really I think that's important- is that a lot of people say we'll ask me the question: hey if it goes, if it inserts in the mem table, just just read from the mem table sort of, but it needs to go to disk. This is a persistent system. It has to go to the disk to get that data and there's no two ways about that.

A

Really, there is an option in data stacks Enterprise for a memory database which it does not do that it just stays in the mem table, but Cassandra a standard uses just you have to go to the disk every time, no matter what and.

B

It's good to point out that in memory compaction strategy for all in memory, data can be done on just a slight amount of data right. You don't have to say all.

C

B

Data will be split in memory, you can just choose. Hot data, live on memory, leaded memory and a warmer data. That knows a disk right.

A

Exactly so, you mix and match nests yeah exactly, and it is a persistent system. It has to go to disk. There's no there's not an in-memory database at all. So this is what let's go back to our disk again. So this is how we stored it on disk now this.

A

This is why it will be fast because when we ask for the data we're going to ask for the partition, key partition key goes to a single node, no matter how many we have in the universe, and you could have the small cluster big cluster doesn't matter, it's still going to go to a particular node for that, because the partition key and we're going to do that single seek on disk because dis seeks are the worst part of your system. Do not forget.

A

This seeks are get measured in milliseconds microseconds if you're lucky and you have a really fast like flash storage or maybe even SSD, but that will be the slowest part of your operation. So doing one seek awesome and one does that single seek it reads it all in one sequence, reads it all into your table, your cql table you get sorted by event, time, rows and columns. Everyone is happy and programmers love it right.

A

B

Do programmers love it well.

A

Because programmatically working with rows is a lot easier. When you ask for a set of data, you can iterate over rows, grab the columns it. It is just an easy way to work with data. Rows and columns are a preferred format. Now there may be better ways to do it, but I'll tell you. This was always worked well for me and I liked. It I think.

B

It also is not it doesn't. You don't have to change a whole lot about the way your application works. If you want to convert to no sequel system now,.

A

Going back to our first, our first discussion: we've always Cabul ated data as humans and creating tabular data is not a bad plan right because we seem to like it. We've done it for a thousand years ah now for something completely different, but not doing it right. I.

B

Mean we can't just have any and no sequel database without some surprises and some changes right. Some cool stuff, yeah.

A

um So these are, these are really the Cassandra specifics and the day will help quite a bit, and this is where we move on from the traditional Oracle style data model, where we have some really cool features, and some of these are some that I use all the time. In my data models, you saw it already, let's start out with collections, so here's a very simple one justice is that now a set of tags. That means that we're denormalizing our data here and when we get into application design.

A

We'll talk more about that strategy, but by denormalizing we're grouping our data together, we're making it faster, and this is because of how Co standard works. We have a partition key and all our data. Well, a set allows us to have a semi, a very dynamic part of that data model inside of a dynamic table.

A

So a set of tags would be like the data model Cassandra like think about tag like on YouTube or something- and this is a set that has n column name and then the cql type inside the greater than less than that used for ordering. So, whenever I create a stat I'm because the set is ordered by the type it's going to be, it's going to be collated by the the utf-8 or like a string ordering and.

B

On the right type of data in there to write.

A

Also critical yep, so if I was, if I did a set of timestamps, it would order by time so I divided by integers would be ordered by ends. A list is very similar to call a name and a cql type. It's a cql type in a list is not used for ordering, because the list is the order you put it in so that where did you put it in? Is it order? You get it out, not my favorite collection.

A

It's a little heavier requires more overhead, because I have to sort every time you do an insert I I stick with set as much as possible in these cases. List is one that I'm not so hot on, and but it's there and sometimes listed is important, and if I wanted to change my life cql type or if I wanted to make my set a list. I'm sorry I put a set in there didn't I duh bad flood.

A

So if I add my list of tags instead of my set of tags, then I would be saying that the order is somehow important, which it isn't really so. My last one here is my map, and this is probably one of my more favorite data types of all I'm, not even in pql, but it's very dynamic you because you get a key and you get a value, and so it creates a nice.

A

It's huh I don't want to stay it because someone's going to hold me to this, it's like a database inside of a database right. It's a key/value database instead of your table, but here's the warning do not go nuts. With this thing it you most, you can put in there 65,000 for a reason, because you have to there's a lot of serialization cost you're going to start incurring, and it's really meant for small dynamic, key value type sets and if you need more than 65,000.

A

This is when you and I have that discussion of how to better do data modeling it with the table instead of trying to use a fancy collection to make it thick so.

B

How does how does it end up the few things get stored on disk? Well,.

A

It's interesting because what what we're doing here is smoke and mirrors in a way because it's still being stored as a cholesterin column in a way down on the disk. um So all of these values are being clustered together in the same partition and that's pretty critical right. We know that if we put a partition key and then we have them all cluster together on the disk, that you will get a faster data model. Well, we do that the collections abstract that for the user and then put those together on disk as well and.

B

And it also allows you to manipulate items within the sets, lists and math through cql, very simply, yeah.

A

And it's just a and because most programming languages have set list and map has a standard collection. It interfaces really nicely with your application. So let me move on what that.

C

A

You can here's your favorite.

B

Yes, yes, yes, the analytic side of me loves aggregates, and these are actually fairly new to Cassandra. So this might be a shock to some of the some of you out there who've been I'm.

A

Sure yeah picture there's people out there since I didn't know you can do this. Yes, you can. Yes.

B

There there's four currently built-in aggregates, so you know there's nothing, there's not all of them about all of the things. Yet. We've got four we're very happy about four. Just they just came out recently um it, but they are built-in and they you know, and you need again, you need to use it with the partition key that it doesn't take away the requirement to not have a partition key in your query.

B

They act just like you would any any of these other aggregates I'm not going to go into the details of how of what a Mac does. But what but I want to point out down at the bottom here in our cql. Shell is there's something missing and it bothers me because I write aggregates with group by and having- and that is typically how you would write- am yes I.

A

Can ask you aggregation yeah, you.

C

A

You can't you can't do an aggregation and SQL without doing a goodbye or something I.

B

Mean this is the way that I always wanted to write them, and but I always got that nice little fantastical warning.

C

B

Need to go by okay, fine! So what's different here, how do these aggregates work differently that they don't require group by well.

A

Then not that is the partition key. No, so what we've known already right is the partition keys, create a their own grouping of data and that's how you gather data together and using the clustering column for ordering, but you get a group I and ordered by by using a partition key and clustering column.

A

So this is why the the warning here, if you look at the example here, I'm grabbing, that Rob weather data table I I, have an example here, where I'm doing the min Max and average temperature for one weather station, so one partition key over a particular year in a month which totally handy but notice what I've done there is I use one partition key and to clustering columns.

B

Interesting because it's already group right- and it is already grouped it.

A

Is but that's, and that is that is the the fine print. Don't do this without partition key, because what you're going to be whining doing is a full cluster scan and we.

B

Don't like doing full cluster scam, no.

A

Full table scans, we try to avoid those because they sucked now we're going to do a full cluster scan. This will suck even more right. Let's not do that, and that opens up the the whole conversation and using something like a patchy spark with Cassandra, which is definitely built to do full cluster scan, but rid of scope for this discussion.

A

B

Aggregate so they're not normal aggregates. Are they no.

A

They're not they're, they are actually just built-in user-defined functions. So this is another thing, that's now in container 2.2, and it is not quite the trigger that you expect and I say that, because, if you're used to using pl/sql or you're doing you know just any kind of trigger work in any other language or any other database, you know that there's a lot of act, a lot of things you can get internal introspection on. You can do things inside the database.

A

User defined functions are built in as a pure function, so you have a a function that takes a CQL type and it's usually a part of a of a column and it that pure function doesn't rely on any outside data at all. That's why it's pure, so the input parameters are manipulated by something pure inside a Java. You can also use JavaScript, you can use Ruby, you can use Python and those that function will somehow manipulate that data and this the functions can then make you can use those to create aggregations, and that means hey.

A

Do them aggregations over a certain range of data that I provide like a partition, and then you can create your own user-defined functions now, I get that max min average. Those are all included count. Those are all included because those are kind of the easy button. I've seen some pretty interesting things already, with these I'm kind of excited to see what users come up with for user-defined functions. But this is definitely an interesting and useful change to cassandra.

C

A

There you go the final, the final coup de Gras, what about but I like acid I, like transactions, Rachel, help me well.

B

You don't keep it I think we talked about dropping acid last last.

A

B

A

Do that not on the golden gate? Something about that so.

B

If casandra's, um that is not an acid compliant database, specifically the consistency portion of the of that is, is a little bit different and that's we're going to talk more about that next week on how consistency works in Cassandra, but that doesn't stop us from requiring, sometimes a need for a little bit of walking. Just just a little bit goes a long way. Not all not every single transaction requires a heavy lock, but there are some transactions that do, and you know, for example, race conditions.

B

So when you are in a distributed system and you've got somebody signing up with the same username in different parts of the world, there needs to be a way to make sure that those don't end up colliding with each other, because the the way that Cassandra works is the last one. The last right wins and that's not very good for application consistency. So there is a concept of called lightweight transactions with Cassandra and they're fairly easy to implement. I mean you get a you, get a pretty big hammer with just a couple of words.

B

So right, there's and if not exists, you add that into your insert statement, that's pretty much it that will actually initiate the process internally to make sure that this data is going to be written and it's going. It's not going to be overwritten by a another transaction coming very quickly after it yeah.

A

And this this has been something Google, those of you that are new to Cassandra. Congratulations. You missed a very bad period of time where people were using things like zookeeper to try to maintain this consistency. And/Or.

C

Appease a nasty.

A

B

And stuff right well,.

A

Anyway, you could possibly do it to mimic this, but because this is built into Cassandra we are. We are looking at taxes under the covers and that's a pretty established consensus protocol so in it lives inside a cluster, and that does I think that's. The key things have happened outside of the cluster are independent bad things can happen, you don't know it and you get a really inconsistent state of your data in dogs and cats living together, total chaos.

A

So um so what else can we do with this? I was.

B

Going to say, I wanted to make the statement also that this is a read before right and it's a few round trips so uses judiciously, because it will affect your performance no doubt about it.

A

Yeah, it's not the cheap option, but it I will save you and, if so, like anything else, and with Cassandra nets are just many times before is you know, understand the internals? Well, this is an internal you should understand. Is that there is round trip, and so it will take longer, it does I, don't use it for all. The things tells.

B

About this, if not exists on the table, what's up with that? Well,.

A

This is, this is really just a another one of those I hey I need to make sure that I'm the only one doing this or it doesn't exist, type thing, and it requires a locking to make sure that it works, and so the create table, if not exists, is really there so that it instead of saying create table, and then it does come back with an error of some sort and currently no exception of some sort.

A

Is that it'll just do a no obso when you run that you'll say just do a no op, don't throw an error again, it uses Paxos to make sure that all this is happening exclusively. So it's it's an easy thing to use.

A

So what about updates, though this is this- is a a standard, update and they're there some danger here, because an update can override data, and how is that true, because there's no constraints, that is not something that's built into Cassandra and a constraint vile when happen. If this is, if this data existed, what's it going to do Rachel, they.

B

Will go ahead and write. It.

C

B

Okay, cool: let's delete the existence of missing one so where that particular ID equals that value yeah.

A

Exactly it's like mice, it's, like my seven-year-old on a skateboard, just don't get in its way because it'll mow you down yeah bad. So in that case I mean the DS are perfectly fine. It needs in the case where you know that you have something of somewhat idempotent or you're working on a progressive data model updates are rarely used, I think a Cassandra. It's interesting, I see more data models with inserts than anything updates are not as common.

A

So, let's put a little lightweight transaction on here, so I'm I'm trying to update this with a primary key. Okay. What if I want to make sure that I am being very exclusive about that operation? Well, we can. We can do that too.

A

So this, if right here at the end, if you'll notice is a conditional, it's a conditional update and what I'm saying here in this conditional update is hey, update the videos with it and give this name, but only if this user right, if it's from this user ID and this user ID, equals this user ID, so I've created a condition where I'm going to make sure I'm not overriding someone else's video. Now you can. This is a very simple example.

A

You can think of some other things, but you can use any field inside the partition, so it could be user ID. It could be added date whatever this is pretty powerful. This makes update a very powerful mechanism when used with light review transaction.

A

So deleting data, let's face it, I think everybody made this point last week- is that we don't really like deleting data. So I, don't even know why we're covering this. We should definitely.

B

Data eventually, how do you do that you to leave from and again you give it a partition key and it will actually wipe out that entire partition, it's pretty cool. You know, there's also the availability to truncate table. That's probably a better way of deleting data. Uh-Huh.

A

Absolutely yeah, and just this will so what's interesting about a delete, though, is it to write? It writes, tombstones and I.

B

Talked about that a little earlier didn't I jumped ahead of the game. Yeah.

A

It's okay, it's all right! It's all valid tombstones are a marker in the system. We won't go into it. If you want to learn more about tombstones, there are much better longer discussions about tombstones. The thing you need to know about it is that it is marking in your database that that data is no longer valid. It doesn't actually physically delete it off of the disk.

B

Right at that, at the time you make the delete and not giving you the ability to go back if you've made a mistake, yeah.

A

In well- and it's not even the it's not even that careful, it's more of a distributed systems problem! It's when you have. If you have data spread out all over the place, you get. You have to make sure that everyone's on board with that delete right there.

B

Cool and a bit cool just delete. We have you do something else to here and there's things like zombies and stuff. I mean it's a great topic really. We.

A

Should it is yeah, that's the big theorem on the.

B

Zombie apocalypse: that would be fun. We should do that tombstones.

A

I think it would be pretty cool yeah and that's what I think that's the biggest fear is that your data returning like somebody's saying yeah I, got that data. You missed that, oh no, so here's my favorite feature of all time and I say that, because I use it all the time so order buying. It's like my second, my favorite is TTLs, and why so?

A

What we're looking at here is a way to the time to live the TTL and the expiring data in especially in time series data or in a very transient and data model, where you have data coming and going I've seen some crazy use cases with this. Essentially, what you're doing is you're marking your data whenever you insert it, you say expire that data after 30 days and the way I explain a TTL is this. Is this: is a free delete, it's actually less than free and the reason I say it's less than free p.

A

We've never done a delete on Oracle. What do you do you? If you have a lot of data to the league you're going to have to you to turn off your redo logs or watch, everyone suffer that, because it puts enormous load on your system as its chewing through all your data. Well, experiment.

B

All your indexes excetera.

A

Just there was like a whole routine, like I'm, going to delete data. Ok, let me turn off all this stuff right, oh and then we'll do it at night, because god knows what they'll do to our users: the beauty of the TTL, that once the data has expired, it just disappears it and it doesn't what I mean when it disappears? Is the users no longer get to see it and the did it will eventually just not get copied as compactions happen? That's why it's less than it'll delete.

A

It just doesn't move data anymore, so it's very, very, very low low low down process lane and it will not create a whole lot of problems good for you seen some crazy stuff on this. I had one one user that was processing 250 terabytes a month and expiring it after 30 days, so you had like a 30 day ring buffer of 250 terabytes. It's pretty cool!

A

All right! Wait.

B

Wait wait before you do that before we go what what.

A

B

Is another UUID? What is this TTL that doesn't say 30 days? Oh.

A

That's a very important point: I'm glad you're here, that's in seconds, so the TTL is in seconds.

B

A

B

Then it does, mortgage Cassandra I know they map number a second see 30 km.

A

Fun fact you can say how many seconds are in 30 days is a Google search and it will tell you so yes, 2 million 2.5 million seconds and 30 days and so spend each one of them wisely, because that's all on your data will last and.

B

At least spent a good 3600 of them joining us is next week. Cuz we're going to talk about join because not been any. If you haven't noticed that we haven't done any tables here in our c ql that are joined together, and that is because there are no join.

A

B

Next week we're going to talk about, why isn't there any joins, and what do you do about that? And and then, how do you design your application, both from the from the logical side and the organizational side on to the physical side like? How do you actually take these concepts following you about the last two weeks and make your application work with Cassandra yeah.

A

That's- and this is a top-down discussion- we hope that we can wrap everything up and make make you a little more successful in your next application with Cassandra, because you've learned you've learned a lot about the parts, the bits the pieces and ellipses- let's just put them together. Our our intent here is to make you really good at doing this, so that everyone thinks you're an ultimate badass. We want that. We want you to be the badass Joe everyone, how cool you are by this really cool app.

A

You built that doesn't seem to go down ever a tech job, that's our job and we would love to see you love to see you at stander summit, and so we are will be. There live September 22nd through the 24th. We are going to do. Training with a Riley. O'reilly is going to do certification for certified developer Cassander developer, which is really cool, that'll, be the first one.

A

Ever there's tons of talk, 130 tracks, lots of big companies, small companies, interesting people talking about interesting things, just be there, it's fruity or if you want to get priority, I can get here's a priority. Pass use Rachel's somebody to use mind ones and there's a priority password pass for getting out of 25% off certifications. So we hope to see you there. Alright, we have minutes left and Devin I know you're on the line out there somewhere. Can you maybe passes a couple of questions here and just keep in mind?

A

We will, if you type a question into the chat, we will answer those in a blog post after the fact, so you will not be forgotten. You will get covered, but just today we're going to cover just as much as we have until the top of the app so Devin get anybody. Oh.

C

Yeah, so here's a question from Tracy in your first STL. You do not have the same number as values as Kiev, the same number as values as keys. Can you explain this type or is metadata populated by Cassandra the same? Oh.

A

Boy, that's good tough question. uh I didn't possibly.

B

Be a type of home and.

A

It could be, but I think. Actually what you're asking is there wasn't a same, a number of columns that I inserted in it. You can have sparse amounts of columns. If you, the most important thing is you have to include the partition key and the clustering columns when you insert data, so those have to be there, but.

B

Very actually, it's actually very common to have a data model for certain types of for certain types of tables that we'll go into a little bit more next week, where you just have the partition, key and a clustering column, and you actually don't have a value and the actual cell can be null, but the column name or the question column needs to be populated. Yeah.

A

Well, if we didn't get that right, put put it at tell us we're totally off-base. Tell us why and we'll answer it in a blog post yeah, but you got another one.

C

If delete does not remove data, how does that impact the performance of a sort of table.

B

You're right, that is.

C

Actually thicker in.

B

Past performance and there are tracing tools in Cassandra that will tell you how many tombstones are you reading over if you're, finding that your read performance is suffering if there is a tunable and out there that allows you to say how soon do you want data to be removed from disk for things by default 10 days, but it will eventually get removed from disk, and if you are finding that you're reading over a lot of tombstones and is affecting your read performance, then you might want to fit a little bit with that without setting yeah.

A

That's a really important one, so we're at the top of the hour I'm sorry, we only got two questions and again we will answer your questions, and these are that last question was obviously a thinker you're, paying attention good good class and we will be net back next week for our final installation of this oracle, Cassandra Business and we'll be talking applications, so hopefully you'll be there. Thank you very much. Thank you.