Apache Cassandra This Week In Cassandra, 26 Feb 2016

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: This Week in Cassandra 2/26/2016

Description

Link to blog discussed in video: http://www.planetcassandra.org/blog/this-week-in-cassandra-2262016/

A

Oh human beings, hello this week in Cassandra, planet Cassandra good times. This is our fourth week today we have, I believe it's pronounced eric the big from the bayous of Louisiana, not even point from sorry uh special guests right thanks for coming buddy long time, no see I'm.

B

So excited to just hang out with you not in person its young.

A

Begin to explain it well, I felt like we had a very nice time at the sandra de la I think we did. We did we had a great time. Yeah I was it was magnificent and then I think you went kicked someone in the face afterwards as part of some sort of no.

B

No, no, that was just you jitsu. That was just grappling, but no effect. I did I did train in one of the great places on earth that tenth planet Jiu Jitsu in LA. It was awesome, yeah.

A

Look at you alright! So let's look at what we have for this week. First up in the blog posts how to write a distributed test for Cassandra, um hey, you know what I'm pumped about this Luke. What do you think.

C

Well so I, you know I like reading through tears and stuff. I see detest mentioned all the time and I always I mean I knew it stood for distributed test, but it wasn't really, you know sure and I've noticed. You know that pretty much anytime somebody submits a patch. You know one of the contributors will almost always ask for a detest covering it.

C

So it's kind of cool to have a how to you know if you're interested, if you're out there and you're you're interested in contributing to Cassandra, you know if you're going to send a patch, you probably gonna, get asked how did how to you're gonna get asked to write a detest so how to do.

A

It so nice I've, been in the Cassandra community for like four years now and I had no idea about it like what that I was like. I have no clue like what does it deep like how do I get this like I, just.

C

Need to look light on. You got to be super excited about that right. You know, I'm, holding Python.

A

Yeah I'm very pumped I'm, going.

B

To actually improve Luke you're so excited about it, considering it.

C

You know like when, when we did the recommendation engine for killer video John like John and I, were writing some Python code, mostly John, and then I was. You know like trying to fix some stuff after the fact, and so I had to like get my toe get my toes in there and you know like I, don't eat Python yeah.

A

Yeah pythons a good.

C

Time, man I don't understand the whole Python 2 verses, Python 3 holy war, but you know it's.

B

It's just about a sensible as Perl 5, verses, pearl, sex or brothers.

A

Noodle wars and there's a lot of pain there. Well, there was just some things that were horribly, broken and Python to like strings, for instance, like Python 3, it's like Unicode string so that the thing that they really screwed up was they actually changed how the language behaved so like everybody's, like a like, none of our stuff worked and it's been like years and nobody's using Python 3. So it's very.

B

Good and it's not going to change, because enterprises, especially like us, who have tens of thousands of lines of Python code in production, can't just drop in Python 3 to test anything and the drivers. Don't work, libraries don't work. The programming paradigm is completely different. It's very problematic. Yeah.

A

We I I did the upgrade for when I was still managing cql engine I made that compatible with Python 3, and it's like it wasn't that it was hard. It's that it was annoying like you have to use like the six library which is like compatibility with like both. So instead of using like Python strings, you got to use like the six strings and it like just does it for you and you're just like.

B

Fine and really nice to do this is worth is when you deal with other things that, like people would normally use Python for, like you know, one of the things that we're getting into is a bit of the graph database, work and I know. Datastax is getting into that as well, and all the graph database drivers are written using Python 3, which makes it you know nearly impossible for anybody to release.

A

B

Yeah I'm gone through I've gone through all of them they're easy.

A

That would be my.

B

To their only now being backported in the past 30 to 60 days, no I just thought I'm working with a lot of a I'm working with a lot of the people who work on the drivers, trying to push them to kind of work with python to async, I/o and python. 3 is a big deal so I.

C

Welcome to this week in Python, with your host John head edge.

A

Right we can move out the pipe, but we digress all right, we'll talk about Cassandra you're right how about about this next post, removing a disk mapping from Cassandra all over the last pickle, unfortunately, I believe this dude's name is pronounced. Elaine I see another mailing list all the time he posts every day, but I'm I've never actually heard his name pronounced out loud. So if I'm butchering it kind of the opposite of how I pronounce erics last name correctly, the.

B

We can just go and move balances the right way, but you know whatever you want done, had dad that.

A

Actually is correct: Yemen I said it slow, no um hey. What do you think removing a disk mapping operations? You love ops, yeah.

B

I mean look, this is one of those things that you as an ops guy, you kind of hope that you never have to do, but as an obstacle, you always realized that there's going to be some point at which you have to do a very annoying disk operation, whether it's on hardware, which I can comfortably say, I, haven't had the pleasure of working with Cassandra on the actual physical hardware I, you know, do all my operations on AWS, but this is something that hasn't been well documented.

B

So just to be able to like have a guide to follow, I think is a big deal, especially for the ops folks out there, who kind of struggle through the nitty-gritty of cassandra operations, yeah.

A

I noticed there's a lot of our sink going on in there. It's like a probably one of the most best tools. I've ever used, I lumbers.

B

Actually I, if anybody's heard any of my recent talks, I actually explain what it's like to upgrade individual nodes and those nodes in some cases in my case, have had upwards of a terabyte of data on them and when you want to do in place sort of swap the nodes use our sink and not just once or twice over and over and over. You know, arcing the data over shut down the node do the final arcing, so the commit logs written the disk can get our st.

B

over and then, when you spin up the new node you, you know just double check that everything's there. By doing our say, cash comparisons, crc checks is another way of saying it and then yeah it's it's a it's a headache, but without but with arsenic. It actually becomes a lot easier to deal with. So the guide that Nick put together is actually Mick from the last pickle is quite helpful. Yep.

A

What else we got here, Patrick McFadden, we work with him. The most important thing is, you know, and Cassandra data modeling the primary key. This is kind of like a little refresher. Almost it's like first thing like just seriously like kulit primary keys are important kind of matter, a lot nick sandra controlling how data is laid out in the cluster.

A

How data is laid out disk if you're new to Cassandra I'd say this is like one of those things just read it put that information in your brain, because if you don't know it, then you probably shouldn't be using Sandra such as nothing. Well, it's not practical!

A

uh Well, we've.

B

Won anything or anything, I, just I, think one of the important things that he covers here is like he just does the basic hey you can select by a primary key and that's a lot of times what you're going to be doing it in you know your standard applications is you're going to be selecting brighter by your primary key, so knowing how to do it and knowing how to create that system, it's like it's pretty important all right, because your primary Keys, your primary key selections, your fastest operation so I. That should be your benchmark.

B

A

It's accurate yeah.

C

It's I mean it's just a pretty good recap of beginner stuff. You know if you're new to Cassandra, like you just said you know, definitely definitely read it. You know the stuff about data locality and that kind of the other kind of thing is really important to learn so yo.

A

We talk about that like Luke we've, given what like roughly one trillion, talks on encore, Cassandra concepts and that's what we know. That's what we talked about we're just like hey. You know you need to get private data by your primary key. People are like wool. How come I can't just like dude joins are like well. If you have like a trillion row table and you want to have another trillion Road table like then, you want to do like random, like scatter gather across all that, like it just doesn't it does.

A

It doesn't compute for a lot of people until you think about the massive scale and the cost of doing like secondary index queries, and- and you know how you know- you can have a 500 cluster and only hit one node if you're querying by the primary key right.

B

And that's like a that's another very like again: that's a really good way of its talking to a developer, but when you're coming from like the ops view, which is tends to be the way I sort of think of things, it's I I, look at what developers write queries and even if they are writing consistently primary key requests and I see one node getting hit very frequently, and you can just do that very simply by looking at you know, disk I/o.

B

You could see that by looking at cpu utilization and that's when I started to say. Well, how did you choose your primary key like? How did you choose your clustering key yeah? So it's actually from whatever side you sit on if you're working with cassandra like this is an important not just beginner thing, but it's a good reminder thing. Yeah.

A

Like, for instance like if aging your example, if people like they just have like one partition for like they're like oh, this is my leaderboard and it's like well, you have like one leader board, which means you're always going to go to the same server. So even if you keep expanding your cluster, like you're, your leader board is pull up, getting pulled up the same amounts river every time it's.

C

Not gonna help spread the load right, yeah.

B

Right it was spreading. I can. I can give you a very specific example, because we did something similar. um You know where that simple retort analytics company and one of the things we do is collect web traffic.

B

You know we, we collect web traffic data and we decided initially that we were going to store all raw events uh in one table but segmented by our, and that was a really early mistake, because that meant that for one hour, every single piece of traffic that we saw got written to the same node and then it would move and we were like. Why is this happening? Well, we didn't really think through that it should. There should be some. It should be like a compound key.

B

Well, you know, say a URL and then an hour, because that that just means that one URL will peg a node for that one hour and then it'll move to. Let that note, calm down and catch up and be eventually consistent, yeah and all that other good buzz word terms and.

A

That's it that's an easy mistake to make right, because, in your case, you're talking about a set of parentheses can change the key completely change the behavior of entire cluster. It's like a set of parentheses, can literally like mean either you did it right and everything is awesome and you can use like a hundred node cluster or you did it totally wrong and you're, like just you're, just failing big time well,.

B

That's exactly what happened is we were working off of a twelve node cluster and we couldn't figure out what what happened? What was going on so we bumped it up to thirty nodes and we made the node sizes larger and we're like. We can't continue to operate like this, and we called in one of your fellow datastax folks at the time touchin Harper, very smart fellow, and he just showed us that we were missing parenthesis and this would be a lot better and we turned our 30 node cluster back down to six.

B

It has since grown to 70 plus nodes, as the company has grown, but just that one's like /, a parenthetical mistake, and you know more than double the size of our cluster, so understanding those keys is it's a beginner thing, but it's also incredibly important for growth, because it's very difficult to change migrate. Your data later, you know, if it's basically needing to know beforehand what your schema should look like and trying to change it post facto.

A

B

A

You know this is this is actually kind of interesting. This is a case where I doing like stress testing with a real cluster, makes a huge difference like people, sometimes only benchmark against. Like a single note like on my laptop I'm, like a I'm just running, run, one Cassandra note and that's an easy mistake to make, because you wouldn't necessarily see the effects of how your key affects the distributed environment. If you're only working on your laptop.

B

So, to actually tell that, I believe, on your side, Jake luciani wrote a an update to cast endure a stress that has schema base to it now, so you can actually put in your schema spin a cluster whether that's on CCM, the Cassandra cluster manager or putting it somewhere in you know: AWS whatever it is, you can spin up a cluster with your schema and Hammer it and see what happens even if it's just a three node cluster. So it's there are tools out there to test your assumptions, even in a distributed fashion. Mm-Hmm.

A

And I love breaking stuff in a distributed fashion. So um so, let's, uh let's take a look at these JIRA's that have been updated because there's a couple more that I definitely want to look at here. Allow custom tracing implementation. This just got merged into trunk and I am ludicrously excited about this, like maybe overly excited, I, have first quiet fresh first glance, you probably like: why does he care so much about this I? Think it's huge yeah.

C

You and Eric both it sounded pretty excited about it. Yeah absolutely.

B

Yeah I mean I, I have I, have a system of hon I have an infrastructure of hundreds of notes and it works with message queues. It works with. You know, message processing that gets passed in between different applications, consumers and producers, and we had to write our own tracing system because there wasn't really something like this at the time and we wrote our own tracing system.

B

That involved like injecting changes in the messages and then reading them at different points, and we wrote our own receiver, which is effectively like just ingesting data, turning it in Jason and and then we had to build a front end for it. Yeah look. We did it with, because we it's just easier to write.

B

We weren't sure what the schema was going to look like, so we ended up having to make a bunch of changes before we got it right and then ended up, dropping it all together because it was taking us too long outside of building our own application, but having something like Zipkin is and having it tied to Cassandra, which we already use and having an interface for it, which already exists, and you don't have to write, is absolutely phenomenally exciting for somebody who runs distributed systems so.

A

For a little, let me add a little background to this yeah. Please cut the custom tracing implementation. The the inspiration here is adding zipkin support into Cassandra's, so zipkin comes from twitter, it's apache licensed and it's an open source implementation of wood, a paper that Google put out about their infrastructure tracing system called dapper and effectively in a distributed system. As Eric mentioned, you have hundreds of nodes, it's really hard to figure out like where performance problems are like in a request.

A

If you hit, you know, 50 nodes over the course that request, it's really hard to find the exact point at which something is broken and what Zipkin does is, gives you the ability to trace a request throughout the entire system and now we'll be able to trace it through Cassandra. So that's going to go into Cassandra 3.4, so this is really really cool. In my opinion, if you've you know, Zipkin is integrated into your application. It's integrated cassandra and it's all there, your databases and tools that you use.

A

You can get a really cool profile of what's happening in your app so I'm. A huge fan of this.

B

And performance bottlenecks I mean I, know you sort of alluded to it, but if one of your applications- that's a processor, for example, start slowing down you're going to notice it when you see you're, just gonna, if you have no tracing in place, you're just going to see the time.

B

The message starts at your infrastructure to the time it ends takes, say, seven seconds and just making up numbers and if all of a sudden it starts taking 10 you're not going to know why or where that happened in the near have to trace every single piece of the application by hand and when Zipkin allows you to do is basically say: hey! Guess what I think?

B

That's probably between this point and this point- and this is why, because our average Delta for this message was 600 milliseconds and now all of a sudden, it's fourteen hundred milliseconds yep. So that's a view. The power of this of having something like this available to you, yep.

A

I, can it can help you narrow down problems more quickly and and I know you're a huge fan of just having metrics and information about your infrastructure, absolutely.

B

Yeah, an alkyl ops guy- that is the one of the rules, and if you talk to most Ops guys, they will say a a service or application does not exist unless it's monitored. That's. This is a way of monitoring and instrumenting an infrastructure that is very tightly coupled to the way information flows through your infrastructure, which is very difficult things for most developers to conceptualize, but having a system that sort of understands it just because it gets plugged in along the way is incredibly valuable. Yep.

C

From a developer's perspective, you know, I can just even thinking back to like one of my previous jobs like having that information available. You know yeah, it could be an ops problem, that's causing your you know your latency or your message, processing time to go from 600 milliseconds to 49 milliseconds or it could be a code change right, and you know if you can point a developer to hey this. It's this particular. You know part of the system or it's this. You know in this day and age, everybody's doing micro servers.

C

So it's this micro service or it's this. You know piece of the system, it's way easier as a developer, to go and try and find the problem you know or to narrow down. You know the change that might have have caused the problem. So hey.

A

Having yeah now having the metric behind you is it's a lot easier to have a conversation where you're like hey I, have a metric, and it shows that this thing is slow. Vs, hey your micro service is slow and then it becomes like a personal battle. People going thanks, I'm about it like okay. You know, people are what are you talking about it I didn't like it's not me. Obviously, like.

C

A good defensive about the code, they've read: yeah yeah.

B

I do I'm. The cool thing is that it might and to your point Luke. It may have nothing to do with the code itself, but the code may help diagnose a problem and say the database related to that service right and like that, just having the you know, just understanding the structural integrity of your system through something like tracing is it's it's mind-blowing, Lee, easier to diagnose problems. Yeah.

A

And and being able to see like even within the micro service, that a particular query took a long time and it's like you know, you can like look at a high level or you can like really drill down and see it a much at a really detailed level. So, like the amount of information you can get onto, this is incredible. The time savings huge, the fun off the charts would all right. Let's, let's go. We got support for group by in the select statements boom. You know what drop in the hammer Cassandra open source I.

A

I I'm excited to see aggregations in here that are more evolved than the you das that have been added. This is me you don't to write Java for one or whatever so yeah. This is. This is happening as being worked on right now, I'm on this year. At the moment, I.

B

Mean I think for me: I'm really interested to see what this I think. This is a game changer, but it really just depends on speed. I. You know. I definitely would like to see some benchmarks when it's done, you know not like I'm trying to be a hater or anything I. Just think. It's important to understand like what what potential trade offs you're making when you get at something powerful like group by you know um a wide rows are one thing, but what you talk about beyond beyond that you know.

B

Are the group buys using the SAS is properly, which I know is one of your favorite. Your favorite topics done.

A

Dude well so here's the thing is your ear. I poll I'm, not a hundred percent, honest but I- think that you're limited to single partitions, so you're not going to be like give me the whole database like let's just grew up on what ever like you you'd, be looking you'd, say: okay, I can either take back like you know, 10,000 rows out of this partition and aggregated in my application, or I can let the database do it and not transfer everything over the wire.

A

So at that point, you're just you're looking at network savings and the question is: can we aggregate fast enough in Java, probably I, don't probably yeah.

C

I mean look, I look like that was being debated, though, in that ticket like whether or.

B

Not, which is why I.

C

Brought it up, morticians are not like whether it was gonna be required that you had to provide a value for a partition key to narrow it down. Essentially yeah that'll.

A

Be that'll be a good one to watch and that's a Cassandra attend 707. This isn't alone the blog post, so definitely look read the blog post. Follow the link, don't listen to me Wow and the return of diagnostic tools. Ss table dump replacing SS table to JSON, which I know your co-workers, a huge fan of right. There yeah.

B

Ruff ruff Bradbury is a big fan of this SS table to JSON, see one of the things that I think it's important to understand when you build systems that are larger than just one or two machines is having a standard communication media. So whether that's tanner communication medium is, you know, XML, which I guess would be great if you don't like yourself, very much or or JSON. If you move to.

A

B

You'll have to cry at any one of those things, but Jason is typically a standard form of communication, and so what what it allows you to do is create applications in any language and allows your developers to be efficient and say: hey, look I!

B

Want you to process this data in whichever form it's easiest for you and, let's just double check what we have, let's, whatever create a materialized view for ourselves, let's whatever it is, you're working in it with a standard set of tools and SS table to JSON is a thing that actually allows that to happen, whereas the other sorry SS table the original version. When you had JSON in Cassandra and then tried to dump it, it made it very difficult to process I. Think.

A

I'm gonna write a tool that takes the SST we'll dump and then puts that data back into Cassandra, so you can allies it using spark.

B

That's actually exactly what I, what we do really SS table dumps and then we turn them into a form that can be processed by spark.

A

I love I love when my leg outrageous idea. It's like no we're actually doing that we're totally analyzing our asses table data with spark.

B

Yes, no sorry we're not accident we're, not analyzing the raw SS table. We're accessing we're rewriting some of the older style data all right as we need it in a form. That's much more conducive to spark queries, ok, gotcha! So it's very custom and I. Otherwise, I would say we'd, happily, you know get it out there, but it's wouldn't work. Yeah. All.

A

Right, don't make your.

C

B

A

That you'll feel free to write your own I will um so it looks like we've got our JIRA's out of the way um hey Eric. Are you? Are you guys hiring by the chance, maybe you're hired? That's.

B

All you should mention that, actually, we are hiring here at simple reach, so I've already told you guys a little bit about the size of our infrastructure, but what we're looking for is just a back end engineer, somebody who liked writing Python who likes right and go that's Python to know. We talked about Python to deuces and you know we're we're looking for somebody. We have a distributed team.

B

So, wherever you are, if you uh you like, learn working with large amounts of data and amber light like working with me and Russ, the other author of the book, practical Cassandra then reach out the the link is in the blog post, mmhm lyst, plug-in.

A

Hey, hey, you know, I I was wondering. I was like maybe I like Python.

A

Alright cool we there's more stuff in the the blog post, I think we're going to wrap it up. For today, though, we've got some CFPs that are open for analytics and python conferences um to meet up information. That's in here so definitely check that out and go to your meetups. It will make Lena happy if you don't know Lena. She is our community wrangler cat Wrangler. If you will she's very good at getting this stuff making this stuff happen so huge thumbs up, let's see how anything else, you guys have anything nada.

C

A

Alright, ok cool what we're done. Thank you very much alright later should it do.