Apache Cassandra Cassandra Summit EU 2013, 12 Nov 2013

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: C* Summit EU 2013: Stump the Experts with Al Tobey & Patrick McFadin

Description

Speakers: Patrick McFadin, Chief Evangelist at DataStax & Al Tobey, Open Source Mechanic at DataStax
It's time to play "Stump the Experts", with Al Tobey, Open Source Mechanic at DataStax, and Patrick McFadin, Chief Evangelist at DataStax. Bring your urgent Cassandra questions to this session and have our expert panel answer them for you.

A

So it's only one day event too it'd be really cool, so uh I'm patrick mcfadden. I uh evangelist, but I'm also a solution architect with biggest tax. So I get to run into all kinds of crazy.

B

Stuff I'm alex w. I'm uh also starts to evangelize to do stacks for malevolent, yellow and broke cassandra in.

A

Every possible way, mostly on the outside, I think that's what it comes down to. We've broken it more than you have, but we fixed it.

A

So um we had our first question in the hallway in transit wow, we packed the room right off hold on thanks. Jeff first question was: how do I scan all the rows in my column panel?

A

You do aha, not dumped yet didn't say it was pretty. No, it's not pretty it's not performance it damn. You can do it. um So uh there is actually a recipe with the ask. Annex driver called get all ropes and the idea is that you, because it's not just it's going to be in order and that's probably the thing I have to qualify. This is not going to be an in-border operation.

A

If you're using random partitioner, it's random doesn't actually generate all the tokens and no what it does is each so what happens is that whenever you create a row key, it will hash it into a token value and then those token values are assigned to each node. Well, because each node has a token range. What you do is you just iterate over that range?

A

You ask for tokens in that range and you just keep going next next through that list, so you're iterating over the range of tokens and getting and backfitting to the keys that are in there it's not pretty totally doable. This is how you run hadoop on a column.

A

Hadoop needs to see everything in the family, so the mapreduce jobs. What they do is the mappers actually, when they're spawned, they are given a range of tokens to iterate over, and so each mapper gets a little chunk. So that's why it's efficient there. That makes sense.

C

A

If you're writing you want to write code to go scan the entire table, it's doable. Actually you could do it in parallel too, because the range is a well-known. It's a well-known range, it's two to the 128.

A

So um if you have uh 16 different uh 16 threads doing it, you can break it into 16, threads or 16 different ranges of tokens and there we.

B

Bentley and keep your own keys in in the second column, family somewhere, so that you can do range scans, maybe a bucket or depends on your data right, how you're going to do the data model. But that really helps it's elevating. In.

A

Order for sure or using orders.

D

Right that will.

A

Help us all right, good first starters see everyone's like I'll lose now. Okay,.

E

The lists are not efficient for many use cases because they are limited so and as for the blocks, it is written that the lists are safe so because we have to delete from a list. What do I do? I read all the values and then find an index then delete and then uh uh so from the from the box documentation. We said that this operation.

C

Because you read a whole list.

E

And then they can then put it back now, so it can happen that multiple threads will get this wrist and then they remove the indexes, and that is maybe incorrect. You may remove incorrect elements if you do by application side. Are you talking about.

A

A

Versus atomic and isolated, or two different things: okay,.

E

So I I want to I logically.

E

A

Because what you're doing is you're you're deleting individual rows, it's just setting two steps.

E

Are you changing.

A

It or whatever pizza it's.

E

The removal of columns because list would be the same partition.

A

List will be the same: partition and they're all values, and it's composite underneath the storage is the composite. So if you're removing.

C

Values: it's setting.

A

A tombstone, the tombstone has a time stamp. The timestamp is what's going to make it win or lose. So if you set the.

C

A

It or anything all those things happen in order that they would get. I.

E

Have I can reproduce where the list is not able to maintain the order? What could be the reason, then.

B

Then I'd like to have that, are you inserting and deleting at the same time, I'm.

E

Doing the minus plus minus plus so now, at the same time, very confusing word when I say either.

E

Keys same because it's a list but.

C

E

Are different, what I'm doing is minus cos, minus plus minus plus, and I can see maybe once in a roughly.

A

You and I talked about this before you were going to send me the like to outline how it worked. So um I like to try it out, because if that's really a bug, I could fix it right.

F

E

Values are the uui you you ids. These are our.

B

Values uh I mean the the index inside your collection yeah. When you write the tombstone and then you write a new value and then you delete and then you write a new one. If that happens really really, you know very right, fast yeah, but you you could get the same counter value to order the same time marketplace.

A

But um okay, the value is the value, but the index inside the collection isn't necessarily the exact origin.

G

We've got a nice little production cluster of 12 nodes and what we're seeing is that sometimes out of blue uh three times on c spikes, right left and c spikes- and uh it looks like rooting issues but aws has quite been transparent about it. The problem is that uh slow mode really manages to bring a whole cluster down. That note doesn't as soon as we disconnect that node the cluster is behaving perfectly, but it's low notice. This is really bringing the whole cluster to a call. How on earth do you debug that.

A

Okay, so um a couple of questions: what clan are you using.

G

A

Headlines, okay, holiness I've seen that before, when all of your connections seem to be going through one server, mistaking you that's one thing to look for and because you really have no awareness of what node you're connecting to unless you're asking, then you can have off. You can get a hot spot, you'll go through one shuffle. So if that's the case so that that'd be the first thing to look for, because I've seen it where one node gets all the connections and then also not one node goes through gc and it looks like it's across.

A

Is the slow node getting disk at all? Sorry? Is the slow node hitting disk at all, or is it just? Is it just slow? Is it is serving.

B

Requests out of memory is, it is a hitting disc.

B

um We haven't completely checked that out yet because I've seen that in ec2 before where a node dc2 basically will happen, is nodes will be working fine one day, the next day you get a couple noisy neighbors and all of a sudden, your discs go from performing just fine. They just kind of drop off and never come back again and you have to replace the node.

A

Are you seeing os level performance problems too operating system? No? Okay! You guys are together, obviously.

B

um That's why I was asking if there's.

A

A

You're always bringing problems all right, well, those kind of problems, because computing is a hard problem. Just you know hard problems in troubleshooting, but I would look at like. Is it really the entire cluster going bad? Is it os? Is it disc? um Is it at that level or is it? Is it increased latency with something else, and what size instance.

B

Are you running on xl xl, that's an m1.

B

um Yeah, what what I noticed is we started out on the c1 larges and those were terrible. They would they drop like flies. You lose one a week or at least one every other week. um The m1 x large is quite a bit better. If you go up to the m2s or just a little bit bigger. What happens? Is you get into a different class of machine in amazon you get into the newer shinier part of their data center and they don't have as much variability in them as the older instances do so anything.

H

B

Class is going to be in the old part of ec2 and in.

H

The the twos and threes, that's on the newer.

B

Gear on the newer network, where they fixed all the crap that makes eds break every year right so every year, there's a big you know: yeah granite is down, and the whole world is scrambling ebs outage every year right and they're, trying to fix that they're doing a good job. It's just it's.

C

Only on the new stuff, because they.

B

Can't exactly evict all their customers off of the old gear to fix it. So that's why I would recommend, whenever possible, go with the bigger and the newer instance types, not just not because they're shinier, but because you're going to get the better infrastructure. And then your database is going to be a lot more consistent.

I

Would you recommend going to m3.

B

The ephemerals are actually pretty much perfect for cassandra because you have the replication built into cassandra and you can replace the node pretty easily. So I've never had a good experience with dds.

A

I think it's meaning that the guaranteed iops guaranteed ups are not that good either. I I thought well because iops are irrelevant, it's transfer speed. How long have the sequential operation you can run? What's your max megabytes per second yeah, you can get some really good ones with evs, but they're not consistent, and they just don't keep up as well as what you can do with the stripe set on ephemeral.

A

You take four discs and you do a rate zero on those you're going to be anything that amazon can give you that's just it next question. Oh you already asked dan.

J

One at a time I haven't been keeping up well with hinted handoff. I know it's changed across different versions. What's the latest state, how does it work in the most recent version.

A

Well, not really change the change. Questions went over. uh What changes are you talking about.

J

Well, just stuff, I read.

A

The biggest change was that how hidden handoff is managed from a dropping out, so, for instance, if mutations were dropped and one got one, then potentially the hints would be dropped as well. So this.

H

A

Case scenario, um so just kind of level set hinted handoff, which this is all about very normal part of extender ring operation. So when you have multiple nodes running and you have coordinators, so whenever the client connects to a node, it may not have it may not be the home for that data and they need to replicate that data needs to go to other replicas, so that node will act as a coordinator.

A

If one of the replicas is down at that time, let's say you just rebooted it and you, you know, restarted the server or something to do maintenance or something totally fine. But if, if that replica is down, the coordinator's job is to store, what's called a hint, it's just gonna, say: okay I'll hold on to the data until you're back and then, when the node comes back online.

H

It replays the hint.

A

Which means like it replaces.

H

Any data that was missing and everyone back in.

A

Consistency that that's the way it's supposed to work, the bad things that would happen is, if you're, if your cluster, let's say that you were using those c1 extra, larger they're, c1 larges, and this is like the worst idea ever well. You start hammering your your cluster with a lot of extra load. Think bad things start happening everywhere. Well hints will start piling up in different places because nodes are blinking in and off it's just. You really have a bad situation on your hand.

A

You want those hints to stick around in 1.1, there's conditions where those hints would get dropped, and so, whatever the nose came back online would replay not all the hints, which means you have to do a repair, and that was I mean: that's okay, you're doing a repair operation as well, but it's just one more step to trying to get consistency, correct and it was just.

E

A

Predict so in 1.2 hints, don't get dropped, they're also they're stored durably until that time that you know we can replay those that was a good change and mostly protection when you're doing bad.

B

Things when things are really hitting the fan to remember that that's sort of a system key space or if that's no local and occasionally very occasionally, I don't know if the docs ever explain when this is necessary, but you'll.

C

Get a node that'll get stuck in a weird.

B

Condition and you need to be able to remove those and I've had to do it before where you go in and remove those it's very rare and you shouldn't do it unless you're it's, unless you really have to recover that node and not just rebuild it with the repair. But that's where it's stored, that's how the storage engine does it actually stores? It's not that stable and.

J

Does the hint contain the actual data or just kind of uh to say, hey, you need to talk to your.

A

Replicas, the mutation is the mutation, so, as that mutation is given by the client to the server says, hey store this and that's going to be everything from time stamp and column and key, and all that, if it can't do that, it'll store that exact thing in hinting there's just the one mutation.

J

Why does it do that, as opposed to just telling it you need to sync up with your replicas, because.

B

You have to go through all the merkle tree exchange and it'll take forever. It's much.

A

More efficient, the other decisions on the brain, this book rank scenario. Potentially too I mean.

K

A

You have to be partition, tolerant. So what? If you can't get to some notes? But you can get this one so replaying hints is not something you want to count on all the time, because you know they're not going to last forever, especially if you're running a cl1 yeah, if you're doing really high volume inserts, you want to be at cl1. That's where the best performance.

B

Is and that's where hints and handout really starts to pay off, because even if you have one of the replicas fail, the hints will get backed up and then they'll get replayed onto that node. When it comes back online and you're not constantly going into repairing your cluster because of single foot failures, you should be able to remove any node in your cluster at any time.

B

This goes back to the diagonal paper. Hints are part of the dynamo.

A

All right next question somebody come on now we can't be hogging all the questions, though yeah.

F

Go ahead uh question! uh So, if you have a row you feel deleted, is it possible to reinsert with an old timestamp.

C

A

I'll, let you do that, there's in thrift. You can actually well no in all of them. You can overload the timestamp you could put in your own timescale. Okay, just make one up in any version.

A

Cql has a timestamp you're using.

F

That yeah, we are yeah we're not using secure. You can do well with thrift.

F

A

If the times oh, the timestamp is older yeah, then the tombstone still stands. Yeah you need to do a longer timestamp. You need a newer timestamp to get rid of the tombstone. There's no any way to override that to overwrite the tombstone yeah reinsert the data yeah. If you're ready during the day.

B

You just use a current time stamp yeah.

F

Yeah, but the idea was to in this case was to we wanted to because we are abusing of the timestamp, so we needed to reinsert that same row with the preview, one of the oldest timestamps, that.

B

We had, oh, you are using the timestamps in your application. Yes, so.

A

Don't do that so there's an anti-pattern.

B

I was going to say.

H

There are some neat things you can do with the.

B

Timestamps the reason why.

H

You can't override.

B

Them is, if you want to do say if you've invented your own version of vector clocks or some of these other fancy distributed system things like crts. You can actually put your own value inside your timestamp and use it, and then you get all the same conflict resolution, but using that timestamp in your application. Is it's a big pattern? If you're gonna go there.

A

Know the rules, the rules are last right. Wings. If you want to break the rules last right ones, guess what the rules still apply. You know it's.

B

A

Right that last right wins.

B

Has its problems, but it it stands consistent. Yeah.

A

It is what it is yeah and if you want to try and get rid of it, there's probably some if you're going to get that fancy. There's probably some other fancy ways to take over. Like change your gc grace to like zero, and you know, delete your data yeah. That would.

B

Get I've got plenty of data, I was going to say you. Could I mean if you did your rep a compaction.

A

You'd, do a major change to zero to a major compaction and then ride it again. Yes, totally doable.

B

Call me later, I mean yeah if you're stuck and you need to get your application back online. Do a compaction, then write your new value and then you'll be fine. The compaction.

F

B

F

That actually leads me to a second question. So, okay.

A

H

L

uh What about deep space? I've experienced a lot of troubles with deep space.

A

L

In version 1.1, I was not able to make any port of data even.

B

L

Data but uh throughout.

B

Mine or using ssd simply.

L

The assets, the tool.

B

Oh yeah, so that tool.

H

B

To a very it doesn't set the heap size inside it's just the shell script, the actual command you run, then it loads the java program right. Yes, when we start cassandra, we set the heap size typically to eight gigabytes and most of those tools. They don't actually set the heap size, so they get the default jvm heap size, which is one gigabyte at the top. Here's two, which I think is one one gate, so sometimes you need to edit those scripts and break after the java command. Add a minus x and max.

L

So I have to make in the scripted launcher cassandra or the script, to launch the tool.

B

So you basically do sudo them user bin ss table loader or whichever one you're using and if this isn't recommended. We gotta find a way to fix that, because I've heard this one.

A

A bunch of times you can set the environment variable for xmx and uh java ops.

L

Java space is that.

A

You know the java ops is an environment variable because jvm will read: okay.

L

A

Would what you're gonna do this is just like if you're trying to run the heat down.

L

Yes, you can try to do something wrong. Okay, the next.

A

L

The same same space problem, uh one one uh stopped to work: the server go down when I restart, but not able to restart for each space and a lot of cash because was the cash in the in the file system.

L

A

Same cache, yes yeah, but.

H

Was not able to.

L

Start because it can uh for the same problem, keep space. All right is the same problem. The same.

H

I

L

In every way, to a large, distinct space.

H

A

So there's a lot of physical memory on that box. I'd imagine so it wasn't a lot of physical memory on the box.

A

Because the only.

I

Time I've seen.

A

That, in that case, is when you have a very limited circumstance and they're very limited.

B

Well, yeah, so I'm just going to default to half four quarters quarter. So one gigabyte one game space, so yeah.

A

You have a lot of data overloaded and say because yeah you the way.

F

H

A

Script startup script works is it looks at your physical memory four gigs and takes a quarter of it right up until eight gigs and then stops regarding apache or dfc.

B

You running apache cassandra from plot like planet cassandra download. Are you running data stacks enterprise.

L

uh I think it's uh uh open source.

B

Okay, so you you're going to look at a file in uh etsy cassandra. Cassandra dash env data stage and there are. There are two variables there's it's documented in the file. There are comments. It's just a shell, shell, blob and you'll you'll see there's a max heap size you can set, and you want to set that. I would say it's probably two gigs on a four gig machine, but maybe three, if you really need to, but I don't know.

L

I don't know maybe some specific to my environment, but I was trying to change the script to give more deep space, because I think the script calculate by itself. You know that's right so and I change manually the values, nothing.

A

Is just delete the same caches, but don't just regenerate them. The others regenerate the same caches are convenient, they're, not important. So now.

L

This is important because when I up without cash and dvd cash was not the data, some data was missing yeah and that was not funny. Okay, save cash.

A

Save cash is just that it's meant to whenever you start the system, you indexes all your files, okay, and that save cache keeps it from having to redo that operation, because it's pretty heavy you.

M

Can delete but commit.

A

Log and data are not optional, commit log is data that did not get put into an ssd and ss tables are in your data directory. Those are the durable part of your data. If you delete any of those, then you don't have data. You have to run a repair if you're in a cluster.

A

If it's a single server.

L

Single servicing don't know the simple application very few data, nothing almost nothing inside yeah.

B

Safe cache is one table only. It sounds like maybe.

I

B

Gotta commit lock by accident. That's, I guess, yeah no.

H

A

Bring up the laptop: let's go all right! Next: okay, more questions.

N

Oh, go ahead, go ahead and walk into it.

N

uh What's the best uh approach, if we have like we need to do significant amount of deletes from a table, so just realize that if you do this uh just by like deleting objects, um then it's like significantly drops the performance. uh After even just a couple of thousands of days. Are you deleting across.

B

A long range of time, as in if you've been inserting data for a couple months, are you doing stuff that might be a few weeks old, or are you just deleting stuff in recent history like in the last hour, another.

N

B

N

The only thing which, like helped, was like um dropping the the garbage collection period or that helps, but of course it has a drawback that when, like the maintenance, restart or something happens, then this old, uh I'm gonna, get zombies, start appearing.

A

In the database delete operations away, and what my guess is is that you are not keeping up with compassion when you're doing your deletes, because you're you're, probably when you write your regular data, you're keeping it at a certain pace. But whenever you're doing your deletes you're just turning it all the way in and you're running it as fast as possible.

A

When you do a delete, it's going to fill up the mid table which needs to get flushed. I would almost bet you that if you look at your system, while you're running a full fast delete like that that your compactions are backing up, your heat will start filling up because flush riders are pulling up. Look at tp stats.

A

Tp stats will show you very clearly that flush riders block threads okay. I would almost guarantee that's what you're saying your disks are not keeping up with that load, because I've seen people turn deletes on to the full blast and that's basically like a cassandra stress test and because it's not your normal application mode right, you're, just doing it as fast as you can push that data in there, because that's really just.

B

A right and your passions have to keep up with that. Have you messed with the actual throttling at all.

N

uh Well, we mess, but I can country confirm.

B

I've had a couple cases where I have to turn off compaction throttling and just let it take all that out and you'll be doing a lot of deletes. It might be worth trying that yeah. That would actually let it just use the full bandwidth of the discs.

B

It's not going to try to be nice to anything else in the system or even inside cassandra, and then you'll be a much better chance of keeping up with the impactions and then, if you're on lcs, just make sure you're careful about your ss table size because you're going to be doing a lot of compaction, it means you probably don't actually want to do. I.

C

Usually recommend bigger.

B

As a stable size megabytes, it's in your schema, you set that you can set it. It's in the it'll, be I'll talk about it. In my talk too, except I usually recommend bigger, like 256 megabytes or 128 megabytes, but you might even want to go smaller depending on your total data size and go with 32 or 64. If you don't have a lot of data per node and then then you'll get a lot more efficient compaction.

A

When I hear the words- and this could be joke, I'm doing a lot of writes and everything starts slowing down, compassion. Okay, if you just added this problem, yeah we're gonna, try it nine times out of ten. That's the issue, uh let's see so we had that one. You remember the order and then you and then in the back right back. You hold on back. First go ahead: red checkered.

O

Yes, uh like we're doing, you have better, they create a mutton tree for each column, family and then the bucket trees get copied to the node. Starting with that and that's comparing the merkle trees, I would like to know how much percentage of the column family is represented by one leaf and if one leaf will match how much data will be transferred and re-inserted by the loans, because I feel sometimes it's like a really big amount. A lot more that, I would guess that's broken, will be re-transferred.

A

Yes, the beginning of the repair operation builds the vertical tree by comparison. So it's not that it's transferring it because it's building and then once the merkle tree is filled. Then it says here now. This is what I need that the merkle tree creation is the heavy cpu part of that, and then it doesn't characterize.

A

Well, then, if it has to read yeah yeah, it might cause a lot of read. I don't know very often yeah the marble tree is it's ready and dark. It should be down here, yeah, because that's what happens is the the initiating node will contact the the node it needs the merkle.

O

Tree from that node to go off for a while and go and calculate the merkle tree and then transfer the whole thing over and then that the node doing the repair will do the comparison and memory yeah and after that it knows all the elements that's going on and then it will start asking for the stuff. It needs yeah, but it falls. Let's do it. It just knows the range of multiple row.

O

Right and how much network traffic is created by just one row, not matching, I don't know under the source. I.

A

Don't know how much it's grabbing um I do know that the network traffic is rarely in the merkle tree. It's always in the screen.

B

Yes, real repair, that's the beauty of merkle trees, they're, really tiny for the amount of power you know.

A

The the heavier operation- and this is actually a big deal with streams 2.0, which is in cassandra, is how the streams get managed by the system and the other part of repair that people don't really get. Is that when you do a repair that immediately is flushed to an ss table which will create a lot of compaction at that moment and a lot of times.

K

I just ran out of repair and also.

A

I'm doing all these compactions what gives well it's because, when a mam table.

H

Gets that new data, it flushes it out immediately.

A

There is no waiting for that, so repair operations will do a lot of item.

O

Is there any way to influence to do less? Yes,.

H

O

If I have smaller leaves, I could do less less flushes.

A

uh Doing changing the type of repair you do like pr doing, a partition range or changing the stream, so the stream's value sets this that stream throughput changing that to a really low value. So it's just kind of splitting the value very slowly and the stream.

C

A

Is from is the from so if the sender, that's from the sender that you set the screen throughput. So if it's comparing it says, hey you're, I'm missing your data, the data that comes from that node. You can set that the throttle. So you can.

K

Brought it out.

A

Yeah, when you set, when you set your stream throttling down to a low value, it's going to take longer to set to do the stream of data over, but it won't overwhelm anybody if it's not. Thank you. Okay,.

M

um Doing multi-data I've been using ssl, you can use ssl to connect them up and the documentation on uh using ssl indicates to a keystore, along with a certificate etcetera for every single node, which seems like quite a bit of pain. You want to add one note: you could go around every single box and add that in there any other, any recommended approaches for doing ssl with multiple data centers, which makes it like easier.

B

I've done some experiments using tink uh instead, which is a little vpn daemon for linux. That's really, the documentation is terrible. I've got a been working on a blog post for almost six months, just trying to figure out how to to do it, so that I can give it to you guys and um like you can run with it. So that's that's one option. If you really need encryption is to do something like that, a tank or ip. I don't want to stack your stuff. That's just me!

B

um So there's that or finding some kind of a certificate management system that might help you with that. So something like a cert master is a open source python project that generates certificates and passes them around. For you, I'm not sure. If that's exactly an application, it's on my to-do list to try that actually so um yeah that that is a pain point. It's just it's it's a pain point in every part of crypto.

M

Everywhere, for especially for period here.

M

I think that's just because network administrators tend to prefer vpns.

A

Well, it simplifies the stack you know not having to pass around. Ssl certificates is a win for for less complexity,.

A

The java crypto is not bad, but it's nowhere near as good as the hardware crypto stuff that.

B

Vpn devices do so using node using cassandra's crypto, it works and it's there and it's a it's a feature but you're going to get way better performance out of a hardware vpn device. You shouldn't have any latency on hardware and even some of the the local you know: linux, vpn software, openvpn or tink, we'll use to open us, sell stuff and appear on a westmear or hire or haswell you'll have the aasni instructions on the on the chip, so it'll actually be accelerated in hardware.

B

So in java, there's there's instructions around various places how to get it working, and I was looking at that a couple weeks ago.

J

And it's it's really ugly security's hard! That's why we just need to do it. Encryption and wraps: hey awesome, all right! Yes, uh solar integration basis,.

J

um Is if I create a schema in solar and have a column family then created in cassandra, there's two things that stand out in the schema description: one is it uses compact storage and the other it creates secondary indexes.

J

What are those two things why compact storage in my secondary mixes.

A

All right: well, the compact storage, because uh composites are I'm sorry. Cql is not something that's understood by solar, yet that's a different type of structure, so compact storage creates a cq, less column and the secondary indexes are how it finds data. So the data that's being managed by the schema in solar has to be understood from a column family perspective, meaning that it needs to know where all the columns are the values in them, so those secondary indexes are created by the schema creation of the cylinder.

A

So it's how it finds the data once you index it, so the indexing is just fine is, is how do I find the row key then secondary index is how do I find the row key with that particular column of the value? So it is how it does its indexing.

A

So it's um it's just a different way of managing data on top of cassandra. Solar should be super fast, but it has to index and data first and that's how it gets into the data.

J

I think the impact of that is, I can't use secondary indexes and solar on the same common family.

A

Yeah, you should really consider that column family as solar column. Mostly um you can use data on there, normally with cql or without with thrift. Even some c2 stuff, but through mostly and the idea, though, is you- should really have the set. If you're going to be doing cql like operations denormalizing into a different channel.

J

I think my concern there was, if I manage the server the database on the server then and have um other people write client code. They might think. Oh there's a secondary index, I can use it and that will.

J

Is there any way to kind of prevent them from this kind of? Can I make that secondary index usable by cassandra, though, but not usable by the yeah? There's no.

A

Security that would do that. It's still there and that's why I said.

B

Create another club yeah if there's a separate column family, you could use the permission stuff to restrict access to that family, but you're not going to do column.

D

This morning, you said that we should never use shared story.

A

Does this apply to.

D

Small clusters, with uh virtual machines and star jerry networks, or is that a completely separate world.

B

You mean like a fiber channel network. Yes, I had this question at lunch. I mean the the the real answer. Is it would work yeah? We still don't recommend it. I mean it. It's it's going to cost you a lot more money to commodity storage.

B

You can get sas disks that are just as good as what's in the storage array, you can get pretty pretty close to equal support from dell hp any of the major vendors and you're going to get it at something like a tenth of the cost of any of this and gear you can afford you can buy if you work at a place that really insists on all discs are on the sand and there are places out there I mean maybe you're working one of them where.

D

Right the thing is this uh array and um we're wondering if that would kill performance on a small cluster. We are building.

A

It really could that's. Maybe this is more of a warning. Just understand that the limitation, so in general, cassandra is limited by one thing. Almost always that's I o okay io is the bot. What kind of.

B

That's easier, is it like high end middle tier.

K

Some of those what.

B

You could do if you were, if you really need to use the gear right. You've got it sitting there burning hole in your pocket right.

D

We don't really need to use it. We are just wondering if that's because in production we will have to live in a different compartment, so you can run tests against that and compare it. Maybe.

B

Yeah sure, and what I would do is if you have the ability to carve custom lens on the array I would actually assign spindles to cassandra. You can do.

K

The raid on the device.

B

So maybe do a raid 10 set do another raid 10 set, do another a10 set even inside the same array but assign each one to a different node and you can zone that all off and switches and all that happy stuff and then you're pretty close to having local attached disk. And if it's just a science project you know or a dev cluster, then I guess go for it. But it's definitely not something.

D

That we write so intellectually, you would recommend that each machine is that's.

B

D

And an ssd, preferably.

B

D

B

If you want really consistent latency, if you're going spinning media, I recommend sas very highly.

A

As a solution consultant whenever and I get into phone calls with people that are in really bad place, you know they've gone down with this path and then they, you know, maybe got to a point where they're having a little bit of problems, we had a customer. I can't name names, but we had a customer called us on sunday night and said we just went, live and there's a fire in our data center help us now we were there the next morning. The problem was they were using this really shitty shared storage system.

A

It's all made at best- and you know this is this- is why throughout this morning, because I want you to know ahead of nfs if you'd asked nfs, I would say absolutely never ever confess, because it's just not made for this kind of stuff, but I'm throwing up this morning just to make you think and you're thinking now.

J

You're gonna be like okay, I'm gonna keep my eye.

A

On it, you're not gonna, be surprised, so timer channel's a different story. Five channels can be.

B

Different yeah, so absolutely not right. That would just be a disaster. I don't care which one I mean there. There.

C

Are some really.

B

High-End iso devices that have good enough latency that you're not going to get in a lot of trouble, but in general those devices are made for a completely different kind of animal than cassandra yeah. It's.

A

It's not it's not a blanket like it'll, never work, it's just be aware, keep it nice, it's going to be painful, yeah treat it like. That really mean pitbull dog, like you know, nice, doggy, okay,.

A

Yeah, don't look away, don't play with me around it. Yeah.

F

This might be another anti-fashion situation. That's a heads up. That's.

A

Where we're here.

F

It's about repairs and operation and maintenance operations in our cluster. We have a growing car that runs a repair and it takes a lot of time. It's quite heavy and my question is two questions. First, is there a way to orchestrate this? Is there a better way to manage this, and the other is when should we do it and yeah, because I'm not sure if one week will be is enough or it should be shorter.

B

Well, it depends on your gc grades, right, whatever your gc grades value is on your schema. That's that's the the time we know you have, and I I like to always have mine, be where I have a really big buffer before I'm in ftc grace. So we all we set them to.

C

We started with 90.

B

Days we screwed that up and so they're at a year now, but we would do the repairs every month. So because you know when you have a 36 or a larger cluster 36 node cluster you're rolling through and doing, we can do one a day. It took about 24 hours to do a node, so it would take about 36 days. So we just basically kept it going all the time around the ring um in terms of orchestration. um I've heard that certain products might be spreading.

A

Out soon, yeah, no, that's this. We're talking about that. Yeah off center has a thing called. It will have a thing called smart repair or repair, and it will continuously run repairs in the background, in an intelligent fashion, part of the painful option.

B

A

Yeah there's there's ways you can do it. You can there's a new option in 1.2 in no tool called calculate splits and it's you calculate the splits of your token ranges and it'll only do the repair on a small token range and that's.

H

A

If you want to spin your own just bust out a ton of those splits say, I want to do like 100 different ones, and you can run them in parallel.

F

But if it fails, it will uh retry or something.

A

No, you just retry that one range you don't retry the entire thing smarter right. Oh please, smarter.

C

Okay, so now you know.

B

Well, there's probably an upper limit; it used to be 200. Gigs was the recommended upper limit and back in that day, I was blowing way past that, on all my clusters, um I always main or oldest cluster is at about 500 gigs per node. Right now I put five terabytes on a node yeah. If you're on ssds yeah five terabytes isn't a big deal, it's just you know. If you need to scan over all your data, you just realize that your spindles are gonna, be limited. It's just all about. I o right yeah.

A

The limit is when you, you can no longer keep up with what you have. If you have a lot of cool data, maybe your latency requirements are different. I mean there's so many different variables. um It can store a lot of data on there. It's not going to fail. It's just. What is your sla if you're looking for 10 millisecond sla on all your reads, then you put 10 terabytes on a node you're, not going to get that it's just not going to happen.

B

But but if you have 10 terabytes of ssd and you're, okay with 100 milliseconds that might work out, it might be fine yeah. How.

J

Much do you need for that five terabytes of storage run ram.

A

As much as you can put in the box, 120 gigs, you could put, you could put 32 in there, but what's your sla for your, I mean it's all about managing the read estimate.

B

If you're sitting on a really hot ssd, you know a high end, intel or um or something like that, 32 gigs, probably do you find as long as the latency of.

J

J

Although you've moved them off heat, there's still in memory, structures that have size proportional to the amount of data.

A

It's os buffers and those os peppers are going to be how fast you can you're just going to use.

J

You're, relying.

B

But then you're never going to have good performance yeah, and so basically, if you need swap, then you need to put more ram in the box. So.

A

You bring them points. That would say: oh that's, a performance problem. Okay, don't do that um and you can you? Can I've put a lot of data in cassandra that packed it in there but you're if you're looking for performance, that's a different profile, you know it will swing.

B

The difference.

A

B

And if you have time series, for example, it's a great one, you.

H

Can you can lay that stuff out on the disc.

B

It goes cold, you don't care, you're, never going to delete it and you just let it sit there and some.

A

Of the new compaction strategies that are now in 202 that are coming it's coming out next week, well, actually there's some of them that have been since 1.2 the new compaction strategies where they look at time series data and they don't recompact it ever. Those will make it. So you can really put a lot of data on there, because compaction is what is really the name of the game. If it's sparse access, then.

B

Yeah, you know you can do you know the big lumbering nearline sas drives yeah, I'm sure you're going to be 500 milliseconds a second under access. But if you can live with that, then yeah. The idea is taking your teeth right because then you've got cheap, jeep storage.

C

Actually, yeah, that's actually all idea to uh throw a lot of storage to sort of build site designer classroom with the crap out of storage. The other thing I really.

B

Like and I can think like what you're talking about is cfs zfs on linux. Even oh, it is that, if you're dealing with that, if you want big storage, I really highly recommend looking at you go to al's talk later. If you want to learn about how to make this really happy.

B

Back here, you've been waiting.

K

Yeah, I'm most likely I've made problems for myself. um We've got a customer. We've got customer running production with our data model that we obviously are not super models with.

A

Moving data from one model to the next, you.

K

Know, there's no tools or anything I mean well sure, there's tools but.

A

At that point, I'm right I just prefer to write custom code. I mean it could be. You can do a kettle. The kettle.

B

Which is pretty decent, yeah and I'll be talking about the really crazy stuff we do to yellow to do something like that.

C

We did the we did. The absolute craziest.

B

Hardcore nerd thing you could do which we we pulled the code out of cassandra to read the essays tables directly and ran about reduced job over all of our students, but that we were back into a corner. We had.

A

We had to do it so and they wanted to do it the hardest way possible for bragging.

A

I mean running cassandra re-optimized cassandra code inside mapreduce.

B

Oh, we didn't optimize it dude.

A

Just let it drag.

B

A

Cool yeah: let's go all right, so yeah in the corner.

I

There are a couple of ways to work with cassandra like three faster knives and a now a lot of stuff is like pushing a secret driver to use. It doesn't mean that as the next.

A

Driver who's alive and well um puni, and I who puniti is the guy who's, maintaining it now ron is no longer doing it. um Punit and I working well together we're trying to blend the two things together. um Netflix doesn't want to be in the driver business forever, but the asthmatics is a very robust driver for what it does. It's.

B

A

I

My question was really: it's.

A

Not gonna, it's not gonna end the life they're, not gonna, they're, not gonna, put it in mothballs and just kill it or anything like that.

I

That those datasets will kill that that the underlying protocol that has the next work to break.

A

It will work with both thrift and native.

M

A

Their goal because they have old code they're not going to change, they have to support the old code, so it might actually be the better choice if you're running thrift and cql at the same time, for a lot of reasons that.

H

That's what they're.

A

Trying to do is make it so they can move forward, but stick to stay behind.

B

One of our most popular clients- probably still the.

D

Most popular recipes are cool.

I

I

A

It's really not anything they could get rid of. They have too much investment in what they're doing. Right now I mean their entire stack billion dollar business is running on afghanistan.

A

You can't just shoot it in the head, definitely yeah. So how was that for an answer?

A

Oh go ahead. Evil question: um oh man, we had to have it.

G

A

The owner's manual there's a.

G

Very high line of fine print, uh that's relevant mostly to uh to the thing that centers, where you just have one huge column, family, it's sized here and compaction, and then, if you manage to run over 50- and you can't compact, the largest one anymore, um then you're, basically totally in trouble. We basically ran into that. The second time.

G

Is there any way to save the cluster, because you can't even protect the stuff away. uh Even if you delete the road because it won't compact.

A

And because of that, you could still what I would do with that data is. I would take those largest ss tables move them. You know expand the size of your cluster like add new nodes, take that existing ss table to a different node and run bulk loader, and it will restream that data into a new data structure across your cluster that's kind of a bad way. I mean that that's a long way, but there's, but it's doable. You could do stuff. I mean if you're.

B

Really a strap like you can't get more hardware and something like that you could do things. I suppose you could replace no start, rebuilding those and use your hair and then switch to lcs and then so you take you get the cluster completely healthy, except for the fact that you've got you can't compact. What are we doing shoot a note in the head?

B

Rebuild it but put it on lps and then repair, and then you do basically rebuild an lcs, and then you don't have that problem anymore. You could do it without bringing your class. That would be totally evil. I don't, I don't think that's supported, but it would probably work so I've done weird names on consulting, especially when you have it, especially when you're like no, you don't understand our business is running on this. We.

A

Can't bring it down help us get make a deal with the devil and help us get out. So there's I mean that that is doable because it really comes down to the fact that the ss tables that's your data, have a lot of options because they're available, and that is all your data and you can take those there's a thing called ghetto which you can take all those ss tables and put them on a different server and reboot it and it'll just read them like. Oh, I own all these files there's some options.

A

It's not the way you want to do it all the time, but when you you just can't do anything else. There's options.

B

Yeah we had a cluster once where we pulled all the esses tables from all the nodes onto all the nodes and then did a cleanup yeah, because we.

H

Knew something was messed up somewhere in the cluster and.

B

It was and we looked and we said we have enough space to do this crazy thing, and so we just pushed all the esses tables everywhere and then did a clean up, and so it went didn't go to. I.

A

O for like three days and then when it got all done, the disk space all came down and then everything was clean and nice and the sun came out.

B

Well, it was more like you know,.

H

B

Could kind of sort of see the sun and you weren't breeding water, anymore, yeah.

A

I I've gone through plenty of the scenarios. Cassandra is wonderful, I mean it's, it's easy to just beat on it and then it'll recover eventually, but man just don't get in those situations. Please yeah. um What are we out of time? Are you monitoring our time? No we're not.

A

Yeah, all right, so uh that's weird all right. So that means that if 3 15 is a break, when is it over 3 45 40 all right break time guys? Are we good.