Apache Cassandra Cassandra Summit 2013, 26 Jun 2013

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: C* Summit 2013: (Re)-Building the Social Grid for Global Telcos @ 1/10th the Market Cost

Description

Speaker: Darshan Rawal, VP of Engineering at Openwave Messaging
Slides: http://www.slideshare.net/planetcassandra/1-darshan
Darshan Rawal leads the development of hybrid cloud based messaging products for global Tier 1 Telcos. Darshan has been working in Silicon valley since 2000, building nimble, cost effective products/services, handling millions of users and billions of transactions per day. Previous to Openwave Messaging, Darshan held engineering positions @ SS8 networks, Yahoo, DE Shaw, yp.com and has a M.S in Software Engineering from Carnegie Mellon University.

A

This is my second time, speaking at the Cassandra conference, been here for about two years and I mean here to talk about some of our experiences with Cassandra and how we push some of its boundaries. Pretty bad.

A

So this is a rough plan left line to talk about. Obviously, the introduction who we are, what we do, what we build as a product I think we'll be pretty clear by the end of this. We are one of the few unique players trying to utilize sander, because pretty much everybody is trying to run a hosted service as opposed to V who ship cassandra is part of a shrink back products.

A

We talk about our journey of Cassandra and then all the metrics along the way in terms of how we evolved.

A

What projects we scrapped, what projects got successful, why they got successful stuff like that, our challenges of big data in the whole spectrum of it as we go along some of the pivots that that allowed us to use cassandra effectively for some of our customers, which are once some of the largest scale, telcos and finally, I'll show some metric about what the year-over-year changes for a single, Cassandra, node and I think it gets lost in translation when people talk about 35,000 operations or 50,000 operations.

A

But talking about a single box, a bare metal box, how effective it has become over a year, some insights and some learnings that we had along the way and finally, conclusion: okay, so first who we are, we are open to a messaging. We, we have global tier 1 customers around the world. If you think of telcos around the world, they're, probably like 25 world telcos, you know the big guys. You know there are probably four or five North America every continent we probably have four or five of them.

A

So the spectrum is not that big, but each one of them is pretty big and there's a reason why, when we talk about appliances or products, everything gets certified at telco scale because they are pretty abusive with all the products.

A

B

A

Our average deployment is roughly 6 million subscribers for for a platform, and the largest is actually 28 million, so reasonably big sizes. How you would like to call it and and use cases, are the boring email to SMS to mms to voicemail and whatnot, and you know on the face of it. It sounds pretty interesting because somebody said on TechCrunch 2012 was the year of peak SMS. Sms is getting slaughtered because of my message and what's happened?

A

What not and the reality is that if you go around the globe, the messaging space is very fragmented, so share a little bit of light on that. Most of these guys are have been our customers for about eight plus years through a very interesting transition, and a lot of them are actually upgrading to Cassandra. Some of them are already live on. Cassandra.

A

This is what we build. This is a product in a nutshell and in the key is that we have been using- or I should I say, abusing databases for a long while right pretty much every database that you can think of every credible database. In the last 10 years we have tried to hammer in a nutshell: we everybody who has data plan with 18t, every 500 milliseconds is making a dip in our berkeley DB instance.

A

It's been up for with fines for the last five years through multiple upgrades, it's a simple directory product, but it's a pretty freaking achievement. You don't see ever that. Your data plan stops working, for example, and the amount of applications have grown at the heart of what we build is essentially the message store and we have became up with this geo redundant message store across geographical redundancies, mainly to answer some of the issues with that came up with japan Tsunami. You know. People in Japan, especially, are very paranoid about failures.

A

You know they literally do not want to fail. Even when our data center ghost and the market started asking us, we need to build a geo redundant solution at a significantly cheaper price. There are obvious solutions out there in the market with clustering and all that stuff which we didn't want to go to. So that's when we decided to go with Cassandra and- and you know, the upper part is what we call other protocol stacks.

A

You know the typical I'm, a pop pop is dying and whatnot, but pretty much your iPhone's work with IMAP today or activesync, or what your protocol of choice is. That's exploding to the roof, the notification and queuing is going to the roof and these these systems, once you start looking at scale, especially for the footprint that we work with, are pretty abusive, meaning we're talking of 50,000 transactions a second for a few hundred boxes, not not like thousands of boxes like the internet companies right, so we really built very dense applications.

A

One thing to also highlight is that we are not so much of if you will a UI company, meaning we partner with some of our partners to build like a typical email, application and UI, but we don't build our own. We have them interface. On top of our message store behind the hood and at the very very left of the spectrum, you have the typical you know the key value pair kind of stores.

A

You know the traditional things that telcos have been using for quite some time, like ldap and stuff evolving to all the other side, where you have really unstructured data like blobs, you know media content, what be it and cassandra sits right there in the middle, in what we allow this to build like an application-level geo redundancy, meaning any component above the message store, doesn't care about geo redundancy? It doesn't know, in fact, about geo redundancy, and this. This is an amazing spectrum.

A

There are customers who are taking the solution and trying to build it within one data center for geo redundancy simulating van deployments. There are people who are trying to actually take it and build it in a service continuity model, meaning don't write it a full geo redundancy, it's too expensive, but try to go as you build as you go along kind model. So people are trying and experimenting this very interestingly.

A

This is an interesting sort of graph. If you will fellas look very interesting on the slide, but either way we started actually in 2009 late 2009 trying to use Cassandra. Okay. If you go to the github repository, the filenames have the word incubating in them. That's how early we started using Cassandra.

A

There were probably 20 employees of casaya datastax. At that point in time we spent almost a year before customers actually started. Saying: let's try this thing out right. Is it working? Is it not working stuff like that, and this is very unique to us. I must say we are the early adopters of Cassandra, but also that our customers are not. We can't just try something out and go live. We basically have to go to a huge lifecycle of certification through telcos and they don't like to go live very quickly.

A

So they'll abuse it very heavily and only then they will go live. So we started having some ramp up in q4 of 2011 with trials literally in labs and stuff, and these trials are real trials. I mean some other trial infrastructures is millions of dollars like just the lab? Is a millions of dollars they actually put high-priced SSD is the real thing that goes live.

A

They actually put the whole scale in their labs and then and then late q4 2012 is when customers really started to go, live and there's a huge hockey hockey stick growth over there and the after q2 2013 is kind of the projected trials and and going live. As you can see, it's roughly a two-quarter lag that trials and going live lag.

A

But if you look at as a percentage of the open way of subscriber base and I can tell you, this is about 100 million subscribers across all these telcos, it's 100 million users across these telcos. We are expecting that year from now, we will have sixty percent of subscribers being powered by Cassandra. So that's a pretty powerful statement for justification of Cassandra as a back-end technology, but there's one more interesting element that I wanted to show which people don't see which is kind of the inside story.

A

So, although we started the first year, the project was scrapped, multiple reasons, but part of the reason was. We didn't really understand, Cassandra's. Well: okay, we didn't truly understand what eventually consistency was either we got to sold on the eventual consistency model or we didn't understand it or we use some of the ugly constructs of Cassandra, which were not ideal.

A

So there's some kind of shaking off that happened in q2 is when we really had sort of a first GA release that we kind of launched out in the market where, obviously it takes a lot of time for telcos to oh wow. These guys have something new. They won't change. If something is new, it really have to prove them the medal for it. Last year, roughly this year we changed ownership of the company. You know we became a private messaging. Focused company will no longer part of a big group, separate investment.

A

Everything and I especially, would like to talk about that customer emergency. That happened two quarters ago and kind of cemented the notion of Cassandra for us like why it is so useful and finally, just about thing. Al hear it very soon, but one of the largest telcos in the world and I'm, unfortunately not allowed to speak the name because they don't allow us to do that. But I can only say it's in the news right now trying to talk about some merger in North America.

A

Let's just say that the largest telco is about to complete deployment for 20-plus million subscribers on Cassandra, a single installation right. So that's a pretty freaking big deal and in.

B

These days, these.

A

Guys drive us nuts by the way I mean they really drive you nuts. They ask you about 10 milliseconds going here and 20 milliseconds going there, so they they are not. They are very smart, technical folks. They actually tell our sales guys and professional services guy. We need to talk to the engineers. We don't want to talk to you before. We sign the check, that's pretty straightforward right, so they are very hardcore, hard metals and for them to actually go live as a pretty big deal.

A

So I wanted to talk about this interesting thing that happened last year, so we are kind of preparing. This telco is preparing to go, live and- and you know last year, as I said, there are lots of people who are trying out in the labs for about two quarters. There is this one customer in North, America again, I'm not allowed to take their name, but there are customers of our. They are using a previous generation product which uses a traditional EMC array of disks, and you know what not they use an Oracle database.

A

They use all of this stuff. It's been humming along for seven years. They have no incentive to upgrade nothing, they wanted to try out our new product, so they had a small system running in the lab and we were trying it out, not so big about the million users give or take, and in December there was a file system, corruption and, as as you have it, the backup stem cells were corrupted. So there was no way to recover some of these things. So what we?

A

What we were, the brilliant idea that came along was that why don't we take the lab system and put it life like? Okay, let's put it live, except for two years they have been. You know these machines were not upgrade. Their phone where's were not upgraded for two years. You know like really bad stuff. Nobody had paid attention to this lab that was running so I'll spare you, the details of spending 3am morning calls and whatnot, but we went through a whole.

A

We took this Cassandra system, we put, it lie, meaning users who didn't have mail access started getting mail, and then we have to kind of migrate over there. Old male put it into their system behind the hood while they are, while the system is like all the time right and the system is I wanted to put their sort of an under sized system.

A

If you will, because, as I said early on the standings of Cassandra, we're pretty premature and the system was sized for that, but by the time we actually went live the sizing was iffy. If you will so, we had to learn as you go along on a production system. So this is an interesting metric.

A

You know we realize first of all that there's not enough RAM in the system- and you know this is like a seven to ten node Cassandra ring that we are playing with and- and we realize well first of all, they'll see is that some firmware has to be up there. Okay need to bring a machine down, update the firmware, bring it back up fine, so we did the form upgrade firmware upgrade. Then we need to change a memory. Oh, the memory is running. Okay, fine, we need to add the memory you know stuff like that.

A

We, what I, what I found what most interesting was that it took us roughly 20 minutes to bring a note down. Put her, am back and put a note back up. Nothing I mean like the system just kept on running literally, not a single error. Right now you can hire a lot of how should I say the technical consultants of the the big guys like AMC's and semantics, and you know what not, but this was pretty dumb I mean. All we really had to do was follow a bunch of steps.

A

We could give it to someone who had never worked with Cassandra before so we had to literally dump it down in a script or faq guy, but he could bring the machine down. Take the ticket off the rack, put a chip in put it back up our backup boom. It joins the ring. Everything gets settled back in ten handoffs. We see the spike of activity tapers down done right and I. Think that is an amazing statement for to see it really under fire right, because this is while customers are basically slamming.

A

The customer support call saying my contracts are in my mail and I'm not able to get hold. Of that I mean this is pretty big stuff right. It's you know I joke about this, but we kind of build things which are commoditized. You know you don't think about your email not never being available. You never think like that. It's always over the bus right so to see Cassandra actually work after that was pretty good.

A

There was a what I call I'll talk about this: eventual consistency, fallacies in many ways, but you know there's a price that you pay for eventual consistency, which is you don't pay now, but you pele, and that we had to solve by putting more I off switches, SSDs, but the same thing we could actually order SSDs. We could calculate roughly that it'll take a seven days before we will totally blow away the Cassandra cluster, got the SSDs and bring the Machine down put SSDs back and put it back up.

A

All the wild system is running and I think that was a huge statement for us and a turning point for a lot of our customers, that you know what it suffered a calamity and it actually worked, and by the way this is a very dense footprint, meaning this wasn't like a really big system. We are talking of tens of boxes. We are not talking of hundreds of boxes for a millions of users, so.

A

This is, you know, without going into much detail, but really this is the spectrum of our big data challenges right, so you're taking an email. First of all, everybody has infinite expectations. You know email just keeps on going. Nobody deletes me ever pretty much because gmail spoils all of us, so we never delete mail. We use email for multiple purposes at the very kind of left end of the spectrum. You have account data, which is your accounts and stuff.

A

You know the thing that as I said, if you have a data plan, it needs to check every so often I. Have you do you have enough capacity left or not, which is ok but mostly provisioning? You know there isn't much of there's some rights, but not very heavy right in the middle.

B

A

What we call is our Achilles heel, it's our metadata of our message store and in the interesting part, is that that metadata is. It has a whole spectrum off it's easy to solve a problem if you have all messages of 140 characters, but if you have a message which is of 10 megabyte and at the same time you have a SMS message, which is literally 100 bytes.

A

You need to solve that entire spectrum, so that metadata is for that, and then there is the content which is it I want I didn't want to call it messages, but really content, because people are using photos and everything is getting uploaded right there right. So on this end, you really have crud operations create, read, update, delete, really heavy, create, read, update, delete and the more important parts are the updates and the deletes which Cassandra really need to understand.

A

Well, Cassandra's internals to optimize for those to operate, and most important is the counters actually for email, especially for imap. As a protocols. There are very strong connotations of counters and and counters represent time series in many ways. So if you screw up some counter in the IMAP protocol, your iPhone will not be able to get me or it will get totally screwed up right, so rock they try to redownload the entire meal. You will never get out of it with a really large mailbox, so you cannot mess up with that.

A

Okay, the crud the arc, the block part, is really it keeps on growing. Almost never gets updated, maybe gets deleted sometimes, but people are stopping doing that more often, but really it gets our time after what so you cannot throw infinite amount of SSDs there, so you have to use cheap quality disks and you have to worry about how do you archive those kind of blob sizes but, more importantly, across both the spectrum of data challenges?

A

We have to worry about geo redundancy, so all of this and tiered geo redundancy- meaning- I can say I need my account data and metadata to be totally redundant. But my blob data to be not geo, redundant right choice need those kind of toggles, obviously the reliability. You can mess up with some of this data. If you mess up this data, your problem- and you really cannot talk about downtime- there are some folks, some of our customers in Japan. They want to store five copies of data, they go wild.

A

Actually they say Cassandra can do so many copies of data legit, dupuy copies, but.

A

This is the spectrum, and these are some of the pivots for Cassandra that and I'll spare. All the details are lots of technical details that go below the hood in terms of Sandra, but for us these were some of the driving points.

A

The one other important thing is jaybird I'll, show you how our footprint has changed on a single cassandra nord- and this was our biggest problem last year, which is you deploy this and you're a bunch of disks, and you know if a Cassandra node goes down or if this goes down, it thinks the whole node is down, and when it is, this comes back up. It tries to resync the whole thing and that actually creates a bigger problem, because it brings the whole ring down in many ways for our kind of footprints.

A

So j-bot was a huge help for counselors. We try to solve using atomic batches, an application-level affinity and I intentionally have a dotted line over. There is because you need to understand. There is no free lunch with regards to isolation of acid. Cassandra will never implement locking pretty much, as we heard Johnathan say today, they will never ever ever, it's fundamentally not designed for locking so they. This was the first trajectory to give us this atomic batches and I'll explain what how much big of a change this was for us from a reliability perspective.

A

Obviously the replication factor, and all that is good. You just store multiple copies of data, but I think the tiered knobs why a consistency level was interesting because distributed systems go down all the time and networks have hiccups all the time. So having this mechanism, where you can say this, data I want I'm, ok to get the wrong copy of data, but this one I want to be absolutely sure. I cannot mess up with the counter data, for example right from a schema perspective, really we're talking of the flexible ordering meaning.

A

You know you have the natural ordering of things in terms of how you order and salt and stuff, and then you want to have this arbitrary size drawers, and you know enough articles about the data model. How do you go about data modeling and denormalization? Being your friend, but really the idea that you can actually have arbitrary size. Rose is a powerful element for us, because you know, if you think about it, gmail doesn't care or any mail application doesn't care. How many messages you have in your inbox.

A

You know it always renders below a second and that's the metric that we need to that. We are shooting for most of these telcos want that infinite storage, but we want to meet that instant user latency. So these are kind of keywords for us and, as I said, there are lots of details that go below. But later this is atomic batches as it boiled down to a nutshell and uniform p folks understand distributed system. This is no brainer.

A

If you have ever built your own server or your own client, you do some of these things yourself. So so you know in Cassandra one dot, one anytime, you wanted to make a change. You're, usually you know, hey get a connection. Do some change free? The connection gets a better connection. Do some change free the connection? This is what you would do. This is what our code used to do earlier on and by the way I just wanted to mention one more thing, because we have been using Cassandra since it was 0.4.

A

We are using thrifty one today, so we actually use tripped in a C++ word. We don't use the multiple libraries out there, so we pretty much care about the bits and bytes that go on the wire. That's one of the reasons why we care about some of these details. So this was the floor. You could do an application optimization to avoid contention on your client side by doing like get connection and then fire three for queries.

A

But if you look at all of these, there are still a lot of round trips going back and forth from Cassandra, and this is important because you know on one side Cassandra. You need to understand that denormalization is your friend, meaning you maintain secondary indexes, you copy multiple copies of data, but then what? If something goes wrong? And the answer that we used to get for the last three years was that most of the time most of the things don't fit, but in our world people always ask the question, but what?

A

If, when they fail right and and for us today, we kept on pushing for this atomic batches for two years and then finally, we got it last year and essentially it's pretty simple as that you do all your batch and it's an all-or-nothing batch from a database perspective and although from an I ops on the back end, it's exactly the same. I mean in the number of operate. It maybe even be a little higher, but the share reason that the application doesn't need to worry about.

A

It improves the overall system footprint, because your contention on the client side goes down. The number of resources required by Cassandra goes down which helps us garbage collection and all of that. But the beauty is that this allowed us to do something very fundamental. With our schema, which is now, we don't need to worry about managing failure. Conditions with the back end does, in the back end, guarantees that it's all or nothing and I consider this as a trajectory towards transactions.

A

Eventual transactions that that I dbms used to provide you know lock, can update and it's all or nothing kind of model. So we got atomic batches, and today we Jonathan talked about compare and set, which is kind of the next step in terms of getting there. So these these technologies are kind of converging or while they will never implement locking. But we have got enough, you know, so you get atomic batches. You get compared and said. Those are good enough for the operations that you care about, the the kind of operations.

A

Are you care about and we do really care about. These I mean these are banked level counters. We cannot mess these counters our customers, so we used to ask them what? If what? If we allow you to go about your mailbox quota by, like you, know five percent or one percent and then do eventually cleanup of that- and they say no, we want you to guarantee that somebody can never abuse a quota on a single minute deposit and the only way to do that is to have very accurate counters. So it's literally like a bank.

A

You know you cannot withdraw more more than the amount of money you have in a bank. Is that level of consistency so that this is, you know all said and done. At the end of the day, we look at we fight on foot print. These are one of the key metrics for these telcos, because they care about heating and cooling and all of this stuff. So how many boxes do we use right? So these three things you know so Cassandra gave us atomic latches and J board right.

A

We made you utilize those concepts and- and you know some couple of folks- are sitting here and basically understood how cuz hand relays column family on a rotational disk, meaning. How can you minimize random, seeks on a disk and actually lay out sequentially so that you find something and you seek it rather than doing random lookups?

A

You need to understand that level of detail, and then you we build the application level affinity to guarantee these counters meaning guaranteeing that within the whole system, there's only one guy who is doing something right for it, so basically achieve civilization. So these three things combined give us some I met a late night again I'm talking of a single box, it's very important to talk of a single box because you can play games with numbers, but with a single box.

A

It's pretty clear, see you in twenty twelve, we had a typical single box for a metadata. Would a 300 gig SSD? That was big.

B

A

It was not big night. We basically moved to roughly 12 910 k SAS drives. This is big because now you're suddenly pretty much going to like 12 terabytes of data per box and by the way this is metadata. This is not the block right. The blobs are massive, actually I'm not going to go into the memory and CPU. They basically remain the same for most parts right. The blob stuff in 2012 used to have like a 10 terabyte blob in many ways.

A

Well, our meta is now almost close to that in many ways, and we basically improved that glob storage by four times- and here is the kicker. This is a medium scale deployment. This isn't a large-scale deployment that we plan to go life. This is really in the mid-range. If you will so, if I didn't say enough, we go with really dense deployment. I mean a single box really holds a lot of for us.

A

So now you can see without j-bot we cannot put 12900 gig 10k SAS drives because a single disc going down on a on a 40 machine system, you're talking off with this going down every day, pretty much by mean time between failure, a disc will go back every day right. So you cannot afford this notion of the machines recovering all the time just because our Disqus down and the blog's, actually we try with Sandra.

A

We also try with some other vendors like st vendors, that we use, which are optimized for this object, store and- and there are things in Cassandra that do not allow us to use Cassandra for lobsters at this scale. Some of it are. How does it do the application, storing exactly three copies of data? There are people who are optimizing that logic further and saying. If I have a 10 megabyte object, I don't want to store three copies.

A

I want to store some parity information as opposed to soaring 3 10 megabyte later, as you start trading off reliability for space and cost, but this is always nice and good. So this allows us to do that and you know so: it's 36 X might a storage increase and I didn't put the IRS, because it gets very interesting if you think about it that one 300 gig SSD has probably more I ops than those 12900 gig combined.

A

But the fact that there are 12 uses that parallelization of high ops. It is in a single box right, but so we increase the meta storage I, think we decrease die-off store at that some debate internally and we did increase the blobstore it significantly. But we.

B

A

This is this: this is hot off the press, but this is a more more important number right, so you keep on pressing the needle with regards to capacity within a box, but you achieve the per node savings. If you do apples to apples, comparison in terms of protocols or applications, it's the same test suite and try to run on our one year ago platform and you try to run on the this year's platform. We actually get.

A

You have forty to sixty percent I of savings overall, and the reason for that is you do you do really need to understand three things which is atomic batches, how they work. Fundamentally, it's not just you know our engineers who are sitting here. You know they don't just if Jonathan says: oh yeah, we have atomic batches. The first thing they'll say: show me the Java code I have to go inside and look at it. How exactly is it implemented? It is definite enough or not right. So you have to understand that.

A

But then you need to understand how the column families are laid off. How do you order the column, families and how they actually get laid off on the disk, which is a more important aspect and we did have to do a major refactoring of our schema schema, doesn't look anywhere close to what it was a year ago and add our application level kind of affinity. Now some of these things are pretty strong and pretty unique to our position, because we build a product and give it to telcos which can never ever fail.

A

There are people who don't do this there. You know Netflix uses a much diff model. Then then, what we use, because they can afford to lose some data and people won't cry out loud. You know if your paws on a videos is lost, it's not end of the world, but if your mail is lost, that's considered pretty bad.

A

So these are some insights. I have two shares of insights, and you know the first one is just general, but the second one is really a couple of technical points. So, first of all it it's a new paradigm. It will take time and investment, no matter what anybody says. It's going to take time and investments we are, you know it takes think time to unlearn the old stuff and learn the new things.

A

It won't take as much time and investment as some of the early adopters did, but it does take time and investment with regards to making it really production quality right and I'm not talking about putting something trying something out and then trying something else out and trying something else out, but really maturing. It takes time and energy.

A

It's funny, I mean you know, especially people who are distributed system engineers. They understand that there is no free lunch, but we keep on hearing this buzz. Words come out again and again, the cool features and some of the cool features that I talked about. Atomic batches have a price. You know there's a cost that the cassandra database has to pay to achieve atomic batches, there's a cost that cassandra has to pay to do, compare insert, and we need to understand that class rather than fighting it, especially.

A

You know, as I said some of our Japanese customers. We had to literally show them that, as you increase the number of copies, how does the throughput of Cassandra lingo down so we have to actually show them. You know, even though you do a synchronous replication, it's about paying the not paying right now, but paying later in a in a nutshell, the sizing for a Cassandra ring other than all the tuning.

A

Once you have gone through all the tuning and all of that exercise, it's all really boils down to I ops, assuming for persistent storage, at least for now, but not all I ops are equal. You know you have to understand the size of the eye off the the rate at which you are writing the rate at which you are reading how we can stop or change some of those things.

A

As I said, you know, you really have to understand the difference between a random and a sequences seek as anything eventual consistency is a dual edged sword. You know I like to compare the best comparison is credit card depth. You know you don't pay now you pay later and if you don't pay attention to it, you it will have run away consequences. You know, for example, in the case of the emergency that we had in 2012 this. This customer didn't have enough I, ops, 2 ever catch up, they will never catch.

A

If I mean literally, it was just a bunch of hard disk with less amount of I ops and the Dean planned for ssds, and they were fighting executives, we're fighting over. Should we use SSD, or should we not use SSD costing twelve thousand dollars or something of that sort and every single day that we lost? We were that much more closer to the cliff, because after that, no matter whether you buy whatever amount of hardware, you will never equip and you'll have in a system outage so tuning.

A

That is one of the key things and, if you think about it in some sense for email at least when, when I, when we saw our earlier message messaging platform, trash can is the ultimate eventual consistency in mail.

A

If you think about it, you know when you delete a mail, it goes into trash and then eventually you clean up the trash after 30 days or 40 days or whatnot, and that works perfectly fine for transactional systems, but for for a Cassandra like system where every transaction is eventually consistent, you do need to do pay that price every so often meaning nightly.

A

There are people who want to think do it every six hours or eight hours or whatnot and really depends on how often you want to do it and the last one is an interesting one. Every single thing that I have seen in the last three years with Cassandra. We have tried to do ourselves first, partly because we were the early adopters. So, for example, when Cassandra didn't have compression, we implement our own application level compression and then Cassandra came up with snappy compression.

A

We implemented our own sort of this illusion of atomic batches at the client level before atomic batches came along in the back, we implement at our own application level affinity before compare insert came, so we are kind of adapting that. But if you look at it, it has reached a point where those adaptations, if they work for local scale, they probably will work for most other use cases right.

A

So you really need to adapt those paradigms and don't try to last year. I was fighting this with Jonathan and I said. Look I have this counter problem.

A

What do I do when he kept on giving me this different options and I in the very end, this is what Jonathan, sir use my sequel for 4 counters right, because the companion set was not there and we had to say throw through 100 gig SSD on a my sequel store counter is their store everything else in Cassandra and try to make them coexist, but now you get to the other problem. How do you achieve consistency across these two databases right? So you just exploded that problem even further than before.

A

So we did lucky, we didn't go down that route and you know finally, we have compared and set which will help us a lot.

A

These are some technical insights. Many of these I shared at last last year's talk and you guys can look at it, but I'll just walk through a few of those which I which were standouts for us. So the first part is the replication factor. Most lab systems, don't have I mean you can easily create a bunch of nodes on AWS, but as it's pretty clear, we cannot run anything like what we run on any cloud out.

A

There I mean AWS came up with SSD is last year we have been running with SSDs for four years right, so we, by the way when we run with pretty much every hardware vendor that you see out there fusion-io, while in memory any of these guys, they really want us to test their hardware, because that's the best way to it's so close, it's so dense that if we can prove at that level, they basically certified for everybody else right.

A

So we had this interesting issue where we had a fight to seven node Cassandra ring and we kept on hammering it as much as possible, and what we realize is that at some point we kept on increasing the number of application instances that are hammering the Cassandra ring, but the TPS that we get as an cluster was essentially same. It never changed and it was a chakra mean. Is there a network bottleneck?

A

Is there what's going on, and this is where a lot of those things come together, which is you know, even though you're paying uh synchronously for your application, you're still paying for the I ops on the back? Those nodes have to do I ops at the end, and the only way of throughput will really increase. Is the ratio of RF to the ring size grows? Dramatic, meaning if we are storing three copies of data right and if your ring, if you have a 30 nor drink sighs, you will see close to linear throughput God right.

A

But if, if you have a three node, if you have a replication factor of three and if you have a four node Cassandra ring forget about them and it's not going to go leave here right, you are basically saturating the ring at that point in time or close to saturation right glad was the first thing, the tombstones. It's amazing that Jonathan brought it up to the in the keynote. You know you see what happens with tombstones. We have been dealing with this for two years now, some of our biggest telcos are actually doing exactly this.

A

They store the message in our platform within a within less than two seconds they get deleted because they get delivered to the device and you're done.

A

I mean it's not stored on the server, and you see this massive array of tombstones grow up and if you don't clean them off, if you don't set the GC great seconds to be not the default value which is 10 days and something more reasonable like a day or less than that, then you will have these tombstones pile up, and if you don't clean them up, then you will have run away with compaction issues or maintenance issues.

A

This sizing was a you need to plan for the perfect storm in Cassandra, so you have this maintenance going on in Sandra. At that point of time, bunch of failures have happened, which means all all these nodes are taking on the load for all the rest of the ring. But at that same point in time some other nodes join the ring, and it is the funny thing, but during that short period of time, there's actually a spike of activity right, just purely because the ring is readjusting and rebalancing.

A

So it's actually you sighs just a raw load plus whatever replication is going through, there's actually the normal load. You need to size. For that plus, you need to size for compaction plus you need to size for recovery, so it's actually a three exercising and need to be careful with regards to sizing some of these things counters. I already said super columns. You know deprecated thank God for that. haha It's good, please, don't use to / columns as much as we use them.

A

We will probably never ever get rid of them because ask mus cannot be changed, as so often because of our customers. Don't use super columns, they are evil and the client interaction. This is what I was talking about connection management and all of that stuff. There are actually issues where you get this thundering herd syndrome on the client, because you are contending on that line.

A

You have not tuned your number of connections and threads and all that, on the client side you are, you are contending on the pool you are sending more transactions to the back. End back end is back loaded because of GC or whatnot. That translates into higher latency, which exhibits your contention problem, and then you have a piling up problem on your application, so it takes a long while to clean up that syndrome. It really takes a long one. Thank you.

A

I hope that a sink protocol that chance and talk about might solve some of this, but need to be aware of this, that, if you don't tune your back and back and we'll have jeezy's it's a Java system, it's not whatever you do. There will be some garbage collection. There will be periodic unpredictable pneus of your latency and because of which you could have contention on your blind side.

A

So in in retrospection you know, so we are playing the game of messaging messaging is kind of this broad spectrum.

A

If you go in search on TechCrunch for messaging, you will see this pace is actually for some apparent reason. Very hot. You know, Sparrow was bought by Dropbox and mailbox or mailbox or brought by a Dropbox fire was brought by Google.

A

There's a lot of thing happening on the UI side, because email is non-functional as it stands today in many ways right it we are abusing email. We play in this space, everything but the real time collaboration right. So we don't do real-time video chat. We don't do any real time stuff now, not yet there are cross pollination opportunities between near real-time and real-time, but we don't lay in that space. The audience is really consumer served by tier one service providers. These guys are, and the funny thing is because we are global.

A

We have North America significant North, America footprint, apac footprint, europe footprint, the consumers over there have very different cultural semantics. You can't just change, you know it's it's very different. You know, for example, we think of I message and what serpent the best thing that has happened to us in North America in Japan, it's a line or whatever be it. That is the cool thing out there right. So it's a cultural thing, people still ting Yao, whose and amazing search engine in you know in Japan yahoo, japan is still popular.

A

The rules that meet levette is t. Co / feature total cost of ownership. It's not just the footprint is not just its capex plus of X we have to worry about is the guy who is going to run the Cassandra ring smart enough? What is the amount of training that we need to invest in it? How can we dumb down some of these things that why I'm making Cassandra easy to use as Jonathan says, is very important, because that brings down the TCO / feature and that's an important aspect.

A

For example, we implement our own what we call hierarchical storage. You know if, as males get old, we put them to lower-cost machines, larger Cassandra rings, but lower-cost machines which have archival characteristics. They are not transactional. Terrific geo redundancy. Obviously, because believe me I mean after the japan Tsunami, nobody wants to bear with any kind of down time. They. They showed us a graph of one of the biggest telcos in japan, and I worked idea when I can tell you know the telcos in Japan.

A

They actually put out notices before 24 December, telling consumers please stop using messaging. Please please stop for a while I mean literally, they tell people not to use messaging, because even if they get it down by ten percent, that's a huge savings for them and I. Think Cassandra, given our history for this long has worked out to be a pretty good gear.

A

To allow us to win at this game, and it's starting to see are starting to see that adoption trickle down and we are starting to see some of the most hardest people getting convinced with regards to the the feasibility of Cassandra back end, but they still have a whole array of list. I mean I always have a running list which I keep on sharing with Jonathan I shared a huge list in 2011 and pretty much a lot of these got solved. I have another 10 things that I wanted to solve it regards to back up.

A

Then you know encryption and all that stuff, but these guys really care about telco grade software right so, but seeing that is actually getting adopted at some of these things is an amazing amazing statement and finally I just.

B

Wanted to show.

A

You know you get the typical picture. We are hiring kind of model, but I didn't want to do that. So these are some of the challenges that we are working on right now: ok, migrations, I! Don't want to talk about I, don't want to talk about the exact semantics, but everybody sitting in this room has an account with that email provider and we are going to migrate off that email provider to our system. Okay, for a big telco in Northam in Europe, really big delco and we are going to migrate.

A

A 10 gig link migrating six thousand messages per second constant for four months, constant, while the micro, while the actual traffic ramps up with zero failures. That's the goal that that's the challenge of working on from a scalper Specter everybody's wanting to go to infinite storage, but the key is you need to go with sub-second latency for all types of content right, so you can solve the problem for 160 message, bite message, but what you do for literally larger messages? Where do you do streaming?

A

Where are your intentions, those kinds of problems we are working on, api's I didn't even go into detail of our core strategy, which is opening up the telco infrastructure, because, if you think about it, telcos are the best possible platforms out there. Right and messaging is just one component of it, but really I mean you see sign in using Facebook everywhere. Why don't you see sign in using bank of america rady India?

A

There is nothing about AT&T that you would you rather trust your credentials to Facebook or 20 18 d from a security perspective as a consumer no-brainer and finally, with those api's, we hope to create a platform as a service through these telcos, but with infrastructure as a service, TCO, meaning think about AWS kind of scale or our footprint and optimizations, but trying to open up this platform. And, finally, let's think about it. We have these stories of these mailbox being bought over or whatever, with a million users it.

A

How would somebody like in Silicon Valley, to get access to these 100 plus million users through a single open API without caring across the global base? That's the problem that we try to solve so that, thanks, if any questions, I.

A

B

A

So they operate the box and we build tools around it to kind of they really don't care that it's Cassandra I mean they know that is Cassandra. They want to know what exactly it's under the hood, but from all practical purposes. It's our appliance.

B

A

We are opening up our platform, so we give them simple: dress to JSON. Api is, and they just work off that they don't care that is Cassandra Oracle or whatever. Behind.

B

A

So that that's a trade-off that you need to make, so you can only use it for use cases where the lifespan of an object is pretty small right. So you are taking a chance that so we write when we drive, we ensure that we won't have issues with repair. So, for example, we could write with quorum on a single late data center or local quorum across datacenters, guarantee that it's written pretty much and then afford to reduce that time stamp to a small window of time.

B

A

But but again you are playing with a game of probability here right. So if, if you have passed ten seconds ten seconds, is the maximum lifespan of this object in many ways right? So if you have guaranteed or return it, an object has multiple machines. Multiple failures have to offer in that 10 second time so that you cannot recover so the it's a it's a interesting tuning exercise that you need to take care yeah deletes are bad in events.

B

A

We are me as just explaining there are people who are trying to even run it like multiple times during the day, if possible and last year, when the emergency happened. That's when we really learnt- and we basically then I- don't remember the exact sequence, with the repair, compaction, repair, kind of key or multiple times during the day, just to get it all kind of stabilized during that crisis period.

A

But it really depends on how much I off so you have, because this is pretty heavy on I offs and it's very tricky because you want to do it during an off peak hour, but the off-peak notion is going away for our applications.

A

You know we literally deploy a Cassandra doing across the two data centers, which are three hours apart and North America for one of our biggest customers, 20-plus million users, and it's amazing, because this guy starts compaction and running maintenance and all that on this side of the ring by the time it's actually peak on the other side. So it's a very interesting exercise to actually balance those two.

B

A

To that's that's what we are, then, that this year's deployments will be on one dot before end of the year. We'll have a 12 degree.

B

B

A

We haven't yeah, we haven't tried that we are in the process of trying this thing.

A

So the first thing we met I wouldn't say, mastered yet, but we are doing our field readiness on atomic batches and that's where I got some of the metrics saying how bad is it or how good it is because you know it looks all good on paper, but what happens under the hood Cassandra, so we got through that I think this year we may not face that specific tombstone issue, because our deployments won't get they'll, be more non delete heavy in many ways, but at some point we will the the biggest telco will upgrade to our next generation software and that time did ask for them.

A

That seems radical, yeah, so I'm more than willing to talk after that. I think I need to wrap up right now, but thanks thanks a lot for oh sure,.