Apache Cassandra Cassandra Day New York 2014, 11 Jul 2014

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Using Spark Streaming for High Velocity Analytics on Cassandra

Description

1.) With each device comes an implicit contract with the end user: you give us the data, we give you the results. Now. Not tomorrow. Not even fifteen minutes from now.

2.) The flip side of getting data in real time is that users expect results in real time.

3.) In return for being wired into the internet 24x7 customers demand a similar level of responsiveness and even better availability.

__

Spark Streaming and Cassandra form the ideal combination of high velocity CEP and analytics with a high velocity and always on database.

Today's solutions don't scale to the Internet of tomorrow. The always-on nature of the emerging Internet of Things space means you need to process information at previously unseen scale and, more difficult, make sense out of that data.

Cassandra is the leader in large scale, high velocity, time series data workloads. While the Hadoop world has been stuck with legacy "batch analytics" technology, Cassandra users have been increasingly focused on the "now". Fast answers to easy questions about your data, at any velocity, and any scale. But Cassandra has always been weak on the "complex questions" problem. DataStax integrated with Hadoop to overcome this limitation, but it was always an awkward fit. Slow batch analytics on top of fast moving data really doesn't do you much good.

But Spark, and in this case, Spark Streaming, make high velocity streaming analytics at scale easier than ever, similar to how Cassandra pioneered high-velocity data management at scale.

Hadoop is the right choice for batch analytics. Until recently, nobody really knew what the right solution is for real-time processing. We believe that Spark and Cassandra are the clear answer.

---
Tupshin has been helping Cassandra users and DataStax customers build large scale, high velocity applications for years. As both a Solutions Architect as well as the lead Field Strategist for DataStax, he has seen deployments of every scale and in every sector. Recently specializing in the financial services space, he has worked with numerous banking industry customers to build and refine their mission critical, enterprise scale, operational data stores based on Cassandra and DataStax Enterprise. After 18 years in the Bay Area start-up scene, he recently packed up and moved to New York City.

Al is a father, technologist, musician, and open source advocate working for DataStax. While attending Central Michigan University as a music major, Al got into MUDs, C, and Linux, eventually ending up with a career as a sysadmin. Over the last 15 years, Al has worked on everything from kernel changes to modern web applications, mostly from inside operations teams. These days he goes by the title Open Source Mechanic, which means he tries to do interesting things with Cassandra and other open source software.

A

All right good afternoon, so this is a session on uh nominally on internet of things, but it's actually appropriate for a much wider set of use cases pretty much anything that involves uh time series uh type data and how that those use cases can be addressed with a combination of cassandra and spark and in particular, spark streaming.

A

First, just a quick show of hands how many people have actually used cassandra in any way? Okay, good? How many people have used hadoop, okay and are familiar with hadoop for those that did not raise your hands? Okay and how about spark?

A

That's a much smaller set, and that's because it's much newer. But my prediction and um sort of the direction that I see this industry going. Is that spark is going to start eating the lunch of a lot of the hadoop use cases and deployments out there, and hopefully this will explain a little bit. Why.

A

So story of the uh internet of things revolution- and I say revolution in quotes and uh you'll- see why so a little tale of two cities references uh here we everything is starting to get connected. However, we're talking about networks that are uh separated by geography separated by unknown intermediaries.

A

Everything will fail at some point, so any solution that you devise for uh project that needs to be high availability needs to be robust against interruptions in service. That's really where cassandra shines. Failure tolerance is not optional.

A

Cap theorem very essential to everything that we're doing basically saying that you cannot be both uh strongly consistent and strongly available at geographic scale. There's some nuances around that happy to talk about them. You know later, but uh it really goes to the core of everything distributed system-wise a little fun here, not a big fan personally, so I'll just move past that one hi eric.

A

And this goes to uh complexity of hadoop in particular, but this is actually a representative, a representational diagram of what a not uncommon, uh hadoop uh deployment looks like.

A

Yeah, it's really just not that good.

A

This is what we did to hadoop. We took the complexity that you saw in all those layers and we took advantage of cassandra's multi-data center uh awareness and said: hey. Why not just deploy additional functionality in one or more of those data centers, so with much less configuration than you'd ever have to do with a traditional hadoop with much less deployment. You basically install datasack's enterprise.

A

On top of you know, which includes cassandra and just turn on the hadoop flag, start them up, and now you have a set of hadoop instances in order to do your batch analytics deep data mining over that data.

A

Thank you all right, so we we brought a lot of simplicity to hadoop, but hadoop is still.

B

A

Very batch oriented, it's, uh you know targeted at full. You know data set and mining. uh You know working with terabytes of data at a time it's cumbersome to do anything except for that large-scale uh processing.

A

So I'm going to pause there. Any comment, questions about you know, sort of where we're coming from from a cassandra meets. Hadoop meets anything point of view. Okay, so internet.

B

A

Would you like to talk about the internet of things? Okay, so internet of things is both a uh a meme. It's a marketing ploy, but it's also a bit of a reality uh wikipedia you know. Furley basically says it's uh refers to uniquely identifiable objects and their virtual representations in an internet-like structure.

A

Basically, that just says you have devices they're connected they're, sending data to you and you're doing stuff in response pretty basic. No, but really what is it it's the act of connecting and it actually predates the world wide web? The first toaster was connected to the internet in 1990. First web page deployed in 1991 internet of things is not new. That's that's really the reason why I'm calling it a quote revolution and really shouldn't be surprising to anybody.

A

It's been where we've been headed, for you know the last two decades, literally of everything and getting increasingly connected whether you know mit projects to you know, connect devices in your dorm rooms. To you know, early prototypes of connected refrigerators.

A

To now- and you know, the predictions are at this point um by 2020- we're going to have 20 billion devices connected to the internet.

A

So is it all news? Well, yes, it is old news, but it's really not just hype. We've got many successful commercial products right now that are in this space: fitness trackers, home security and automation, cars, industrial equipment, medical devices, sensors.

A

Well, pretty much. Everything is getting connected to the internet, so is it just hype? No, it's really not so now.

B

A

Going to turn it over to you out, let's talk about what is cassandra just first, oh.

B

So the the big difference and the reason why I believe that iot is becoming a real thing. Is it's a problem of scale? If you look, if you talk to, I have friends who work for smart meter companies right. If you see the newer meters on a lot of houses, they actually have a wireless transceiver in them, and a lot of them have ipv6 addresses and the power companies are early adopters of ipv6, specifically because they couldn't get enough address range in ipv4 to do what they wanted.

B

They wanted to use commodity ip, so they moved to that and millions of homes wired up. This is a problem of scale in that how many samples do they want to take every second or every hour off of a meter? The the more frequently you take samples or the more higher resolution of the data is the more useful it is for things like trending right.

B

So now, if a power company is looking at data across all of their meters and they're they're doing analytics on that data, they can start to do things like predicting load, spikes and start to do power. Grooming across the grid- so that's just one very real, already existing use of the same technology.

B

We're talking about, and so cassandra comes into play because, when you're gathering all these metrics, you got to have somewhere where you can put them in real time and be able to scale out that storage linearly as the number of devices grows, as your sample rate grows. As you change your product over time and add more data in the more data you have, the more insights you get, the happier you can make your customers right. This is where the cool stuff comes from.

B

Well, we have customers from anywhere from one node. You know developing small applications on a single host, all the way up to um many thousands of nodes. The biggest rings I think, we've seen are a thousand nodes right around that range. Our goal is to hit 10 000 sometime this year at least test it and make sure it works. So that's something that that's our goal for data stacks.

A

B

B

um So cassandra is an eventually consistent system and the reason why is scaling um acid, for example, the way that most people think about it anyway across multiple systems is an incredibly difficult problem.

B

um So cassandra chooses eventually consistent, which means you do your right to your coordinator and it says: okay, I got it and it takes care of replication in the background it's tunable in that you can say, don't tell me it's done until it's done or you can say tell me when you have the data and I can go off and do other things right. There's it's tunable in that sense, and then we have things like lwt, so our lightweight transactions, which are based on paxos that let you say no really.

B

I need to know that this is on every node and that they all agree on the value before you tell me that it's done so that from your application, you can make those decisions and that's really what we're about with cassandra is putting that decision in your hands.

A

So the emphasis on a fundamental trade-off- I want to uh you, know, stress how important that is when you're dealing with uh a single system that spans the globe say or as a thought experiment was like. Okay, what? If we had a single cluster that spanned you know, earth to moon or earth to mars you're talking about latencies that are inherently not interactive, and you at that point just can't treat a remote data center as ever technically up or down.

A

uh So if you are going for strong consistency that says I will always have the same view of my data from any you know, location. You can no longer guarantee that at even global scale. Once you're talking about you, know hundreds of milliseconds round trips, your you know, users aren't going to wait for that for uh to achieve uh complete consistency or immediate consistency uh for every operation, so you actually have to adopt some kind of eventual consistency model.

A

Some products just go strictly strong consistency everywhere and basically don't scale out to geographically disparate locations.

A

uh Hold on, I will no.

B

A

Get back thanks for derailing me now, one sec, okay, so um something like just really doesn't uh try to address that directly. They're master at one location. Only couch base is another strongly consistent system locally, but they actually adopted eventual consistency for replication between clusters, which is their method of doing uh remote geography.

A

So, basically, if you're doing globally, spanning or even you know widely geographically uh diverse uh locations uh for your uh for your applications, then you need some kind of eventual consistency model in order to achieve anything resembling uh usability and.

A

B

I'll just finish the cassandra story and, and that plays into spark. So the reason why I was talking about iot in terms of the the database being able to take the the sheer volume of metrics you want to have for gaining analytic insights is with cassandra with multi-data center replication.

B

If you have sites in multiple geographies, what you can all of a sudden they do is now you can ingest those metrics as close to the customer as possible, which provides lower latency, it's higher more highly available. If you have internet cuts in the middle of the country, you don't have to lose data, because all of a sudden, your data center on the west coast can't get data from the east coast. You have data centers on either and what that makes cassandra makes that possible through data center replication.

B

That is eventually consistent, and then you can layer things like gslb dns, on top of it and get automatic load balancing across all of your clusters without having, and you can take data centers offline and all this stuff, so you're 24 by seven.

B

Does anybody know what gslb is okay, so so the internet runs on something called. uh I don't remember what acronym means anymore global service load balancing um I I just think anycast dns, um so it uses a trick and the internet's built on something called bgp and what bgp does is it? Is it lets you do it?

B

It manages all the routes across the internet and it manages all the failures so that if a router fails somewhere between you and your destination, it will automatically take care of setting up new routes to route around it, and this is all automatic and the whole internet works this way and there's a trick. You can play with dns where you actually advertise the same address from multiple data centers and then, when you go to do that, dns query.

B

So when you type www.datastacks.com into your browser, it's going to send out a udp packet and if it goes to that address, it's going to go shortest hop to the local data center. This is how cdns work, by the way, if you didn't know that it goes to the closest data center, that dns server replies with a local address, and then your tcp connection goes to that data center.

B

You can do all this stuff on top of cassandra, your application on top of cassandra, so that you can have the same data in all your locations, whether it's ingest or egress, and have your application, be highly available and spread across multiple geographies using this technology without it. So you can do things like take data centers offline. You can survive internet cuts all these different things using technology that we've been using for years on the internet.

A

So spark emerged out of actually both berkeley and mit as really a alternative to hadoop, recognizing that some of the fundamental limitations of the mapreduce model uh of hadoop were really holding it back. uh Besides, some very questionable architectural choices, but really uh the the the initial uh driver was, uh was to uh address the uh limitations of a a flow. That is a fundamentally. You know two phases you map over a very large set of data. Typically, then you reduce it within each.

A

uh You know region and then you know, merge data across those those different uh splits that is powerful, but it's not um flexible. You can't do exploratory analytics that way very easily or really at all in any kind of a reasonable time frame. So the addition that spark brought to that world is dag, directed acyclic graph. It basically means that you have a lot more control over how the execution of a batch process works.

A

You can go, you know if you treat your data set as a graph, even if you're talking about relational tables you're talking about traversing between portions of those tables, a direct-to-basic graph approach says you can go from here and then maybe do a fan out to n number of locations from there within your data set and then from each of those. You can go back to the original location or back to some others and just keep doing additional processing.

A

So you have a whole pipeline of tasks now, mapreduce or hadoop addressed this in some ways by allowing you to chain uh mapreduce jobs, but that's a pretty clumsy approach compared to actually encoding the entire um directed approach to your data mining directly into your query.

A

You know also spark it was designed to, as I said, address a number of the architectural. You know: challenges and limitations of hadoop, uh hadoop's, sort of notorious for having single points of failure. In terms of you know, name, nodes and and various complexities in their deployments and pretty much every additional uh api and system that's been layered on top of hadoop has had a bunch of those same challenges as a result.

A

Spark adopts much of the hadoop ecosystem. It allows you to use things that you're familiar with like hdfs as a file system. You can, even you know, talk to hbase as a database for it you can use ports of hive, for example, is now called shark on top of spark. uh Pig has now called spork on top of a spark. Yes, uh but fundamental yeah, but but those are are also a bit limited in that they come from the mapreduce ecosystem and mindset.

A

So, while spark has incorporated them and adopted them to some extent, in many cases, there's either existing or upcoming replacements to them shark is going to get basically replaced with a better sql implementation called spark. Sql.

A

The spork and the pig approach is basically a functional programming approach and it's actually uh really built into the native scala query language of spark itself, which we'll you know, show just a hint of at the end of this.

A

The other thing from a performance point of view is a very intelligent use of caching spark incorporated what are called rdds resilient data. Damn it really resilient distributed. Data sets thank you which are um they uh carry with them the chain of history or provenance of how they were created. So if you actually have um an rdd that is generated from uh from a query that can be cached that can be stored and used later.

A

But if, um but if some intermediate rdd is destroyed because say a node goes down, you can always rebuild it, uh so you have the ability to have caching but not be tied to having a fully consistent cache that always has to be kept in sync and with every permutation and with every node that might be changing its state and finally, hadoop itself did a little bit with uh streaming, but you know streaming got maybe more popular with uh other products like storm.

A

A lot of people are doing that too, to pull data into hadoop or cassandra or many other systems. A spark incorporates streaming, and that's really what we're going to be doing a lot of talking about uh into your uh batch execution platform to make it a first class citizen with respect to processing your data.

A

So to my to my mind, uh cassandra is chocolate. Spark is peanut butter. They go together really really well.

B

So putting this in context for those of you, who've worked with cassandra before how many times have you sat down, created a couple tables and went crap now I gotta join right. You want to join and, and you you call you, you email me or the mailing list, and they say: oh, you have to put them in the same table materialize it. um I could reference a certain comic. That's been going around big data circles, but um I won't go have fun with yourself. um This is kind of the answer a lot of times.

B

What spark does and the reason why we're so excited about it is sitting on top of cassandra. Is we didn't want to get in the business of writing a distributed? Query, planner and execution engine and it's an incredibly complicated topic. The spark guys have been at it for actually what seven years now they've been working on it, it's almost as old as cassandra, and so they've already done all of this work, and it turns out it's a really good fit for cassandra.

B

So what spark is going to be giving us is the ability that query execution engine distributed directly like acyclic graph, which means you can just have a bunch of jobs that are connected and it takes care of figuring out where to run them when to run them and what order to run them in and it paralyzes them across nodes in a way that hadoop never really did it very well. Hadoop was a simple first generation system, it's kind of showing its age.

B

These days spark is the next generation where you have kind of this arbitrary direct, arbitrary graph of things to do.

B

So one of the one of the great use cases for us is time series most of the iot stuff that that we're seeing out in the field is time, series data right, your thermostat reports every few seconds or every minute what the temperature in your house is. What is the setting? What's the humidity all these different um dimensions of data that goes into a time series schema um getting meaning from that.

B

You need a query engine right and so the hadoop jobs have been the way we've done this to up until now, and we feel the pain we hear the pain from our customers. We hear it on the mailing lists. There's pain everywhere, where we're dealing with us hadoop.

A

It's really the time to get your answer uh in the you know increasingly connected world in the uh um and you know, frankly, just the increased competition. uh Everybody wants uh answers from their data uh faster and faster. You know when hadoop first became available. You know people were so happy that they could get really good meaningful results out of their data overnight. That was good. You know, then various efforts have accelerated things that are, you know it's now reasonable to get.

A

You know, map-reduced jobs that run in minutes or hours or but never really, like second wait and see, or a few second wait and see. So if you need to you know, be doing ad hoc queries you're going to be sitting there waiting if you need to be able to take immediate action in response to an event and that's where streaming comes in then you're not going to be able to do that.

A

It's the time lag that hadoop introduces that really really kills. You.

B

I don't want to report on data. Oh yes, you do well it's just. What we're saying is that the meaning turns into a report that you're going to show to your users, or maybe there's more interesting things you do right is the the bigger the scale of the data, the larger the query, so you can, if I'm talking about a particular user, usually a single user will fit in a single partition on a single node. That's fine right! All my thermostat data for my whole life would easily fit on the hard drive in my computer.

B

But if I want to go and say well, I want to see behavior of user populations across everybody in this room, then, all of a sudden. Now we don't fit on one hard drive. If we do it across new york city, it's not going to fit on one node right and that that's where it really starts to get interesting and what the reporting is.

B

Those two different levels is when I'm doing population reporting population information, that's where we start to see power in terms of generating efficiency inside of businesses, building efficiency or ecologically efficient things different things like that.

A

So logging events at scale- this is one of the more you know, seminal graphs that you see in the cassandra world this you know two years ago now.

B

A

Netflix was testing to make sure that it could meet their scalability needs and they scaled it up to this case, 300 nodes million writes per second, and but you know the numbers there didn't matter that much what mattered was they had predictable increase in performance, no matter how many nodes they added as al says we're now up to a thousand plus. I actually think there's uh some production clusters in the 2000 range, but yeah we're definitely headed further and further. You know uh out horizontally with the sheer number of nodes in a single cluster.

A

We can, you know, add data centers, I've. You know seen deployments up to 12 data centers spanning the globe. You know hundreds of nodes in each data center. At no point are we failing to achieve that linear scale up that you really need if your data and your use cases are likely to scale up as well any system that doesn't allow you to horizontally scale out or doesn't allow you to do it linearly is going to be an obstacle to your growth. If you're a startup, it's going to be an obstacle to your.

A

You know initiatives if you're, you know an enterprise that needs to start, throwing more and more data into it. You you have to look for something: that's not going to cap you before the capacity that you need is is exceeded.

A

uh This is actually really what al was talking about about gathering meaning at scale, but this is how spark does it, and this is, how spark streaming does it? So in the upper left, you see live input, data stream? This could be the stream of devices coming from your nest. uh Thermostat.

A

You know it could be the stream of user interaction. Coming from a netflix console, saying I pause the stream, I'm unpausing, the stream. It doesn't really matter, it's just it's events that are just coming in and that are then processed at real time.

A

uh What uh spark streaming does is it divides those uh streams into um what are called uh d streams, discretized streams which chunk it up into small windows of time like five. Second windows are typical in order to achieve uh efficiency in batching and achieve various windowing algorithms that you couldn't do if you're just looking at a single data set at a single, you know point in time at a time.

A

The key, though, is that you're actually able to apply virtually all the same types of algorithms that you do to a mapreduce or dag direct is a cyclic graph spark job at ingest time at streaming time before the data or simultaneously, while the data gets into your database. That's what allows you to uh to do. You know immediate, you know anomaly detection and what's what allows you to take immediate? You know proactive action.

A

If you see say a hard drive is failing, for example, uh it's uh what gives you the immediacy that allows you to compete with? uh Well with your.

A

Competitors according go back so.

B

The other so spark streaming really. I see it as filling a couple of different niches. One is when you want to do real-time reaction to basically what we call cep before right. You want to have a stream of events coming in and you want to do things on the data as it arrives and make decisions at that instant.

B

The other one is roll ups kind of real time right. You have a bunch of metrics data coming in you want to go over a window, you want to basically save averages and maybe throw away the high resolution data for storage efficiency, but the other big thing, that's kind of a side effect of having the system is. We don't have to write all these stupid ingest systems from scratch, anymore, right, who's, tired of writing, ingest systems.

B

I mean I am I've written a whole bunch of them and it sucks it's always tricky and getting the high availability right is tricky and there aren't haven't been many good tool kits. A lot of people are using storm, storm's, okay, but it comes with zookeeper, which is not okay, um you know. Well, so it's it's! A provably reliable system, it's just operating it's kind of a pain in the butt, um so that that's kind of the other part of spark streaming that I find really interesting. Having built some of this stuff.

B

Before so, we we kind of like- or I kind of like, because I come out of an operations background to think about this in terms of well just managing our own data, centers right kind of dog fooding this this stuff, so we've had for a long time monitoring systems that are basically just canned reports. It's just a whole bunch of canned reports. It's the is the host up and down you. You open your page and your monitoring system.

B

You see all the little graphs and stuff, but that's all old data that has been sitting on a hard drive for at least a few seconds, and you can't do much with correlation across different metrics and things like that, after the fact to make really intelligent decisions about what's happening in your network right. This is kind of the holy grail of monitoring is to be able to do things like what spark streaming allows you to do and say.

B

Well, I have this metric coming in for disk space utilization and I have another one coming in for cpu and another one coming in for network utilization, and can I start to derive insights across those in real time and start to let my network administrators know when a problem is coming before the storm hits right.

B

So you need a couple of different parts of that you need to add hot querying. Obviously, like something's happening, I need to go. Look at some metrics, that's more spark!

B

You need to be able to actually drill down from say this high level aggregate that we did across multiple different metrics and be able to go and see the raw metrics to make decisions and obviously alerting comes out of that aggregation. We already talked about um one of the most popular questions that I get all the time is um I didn't introduce myself, I'm evangelist for apache cassandra at datastax um is um how do I do top k in cassandra?

B

It's one of the most popular questions we get all the time and the answer now is to do it through spark, but in the case of spark streaming, you can actually build these top k lists, as the events come in and just serialize them to cassandra or actually wherever um and then the ml stuff I'll. Let you talk about it's more hit space.

A

Sure and that that clustering and the top k actually is quite a tricky problem to solve, uh if you're just doing it from the database point of view, and certainly from a cassandra point of view, there's no good mechanism to do it unless you have pre-calculated in some way what your at least your candidate set of of top candidates for from whatever you know set of data you're looking at is uh that uh being able to uh you, you know far more than just like counters, but being able to actually uh at ingest time, keep running, totals be able to use uh heuristics and and the k-means type algorithms to actually uh process those group, those and then retrieve an efficient list of what the what your top leader board, for example, in the gaming industry.

A

Anything like that streaming is the ingest time is the right time to do that uh on the machine learning side, uh you know, there's there's so many uh use cases there's you know uh you.

B

A

Robots or cars now doing more, with uh uh visual recognition and and learning things about their environment there's uh the actual just uh predictive analytics. That uh is a big part of so much machine.

A

Learning that allows you to detect when patterns that have previously led to failures are starting to occur before failure ever comes in so getting in out in front of the cur of the events that you're looking for so you're, not just reacting to them quickly, which is better than you could do with hadoop, but actually reacting to them before they ever happen in the case of predictive analytics uh a really really powerful uh capability, that's starting to emerge through a stream based processing and enrichment, so taking action, um I'm actually gonna fast for no yeah.

A

Okay, all right! So uh just a quick point here. uh Take action at scale. Best practice with cassandra has long been to have a bank of application servers co-located with your cassandra data centers, and this is a key to the high availability store. You want, the you know, hurricane sandy wiped out. You know outbrain their new york data center. They kept going without any problems because they had full availability through multiple other data.

A

Centers and all they had to do was route their traffic around if they had data application server interdependencies between those data centers huge problem. So basically the the architecture and the topology here is the blue is cassandra. The green is spark uh and, if you add in your own application servers it's another. You know you know, bank of machines co-located with each of those.

A

You have uh you know things coming in from internet or any other source into a spark being processed enriched there and written back down into cassandra where it can be the basis for you know, future current queries both from cassandra cql, as well as from spark's, uh more powerful querying methods such as a shark for sql.

A

The other pattern, if you want to uh go with uh getting data into cassandra, because uh durability is key, cassandra's uh consistency, levels being tunable, allow you to guarantee that you have at least you know n number of nodes that have received a um a message before acknowledging success back to that customer. If you want to be relying on that durability, you can still write into cassandra and then various methods and we're going to make that even easier moving forward.

A

We'll allow you to get that data immediately back out into a spark, so it can then do with the enrichment and then back into cassandra.

A

The multi dc, though, is really the key uh in this case you're, just you know, fanning it out to any location where you need to be now. People choose multi dc with cassandra for multiple reasons, including just bear. You know high availability, and for that they could be. You know a few hundred miles away. That would be sufficient, but often you need them uh to be.

A

You know in remote locations to ensure low uh latency response time between the devices or the people that are interacting with it cassandra's one of the very very few products out there. That's able to scale out and be able to do master master absolutely everywhere. Thankfully, our spark architecture allows us to do the same thing there by having spark co-processors as I'm thinking of them now, alongside each cassandra, node you're able to do that.

A

Processing in place at ingest time, no matter where your data is and then use cassandra's built-in replication to get it to any location. So oh yeah, right anywhere is is the true key there. So things that go together.

A

Well, really, it's cassandra and spark and sparse streaming.

B

Starts streaming.

A

In particular, um I feel like is is something that you're going to be seeing together far more than you ever saw cassandra and hadoop or cassandra and pretty much any other technology. They go together so well that together they make up a very sort of complete solution.

A

We're going to be making some announcements at the spark summit in a little over a week, there's going to be some code dropped, hopefully, and we're going to be doing some demos at that point, but um happy to open it up to questions at this point about anything related to internet of things, cassandra or.

A

A