Apache Cassandra Cassandra Summit 2015, 14 Mar 2016

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Weather Company: Lambda at Weather Scale - A Real world Event Processing Pipeline SD

Description

Speaker: Robert Strickland, Director of Software Engineering

We hear a lot about lambda architectures and how Cassandra and Spark can help us crunch our data both in batch and real-time. After a year in the trenches, I'll share how we at The Weather Company built a general purpose, weather-scale event processing pipeline to make sense of billions of events each day. If you want to avoid much of the pain learning how to get it right, this talk is for you.

A

Alright, so it is 11 after so, I will go ahead and get started. I appreciate you guys. All joining me today been a lot of great talks today. I hope that I can live up to the epic label for this track. I have my doubts, we'll see. As you can see, I am the one and only impressive-looking slide in my deck, I'm going to talk about lambda architecture, specifically our experience at the weather company over the last year, as we have struggled our way through building and scaling our analytics platform built on Cassandra and spark.

A

If you've heard me talk in the past, you know that I typically take a deep dive into some specific technology or try to simplify a complex subject. This talk is a little bit different. I could subtitle it how to avoid all of the land mines we stepped on. So a little background for those who don't know me: I lead the analytics team at the weather company based in the beautiful city of Atlanta.

A

Yes, more Atlanta people, I'm responsible, any other Atlanta people a few. The few, the proud I'm responsible for our data warehouse and our analytics platform, as well as the team of engineers who get to work on cool analytics projects on massive and varied data, sets so I'll talk about more on that in a minute. So why am I qualified to talk about this I've been around the community for quite a while since 2010 and Cassandra point five to be exact. I've worked on a variety of cassandra related open source projects.

A

Basically, if there's a way to screw things up with Cassandra, I have done it if you're interested in learning more about that, you can pick up a copy of my book, Cassandra high availability, where there is some of that and if you're ever in the Atlanta area, I'd love for you to come, join us at the Atlanta Cassandra users group. So, as I said, I work for the weather company, which is the caretaker of a number of well-known brands such as the weather, channel, weather com, weather, underground telecast, among others.

A

At heart we are a data company and in fact, we serve around 30 billion API requests daily, providing weather data to the likes of Google, Apple, Facebook, Yahoo, Samsung and other large customers. By way of comparison, that 30 billion number is about 60 times the number of tweets tweeted daily. So we have a hundred and twenty million active mobile users, plus many more on weather.com and our other digital properties, and that gives us the third most active mobile user base.

A

In fact, we have the number one app for weather on all platforms and in some cases, the number two as well, all of which adds up to a lot of data. 360 petabytes of traffic daily runs through our systems. That's a very, very large number. It's actually 10 and three-quarters exabytes a month write that number on the board. If you can even figure out what that number is at the end of the day, most of the weather data you consume on a daily basis actually originates from us.

A

So for better for worse, sometimes so now to our use case. As you can imagine, all those API calls pageviews mobile sessions and weather observations add up to a lot of secondary data. My team's objective is to extract meaning from that data using repeatable processes and tools that allow collaboration and shorten the time from discovery to incite.

A

So part of the challenge we face is that we can't throw data away, so we need a way to do this using linearly scalable, cost-effective technology to do that, we needed to build a system that would allow us to both do batch and streaming analysis of billions events per day, preferably using one common paradigm for both.

A

Since we have many data scientists and analysts at whether who are embedded in the various business units, we had to be able to support self-serve data science and exploration. This frankly was one of the more challenging aspects of our mandate and, lastly, we needed to support enterprise reporting and business user self-service using off-the-shelf BI tools.

A

So, let's get into the meat of this talk, which is all about how we achieve these goals at scale, starting with our architecture.

A

So this was our first idea: zero based of course, and initially I felt strongly that we would be successful with this architecture. The gist of it is that we had a Kafka Q fronted by a restful service, and our various systems would make call to the service with a JSON event payload.

A

These would be dq'd either by spark streaming or by a custom ingestion pipeline that we built events would land largely unchanged in Cassandra and where we would do batch analysis via spark, we could use, we would use sparks sparks equals ODBC support to serve data to various end customers and for batch sources. We would repurpose our existing informatica team to load the data into Cassandra.

A

So let's take a closer look at the data model we use for the system. As I mentioned, all events landed in the same table. It's a typical time series model and you'll notice that most of the fields are just metadata. Can you guys see that? Okay, all right? The data we really care about is stored as this schema list JSON in the event data field, and our partitioning is also typical for a cassandra time series model. We store the data in one hour time buckets further partitioned by event, type for even distribution across the cluster.

A

We also have time stamp as a clustering column to get events written in time order. And, lastly, we decided to use the new date tiered compaction strategy, since it seemed to be the right choice for time series, since it was new at the time we had little operational experience with it, which turned out to be a significant issue later.

A

So in a nutshell, we plan to use cassandra to store everything with streaming data ingested via homegrown data pipeline and we'd use. Kafka a cough cough kaback by a restful service to get data into the pipeline.

A

Our batch data would be loaded into sondra using informatica clients would query sparks equal via ODBC the event payload would be schema-less, JSON and date. Tiered compaction would make quick work of RSS tables and life would be grand, except that it wasn't. So let's talk about why not?

A

First of all batch loading, many terabytes of data into Cassandra turned out to be pretty silly and quite expensive for such large data sets. We were going to need huge numbers of nodes to handle the load, and the need would just keep on growing over time unbounded, which is not a cut on Cassandra at all. It would have handled it perfectly fine, but we would have broken the bank in the process and we also learned that bulk loading Cassandra using informatica is currently a no-go. We gave it as we say.

A

The old college try, including a custom process that used the bulk loader. The Cassandra bulk loader under the hood I will say that informatik is working on a native Native Cassandra support and we've been talking to them about that. But for the moment this is really an exercise in futility for reasons I'll explain shortly. It also became apparent that we didn't need to maintain a restful service, layer and Kafka setup, which added significant cost and complexity to our system.

A

One of the most difficult issues we encountered was the lack of modern supported open-source hive driver for Cassandra. We needed the hive driver to support, spark sequel queries over ODBC and there's a lot there if you're not familiar with that, you can go, read up on it, but that just suffice it to say it was necessary. We considered taking over the cash project, which is an old driver that was later integrated into DSC, but ultimately decided to abandon that idea in favor of an alternate architecture.

A

It's worth noting the data stacks enterprise users have access to the proprietary hive driver bundled with the distribution, so you won't run into that issue. Unfortunately, we found that day tiered compaction is effectively broken. It falls down for all but narrowest of time, series use cases and should be used with extreme caution. Since Jeff jirsa already covered this in detail in his earlier talk on the subject, which I hope you saw I'll leave to you to check out to check that out online later or review the ticket. There's a lot of information there.

A

If there's any point in this list, that seems obvious. After the fact, it's that storing data as schema lyst json was a terrible idea. First, because our analysis jobs had to parse json to extract the data we cared about. This made every job much more expensive and caused us to have to read more data than we cared about on every job, which is partly because all event types were in the same table, making it expensive to analyze by event type.

A

This turned out to be our most common access pattern by event, type plus in our model. If we wanted to look at it all events for a given type in a given time bucket, all that data was stored on a single node and if you don't know why that's bad we'll talk about that in a minute.

A

Another side effect of all events being in one table was that we couldn't tune by event type. We later learned that some events require special tuning due to their volume and the one-size-fits-all model prevented this. So now take two you'll notice, several key differences here. First kafka has been replaced by amazon, SQS and second we're using s3 for long-term storage.

A

We're still using spark for processing and cassandra for short-term storage, otherwise I probably wouldn't be here and informatica becomes the mechanism for loading batch data into s3, both in raw form and in parque format, I'll elaborate more on why we made these choices in a minute. First, let's talk about the data model. Instead of dumping, everything into a single table, every event gets its own table.

A

This allows us to individually tune each table based on the workload and we extract the payload and apply schema at the point of ingestion. There are a number of reasons for this first, because we have to read the data on ingest anyway. This is a logical time to go ahead and app the data to real Cassandra fields, which, of course, make subsequent analysis a lot easier, because we only have to deal with parsing and mapping one time.

A

So if you've done any work with raw events dreams, you know you get a lot of junk data. If we wait until it's written, we have a lot more storage and processing overhead, and this complicates analysis. Jobs filtering junk out at the point of ingestion just makes everything else better downstream.

A

So, to summarize, this architecture we're using cassandra for streaming data. All data is written with a TTL that depends on the type such that we end up with a rolling time windows stored in Cassandra, instead of storing them forever.

A

This gives us real time access to events, while keeping overall cluster size to your reasonable number, and we get data locality for this data and, of course, cassandra is quick access times, which makes spark jobs much easier, much faster for most other use cases, and there are exceptions. Of course we store the data in s3.

A

This primarily consists of batch loading, log files from internal and third-party sources, as well as daily backups of our cassandra data. So at any given point, the data in s3 is roughly a day old.

A

Once the data is processed, we write it out as parque files, both spark and hadoop have support for parque natively and it allows for reasonably efficient sequel type queries across large datasets.

A

Ultimately, the primary reason we chose s3 is that it's scalable and much cheaper than running the equivalent ec2 instances with cassandra.

A

Fortunately, it's also easily accessible from spark, and it's super simple to share both internally and externally, at whether we have a bunch of different AWS accounts for better or for worse, as well as a number of our own physical data. Centers. An s3 lets us easily share data across all those environments and, lastly, hive has open source support s3 via HDFS, which means that we can leverage odbc to connect to the s3 data through sparks equals thrift server.

A

So one of the more interesting and possibly unorthodox choices we made I assay unorthodox. Only because I don't know anyone else doing. It was to replace Kafka with Amazon's simple q service first, because it's scalable and reliable, and maybe I should alter that after this past weekend.

A

But what really made the decision is that when one of our Ops guys made the suggestion that perhaps there was no need for Kafka since SQS already has a restful interface and he was right initially, it seemed crazy and actually took me a couple days, marinating on it to decide that it was worth doing. But it has turned out to be one of our better decisions.

A

It costs almost nothing to operate, it's extraordinarily inexpensive. It has a robust security model that we're already using. Is it so it's integrated with the rest of our AWS security.

A

We set up the cues such that there's a queue per event and platform per event, type and platform which allows us to use Amazon's built-in monitoring to get a granular view of the events coming into our pipeline. So a lot of things we would have had to build ourselves that we didn't have to build, and so also so because of the issues you ran into a date tiered compaction. We decided to switch to size tiered compaction, mainly because it's the most mature compaction strategy and it's effectively bulletproof been using it since point five.

A

Five years now, that's pretty bulletproof. The two primary issues with size, tiered compaction, are its facial complexity and the potential for single partitions to span a number of SS tables. We can manage the spatial complexity and the partitions spanning SS tables, isn't really a big deal for analytics use cases where you're doing lots of large token range scans. I also want to take a minute to give another shout out to Jeff for his time window compaction strategy. The implementation is exactly what most people want and expect.

A

From date, tiered compact Shin, esas tables are grouped by timestamp into configurable. Buckets, then simply dropped when all the data expires. Simple, straightforward and low overhead for time series data we were this close to doing it and then I saw his implementation overall. This second architecture has been a great success for us.

A

So of course, there are numerous details to all this, so I can't cover all of them in a 35 minute talk. But let's talk about a few of the more critical ones. First, make sure to use at least version 2 18 and at this point precisely version 2 18. If you're planning to run spark I've listed several critical issues here, which you should look up if you plan on using anything other than 218, all of which cost us a lot of time and frustration as we work through them.

A

Trust me when I say that you don't want to deal with any of these three issues along the same lines, make sure to check the spark Cassandra, connectors, SBT files for version numbers and make sure to use those versions. We have run into many incoherent errors due to version incompatibilities.

A

One thing I didn't show on the diagram is that we actually run two separate spark clusters. The first is co-located with Cassandra for our heavy analysis, jobs. We keep it tightly controlled, so we can predict the load and it gives a sufficient access to Cassandra data. The second cluster handles our self-service clients. It's not co-located with Cassandra and in fact we generally prefer using it against our s3 data rather than Cassandra. Obviously, its load is much less predictable, so the second cluster allows us to isolate it from production jobs.

A

So, just as with traditional Cassandra use cases, proper data modeling is critical, but there are key differences when you're designing for large-scale parallel processing compared to efficient single query patterns. Let's examine some of those differences, so, first the partitioning strategy you use could be considered the opposite of the normal Cassandra modeling ideas for those still learning Cassandra when I fert refer to a partition, I'm speaking about the first part of the primary key declaration, the first field or set of fields in parenthesis which determines the distribution of data across the cluster.

A

When it comes to choosing your partition key, you want to model primarily for good parallel ism. This means that your most common queries should produce results evenly distributed across the nodes rather than single partition. Queries, which is commonly suggested for non analytics. Workloads you'll also want to avoid shuffling, if possible, as you may know, from prior experience with either Hadoop or spark. The shuffle phase, where all similar output keys are sent across the cluster to be grouped together is typically the most expensive part of any distributed processing job.

A

Any query that groups data by something other than the partition key will end up shuffling data, so the key advice here is to partition based on your most common grouping. For example, if you know most of your aggregations will be by user, then user should be your primary part. You should be your primary key.

A

So let's talk about indexes secondary indexes, have been a controversial and often misused feature since their introduction in point 7, the conventional wisdom says to avoid them under normal circumstances, because they require the entire cluster to participate. In answering your query, but in a model where we want our queries distributed, this becomes a desirable feature. It allows spark to push down a query predicate to Cassandra, so the database is actually doing the filtering of the unwanted results.

A

This can significantly reduce your processing overhead and spark because it never has to look at unwanted data. It's worth noting, however, that low cardinality is still the rule. If you are trying to index on a high cardinality value, you still need to build your own index. For now that is keep an eye out for global indexes.

A

In three point, 0 and so I, will it an unnamed source from the last pickle who says that 50,000 is the number of is the cardinality you're looking for that's the sweet spot, so a picture is worth a thousand words in this case. So let's look at the difference between the behavior of a secondary index. Query using a traditional client access model versus a spark query. You can see here that the client request goes to the coordinator, which must query the index on all other nodes and assemble the final result set.

A

Obviously this has significant performance and availability implications and a lot of coordinator overhead, on the other hand, spark looks only at data, that's local, to the node it's running on, so the index serves only as a row filter and never actually results in unnecessary coordination overhead.

A

You can also take advantage of leucine based full text indexes in spark using strati OHS plugin, which was recently open source. Their custom index implementation adds more cassandra level, filter capabilities beyond the built-in secondary index mechanism.

A

Since it's implemented using the secondary index, api and stores lusine indexes in the same manner, it uses the same math access pattern, I described earlier, so creating an index in this manner is straightforward and will look familiar to anyone who's familiar with leucine. Once you have, the index did I skip that too quick. So there you go once you have the index. You can execute queries like these as simple cql statements. Since queries run through the server-side cql parser, any cql based client can use them.

A

Okay, so I lied. I have one more impressive slide seriously. This is a really big issue, so it deserves a really big and memorable image to remind you how important it is and that's because it takes only one really wide road to ruin your day. In our case, we had a bug in a mobile client that produced the same key for hundreds of millions of records every day. A bug like this unchecked for even a relatively short period of time, can literally bring down your cluster.

A

The lesson there is that you can end up with a wide row, even when you have a model with theoretically good distribution characteristics. The key to avoiding serious problems like this is early detection. While there are many important stats to monitor wide rows can be detected by keeping an eye on max partition bytes for each table. If this number gets significantly higher than the mean partition, bytes, that's a sign of trouble.

A

If you find you have a problem, you can use the new node tool top partitions utility, which I believe we were the first people to ever use in the wild according to the the IRC Channel, not sure, if that's true, but it felt good. So you can use that to sample rights and find hot keys.

A

Another dilemma, you'll run into with an event pipeline. Is that you're likely to have many missing fields in your event payloads? But you don't want to construct a dynamic cql statement for every single query. It's easy to be tempted to just write null values, but the problem is that nulls are actually deletes. That is important, of course, because deletes and Cassandra create tombstones or marker columns that cover any previously written data. It takes work for Cassandra to deal with a bunch of unneeded tombstones, so it's very important in a nutshell to avoid writing nulls.

A

So how do you leverage prepared statements- and you will want to without having to create a separate prepared statement for every permutation of populated fields which we originally did? Our answer was to create a library that dynamically builds and binds prepared statements on the fly based on actual observed, combinations of populated fields. We guessed that the number of practical combinations would be much smaller than the number of theoretical combinations, and this is actually proven correct. Your mileage may vary. It is also increased performance and dramatically reduced our code complexity.

A

So now I want to shift gears a bit and focus on an often overlooked aspect of data analysis exploration. My definition of this goes beyond pig screen or sequel engines. It's really about collaboration between data scientists, engineers and business users for too long our tools have catered to one group or another, but one of our primary objectives. On my team has been to break down these walls, so let's take a look for a moment at what many may recognize as an old data warehouse paradigm.

A

It's a linear process that requires a number of steps, and often weeks or months before we can begin looking at our data. The way a human brain can understand it in this model. There's so much work involved in getting data into palatable form that very little of it actually makes it that far we lose so much in the process.

A

Fortunately, it doesn't have to be that way anymore. There's no need to get the data into a dimensional model before plotting it on a graph, for example, we can visualize at every step, thus ensuring that the final result is more valuable than with the old approach.

A

So why so much talk about visualization, because we humans can do very little with terabytes of raw JSON data. We need some way to make sense of it early on, so we know where to look next.

A

So a major part of what we're trying to achieve is a reduction in the amount of time it takes to start visualizing the data, and by that I mean from weeks two minutes. We want instant gratification which effectively results in our data analysis becoming more collaborative and agile, rather than a waterfall process like the old model.

A

So along comes Zeppelin. Zeppelin is an open-source park. Notebook I've seen a couple of other people talk about it today. It's similar in many ways to the data bricks, cloud interface and I can say without hesitation that it has completely transformed the way my team functions. It has also become an extremely popular way for other teams to interact with the data.

A

Zeppelin includes interpreters for a variety of languages, all of which integrates seamlessly with one another and, most importantly, it allows you to visualize your data immediately and effortlessly then share your work with others removing much of the friction from the process.

A

Once you arrive at something you like, you can then run it on a schedule right there in the same UI, so here's an example notebook and I'm, not brave enough to do it in live because I've seen how that goes. Here's an example notebook where I'm using sparks equal to query some data received from mobile devices. This data is stored in Cassandra inside a map structure, yet I'm able to query it in place complete with aggregations and visualize it in a bar chart.

A

Here's another example where I'm retrieving some historical weather data for a specific time bucket and location, then extracting mapped values. The same way I did before into a data frame and registering it as a temp table. I'm then able to plot multiple values to see the relationship between them.

A

Lastly, there's a cql parser that lets you run direct queries against Cassandra and perform the same sorts of visualizations. This has largely replaced our usage of CQ lsh for anything other than administrative tasks. Obviously, there are other tools that can do this sort of work, but most are very costly and almost none support native cassandra and spark queries.

A

So as we wrap up, I wanted to spend a couple of minutes addressing a common question, I hear from people who are looking to use Cassandra as part of an analytics project. The question is whether you should use apache cassandra and put all the open-source parts together yourself or use datastax enterprise, where everything is bundled into a nice package.

A

I'll attempt to answer this by asking a series of questions, but before I do that I want to make clear that I am personally an open source fan and I, don't care whether you by DSC or use the open-source version I am not paid by data sex with that disclaimer. The first question is: do you have an open source culture? Is your shop more likely to buy an enterprise license, so you can get support or use an open source project?

A

Second, do you have on staff Cassandra experts, and by this I mean people who really know what they're doing not just someone who's used it before? This is critical. If you are planning to go it alone at scale, so you either need them in-house or you need to go, find some people. Is your team willing and able to actively contribute to the community success with a complex, open source system, especially multiple complex, open source systems, tied together, depends on community engagement and a willingness to dig in and really understand how things work?

A

If you just want to install and go and get help from experts, when you have questions an open source, analytic stack is going to be extremely frustrating. Are you willing to accept a moderate degree of risk? Obviously there's risk involved in any software project, but the combination of well tested software, tighter integrations and first-class support certainly mitigate that risk to some degree. Next, do you have a strong need or desire to be able to use? The latest features?

A

Dsc is always a bit behind because they try to make sure you have stable software that's well tested. The trade-off is that you'll often have to wait for the latest features. This is not actually a terrible idea and practice anyway to wait for the features, as we've certainly been burned by being pioneers more than once, but it does mean that you can't just open the hood and patch something any time you want. The value of that capability depends on you and the culture of your team.

A

The next one is critical for us, and that is the need to control tool versions. We have a complex environment with lots of dependencies across teams and many Cassandra clusters moving to DSC would remove a tremendous amount of our flexibility and tool choice. But again this may or may not be important to you. Lastly, you may not have the budget for licensing or it may be a drop in the bucket. Perhaps your business model requires scale, but you don't have the corresponding funds to pay for the software.

A

If that's the case, my suggestion is to make sure you fully embrace the community for most businesses. If you can't answer most of the questions with a yes, the DSE probably is the right choice. So that's all I have so. Thank you for your time today. If all of this sounds like something you'd like to be involved in and you'd like to live in a city with a great tech industry, I standard living in low income tax, we're hiring so please come talk to me. A.

A

One quick note: I will be available tomorrow in the live room over in the expo hall at 1pm and following that datastax has bought a bunch of copies of my book and I'm. There I'll be there signing them if you're interested in that so happy to take questions. If anybody has them.

B

The library you mentioned for building the prepared statements, yeah they're null safe, is that a library you're making available not.

A

Yet but we could I could we could consider that it's a it uses its Scala based. So if you are writing Scala code, then it it will be very helpful. If you are not writing Scala code, then it's not going to be very helpful.

A

Back here, oh there we go so.

C

Could you talk a little bit about the advantages of your custom ingestion engine versus cost Kafka, so.

A

Vs. Kafka well so I would compare our custom ingestion pipeline to spark streaming as opposed to Kafka or to something like Apache mule, or you know, one of the other projects and whereas I would compare Kafka to SQS. So Kafka is just a queue feeding the the system, so I I sort of enumerated the reasons for Kafka versus SQS.

A

As far as the custom ingestion pipeline, it's honestly a source of great debate internally like should we have done that or should we have used mule or something else, and so that's still a source of debate and we may throw it away. We may use spark streaming there. There are advantages and disadvantages to owning that piece and so for right now it works really well, and so we do that. The prepared statement thing is one of the reasons.

A

So it's it's much easier to to manage that the sort of flexible schema when we can control everything.

D

You mentioned that you used like a petition key based on time bucket and event, but then you switch to like a separate table for everyone. How did you like make sure that the Lord is equally distributed across, like machines now.

A

I'm, sorry, how do we make sure what ah that.

D

The Lord, like is distributed across machines in the cluster. If you don't use like event and just use a time bucket, because then, like all the Lord fortune,.

A

So I gave you that that model just to be very clear that first model that I showed you what bag. So that is not the way we do. Our our data modeling for the custom event types every event gets its own model based on what we look, what we what we believe to be the right access pattern. Now we get that wrong.

A

Sometimes we have been known more times than I care to admit to get it wrong and have to go back and reprocess a whole bunch of data and write it back again in the right way.

E

How in your infrastructure and the processing of data occurs, because so you say that, like everyday data goes from a Cassandra to X is 3. That's.

A

E

Then II, once you represent you drink you, these data from history to your pipeline or so.

A

Okay, so maybe that was unclear, I can go back to that slide. Let's find that.

A

Okay, let's see getting close.

A

Almost there there we go. So if you look that that's actually not the flow, the flow is, it goes into Cassandra. It goes into our event pipeline immediately from the queue and then from the queue in to Cassandra, and then the process to get it from Cassandra backed up parquet into s3 is asynchronous. That's a batch job that runs a spark job right now overnight, but.

E

Then one once this happens, the data is somehow gone from Cassandra, or is it still there in Cassandra sorry.

A

It's somehow what so.

E

Once this copy a of data from Cassandra to extremely then the data from Cassandra is deleted. So.

A

It it's a time window, so it all the data is written with the TTL and the TTL. You know default. We defaulted to about 30 days that we keep up data, but it depends on the on the event types you know for some data we may will keep it longer for others. We may only care about it for a couple of days and then we're done and then.

E

If, if you need to process data from older than 30 days, we.

A

Gotta take a comes.

E

From s3, that's.

A

Right esri park, a data goes.

E

To they threw the whole pipeline again right. That's.

A

Right so the bottom line is that the cassandra data we put data into cassandra because it's it's an append. It's a log, structured storage engine right. So we can pipe data really really fast into cassandra and we can read it very efficiently from spark because of the partition, the natural partitioning and the collocation, and just the speed that Cassandra can read the data, whereas with parquet it's cheap, but it's also slower.

A

So s 3 is slower, parquet is slower, and so the the idea- and we get it right sometimes than we don't sometimes and have to fix it. But the idea is to from all of for the bulk of our work are really heavy hitting analysis. Work is to take place while the data is in Cassandra and then only move it to park a when it's more like long-term mining types, and so we also will some one thing we do is we will put data into Cassandra, that's aggregates and keep it forever.

A

So we may process data and leave it as aggregates there, so that it's there for efficient processing or updating or, for example, a graph, a user graph relationship graph. Those types of things that will stay in Cassandra all the time and get updated as we make changes to it that make sense. Yeah.

E

Thank you sure.

A

F

I just want to know the size of the cluster and how much data you're processing so.

A

As I said, it's billions a day and that, but the actual number is sort of changing, constantly and growing like every week. Somebody at calls me up and says: I want to send you a hundred and fifty million record today, and so you know it's I. If I told you now, it would be different next week and I'm not even really sure what the answer to the question is at the moment, but the size of the cluster.

A

We use fat nodes for our analytics cluster, so we have 20 fat nodes right now, very very large, I, two series nodes in an AWS that does that handles a bulk of our of our initial processing. It's not all the Cassandra. We have it whether we have lots of other Cassandra clusters but for our analytics, we're using 20 fat nodes. So.

F

Fat node means how big is this ice so.

A

There are 128 gigs of ram and each each node stores three terabytes they think is, is how much storage we have on that. Okay,.

F

G

I'm curious: have you ever waited the amazon kinesis.

A

Amazon, kinesis yeah, so I we have not looked at amazon kinesis, primarily because we're trying to be as much as possible platform independent now you might say: well you got a lot of Amazon up there. How come you can't use Kinesis, but the reality is. These things are very, very easy to replace with with similar technologies on other platforms.

A

H

Hi you briefly mentioned user graphs. I was wondering, are, is that, like a custom data solution for you are using Titan or something else to handle graph? So.

A

Our graphs are actually micrographs, they are individually graphs for each user, and so Titan is not really useful in that case, because Titan is for large graphs, we don't have a large graph use case yet, but I there there's it's coming, and so we don't have the answer for that and I'm hoping Titan will mature enough to the point where we can. It will get there before we do Thanks. Thank you.