Apache Cassandra Cassandra Community Webinar Series, 19 Aug 2015

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Webinar | Big Data Analytics with Cassandra and Spark

Description

Apache Cassandra is the leading distributed database in use at thousands of sites with the world’s most demanding scalability and availability requirements. Apache Spark is a distributed data analytics computing framework that has gained a lot of traction in processing large amounts of data in an efficient and user-friendly manner. The joining of both provides a powerful combination of real-time data collection with analytics. After a brief overview of Cassandra and Spark, this class will dive into various aspects of the integration.

A

Okay, terrific, so we're going to talk sorry for the delay. Folks, I'll try to keep things moving, pretty quick and still try to cover I think most, if not all, of the the stuff we had for today so start with this guy. So this guy is a little bit of history.

A

Willie Sutton was a bank robber in the 1930s 40s and actually into the 50s and within the first month of being of the creation of the FBI most wanted list, Willie managed to get up on that list, and then he was actually captured a number of times. I managed to escape a few times actually and then was finally captured in 1952.

A

The thing that's interesting about Willie: is they asked him this question after his career of of robbing banks of? Why do you do that? Why are you robbing the banks and his answer was that's where that's where the money is and so we're going to keep Willie in the back of our mind for a little while, as we talk through some of these, these other pieces of the puzzle and we'll come back to him towards the end.

A

Now talking about Cassandra and SPARC and the whole space I think that a a motivating example sort of helps ground things, and so the case that I'll talk about primarily here is an Internet of Things use case, and you know those pictures talking about connecting toasters up to the Internet.

A

We've been talking about doing that for, like 30 years, still never understood what the toasters would have to say, but nonetheless it's a the idea that we're going to have all these connected devices across the enterprise and out in the world connecting back to some location to deliver data. So we see this in a whole bunch of places. You get things like your thermostats, like nest and and and some of the others who are getting in that space. You see it with connected cars.

A

We see it on manufacturing floors, giving telemetry data for their data, smart metering, etc, and all of them. The idea is: we've got our our sensors or devices out there on the internet, and they are sending data back to some central system that central system clearly can be distributed around the world.

A

But you know from a a logical standpoint: that's coming into you into this system, and now we have to think about what that system should do so it needs to be able to receive from these various places, and it also needs to then be able to answer some relatively straightforward questions. Those could be questions like what did that silver toaster say last or it could be something a little bit more fancy like you know what was the average number of pieces of toast per I, don't know toaster type or something along those lines.

B

A

If we are at all successful in what we're doing we're going to see more toasters coming on the scene and our system is going to need to be able to handle the fact that we are successful here and that the growth of the the number of sensors that are out there is going to need to have a our system is going to need to tolerate that growth. And then the other part is that we are certainly going to be in a world. The realistic world, where faults will happen.

A

Nodes will go down because of Hardware reasons or something. You know. We had this situation almost a year ago, where Amazon had to reboot a whole bunch of servers for some regular maintenance, and we hear about some. You know, folks that were hosting things on Amazon. Their systems became offline, while other folks, like Netflix, were rolling just fine and we we now talk about the Amazon outage and not the Netflix outage.

A

So in that context of Internet of Things we're going to talk about two things: Cassandra and spark, and then we're going to talk about the two great tastes that taste great together sort of thing with bringing the two pieces together. So first a bit on Apache Cassandra.

B

A

Apache Cassandra is they distributed no sequel database? It is sort of the love child of the BigTable paper by Google and the dynam yes, and the Dynamo paper by a by Amazon BigTable is really the data structure. So we're talking about tabular data with these sort of column, families and you're able to sort of have a bars storage of it.

A

In other words, we don't store the nulls that don't happen for certain things and it's a very flexible data structure and then we've got the dynamo paper was all about resilience and fault tolerance and having this concept of no no master. That's an interesting element that comes into play with Kassandra. That we make a lot of use of is the fact that all the nodes are equal and, as a result, in that sort of democratized world, we don't need to worry about a couple of things.

A

So, first of all, there isn't a master to go down, and so we're always on and if there's a node that happens to have a failure, say a network failure when that node goes offline, the other nodes are able to pick up all the pieces for him.

A

Similarly, since all nodes are the same, that means that all nodes are answering questions and queries to client applications, as well as serving up the data and owning some part of the data. To answer those questions in the system means that if we double the number of nodes in the system, we really are not only doubling the amount of storage, but we're also doubling the number of questions that we can answer at the same time. So we usually talk in Cassandra. We usually talk about transactions.

A

We merge the idea of reads and writes into sort of one bucket. There certainly are ways in which the read path and the right path are a little bit different, but we usually talk about transactions per second. So when we scale out, we can get more transactions and more data. One of the things baked into Cassandra won't talk too much about that on. This talk is really this multi data center idea, and so we can have our data centers.

A

Can the data centers can either be geographically disparate because of of either two things? One is fault tolerance, and so we can say that if there was a flood in in in say, New York, like a hurricane sandy, then a data center they're going out the application can can go to another data center, say on a you know, high up on a mountaintop, far away from a flood plain, and so that's one reason.

A

Another reason for geographic distribution could be the idea of wanting locality, so I'm going to do the guys in New York, we'll talk to the New York server and the guys in Los Angeles, we'll talk to the Los Angeles server. But for this talk we actually might be interested in the third reason for having this and that's that's having workload, isolation and so those those data centers can be virtual data centers. In other words, they could actually be in the same location and we've just isolated them.

A

So that, if you're doing a slightly different workload, you've got different, Hardware dedicated to it and then last a little bit on the Cassandra query. Language Cassandra, when it first came out, was admittedly not the easiest thing to use a lot of the easy use cases. A very straightforward use cases still required you to get down into the guts.

A

Admittedly, the the harder more exotic ones also required you to get down in the guts, but you were probably willing to go that far with something that that custom, and so we came up with this thing- called the Cassandra query, language which looks a lot like sequel, but is a little bit deceiving. Basically, it covers all the operations that you would do in Cassandra in these normal sort of use. Cases makes them very simple and they look like the SQL analog now.

A

That being said, there are things and we'll talk about this a little later, but there are things that look like you should be able to do, because isn't it an SQL database, but it is not and that's part of where we'll come into spark again in the story.

A

So if we think just briefly on sort of how Cassandra works on writes, we've got our little nest. Thermostat here so NASA actually is using Cassandra and he's going to connect to the cluster. Now when he connects to the cluster, he actually sees the whole set of nodes here and he can make intelligent decisions.

A

There's all sorts of load, balancing decisions that he can make we'll do something simple, let's say he's doing around robin and say he's going to share his workload among these five as he goes around and so he's going to just tell one of them: hey if 73 degrees, and so that node may actually not own the data and if he doesn't that's okay, he's happy to be the coordinator who we called the coordinator, I. Think of him more as like the maitre d, and he says I'll, take care of you.

A

I can wrap that question to who who needs to know it or who needs to have that information, and so internally. He will pass that on to the node that owns the data that person or that node will respond when he's done and the coordinator or the maitre d is able to reply back to the to the thermostat or to the client so few things going on there that we can actually start controlling here and that comes under this large umbrella of tunable consistency.

A

So consistency is one of the parts of acid, but it's frequently one of the things that that we take for. We don't really take advantage of in in the sequel space, but we end up having to pay for, and so the one of the things that's been done in Cassandra is relaxing consistency and allowing people to tune it to the needs that they have, and so by doing that, we're in a position where we can, where we can get different scale out characteristics, fault, tolerance, characteristics, etc.

A

So the first thing to note is the data in Cassandra is replicated. You set that when you set up your table and we're kind of distributed by token range, so each node is responsible for some portion of the of the range of tokens for rows, and so all rows have a primary key, which maps into that token and based on the ranges, tells you where that data should be located. So if we keep it simple, and we said that we had a hundred tokens, we don't we have a.

A

We actually have a two to the 64 space. That number is a little too big to even for illustrative purposes. So let's say we had a hundred token range, then the node one might take tokens one to twenty node two may take tokens 21 to 40, etc.

A

Okay, now, when we do writes, we have options as to how many of those replicas need to know the data was delivered before we tell the client that it's all done and by combining reads and writes- and these consistency levels we're in a position to ensure some things about how the data is distributed. All right- and so this becomes a core part of the application and the design decisions that we make and one of the things that somewhat interesting about Cassandra, so the first one, the top upper left, is probably the most relaxed.

A

Basically, what you tell the maitre d is hey as long as you have it, even if you don't own it as long as you know that that you acknowledge that you've received this data, then you just tell me the clients and I'll be happy. Now what the coordinator is going to do in the background is he's going to make sure that let's say we have our F of a replication factor of three it's pretty common.

A

The coordinator is always going to make sure that all three replicas get the data, it's just a matter of when he tells the client that things have been done and you can some assurances around it. So the upper left is the most relaxed. Okay, just as long as the coordinator has it it's fine. The lower left is just just make sure one of the replicas has it, but if one of the replicas has it I know that it's durably, you know in the data in the system somewhere and I'm.

A

Happy in the upper right is actually probably the most common, which is quorum, ensure that a majority of the the nodes are actually seeing the data, and then we can do some interesting a little bit of interesting things with respect to guaranteeing certain data, consistency, things and then in the lower right. It's actually the most stringent we say: hey I, need to make sure that every node has actually written the right that I just got now the lower right.

A

One has actually got some challenges to it, because if any node is down, then that right won't succeed. So we lose some of our availability, which is one of the things that we're trying to keep very high here is the high availability and so quorum becomes a really popular choice for really two main reasons.

A

One of them is it's a good middle ground between consistency and availability, I'm likely to have at least a majority of my replicas up at a given time and as a result, I so I have good availability and then, as a result, I get a couple of nodes actually getting the data.

A

Now, that's the that's the right side of things now on the read side of things. So now we've got our client here on. The left is doing this query. This is looks like SQL. It's actually cql, it's very similar in that intersection of space between them and again he's going to contact the coordinator. It could be anyone and he's going to ask this question.

A

Give me the user ID for the for the user, whose name is PB. Cup fan all right, so the coordinator says no problem. Let me go get that and consult with some number of the replicas based on the consistency level. So let's say that he did it on quorum, and so the coordinator is asking two of the nodes for what they have and they each give him the value that they've got now. They may be different right, we didn't say we said before very active system.

A

We said that you could end up with only needing to ensure that one or two or a subset of the nodes got the rights. So when you ask the question, you may get different answers and the way we do that in Cassandra is that when we do the writes, we put a timestamp. So what the coordinator has to do in this case is he's not trying to see if the values are the same. This is not a majority votes majority wins system.

A

This is a last write, win system, she's, going to check the timestamps and he's going to return back the value at the highest time stamp.

A

So if we combined a quorum right with a quorum read, we can make sure that we always get the most up-to-date data because we're going to be consulting one of the guys who got the previous the previous set. Now this is all interesting.

A

This is a very basic how Cassandra is working, how we get a basic scale out system, but the interesting parts really not about, especially with Internet of Things, the individual thermostat, but rather a whole stack of thermostats and then, if again, if we're going to be at all successful, this ramps up too many thermostats and then zillions okay, so that sort of where, where the the general concepts of Cassandra are and Cassandra really is well-suited to the Internet of Things, for the space where we talk about always-on, so there's no downtime we're always going to be able to receive, and the data from the internet as well as be able to serve up to queries.

A

We've got this scalability so that if we get successful, then we're able to scale without having to massively reaaargh attack things, and that makes it a really good choice for Internet of Things. Okay, so now Isis a really positioned as being a place where the data is but like I said, Cassandra alluded to Cassandra doesn't do everything it's great. It's awesome. We love it, but it does come short by design of doing certain things.

A

So one of the things is that it does not do aggregates, and so it's really been optimized to do lookups and to do right, but it's not really about doing large group by aggregates or even the more fancier windowed aggregates that you get in SQL.

A

We also don't do joins so much of that is where we spend some time with people who are new to Cassandra, because there's a data model change and we can really cover most of the join use cases by intelligent use of data modeling, and so we can work around some of these problems.

A

They're primarily lookups, like I mentioned, so we talked about needing a partition key and sometimes the predicates aren't based on the partition key and that's another challenge that we hit with Cassandra and and essentially Cassandra is not designed to be this full table scan kind of place. So we need to have something else, helping us out here all right. So we talked about the peanutbutter. Let's talk about the chocolate so SPARC distributed computing frameworks, then let's say wengie a version 1.1 GA last spring: it's got this generalized execution model can make reuse of data.

A

That is already pre-calculated. People talk about SPARC being an in-memory system, it's actually more than that. It actually does work off of disk, but I think the thing that's really important to keep in mind is that spark is, are really good at reusing memory and very efficient there, and so that pays a lot of dividends for folks and then there's this easy abstraction. This thing called data frames and a whole bunch of other tools that are built on top of this.

A

So if we look at the SPARC eco sister or the SPARC sorry the product, it's got a number of elements that are all integrated here in one place, we've got a basic core and then we've got a sequel engine, a streaming engine and then there's a few other pieces that I won't talk as much around machine learning, graph analytics and in our and statistics. So this is the general high-level view of the stack of spark.

A

Now spark is a relatively simple architecture to understand: there's only two roles of the nodes in the cluster: there is a master which we've talked about a little bit before and and some workers we've got this like I mentioned efficient memory. Caching and this generalized model, which is really nice, and this abstraction called data frames. Now, if I, given this talk about six months ago, we would be talking about resilient, distributed data sets or rdd's. A lot of similar things can be said about them.

A

Data frames are the sort of Noori envisioned version that actually gives a lot more capability, but conceptually speaking, they share a lot with the concept of an RDD, so the SPARC master he's the one who used submit jobs to, and he assigns resources to a particular job. The SPARC worker is the one who now has tasks that he needs to do, and so he assigns work to the local executors running on this machine, and then the executor is actually the little bit of code.

A

That's doing the work he's the guy who's, taking a part of the data frame or the RDD and and processing and processing the work that comes out of that.

A

So the rdd's or the data frames can be generated from a whole host of sources. We got things like HDFS and text files and databases and to the Apache Cassandra. Alright,.

B

A

Now we'll talk about sort of how the the two pieces fit together, so the way that the way that Cassandra fits in under in with SPARC is is on the bottom, and so we can put a Cassandra source or a Cassandra database underneath the the spark core engine and we've got this piece called the data stack spark Cassandra connector. That is basically the interface code between the Cassandra data ace on the spark engine. That's an open-source technology or package that we've got out there.

A

We contributed out there's others who are contributing, but data facts is doing the lion's share of that work. And what that does. Is it surfaces, Kassandra tables up to the spark engine as a data frame and by spite by integrating it at the bottom, by integrating it at that lower level? We're able to take all the elements above it, the sequel engine, the streaming engine, the machine learning engine and we're able to just reap the benefits of that from this data from this concept and this this data structure of the data frame.

A

If we dig in a little bit and take a look at how? What's really going on under the covers, I mentioned this spark executor, so the struct executor is the workhorse he's doing the processing of the data that is sitting in the in the data set or the data frame he's taking it piece by piece, he's working it through the the dag of execution and and processing it.

A

So the way that it is inside of that executor, we need to take a look at how he's grabbing data from Cassandra, so each one of those executors has a connection to the to the Cassandra database, but he's making through the through the java driver. So data stacks and other thing, data stacks does is builds a whole bunch of the open source drivers for Cassandra. This one happens to use the Java driver, but there's a number of other drivers available, and he makes a connection to the cluster makes it.

A

Each of these executors will then pull across a part of the data frame or part of the full data set and bring it across, and so what we do is we split these up, rdd's and and and data frames each. They split up the full token range into pieces and a spark executor will operate on all of the data for a subset of the tokens. So the just in this picture, the first, the first slice might be tokens one to a thousand.

A

The second one could be a thousand one to two thousand etc, and we work our way through the entire. The entire data set. That way we can work on these in parallel. So, conceptually speaking, we we sort of have this situation where each spark node these orange nodes on the large marbles on the left are each making a connection to a sister, Cassandra node on the right. This is a simplified version.

A

There's really no reason why the spark cluster and the Cassandra cluster have to be the same size, but for simplicity, we'll just display them this way, and so they buddy up and so the top spark node there might work with the top Cassandra node and they're going to be reading the data that is owned owned in that space. Now, once you see this picture, you sort of rapidly come to. Why are there two different clusters?

A

Why don't we co-locate the spark process and the Cassandra process on the same node and get these sort of hybrid nodes will run the Cassandra and the spark on the same thing that allows for us to do local reads and writes, and we can skip some of the network performance, Network costs of doing spark here and and save that. So we end up with this sort of spark, Cassandra hybrid.

A

So now that we've done that, let's take a look at things that we can do now with spark that we couldn't do before with Cassandra. So the first one is joined. So this is a silly little example where maybe I have some metadata about each of my sensors, that is in this table called metadata and I have another set of data that actually has the temperature readings and I might want to know, say, tell me what the temperatures are or last.

A

You know that you have in the data set, but but show me also the location for that sensor, and so this is a relatively simple join. They certainly can get more complicated than that.

A

The second example we got is aggregates, and so now I might be interested in the year and monthly average or I'm sorry maximum temperature for each of my sensors. So this is actually going to do that kind of full table, scan and grab the data and and process it and give us this. This summary table sort of an OLAP style query now, one of the other pieces in the SPARC ecosystem there's a lot of use of distributed file systems like HDFS, and they may be external to the cluster.

A

That we've got well they're, certainly external to Cassandra, which does not have an HDFS in it, and so we might want to do a join between the two. So we have our HDFS data is say, the data of temperatures from 2014, maybe they're stored in CSV ES, or something along those lines and we'd like to join it with the data that we have the hot data that we've got in Cassandra the operational data- and maybe this is a query. This example is trying to say in this first line.

A

You define the HDFS data set in the second line. We are defining the Cassandra dataset and in the third line here we're joining the two together and then doing a little filter to find out which sensors are hotter this year in this month than last year in this month. So simply simple, but it just wanted to give an example of sort of a year-over-year kind of query and then the last example is super simple. This one's just saying: hey what? If I wanted to restrict my query based on not not the partition key?

A

In other words, I'd like to know every time there was a temperature reading above a hundred degrees and and and that's just over the entire data set and again SPARC is well-suited to scrub through all of that data, because we're integrating in with SPARC, we can latch into the entire SPARC ecosystem of tools. One of the pieces there is through ODBC and JDBC, because SPARC sequel has that interface, and so we can tap then into things like tableau and Pentaho and I put R here.

A

There is a spark our capability, but there is also our ODBC in our JDBC and you could leverage those just for getting an extract of data, and then there are some notebook style tools like Apache Zeppelin. That's that's an incubating project in Apache that we could also tap into so. The spark ecosystem enables a number of tools that are. In addition to the tools that we've got in the in the Cassandra ecosystem, so I'll do a quick word on spark streaming I'm.

A

Conscious of the fact we got a late start here, so this will be relatively quick. Spark streaming is one of the big things that we're seeing a lot of folks use with Cassandra. The idea here is that you've got a number of data coming in from the ether in and they're going to come into some custom into a receiver, and so we could think of this.

A

As again, we could do this with them with, say the sensor data temperature data coming in from our sensor, network, etc, and then what they would spark streaming does is it basically makes these little windows, and so these blue, dark blue boxes are say: one-second windows, let's say that we're going to build up, and so we end up making these little batches.

A

It's really kind of a micro batch kind of approach, and so we build up all the data for one second and then we can process that one second or some collection of them, and so in spark streaming we can say I'm going to put them in one second, wind buckets but I'm going to I'm going to do like, say two every two seconds: I'll do the actual processing and I'll slide by one second, so that's what this example is at least illustrating and maybe I'm doing, roll-ups I could do pre aggregation I could do some filtering I've got anything I want to do here and then I can take that data and persist it somewhere and one of the things we see a lot of people doing is using something like a message: queue like RabbitMQ or Kafka and then pushing it through spark streaming and have the results go into.

A

Cassandra I'll skip going into the details here, but there's some relatively basic commands that you would need to do to set this up and it's relatively straightforward by setting up what the receiver is, setting up, what you're going to do on each window and then telling it to keep running until you decide to stop just a quick plug on data stacks Enterprise, one of the things that data stacks Enterprise delivers is Cassandra at its core. But we have these other capabilities.

A

We've got a this SPARC integration built into the into that platform, as well as our integration, with Apache Solr on the on the search side of things. So now, if we come back to our Internet of Things use case, we can see we're SPARC and Cassandra, really feuer, where Cassandra fits us in this system of being able to receive all this data and scale for all the toasters, we're going to add to our system and how SPARC really helps us address.

A

This broader set of questions that the person on the bottom is is going to want to do with that data, and so, if we kind of come back then to a to Willie Sutton, we really see that the re one of the real benefits here of getting Cassander and spark is being able to unleash the power of analytics where the data is and that's really the big benefit.

A

So here's so I had a couple of links for where you can go to follow up again. I won't belabor this point too much while since we had some time issues at the beginning and I think I'll pass that then over to Brian I think is the next there's something I need to do to enable Brian Devon.

B

Yeah I'll pass the ball. Remember. We should be ready to go I. Think I just did actually it cool we're all we're all good to go. Brian go ahead and start when you're ready.

C

Okay, I'm going to share my screen here. Excuse me, I'm, assuming I'm still seeing Brian screen. Is that correct.

B

Yes, that is correct. Okay,.

C

What do I need to do? Obviously, this is not not giving.

B

Me allegis, okay, you should be able to go to share my desktop now that you have the ball. Okay,.

C

C

You let me know when everybody can see that I.

A

Can see it Brian, you look good okay, great.

C

Again, as Brian mentioned, I'll keep this very brief. We've got started a little bit late, so we'll go through this quickly. We're going to cover the data stacks implementation methodology. Kpi is a partner of data stacks and so Valera partner will also be at the Cassandra summit. Hopefully all of you can join us there, September 22nd through 24th and Cape. You guys done quite a few of the data stacks implementations in both in retail and financial service customers throughout the US.

C

So here's an overview of the methodology we're going to cover we'll keep it at a high level, not get too technical just to cover these things. To ensure the success of implementations that we've done in the past so initially to get started. We have we referred to as a requirements phase and here a high-level we're going to do our use case requirements for data modeling, which is very important.

C

Security and encryption requirements, service level agreements, SL A's operational requirements to allow us to monitor and manage a system, and then the searching and analytics requirements as well. Some of the key points here is and I think Brian said this earlier. Get the data model right is going to be key to our success here and leverage whatever you can.

C

So, if there's an existing database where you can get access to the query logs go and do that that's going to help you define your data model, you want to be able to define specific, create, read, update and delete requirements. Those are essential for the requirements phase. Then also security is important both from an authentication perspective and an authorization perspective, and you can see the areas that we cover their encryption.

C

We have the client application to data stacks kind of the cluster to the cluster and then the node to node, which is the inter cluster encryption.

C

Sles are must-have and are highly recommended, even if you just define them for your own project. The lack of SaaS has led to a lot of project failures and this always comes up during our implementation process and we work with the customer to help define those we have to understand we're building a mission-critical system.

C

So we have to make sure we define the operational monitoring and the management of the system early on in the process and to make sure we build those those requirements and then, as we talked earlier, data stacks search, we're going to be defining our requirements here. But ultimately we need to determine the fields that will be searched on and returned in the searching process, and you can see some examples. We give you as well.

C

Data analytics has requirements as well they're important to capture at this time. The key ones that we see out there that need to be incorporated are statistical, algorithms required data sources, the data movement and modifications, security and access, and then there's the other analytical requirements, and we just have to make sure we have enough detail to perform a good design.

C

Step two is the design phase and again the data model leads off this list as well idea access, object, data movement, design, operational design, search and analytics design as well. We're going to touch on so here in the data model design needs to include, keep a key space design.

C

You know our replication strategy name our table design, and you can see that the components within that that are necessary for us to put this design together and then again, any relationship between tables needs to be noted here, and you know, the the joining is not technically feasible within data sets, but, as Brian mentioned his demonstration, how that can be overcome and accomplished. So again, we want to identify this stuff early in the design process and incorporated here.

C

When leveraging simple data access objects, we will want to keep it simple to be successful, so we want to use the data access objects to encapsulate and abstract data manipulation. Logic we want to keep this current trends in the industry.

C

Right now is is an application development where the project's leverage framework to encapsulate an abstract and represent database components as application objects, we're doing that a little bit differently and then designing application DSX as much as possible up front will help in the overall application development and functionality component, and then the last piece is that data movement design. You know the batch real-time data integration between the systems, ETL change, data capture, data pipelines- these are all the essential things in the design phase for data movement in our operational design.

C

We want to do as much tooling and techniques as possible, so we want to deploy new nodes configure and upgrade nodes in the cluster. We will be able to back up and restore operations. We want to know how to do cluster monitoring, ops, Center, use, repairs, alerting, disaster management processes. We want to have all those things in place and KP. I highly recommend putting a playbook in place to manage the operational design process.

C

Here's some of the search design stuff we talked about earlier search, searchable terms, returned items, tokenizer, x' filters, the multi document search terms. These are all things we identify and incorporate into our design phase, and then you can see the analytics component as well. We do do a design phase for analytics this early in the process.

C

Step. Three is our implementation phase? It's we're going to cover our infrastructure deployment and configuration management, and then software components which conclude both the data model and the application. That's been built to do this and then our unit testing, of the components.

C

When you're doing application development, you know, based on your organization, you can use an agile or waterfall methodology all work well with this process. Here we want to cover the deployment and configuration of the management mechanisms. So it's key in a distributed system like this. It's automate as much as possible, so so try to leverage, ops, Center, docker, vagrant, chef, puppet, all those type of components and then, obviously, in unit testing, a more complex with a distributed system compared to a single node system, be prepared for that.

C

We're going to be looking for specific defects such as race conditions, that's those can only be observed when we at production scale or somewhat a scale of the actual system. So we always recommend unit testing should be executed over a small cluster, but it contains more than a single node and then there's other things that you can use to automate some of your testing and launching things. Those are always recommended as well.

C

Our step four is the pre-production testing phase and you know basic stuff here: defect, tracking tools you can use to do that stuff and our operational readiness checklist. That is key to deploying an application like this.

C

You know it's critical to enable the project team to identify actual issues prior to going to production at scale in financial services is very common where the testing environment were using is an exact clone of the production environment. Just for that reason alone, they want to identify as much upfront as possible.

C

We recommend a two-week minimal period for running your application at that production scale, to look for all these errors and issues prior to going live, and it may take several iterations of your configuration, some code changes and other things before you get are able to do the full execution.

C

Here's a good example of an operational readiness checklist and you can kind of bullet through these. Most people need to be expecting this type of stuff. These are things you need to be ready to do and cute prior to going to production that way when you're in production, you feel comfortable doing it and everything's in place to to execute on our last up. Here's the scale and enhancements- and these are to highlight the normal operational mode of an application built on data stacks up right.

C

So we're always looking for how do we scale this, and how can we enhance this? Whether it's tuning performance, more features more functions? These are the things they're always looking for in this face here and then you'll always have to prepare for all the eventualities. Things are going to happen and you can address this by adding nodes to expand capacity to the system when it's needed.

C

These are all options you have with the DAC product and in capabilities that you have to plan to take advantage of as you grow and also the final thing is to scale with data stacks. The Enterprise Solution is a nice to have and it's what the products for that's what it's known for and its dominance out there in the field today. These are just put some a reference architecture. Examples we put in place. Brian gave some great examples either some other ones to look at that are commonly used in the field today.

C

Here's another one: that's a cloud-based one, a lot of organizations that are global, set up these kind of clusters west-east and then you know amia to do various times of real-time analytics and churches that are out there available to you.

C

Quickly, our prepare it. The key things we want to push today is one of the things we've been very successful with a lot of customers appreciate is scheduling a Lunch and Learn reach out to us. Let us know you're interested in what we do is a KPI, we'll team up, Athena stacks will come out to your location and it's clearly an educational two hour session, based on your knowledge and what experience you have a Cassandra, it can be. You know, introduction all we up to some more advanced data, modeling or performance tuning, if necessary.

C

So that's something we're more than happy to do at no charge and we've got quite a following doing that type of stuff, but many different organizations, and typically it's geared towards architects developers, people looking to understand how they can take advantage of the Cassandra product and the other one is a data stack assessment call we can quickly get on the call of you answer any questions you may have do an assessment of maybe your use case that you're considering or make some changes to something you're currently doing today, the contract.

C

My contact information is included here, feel free to reach out to me at any time and we'll get back to promptly. The other thing I want to remind everybody of is the Cassandra summit. It's in September kpi will be in booth one one one should be easy to find and there's a piece right there. So at this point, I am going to change back, are actually Devon. If you could change back Brian to the presenter, and we can take questions at this time. I would appreciate that. Thank you.

B

Okay, perfect yeah switch Brian back. We have a we'll take a few minutes to enter a few questions.

B

First question: what are the cluster monitoring to tools at data stacks offers.

A

Sure so the data stack Center prior actually doesn't come with data stacks enterprisers data stacks produces a ops center tool. That's our visual monitoring tool. You can use it on on some of the open source Cassandra cluster, but has some limited capability, but into DSC. It's the data stacks enterprise itself. It allows visibility into various metrics going from the sort of Cassandra level and understanding about like read, requests and write requests, kind of see. Spiking.

A

You can get down to things like the java virtual machine and understand how it's doing and getting all the way down into lower level things like disk access, and you know operating system sorts of metrics. You could do some charting for that to see the trends as things are going on again, this is a distributed system. So sometimes it's good to see an aggregated view and we provide that or you can do an individual view.

A

You know if you have 100 nodes in the cluster, the individual view could be a little bit busy, but it certainly can help to to dive in at that level. That's on the monitoring side on the management side. There's there's some alerting that you can do that's relatively standard kind of thing to have an amount in a monitoring tool. We can also take a and do some of the common housekeeping kinds of things to deal with backups and restores.

A

We can do things with some of the anti entropy, so there are some maintenance pieces that we would do in the database in the regular care and feeding, upgrades, etc, and then, lastly, there's a number of perform another. What we call services that are through the OP Center tool that do things like evaluate whether or not you are subscribing to some of the best practices in your cut in your configurations and sort of just alert. You I mean there.

A

Sometimes there are plenty of good reasons to not do the best practice, and so we just want to make you aware that you are doing something non non-standard and that's great if that works for you- and it's not great, if you didn't know that you were doing that by accident and a number of other services, a performance service, etc that we can kind of dive a level down so op Center is our visual monitoring tool. I did sort of skip back or I skip relatively quickly through the data stack slides this data stacks slide here.

A

I talked primarily about things in the blue. On the top there's a number of things. The data stacks provides along the bottom, which I think of as enhancing and making enterprise-class Cassandra itself, and so that includes this visual monitoring tool that you can see on the right. There.

B

Great, do you encourage or discourage running Cassandra nodes, along with SPARC workers, and, if so, under what hardware conditions yeah.

A

I think that I think the answer there is is sort of unfortunately, kind of fits in that. How long is a piece of string kind of question? It really depends on what you're doing in your SPARC job while Cassandra, we can actually talk a little bit more concretely about the sizes and the sort of prerequisites or the recommended configurations. Sparc jobs can go from relatively simple to extremely complicated and so I think it really matters.

A

The the basic sort of configuration we talked about for Cassandra is is I, believe four cores and 32 gigs of ram I would prefer to double that, but not requirement, certainly and there's a number of cloud instances that we that we talk about and recommend Amazon and Google and Asher and all the rest of them. You certainly can go out there and do that when you bring spark into the mix a few more cores and a few more rant a few, some more Ram really does start to help the picture.

A

If you're doing analytics, that's going to end up building big models and doing a lot of number crunching that can be putting a real stress on both the computation and on the RAM, and so it sort of floats. The general rule of thumb would probably be more like six to eight cores and probably drifting up, 48 64 gigs of RAM, you know, is sort of where you want to be.

A

But you know what I think the best sort of approach to this is really to kind of get a good guess and then actually just try it out and see if the workload works. One of the things in terms of sizing I'll take this opportunity with Cassandra to talk a little bit about cluster sizing for just a quick second. Is we don't always do it by data size a lot of times? We do the query, the sizing of the cluster by your query, SLA s. So are you trying to get high concurrency say you want?

A

You know 50,000 to clients all asking questions at the same time, that's going to change the cluster than if you only need 5,000, and similarly are you trying to get you know very low latencies. You know I need to get under 100 milliseconds to do my to do my right, vs., I'm, okay, having a 1, millisecond SLA, and so those are the other sort of I'm. Sorry what yeah? Those are, the other sort of things that we take into account when we're coming up to posture.

B

Ok, great, let's take a few more about the availability of individual nodes being offline for many hours. So what is the SLA for how long a No cannot be connected to the rest of the Cassander network? So.

A

In terms of how long it cannot be connected the there's sort of a couple of things mixed in to that the architecture is designed to have it tolerate the fact that that node is not there relatively indefinitely. Now the fact that it's not there if we go back to our replication factor story of data is being replicated around a couple. Things are happening while that node is not there. First of all, I'm only storing some of the data in only two places, and so I'll start to worry about things.

A

If I lose another node now, I only have it in one and my fault, tolerance is starting to degrade a bit. So I do want to address this if I know, for instance, that it's going to be a while I may need to rebalance, in other words, give that token range that's now missing to one of the active nodes. So if I had say ten nodes and one went down, I could give that to one of the other. Nine don't need additional hardware necessarily, but I do need to.

A

You know that it's going to be taking a little bit more strain on each of the remaining nine I could also spin up another node to bring it back to ten and do it that way after a little while. So if we get Network hiccups all the time, that's part of the world we live in, and so the architecture is fine with losing a node. For like a second or a couple seconds, which is you know normal normal network hiccup thing and what we do. The coordinator will just store a little hint.

A

That's what we call them store, basically what it would have told that node had he been up and we'll hold onto that for a little while there's actually a configuration of it to how long that will be. It's usually a few minutes, so maybe tens of minutes and then at some at that point, it'll say: look I'm, just not going to bother because it's just piling up and it's going to be too much of a drain and we'll deal with that guy when he does come back on line now.

A

After let's say there was a machine guy power supply blew up, and so the node died for for two days. When you bring him back, there's a operation that we call repair, that's not quite as much like it was broken, but it's more of a we call anti entropy I need to get this guy to have the latest data from the other replicas.

A

They can stream the data across to this new guy or the the guys revisiting, and then he comes up to par and once he's up to sort of the current state, then he's able to sort of join and take care of things, and so that operation of bringing a node back in if he's been offline for a while, actually looks a lot like bringing a note in as if he had been cold and bring in joining entirely. So in terms of the tolerating how it goes, you can handle it for a long time.

A

It isn't super sensitive to it. But you have to concern you have to be worried about if it's down for a long time so giving that to somebody else. So that there's not a concerns with having the data available, etc. I know Brian. If you have any other thoughts on sort of know,.

B

Grabbin I've been dominating.

A

My cousin's answering the questions I.

B

C

That that's where we bring the expertise con.

B

Sorry, let's go for one more.

B

Let's see 20 good ones, is there a way to archive the Cassandra commit logs, since these log files can be quite big.

A

So sure there there's a couple things you can do with the commit log. First of all you, you can actually set up how much commit log space you offer offer. So we didn't cover this on the talk just to bring other people up when you go to do it when each replica goes to do a write, what he does first is he writes to a little space on the on the disk called the commit log, which is how he knows that it's been written somewhere.

A

Then it will go through the rest of the path it stores it in memory for awhile and has an in-memory table. So it's actually a much cheaper operation to query data that has most recently been active and then, after a while that'll that'll grow to a space that needs to be flush to disk, and that's when we start calling these things called SS tables, and so those are the on disk version.

A

And so, when you do a read, you kind of combine what's on disk with what's in memory, if the machine were to somehow just go down, you kick the plug out, then, when it starts back up, he can go to the commit log and replay everything. That's in the commit log to get himself back up the state, and then he sort of brings himself online and says now I'm ready to talk to people. So the commit log can is this safety mechanism and it can grow. One of the things you can do is set.

A

A

One of the other things you can do is there is a hook inside of the api's that allow you to tell it you'd like to do something like, for instance, archiving you might do commit log archiving, and so you can say that this API hook will get called when we're going to get rid of the file, because it's occupying more space and everything is handled there and it's out, we don't ever get rid of a commit, lock, page or file until we know that all the stuff that's been in, it has managed to get durably to one of those SS tables on disk.

A

In other words, there was that memory that memory table I was talking about and then eventually it'll overflow into the and go on to disk and SS tables. We don't want to. We don't want to remove, want the the only disk copy we have until we make sure that we actually have the SS table disk copy. Once that's done, we kind of mark these commits log pages as go ahead, get rid of it.

A

It's fine we're covered, and so, though, between a couple, a combination of those knobs and interfaces, you can really address sort of larger commit log files.

A

Another thing you can do is you can do an operation in one of the management tools to flush, DF force, a flush of the memory tables, and you basically do that command using the node tool, command line utility- and you say: hey I- want you to take the table called Brian or Brian's temperature data and I'd like you to put that on flush, everything that you have in memory on disk when it does that it sort of goes to the commit log and set it could clean up and say it is a lot of stuff that, but we did now, so you can get rid of some of those files so number of options.

A

But it's a clearly a question from somebody. Who's done some operational work with Cassandra. It's it's! It's a good question. It's relatively uh it's relatively targeted to the operational use, though it's really good.

B

Right, a perfect record: that's enough for all the questions that we didn't get to we'll try to get to in the blog post. So thanks again everyone for joining again. We will be sending out a recording of this video, along with the slide decks within the next 48 hours. So sorry for the technical difficulties with banks against coming. Thank.

C

You have a great a Cheers.