Apache Cassandra Cassandra Summit 2014, 10 Oct 2014

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Glassbeam: Internet of Complex Things Analytics with Cassandra

Description

Speaker: Mohammed Guller, Application Architect & Lead Developer at Glassbeam

Learn how Cassandra can be used to build a multi-tenant solution for analyzing operational data from Internet of Complex Things (IoCT). IoCT includes complex systems such as computing, storage, networking and medical devices. In this session, we will discuss why Glassbeam migrated from a traditional RDBMS-based architecture to a Cassandra-based architecture. We will discuss the challenges with our first-generation architecture and how Cassandra helped us overcome those challenges. In addition, we will share our next-gen architecture and lessons learned.

A

A

A

Grafton guys, my name is Mohammed color and I'm, going to talk about doing a machine data analytics with Cassandra and I'm, going to actually interchangeably use the term machine data in internet of complex things. So don't get confused here at the right talk.

A

So a couple of things before I get started as I'm going through the presentation. If you have any questions, feel free to ask right away, you don't have to wait till the end. The other thing is sometimes I talk very fast and once you combine that with my heavy indian acts, and it could be hard to understand me. So if that's happening just raise your hand, ask me to slow down okay.

A

So let me first introduce myself: I'm an application architect and a late developer at class meme, I'm lucky to have that row because I'm passionate about both things, I enjoy designing new things and then building them up. So it's fun to be able to get to do both and before I joined class play. I was working on my own startup. We did two products. The first one was an idea.

A

Discussion platform called good, a great idea that allowed people to discuss ideas, new ideas, new product ideas, new business ideas and rate them and have qualitative discussion or those ideas. The other product that we build was trust strikes, which was a social recommendation. Engine solving the same problem that Yelp is solving, but leveraging is a social network.

A

Okay, so I think one thing that's helpful is for me to know the audience: oh I'm going to do the pop quiz, but the answers are easy, so don't be afraid, so how many of you in this room are very comfortable with Cassandra and have worked with Cassandra for some time?

A

Okay, so it looks like maybe around twenty thirty percent. How many of you have just started: learning, Cassandra, okay, so majority of the people are in that other segments. Okay, what about IOT? It's a pretty hard word! A lot of people talk about it, how many of you are actually working on IOT, okay, so roughly twenty percent, and how many of you have read about it?

A

Okay, what about the rest of the folks? So it looks like it has some people are lazy. I'll probably have not been reading much about IOT. So it's it's a really hot werd lot of companies are talking about it. You hear it about it in the news and we'll go through some of that during my presentation. Okay, the other thing is in terms of your background, how many of you are on the technical side and by technically means development operations?

A

Okay, so it looks like a big chunk of the audience and how many of you are in the business side, product management, marketing, a few okay, so it looks like most of the crowd is technical, that's great so before I get into the meat of my presentation. I want to set this stage by defining the problem. So for those of you who are familiar with IOT, you know that it's in it's in the news a lot these days and the data from IIT is exploding.

A

There are devices that are generating huge amount of data according to a study by done by Cisco and IDC, and we keep on incest emitted by the year twenty twenty. There will be 20 * connected devices, so I guess all of us will be for each individual there'll be ten, so that, basically, that means we'll have smart phones, smart, glasses, smart shoes. Everything is going to be smart. That was essentially, it means.

A

The same study also points that forty two percent of the data is going to be generated by machines and, if you to just put that into perspective the right hand, side of this slide shows you how the data the size of data that have been generated over a period of time. So until the 90 80 s most of the data was getting generated by ABS, which was structured data, not a huge volume.

A

It would take years before they come to go to a stage where they're terabytes of data, and then the internet took off in the 1990s and suddenly you had people sending emails, sharing pictures posting videos and that data eclipse, the all the data that was generated previously, but the next big wave is the data. That's going to come out of RIT, that's going to be very huge lot more than what people have been generating so far, and this data presents new challenges.

A

The three main ones- and some of you may have heard about this from other places too volume variety and velocity volume is in terms of the amount of data. There are instrumental devices that can generate terabytes of data on a daily basis by variety. I means that earlier again, going back to the or older days abscess generating structure data.

A

Now you have this machine data, which is not just structured, but it can also be unstructured or multiple or multi structured and later, in my presentation, I will describe what I mean by multi structure and then the third, a key attribute of IOT, is velocity. So when you humans are generating data, it's at human speed when, when the machines are generating data or the Internet of Things, it's at machine speed. So it's going to get much faster pace, which means that kind of technology need for consuming. That data needs to be totally different.

A

But but I already also presents new opportunities, so it's not just that you have new challenges, there's also new opportunities and organizations. Multiple groups across in an organization can benefit from that data. So I've listed some of the key groups that benefit from that data. First, one is remote support by leveraging the machine data support organization can be more proactive instead of being reactive. So, instead of waiting for the customer to call and say hey, my system is not working now.

A

Actually support can analyze the data that they are getting from the machines that are out in the field and see if something is not working, then proactively. Take steps to fix this. Imagine how happy a customer would be if he gets a call from support and say, and the support guy says mr. customer I say that one of the component in your system is having some problem and that's going to crash the system in a few days. So we are sending you a replacement parts.

A

Imagine what the reaction would be when the customer here is that the second benefit that support gets from the machine data is that they can actually lower the mean time to resolution. So those of you have tried to solve problems remotely. You know that you need a lot of information when you're trying to fix something and with machine data.

A

All that information is there, which means support, can actually be a lot more quicker in the resolving whatever issues the customer is facing, and as a result of that, right being proactive and being able to resolve issues, much faster, you make the customer happy. The customer satisfaction goes up and also your support cause goes down right because, instead of taking at 30 minutes, if you can do the same thing in 10 minutes, dear cost was going down significantly. The productivity has gone up. The second group that benefits from machine data is marketing.

A

They can actually see how the product is getting used in the field. What's the adoption curve, so I'll give you an example. Let's say you have released a product that has 20 features. How do you know which features are getting used which are not getting used so if you're getting data from the machine itself, when it's pretty Anna easy to analyze that and see what's going on?

A

Similarly, let's say: you've released multiple products over a period of time. How do you know which products have been deployed out in the field a lot of times? Customers will buy something but not really deployed. So if you have access to the machine data, then you can see what a period of time. What is the adoption curve for different models?

A

Similarly, engineering, it's insights from how the product is getting used and they can build other products for their customer and then the last group to benefit is sales. They can discover upsell and cross-sell opportunities by analyzing the machine data. A good example would be lets say you are a storage vendor you're selling storage to your customers. If you could see in the app machine data that some of my customers are already at eighty percent capacity, you know that this customer is going to need more storage right.

A

So you can proactively call that customer, given what he needs.

A

So there's one more thing: I need to define actually before we continue Internet of complex things. So far we have talked about IOT, so internet of complex things is a subset of IOT, and this includes basically systems that provide complex functionality. So I've shown some examples here on the slide, and this would include, for example, in a data center. This would include your legacy. Servers, storage systems, routers switches, security devices in a hospital setting.

A

This would be the equipment's that search end users for surgery, for example, in a lab environment it could be whatever the lab technician is using. These are all getting internet connected now. Similarly, there are industrial devices that companies like GE and an evil are making that are all connected and sending tons of data back to the product manufacturer. In the automobile section you have these new cars, for example the cars that Tesla makes they're heavily instrumented.

A

They are not turned on by default for privacy reason, but they can actually track everything that the car is doing at any given point of time if they want it to and then of course, you have smart phone and other connected devices.

A

So so what does glass beam do so we are basically we offer a SAS based product that allows our customers to do analysis of their structure or in structure machine data. So as an input we take or whatever operational data that the devices are generating in the field, the customer sends it to eyes and, on the other side, basically provide a bunch of apps that allows them to do whatever analytics. They want to do on that data.

A

So, as I mentioned, how I'm going to describe what multi structured data looks like now, this this data I could come in in one file or it could be actually in multiple files.

A

If you pay attention or if you look closely what's shown here, not like the data actually, that's shown, there has multiple sections and each section has a different layout and there's another thing, actually that's very different in each section and that's the frequency at which that data changes. So if you look at the first section which I've marked as static it has key value pair and that data very rarely changes actually across the lifetime of a product. In some case it might be every year in some cases might never.

A

Actually, if you look at the next section, the config section it has to have a little and that information is going to change mode frequently. The third one is the statistical information again it has totally different layout than the config section and that one actually is going to change lot more frequently in some cases might be 10 times a day. In some case, it might be every minute in some cases, every second and then the last one is the logs.

A

So how do you capture this kind of information? Right? That's the first thing: okay, the data has different layout, different frequency, which changes. How do you extract meaning out of this kind of data?

A

To do this? We created our own language, which we called SPL. That's our code IP. So this, even though the name actually says semiotic parsing language, it doesn't allow us to do parsing, but a lot of other functionality. So it allows us to specify the parsing rules, how exactly to parse, multi, structured or unstructured data where to store it, how to store it? What kind of search capabilities that we want to provide on this data? What kind of analytics transformation? So this is how this is.

A

What enables us to provide that analytics capability on the multi structured data that we get from our customer.

A

So this is a sixty thousand foot view of our solution. Customers are sending error, sending as unstructured machine data. We have a spiel defined for each customer and using that SPO we are able to extract meaning out of that data, and then we have shown the other side that our customers use, and this is what the first generation architecture looked like a for product. So the input is same, as you saw in the previous slide, and then we have a parser that basically applies the SPL rules to the incoming data extracts.

A

The data puts it into an SQLite database once that part is done. An ETL process kicks off, takes that data and puts it into a data warehousing platform which in this case is vertica, and then we have another PTL program that kicks off at that point. Takes that data from our subset of that subset of the data from vertica and puts it into maria DB, and then you have a party based web app that the users can use. So this worked actually does not a bad architecture.

A

It worked for quite some time, but over a period of time we realize we are running into some challenges.

A

For each product, so the question was: do you have to define different SPL for each vendor? Yes, so yes, the SPL is for each product. Yes, so the first challenge that we ran into us that ingestion speed was slow because we were using a traditional RDBMS which is reared optimized, so the rights were not as fast as we would have liked. So ingestion speed was the probably coming problem.

A

Second problem was: it was difficult to make schema changes and it was happening, and quite often so sometimes would parse the data and then later on the customer would censor. Are some important stuff that you guys missed. Can we start parsing this one too? So we had to repass the data and I know when we do that we also have to make schema changes now. Schema changes.

A

Okay, if you have small amount of data bus went once you have few terabytes of data, it's not that easy, so it was hard and then reloading the data was painful to so. Let's see if you have been passing data for six months and then the customer comes in says up, we need you guys to pass some additional file that we are not sending earlier, and so we have to go back and repulse everything reload that data, and that would take weeks and weeks which was not acceptable and the other thing was.

A

It is costly to scale this infrastructure. So what I've shown on the previous slide? We were pretty much. This is not a multi-tenant solution for each customer. We're building this thing actually, so every customer had an instance of what's shown here. So that meant every time you got any customer. We had to be deploying new infrastructure with the same set of tools, and it was painful to deploy it again and again and managing multiple things. It was an operational headache, so we decided to rewrite everything that circle.

A

Ok, let's redesign everything- and this is what the next generation architecture looks like there's a lot of information. So let me take a few minutes to go over each item, so the input is still the same. We get streaming data files and instruct your format or multi structure format, and then we have SPL.

A

The first change was that we rewrote the program that did the parsing it's written in Scala now, and it's an order of magnitude faster than the previous parts of that we had, but that what that also meant is that now we needed a data store that could keep up with the parser actually, so we have to replace the data store layer. We have a case and review have s3. We have solar cloud and post Christ. So typically, what happens is as we are getting data once it gets past.

A

The past data goes to boot into cassandra as well as solar flower. It gets in both locations in the raw data, in its original format, gets towards good store on s3 and a subset of the data that goes into consent also gets extracted into postgres and, as I go through, the apps I'll explain why we do that.

A

And the data layer is front ended by a middleware, so the Act customers never get access directly to the database. Everything is or even our own apps. They never access the database directly. They are all going through this middleware layer which we call info server and it provides a set of rest api. So all the data is access through those api is so let me quickly go over the app. So the first step is the log wall.

A

That's what allows customers to get access to the raw data, so if they want to go back and look at the data that they've been sending to us and be able to filter by date or time or whatever else right, they can do that and that's all being backed by s3. The Explorer app provides search engine functionality.

A

So if a customer wants to do full text search and some data, they use our Explorer app and that's where solar cloud comes into picture so get using the solar and leucine engine in the back end to do that, work, benches or bi 2. It allows customers to do ad hoc analytics and that's the reason why some of the data gets extracted into postgres and I'll kind of go into more detail. Why we do that later on standard apps? Are the out-of-the-box analytics that we provide to our customers, so they don't have to create anything.

A

These are apps Tyner apps that already have all the analytic capabilities built in the charting and the graphing. Everything is there all. They have to do just like to see a specific report or specific dashboard that they want to see.

A

The rules and alert engine is an interesting one. Actually that allows our users to create complex rules. So, to give you an example, let's say if you have a storage vendor again and if you want to know when a certain customer has raged eighty percent capacity utilization, you can create a rule saying: okay, if a customer's can't get such ideas and generate an alert to me, so you don't have to be constantly monitoring the system.

A

Actually, the system is kind of monitoring it for you and as data it's coming in its parsing, it will look see which conditions are being met and as soon as it meets a condition that's listed in a rule, then it will generate the defined alert for you.

A

It is regular language in this HQ like right now, so it's pretty similar to ask will and then the last one that I've listed here is the direct access.

A

So that's basically a set of rest, a PIAA that we allow our customers to access directly and the reason for doing that is that what customers will say: okay, I, like the apps that you guys have built but types of other apps that I want to build where I'm going to integrate some data that is coming from the other parts of the organizations and I, don't want to send you that data, or they may already have an app that's providing some functionality, and they want to embed some of the past data that we have inside their application.

A

So we expose that through the direct access layer, which is basically a set of forest AP ice.

A

So one of the questions I get asked is: what do you guys choose? Cassandra there are so many options, probably I mean last I heard. Was there 150 new SQL databases, but I met a friend today and he said I should have like lot more than that. So why can't I? Why not something else- and this goes back to the challenges that I mentioned earlier with IOT data right at three key attributes were to date a volume, variety and velocity a Cassandra. Let us handle all those three chaalis are very elegantly.

A

So, first, the first capability they like in Cassandra was the linear scalability. It allows you to easily scale from gigabytes terabytes, so you don't have to build an infrastructure upfront for handling terabytes of data. You can start with a very small cluster and, as you start getting more and more data, you can still add more clusters and you can easily scale so that allows you to address the volume challenge. The other one was variety as I showed you right, the multi structure document it had different characteristic, different layouts, different change frequency.

A

Not a scientist supports dynamic schema, which makes it really easy to consume that kind of data. So we compared to what we were doing earlier in the are DBMS life became lot more easier, it's much easier to model that multi structured, it and casaya, and then the last one is velocity again. Cassandra is right, optimized, so your rights are extremely extremely fast and that's what we need it.

A

So these were the three main reasons why we chose Cassandra, but then they actually provided one more benefit that helped us on the operational side, and that was that it allowed us to build a multi-tenant architecture. So earlier where we had separate infrastructure for each customer now we have actually won infrastructure for all the customers. So there's one cassandra cluster, those one key space in one set of column, families for all the customers- and I don't think it would have been possible to build something like this using any other technology.

A

So what do we store in Cassandra? So it's what I main data store are all the data that comes in once it's passed and we extracted meaning out of it. It goes into Cassandra, different column families within Cassandra. We also store the metadata, which makes if, if somebody else wants to see what's in the data, they can get a lot of information out of that metadata column, family that we have or apps they're, pretty flexible.

A

Sorry like even, for example, what kind of color scheme to use weather or certain data should be shown in form of chart or bar chart or pie chart or some kind of graph, all that's driven through a configuration and that configuration is stored in Cassandra as well. So it makes it really easy to change the layout of the apps.

A

We also keep statistics about the usage of the app, so let's say an customer bought license for 100 users and by looking at the stats we know exactly where hundred people are using or only 50, and if only 50 people are using. That means there's some problem. Actually so then you try to figure out. What can you do to make sure that everybody is using it right?

A

You can also see how exactly your apps are getting used, where people are spending time what the flow is and that you can optimize things and make user experience a lot better and then the last thing is stored in cassandra is journal. So all the data that comes in as we are processing we keep a journal and that way it's easy to go back and audit and see what happened. As the data came to us.

A

So a few a word of wisdom here it might not be anything new for those of you. You have been working in Cassandra, but for people who are just beginning to learn. Cassandra just started to your lunches and I thought this might be useful. The first thing is, data model is important, so you need to be really really careful how you're going to model your data, the of course the our dbms model is not going to work.

A

So you need to understand how Cassandra stores data and what kind of your queries you are going to run. So every everything is driven by your queries. Basically, since there are no joints, you can't really do joints and Cassandra. Sometimes you do have to do it. You'll have to do it in the application, and it's painful to do that so sort of spend time understanding. What are the query patterns and use that to create all the column. Fam is designed your data model. The other thing is avoid queries that are returning large amount of data.

A

One reason not to do that is that it's slow. Actually, cassandra is what really fast extremely fast. If you're doing point rates, but if you try to read let's and millions of rows, it's not a good use case. The other thing is also. If you have been here reading large amount of data, it can count a lot of problems, garbage collection and things like that, and if you don't have enough memory, you can also get out of memory errors.

A

Okay, so what are the other lessons that we learned as we went through that experience? One thing was that ad hoc queries are difficult and this is kind of by design. I guess in Cassandra everything is driven by query patterns, so you know your queries, you design the column, families and, by definition in case of ad hoc queries, you don't know what the queries are.

A

A user may want to run different unencrypted queries, so we found that it was very difficult to support those kind of use cases with cassandra and that's the reason why we are extracting a subset of the data that goes into cassandra, impose squares and that way people can do some kind of ad hoc analytics on that data.

A

The other thing was you'll see here, see a lot of bi tool, windows like tableau, pentaho and everybody else. They have announced support for Cassandra. So if you go on their website also, I co-signed I support it, but the amount of the capability that they provide is nowhere close to the capabilities that they provide for traditional RDBMS and then the last thing is about performance. So I mean some of the performance related stuff is pretty obvious right. Your performance depends on your cluster size.

A

Obviously, if you say I'm going to store two terabytes of data in just two nodes, probably that's not a good things. You need to distribute the data, so it depends on your cluster size. The note characteristics right how much memory you have what kind of disk you have on that and the amount of memory, the what kind of memory you have. So those are some of the basic stuff that Hardware related, but there are few additional things and thats related to your data model and data.

A

So whatever numbers you see being printed by published by somebody may not apply to you so the date. The performance also depends how exactly your column family looks what how many clustering keys you have and what kind of data you're storing in those keys, so that can impact performance by a huge degree. Actually, so, as you start deploying this writer building, it make sure you test it with your own column. Families, with your own schema, put some data. That's reflective of the kind of data that you'll be storing there and then do your own benchmark.

A

That way, you can size your cluster more accurately,.