Apache Cassandra Cassandra Summit 2013, 26 Jun 2013

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: C* Summit 2013: Ground Traffic Control - Logistics with Cassandra

Description

Speaker: Jesse Young, Director of Research at Zonar Systems
Slides: http://www.slideshare.net/planetcassandra/2-jesse-young
Come learn about how Zonar Systems uses Cassandra for logistics use cases such as tracking fleets of school buses and other fleet management services. Zonar uses Cassandra because because of its ability to scale horizontally, its continuous availability and operational ease. This talk will cover details about the implementation and our 3 year journey that got us here, including the challenges along the way.

A

I'm Jessie young I'm, the vice president of software development at sonar systems. Today, I'm going to talk to you about ground traffic control or logistics data with Cassandra really quickly today, we'll discuss an overview of who's on earth systems is, as we know, with Cassandra.

A

We need to know what kind of data we're dealing with talk about some of the technical challenges that we've had at zone R and why that's kind of lettuce into using Cassandra I'll, give a couple examples of how we're using Cassandra today and where we're going in the future and then kind of the road to Cassandra and how we got to becoming music or got to be using Cassandra.

A

So a quick overview of zone R we're a seattle-based company. We deal with heavy fleet telematics. Now. What is that heavy fleet is any vehicle, that's over 10,000 pounds or carries over eight passengers. That's our specific target customers. We do deal with lightweight vehicles, Ford, f-150s, etc, but our main customer base is those heavy fleets. Fleet telematics is just the collection of all the data from these fleets. Gps data fall codes, any kind of data.

A

That's really going to help those fleets function and really what we are is a hardware enabled software as the service company. What that is, is we create a hardware device? It's a GPS of a GPS receiver with the GSM modem in it we do engine diagnostics and connect up to the engine, computers of all these vehicles as well, and then we offer a SAS based application for this. So we host all of the customers data. We offer a web front-end for those customers to get the data.

A

Then we also offer a nice API for those customers to get the data and bring it into their back-end software. We know that a api's are huge and that's something that we've always focused on word or an open source company similar to data stacks and cassandra users. We started out and really noticed that customers needed access their data and we wanted to make sure that we offered that to the customers. So what kinds of data are we really dealing with over his owner? Well, first and foremost, we started as a safety inspection company.

A

We weren't dealing with GPS data. To start we're doing. Do T required pre and post trip inspection? These are inspections that these drivers are required to do to make sure these pleats are these heavy fleet vehicles are safe to be driving out on the road with all the rest of us commuters, we're now tracking GPS data in the last eight years.

A

This is a lot of data latitude, longitude heading speed, all sorts of extra information that we're collecting about these vehicles just so that we can offer the best possible reporting to our customers, we're getting into vehicle Diagnostics, so we're actually tapped into the engine computer of these vehicles and we're collecting all sorts of information.

A

Oil temperatures, stop engine lights, check, engine lights, really anything that's out there on that engine, computer, we're able to collect and we're pulling that information and doing all sorts of fun analytics for our customers on net or even starting now to get into photos with our inspection device for about or at least an Android tablet, and when these drivers are out doing their pre and post trip inspection. We know that a picture's worth a thousand words so rather than the driver trying to explain exactly what they're seeing is wrong with the vehicle.

A

They're gonna be able to take a photo- and this is just opening up a world of opportunities for us, so there's just all sorts of data that we're trying to collect and give a quick, fast reporting on. So some of the technical challenges we've had at zone art again, we've been around for about 13 years. We've got over a hundred database servers. Now that's over 3,000 databases due to our the way that we've sharded these databases. It's a lot of a large amount of data. It now constitutes over a hundred terabytes of data.

A

So, what's that start bringing in to us the big data issue, we've got all sorts of problems that we're trying to deal with, and although your typical, our DBMS is capable of handling those, it's not always the best solution. We need fast diggit of replication and I. Need it easy to be done all right need it. I need to be able to do it easily. uh You know I can do all sorts of replication with all the our DBMS tools out there, but they're not easy they're, not easy to administer.

A

I need to maintain fast, inserts and fast retrievals from the same data store. I, don't need to have an OLTP database and a data warehouse I need to be able to do all of these things in one fast place, I need to horizont be able to horizontally scale easily I. Have this bet we have this done fairly well with our DBMS is now, but it's not as easy as just adding it a couple it's putting in some tokens and being done with it and uh the last but not least, is easy to administer.

A

We want a system where we're not having to have constant DBAs. We don't want to have to continue adding system administrators and systems engineers just to scale out horizontally. So with our typical. Our DBMS just starts not being relevant anymore. It's got its place and it's a great solution, but we needed to some something better. So we got our big data solution as Cassandra.

A

As we all know, cassandra has built-in data replication. This makes it very easy to continuously add nodes at data. Centers add extra rings. If we need more performance, we need that fast data insert this is something Sandra's done very, very well at same thing, with fast retrieval of data, I can insert the data. I can pull it out all at the same time and have a very, very high throughput. It's very easy to administer. We've done this as owner for the last couple of years.

A

Now we have a ten, no drink, that's running my administrators very rarely have to do anything with it and the only solution they typically have is restart the Cassandra service, maybe restart the server itself. Never any other issues beyond that. One of the other things that we we happen to come upon with Cassandra was the need for TTLs in Cassandra, supports this very well we're dealing with some DoD compliance data and other data that our customers are required by law to keep around for a specific amount of time, but after that required amount of time.

A

Some of them don't want that data around if you're, pointing that into a typical already our DBMS driving did you do delete statements or partition the data a certain way. This causes a lot of problems, and then you have to do vacuum statements and all sorts of fun stuff like that. We've just been able to avoid a lot of that with Cassandra by using TTLs, so some quick examples of how we're using to Sandra right now, we've got our photos that we're just starting to collect.

A

Now really for this we need a cheap storage, as we start collecting millions and millions of photos. We don't want to put that on big, expensive sands or anything that it's just going to be cost prohibitive. Cassandra were able to use our commodity Hardware, the data gets replicated and it's relatively cheap. We can grow this capacity easy over time by just adding additional nodes. I don't have that vial that in initial infrastructure right up front, I can start out with three nodes six nodes and continue to grow it.

A

This is another area where those TTL is just make a lot of sense, but certain photos don't need to be around forever, so I don't want to have to continuously figure out what needs to be deleted, set the TTL when we store it and then the photos are gone when they're gone. Another big use case that we've got is elevation data we're starting to get really big into analytics with the data we use.

A

One of those those key important factors for us was elevation data and knowing where these vehicles are traveling and what elevation they were at. This is a really large data set that our engineers have had to work on I. The data gets loaded once to the system, and then it's just read heavily. We might update this once every year if we have to, but elevations aren't changing around the world very often at least, and so we just needed those very heavy reads. We found what we were looking into.

A

The solution that was just gonna be a really quick key based application. We needed a scalper from perform performance. We know that within the first year we're gonna need to do bursts of up to six thousand reads per second and do over a hundred and fifty million reads per day, and that's just in the first year as we continue to add more devices to our to our system. That's even more reads and more more more performance that we're going to be able to need to be able to do.

A

We've got another application that we've jumped into which is Z Pass, and what this is is tracking bus ridership. So we need to know when people are getting on vehicles and getting off of the vehicles. We need to know that these vehicles are being utilized to their fullest capacity if you've got a bus riding or driving around with 5v5 passengers in it you're not being very economical with your your vehicle or the fuel that you're using. So for this we needed to be able to read and write very heavily.

A

At the same time, these are very small bursts of traffic throughout the day. Typically, two big Peaks throughout the day, a lot of people don't ride the bus at midnight or early in the morning. It's it's typically a couple times throughout the day, so we needed a way for millions of users to actually access this data and we needed to be able to do at least 20 million writes per day just in the first year again, we know, as we continue to add, via and passengers, there's even more rights that are gonna happen throughout time.

A

We needed to be able to scale horizontally the neverending story for everybody. Do you want to be able to easily scale horizontally? So that was something that we really knew, that Cassandra's been able to do and one of the fun things for us was. We just took a look at with Sandra. We've got a very basic app that reflects a Twitter type. Feed to Sandra is a big example.

A

That's constantly given out there we're able to look at that code and adapt it and use it to help us do some rapid development, so the road to Cassandra usage, it's kind of been a long road for us and in a curing this reflected over and over again the talk by Accenture kind of talked about this. It was great to hear there's many resources out there for you to start using Cassandra one of our system, architects, Josh Hansen, really started with data stacks ever with Cassandra early early on way.

A

Back in the days of Krypton, you found this little application. It was out there open source and just went to the first summit and that's one of the the key ways that that you can continue learning about Cassandra in ways to utilize. It and you're all here so you're already well onto that track.

A

Training and consulting has been really big with for us just getting some of those experts in there and being able to train us and help, consult and show us the right way, and we've used data stacks a number of times just to help us do that and do some data modeling we've had Matt Dennis a few times, come over to our offices and really help us, and not only just during those consulting times babe but he's available in almost the data stacks in the cassandra community through multiple ways.

A

There's the planet, cassandra community, that's out there, that's really good I'd highly suggest you go to the meetups. Those have been just great to get more informed with the local community and people that can help you locally. Twitter is another great resource for everyone to use. We've got IRC. The IRC community is great for Cassandra.

A

We've had plenty of times where we've asked questions in on how to do specific things or issues that we've occurred come across and had some great help there on IRC.

A

One last thing that I'd like to point out too, is a way that you can rapidly develop with Kassandra, specifically using DSC and AWS, and we found this to be really important for us using the AWS a.m. is we're able to quickly use a couple Python scripts with about two or three lines: each. We can bring up a DSC ring in five minutes that could be a six node ring twelve node ring and just really rapidly bring those systems up and test them and take them down. Now it's nice from a management leadership position.

A

I, don't have to acquire servers for my developers to use every to test everything out. They can quickly do it tear down the system and it's very, very cost effective for us, specifically with that we can load up a hundred two hundred gigabytes of test data to really start testing queries out with it in under 30 minutes. It's just a really huge thing for us.

A

That's primarily, we are hiring just like every other company. If you find to you, spatial data in Cassandra are very interesting. Please come talk to me, we're starting to really get into some fun analytics with this data and just looking forward to doing everything else with it. Thank you any questions.

A

Volume of our data.

A

After volume from node Josh, do you know 500 gigs yeah? It's it's one of those things where we're. We really wanted to move all the GPS data into Cassandra initially, but we also found it's easier to start moving new applications or bring new applications up on Cassandra, and that really brings that paves the way to start migrating data over.

A

Just a pet, it's random I would say on certain manufacturers are better than others. We've got a very nice relationship with the manufacturer right now and it's pretty easy. The heavy fleet vehicles are a they're, more standardized than say your lightweight vehicles, like obd2 those manufacturers, very, very, very largely on the J bus protocol, which is the heavy fleet protocol. It's a little bit more standardized with some manufacturers with very custom engine fault codes, and things like that.

A

It could be large.

A

Right, yeah, we we can do it either off of previously stored data, or we can do that lookup as it's streaming in from the GSM network and store that data. At the same time,.

A

No, some of you can't see that's Josh Hansen. These are our system. Architect he's the brain so.

A

A

We're doing the analytics on our own really right now what that, what we're really doing is joining multiple types of data. This elevation data, I'm, really big into fuel analytics right now, so as these vehicles are driving down the road, how can we help our customers save more money by enhancing their fuel economy? That could be things such as looking at our PM ranges that the vehicles going at as they travel down the road, which is where elevation becomes very interesting.

A

If the vehicle is driving up a steep elevation and in a high rpm range, can we get the driver to downshift and actually save a little bit more fuel economy? It's just a joining of multiple types of data that we're really starting to get into most of it's all in Cassandra. It's it's the easy way to scale and get that date that that read access very quickly.

A

Not so much in Cassandra, yet it is one of the challenges or having to deal with, and it's it's not going to be too hard. Typically, we we store data is both ace, the collected timestamp and an insert timestamp to deal with those types of requests.

B

A

Nothing for our IT applications, the company's about 200 people or so now the development team ranges it's about 3035 developers on staff. Anyway, you do use some outside consulting just to help weather the storm. All new applications were working on.

A

Yeah, it's it's! It's been something where we've had to keep some of that data in our our DBMS system. With the data stack three dot o, we were kind of excited to get more of those security privileges in there and lock the data down better and be able to migrate more of it into Cassandra.

A

A

It's ramped up, I we'd started using Cassandra early early on just through development cycles. It did have a little bit of a learning curve in some developers are still learning to utilize it the best way, but it was relatively quick for a lot of developers to start accessing, especially for reads the PHP developers typically aren't writing data into Cassandra. It's very easy for them to use PHP, casas and really just start reading the data in treating it like a typical data store.

A

Our first first application that a lot of developers were writing data within a month month or less so we've been on the road to cassandra for again three four years now, but we've had a tended cluster running in production for about two years now,.

A

Any other questions all right. Thank you very much. Everyone.