Apache Cassandra Cassandra Community Webinar Series, 18 Jan 2013

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Cassandra Community Webinar | CEP Distributed Processing on Cassandra with Storm

Description

Speaker: Brian O'Neill (Lead Architect, Health Market Science)

Cassandra provides facilities to integrate with Hadoop. This is sufficient for distributed batch processing, but doesn't address CEP distributed processing. This webinar will demonstrate use of Cassandra in Storm. Storm provides a data flow and processing layer that can be used to integrate Cassandra with other external persistences mechanisms (e.g. Elastic Search) or calculate dimensional counts for reporting and dashboards. We'll dive into a sample Storm topology that reads and writes from Cassandra using storm-cassandra bolts.

A

Welcome everyone to this week's edition of our college credit webinar series. Today we are discussing CEP distributed processing on Cassandra with storm. This is a technology mix that in the community we see increasingly more and more of and I am very excited to say that today we have two speakers with us. We have Brian O'neill, who is the lead architect, health market science, and he is an MVP for Apache Cassandra, very well known in the Cassandra community, and also joining him.

A

Today we have Taylor, gets and Taylor is an expert in storm and has been in the storm community since it was open sourced and he has been leading the charge around Cassandra and storm integration, a couple of housekeeping items. We will take QA at the end of this session. So if you have any questions for Taylor and Brian, please use the Q&A tab inside of WebEx and put your questions in there, and I will ask them at the end. Also, we are recording today's session.

A

The archives of our webinar series are available at both datastax calm and also our community website. Planet Cassandra, org and without further ado I will hand it over to Brian. Thank you very much for joining us today. Brian and Taylor. Hey.

B

No problem, alright, so I'm very excited to do this presentation we did the last webinar was you can create your first java application, um this one we're taking it up a notch, and you know once if you're familiar with Cassandra and you you're familiar with the crud operations on it. This is taking it to the next level and once you've got your data in cassander. What can you do with it and kind of analytics? Can you perform?

B

How do you integrate it with other systems in your in your enterprise, so that further ado we're going to get started all right? The quick agenda is I'm going to go through the youth case. I'm sort of talk about what complex event processing is. What would you use it for sort of get motivated to deploy storm and then going to hand it over to Taylor he's going to do a little bit of background on storm?

B

How to deploy what the cluster looks like go through some code and do a demo and then, if we have time, were to come back and talk about sort of new new things out in the storm Sibley tried in, which is a higher level API that we've recently adopted and we're migrating all of our topologies to try to so. um First, our.

C

B

Case so our main product is called a master.

B

Sioners mentioned in two different feeds with two different addresses, which is the best one pick that out and then report out on all the answers and the quality of the master file, so that the challenging challenging problem to solve it's really valuable in that once we have that master file, you know pharmacies can use it to determine if and prescribers eligible to write a script so but there's a lot more involved in just crud.

B

So when we take a look at it on research book expander in place like I was saying, is that there's thousands of those schemas can change over time? It's quite a bit of data, but to us the major motivating factor there to select Cassandra was a variety of data. So what we do is we take those thousands of people dump minute, Cassandra, run all sorts analytics and and things and then produce a matter file that we deliver to our clients or provide web services integrations for them.

B

So um so, if we have Cassandra in place, what didn't that cover? Well, you want to be able to search that data. Eight.

A

Teyla sorry to interrupt so showing on the screen is still the presentation that I led in with so are you sharing your desktop oops I believe.

C

We are, oh sure,.

B

Are you gonna fist, yep how's that now Christian yeah and now we're getting now.

C

B

Okay, so I'll just show you the pretty pictures, real quick that.

A

Uncle people will having to close their eyes and imagine.

C

A

B

Yeah so note no bomb, so here's the agenda as saying it's just the use case then hand it over to fill and then come back for a little trodden. If we have time and then here's the doctor that we've got so, would you know the industry and that kid really that we were in is called master data management for all those feeds and then, like I, said, we deployed Cassandra to handle the data and a bit of the volume, but that didn't cover all of our requirements.

B

So now that we're here you know, we need to supplement Cassandra with a few other things. Specifically, we need to go to search, unstructured data, so fuzzy matching on addresses, for example, geospatial kind of queries. We also need a real-time analytics and reporting so that data that was in Cassandra was great, but we needed you know. Dimensional counts aggregate counts. So how many doctors of this specialty are in this area that have sanctions against them, for example, and then also transactional processing? So you know we have a web front-end. We want.

B

We wanted that data to be immediately searchable right so and reflected in sort of wide row indexes that are maintaining. um So what does that look like? So this goes into a little bit of our infrastructure from a very high level. Logical point of view.

B

About two years ago we looked at the different searching, capable engines that were out there and I think everybody knows the two that are most prominently deployed or solar and elastic search. um Solar is great. We actually chose that first, because about two years ago we didn't think elasticsearch was mature enough. We've changed that opinion, so elastic search is one of our integration points. Now that that's tens, the scale um like Cassandra does so it handles. You know replication and distribution underneath the hood rather than the solar, where you had to manually.

B

Do it then web services right, so we chose drop lizard for that one, and then we also want to go to do reports so and most of the reporting mechanisms. Although data sector that is got some good on aggression jaspersoft, mr. reporting people still go off of relational. So if we take this as our use case, let's, let's walk through a couple of things. This is what we wanted to get to, but we're going to go through a couple things that we did wrong before we got here and the first was to do so.

B

Is it not Tanaka doop? We love to do but its its focuses a lot around batch processing right and when we looked at our problem, we wanted to be able to reflect changes as they happen, transactionally, and we couldn't do that. Well, just spinning up the Hadoop job was longer than we wanted it to take to be reflected in the user experience, so that wasn't going to work so and then, in order to effectively take off those who do jobs, you also needed to track what changed in the system.

B

So what goes into that batch so that added a little complexity that within like so then, overall, some of the processing that we saw could race over column, families that contained our data was just too long for for our tolerance.

B

So what we did wrong part 2, so we moved to a ALP triggers approach, so we like that you know, took all the burden off of the clients, so we actually used an LP extension that we open sourced called Cassandra triggers that would watch data as a change in Cassandra and update some wide rows. So that was great. That worked really well, but then we realized, as we integrated more and more systems and wanted more and more wide rows.

B

That became a huge burden on the right in Cassandra and then in addition to that, guaranteeing the execution of those triggers to make sure you had guaranteed processing those triggers was extra overhead, so we turn to towards complex event, processing, so just kind of did a background event. Processing is straight from the geek, so to throw definition on slide thing on complex event. Processing is a matter of treating the events like a streams right and then discovering different things from that stream. Those streams of information.

B

um So if we take our use case and frame it as a complex event, processing problem, our events in our system, our crud operations, as they either happen in the system or 12 happen in the system. So if you take that frame of mind on the complex event, processing can become an ETL tool and or analytics to figure technical editing again, and then what that means is that you can take a complex event, processing engine and apply it as a data processing pipeline.

B

So if we go back to original here's where we want to get so, our users are happy right and we take a complex event. Processing engine- and you have the credit operation coming in. You could take that piece of data transform it before you write your system of record. So in our case our system records Cassandra then continue to pass that event down.

B

The line could do dimensional counts, so aggregate aggregate, the different dimensions on the data that's coming through you can done in which that data, by touching other systems and pulling in extra metadata and then write it to a fuzzy index right. So this is a really powerful turns out and we're really really like. What's what's going on and and it pulled all of that complex data flow that we are embedding in different applications and clients out to something that was manageable and tangible that we get reason about.

B

So that's one of the powerful pieces of storm that these typologies and I'll. Let till I go into the details of it. The pollies actually articulate your data flow in the system, so that's great and allowed to do some really really cool stuff, so I will at this point hand it over to tailor and he take you through the details. Okay, thank.

C

You Brian so just to start out with a quick overview of storm. Storm is a distributed. Real-Time computation system, so it does complex event, processing type work. They will open source by Twitter in September of 2011 that worked out well for us at health market science, because about that time is when we started moving away from batch processing into a more transactional, real-time processing model, I, say.

A

That this is question so sorry to interrupt. Could you speak up a little bit it'll get closer to the microphone sure that a little better.

C

Yes, that's a lot to thank you. Okay, um storm is fault, tolerant, it's distributed among multiple nodes and I'll get into more of the architecture. A little later we can have.

A

C

Go down and storm will continue to process data highly scalable horizontally. Much like Cassandra, you can scale by adding additional nodes to your cluster.

C

Storm supports a model of guaranteed processing so that, when you're through data streams, events within your streams- if for some reason, fails process, they can be replayed so that processing is guaranteed and in terms of cep, storm operates on one or more streams of data, so you can add any number of inputs into your distributed. Computation.

C

The anatomy of storm cluster- the diagram you see is a typical development cluster that we use here at HMS. If you're familiar with the Duke, then this layout will look fairly familiar, there's a master mode in storm language, that's called Nimbus and then there are slave nodes, which are essentially your worker nodes in our development clusters. We also with Cassandra nodes on our our slaves, so there's also an assistant note and for our clusters.

C

That's where we run zookeeper and what zookeeper beds is maintained, the state of the cluster and then supervisors are storm demons that actually run the tasks that make up your data processing and I'll, get into more of how that works in subsequent slides, but Nimbus is job, is to assign tasks to the supervisor nodes and that can be multiple workers and multiple tasks and master. All also true zookeeper keeps track of the health and state of those demons.

C

So if one demon were to go down, we're saying you lost a node master would take the workers and reshuffle them among those laid notes, um storm components, the source of data entering into your computation comes from spouts spouts are essentially stream sources, sources of data and I'll expand on each one of these later bolts and a storm. Topology are your units of computation, so they do operations or react to the data.

C

That's passing through the system and then a topology is a combination of any number of spots and any number of bolts and that it defines the overall computation or computation network so to expand on spouts a little bit as I mentioned earlier. Spouts represent a stream with data. Examples of that could be cues like a JMS queue, kafka, kestrel its era or something like the Twitter fire hose or sensor data.

C

I know the weather channel uses storm to process weather information, so a storm spout connects to some sort of source stream or queue and emits couples which represent the events in PDP language and couples are the primary data structure in storm and they're, basically, just a set of main key value. Pairs.

C

Storm bolts are responsible for receiving couples from spouts or other bolts, and they operate or react to the data. So bolts are typically something you provide or provided for you and they can perform functions and filtering joining aggregation. That kind of thing, and they can also do database rights and lookups, which is where cassandra will come in a little later and then both can all optionally emit additional tuples.

C

So if you have a doubt that submitting tuples the spout will send those to one or more bolts and then the bolts can either just react that data or they can permit additional couples I'll get into more of that later. On and storm topologies represent the data flow between spouse and bulks and routing of tuples. Between spouts and bolts. The routing is basically a simple subscription models, and when you define your topology, you define groupings and what groupings do is determine how tuples get routed between bolts and spouse within your topology.

C

One of the groupings that I'll get into later on that's very important, is a that's all the fields grouped and what that does, because both spouts and bolts can be parallelized. So, given a a bolt in a topology, you can define how many instances of that all you want to be active in your topology.

C

But when we do stuff like filtering and aggregations, it's important that sometimes the same fields. Content goes to the same bowl, so a field ooping helps accomplish that, and basically the way that works is probably those are familiar with. Cassandra will understand the concept of a distributed hash. So basically it just passes the values of the fields that you specify and that determines which foldable pass to storm.

C

Topologies are long-lived, unlike a Hadoop job, which is a batch process that run as a start and end storing topologies run forever until you essentially kill them, which is understandable, because your stream of data is open-ended.

C

So storm and Cassandra, some of the use cases that that we looked at were being able to write storm couple data to Cassandra examples that that would be computation results or pre computed indices and then also to read data from Cassandra animate storm troubles, so that kind of thing would enable us to dynamic real cuts eventually have a tough will come in and based on the data in that tupple do a lookup or a festa cassandra and then based on those results. Additional problem.

C

So in storm cassandra, we have two bull types that I mentioned earlier: there's the basics and revolt and what that does is take tuples, passing through a storm topology and persists the data to Cassandra, and then we have Cassandra look up both which does the opposite of it pulls data out of Cassandra and emits tuples. Based on that data.

C

The storm Cassandra project, which is open source on github, provides generic bolts for reading and writing storm couples to and from Cassandra, and the way we did that was come up with the concept of mappers, so we have a storm. Cassandra project is essentially a framework that defines generic bolts, and then you provide a cup, a double mapper or a columns wrapper. That is specific to your use case or your data model.

C

So by plugging in to be so by plugging a mapper you get all the benefits of the framework, and all you have to concern yourself with is the implementation of your mapper. The framework does the rest for you.

C

So the couple mapper interface tells the Cassandra bolt how to write a couple from an arbitrary datum. So, given a storm couple, you would map to a column family. So basically you tell it which column family. You want to write to map to a row key that allows you to determine what your rope he is based on the content of the couple and then mapped to columns. That's mapping how the data in the storm couple gets stewarding sander columns, then on the opposite direction.

C

The columns mapper interface tells Cassander lickable how to transform a cassandra grow into a storm couple. So, given a Cassandra roki and a list of columns, it returns a list of storm couples.

C

The current state of the storm Cassandra projects and right now we're working hard on version 0 about 40, which is a work in progress, is currently using the stn ex-client. Now we have a couple out of the plaques mapper implementations, a basic key value column matter for basically liking a hashmap type data model, search, sander valueless columns, that's a common data, modeling pattern in Cassandra and I all demo.

C

That later we just added support for counter columns, look up by rocky and also look up by range group, we're just beginning to end to try to support and a little later I'll talk more about what frightened is, and we also have an initial passive, composite columns.

C

Future plans we're thinking about switching to c ql and then also we want to support trident fully and before I I know, I went pretty quickly before I get into a demo. Are there any questions.

C

So the first time I'm going to do is a wordcamp, and this is sort of one of the canonical examples that is used for both storm and for Hadoop. The ideas that you have a spout, that's emitting random words and from there it goes to through a fields grouping to accountable. The counts keeps account of each word and how many times that words been seen and the importance of using the fields group in there is if the count bolt is parallelized across multiple nodes in the cluster.

C

You want to make sure that the same word goes to the same bolt. Otherwise, your your accounts with get out of whack, then from the account both the tenant bolt, emits essentially the counts of each word in real time. Ads are incremented and shuffles them to the Cassandra bolt, which is responsible for persisting. The count for each word so now get into a demo of that.

A

C

A

An elite and or as you're doing that there are awesome questions but we'll just hold full of those and until the end yeah I figured.

C

That thank you.

C

Okay, so right now, I have a storm colin family that is empty right now and what I'm going to do is pop over. There were clips and run the persistent working out topology, and this is just a couple lines of code and I'll walk through that. After I run it.

C

Okay, now that the topology is running so it's actively spitting out random words and actually I think in this demo, their names and if I, lift that Cohen family again we'll see the words and a correspondent count, so the rotative Nathan. You can see that. As of that query, that word had been emitted 56 thousand times, then there's Mike Jackson and a couple other names and if I run it again, we'll see that that incremented much more an hour up to 172,000 instances of that work.

C

This demo is a little old. We just added on there just got support for counter columns, so this emma is using just regular column, sins persisting numbers they are not using counter columns, but we do not counter column support.

C

So this main method here is the entire, basically the entire code for that demo, by reusing the out of the box, cassandra.

B

C

So in our topology, it could show you before they pulled it up. We have a word spout which sends words to account bulk chemicals keeps track of those and then, when it updates account, it sends a message to the center bolt to persist that cap and basically in this line, we were setting up the cassandra bowl and using a default couple.

C

Mapper, that's using strings persistence and we do support different types of serial serialization, for these demos were mostly just sticking strings and as I as I mentioned before, storm sports, guaranteed, processing and part of that is when a couple passes through a or when a couple gets emitted from a spout. If you've defined your topology to be guaranteed processing, then all those couples must be. The whole couple tree must be act and extender bolt supports different act strategies, so in this case, I've used a cone right and what that means is.

C

As soon as the data gets written to Cassandra successfully. The couple will be acknowledged and once it's acknowledged them, the spout role being trigger to remit that that couple, the next lines down here. This is where we build our topology here, we're saying creating a new topology builder. They were saying setting the stats. They are words that- and this last number over here is the parallel ism.

C

So what we're saying there is when we submit that topology to the storm cluster, we want three parallel instances at the words about, then we go on to set the bolt first of all to the Cassandra count book and that's just a generic Cassandra bolt that we configured earlier again we're saying a parallel Elizabeth tree and then here is where we tell it to use a fields grouping and that it subscribes to the word, spat and then which fields we're using for that group.

C

In this case just the word and then finally, this is where we set up our cassandra bolt to subscribe to the channel.

C

C

The next demo is for distributed, RPC calls and storm. um Well, there are cases where you have streams of data that are open-ended like I. Just download dr pc in storm allows you to create a request response type interaction and essentially distribute out that processing to your cluster. So for the way that works in storm, is you have a dr piece client that passes in arms to a gay, RPC server and the server.

C

Send out tuples via a dr pc spout, that's a special kind of stout and then from there. It goes into your topology that you define, so it did keeps track of that by request ID, which is ultimately how that the result will get returned to your client. So the step and lips to your topology and then the result from your topology goes back to the dr pc bolt and the bolt will notify the dr pc server when the computation is completed and then the result will get returned to the dr pc client.

C

C

Example, if I have three followers and I tweeting URL, that reach would be three but then let's say someone else tweets the same URL and depending on the number of their followers, then the reach would be the the net film of my followers and that other users followers. So to do that at that Twitter, where they're you know massive numbers of users, that would take a lot of database lookups and a lot of performance if you're hitting just a single database.

C

So what we can do is we can take that and break it down into a deer, PC topology.

C

So essentially our input to that is a URL, and that goes to the tweeters bolt and that essentially takes an URL and does a lookup of how many users tweeted that URL. So all those get so all of the records of that tweet then get shuffled route to another bolt the followers goal, and that takes the number of followers or I'm sorry, the user ID at each follower and admits that out as a couple from there. It goes to a partial uniquor which uniques those and finally, it sends a.

C

The result of the unique er to account aggregator, which will count how many users hoods been exposed to that URL. So, if I go back over to Cassandra.

C

Basically, the model for this is: let's look at tweeters. First.

C

So this is basically a valueless column that has a URL and then a column. Key of.

C

Each user that has tweeted.

B

C

C

But if we look at the followers.

C

The followers column, family is again another valueless column and the row key is the username and then the column names within that are the users that follow that users. So this basically sets up your your social graph.

C

So when I run the van der riese technology is seeing storms start up you're going to see a lot of logging information.

C

And it actually completed already.

C

Right there, based on what was, in those two column, families, it was able to compute the reach of the following for urls.

B

C

So again, that's a the code, for this is there's not a whole lot to it.

C

Basically just set up our container bolts, again reusing out of the box, Cassander, bolts and mappers, so we set up a mapper for or treaters and urls followers and tweeters, and then this code sets up the topology that I put it up in a diagram.

B

Oh alright, so so, but that was a some pretty cool storm in detail, and you know both of us are out there on github. So if anybody has any questions all this is all, is it out there? You can concluding the examples. So if you clone storm Cassandra, though you'll see an example for a career in there, that has all this stuff. So and if you have any questions just set us up, we can answer those so next partner um we're going to talk a bit about trident.

B

So this is how I've been spending my days on trodden, because what happened is, and probably a lot of many of the questions out there might be around transactional integrity. You know, how do you not over count stuff? How do you not process it twice that kind of thing?

B

So there are a bunch of patterns, I think that we're cropping up when people went to start using storm and underneath the hood we said we had both the batching and non batching version of the Cassandra I mean things are easy when you're only taking one event into account, and then you know happily, processing and moving along to the next, and when you start batching stuff, and you start considering fault collar and things get a little complicated. So Nathan saw all those how Nathan mark's twitter sold those patterns and decided hey. We can.

B

We can all benefit from creating a higher level abstraction here and hiding some of the underneath, the hood stuff. It's required to implement exactly once and semantics and transactional integrity, so um so try to in a nutshell. He took like I said: state management, which is you know what gets complicated when you consider fault, tolerance and batching and provided an API on top of it. So he created additional primitives, where the primitives in storm are the bolts, the spouts and the topologies all operating on tuples in trident.

B

You have functions that are n state objects that are operating on trident tuples um and effectively and kind of we were early adopters in the transactional topologies, which have since been deprecated, because I think with Nathan's shown is that all the power that we had in transactional topologies, if anybody's been using them, is available and Trident, but from a much higher level abstraction. So you know at all the classes that were in the storm caca that dealt with transact with apologies are now gone in favor of Trident versions of those same classes and I tried.

B

It is not something different than storm, it is TI's in storm. When you good deployed a trident apology, it actually compiled down into storm storm topology and runs so too has a couple different primitives in trident one of those are the operations that you can perform and well, you can think of what try gives use. It takes a string of those tuples and partitions that stream into set of batches, and then he provides operations on those tuples within those those of batches and I. Just called out a couple kind of operations that you can perform.

B

You can see how the higher level abstraction, so one of the use cases of a bolt that Taylor called out is the filtering. So instead of going in subclassing from bolt and implementing my filtering as a concrete class, I can just use a function here that implements a filter right, so I just implement a quick function that says is: should I keep this tuple and my responsibilities to the filter are either and miss a tuple or I phone them at the tuple drug wise. You know a tip generic function.

B

I take in the people and I commit that thing again with additional fields in it or additional kooples, and then there are aggregates functions, so I can do with making all combining right, which is pairwise combining of data or I can reduce that which is an iterative accumulation. So you basically keep a cow, but then your past tuples and adjust that count or more generically I can go. You know, implement the aggregator and that's sort of bring your own aggregation to the table.

B

So this is a sample topology in trident. It looks very similar to a storm's apology, but you can see you know. These are functions here like count and split that I just plug in and then I basically say what my input is and then what my output is right in anthropology, so very, very similar and really the benefit of Trident. Is this so on the Trident state Nathan figured out I guess a couple cool things about how to write state so that you can handle losses of nodes and batches and the way that he did.

B

That is that state manipulation. So, instead of just writing a you know value to the database. You write the last the idea of the last batch that was, that produced that result and then all batches are sequenced.

B

So you know if you, if you just wrote back three and batch five, is there it's not going to write until see Betsy's batch for and in that leg everything else in storm is done in parallel, but the state even can be guaranteed being a good state because everything's written sequentially down into the database- and this is where we're headed with the Cassandra storm, Cassandra and there's somebody else else out there.

B

We should give a shout out to a pet remember: frostbite, zero summer cross man, frost, man, who's, gotten implementation of the trata trident the the persistent mat for storm and we've been using it um internally. So that's been working out well, lance and flexibility, so probably reach out frost man to coordinate and collaborate with them um anyway.

B

uh So with respect to storm state um and how it applies, the transactional integrity are two different kinds of spouts that you can have in tried in one is glad the transactional spout and the guarantee there is that a batch or your obligation as a spout implementer for a transactional trident spell. Is it the best contents? Never change. So we use Kafka here as our main spout for all the work for art. Apologies.

B

So there's a transactional, trident Kafka's spell that is guaranteed to omit the same batch, the same tuples in the same batch each time and reasons I won't go into now. That's not so great if you're. If you lose a note, because if you have a multiple multiple partitions emitting, you could lose a partition and then you wouldn't be able to have the same tuples from that partition in this in the same batch, but say that for another time where you can have an opaque spell where the batch contents can change.

B

But in that scenario you'd be very careful with your state and there's additional obligations on the state object if you want to maintain transnational integrity, so I'll go through that now, so so these are using sort of the contracts established by Trident for spout. So one case, the batch contents never can never change and the other they can change as implications to the state. So, just like stop this transactional, opaque state management and in the transactional state, the transaction ID is stored with the value in the in years.

B

State object when you persisted the database on each bad batch has unique transaction ID and, like I, said that that trick works then, because what I can do is like every time I update the state I just write the last transaction. My current transaction I did and I don't write until the previous transaction. Id is written and in fact contents never changed. Then that's great that works that doesn't work, however, if the best contents can change right.

B

So in that case, what your obligation is to create a state object is to write the previous value and then the last transaction ID and then the current value. So this says that if I read, if a batch gets replayed, I can replace the value. That's in the database with what I've calculated from the new batch so would with that. You can get exactly one semantics from storm with combinations of the spouts and states.

B

So we probably should include a few more slides that can get a little trickier to give it a couple demos of who we can come back and do another session on that when we are in it all out, okay, so I'm in order to give some time for questions, here's the shortlist shoutouts slide. So here, HMS I didn't go through the full architecture, topology that we have in place, but we have three or two two bolts that are out there. Now the storm Cassandra bolt, which we've been talking a lot about.

B

We also have storm elasticsearch, so I combining those two you can actually have a data processing pipeline that writes to Cassandra plus enables fuzzy search on the other end. If you tack that elasticsearch bolt onto the end of your apology and then internally, we haven't released it to the public yet, but we also have a storm jdb I bolt, so that allows you to tie in the relational database. So if you go back to that picture from the beginning of slide, you know you can you can tie in your favorite relational and enrich the data?

B

That's coming through your pipeline with any kind of metadata that might be stored in a relational database and or you can write to the relational database, that supports reporting tools and then till is got a two other ones out there: storm JMS and storm signals and on your phone signals. Basically, we've got I'm.

C

Sure storm signals is basically it allows you to communicate with your spots and bolts out of band. So, for example, let's say you offer a topology and you want to send it some sort of signal to changes behavior or something like that. Storm signals just gives you a very simple way to do that, and everything that we demo today is all part of the storm Cassandra project you can should be able to get up and running with those examples in about five minutes, so we're all around. If.

B

You can't let us know yeah and just a couple of closing remarks. I mean what we found is that you know and I've always believed. Cassandra provides great primitives that allow you to do distributed storage, um and you know so. We've done that the data model that's provided by Cassandra in the way that it partitions up the key space right automatically across house is perfect, absolutely perfect and I.

B

Think storm is a natural fit to that, because what it actually provides its primitives for distributed, processing that you can do a lot with and combining any kind of creative ways. So you know those two together um created a pretty pretty good basis on which you can build up.

B

You know any kind of big data platform and then one last piece of that is that if you look back at Taylor's demonstration of dr pc right, most of the song is like, like it open, and some of these distributed processing systems they're mostly focused on asynchronous cases, but the dr pc allows you to also do things synchronously and you know, there's tons of cases where that becomes really applicable or you want your answers to come back and the same in the same request and one of the things that we're trying to do is figure out how to reuse topologies across both asynchronous and synchronous means.

B

So if anybody has any thoughts on that, we greatly appreciate them, especially in the triangle. So I think I think that's it from us. We have a few minutes here for questions.

A

Thank you, Brian and Taylor. Very very much so please feel free to ask your questions in the a Q&A tab or in the general messaging window, while we're giving a couple of minutes to the questions. I wanted to just go over a couple of things here on our slides of upcoming events.

A

So you know this is one in a series of webinars. We always archive them. We want to help the community by putting out some great content and Taylor and Brian have put out some great content today. Next up is Christos. Kilometers from netflix is going to talk about his transition from being a DBA of relational databases to now really leading the charge of Phoebe engineering by moving to Cassandra he'll he'll address other no sequel, databases as well, and then make sure you mark your calendars for Valentine's Day.

A

It's not V is not for Valentine's it savvy nodes. That should be a very hot topic: that's by Patrick McFadden and then on februari 28, Aaron Morton will take a look at an introduction to Cassandra and we'll be focusing on the recently available one dot to some resources. We have three weeks ago we released planet Cassandra, that is the community site. The community is doing a good job centralizing stuff fast. So that's sort of your one-stop shop for everything. Profound related.

A

This webinar will be available there if you're interested in more deep dive training around Cassandra, with it training page and then Brian and Taylor are going to be talking at this one NYC star marks 20th. If you're on the East Coast, you know, definitely try and make a damn for this technical day around Cassandra.

A

Okay, so are you ready for questions? Yes, yes, okay, great! Are there any cep alternatives to storm that you're, aware of but also work with Cassandra? You did a good job talking about solar and elastic search and a little bit of the differences. You guys look for anything else up in store.

B

That's a great question: um I've seen I I should say not not to my knowledge. That's not that's, not that's kind of a hard question to answer, though, because they're tons of you know, event processing systems and you know I think you could tie any one of them to Cassandra.

B

If you wanted to we just we preferred the abstraction that storm provides, but I should say that there's a lot, there's been a lot of activity, the less we can ask, probably on the on the discussion list around other dsl's, so domain-specific languages that would allow you to articulate processing on the you processing flows. It would compile down or eventually become storms. Apologies right, so there's been a lot of discussion about. You know, making a spur like language that we compile them into storm.

B

We've also seen on some people working on drools, so that would leverage storm on the back end. So I think I can imagine what that is right. That's a you know, bundling the drools engine so that it deploys probably in bolts and can read the rules, language that rules has so you know so that did kind of a hard question to answer. But I think a lot of people are, you know, are looking at storm. It certainly has the most momentum.

B

I forgive it's like a 6,000 followers, the 500 forks or something like that on github. So certainly at the most momentum- and I think everybody knew nathan's got a cascading backgrounds, so I mean cascading is another one but cascading compiled down in the hadoo. So he took the best of cascading and made it real doc. Okay,.

A

Great, that's good to know and yeah. Just to reiterate, you know whip seeing this combination of storm and Cassandra more and more out in the Cassandra community, going back to your I think it's at references! Thank you. A sort of architectural slide up front in a multi-day trous scenario will zookeeper only look after the local slaves, or will it be aware of all the slaves in all the data center?

A

C

That in touch and actually zookeeper, doesn't look after the notes per se, zookeepers, essentially just houses, the storage state of the system and its nimbus that communicates with the nodes, true zookeeper. But how that how that setup is basically, however, you want to configure it so Nimbus find out about additional nodes through through Z zookeeper.

C

Hopefully, that answered the question.

A

Yeah, thank you very much. This one I can talk about this data pipeline use in production, your volumes, hardware and the unique value Cassandra provides versus other no sequel options. So you know I'm, assuming you can integrate storm with with other databases. Yeah can you can you focus on the unique value of Cassandra sure.

B

um Yeah so I mean when you, when you compared so, and this is discuss elasticsearch too so our architecture, you know what we want out of it is linear. Scalability right so you know, are the we've got a legacy system here that has got upwards at all? Probably a hundred we've got hundreds of machines here. So um you know, linear scalability is really important to us and Cassandra you nailed that absolutely nailed it. So every other piece of our architecture.

B

We want this a linearly like Cassandra right um so and I think that so storm does. That elasticsearch? Does that you know. So when you look at Sandra versus others, you get you're going to start to get an impedance mismatch if you're processing from or scales linearly, but your database under the hood can't support that right.

B

So if you had you know of relational database, you try and have to figure if you'd have to figure out how to shard that thing and and have it interact properly or if you have, you know, have a relational database. You know.

A

B

They're notoriously skill are to scale linearly, so it just made a lot of sense that if we've got a processing finger that some scale linearly in an indexing engine that can scale linearly but I, wouldn't we have a database that can also do that. Natively it out of the box with little admin.

A

Great, thank you very much and we're coming up on the top of the hour, but maybe we can sneak one in so talked about. You know going to get hub for the code. What about educational resources around storms, specifically what where's mom.

C

A

C

Is also open source on github and I'll type in the URL, for that in the chat area.

C

So that URL that's the source code for storm and there's also a very good wiki that covers all the all the basic concepts on storm.

A

Okay, great, thank you very much. We're right at the top of the hour, really appreciate you taking the time today, guys a great topic, and you know that this has been a great session to educate the community, stay tuned, everybody for more upcoming webinars, and we will see you next time. Thank you very much. David.

B