Apache Cassandra NYC* 2013, 8 Apr 2013

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: NYC* 2013 - "A Big Data Quadfecta: Cassandra, Storm, Elastic Search and Kafka"

Description

Speakers: Brian O'Neill and Taylor Goetz
A successful Big Data platform combines distributed processing and polyglot persistence into a single cohesive infrastructure. Over the past few years, Health Market Science has transitioned from traditional relational databases and enterprise systems to a massively scalable Big Data platform that combines Cassandra and Storm to ingest thousands of feeds of data from the health market industry to produce a single high-quality masterfile. Hear how we applied event processing and NoSQL to deliver real-time analytics, while accommodating structural change over time, and fuzzy/geospatial search.

A

To the big data quad vechta I'm, Brian O'neill I work for health market science, I'm the lead architect there- and this is Taylor- gets the development lead and we're going to be walking through our use case. Why we chose the technologies we did and how we're applying them and then we're actually gonna do a deep dive on how we integrated storming Cassandra. So hopefully everybody will enjoy it.

A

Okay, so quad vechta I actually had to look that up. Because what had happened is uh we lose the video there I've done a blog post on a trifecta where we put Kafka, elasticsearch and Cassandra together. That seemed to be working well and then we added to that storm. So I didn't know what to call the new things. So he said quad facta, um then, when I looked it up, it ended up being a urban dictionary term.

A

That meant that the ping pong ball in beer pong lands on four cups simultaneously, so that's sort of just stuck. So what we'll be doing is going through the four things that we think we got right and are at the end that are working well. So we should, I wait for the slides.

A

Jokes to peanuts, go walking in a park and one was assaulted. That's all I got what.

B

Do you got you know, I've, not after that.

A

Here we go okay, so there there's the beer pong so we'll pick up from there. Alright. So in continuing on that theme, sometimes I think you know big data. These days is sort of a in vogue right and there's a lot of hype around it, and it took us a lot of lot of time to figure out what big data actually was right. We figured and we had heard that if we get big data, they'll solve all of our problems.

A

It'll make coffee for us in the morning and everything right, but if you do some googling and try and look for definition, you'll eventually stumble upon. You know the 3 v's, sometimes there's four, but we focus on the three, the volume of data variety of data and the velocity right and all three of those have to play in our solution to service our customer demands so now I'm going to go through our use case.

A

So we work for health market science and what we do is we take thousands of these data feeds and we put them together into a master file. That's called master data management, so in our master file we have every practitioner in the United States every healthcare organizations, and then we affiliate all those different entities together over time and that's important to a lot of different industries, one of which would be pharmacies.

A

So when you go to the pharmacy and we need to get your script, your hand, a script to the person behind the counter, they actually have to figure out whether that doctor was eligible to prescribe you that drug before they give you something that's potentially illegal right, so that that one of the uses of our data is, it would hit our data on the back end.

A

The figure out of that prescriber was actually eligible to do that and the goal there is to eliminate a lot of the fraud, waste and abuse in the system plus in general. Our customers use our data to gain a lot of insights into the space out there, so we do analytics, on top of the data, to figure out influence our networks right. So if a physician is a critical person in his opinion, counts, he may influence other physicians.

A

So here are the details of our numbers on the physician, ER, physician and practice side. We have about six million unique practitioners in our database- that's not too bad as a number size, but when you think about the 2,000 sources that we bring in that comprise the data we maintain those two thousand sources over time over all time. So for compliance reasons, it's important for a pharmacy DeBell to say: I, prescribe I allowed this prescription six months ago.

A

Was that prescriber eligible six months ago to prescribe that so we're not just keeping the current time but any point in time, so that you can go and search this and find out what what the characteristics of and then what data contributed to that entity. At that point in time, on the claim side, we also have I think the largest claims database out there. So anytime, you submitted claims a claim for insurance, it's probably in our database d identified. So but that's a lot of data.

A

So, on the on the on the claim side, we have a billion claims annually and we maintain five years of that, so that's five billion. So the numbers get big I, think you know it's nine or ten terabytes on the on the claims data side, and then our physician is growing, so I forget what the numbers there.

A

So some of the challenges we deal with is that some of the data has schemas associated with it. Some is unstructured. We actually harvest data off the internet, so you know some of those two thousand feeds come off the internet and then, in addition to that, those schemas change over time, so you might get a input from the government, a government agency, that's maintaining state, license information for practitioners and one month it comes in with six columns next month. That comes in with ten columns right in our database.

A

Actually actually asked has to deal with that. So, um if you take a look and what I'm going to be talking about is the platform that we've built right and here's where you know I think when you put these the technology stack together, you get a lot of powerful capabilities right. So when you throw storm an elastic surgeon, Cassandra and Kafka all together, you get a lot of these great capabilities which allows you to deliver multiple prod. So if you take our the business needs at the top right, we actually deliver all these different solutions.

A

With a single platform that can ingest the data, put it in a standard of format, make it searchable like Google right and then answer all sorts of different questions. You know one of the things that we all we do as well as expense management. So a lot of the large pharmaceutical companies have to report out on how much they spend on each doctor right. So so the government can, you know, verify that they're, not bribing doctors, so you know they'll hand us all of their expenses. It just goes right into the engine.

A

We run it like anything else and we we can match it up against our master file and then output answers for them right and I. Think when you're, when you're architecting a system like that, you know the flexibility and the system is critical. We could have built one off apps for each one of these business needs, but by building it as a platform we can leverage we can just basically take. You know these four components stitch them together. Now we have a new product new, offering that's really powerful.

A

Ok, so just on the data center I want to put this one out there, because this is more of a crowdsourcing. We have about three quarter, petabytes of rawl storage and we are entirely virtualized. We use VMware and we have a San all right. All three of those characteristics were there before we chose Cassandra, so we ended up rolling out Cassandra on to this infrastructure and we're looking at it now we're saying: should we have gone physical? Should we be going physical?

A

We have no idea, so you know that there's been a lot of people talking today about you know some people deploy out into Amazon Web Services ons web services. Some people are running their own data center.

A

Like we do you know, data is obviously critical to us, so we absolutely run our own data center and is it is on premise, but you know how we can figure that danger senators still up in the air and I'd love if anybody has any kind of information related to physical versus virtual, as far as a feature, performance and function compared to cost I'd love to hear so, but that's our current approach?

A

Okay. So if you look at the platform that we're about to describe on how we stitched everything together, this is what it is. Is everybody familiar with master data management? Has anybody heard that term MDM hands? Okay, not a few? Okay, all right, so this is our version of MDM um on the Left. We have data sources coming in right, as I mentioned some come in from the web. We go out and harvest information, so we have something called the harvester it brings it in. We get it data coming in from government sources.

A

So sometimes you know this is actually a you know a CD coming in the mail right. We have to pull it in load it and grab a flat file. You know, and then also we have customer information like I mentioned the expenses so that when that comes into our system, that could be in any format. You know, and it could change over time. We could put that CD in the drive and suddenly they're using an excel sheet right.

A

The government organization that maintains these oncologists just decided to use Excel instead of the text file that they were using, so it to be really flexible. So what we do is we process that in the first set here, so we standardize and validated. So we apply a layout to that data to add a bit of structure, but really at this layer. All we want to do is tag fields, so the data comes in in whatever format we want, you know they want and then we tag which columns.

A

What right we don't want to apply too much structure we don't want to. We don't want to make it so that we can't flex to accommodate what kind of data is coming in. So at that point we're just standardizing and validating, and then we push it into no sequel which is Cassandra right now, after that right, we have to take these two thousand streams of data and find out which ones are talking about the same thing: the same practitioner right, one might have an address.

A

Another one might have state license information, so we need to match it up and then consolidate it. So if these hundred streams all have data about the same doctor, they might have conflicting addresses. So we take that data in and we run it through pluggable rules, some logic that our data scientists figure out to figure out which of those addresses. We should choose right and associated, and then we rank the addresses and put them on what we call the profile.

A

So once we've done that we can actually push it then into our index and our relational storage right. So we've added quite a bit of structure. At this point, we've consolidated all these data sources. We've picked what data, what data we should use and then we pushed it and the left, and now we can move on to analytics. So once we have that structure, we can run the analytics on top of the data and we can present that data through dashboards and reports.

A

um We then allow our customers to interact with that through both the user interface and the web web services, so that is master data management to us and here's. The crux of the problem, which is what drove us to choose Cassandra and I, mention this a little bit. We have multiple streams so picture. A thousand of these are you know a couple thousand of these arrows right. All data streams coming in right and the f is represents an entity and what I call fragments right.

A

So we get fragments of the entity that are split across streams and they come in at different times and then we have to associate them together through matching. Meanwhile, the schemas underneath of us might change all right, so that was the use case quickly. Hopefully, everybody captured that and we're going to move on to the design. So we had a relational database at the base of our technology stack right, so we have oracle. You probably saw in the other slide. We have oracle RAC, we have a massive oracle system and what it meant to make.

A

This shift to big data for us was to swap out the system of record, which was oracle in relational and instead plug-in cassandra right. So you saw from the design slide. The flow is different. Now cassandra is the first place we write to, and that is the single source of truth. Everything else is then derived from the data. That's in Cassandra right, so the relational database to us is really just another means of presenting and actually exactly what it is. Is it's one point in time.

A

So, as I mentioned, we store all points in time of all the data everywhere. That's in Cassandra. We then materialize a single point in time into Oracle, so that people can view it all right. So this is an important slide to make one of the things that that drove us to use Cassandra, in addition to the flexibility and the scalability, was that I think Cassandra nailed the primitive that you primitives, that you need and I was asked earlier today.

A

You know what was the hardest thing about: getting to use Cassandra coming from a relational world, I come from relational world, were you know, 180 relational databases, massive modeling and lots of tables to deal with, and then I had to come to Cassandra, right and you're tempted to apply the same kind of modeling techniques that use in the relational world. When you come over to Cassandra I said you know it waxed.

A

She wasn't that big, a shift, if you just don't apply those relational concepts, just ignore them right, um and you focus on this in the middle right, which is a map right, a map of maps. Basically, when you take a look at it, you know a column family.

A

You know Roku's your first key into it right, and that goes to another map, which is the the columns themselves, which is a key which could be a composite which, in representing lengthy array that goes to a value, and if you just think of that, that's your data structure now use it to the best of your ability and that thing can distribute across lots of different nodes. That's a really really powerful concept right.

A

So once we had nailed the storage primitive, which I thought Cassandra did really well, you know what other primitives did we need to add to the system. Well, the one we needed was a queuing right. So as we have all these streams coming in, some of them are dropping millions of records on our door in one day, some of them are trickling in. We needed a queueing queueing primitive right, so we need to push in the pop of a value right.

A

In addition to that, we also need to be able to process these these changes in the data and do something with them right. So that's the emit in process of a tuple right, I, think storm nailed, those primitives all right and then. Finally, um since cassandra is mostly opaque and has no idea what kind of bytes you're putting into the system it cannot apply, it cannot apply any kind of smarts to that data right.

A

So at some point you need to apply some smarts so that you can query this data quickly and that's the index right the index field by type and we do stuff like you know, our customers want to search for dr. Bob Smith right. Meanwhile, in our system, it's robert smith. So at what point did we know that bob was really ripe? Robert?

A

That's where the index is smart enough to know that can index the data properly? So when we take a look, you know we covered our distributed processing primitives with these four technologies.

A

Okay, so, along the way, you know- and I think you know if you look at themes- and you know- building some earlier presentations today- item potent operations. How many times have you heard that thing that that term today, you know 3050 really important I, you know it's as easy, as you know, I shipped an operation to one node, I lost communication. With that note, I have no idea whether it did it or not.

A

So I need to put it over here and run it again, but my system, my architecture, my design, better, be able to handle that case right um and then, in addition to that, nathan martes of storm had a great little was a strata he was at strata, has a great little presentation on immutable data right and we sort of switched. We we adopted that model where your database is really just an assertion of fact. At a time, don't go change. Anything don't do updates.

A

Don't do deletes it's just assertions of facts it at a point in time right and that makes idempotent operations a lot simpler right so and then the anti-patterns to that is the transactions and LA Kings. Just don't do it so one thing there and I think there's an earlier presentation on storm today, but this is this is important.

A

So you just told me that I have to be entirely on impotent right and there are certain anti-pattern. There are certain things that are hard to achievable with that kind of design right. One of those is counting right right, so, as things are happening, I'm trying to count and if I replay something I'm going to count it twice right. So that's that's! That's bad!

A

So Nathan came up with a elegant, hack, I'll call it to try and remain to tolerate replays, but also allow counting and state transformation and the way that I did it is, and somebody in the earlier presentation they mentioned Trident as well. So if you take a look at storm and you'll get more into this, so I'll go quickly. um What he did is the stream of data. That's coming in this doesn't just apply to storm. You can actually implement this design and anything you can batch it up right.

A

You can actually parallel all the parallel lies all the processing of those data. But then, when you go to update your state, you then sequence it right so here if we have three batches of data that we want to process- and let's say we just want to commute computer some for each one of those batches. So the first batch is total of four I'm, counting something yeah. The second batch has 13 of those things in it.

A

Sorry, the batch number three has 13 of those things in it batch to as six of those things where, if I look at the state transformation as I go once it provided I've sequenced, those batches right when I process batch 1 I'll have a total of four. When I pass this batch 3, I'm going to wait until I've practiced process batch two right here. You've brought everything back together right so that waits to comes in. He adds his six to get to ten bosch.

A

Three can then apply right, adding the 13 and then when the second batch of batch gets there, because it was a replay. I know, I've already added it all right. So that's a neat little hack to do state in a system that is mostly paralyzed.

A

Okay, here's the what we did wrong. We got two of these slides so and this is not tanaka doop I know there's a lot of Hadoop fans here, um but we as I said switch to use. Cassandra is our system of record right, um so we first load all raw data in the Cassandra, because nowhere else would do not load it to HDFS. That's it so we were running our Hadoop jobs against Cassandra and we found that to be incredibly inefficient.

A

So I don't know if anybody I've we've been trying to talk to as many people as possible here to find out if anybody's doing that successfully with Cassandra input, but I know a lot of people and we were in the first place. Writing Cassandra are using the output of a dupe job but input it wasn't working for us. So the kind of times that we were expecting to see we're not there so I'd be interested. If anybody wants to talk after or just yell at me right now, all right.

A

So, in addition to that, we also needed to track what changed right to in order to went into that next to dupe job. We had to track what changed so that be added a layer of complexity that we didn't want and we'll see how storm fixed that for us, okay- and this is what we did wrong sort of I'll put it sort of on this one.

A

um One of the reasons that we chose cassandra in the first place is because the cleanliness of the code, if you guys have been in there you know Jonathan just- does an amazing job. That is a great codebase. There's some legacy stuff is sort of like you know, super columns are still in there and things like that, but um you know, but it's a really great code base and we were looking at HBase versus Cassandra. You know we went down in the code and said you know: what could we do with this?

A

Can we extend the functionality here, and the answer is yes, absolutely, it is very clean. I mean when you saw Edwards talk earlier, if you're up there for introverted, like you're able to just go in there and it build on top of it right with. You know, just a few hours of getting oriented kind of thing, which is really neat, so maybe that's a bad thing when you want to go in there and you build your own trigger capability.

A

So you know we weren't willing to wait and we wanted this trigger capability, because we what we, what we were seeing, is lots of different clients accessing Cassandra right directly, and we we knew we needed to maintain some wide rows, some indexes so that people could get at the data through other dimensions. We did not want every client accessing Cassandra to have to know about those different dimensions right and we hadn't gotten to the point where you know eric is saying that you know everything went through a nice common service, oriented architecture API.

A

Yet right, so we implemented our own trigger functionality. It's out there. You can go to github, go to Cassandra triggers I. Do not recommend doing this, but you could so it worked. Well. Initially and to be honest, it's actually still in production right now, it's been there for a year, it's doing well, but what happened? Is we started leveraging that too much? So you know we said we need one more wide row. We need one other thing and then every our entire business process was starting to be articulated via trigger right.

A

So you know that we start to get this. You know sick feeling and I stomachs that you know our business process should not be a side effect of writing to Cassandra right. That's, not good, okay, what we did right, um so we did eventually look at rest, api's, right and I. Think if you were in Edwards talk earlier, you know so I we would wrote Virgil, which is a rest layer on top of Cassandra, and then we got it right. Okay, everything's got to go through services, an API, that's a really smart thing to do.

A

Let's do that so we put Virgil out there to you, can go out, go and you can use it. It's actually working again. Ours is in productions been in production for a long time, but I'm really excited about what we saw about introverted right. So if you want that HTTP and rest capability, but you also want to be able to plug in and do your own own thing introverted looks like it's real going to be a really powerful, powerful thing to use.

A

Ok, so I spend a lot of time. How am I doing the time? Oh man, ok, I'll, go faster, ok, so Kafka is great. I. Think a lot of people heard lots of good things way better than JMS. We were using activemq and Oracle advanced messaging, advanced queuing, and you know they topple over once you start flooding them with messages. So you know we took a look at Kafka and it fitted into the architecture as I'll show in a few slides.

A

Likewise, with elastic search we actually two years ago, look we're doing the comparison between elasticsearch and solar and went we figured solar was the solar was the way to go because elasticsearch was too young, but if you've tried to scale solar, you know that it can be a bear right. You're doing master-slave kind of relationships and directing queries to certain nodes and then updates to others. It's not very pleasant. So when elastic search incorporated and became a little bit mature, we took another look at it and said: that's the way to go.

A

um I'm gonna leave talking about storm to Taylor, but basically, as I said before, it offers the primitives that we needed for distributed processing, and this is, it has become our backbone right. So you know I think some of the questions around polyglot persist are how do I manage and control the data flow from one system to another. You know when you're using you know, we've got. This is not the only thing. We've got, my sequel, databases and Oracle databases and we've got elastic search on got Cassandra.

A

So we need to control that data flow and make sure that all the systems get updated appropriately right. So that's big. That's become storm to us, but the interesting their thing. There is a nice still long for the day when everything was a trigger is that all crud operations now have to go through this data flow right. So I'll just go quickly through the system that we have.

A

We have a rest api that you hit and accuse the data into Kafka all right, then the storm topology reads that up off right and we have a bunch of Cassandra bolts that that Taylor will go into they write to Cassandra multiple times right also to keep the wide rows and everything up to date, and then we, then we dump it in elastic search that it's also googly right.

A

um So the good great thing here is that you know talk about high availability and as long as you've done everything you know in an item potent fashion right. You can tolerate things going down right. So if you, if you lose part of your storm topology, no problem storm will rear out replay the tuple someplace else. It'll get run right. If you lose a note in your in your cassandra ring no problem.

A

You've got a replication factor there right if you lose all of storm for some time period, no problem or if you were, if something was going wrong in your system, no problem just rewind your offset in Kafka right. So all of these things played really well with one another right, we're really happy when it all came together, just real quick. So what's coming next, we have a desperate need for a graph database, as I mentioned.

A

We maintain all these affiliations between all these entities, all the practitioners, how they relate to one another and all the organizations. So you know we're gonna, look at Titan and I'm I'm. Looking for anybody that has experience, running, tighten against Cassandra awesome, we'll talk, okay, so I'm we were looking at neo4j, but obviously it'd be better. If we can manage this one persistence mechanism, all right, I'm gonna, leave this to you right. You got the slides, okay and then here's Taylor Taylor is the awesome author of storm Cassandra. So go for it.

B

Okay, how many people here have heard of storm awesome? Is anyone using it great I'm, going to go into a little bit of a discussion of how we use storm and some of the open source software that we've developed for storm storm is a real-time distributed, distributed computation system. So what that means? Is it it processes, data in real time, and that contrasts with a batch system like Hadoop, that, where you have a clear start and end two jobs in storm, your computations are open-ended. They run forever.

B

One of the things that Brian mentioned with Hadoop was that we were for a while. We used to do to do some of our batch processing. What we also found out that we needed to reactivate in real time. So in some cases we may get data dropped and that gets dropped as the batch. So we could have a Hadoop job that would deal with in putting that data into our system of record. But then we also wanted to support real-time data coming in. So what we're refining was.

B

We were writing the same logic in two places, one for Hadoop job and one for a real-time system. So one of the things we figured out- or we thought about- was what, if we treat all data as real-time input and that's where a storm sort of came in.

B

So what does a storm cluster look like this is actually a diagram of one of our development clusters, but storm has a an architectural, very similar to Hadoop, there's Nimbus, which is the master node that takes care of assigning tasks out to the work of know, zookeepers used for cluster coordination. A storm actually doesn't make, doesn't do a whole lot of writing to zookeeper.

B

So it's it's pretty easy on your zookeeper cluster and then supervisors are your worker nodes, mm-hmm and they're the ones that actually run the task that operate on your data and in our development environment. This is not a I, wouldn't suggest a cluster like this in production. This is what we use for a a small development environment to test applications in a clustered environment. In a production environment we probably wouldn't co-locate storm and Cassandra and we'd have a multi-node zookeeper cluster.

B

Storm primitives at the the highest levels, the main, primitive and storm is streams and streams are just a unbounded sequence of tuples. Tuples are just key value. Pairs like a hashmap.

B

The next primitive is a spout and a spout emits streams of tuples bolts are your unit of computation? That's actually, where you do your operations on tuples do calculations that sort of thing and then, if you combine bolt spouts and streams, they come together in, what's called a topology. It's a topology is a combination of any number of spouts and bolts and defines your your overall computation.

B

So to dive in a little deeper with storm spouts they, as I mentioned before, represent a stream of data that can be cues like Kafka JMS, kestrel, Twitter firehose sensor data, whether that sort of thing on spouts omit tuples into the topology bolts, receive tuples from spouts or other bolts and operate or react to that data. They can do perform functions.

B

Filters joins aggregations and, in some cases, database rights and look up and we'll get into more that a little later and they can also optionally emit additional tuples storm topologies define the computation of your application and it's a data flow between spouts and bolts, parallelism of components. So when I say a spout or a bolt, those operate as tasks in your topology and task can be parallelized, so they can be fanned out across multiple nodes in your cluster and what groupings do our define? How that data routes to individual tasks?

B

So, for example, if you're doing something like counting words, you want the same word to always go to the same task running or your bolt, so you would use a grouping that allows that to happen.

B

Here's a sample of what a topology might look like. You have a spout attached to some sort of a real-time data source and it emits streams of tuples to the bolts and the bolt perform your your computation.

B

Storm and cassandra, so the main use case is being able to write storm double data to cassandra, for example, the results of your computation or precompute indices and read and read data from from Cassandra and omit the storm tuples that enables you to do dynamic, lookups and that sort of thing and I'll show a couple use cases in a few minutes.

B

The storm Cassandra project that bruh I mentioned on github started a little over a year ago, basically what it, what it is, what it boils down to is. We have several components: storm components that are bolts and they're, also Trident functions and I'll. If I have time, I'll talk a little bit about Triton later, but we have these generic Cassandra bolts, one that allows you to write to Cassandra and one that allows you to look up.

B

The writing. Cassandra bolt is available in both batching and non batching, the batching one. What it will do is as troubles stream in it, will batch them up and attempt a batch mutation all at once.

B

So I mentioned that it provides generic storm components and then what you actually plug in the custom piece that you write is called at uppal mapper and a columns mapper, and that defines the interaction between the storm tupple and the cassandra database in either direction.

B

So at uppal mapper tells the Cassandra bold how to write at uppal to an arbitrary data model. So your your data model can be anything you want. All you really need to do is define that interface. So, given a storm top a little map to a column, family maps, your roki and then map that data into Cassandra Collins and the columns mapper interface does the exact opposite. It tells Cassandra the Cassandra look up bull: how to transform a sander row into a storm tupple.

B

The current state of the project we are on zero dot, 4 20, and that should be released soon, maybe tonight, depending on if I go to the drink up or not we're using the a/c annex client for a while, we were using both st annex and hector and allowed you to switch that at runtime. But when we started looking at supporting composite keys and composite column names, it became a lot easier to settle on a single client.

B

We we chose st annex at that point, comes with several out of the box, mapper implementations, so basic key value. Columns like a hashmap, valueless columns, I'll show you an example of a use case that uses that we have a counter column implementation. That was a community contribution.

B

We allow look up by roki and range query and, as I mentioned composite keen, column, support and we're. Also, we've introduced Trident support future plans, which amp plan to switch the cql and enhance our tried and support.

B

So a couple examples.

B

So persistent word count if those of you are familiar with storm. One of the first sample projects you go through is the word count.

B

So and these there's an example of this in the github project that you can run, but basically what we have is a word spout that uses a field grouping and that field grouping make sure that all all the same tuples, all the same words essentially go to the same bolt the count bolt, and then we use a shuffle grouping to the Cassandra bolt which actually receives the counts and then writes them into Cassandra.

B

Another interesting feature of storm is dr pc, so I talked about how you can parallel eyes and horizontally scale, your topologies. What dr pc does is take that power and it adds a dr pc server, a specialized spout and a specialized bolt that does the coordination. So what you can do is create a dr pc client that will create a query and that query gets parallelized out into your topology and then using transactional mechanisms.

B

It will actually return you a spot, a response, so you get all the the power of a storm topology with essentially remote procedure. Call one example of that is the reach computation. So if we define reach, for example, as let's use, Twitter as an example, so if I tweet out a URL and then I have X number of followers and each one of those has followers, the reach is the total number of people that have been exposed to that URL.

B

So let's say we are Twitter and we have massive amounts of users and followers and a huge graph. So what you can do with storm and dr pc and our cassandra implementation is break that down into several simpler functions that we can then parallel eyes. So in this example, we have a dr pc spouts, that's just going to receive a URL. The URL is our query.

B

We want to find out how many users have been exposed to that that gets passed down to a cassandra bolt and that sander bolts sitting on top of a wide row, valueless column and the row key to that is just the URL and then the values, well, the it's valueless. But the values are the the users who have tweeted that. So what that does is then emits the URL and the user for each one of those and sends it to the followers bolt.

B

What the followers bolt does is takes the user that ever see for each user it receives it, looks that up and emits all the followers of that individual user. So now we've we found that out and we pass it on. We do a fields, grouping based on the ID and the follower, and that goes to a partial, unique or bolt which gathers up partial batches of counts and then, finally, to account aggregator and.

B

Just like that, it returns your result and it can scale horizontally.

B

This is a notional topology. This is not exam. Tm solution looks like, but this gives you some idea of how we're using storm I mentioned earlier, that we treat all incoming data as real-time data. So when we do batches, we read that data and write it to a Kafka queue, and real-time data also comes in the same way it gets. There is some transformation, it goes into a Coptic you and then enters one of our topologies.

B

So we have apology, implementations for loading, standardizing matching and consolidation, and they all right to Cassandra, which is our system of record and then, in addition to that, they also handle all the polyglot persistence operations like writing to elasticsearch writing to a graph database and then, finally, for reporting what we do is we have processes that take the system of record data out and stored and a relational database that we can report off of and also use for, delivery for people who have purchased our product.

B

This is a small subset of the previous one. This is an example of what part of our loading process looks like, so we have entity fragments coming in and you can think of an entity fragment as a smaller piece of a larger entity, so it might be address information or state license information, so fragments come in they're written to Kafka. We use a transactional Kafka spout to route that into the topology. There are a couple things going on here.

B

We shuffle that so we can parallel lies amongst multiple fragment functions. There and the several streams come out of the fragment function that we group by, for example, fragment ID and naid. We do that so we can maintain counts and do analytics kind of things. One simple count would be the number of fragments we have per entity, the total number of entities process, the total number of fragments processed, and that goes to a persistent aggregate function and those counts.

B

Through a trident state implementation get written to Cassandra, then we also route another stream, two eggs and or lookup function, and what that does is looks up. Does sort of reference table. Looks up lookups to enhance the data, and then it goes to a Cassandra write function, writes it to Cassandra and then passes on downstream.

B

Additional standardized match, that's sort of some shameless shoutouts.

B

Our github accounts hms who is currently we're. Storm Cassandra, storm elasticsearch and storm jdb I, which should be coming soon, is and a couple of projects on my github account, and with that we may have a minute or two for questions.

A

Who's that comprehensive okay, um just one more note, if we can go back here, real quick, um so we're talking some of the things I think it is an evolving aye, sir, it's an evolving area how to best integrate goosander with storm, I, think and they'll. You know: we've got this function out there, the exact composition. The topologies, I think, is something that we don't yet have patterns for. So if you take a look at this topology, there are multiple or in some of our topologies.

A

There are multiple write functions right, so we write to Cassandra. Here we write to Cassandra. Here we write to Cassandra there. You know one of the patterns we've been evaluating is whether you actually construct the mutation as you go through the topology and do one single right at the end right and that takes advantage of some of one tues properties for the mutation so that you can get you know you can eliminate a little of that eventual in the consistency so that when you write this thing all the indexes there are immediately.

A

A

Actually, we did yes yep, yes, you know yeah and that's what I mean it's like it's you know, and I vacillate some days between like everything is now in storm: thou shalt go through the data flow for every crud operation. um You know I come in and that gives you you know some sanity some days and other days I'm like now. That's that's terrible, because now I got to go through the data flow for every crud operation. There.

B

They're definitely trade-offs to all everywhere: they're they're trade-offs, the index solution started. It didn't scale for us 48 from a business standpoint, but now you could argue that we're doing everything through storm that involves a lot of network traffic, but in general storm is extremely performing. That.

B

The what I'm sorry.

A

That's the question Kluge anything special for the Nimbus Karen feeding. Oh.

B

Because Nimbus is essentially like the master, basically both with a storm you of Master, which is responsible for assigning work out to the nodes, and then you have supervisors both of those we run and its standard practice to run those under supervision that they're designed to fail fast. So if they go down, they can be picked up. So if Nimbus goes down, it'll come back up. If you lose the machine that Nimbus is running on, then your topology will continue to flow and work.

B

But if Nimbus is down- and you lose one of your workers right, then it.

A

B

Be able to sign another worker to that node yeah so.

A

We run it under supervision. The processing doesn't flow when it crashes right, but you got to get it back up before other stuff crashes. Otherwise flow will stop. Okay,.

A

A

Fault taller, good.

B

um You mean in what sense.

B

My understanding is that that's coming up, you can, there is replication and you can operate them as a cluster and you you get some partitioning, but I believe I'd have to look at the current version of cuff guy. Think that's coming up in a new version that provides some enhanced fault, tolerance, yeah I believe so. Do.

A

huh Yeah yep yeah. It's a great point: storm.

B

Topologies, basically, they get packaged up as a jar file, a fat jar file, not unlike what you do with Hadoop and they get submitted to the cluster and Nimbus will pass out the jar file to all the workers. As far as redeploying yeah changes for rules yet be would be a redeployment of your topology well yeah. In some ways.

A

B

Some cases, but you can build in you- can build in. You can do anything you want in a bolt or spout, so you can build in some sort of something like that. There's also another project that was head up on github is storm signals, and what that lets you do is send out of band signals to components of your topology, so they can essentially communicate with one another yeah.

A

The I guess in a previous presentation and we've done the same thing: we have a bolt in arc in our world. We have a bolt so as to poles are flowing through the system. As the data is flowing through, we have a bolt that we call the processing engine that actually calls out to Ruby it handles handle hands the tuple to the Ruby script.

A

The Ruby can do whatever the hell it wants to transform the data, you know, do anything and then it hands it back that thing loads, the Ruby from a URL, so we can replace the logic at runtime right out from underneath it has a you know: five minute expiration. So if we update the Ruby it'll clear, its cache pulled, the new rubian and the new logic will take effect immediately and in the earlier presentation they were using journals for the same thing.

A

So if you embed the drools library into a bolt, you know you can run drools and make decision off at rules file. It rules rule rules in your filtering and stuff. So.

A

B

Modeling in storm.

B

Yeah in, depending on what your topologies are doing, especially with cough, gets very easy to overload certain systems. For example, kafka kafka could probably perform writing to an oracle database. So what you can tell storm to do is store. Mascot has guaranteed processing, so you can say a couple must pass through the entire bolt tree before it gets acknowledged if it doesn't replay it when you're using that type of topology.

B

What you can do is say limit the number of outstanding tuples to X number, and then storm will hold back until all those tuples have been acknowledged. Yeah.

A

Yeah we should say that that storm is a pull model. So when the spouts are implemented, storm asked the spout for new data when storm is ready to accept new data, so we can actually throttle itself based on it's called a max spout pending. So the it's built in.

B

All right, okay, I, think I'll have to wrap up. Will.

A

Be available if everybody wants to talk it's our favorite subject: okay, all right! Thank you.