Apache Cassandra Cassandra Summit 2014, 10 Oct 2014

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Scale Unlimited: Fuzzy Entity Matching at Scale

Description

Speaker: Ken Krugler, President of Scale Unlimited

Early Warning has information on hundreds of millions of people and companies. When a person wants to open a new bank account, they need to be able to accurately find similar entities in this large dataset, to provide a risk assessment. Using the combination of Cassandra & Solr via DSE, they can quickly find and evaluate all reasonable candidates.

A

A

A

So I realized too late. That I had a really lousy title. For my talk, it should have been something cool like so mr. Smith. You want to open a bank account like that's really what this talk is about, but the underlying issue, or this the technical name, is fuzzy entity matching. So we got the obligatory. Why my peer talking? Why should you listen to me so I've got a boutique, Big Data consulting company boutique sounds way better than small.

A

It's just five of us, but we we help clients with issues around big data problems, so Hadoop workflows that we implement using cascading, which is an open-source API. We do stuff with cassandra with solar, like on the search side, machine learning, etc. So we do consulting on that and we also do training. So actually I created the training materials for datastax for their vse solar search and occasionally I teach that class for them. I also teach classes on Hadoop and machine learning and other things like that.

A

I enjoy teaching so much that I volunteer, teach high school programming classes which, if you've dealt with high school students like it, means I really liked teaching all right. So when I go to talk, so I really enjoy having a concrete use case like as the context. So as the person's talking about things, I can get a sense of what the what they're trying to tell me and why it matters. So in this case we're going to start with the problem.

A

Let's say: I'm going to a bank and I want to open an account. So in this case this guy, who is sitting across the table from me, he's pleasantly chatting with me, but what's really going on in his head, is, should I open an account for you like? Would that be a big mistake? You know what would I regret doing that? Okay, so it's it's about calculating the applications with the applicants risk trying to figure out like should I do this? Should I open the account for them. So what's happened is I've.

A

Provided me as the applicant a bunch of information, my name date of birth, Mabel, social security number details like that and what they need to do is be able to look at my account history. They want to be able to find every single account that I've owned or had control over and then associated with whether there were problems with that account. That's a key part of that right. So really it's about matching up who I say: I am the details, I provided with everybody that they know something about. Does that make sense?

A

So that's the fundamental problem here so matching people I've, provided some information or in the case of the banker, it's I've got information that the applicants provided and what I need to do is match it against all the data. Not just the data.

A

I've got right, that's a fundamental problem, because if, if I've been a total bad boy over at Wells Fargo, you know I've been bouncing, checks left and right and then I go over the bank of america, and I say oh I'd, like open an account if they don't know what's going on at Wells Fargo, they don't know that I'm somebody they shouldn't open that account for all right, so fundamental problem they have to.

A

They have to have data on all the bank account activity, but like there's no way in hell, wells fargo is going to give bank of america their customer list and their account status right and vice versa. That's just not going to happen. These are competitors. So so what do you do? Well, the solution to this problem is a company called early warning services.

A

They provide it's a joint venture, so they provide this commonplace for the banks to send their data to where they know it's not going to get passed around to each other right. It's the trusted third party, so joint venture of the biggest five US banks plus then like a 800 plus other financial institutions, send information. So they have data on pretty much everybody in the US. Who has a bank account? So that's the key they've got the data, but now the problem is you've got that data.

A

How do you solve that person matching problem right and by solving it I mean you get good results and you get the results fast enough right. So it's with quality and the performance on that, and now we get into the issue of fuzzy matching.

A

So in general, what do I mean by fuzzy matching if I've got something that I'm looking for like this blue triangle over here, I want to get everything that I think is equivalent and nothing. That's different! That's my goal now in this case right here, you know key point is how do I define what is similar or what is dissimilar so in this in this first row here what I've defined arbitrarily as I'm saying rotation doesn't matter right? The triangle is rotated. It's still the same triangle.

A

I've said that if the color is closed right, it doesn't matter so here it's a slightly darker blue for that second triangle. Up there, this guy, right here and I'm, saying okay, that that's equivalent that doesn't matter and I'm saying if it's slightly smaller, now notice that I'm, using terms like slightly so right away. You get into this problem where it's not a step function. It's not like down here where I say: okay, if it's a circle versus a triangle, forget it it's not the same.

A

It's this slope I've created a gradient that says well, if you're pretty close to the color you're okay, but if you're too far away from the color, then you're, not okay, all right! So we get into this area of grey that we'll talk about and we've decided things like you know.

A

A structural change like this to the triangle is significant or not so we're arbitrarily coming up with aspects of attributes that we decide are important and that matter and then have to match or that gap to match close enough or that you know if they don't match we're done. So why is it hard? Well, as we talked about there's fuzzy areas like in this case right here? What if it's? What if it's a, really light blue color? Is that a match I, don't know or what?

A

If we've got some attribute like a really thick border around, it would I consider those two things equivalent or what, if the triangles really small compared to the other one. Alright, so again, we've got these slopes and you have to the problem is when you decide if it goes past this level, then it's not the same you're, creating you're, creating a step function essentially, and then, if somebody's just a little bit past that edge and they fall off it, and so now it's not similar.

A

You've removed them from your list, all right, which is a problem. um So that's one problem: lots of areas of grey. The second problem is you can't use Cassandra. Alright, the fundamental problems like Cassandra is great. If I've got a row key then wow. It's super fast for me to look up that roki and if I look for that row key and it's not, there then I know I. Don't have it all right so fast and accurate, but I don't have a row key. It's a fuzzy matching problem.

A

So fundamentally, I can't apply the key feature of Cassandra to the problem and the third thing is computationally intensive. So imagine I'm comparing two things that I've got like I, don't know: 100 attributes like strings and I'm doing string edit distance right to do the comparison between ok, that's a computationally intensive problem. Alright, that can that can choose some serious epu cycles.

A

Alright, so we've been talking about triangles now, let's go to people so in this case right here. These are all pictures of people who are speakers at the Cassandra summit here and there's me over there a really bad headshot that somebody did a couple years ago, I'm getting a new one on Saturday, so I so got information on all these people and now I'm being asked about a specific person right, I'm being asked about me like mi in this group over here, and the thing is I want to quickly find all the good matches.

A

One of the challenges here is: it's not like. I can search until I find a good match. I need to find all the good matches, which means I've got to go through everybody, I can't there's there aren't any real shortcuts here or so. It appears now pointing to make yours we're not doing batch matching. So a lot of people. When you talk about fuzzy like matching they're thinking about batch, which means it could be a self join like I've got this long list of people and I want to D.

A

Do like I've got entries in here that are close, I've got to deduplicate them or, let's say I've got a list of people. I get data from another source and I need to merge those together, but again, there's going to be a bunch of duplicates in there there or near duplicates. How do I do that? Merge I gave a talk a couple months ago at hadoop summit about doing that kind of batch fuzzy matching at scale different kind of problem.

A

This right here we're talking about it's ad hoc I've, got this one person or thing and I'm trying to see what, in my large corpus or set of data I've got that's similar enough to be considered a match.

A

So what's a good match so we're doing here for people as we've got attributes basically fields with values for people, and in this case we've got. You know, name address city, state, zip, all right. So are these two people the same. But if you looked at this, have you looked at this right here? Would you say these two are the same? Everybody says? Yes, you know because this person- you look at it. You know oh zip code matches.

A

The plus 4 part is like a sub specifier, whatever it's, okay and Bob and Robert those are sort of equivalent. Those are synonyms all right, so you would look at that. You'd say yep. Those are exact matches.

A

What about now.

A

Okay, like most people, look at and go yeah. You know yeah, probably but yeah. There's some issues here, for example, I mean the fact that this one doesn't have a zip code. This one does you look at that and you go whatever okay Washington versus wa. That's an alias yeah. The fact that there's no middle initial here is okay, like sometimes you don't get the middle initial, not having an apartment number. Okay, sometimes you get the apartment number.

A

Sometimes you don't, but the fact that it's 32 20 verses, 220 there is that a typo when people are entering and it's amazing how noisy data is like in your mind, or at least in my mind when you think about bank data. You're like this is Wow it's going to be like spot-on No, so you get typos weirdly enough because you got humans involved entering data that people scratch on two forms in eligible ink. So you get typos.

A

So you look at these two things and you're like well, you wouldn't be a hundred percent, but you'd say yeah, probably so two things that one is normalization, so the being able to know that Washington and WA are the same thing: they're different strings to a computer. Knowing that the same thing that's an issue of normalization right. Similarly, with like a bob vs Robert, you could treat that as a normalization problem, and the second thing, though, is how do you know what features to focus on what's important when it's the same or when it's different?

A

So a typical approach here is for calculating similarity and there's many many ways to this. There's I think God knows how many research papers or phd's have been done on similarity. But this is a common approach for record similarity where you say for each field. I can calculate a degree of similarity between them and often how you calculate that winds up being fields Pacific, like zip code, if the primary part matches we're pretty close to being one point: oh you know.

A

If the you know, the sub part doesn't match it's not so significant and then for each field to give it a wait. How significant is it if it matches or doesn't match all right, like zip code, zip code matching is more significant than state matching, because, if codes more specific than state, so it has a higher weight. So then, what you can do is is, if you give each field a weight and the sum of those weights equals one right.

A

They sum to 1 point 0, so each fuel has a weight from 0 to 1, and if you sum all the weights of all the fields, they're going to sum to 1 point 0, then for each field you can cal cake its similarity value times its weight and you some those and you're going to come up with a number between 0, which means these things are completely different and one point: oh they're, an exact duplicate like they're, an exact match that say a simple way of calculating a similarity score from 1 to 0, from 0 to 1, and that's what I call them match similarity but I.

A

Ask the question: does it scale the answer's? No right? The issue is again for a single person like me, being matched against hundreds of millions of potential people. You have to do that comparison for every single record and we talked about it being computationally intensive right. It works for a couple hundred, maybe even a couple thousand people. It does not scale to hundreds of millions right. So fundamentally, what you need to do here is figure out a way to do fewer comparisons.

A

How do you do furyk, pure comparisons, all right and that's, where search comes in all right, so search? Who here has experience with search in general hands up anyone quite a few people? How about specifically solar a couple people all right, great design by using DSE with solar? If you were still okay, all right, if I accidentally say solar in here, just because I'll do that it's an open source project that provides search will go into it in more detail in a little bit.

A

But what I'm going to talk about here is in general search, so search is fast. Similarity might search is all about. It takes a query like I'm looking for scale. Unlimited blog, let's say that's my query: three words. It actually turns that into a docking, just like any other document that I've got in my search index. It turns it into a document where a document is essentially a feature vector. It's a multi dimensional vector where there's one dimension for each unique word right.

A

So if my search was scale and limited blog I'd wind up with a feature vector that has three dimensions to it, so it's a three dimensional feature vector and the magnitude of each weight is based upon something called tf-idf, so term frequency inverse document frequency, really what it means is. How often does this word occur in this document? That's the term frequency part like how significant is the word in the document and IDF is inverse document frequency.

A

It's like how significant is this word across all documents as an example there, the word the is not very significant because that's in every single document. So the fact that my query has the word done. It doesn't matter. So it's document frequency is really high. It's like 1 inverse document frequency. Then becomes very low, so it's sort of how significant is a word across my entire corpus. Basically, you get a weight for each word in this feature vector, and that actually gives you a vector then, because each dimension has to have a magnitude.

A

Okay, so that basically gives me this three dimensional vector here in this case, for the three word case: it's three dimensions, so I've got three values and each value has a weight or magnitude set according to term frequency x, inverse document frequency. Think of it. As this, though, here's this term vector that have got for this document, I can compare it to some other documents, term vector, and it has an angle between them right vectors have angles between them.

A

If that angle is zero, which means these two term vectors are exactly the same or have exactly the same sort of direction, then the angle zero cosine of that is one so great they're, exactly the same cosine similarity gives me one point 0 for exactly the same: if they're, 90 degrees, opposite cosine will give me 0, there's nothing in common between them and if they're opposite each other. So the angle 180 degrees, its minus 1, it's actually showing negative correlation right between it.

A

So that's a search similarity score. It's a common one, there's other ways to do similar to scores, but this is the basic one inside of leucine, and you know that solar is built on top of the key thing. Is it's super fast for search systems to use this to find the best documents?

A

However, this cosine similarity is search similarity, it's not the same as the match, similarity that we were talking about earlier, obviously, right match similarity, had / field weights and maybe edit distances and all kinds of crazy stuff you know generally doesn't have the same level of sophistication because it's really trying to be fast. So how do you deal with that?

A

If you're trying to use search to narrow down the set of candidates that you then have to do match similarity with right, because that's the key point if I can use search to say out of the hundreds of millions of people that I know about what are the hundred that are likely to be similar to be good candidates? And if I have it a hundred, then I can do that match similarity and it's okay, alright, so I'm using search to narrow down the candidates set right.

A

But the issue here is, we said, is search similarity that candidates said I get isn't gonna, be exactly the same as match similarity the score. So what do you do? Well, you throw a bigger net if I say like atmos I'm going to get ten matches. I expect that I'm only going to get at most 10 matches or maybe 10 matches is all I care about.

A

If I get more than that, I'm dealing with somebody who's creating a whole bunch of bogus ids, and I don't want them anyway, but let's say I get at most 10 matches. Well then, I could say: okay, let's find a hundred candidates or maybe a thousand candidates like I scale, up that 10 by some amount, and that's my candidate list that I didn't do match similarity on all right. Does that make sense so basically using that super fast search to narrow down the set of candidates to the level where I can actually do this match?

A

Similarity on what's left all right, so it's a two-step process. We've got like here's, the information being provided by some person. They got a name social security number and maybe a date of birth right. Maybe I'll have address as well. But let's say we got these three attributes form. So I'm going to turn it into a query and one of the interesting things you can do with solar in other systems is, you can add weights to the fields right?

A

You can put weights in there that are similar to the weights you put on fields when you're doing that match similarity which fields matter more. So in this case right here, I'm saying if the social security number matches, let's boost the importance of that by 10. If the date of birth matches that's boosted by five, if the name matches, let's boost it by three, so essentially, I can provide some hints to search where I'm using weights for fields that are similar to the weights. I use when I'm doing matching.

A

So your search system then goes out and hits the index and it comes back with its list of what it thinks. The top candidates are and I say, give me a ten giving 100 whatever and it'll come back with these ordered that's the first step and then the second step is I run the matching logic against those results and the matching lodge is going to come up with different numbers based on its it's. You know way of comparing things.

A

So it's essentially re-ranking things right it reorders things and then you can say well, I know anything less than a give. Some number point: seven I, don't consider that a match all right, so I'm going to get left with some subset of these things make sense, so two-step search then re-rank using the match. Similarity one of the things I should have said at the beginning of this talk is that there are some details about the implementation for early warning. That I obviously can't talk about like I'm, not going to say.

A

Oh, you know, if you put an umlaut over a vowel in your name, we can't match the name at all that, and that is actually isn't the problem. So don't bother trying that. But but you know obviously, I can't go into the details of exactly how their match logic works. I, give you a very simple approach: they've got way more sophisticated things, but you get the basic idea here right.

A

So how do you pick n that scaling factor right, I'm looking for 10 up to 10 matches, let's search for 100? How do you pick n? We can make end small if your search similarity results are very close to your match.

A

Similarity results, The Closer, those two things are, the smaller end can be the smaller your net has to be, and you can make those things closer by doing things like ensuring that your data is normalized before it goes into solar, so, for example, that the zip code thing and there's a good thing: there's a zip and a zip plus 4 right.

A

You can take it and say: let's break it into a zip field and a zip plus 4 field and we'll search on just the zip or zip plus 4. If we've got it with a higher weight and that way, if the person only gives you your zip and you have zipless, for it still works versus not finding it at all, if it didn't find it at all, that means your search. Similarity is being skewed relative to your match, similarity, so the more normalization you can do, the better.

A

Your search results can correlate with your match results. Now. Why do you care about n? Well end's too big? It's like this case right here, where I'm throwing this big red net and thus up in Purple's the stuff I actually care about. So, if n is too big, we got a performance issue potentially because we're putting more load on the search system to give us back these bigger set of results and then that bigger set of results I still have to go through, and do that matching logic against each one.

A

So you get hit twice now what happens of ends too small? Well, you make then too small. You might get this case where you're actually Mitch that missing some matches.

A

You want because that circle isn't directly aligned with the match circle, and so, if it's too small, you get some false negatives which, depending on your use case, may be okay or it may be really bad like this right here might be the one account where the person that's trying to open an account, the one sort of identity where the person's trying to open the account was bouncing checks like they were a Harlem Globetrotter.

A

You know, dribbling a basketball and you don't know about it and so you're, like oh sure, let's open an account for this person right. So how much you care depends on your use case and the key here is white, the tuning, the search to mimic the match, similarity I. So, as we talked about before you can do things like wait. The value of match of matching different feels when you make your search query, that's a common technique for getting these more in alignment and the normalization.

A

So, as I said, I mentioned solar and I would talk about it. So solar is an open source project out of the apache software foundation. It's built on top of leucine, which is a which is a low-level information retrieval engine, the key things about solar or that it and leucine is that it's highly scalable like up to billions of documents and pretty darn fast, even at that level, and you can customize it and configure it to your heart's content. So often you can make solar do your bidding.

A

You can make solar do what you need it to do as far as the search stuff and then the third key point is that it's integrated with Cassandra as part of the data stacks enterprise edition we're going to talk about what that integration means in a little bit, but basically there's some nice goodness about that integration.

A

What you do and solar is, you have a schema. Solar leucine itself is pretty it's almost like Cassandra and away, and that you can have these documents and every document can have different fields in it, and you can put anything in there that you want like it's. Schema-Less solar puts a schema on top of that. So you have fields, and you say what type they are and by saying the type of each field, you say how it gets analyzed and how it gets searched all right.

A

So this is the schema that you're putting on top of the data that's going into the index. So, for example, I can have a name field, they're a type text, and I can define the text field type to say. If you see robert treat it as bob- and I can do things like that, I can do synonyms. I can control how it gets tokenized how it gets broken up into individual words, so DSE search with solar.

A

So what is it? It's not part of the open source, Cassandra project right, it's an extension that is specific to the commercial product you get from datastax. What it does is it lets you say for this table in Cassandra I'm going to have a solar index that is automatically kept in sync with it. So if I write to Cassandra, my index gets updated if I delete from Cassandra my index gets updated and likewise, since my Cassandra data is replicated and distributed across multiple servers, my solar index is replicated and distributed across multiple servers.

A

It leverages all that underlying goodness from Cassandra, so essentially the solar support in DC kind of competes with solar cloud, which is the solar, open source projects, extension to do more of this kind of cloud-based distributed, search and also a elastic search.

A

Now, as I mentioned great, it's leveraging the existing Cassandra replication, so you get like reliability. Node goes down, you can still search because you got copies of the data on other nodes, just like with Cassandra right, no going down, isn't a problem and the data is distributed. So if I need more capacity, I can just scale up. My cluster and I got more search capacity and you get like replication between data centers, which for enterprise solutions with solar, is a beautiful thing. I can have a data center east coast.

A

Data center west coast are right to one network topology replicates over to the other. It's a beautiful thing.

A

Just to dive in a little bit like often people like well, how exactly does the solar indexing thing work in Cassandra?

A

There's something called a secondary index, all right that you can use well, there's a hook there, a handy-dandy hook, that's effectively undocumented, but it's there strangely added by a datastax engineer that lets them hook into it and do this thing where, as documents as rose, get modified or added to Cassandra, the hook is in there and it says: oh ok, I'm, going to queue up something I'm going to cue up a solar index change for every cassandra row. Change! Ok! So that's how it keeps it in sync.

A

But a key point is its way slower than the cassandra writes like a Cassandra right. You know it's whatever pick a number twenty thousand writes per second per node. You know a regular hardware, pixel number! Ok, solar, is nowhere close to that. For a couple reasons, one is solar when it writes it does analysis on the data like it's sitting there, tokenizing it and normalizing it and doing all this work versus a Cassandra right is just ok. Here's some data in the mem table I, always some more data in the mem table.

A

Hey Solar has to do actually do work. The other problem is solar. When you write in a document like a row and solar, you need to have all the data to write that row. So what that means is when you do a solar update when you update akka San in Cassandra like a single field, you know single column in your cassandra row. What it has to do behind the scenes is, it has to read that row it to get the whole row so can build that solar document to write it.

A

So it's violating the never read issue right when you're writing. It has to always read before it writes to build the full solar document. So how much slower it is I, don't know. I mean my I use, something like 10x as a rule of thumb, but I mean I think it varies a lot on your configuration, how much work solar is doing, etc.

A

So how fast is it? Oh, yes, question.

A

Trigger invoked exactly once per right, so these the secondary indexing hook will get called exactly once for each row. Mutation.

A

Well, the secondary index. So the interesting thing is the secondary indexing hook. It's it's not at the level where you're writing to some Cassandra node and then that gets distributed out to other nodes. It's actually on the individual, node. So you're going to wind up doing this indexing process three times. If you have three copies of the data, it's actually happening on the on the node.

A

That's an interesting question in general: yes, it happens well.

A

That would be a good way to think about it. Yes, yes yeah, it's the more I dug into this when I was writing the documentation, the more complicated it gets, of course, so it was like peeling, an onion.

A

So when a record gets flushed by TTL, you essentially have a row mutation and then the regular process happens. Yes, it will replicate into the solar index. Yes, you'll get the effects, so the goal of the whole system is, if you have this in Cassandra table. If, like you make a request to Cassandra you get this, then you get the same thing on the solar side, keeping those things in sync, so TTL triggers and you you know you've got some data getting deleted. It will get deleted from solar.

A

A

Correct it's a hook at the at the I mean if you do a delete using CQ ls8 right, you get rid of a row, then automatically solar will be updated. Yes,.

A

So the solar index is also saved on the same Cassandra nodes. The index isn't saved in a cassandra table. The index is saved to the disk. The local drive yeah, but what's interesting is they do leverage the cassandra tables for other things like that solar schema, that is talking about with the fields and the types that essentially is stored in us in a cassandra table and that's how they replicate it out to all the notes. So they leverage that a lot.

A

So the question is: how do you associate a column, you know with specific solar stuff and so that so that solar schema that I showed you there? What happens? Is it associates the field names in there with the column names in Cassandra right? So if, if I want to do a certain thing with the text in solar, I just create a field and solar that does whatever I wanted to do and I make sure the name of that field matches the Cassandra column name, all right, so I'm going to I'm going to keep going.

A

We can definitely talk afterwards. If you have more questions so you're some performance like using mock data, we wrote 170 million records. This is into a eight node cluster. The writing took two and a half hours so definitely much slower than if we were just slamming records into Cassandra without solar. When the writing to Cassandra finished, we were about fifteen percent of the way done with indexing it right, so the indexing is running behind and when the indexing runs behind. It also slows down your cassandra rights there's.

A

Actually, this back pressure support that like tries to prevent you from overloading the system, so it'll slow down your rights, which is why the rights took longer than they would normally and then afterward, though the cassandra rights finished, it actually took about another 12 hours for the index to be totally in sync. Now, there's definitely things you can do to reduce that amount of time. For example, you can offload work from solar into whatever is generating the data you're putting into Cassandra.

A

So, for example, you can take the solar analyzer chain that code and plug it in upstream that you're already tokenizing the data, and then you can use a really simple, like white space tokenizer in your solar scheme, so you're, essentially shifting cpu load away from solar into, like, in our case a hadoop workflow.

A

Alright, so the system overview, so I mentioned we have a hadoop workflow. So basically the way the system works. Is you know we get all this raw data coming in from the bank's parsad validated normalize it and that's all happening through a Hadoop workflow. Then.

A

After it does that- and this is this workflow- that we build our lower workflows using something called cascading which is kind of a Java API on top of the dupe, so you you can stay sane when you're working with things like joins, and you know what not and then, after it's done that, there's this thing of getting it into Cassandra and we still use the dupe for that. We actually in Hadoop there's map and reduce tasks. You know two phases do MapReduce and in reduce you can control the level of parallelism.

A

You can say: I want to run 3 reduced tasks in parallel. I want to learn one in parallel: I want to run 100 in parallel, so we we actually in the reduced task, is where we talk to Cassandra, because we have better control over the amount of parallelism like how hard we're hitting our Cassandra cluster. It's kind of interesting, because that the actual Java driver for Cassandra also lets you have some level of parallelism in it.

A

So just of this where's the right place to do it um and as I mentioned the bottle like, though, is the solar indexing.

A

There is an approach that we tried, which is kind of interesting. You can write it a Cassandra table where the Cassandra table isn't hooked up to solar. You just have a regular Cassandra table you right into it. It's super fast and then after you do that you say. Oh now, we want to associate this solar index with a cassandra table and it starts doing. The indexing in the background turns out. That's like currently single threaded, so it took like I, don't know after about 18 hours, we killed it, so it was slower.

A

I mean the cluster resource got released faster, the Hadoop cluster, but the overall time was significantly longer to get a full index in there all right. So we got this paduk cluster writing things into. You know this Cassandra plus solar guy, and then we've got this actual API. That's making solar queries against. It makes sense right.

A

So in just performance we talked about how you want to do rights without reads right, that's the fundamental thing, but the problem here is, if you're adding an entry. What we tried first, is what we got to see if we already have this entry in there. So we do a solar query and if we got the entry then we just add the data to it.

A

Otherwise we would create a new entry, but that's really slow now we're doing like a query before every right, which is really bad, so the key and I'm just going to quickly mention it here is we set the row key in Cassandra to be a hash of all the searchable fields?

A

That means is, if all the searchable fields are the same, the hash going to be the same. We know we got that same entry in there. Otherwise, it's going to be different, and so we wind up doing is you know the address changed, we'll get a different entry in there, but we just expand our net a little bit more to handle those duplicates and, and then our in just performances way better.

A

So make sense right so for each cassandra row you got to have unique roki and normally we were setting that row key to a uuid right, but instead what we do is set the row key to the hash of all the fields in the record that we search on, and so if those searchable fields are the same, we're going to get the same hash.

A

So if we try and add essentially the same entry but with different, like account data, which is we don't search on then we'll say: oh, we already have this one in there we can just update the account data if, on the other hand, any of those fields have changed their date of birth, etc. Then we get a different hash, we'll get a different row. Okay, so summary, this is for ad-hoc kind of D duping fuzzy matching, not batch level. So you might I think my slides from hadoop summit are posted on somewhere.

A

If you care about similarity at scale like doing the batch level, dee doop goal is the key. Is we use search to get a small set of candidates that we then do that expensive match similarity against the pain like with all this stuff? It's always the data prep is where the pain is like the normalization cleaning up the data dealing with messy data, that's the problem and using the essentially cassandra plus solar in DSC. That makes this actually pretty darn easy, architectural e to handle right.

A

We just treat it like a Cassandra right on the on the left side, where the Hadoop stuffs flowing into Cassandra and on the right side. We treat it like a solar index and we don't care about exactly how that magic happens in the middle.