Apache Cassandra Cassandra Summit 2015, 14 Mar 2016

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Reltio: Powering Enterprise Data driven Applications with Cassandra

Description

Speaker: Anastasia Zamyshlyaeva, VP Platform - Product Management

Cassandra's flexibility and scalability make it an ideal foundation for a modern data management architecture. Come hear how Reltio is using Cassandra, in combination with graph technologies and Spark to deliver a new breed of data-driven applications.
In this presentation you'll find out:
- How we ended up selecting Cassandra
- The unique characteristics of data-driven applications
- The best practices we learned by combining Cassandra, graph technology, Spark and more

A

Good morning, everybody thank you for joining my session today. Let me start with some questions. How many of you are using Cassandra for consumer facing applications?

A

Okay, how many of you are using Cassandra for infrastructure or administration tools?

A

How many of you are using Cassandra for operational enterprise, software for business people? Oh pretty good group of people, but usually enterprise software is not common application for Cassandra, which traditionally relies more on relational databases, but at rel do we use Cassandra to build very powerful data-driven applications and, in this session, I'll tell you why? What do we mean by data-driven applications for our customers? This means to be write faster, be write faster with reliable data, relevant insights and recommended actions.

A

You can't be right if your data is wrong garbage in garbage out, it is impossible to have relevant insights if you are not factoring all huge amount of information that builds complete picture and recommended actions come with learning of data, behavior users behavior data patterns, in other words, applying machine learning in this session, I'll be focusing on the first spot of royalty equation, reliable data. This is the one where Cassandra plaits major role, but first let me start with some introductions.

A

My name is Anastasia smoosh level and I'm VP platform, product management at proteome and I'm co-founder. We started the company in 2011 and today we're over hundred people. We said the company oven, believed that organizations need reliable and relevant data at their fingertips.

A

Four years ago, I started to play with Cassandra to see what will what will be the best persistence layer for our hybrid columnar and grab datastore and out of various options, have chosen Cassandra we're using Cassandra in production since 2012, and because of this we were able to achieve 24/7 up time with 99.995% availability.

A

We use the same cluster and the same hardware to support various customers in the same multi-talent environment and the same data store is used both for operational and analytical work, loads.

A

If you focus on the smallest details, you never get the big picture right like in this picture. When I show you the first fragment you can try to guess what it is. You can even make some decisions on top of this, then, when I show you the next fragment of this data, the direction of your thoughts can go in completely different direction, but the truth is that reality could be completely different, and this is what is happening within the enterprise.

A

They have all this data, but business people are making decisions having just smaller picture in front of them, they're limited to the application they're using let's dive into this, with more details with company that most of you are familiar with. This is enterprise company from the office. Tv show dunder mifflin for those who do not know today's vessel paper and office supplies enterprise company. They have sales department which goal is to find customers and sign up contracts to you to supply paper.

A

All clients or dunder-mifflin have an access to website where they can manage information about their account where they can change. How frequently do I want to receive supplies to be effective with supplies?

A

Dunder-Mifflin bought an application for supply team that allows effectively deliver paper across various customers in different regions. Of course they have marketing department. They have support a lot of other departments, so in this picture, just five of them- five departments, five different applications that perfectly addresses needs of each department, but these five applications have their own database with each own data store. That's why all the data is isolated and captain silos, but at the same time, information about the same object could be stored in various places.

A

So, for example, on company website there could be information about John Smith. The customer that he's maintaining there could be information in sales department in in supply, like everywhere, cross-up enterprise information about the same account that shows data from from different perspective and data can be updated in one place, but other applications will have no idea about it and to bridge this gap, IT team comes into play. They're asked to synchronize data from one application to another, so, for example, if sales team sign up contract, they want automatically create an account on website.

A

If sales team sign a contract, they want automatically start supplies and I can just continue drawing this arrows between applications, each application to into each application to try to have this bigger picture.

A

Actually, companies are spanning huge budgets on this kind of activity and a lot of time they're trying to introduce keys so to Leeson what is happening in one system and when something is happened there. They are taking this data, transforming data in the format that that application understands SATA and save it, and at the same time, what? If that application, is not available?

A

What if this connector dies like a lot of things that IT team needs to consider to have this big picture, and this all adds complexity to overall infrastructure.

A

This is actually real diagram of such enterprise system. You can see a lot of applications with their own databases and there is synchronization with between various applications, and synchronization can be done with special tools that companies are buying that focuses on bring data from one place into another.

A

It could be some home-baked tools when IT team is developing something to do synchronization or it could be some old-school approaches such as hey Mike. This is Dwight. Do you have information about John Smith uh-huh?

A

Can you spell it like yeah really? This is what it's happening with the enterprise and after this Lu logical question is: is data up-to-date another logical question? Is these data correct? So what if one application provided incorrect data, then this is just spread across all enterprise and all enterprise now have incorrect information how to roll it back. How to understand? What's the current say, a state of the object should be. The other logical question is data complete every application is limited to its data structure. Of course, there's some flexibility to expand this schema.

A

But the question is: how far can you extend this schema or the question is: what should be that application, that source bigger picture should be sales application that have information about support tickets or like do you know too many questions.

A

To address this problem, 15 years ago, a new type of application appeared to unified data from multiple sources, open they're cold as master data management. Their goal is to consolidate data from multiple sources, bring together address any conflicts blended and provide unified view of data across enterprise, as this application appeared 15 years ago.

A

Traditionally they are based on relational databases with certain problems that relational databases have such as fixed structure. What, if you have new new attribution that you want to bring, then you need to do alter table that locks database, not cool. It's close, its close to impossible to bring big data, for example. Think think how we are we really bringing information about all emails across enterprise, about click, streams of of clients on the website almost impossible, because hardware is crazy, expensive for such systems.

A

It is hard to support structures such as graphs or or complex attributes, because this requires additional tables, additional joints that doesn't affect grade or overall performance, and usually they have single point of failure that can be workaround at begging again, a lot of a lot of problems.

A

So, after that, enterprise start getting bigger picture but still fragmented, because some of information can't be brought into this big picture at rel to our goal from the very beginning was that we want to bring old data, no matter how big it is, no matter how complex this structure, we wanted to support structures such as entities, relationships, graphs, interactions, historical data, and for this we have chosen Cassandra on the one side.

A

Cassandra doesn't have very powerful data model that we would benefit from, or it doesn't have other blinks such as powerful, query, language or or bulk operations, but it it does what it promised. It has high-performance fault, tolerance, linear, scalability, multi data center availability. So all the things that you hear a lot of this conference and they we use, we use a commodity hardware. That's why we've chosen Cassandra as our primary data story.

A

At the same time, our use case is not typical for an for Cassandra. First, this is enterprise. Second, we route your platform is foundation to build data-driven applications.

A

According to data stacks recommendation, you need to think how do you want to use data? How do you want to expose this to you? I what queries they want to make, and this will drive modeling this. In our case in case of platform, we have no idea with what objects we'll be working with. Will it be organizations, individuals, products or maybe it will be? Database of cats and dogs, like anything, could be there.

A

So, for example, in our cloud we can have a tenant where somebody defined schema for doctors and hospitals, and this automatically will drive all the you is all recipe is and analytics to work for Life, Sciences or healthcare schema rel Tia will do all the cleanup of the data merging data to have accurate view across enterprise.

A

So here on the left side you can see various UI is that were generated automatically out of metadata that helps to maintain doctors, hospitals, affiliations, hierarchies historical data, and we we provide insights and recommended actions for sales or marketing team. In this. In this you eyes or another scenario, the same cloud just different configuration where we have configuration for oil, wells and other equipment, and with this application we are targeting completely different set of users.

A

These are users who are interested in having 360 deal of oil and gas production across hundreds of wells, hundreds of thousands of wells worldwide or in the country, another application and another metadata that drives all UIs and api's. Somebody wants to manage asset catalog for movies, for songs, for TV, shows and blend all this information with social media so, for example, have some sentiment analysis on top of of various movies, and this is a different application.

A

This is one ratio cloud for three different use cases that resides in the same hardware and targets completely different use cases and different different users.

A

In your audience, you can store entities, organizations, individuals, you can store relationships such as John's palette, spousal relationship. You can store information about Graf, John's, social, social graph transactions. How John went to the website and what Lord links he has. He clicks so maybe to predict what he's interested in we store historical information and Cassandra for this is foundation in there in the remaini session. Let me focus on what challenges that we have using using Cassandra while building voucher.

A

So the first challenge that we had was modeling complex documents and the additional complexity that we have on relative side is that every attribute, even even simple string, it can come from multiple sources. That's why we need to support multivalued attributes for everything, because sources, different systems can't agree on on having certain value and additional complexity is that we we want to support very complex structures and again these complex structures can have can come from multiple sources.

A

So, let's take an example, we want to build an application that maintains individuals with emails, add addresses names, and this is the kind of business object that that our users are configuring, so there there's no need for them to go in a dissent, Cassandra and all under the underlying structures. We want to work with individuals with that structure and we we drive everything by ourselves.

A

So as an example, we have John's profile with two MLS and two addresses, and let's see, how do we use Cassandra to to build this? We use 1:1 Cohen family, to store to store entities and, in this example, I'll be using thrift terminology.

A

We use one Cohen family for all entities. One row is representing one complex document.

A

One value for simple attributes goes into one column, column and, and over name is metadata driven so like in this case, name dot one and then value goes into cell here. You can see that even for simple attributes, we we have index, and this is because data is coming from multiple sources and simple attributes can have multiple values, multivalued attributes very, very similar way as as for simple. So we each value goes into separate column. Column is metadata, driven and unique, and value goes into cell.

A

This approach allows us update each element independently of each other, so we can just update email one without touching data that is stored in amount here, all right. Let's go into more complex scenario where we have necessary attributes. In this case, it is very important to you to mix data from multiple for multiple nests. For example, we don't want to say that California is billing. Address and New York is shipping. We really need to preserve this.

A

This well is that's why every attribute is having its own idea like California shipping is one and New York billing is, is Tiye and then for nested elements such as such a state or type. We also introduced IDs because the data can come from multiple sources, and here this is how we model NASA structures. What a give us we can build, NASA structures of any depth. We can have any number of attributes on each level and we can update each element independently.

A

Cassandra allows us to persist and replicate data across our data center. It allows us, with this kind of model, to have atomic updates on every element and at the same time we can do transactional update on the whole document.

A

We can retrieve just sum sum of attributes, so, for example, in this query, I'm interested only in addresses and we can do with such model. We can support thousands of attributes that have a lot of values each and we, with such model can store only those attributes that have actual data. So this saves us, as disk space. Relative platform will take care of transforming this kind of structure into into documents and we'll will address on any complex that that you have in might in in data. So this is this.

A

Is this smart logic that relative will put on top of this this as you imagined this is. This is three of two terms right, and this is how we started and we were quite happy with thrift and then, according to data stacks recommendation, we we should. We should move to Cassandra client which uses secure. So our first impression was that hmm it's it's close to impossible to support our very complex structure that can have thousands of attributes and each attribute can be multi valued. We even did some experiments.

A

We tried to put to model what we have right now, with with static schema and as a result, our performance degradation, and then we start thinking, does CTL really means static, or can we continue having wide rows and cql does support wide rows? So this is. This is definition of the same schema as I was talking before we have color family for four entities that have dark ideas as as partition key.

A

We have attribute name and attribute value to you to model the same structure, and actually we can continue making the same request as we did before, for example, to retrieve just some of the information such as addresses. So basically everything that you could do with thrift API. You can continue doing with with cel-3 another scenario that we use where we use Cassandra. We use it for hybrid graphs, I already work you through how we can model entities with infinite attributions, but having information about entities is not enough.

A

It's not enough to know my name right and my email address. You Know Who am I. It is important to know what relationships do have what strengths of this relationships. That's, why relationships are very important and we do support this through Cassandra. So here you can see various entities such as organizations, employees, products and individuals which can have infinite attribution and we also can define in metadata relationships between these objects.

A

So, for example, I can say that Dwight is employee of of Dunder Mifflin John buys paper copy copy copy paper, and this is what you define a metadata this. What drives you I? This was drives, drives all api's. So from from Cassandra, we use basic foundation to to actually save our data from relgious side. We have metadata driven graphs with with very rich model for entities and relationships with infinite Android attributions, and we do partitioning and effective joints to provide high performance graph operations.

A

Next use case where we use Cassandra rot your deduplication.

A

Data can come from multiple sources and if we not merge it together, then it will continue being in Silas. You will have one data from from one source, one data from another source and it will be never linked together.

A

So that's why it is very important to build effective deduplication mechanism and to have high performance high high quality dupe mechanism. Without users intervention, it is impossible. It is important to factor all information so like in this example, I have T John Smith's, with slightly different, slightly slightly different uh names.

A

So potentially this could be T same entities, but we can't guarantee- and if you start making decisions just on top of attributes, we can go into too many false merges. Then, after this data stores need to go and unmerge and understand pretty-pretty non convenient. But what, if you have additional information that we will start factoring? For example, we have information about their addresses and we know that they leave within two miles radius, so this really increases chances of these two records being the same.

A

What if I tell you that John Smith without age and John Smith? They have the same daughter Stephanie. Then all these information really increases the probability of these records being the same close to hundred percent. So it's very important to factor all information to to build to build effective deduplication mechanism.

A

We use only Cassandra to to support this. We we have additional column families for this and the.

A

Types of matches that we do support just using Cassandra, is matching by attributes both fuzzy and exact. In this example, geo matching and graph matching for incremental matching. We use just just Cassandra, but if you want to do very, very fast bulk matching across whole tenant, then we are using combination of Cassandra and SPARC.

A

One more interesting use case of Cassandra Cassandra doesn't have a very powerful query language. It has some integrations, but it doesn't. It doesn't work perfectly well for us, because we we need to transform our complex metadata to to certain structures and for our use case we found that elastic searches is better fit and we we did. We do transformation by ourselves so for search by elastic search perfectly fine, and then we started to explore. How can we use our cluster more efficiently?

A

What if we exclude document contents from from elastic search and use it only for indexing and returning IDs of objects, and then we use Cassandra to retrieve whole information. So this is what we called hybrid search and we've got some very interesting results, so the first result predictable that the size of elastic search index reduced twice.

A

We already have this information in Cassandra. That's why we we reduce this and it cost to our cluster with zero, with zero loss function. Second function. A second area is that elasticsearch indexing performance in Croson increased twice, and that means that we can use the same elasticsearch cluster longer without without need to scale up.

A

This is something that we didn't expect if we did search on large documents, hybrid search, sometimes provided performance 20 times faster than just pure elastic search.

A

So tells you we have a lot of interesting scenarios and use cases. We continue collecting data from multiple sources, blended together as a Clint cleanse it with data providers with with cleanse mechanism with with enrichment from social media. We do manage relationships, entities graphs- we do analytics on top of this, and we do insights and recommended actions and for all of this we are using Cassandra. As our primary data store I bought from Cassandra. We are having heavily using elastic search for indexing. We use spark for for analytics such as segmentation, clustering, ranking.

A

We use elastic search, we will use spark for for machine learning for bulk operations and then we will spark. We are able to bring this data back to Cassandra and for SQL interfaces that very oftenly used in in in enterprise world. We use redshift, Amazon, redshift brawl. Tio allows to simplify a lot architecture, removing a lot of complex and moving of blocks.

A

We can provide a single place where you can bring all the data and have insights and and recommended actions on top of your data. So if you want to learn more about how to use cases please to buy by our booth, thank you.

A

Questions craft acknowledges so we we built our own craft technology because we needed that kind of that control to to to maintain merging of crafts to understand where what should what information came from what sources? So we started build graph, McGruff foundation for our solution four years ago, so this is pretty much the same way when tight on that period, Titan is, is good, good tool for managing graphs, good database.

A

We have some some extended scenarios where we can't use Titan to be, and that's why we we need to continue maintaining our craft, datastore and crop database.

B

So as the individual.

C

Departments do clean up of data how's that synchronized with that with the Cassandra, where you keep all the metadata so.

A

This is, this is synchronize with three connectors through our REST API, so that our metadata driven.

A

And as as we use Cassandra and our our other foundational pieces, that is C they're highly available, that's why relative is it's always available and there are no problems with with here. Lt is not available. What should we do with synchronization I?

A

Think there is the question.

C

I have a question about the oil and gas side you mentioned: what is the application area? Are you using? Is it in the exploration? Is it in the well logs? Is it looking what the production is, so this.

A

And what technology.

C

A

So this is, we are. We are using for tracking information where all all wealth are located. We are tracking information from each well from various equipments, like Internet of Things of what is happening within each well, so also you can on a map you can. You can try to to find where their possible places for wells. You can try to predict on all the data that you have where they could be potential brakes on on wells equipment.

A

So you can bring any information. If you what your track, you can bring everything interrupt you and use it to have more powerful insights.

D

What do you do to keep the Cassandra and elasticsearch in sync? Presumably, if you search for a document, it has to then also exist in Cassandra and vice-versa. We.

A

Are using our internal tool, we wrote our own code that understands metadata, understand this structure and dust dust synchronization.

B

You mentioned the connectors the use special ELT, connectors or database connectors. So what type of connectors do you use special.

A

B

Like um for America use, Oracle data integrator to connect to Oracle or to Seco server how they connected those databases.

A

How do we bring data inside relative if.

B

You're doing a bi-directional synchronization: how do you maintain the updates on the source system so.

A

There are various tools that that help to do synchronization such as snap logic, mule, soft or informatica. So what we did? We created our plugins into those systems, so you can use rel tio for us as target and as a source.

B

So you don't have a direct connection to the actual source databases now.

A

This is not our business.

E

If, if elasticsearch might go down in a data center, do you ever find yourself having to re-index and I'm wondering now that DSC has its search capability? How, if, if you reconsider that and how you would compare your solution.

A

We we didn't have a lot of cases when elasticsearch run down, because this is also distributed highly available component. So if, if this, if this happens, then this is considered downtime on our side and we have all the mechanism to repair data from from the moment when something happened. So we can. We can repair all the information on elasticsearch when it is up again.

A

E

Compare this it's sort of like an incremental reindex. We.

A

Can do yes, we can do incremental reinvesting and now another area was a comparison with DSC right and with search that we have out out of DSC. So the the problem is in complexity of our data structure. If you just put all the information that I showed you, that is in column, families, if you put this directly in solar, it just doesn't make sense.

A

You need to combine this, you need you need to this on transformation, plus also we bring data from from multiple column families in one search document, which is not supported by by DC. This is just a just some very specific use case on our side, but if you have all the data that you have inside column of column family that you want to be available in search index, DC is perfect solution for that.

D

Any other questions.

B

All right well, thank you very much. Thank.

A

A