Apache Cassandra Cassandra Summit 2015, 14 Mar 2016

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Wellaware: Modeling the IoT with TitanDB and Cassandra

Description

Speaker: Ted Wilmes, Senior Data Warehouse Engineer

The graph database, TitanDB, with Cassandra as its backing store, provides a powerful platform for modeling and extracting insights from the connected world of today's internet of things. This talk will briefly cover graph database basics and then dive into IoT specific use cases with a focus on data modeling and performance considerations.

A

B

Great so I'm Ted Wilmes I work for a company called well aware. We provide a SAS solution for monitoring oil and gas wells. So if you have oily gas well talk to me afterwards, but if not, you know we're going to talk about IOT in general today, so this isn't specifically about what I do we're going to first cover a little bit about the graph property model, a little bit about Titan database, not too much, then I'm going to really focus, though on modeling, the Internet of Things in a graph database.

B

My main goal for that piece is to get you guys excited to go. Try this out on your own, so I'm, not saying this is the way to do it, but kind of get that imagination going it's easy to install Titan set it up, create your own little application and try it out. Lastly, because time series data is a big part of IOT applications. Why not try to store your time series data and Titan database might as well.

B

Try it out so I'm going to use that to discuss how Titan actually interfaces with Cassandra, because it is the Cassandra conference and then also discussed one approach for storing time series, data in Cassandra and a performant manner. Okay, so, first of all the property graph model you may have heard of RDF graphs, so there's also property graphs. So first we have a vertex. The vertex has a label. So that's basically just the typing information.

B

You can have an arbitrary number of key value pair properties on their key, simply being a string value being any sort of type that your graph database can support. We add another vertex and then we up add in the other main component, which is an edge as you probably already guessed, but the edge has a direction goes from Ted to George. There's a label on there typing that edge and then again you can assign an arbitrary number of properties to that edge. So how do we query this?

B

There is a patchy project called Apache tinker pop that is defined a standard for folks and vendors to follow so that they can implement this standard graph. Query language called gremlin so really simply won't go into too much detail here, but just to give you guys a basic idea of what it looks like we'll do.

B

A very simple query: against that exact same model, I just showed you so I'm going to say: hey here, take my graph out of all the vertices I'm, particularly interested in the ones that are of type person and I want to filter on the name Ted. So as you'd, imagine that's going to give you your first vertex Ted. Think of that, as kind of your first step, you're diving into the graph and then from that point you may want to most likely use that graph use the edges.

B

Traverse out, do interesting things, so we say out: I want to go outbound on the outbound edges that have a nose label and then I want to return the name of all those vertices that are outbound from that vertex Ted.

B

So real, simple, there's just one George, not too exciting, but really straightforward to use really straightforward to put together some very interesting sorts of queries. So tighten DB tighten DB is a graph database, open source graph database that has support for pluggable storage layers. Of course we're going to talk about Cassandra.

B

Today it was designed from the ground up for an OLTP type workload, but now there's also really good options for doing more OLAP type, global graph computations with it, as I mentioned it implements this Tinker pop3 API in the latest, Titan 10 version and again because it's the Cassandra conference. Cassandra is an excellent option, of course, for the storage layer will get a bit into how that data is stored and I'll explain that some more in a little bit so the Internet of Things.

B

Well, it's kind of a you know, sort of nebulous topic, but let's just sort of put some loose bounds around this and then discuss an example. So first, of course we have things. We have some component of time, probably maybe we're tracking these things through time, we're getting sensor readings, but people come into the or into place. So we may have users that we're tracking in the system organizations places all sorts of things.

B

So, even though he's an Internet of Things to really get meaning out of a lot of this data, there are connections to these other things that we sometimes forget about. The graph database is great for capturing those relationships. So, as I mentioned, I'm an oil and gas, but you may or may not have interest in that.

B

So I decided to come up with a just simple example: so, internet of things in space, so let's say we're setting up an application to monitor a wide variety of things floating around in space, spaceships, Rockets satellites, all your sorts of usual space things. So as we dive in, we have a rocket here. So let's look at what we could do in the graph. So, let's start out here we have a rocket in the center there. The rocket is not an island.

B

It actually has all sorts of other pieces of information that you can hook up to it to gain context and draw meaning out of it. So first over here we have the world's smallest David Bowie Major Tom. He pilots the rocket right there, as I mentioned, there's, probably some organizational component like who actually owns or operates this rocket so Starfleet Acme Rockets builds a particular type of rocket called the Delta booster, so that rocket you can see, there's an outbound edge, you'll notice, a lot of times.

B

You label those edges with a verb, so that rocket is model Delta booster and then we have an engineer that maintains this rocket Joyce. So you notice that with all these connections, as you build out your application, you can make use of these connections to do interesting, analytics and grab other contextual information about these pieces of data. These things that are in the graph. Ok. So what?

B

If we look at the rocket itself so a lot of times a thing is actually really a system of systems or maybe other things connected to each other and maybe you're just thinking of it at a higher level. So on the Left, we just have a really generic exploded diagram of a rocket on the right. We have a simple sub graph, showing okay. Where does this rocket actually made up of so, depending on my role in this organization?

B

I may only care about this thing in the context of being our rocket, but if I'm actually Joyce the engineer, I want to do something with this rocket I care about the lowest level nitty gritty piece. The graph database allows you to very easily put together these sorts of hierarchies and also, most importantly, right here.

B

Everything is shown as a hierarchy, but a lot of times, there's interconnections dependencies that you can capture in this graph between these different things and exploit those simple example would be if you have an alarming system and you're doing maybe some sort of root cause analysis. If I'm an outsider, I'm new to this system, it has many parts to it. Maybe I don't understand how everything is put together. You can have something like this. This model of that system in your in your graph database and run queries against it to better understand.

B

Okay, what is the actual root cause of this problem are the things that came into play so if we zoom in a little bit further, this is one of those components in the Rockets. So these are the guidance, electronics notice. We have our JVM up on the rocket and unfortunately it looks like there's a garbage collection issue here, but so we have the jbm we're moving down to even a lower level. So now we're finally down, in that say, sensor type level, we're actually gathering some sort of metrics.

B

So we're going to zoom in there and discuss just basically an alarm scenario, so this is just one idea of how you could possibly model alarms so say, you're interested in heap usage. You want to be notified when something goes wrong, so we curate an alarm vertex and that points to the heap usage that alarm could go off and say I'm going to notify any number of people. So we just have our that Joyce node or maybe some other engineer, vertices cook those up and then some arbitrary number of alarm conditions.

B

So it's a very flexible way to model these sorts of things. Now we bring in the organizational and the other personnel component. We can tie this back. So what if Joyce doesn't answer or space pager, or something like that, so we can just fail over and say: oh I'll just use the organizational structure that I've stored in this system. To figure out, I need to actually report this up the chain.

B

So, in summary, things can be the pending on your application, but in many cases, and at least specifically in our oil and gas use case a lot of times. That thing is really can be broken down into other parts or potentially those parts that you already have can be brought together with some sort of structure and relationships that tie them together so depending on what you want to do, maybe the more higher the higher fidelity of the model that you're actually storing in your system. The more flexibility you'll have, so you can.

B

Of course you could store the same sort of structure in a relational database, but the graph database, and specifically the gremlin language, make it that much easier to work with.

B

Okay, so can't talk about Internet of Things, without maybe some time series stuff so now get into the time series and performance information so we'll go back and we'll build upon this example, specifically looking at modeling how we would store this time, series data in a highly performant manner and tighten okay. So some really basic nebulous time series requirements, but at least to kick us off so say we want to support a large volume of low latency rights and then we want to retrieve primarily the most recent data.

B

That's pretty straightforward, so here's kind of a laundry list of things you you would like to look at if you're doing performance tuning our optimization on your Titan system, so one some of these are probably pretty obvious but tightened deployment topology and configuration all your usual cassandra tuning tips and tricks. If you think about it, tightens running on topic of sandra's, so it's optimizing. Your cassandra setup is probably a good thing.

B

Titan JVM tuning Titan actually runs in the JVM, so usual JVM tuning matters, their data, modeling choices and then indexing Titan has its own built-in indexing and can use third-party indexers. So that makes a difference and then also Titan has a number of different layers of caching that are important, specifically we're going to talk about deployment, topology and data modeling, okay, so deployment options just to beat the space thing into the ground. We have our AWS Mars North, one region.

B

How could we deploy our Titan and Cassandra okay, so here we're looking at just an individual instance, so here on the same instance, we could do a local deployment, so we have our Cassandra running in a JVM and we have Titan running in another jbm they're communicating over a socket connection. Okay, so that's one way to do it. We could run it embedded. So Titan could be running in the same jvm as Cassandra third option.

B

We could actually ruin Titan remote, so you could have maybe tight and running in a docker container, and then you have your Cassandra cluster over here and they're. Communicating that way.

B

So if you dive in and actually look at Titan itself, okay graph database, how do I actually communicate with this thing? You know how am I going to make an application with it. So first option is the Apache tinker. Pop project has a gremlin server, so this allows for remote access into your graph database that could be with the JVM based language, a rest type interface or another sort of driver for a different language. So that's one way to do it. Another option is to actually embed Titan within your application.

B

So, for example, we use drop wizard, but you could do the spring boot setup spring setup. You know whatever sort of framework you're comfortable with. In that case, your application and Titan are going to be running in the same jvm, okay, zoom out a little bit here. What does that actually look like here's an option, so you have to drop wizard containers, you have your Cassandra cluster and then this is greatly simplified. But then some magic happens in your and your clients are hitting your api's.

B

So just to give you just a rough idea of a possible deployment strategy, okay, so time series, let's jump over real, quick and just look at kind of this canonical time. Series example with cql, so I took this off a datastax academy website, and so here's a very straightforward and kind of standard way that one could model time series say storing sensor readings in using cql and so I bring this up to compare contrasts with how we're going to do that in Titan.

B

So now, if we look at how Titan actually stores data in Cassandra, we have each partition, so Titan makes a number of different column families, but will be specifically looking at the edge store. Each partition is going to be a vertex, so you're going to have a vertex, ID and then and I should say that Titan still uses thrift. So then you're going to have a series of properties that are associated with that vertex, each one's going to be in a separate column and then your edges.

B

This is stored, an adjacency list format, so your edges will be stored with that vertex and they're going to be stored on both sides of that walk. If you think about it, you have vertex a you have vertex B you have an edge between that edge is actually going to be duplicated to both of those vertices in most cases, and so why is Cassandra great for this? Well, one thing the Titan is also matt is doing edge, queries so filtering by edges.

B

Remember I said you can put properties on edges and so because of Cassandra support for these wide rows and the ability to slice that row, you can actually do very efficient edge, queries and kind of narrowed down on to what you're looking for I should also say, yeah down on the bottom, you can see how the edge is actually stored, but suffice to say that properties on that edge are stored with the edge.

B

That's the important thing to remember, along with the fact that the edges using the Titan schema system can be set up so that you're ordering by a particular one of those properties, one or more okay. So how could we actually store time series data in here? Well, one thing that we could do is say on the top. We have our sensor, maybe that we're pulling data back from our metrics, so we have heap usage and then I'm going to break this data up. We know that Cassandra can handle wide rows, but not infinitely wide rows.

B

So we need to break this up somehow so I've just created this notion of chunks. It's just think about it. Buckets the same idea, and so we have some time range is going to go in each chunk. Then you could have these observation vertices that you'll hang off the chunks. Those observations vertices will have the time stamp and the value or you know whatever other sort of information that you'd like to store with with each of those observations.

B

So that looks pretty good. You could imagine, maybe doing a year month day hierarchy storing roll-ups in there and things like that. So you can you could you could do something along those lines? So that's good, but each one of those vertices as I mentioned, is a separate partition. So say you're pulling back a thousand different observations.

B

Well, in that case, if it's down on the observation, vertex you're actually going to have to retrieve a thousand different observations, so you can imagine that that quickly gets you into trouble and doesn't scale very well. So what could we do about that? Okay, so one thing that we can do is move all those observation properties up to the edge. So maybe we still leave them on the observation, but we also move them up to the edge.

B

So then, what we can do is have tightened just perform an edge query retrieve those edges in a similar manner. If you were performing that query with say c ql, where it's just looking at say a single partition or maybe a few cross partitions, you know maybe two or three depending on how much data you're retrieving, but it can do that slice, query and pull them back.

B

The other thing to throw on top of it is you could even skip hanging that observation off the end of that off of that edge and instead you could point that edge back to the chunk, because you really just care potentially about that data. That's stored on the edge itself. Now that's good in some cases, but in other cases you may want to exploit the fact that you can tie things together in the graph. Maybe you want to go back and actually associate something else with that observation.

B

Do some sort of tagging something along those lines? So it's going to depend on exactly what your use case is. If you just need to actually store that Ross, a sense of reading you're not going to do anything with it afterwards or the other than just read it back as it is, then maybe you can just leave it where it's just storing it on the edge. What that actually looks like, then, if you look at the partition, is we have our vertex ID?

B

Like I mentioned, you have your properties that associated with that vertex and then the column simply are just your individual observations, so this looks fairly similar to how it would be stored in the c ql cases, there's differences, but it's it's. It's exploiting the same fact that you're doing maybe not always sequential io.

B

To retrieve that on your slice, query, you may have to go across multiple SS tables, but it's better than hopping around and getting all those disparate vertices, really simple, gremlin examples here, gremlin also actually works really well for querying the sort of data. So in our simple example, let's just say we have one chunk, we're dealing with we're going to say, I'm, going to go out the out edge out e instead of out so I'm just interested in the edges.

B

I want the specific one with that time, stamp or I could do a between type query or if I want to do the most recent observation before now, I can go out e has time stamp, say now and then specify an order and then limit by one and so tighten and behind the scenes. Cassandra is just going to go and get that single record back now that works really well.

B

But of course you could wrap this up in your own time, series API and that's kind of how we use that as something that is specifically looks nice for dealing with the time series. Okay, so pros and cons, so pros I mean- and this can probably get into religious debates, but this allows for a single, unified view of your IOT data. So you actually have all that first part of the talk sitting in there with your time series data, admittedly, could be a good or bad thing, so I'm not going to pass judgment on that.

B

Gremlin works well for processing streams of time series data cons. Of course, the storage formats not going to potentially is not going to be quite as compact, because Titans putting some other metadata in there with your information, there's some extra properties on the on the partitions, and things like that. So it may not be quite as compact, although they do some very efficient serialization of that data out themselves. So it may not be a huge difference, but that could affect performance potential.

B

Lastly- and this is the one that actually requires some some managing some overhead on on the developer side, at least when you're, making this time series type library, is there's this extra overhead of managing these chunks. So it's not like in c qo, where you could just say insert this new point: okay!

B

Well, if that happens, is needing to go and a new partition or something it just happens for you in this case, especially if you want this to operate in an environment where you're having a ton of different threads, potentially right in the same time, stiri series you need something to actually go in and create that chunk, vertex ahead of time.

B

Okay, so now I'll briefly talk about how does Titan actually communicate with Cassandra and what are some things that we can tweak to get high performance out of it specifically just for this time series model, but it's also applicable to other instances.

B

So let's look at a really simple query here here, I'm saying GV, for you can look up a vertex by an ID, so I just magically knew that ID for that's one of those chunk vertices, I'm going to say out the has chunk edge and I just want to get the start times for that chunk. So you can just see them they're. So pretty simple query! So, first of all what happens when you say get vertex by ID, and this is true whether or not Titan is running embedded local remote.

B

The difference here is going to be. What's the magnitude of that latency that communication time that's happening between Titan and Cassandra. So, first of all it says, Titan says: does this vertex exist, so look sit up in Cassandra Cassandra said: yep, that's there! So now it has the vertex loaded into Titans transaction cash, so Titan maintains its own transaction cash.

B

So brief aside, if you are curious and want to like nerd out on this stuff, like I enjoyed doing one good way to just very easily look at this, as if you use a profiler that has socket race, profiling makes it really trivial to look and see what communication patterns look like, so you can go in and even dive in and get a nice stack trace and see where I 0 is being started from in your application.

B

Ok, so now we want to say, let's get retrieve the properties for this vertex for sensor type and then in units. So it goes out. Remember I said just retrieved the vertex before I didn't actually get any properties at the same time, so retrieve the properties got back to properties. Now those properties are loading, the transaction cash. So if you use that same vertex somewhere else in that same transaction Titans going to hit its transaction cash, it won't go out again andrey retrieve it.

B

So that's good, but you can get an idea here of how, depending on the number of vertices that you're dealing with properties and things like that, you could run into trouble also depending on your deployment model. So these add up to basically these two round trips going back and forth to retrieve this information. So what if we add in a query and outbound query we're going to go out that has chunk. This is like that. First query. So here I've collapsed each round trip into just one line, but we say: does this vertex exist?

B

It gets the edges. Then it gets the first chunks properties so that first vertex and it gets the second chunk properties. So that's second vertex. So what can we do? So?

B

Here's the first first thing that you can do the configurable thing so with Titan 1, dot, 0 they've added this option in this query batch equals true, and so, if you go into the Titan configuration and turn this on, what this is going to do is batch up Titans request to Cassandra, so in my trivial little query, where I'm actually only starting at one vertex you're, not really seeing the benefit of the batching up of the edge queries.

B

But what you are seeing here is you'll notice that that last retrieval of that second vertices properties got collapsed into that previous request. So it was able to just go out grab that and pull it back, so that makes a difference right there now. What else can we do? Well, if you want to get extreme here, I'll put a warning on this, just because you want to test out and make sure that this works right for your situation, but say you're, you're, very certain that that vertex with ID number 4 exists.

B

So you better be very certain about it, doesn't matter so much on the red side if it doesn't really exist, but if you do that storage batch loading. True, that's going to turn off Titans internal check to see if, when you give it a vertex ID does that vertex actually exist in Cassandra, and so what happens then is you can just see we're down to basically the minimum number of queries that back and forth with Cassandra we get the edges and then we get the chunk properties so two round trips.

B

Okay! So that's on the read side. So let's look at optimizing on the right side. So one thing that we ran into when we were putting together this time. Series model was some issues with insert performance, and so here's, just a really simple, insert, remember I'm doing this sort of odd thing, where I'm really just care about that edge. So I'm saying chunk at edge has observation: that's that label I'm, pointing it back to itself for better or for worse, adding in the timestamp property and then just a double value.

B

Okay. So what does this look like? You could probably guess with these settings off, does vertex exist, then it writes a new edge and so notice. What happened here is we wanted to do a write, but we also introduced a read, so you can imagine, as you scale this up for its potentially depending on your load as I mentioned. There's a transaction cash if you're writing, maybe to the same chunk over and over again in the same transaction, say batching up your commits.

B

It doesn't make a huge difference, but in most cases you're probably going to be writing across many different things. So many different different sensors or you know- data collection devices. Okay, so can we do well? If we do storage batch loading true, then we get rid of that. Vertex exist query and we end up is with this just this new right edge, and so what happens there is then. Your actual right is just writing to Cassandra.

B

Felt like there need to be some like requisite flame graph in the presentation, so here's a flame graph storage batch loading equals false. So that's what that vertex check in there. This is just a quick and dirty test.

B

I did over the past week just to get some numbers and examples so that big chunk I know you can't really read that, but that big chunk on the right is actually all those calls to get vertices so spending a significant amount of that time of its of its insert actually pulling back and doing that that read to Cassandra. So if we do storage batch loading, true that goes away on the left. You can see that chunk is actually where the commits are happening. So all that stuff on the right isn't actually I owe anymore.

B

That's just other things that are going on so just again, quick and dirty performance numbers. This is not supposed to represent any sort of definitive what sort of insert radar you're going to get out of the system, but for the sake of just giving, you guys may be some sort of rough order of magnitude on untuned system. I set up a cluster on AWS 9m, 3, 2, x-large, nodes, cassandra, 2.2, replication factor, 3 writing at quorum didn't do anything else with them set up one other node. That was this client.

B

That was writing a time series data one hundred percent right workload into this cluster 10 right threads, committing I. Think 50 of these observations at a time across a hundred thousand different series. So you could think of that as a hundred thousand different sensors or something that you're writing data to, and so what happens at the beginning is actually, as I mentioned, the one downside is. You need to manage these chunks that you're writing to and by manage.

B

In this case, what I mean is well when you get a new point in you need to figure out what chunk am I going to put this under so and maybe cql terms, what partition is it going to go into, and so just in this quick test, we do some caching. So we retrieve that chunk once and then we just have a very simple look up that we do on our application side to say, give me the chunk back so I don't actually have to go to Cassandra and look it up again.

B

Every time I do a right, and so that's why it ramps up, and then we level out a little bit under 90,000 writes per second, but again take this with a grain of salt. Just figured I'd put something up there.

B

Okay, so in summary, it's Titan is a graph database on its own, but it's of course important to understand kind of what the implications are of how it uses Cassandra to help get the most out of Titan. One thing that we learned on the right path is depending on your needs. It could be good to remove some of those reads from your right path. Again, I put a warning up there. These are things that you want to try out on your own setups see what happens. So it's not just a blanket thou shalt go.

B

Do this one other thing I didn't mention is: if you're writing large numbers of vertices Titan has its own scheme for coming up with vertex IDs and sometimes allocating these vertex IDs can be more of an expensive operation. So when a node runs out of vertex ids, it gets an allotment and needs to get another set of vertex ids. That can take some time. So here's two more parameters that you probably want to look at an ids block, size and ID's renew percentage.

B

Those are those are some things that you can tune to ensure that that wait time when that happens, isn't too high on the read side. I can definitely say you know: query batch equals. True is probably going to be helpful in a lot of cases. I mean it sort of depends on what sort of traversal as you're running, but it's probably worth while enabling that and then lastly, of course, global and vertex centric indices.

B

We didn't really talk about the vertex centric indices too much, but on that time, series data that vertex centric index is actually ordering allowing Titan to order that time series data in either timestamp ascending or descending order. So you have the power to do that.

B

Okay, so that's it like to thank you all for coming what sort of questions you guys have.

A

B

Yeah so so say you grab the edge. The edge has the IDS of the of the vertex that it's connecting to, and so it will grab that ID and then do another query. Titan they'll do another query to look at that partition up and retrieve it.

C

Anyone else have any questions.

D

E

You run into any sort of size limitations or in Titan, as you explored different relationships and different vertices like.

B

Different, like a vertex right.

E

B

Something like that, no we haven't so far are I know that there are graphs out there running in the tens of more billions. Ours isn't that size right now.

F

B

Yeah we haven't run into one as far as edge count.

G

B

Specifically, if you think about an on time, series example like if you're modeling, a social network- and you have somebody who's very popular, what we would call a super node and they have lots of edges going into them.

B

Titans going to have potentially the same sort of limit as the underlying layer of you know how I can that Roby and Cassandra okay.

E

B

E

If I can follow up another question, yeah I can have a very rich metadata relationships, modeled in Titan and then persist its key value pair properties in Cassandra. Well,.

B

So Titan is actually storing its data in Cassandra, so it's.

E

It's own model. It is storing interesting.

B

Exactly yeah, so Titan doesn't have its own native stories orage it's using Cassandra for that. So.

E

How would I associate instance, data that that can extend into hundreds of rows or thousands of rows for that vertex on a.

A

E

Say that if that engine or sensor has a vertex and then it has thousands of rows, how would I put that in Cassandra under that coffee table or column yeah.

B

So you could either follow a model like I demonstrated here with Titan and actually store those edges or the other. Of course alternative would be say if you wanted to use Cassandra this.

G

B

Say is your time series data base? In that case, you could store and a lot of people will do this store that time series data totally separately and then, of course, you maintain a reference between between the two, but it's not a hard reference. Ok,.

G

If you don't mind, can you go or how cus how Titan stores data in Cassandra once again, yeah.

A

B

This this one is that yeah, okay, okay, so yeah so here this example is a cql schema for storing the the sensor data, and so here, if you think about and again I'm kind of falling back on thrift terms here, but if you have your partition in Cassandra, and so it's going to be bucketed out by dates.

B

So now, if we flip over here, what's going to happen, is we have our vertex IDs and so think of that as the partition key? And then we have some arbitrary number of columns cells, whatever you'd like to call them, and so what Titans going to do is actually serialize those key value pairs out using its own serialization into the columns that are on that partition. So we're going to start.

B

We have properties there, so if the name is actually on the vertex, that's going to be first then, following that, possibly in order by some key that you've set up and tighten to be a vertex centric index. You're going to have your edges.

B

Exactly and in this right here they technically could be going in or out, and you could walk that from either direction. So in my little query examples I always set out, but you could say in and depending on the direction and then the edges themselves. That's the basic format that it's packing: that information into the edges.

B

You could go either way, yeah it's just going to depend on which, what makes more sense with sort of performance requirements. You have yeah.

B

Yeah so so Titan has has some some sorts of schema definition, but it's not schema definition. In the context of setting up, say a table and saying you know, table person. Has these three columns and soin Titan? What you do is you can say for this particular property. Take like that name. Property I had on that vertex, I'm going to say name is always a string, so you could say something like that. So you'll define that in the Titan type system, you won't say a person has a name, a date of birth.

B

You know whatever it may be. So say you come down the road and you add some new properties say address. So you go in.

B

You define what is the type of this address property, but then you can go and add that to whatever vertex that you would like to, and so what's going to happen, ultimately, as that data will be stored as a property along with that vertex yeah yeah, that's a good question, so the other thing that you can do while you're creating that type is actually set up indexing on it and so there's two kind of basic types of indices with the graph database.

B

One is that global index like if I'm looking at all the vertices across the whole graph and I want kind of you remember my query. It started out with I want all the people named Ted, and so it's from that level, I want to dive into the graph and just grab those, and so with that you actually have two different options. You can either use Titans internal indexing system on that property. In that case, it's it's a it's quick and efficient, but it only supports equality type checks.

B

Now, if you want something else that that supports a more robust type search, query language, you could use something like solar or elastic search, and so when that property gets added in it'll, be indexed either by that Titan indexing system or that third-party indexing system that you've set up yeah. So if, if you're in a situation like that, like in the time series one, you can kind of think okay well, I can put some sort of arbitrary time range boundaries on this, or something like that.

B

So oh yeah, so he was wondering about in the situation like I mentioned. We were talking about the super node saying like a social network type situation. Oh I have.

F

I have a question actually three. Regarding the Titan setup like when you are installing it has to be a standalone machine, or it can be a part of that Cassandra cluster I'm.

B

Sorry, can you say that one more time so.

F

The Titan Sh can it be a distributed machine or it can only run on a standalone machine. It.

B

Can actually run on the same instance, it can run.

F

On a different.

B

One you could have a one-to-one match between Titan and your Cassandra nodes, or you could have less Titan nodes. So there's a lot of flexibility there, depending on how you're using it. So it doesn't have to be set up directly on the box or off, as I mentioned, of course, because it is communicating, it has its own communication pattern with Cassandra. There could be implications of increased latency when you move it off the box. Having.

A

B

That, depending on what your application is doing, there could be garbage collection, contention, type issues. If it's running in the same jvm, you may need to tune that more so there's trade-offs with the different approaches. Okay,.

F

So what do you suggest, like one machine or in multiple it.

B

Really totally depends.

A

I've ever heard yeah.

B

C

Yeah, so my question was basically regarding the super node she's, like what kind of modeling technique would you would be using Titan to kind of not reach that situation? Yeah.

B

So say you reach, like I, don't know you know like a million or two million edges or something like that hanging off of a hanging out with a vertex, so you would have to if you're starting to run into issues or think you're going to run into issues you're going to have to split that vertex. Somehow one thing that I think probably they may go over tomorrow, but there's a Titan one dot o presentation tomorrow, but one thing that tightens supports and I think cover it.

B

All here is a vertex and edge partitioning, and so what that means is Titan can actually partition, vertices and or edges across your cluster. So that may be one option. If you check out the Titan docks, they have a lot of really good documentation. I should say in general the Tinker Papa Doc's are really good too lots of good examples. But you can read more about that partitioning.

B

The other option would be if you somehow in your modeling strategy, come up with some at least maybe semi logical way to say well, I'm going to split this out into these two separate vertices and then handle that in your application logic. But.

D

So Titan would be taking data from Cassandra and duplicating it. No.

B

Actually, all of so all of the data stays in Cassandra. So if you think about MySQL and how they have you could choose to you say: I know DB or another storage layer, but mysql would be running the query optimizer things like that. It's just it's somewhat similar to that. In that that MySQL say the query. Engine is not actually storing any data, but I know DB is so. This is a similar case. Titan is maintaining caches. It's too in query. Optimization!

B

It's implementing this graph abstraction is taken care of all the serialization d serialization to Cassandra, but all that data is living inside of whatever stories you chose to put behind Titan, so Titan itself sits on top of that underlying data store.

A

If you can go back a couple, more slices of different things that connected to it yeah they're, uh huh okay, so is that the whole list like? Can you use their like a PHP and umpteen other things? Ways.

B

Into it, I don't think, there's a PHP one I would not consider this list exhaustive or authoritative. That was me on the wild, like google image search thing so I'm not I, don't think there's a PHP one. Not just there is one sweet. There is a PHP one yeah there's also. I didn't mention them here, but there's also object to graph mappers so similar to an ORM, but it's an OG em and so there's a number of those out there for different languages. Java python has a few. So that's another another option.

B

If you prefer that.

B

C

It's just a question on kind of how Titan is interacting with Cassandra. Are you saying it's using the thrift protocol to kind of talk to it? Yeah and mine is shining as thrift is being deprecated.

C

So have you seen any issues kind of like you know what kind of the interaction between Titan and Cassandra, because it's one point 0 and I'm? Assuming? Is there like a lotion coming out that would not use thrift, cling fault, um I.

B

Think they're there had been some work on a cql adapter, but I'm not sure where that's at we haven't had issues upgrading Cassandra behind Titan so far, but no I mean that's a very good point, I'm not quite sure, what's going to happen with that in the future. But thank you thank.

F

F