►
From YouTube: Wellaware: Modeling the IoT with TitanDB and Cassandra
Description
Speaker: Ted Wilmes, Senior Data Warehouse Engineer
The graph database, TitanDB, with Cassandra as its backing store, provides a powerful platform for modeling and extracting insights from the connected world of today's internet of things. This talk will briefly cover graph database basics and then dive into IoT specific use cases with a focus on data modeling and performance considerations.
A
B
Great
so
I'm
Ted
Wilmes
I
work
for
a
company
called
well
aware.
We
provide
a
SAS
solution
for
monitoring
oil
and
gas
wells.
So
if
you
have
oily
gas
well
talk
to
me
afterwards,
but
if
not,
you
know
we're
going
to
talk
about
IOT
in
general
today,
so
this
isn't
specifically
about
what
I
do
we're
going
to
first
cover
a
little
bit
about
the
graph
property
model,
a
little
bit
about
Titan
database,
not
too
much,
then
I'm
going
to
really
focus,
though
on
modeling,
the
Internet
of
Things
in
a
graph
database.
B
My
main
goal
for
that
piece
is
to
get
you
guys
excited
to
go.
Try
this
out
on
your
own,
so
I'm,
not
saying
this
is
the
way
to
do
it,
but
kind
of
get
that
imagination
going
it's
easy
to
install
Titan
set
it
up,
create
your
own
little
application
and
try
it
out.
Lastly,
because
time
series
data
is
a
big
part
of
IOT
applications.
Why
not
try
to
store
your
time
series
data
and
Titan
database
might
as
well.
B
Try
it
out
so
I'm
going
to
use
that
to
discuss
how
Titan
actually
interfaces
with
Cassandra,
because
it
is
the
Cassandra
conference
and
then
also
discussed
one
approach
for
storing
time
series,
data
in
Cassandra
and
a
performant
manner.
Okay,
so,
first
of
all
the
property
graph
model
you
may
have
heard
of
RDF
graphs,
so
there's
also
property
graphs.
So
first
we
have
a
vertex.
The
vertex
has
a
label.
So
that's
basically
just
the
typing
information.
B
You
can
have
an
arbitrary
number
of
key
value
pair
properties
on
their
key,
simply
being
a
string
value
being
any
sort
of
type
that
your
graph
database
can
support.
We
add
another
vertex
and
then
we
up
add
in
the
other
main
component,
which
is
an
edge
as
you
probably
already
guessed,
but
the
edge
has
a
direction
goes
from
Ted
to
George.
There's
a
label
on
there
typing
that
edge
and
then
again
you
can
assign
an
arbitrary
number
of
properties
to
that
edge.
So
how
do
we
query
this?
B
There
is
a
patchy
project
called
Apache
tinker
pop
that
is
defined
a
standard
for
folks
and
vendors
to
follow
so
that
they
can
implement
this
standard
graph.
Query
language
called
gremlin
so
really
simply
won't
go
into
too
much
detail
here,
but
just
to
give
you
guys
a
basic
idea
of
what
it
looks
like
we'll
do.
B
A
very
simple
query:
against
that
exact
same
model,
I
just
showed
you
so
I'm
going
to
say:
hey
here,
take
my
graph
out
of
all
the
vertices
I'm,
particularly
interested
in
the
ones
that
are
of
type
person
and
I
want
to
filter
on
the
name
Ted.
So
as
you'd,
imagine
that's
going
to
give
you
your
first
vertex
Ted.
Think
of
that,
as
kind
of
your
first
step,
you're
diving
into
the
graph
and
then
from
that
point
you
may
want
to
most
likely
use
that
graph
use
the
edges.
B
So
real,
simple,
there's
just
one
George,
not
too
exciting,
but
really
straightforward
to
use
really
straightforward
to
put
together
some
very
interesting
sorts
of
queries.
So
tighten
DB
tighten
DB
is
a
graph
database,
open
source
graph
database
that
has
support
for
pluggable
storage
layers.
Of
course
we're
going
to
talk
about
Cassandra.
B
Today
it
was
designed
from
the
ground
up
for
an
OLTP
type
workload,
but
now
there's
also
really
good
options
for
doing
more
OLAP
type,
global
graph
computations
with
it,
as
I
mentioned
it
implements
this
Tinker
pop3
API
in
the
latest,
Titan
10
version
and
again
because
it's
the
Cassandra
conference.
Cassandra
is
an
excellent
option,
of
course,
for
the
storage
layer
will
get
a
bit
into
how
that
data
is
stored
and
I'll
explain
that
some
more
in
a
little
bit
so
the
Internet
of
Things.
B
Well,
it's
kind
of
a
you
know,
sort
of
nebulous
topic,
but
let's
just
sort
of
put
some
loose
bounds
around
this
and
then
discuss
an
example.
So
first,
of
course
we
have
things.
We
have
some
component
of
time,
probably
maybe
we're
tracking
these
things
through
time,
we're
getting
sensor
readings,
but
people
come
into
the
or
into
place.
So
we
may
have
users
that
we're
tracking
in
the
system
organizations
places
all
sorts
of
things.
B
So,
even
though
he's
an
Internet
of
Things
to
really
get
meaning
out
of
a
lot
of
this
data,
there
are
connections
to
these
other
things
that
we
sometimes
forget
about.
The
graph
database
is
great
for
capturing
those
relationships.
So,
as
I
mentioned,
I'm
an
oil
and
gas,
but
you
may
or
may
not
have
interest
in
that.
B
So
I
decided
to
come
up
with
a
just
simple
example:
so,
internet
of
things
in
space,
so
let's
say
we're
setting
up
an
application
to
monitor
a
wide
variety
of
things
floating
around
in
space,
spaceships,
Rockets
satellites,
all
your
sorts
of
usual
space
things.
So
as
we
dive
in,
we
have
a
rocket
here.
So
let's
look
at
what
we
could
do
in
the
graph.
So,
let's
start
out
here
we
have
a
rocket
in
the
center
there.
The
rocket
is
not
an
island.
B
It
actually
has
all
sorts
of
other
pieces
of
information
that
you
can
hook
up
to
it
to
gain
context
and
draw
meaning
out
of
it.
So
first
over
here
we
have
the
world's
smallest
David
Bowie
Major
Tom.
He
pilots
the
rocket
right
there,
as
I
mentioned,
there's,
probably
some
organizational
component
like
who
actually
owns
or
operates
this
rocket
so
Starfleet
Acme
Rockets
builds
a
particular
type
of
rocket
called
the
Delta
booster,
so
that
rocket
you
can
see,
there's
an
outbound
edge,
you'll
notice,
a
lot
of
times.
B
You
label
those
edges
with
a
verb,
so
that
rocket
is
model
Delta
booster
and
then
we
have
an
engineer
that
maintains
this
rocket
Joyce.
So
you
notice
that
with
all
these
connections,
as
you
build
out
your
application,
you
can
make
use
of
these
connections
to
do
interesting,
analytics
and
grab
other
contextual
information
about
these
pieces
of
data.
These
things
that
are
in
the
graph.
Ok.
So
what?
B
If
we
look
at
the
rocket
itself
so
a
lot
of
times
a
thing
is
actually
really
a
system
of
systems
or
maybe
other
things
connected
to
each
other
and
maybe
you're
just
thinking
of
it
at
a
higher
level.
So
on
the
Left,
we
just
have
a
really
generic
exploded
diagram
of
a
rocket
on
the
right.
We
have
a
simple
sub
graph,
showing
okay.
Where
does
this
rocket
actually
made
up
of
so,
depending
on
my
role
in
this
organization?
B
I
may
only
care
about
this
thing
in
the
context
of
being
our
rocket,
but
if
I'm
actually
Joyce
the
engineer,
I
want
to
do
something
with
this
rocket
I
care
about
the
lowest
level
nitty
gritty
piece.
The
graph
database
allows
you
to
very
easily
put
together
these
sorts
of
hierarchies
and
also,
most
importantly,
right
here.
B
Everything
is
shown
as
a
hierarchy,
but
a
lot
of
times,
there's
interconnections
dependencies
that
you
can
capture
in
this
graph
between
these
different
things
and
exploit
those
simple
example
would
be
if
you
have
an
alarming
system
and
you're
doing
maybe
some
sort
of
root
cause
analysis.
If
I'm
an
outsider,
I'm
new
to
this
system,
it
has
many
parts
to
it.
Maybe
I
don't
understand
how
everything
is
put
together.
You
can
have
something
like
this.
This
model
of
that
system
in
your
in
your
graph
database
and
run
queries
against
it
to
better
understand.
B
Okay,
what
is
the
actual
root
cause
of
this
problem
are
the
things
that
came
into
play
so
if
we
zoom
in
a
little
bit
further,
this
is
one
of
those
components
in
the
Rockets.
So
these
are
the
guidance,
electronics
notice.
We
have
our
JVM
up
on
the
rocket
and
unfortunately
it
looks
like
there's
a
garbage
collection
issue
here,
but
so
we
have
the
jbm
we're
moving
down
to
even
a
lower
level.
So
now
we're
finally
down,
in
that
say,
sensor
type
level,
we're
actually
gathering
some
sort
of
metrics.
B
So
we're
going
to
zoom
in
there
and
discuss
just
basically
an
alarm
scenario,
so
this
is
just
one
idea
of
how
you
could
possibly
model
alarms
so
say,
you're
interested
in
heap
usage.
You
want
to
be
notified
when
something
goes
wrong,
so
we
curate
an
alarm
vertex
and
that
points
to
the
heap
usage
that
alarm
could
go
off
and
say
I'm
going
to
notify
any
number
of
people.
So
we
just
have
our
that
Joyce
node
or
maybe
some
other
engineer,
vertices
cook
those
up
and
then
some
arbitrary
number
of
alarm
conditions.
B
So
it's
a
very
flexible
way
to
model
these
sorts
of
things.
Now
we
bring
in
the
organizational
and
the
other
personnel
component.
We
can
tie
this
back.
So
what
if
Joyce
doesn't
answer
or
space
pager,
or
something
like
that,
so
we
can
just
fail
over
and
say:
oh
I'll
just
use
the
organizational
structure
that
I've
stored
in
this
system.
To
figure
out,
I
need
to
actually
report
this
up
the
chain.
B
So,
in
summary,
things
can
be
the
pending
on
your
application,
but
in
many
cases,
and
at
least
specifically
in
our
oil
and
gas
use
case
a
lot
of
times.
That
thing
is
really
can
be
broken
down
into
other
parts
or
potentially
those
parts
that
you
already
have
can
be
brought
together
with
some
sort
of
structure
and
relationships
that
tie
them
together
so
depending
on
what
you
want
to
do,
maybe
the
more
higher
the
higher
fidelity
of
the
model
that
you're
actually
storing
in
your
system.
The
more
flexibility
you'll
have,
so
you
can.
B
Okay,
so
can't
talk
about
Internet
of
Things,
without
maybe
some
time
series
stuff
so
now
get
into
the
time
series
and
performance
information
so
we'll
go
back
and
we'll
build
upon
this
example,
specifically
looking
at
modeling
how
we
would
store
this
time,
series
data
in
a
highly
performant
manner
and
tighten
okay.
So
some
really
basic
nebulous
time
series
requirements,
but
at
least
to
kick
us
off
so
say
we
want
to
support
a
large
volume
of
low
latency
rights
and
then
we
want
to
retrieve
primarily
the
most
recent
data.
B
That's
pretty
straightforward,
so
here's
kind
of
a
laundry
list
of
things
you
you
would
like
to
look
at
if
you're
doing
performance
tuning
our
optimization
on
your
Titan
system,
so
one
some
of
these
are
probably
pretty
obvious
but
tightened
deployment
topology
and
configuration
all
your
usual
cassandra
tuning
tips
and
tricks.
If
you
think
about
it,
tightens
running
on
topic
of
sandra's,
so
it's
optimizing.
Your
cassandra
setup
is
probably
a
good
thing.
B
Titan
JVM
tuning
Titan
actually
runs
in
the
JVM,
so
usual
JVM
tuning
matters,
their
data,
modeling
choices
and
then
indexing
Titan
has
its
own
built-in
indexing
and
can
use
third-party
indexers.
So
that
makes
a
difference
and
then
also
Titan
has
a
number
of
different
layers
of
caching
that
are
important,
specifically
we're
going
to
talk
about
deployment,
topology
and
data
modeling,
okay,
so
deployment
options
just
to
beat
the
space
thing
into
the
ground.
We
have
our
AWS
Mars
North,
one
region.
B
How
could
we
deploy
our
Titan
and
Cassandra
okay,
so
here
we're
looking
at
just
an
individual
instance,
so
here
on
the
same
instance,
we
could
do
a
local
deployment,
so
we
have
our
Cassandra
running
in
a
JVM
and
we
have
Titan
running
in
another
jbm
they're
communicating
over
a
socket
connection.
Okay,
so
that's
one
way
to
do
it.
We
could
run
it
embedded.
So
Titan
could
be
running
in
the
same
jvm
as
Cassandra
third
option.
B
B
So
if
you
dive
in
and
actually
look
at
Titan
itself,
okay
graph
database,
how
do
I
actually
communicate
with
this
thing?
You
know
how
am
I
going
to
make
an
application
with
it.
So
first
option
is
the
Apache
tinker.
Pop
project
has
a
gremlin
server,
so
this
allows
for
remote
access
into
your
graph
database
that
could
be
with
the
JVM
based
language,
a
rest
type
interface
or
another
sort
of
driver
for
a
different
language.
So
that's
one
way
to
do
it.
Another
option
is
to
actually
embed
Titan
within
your
application.
B
So,
for
example,
we
use
drop
wizard,
but
you
could
do
the
spring
boot
setup
spring
setup.
You
know
whatever
sort
of
framework
you're
comfortable
with.
In
that
case,
your
application
and
Titan
are
going
to
be
running
in
the
same
jvm,
okay,
zoom
out
a
little
bit
here.
What
does
that
actually
look
like
here's
an
option,
so
you
have
to
drop
wizard
containers,
you
have
your
Cassandra
cluster
and
then
this
is
greatly
simplified.
But
then
some
magic
happens
in
your
and
your
clients
are
hitting
your
api's.
B
So
just
to
give
you
just
a
rough
idea
of
a
possible
deployment
strategy,
okay,
so
time
series,
let's
jump
over
real,
quick
and
just
look
at
kind
of
this
canonical
time.
Series
example
with
cql,
so
I
took
this
off
a
datastax
academy
website,
and
so
here's
a
very
straightforward
and
kind
of
standard
way
that
one
could
model
time
series
say
storing
sensor
readings
in
using
cql
and
so
I
bring
this
up
to
compare
contrasts
with
how
we're
going
to
do
that
in
Titan.
B
So
now,
if
we
look
at
how
Titan
actually
stores
data
in
Cassandra,
we
have
each
partition,
so
Titan
makes
a
number
of
different
column
families,
but
will
be
specifically
looking
at
the
edge
store.
Each
partition
is
going
to
be
a
vertex,
so
you're
going
to
have
a
vertex,
ID
and
then
and
I
should
say
that
Titan
still
uses
thrift.
So
then
you're
going
to
have
a
series
of
properties
that
are
associated
with
that
vertex,
each
one's
going
to
be
in
a
separate
column
and
then
your
edges.
B
This
is
stored,
an
adjacency
list
format,
so
your
edges
will
be
stored
with
that
vertex
and
they're
going
to
be
stored
on
both
sides
of
that
walk.
If
you
think
about
it,
you
have
vertex
a
you
have
vertex
B
you
have
an
edge
between
that
edge
is
actually
going
to
be
duplicated
to
both
of
those
vertices
in
most
cases,
and
so
why
is
Cassandra
great
for
this?
Well,
one
thing
the
Titan
is
also
matt
is
doing
edge,
queries
so
filtering
by
edges.
B
That's
the
important
thing
to
remember,
along
with
the
fact
that
the
edges
using
the
Titan
schema
system
can
be
set
up
so
that
you're
ordering
by
a
particular
one
of
those
properties,
one
or
more
okay.
So
how
could
we
actually
store
time
series
data
in
here?
Well,
one
thing
that
we
could
do
is
say
on
the
top.
We
have
our
sensor,
maybe
that
we're
pulling
data
back
from
our
metrics,
so
we
have
heap
usage
and
then
I'm
going
to
break
this
data
up.
We
know
that
Cassandra
can
handle
wide
rows,
but
not
infinitely
wide
rows.
B
So
we
need
to
break
this
up
somehow
so
I've
just
created
this
notion
of
chunks.
It's
just
think
about
it.
Buckets
the
same
idea,
and
so
we
have
some
time
range
is
going
to
go
in
each
chunk.
Then
you
could
have
these
observation
vertices
that
you'll
hang
off
the
chunks.
Those
observations
vertices
will
have
the
time
stamp
and
the
value
or
you
know
whatever
other
sort
of
information
that
you'd
like
to
store
with
with
each
of
those
observations.
B
So
that
looks
pretty
good.
You
could
imagine,
maybe
doing
a
year
month
day
hierarchy
storing
roll-ups
in
there
and
things
like
that.
So
you
can
you
could
you
could
do
something
along
those
lines?
So
that's
good,
but
each
one
of
those
vertices
as
I
mentioned,
is
a
separate
partition.
So
say
you're
pulling
back
a
thousand
different
observations.
B
Well,
in
that
case,
if
it's
down
on
the
observation,
vertex
you're
actually
going
to
have
to
retrieve
a
thousand
different
observations,
so
you
can
imagine
that
that
quickly
gets
you
into
trouble
and
doesn't
scale
very
well.
So
what
could
we
do
about
that?
Okay,
so
one
thing
that
we
can
do
is
move
all
those
observation
properties
up
to
the
edge.
So
maybe
we
still
leave
them
on
the
observation,
but
we
also
move
them
up
to
the
edge.
B
So
then,
what
we
can
do
is
have
tightened
just
perform
an
edge
query
retrieve
those
edges
in
a
similar
manner.
If
you
were
performing
that
query
with
say
c
ql,
where
it's
just
looking
at
say
a
single
partition
or
maybe
a
few
cross
partitions,
you
know
maybe
two
or
three
depending
on
how
much
data
you're
retrieving,
but
it
can
do
that
slice,
query
and
pull
them
back.
B
The
other
thing
to
throw
on
top
of
it
is
you
could
even
skip
hanging
that
observation
off
the
end
of
that
off
of
that
edge
and
instead
you
could
point
that
edge
back
to
the
chunk,
because
you
really
just
care
potentially
about
that
data.
That's
stored
on
the
edge
itself.
Now
that's
good
in
some
cases,
but
in
other
cases
you
may
want
to
exploit
the
fact
that
you
can
tie
things
together
in
the
graph.
Maybe
you
want
to
go
back
and
actually
associate
something
else
with
that
observation.
B
Do
some
sort
of
tagging
something
along
those
lines?
So
it's
going
to
depend
on
exactly
what
your
use
case
is.
If
you
just
need
to
actually
store
that
Ross,
a
sense
of
reading
you're
not
going
to
do
anything
with
it
afterwards
or
the
other
than
just
read
it
back
as
it
is,
then
maybe
you
can
just
leave
it
where
it's
just
storing
it
on
the
edge.
What
that
actually
looks
like,
then,
if
you
look
at
the
partition,
is
we
have
our
vertex
ID?
B
Like
I
mentioned,
you
have
your
properties
that
associated
with
that
vertex
and
then
the
column
simply
are
just
your
individual
observations,
so
this
looks
fairly
similar
to
how
it
would
be
stored
in
the
c
ql
cases,
there's
differences,
but
it's
it's.
It's
exploiting
the
same
fact
that
you're
doing
maybe
not
always
sequential
io.
B
To
retrieve
that
on
your
slice,
query,
you
may
have
to
go
across
multiple
SS
tables,
but
it's
better
than
hopping
around
and
getting
all
those
disparate
vertices,
really
simple,
gremlin
examples
here,
gremlin
also
actually
works
really
well
for
querying
the
sort
of
data.
So
in
our
simple
example,
let's
just
say
we
have
one
chunk,
we're
dealing
with
we're
going
to
say,
I'm,
going
to
go
out
the
out
edge
out
e
instead
of
out
so
I'm
just
interested
in
the
edges.
B
I
want
the
specific
one
with
that
time,
stamp
or
I
could
do
a
between
type
query
or
if
I
want
to
do
the
most
recent
observation
before
now,
I
can
go
out
e
has
time
stamp,
say
now
and
then
specify
an
order
and
then
limit
by
one
and
so
tighten
and
behind
the
scenes.
Cassandra
is
just
going
to
go
and
get
that
single
record
back
now
that
works
really
well.
B
But
of
course
you
could
wrap
this
up
in
your
own
time,
series
API
and
that's
kind
of
how
we
use
that
as
something
that
is
specifically
looks
nice
for
dealing
with
the
time
series.
Okay,
so
pros
and
cons,
so
pros
I
mean-
and
this
can
probably
get
into
religious
debates,
but
this
allows
for
a
single,
unified
view
of
your
IOT
data.
So
you
actually
have
all
that
first
part
of
the
talk
sitting
in
there
with
your
time
series
data,
admittedly,
could
be
a
good
or
bad
thing,
so
I'm
not
going
to
pass
judgment
on
that.
B
Gremlin
works
well
for
processing
streams
of
time
series
data
cons.
Of
course,
the
storage
formats
not
going
to
potentially
is
not
going
to
be
quite
as
compact,
because
Titans
putting
some
other
metadata
in
there
with
your
information,
there's
some
extra
properties
on
the
on
the
partitions,
and
things
like
that.
So
it
may
not
be
quite
as
compact,
although
they
do
some
very
efficient
serialization
of
that
data
out
themselves.
So
it
may
not
be
a
huge
difference,
but
that
could
affect
performance
potential.
B
Lastly-
and
this
is
the
one
that
actually
requires
some
some
managing
some
overhead
on
on
the
developer
side,
at
least
when
you're,
making
this
time
series
type
library,
is
there's
this
extra
overhead
of
managing
these
chunks.
So
it's
not
like
in
c
qo,
where
you
could
just
say
insert
this
new
point:
okay!
B
So
let's
look
at
a
really
simple
query
here
here,
I'm
saying
GV,
for
you
can
look
up
a
vertex
by
an
ID,
so
I
just
magically
knew
that
ID
for
that's
one
of
those
chunk
vertices,
I'm
going
to
say
out
the
has
chunk
edge
and
I
just
want
to
get
the
start
times
for
that
chunk.
So
you
can
just
see
them
they're.
So
pretty
simple
query!
So,
first
of
all
what
happens
when
you
say
get
vertex
by
ID,
and
this
is
true
whether
or
not
Titan
is
running
embedded
local
remote.
B
The
difference
here
is
going
to
be.
What's
the
magnitude
of
that
latency
that
communication
time
that's
happening
between
Titan
and
Cassandra.
So,
first
of
all
it
says,
Titan
says:
does
this
vertex
exist,
so
look
sit
up
in
Cassandra
Cassandra
said:
yep,
that's
there!
So
now
it
has
the
vertex
loaded
into
Titans
transaction
cash,
so
Titan
maintains
its
own
transaction
cash.
B
Ok,
so
now
we
want
to
say,
let's
get
retrieve
the
properties
for
this
vertex
for
sensor
type
and
then
in
units.
So
it
goes
out.
Remember
I
said
just
retrieved
the
vertex
before
I
didn't
actually
get
any
properties
at
the
same
time,
so
retrieve
the
properties
got
back
to
properties.
Now
those
properties
are
loading,
the
transaction
cash.
So
if
you
use
that
same
vertex
somewhere
else
in
that
same
transaction
Titans
going
to
hit
its
transaction
cash,
it
won't
go
out
again
andrey
retrieve
it.
B
So
that's
good,
but
you
can
get
an
idea
here
of
how,
depending
on
the
number
of
vertices
that
you're
dealing
with
properties
and
things
like
that,
you
could
run
into
trouble
also
depending
on
your
deployment
model.
So
these
add
up
to
basically
these
two
round
trips
going
back
and
forth
to
retrieve
this
information.
So
what
if
we
add
in
a
query
and
outbound
query
we're
going
to
go
out
that
has
chunk.
This
is
like
that.
First
query.
So
here
I've
collapsed
each
round
trip
into
just
one
line,
but
we
say:
does
this
vertex
exist?
B
B
But
what
you
are
seeing
here
is
you'll
notice
that
that
last
retrieval
of
that
second
vertices
properties
got
collapsed
into
that
previous
request.
So
it
was
able
to
just
go
out
grab
that
and
pull
it
back,
so
that
makes
a
difference
right
there
now.
What
else
can
we
do?
Well,
if
you
want
to
get
extreme
here,
I'll
put
a
warning
on
this,
just
because
you
want
to
test
out
and
make
sure
that
this
works
right
for
your
situation,
but
say
you're,
you're,
very
certain
that
that
vertex
with
ID
number
4
exists.
B
So
you
better
be
very
certain
about
it,
doesn't
matter
so
much
on
the
red
side
if
it
doesn't
really
exist,
but
if
you
do
that
storage
batch
loading.
True,
that's
going
to
turn
off
Titans
internal
check
to
see
if,
when
you
give
it
a
vertex
ID
does
that
vertex
actually
exist
in
Cassandra,
and
so
what
happens
then
is
you
can
just
see
we're
down
to
basically
the
minimum
number
of
queries
that
back
and
forth
with
Cassandra
we
get
the
edges
and
then
we
get
the
chunk
properties
so
two
round
trips.
B
Okay!
So
that's
on
the
read
side.
So
let's
look
at
optimizing
on
the
right
side.
So
one
thing
that
we
ran
into
when
we
were
putting
together
this
time.
Series
model
was
some
issues
with
insert
performance,
and
so
here's,
just
a
really
simple,
insert,
remember
I'm
doing
this
sort
of
odd
thing,
where
I'm
really
just
care
about
that
edge.
So
I'm
saying
chunk
at
edge
has
observation:
that's
that
label
I'm,
pointing
it
back
to
itself
for
better
or
for
worse,
adding
in
the
timestamp
property
and
then
just
a
double
value.
B
Okay.
So
what
does
this
look
like?
You
could
probably
guess
with
these
settings
off,
does
vertex
exist,
then
it
writes
a
new
edge
and
so
notice.
What
happened
here
is
we
wanted
to
do
a
write,
but
we
also
introduced
a
read,
so
you
can
imagine,
as
you
scale
this
up
for
its
potentially
depending
on
your
load
as
I
mentioned.
There's
a
transaction
cash
if
you're
writing,
maybe
to
the
same
chunk
over
and
over
again
in
the
same
transaction,
say
batching
up
your
commits.
B
It
doesn't
make
a
huge
difference,
but
in
most
cases
you're
probably
going
to
be
writing
across
many
different
things.
So
many
different
different
sensors
or
you
know-
data
collection
devices.
Okay,
so
can
we
do
well?
If
we
do
storage
batch
loading
true,
then
we
get
rid
of
that.
Vertex
exist
query
and
we
end
up
is
with
this
just
this
new
right
edge,
and
so
what
happens
there
is
then.
Your
actual
right
is
just
writing
to
Cassandra.
B
B
I
did
over
the
past
week
just
to
get
some
numbers
and
examples
so
that
big
chunk
I
know
you
can't
really
read
that,
but
that
big
chunk
on
the
right
is
actually
all
those
calls
to
get
vertices
so
spending
a
significant
amount
of
that
time
of
its
of
its
insert
actually
pulling
back
and
doing
that
that
read
to
Cassandra.
So
if
we
do
storage
batch
loading,
true
that
goes
away
on
the
left.
You
can
see
that
chunk
is
actually
where
the
commits
are
happening.
So
all
that
stuff
on
the
right
isn't
actually
I
owe
anymore.
B
That's
just
other
things
that
are
going
on
so
just
again,
quick
and
dirty
performance
numbers.
This
is
not
supposed
to
represent
any
sort
of
definitive
what
sort
of
insert
radar
you're
going
to
get
out
of
the
system,
but
for
the
sake
of
just
giving,
you
guys
may
be
some
sort
of
rough
order
of
magnitude
on
untuned
system.
I
set
up
a
cluster
on
AWS
9m,
3,
2,
x-large,
nodes,
cassandra,
2.2,
replication
factor,
3
writing
at
quorum
didn't
do
anything
else
with
them
set
up
one
other
node.
That
was
this
client.
B
That
was
writing
a
time
series
data
one
hundred
percent
right
workload
into
this
cluster
10
right
threads,
committing
I.
Think
50
of
these
observations
at
a
time
across
a
hundred
thousand
different
series.
So
you
could
think
of
that
as
a
hundred
thousand
different
sensors
or
something
that
you're
writing
data
to,
and
so
what
happens
at
the
beginning
is
actually,
as
I
mentioned,
the
one
downside
is.
You
need
to
manage
these
chunks
that
you're
writing
to
and
by
manage.
B
In
this
case,
what
I
mean
is
well
when
you
get
a
new
point
in
you
need
to
figure
out
what
chunk
am
I
going
to
put
this
under
so
and
maybe
cql
terms,
what
partition
is
it
going
to
go
into,
and
so
just
in
this
quick
test,
we
do
some
caching.
So
we
retrieve
that
chunk
once
and
then
we
just
have
a
very
simple
look
up
that
we
do
on
our
application
side
to
say,
give
me
the
chunk
back
so
I
don't
actually
have
to
go
to
Cassandra
and
look
it
up
again.
B
B
Okay,
so
in
summary,
it's
Titan
is
a
graph
database
on
its
own,
but
it's
of
course
important
to
understand
kind
of
what
the
implications
are
of
how
it
uses
Cassandra
to
help
get
the
most
out
of
Titan.
One
thing
that
we
learned
on
the
right
path
is
depending
on
your
needs.
It
could
be
good
to
remove
some
of
those
reads
from
your
right
path.
Again,
I
put
a
warning
up
there.
These
are
things
that
you
want
to
try
out
on
your
own
setups
see
what
happens.
So
it's
not
just
a
blanket
thou
shalt
go.
B
Do
this
one
other
thing
I
didn't
mention
is:
if
you're
writing
large
numbers
of
vertices
Titan
has
its
own
scheme
for
coming
up
with
vertex
IDs
and
sometimes
allocating
these
vertex
IDs
can
be
more
of
an
expensive
operation.
So
when
a
node
runs
out
of
vertex
ids,
it
gets
an
allotment
and
needs
to
get
another
set
of
vertex
ids.
That
can
take
some
time.
So
here's
two
more
parameters
that
you
probably
want
to
look
at
an
ids
block,
size
and
ID's
renew
percentage.
B
Those
are
those
are
some
things
that
you
can
tune
to
ensure
that
that
wait
time
when
that
happens,
isn't
too
high
on
the
read
side.
I
can
definitely
say
you
know:
query
batch
equals.
True
is
probably
going
to
be
helpful
in
a
lot
of
cases.
I
mean
it
sort
of
depends
on
what
sort
of
traversal
as
you're
running,
but
it's
probably
worth
while
enabling
that
and
then
lastly,
of
course,
global
and
vertex
centric
indices.
B
A
B
D
B
F
G
E
B
E
B
B
G
B
A
B
So
now,
if
we
flip
over
here,
what's
going
to
happen,
is
we
have
our
vertex
IDs
and
so
think
of
that
as
the
partition
key?
And
then
we
have
some
arbitrary
number
of
columns
cells,
whatever
you'd
like
to
call
them,
and
so
what
Titans
going
to
do
is
actually
serialize
those
key
value
pairs
out
using
its
own
serialization
into
the
columns
that
are
on
that
partition.
So
we're
going
to
start.
B
B
Exactly
and
in
this
right
here
they
technically
could
be
going
in
or
out,
and
you
could
walk
that
from
either
direction.
So
in
my
little
query
examples
I
always
set
out,
but
you
could
say
in
and
depending
on
the
direction
and
then
the
edges
themselves.
That's
the
basic
format
that
it's
packing:
that
information
into
the
edges.
B
B
Yeah
so
so
Titan
has
has
some
some
sorts
of
schema
definition,
but
it's
not
schema
definition.
In
the
context
of
setting
up,
say
a
table
and
saying
you
know,
table
person.
Has
these
three
columns
and
soin
Titan?
What
you
do
is
you
can
say
for
this
particular
property.
Take
like
that
name.
Property
I
had
on
that
vertex,
I'm
going
to
say
name
is
always
a
string,
so
you
could
say
something
like
that.
So
you'll
define
that
in
the
Titan
type
system,
you
won't
say
a
person
has
a
name,
a
date
of
birth.
B
B
One
is
that
global
index
like
if
I'm
looking
at
all
the
vertices
across
the
whole
graph
and
I
want
kind
of
you
remember
my
query.
It
started
out
with
I
want
all
the
people
named
Ted,
and
so
it's
from
that
level,
I
want
to
dive
into
the
graph
and
just
grab
those,
and
so
with
that
you
actually
have
two
different
options.
You
can
either
use
Titans
internal
indexing
system
on
that
property.
In
that
case,
it's
it's
a
it's
quick
and
efficient,
but
it
only
supports
equality
type
checks.
B
Now,
if
you
want
something
else
that
that
supports
a
more
robust
type
search,
query
language,
you
could
use
something
like
solar
or
elastic
search,
and
so
when
that
property
gets
added
in
it'll,
be
indexed
either
by
that
Titan
indexing
system
or
that
third-party
indexing
system
that
you've
set
up
yeah.
So
if,
if
you're
in
a
situation
like
that,
like
in
the
time
series
one,
you
can
kind
of
think
okay
well,
I
can
put
some
sort
of
arbitrary
time
range
boundaries
on
this,
or
something
like
that.
B
F
B
One
you
could
have
a
one-to-one
match
between
Titan
and
your
Cassandra
nodes,
or
you
could
have
less
Titan
nodes.
So
there's
a
lot
of
flexibility
there,
depending
on
how
you're
using
it.
So
it
doesn't
have
to
be
set
up
directly
on
the
box
or
off,
as
I
mentioned,
of
course,
because
it
is
communicating,
it
has
its
own
communication
pattern
with
Cassandra.
There
could
be
implications
of
increased
latency
when
you
move
it
off
the
box.
Having.
A
B
C
B
So
say
you
reach,
like
I,
don't
know
you
know
like
a
million
or
two
million
edges
or
something
like
that
hanging
off
of
a
hanging
out
with
a
vertex,
so
you
would
have
to
if
you're
starting
to
run
into
issues
or
think
you're
going
to
run
into
issues
you're
going
to
have
to
split
that
vertex.
Somehow
one
thing
that
I
think
probably
they
may
go
over
tomorrow,
but
there's
a
Titan
one
dot
o
presentation
tomorrow,
but
one
thing
that
tightens
supports
and
I
think
cover
it.
B
All
here
is
a
vertex
and
edge
partitioning,
and
so
what
that
means
is
Titan
can
actually
partition,
vertices
and
or
edges
across
your
cluster.
So
that
may
be
one
option.
If
you
check
out
the
Titan
docks,
they
have
a
lot
of
really
good
documentation.
I
should
say
in
general
the
Tinker
Papa
Doc's
are
really
good
too
lots
of
good
examples.
But
you
can
read
more
about
that
partitioning.
B
B
Actually,
all
of
so
all
of
the
data
stays
in
Cassandra.
So
if
you
think
about
MySQL
and
how
they
have
you
could
choose
to
you
say:
I
know
DB
or
another
storage
layer,
but
mysql
would
be
running
the
query
optimizer
things
like
that.
It's
just
it's
somewhat
similar
to
that.
In
that
that
MySQL
say
the
query.
Engine
is
not
actually
storing
any
data,
but
I
know
DB
is
so.
This
is
a
similar
case.
Titan
is
maintaining
caches.
It's
too
in
query.
Optimization!
A
If
you
can
go
back
a
couple,
more
slices
of
different
things
that
connected
to
it
yeah
they're,
okay,
so
is
that
the
whole
list
like?
Can
you
use
their
like
a
PHP
and
umpteen
other
things?
Ways.
B
Into
it,
I
don't
think,
there's
a
PHP
one
I
would
not
consider
this
list
exhaustive
or
authoritative.
That
was
me
on
the
wild,
like
google
image
search
thing
so
I'm
not
I,
don't
think
there's
a
PHP
one.
Not
just
there
is
one
sweet.
There
is
a
PHP
one
yeah
there's
also.
I
didn't
mention
them
here,
but
there's
also
object
to
graph
mappers
so
similar
to
an
ORM,
but
it's
an
OG
em
and
so
there's
a
number
of
those
out
there
for
different
languages.
Java
python
has
a
few.
So
that's
another
another
option.