►
Description
Speaker: Brian O'Neill (Lead Architect, Health Market Science)
Cassandra provides facilities to integrate with Hadoop. This is sufficient for distributed batch processing, but doesn't address CEP distributed processing. This webinar will demonstrate use of Cassandra in Storm. Storm provides a data flow and processing layer that can be used to integrate Cassandra with other external persistences mechanisms (e.g. Elastic Search) or calculate dimensional counts for reporting and dashboards. We'll dive into a sample Storm topology that reads and writes from Cassandra using storm-cassandra bolts.
A
Welcome
everyone
to
this
week's
edition
of
our
college
credit
webinar
series.
Today
we
are
discussing
CEP
distributed
processing
on
Cassandra
with
storm.
This
is
a
technology
mix
that
in
the
community
we
see
increasingly
more
and
more
of
and
I
am
very
excited
to
say
that
today
we
have
two
speakers
with
us.
We
have
Brian
O'neill,
who
is
the
lead
architect,
health
market
science,
and
he
is
an
MVP
for
Apache
Cassandra,
very
well
known
in
the
Cassandra
community,
and
also
joining
him.
A
Today
we
have
Taylor,
gets
and
Taylor
is
an
expert
in
storm
and
has
been
in
the
storm
community
since
it
was
open
sourced
and
he
has
been
leading
the
charge
around
Cassandra
and
storm
integration,
a
couple
of
housekeeping
items.
We
will
take
QA
at
the
end
of
this
session.
So
if
you
have
any
questions
for
Taylor
and
Brian,
please
use
the
Q&A
tab
inside
of
WebEx
and
put
your
questions
in
there,
and
I
will
ask
them
at
the
end.
Also,
we
are
recording
today's
session.
A
B
No
problem,
alright,
so
I'm
very
excited
to
do
this
presentation
we
did
the
last
webinar
was
you
can
create
your
first
java
application,
this
one
we're
taking
it
up
a
notch,
and
you
know
once
if
you're
familiar
with
Cassandra
and
you
you're
familiar
with
the
crud
operations
on
it.
This
is
taking
it
to
the
next
level
and
once
you've
got
your
data
in
cassander.
What
can
you
do
with
it
and
kind
of
analytics?
Can
you
perform?
B
How
do
you
integrate
it
with
other
systems
in
your
in
your
enterprise,
so
that
further
ado
we're
going
to
get
started
all
right?
The
quick
agenda
is
I'm
going
to
go
through
the
youth
case.
I'm
sort
of
talk
about
what
complex
event
processing
is.
What
would
you
use
it
for
sort
of
get
motivated
to
deploy
storm
and
then
going
to
hand
it
over
to
Taylor
he's
going
to
do
a
little
bit
of
background
on
storm?
B
How
to
deploy
what
the
cluster
looks
like
go
through
some
code
and
do
a
demo
and
then,
if
we
have
time,
were
to
come
back
and
talk
about
sort
of
new
new
things
out
in
the
storm
Sibley
tried
in,
which
is
a
higher
level
API
that
we've
recently
adopted
and
we're
migrating
all
of
our
topologies
to
try
to
so.
First,
our.
C
B
So
when
we
take
a
look
at
it
on
research
book
expander
in
place
like
I
was
saying,
is
that
there's
thousands
of
those
schemas
can
change
over
time?
It's
quite
a
bit
of
data,
but
to
us
the
major
motivating
factor
there
to
select
Cassandra
was
a
variety
of
data.
So
what
we
do
is
we
take
those
thousands
of
people
dump
minute,
Cassandra,
run
all
sorts
analytics
and
and
things
and
then
produce
a
matter
file
that
we
deliver
to
our
clients
or
provide
web
services
integrations
for
them.
B
So
so,
if
we
have
Cassandra
in
place,
what
didn't
that
cover?
Well,
you
want
to
be
able
to
search
that
data.
Eight.
C
B
Yeah
so
note
no
bomb,
so
here's
the
agenda
as
saying
it's
just
the
use
case
then
hand
it
over
to
fill
and
then
come
back
for
a
little
trodden.
If
we
have
time
and
then
here's
the
doctor
that
we've
got
so,
would
you
know
the
industry
and
that
kid
really
that
we
were
in
is
called
master
data
management
for
all
those
feeds
and
then,
like
I,
said,
we
deployed
Cassandra
to
handle
the
data
and
a
bit
of
the
volume,
but
that
didn't
cover
all
of
our
requirements.
B
So
now
that
we're
here
you
know,
we
need
to
supplement
Cassandra
with
a
few
other
things.
Specifically,
we
need
to
go
to
search,
unstructured
data,
so
fuzzy
matching
on
addresses,
for
example,
geospatial
kind
of
queries.
We
also
need
a
real-time
analytics
and
reporting
so
that
data
that
was
in
Cassandra
was
great,
but
we
needed
you
know.
Dimensional
counts
aggregate
counts.
So
how
many
doctors
of
this
specialty
are
in
this
area
that
have
sanctions
against
them,
for
example,
and
then
also
transactional
processing?
So
you
know
we
have
a
web
front-end.
We
want.
B
B
About
two
years
ago
we
looked
at
the
different
searching,
capable
engines
that
were
out
there
and
I
think
everybody
knows
the
two
that
are
most
prominently
deployed
or
solar
and
elastic
search.
Solar
is
great.
We
actually
chose
that
first,
because
about
two
years
ago
we
didn't
think
elasticsearch
was
mature
enough.
We've
changed
that
opinion,
so
elastic
search
is
one
of
our
integration
points.
Now
that
that's
tens,
the
scale
like
Cassandra
does
so
it
handles.
You
know
replication
and
distribution
underneath
the
hood
rather
than
the
solar,
where
you
had
to
manually.
B
Do
it
then
web
services
right,
so
we
chose
drop
lizard
for
that
one,
and
then
we
also
want
to
go
to
do
reports
so
and
most
of
the
reporting
mechanisms.
Although
data
sector
that
is
got
some
good
on
aggression
jaspersoft,
mr.
reporting
people
still
go
off
of
relational.
So
if
we
take
this
as
our
use
case,
let's,
let's
walk
through
a
couple
of
things.
This
is
what
we
wanted
to
get
to,
but
we're
going
to
go
through
a
couple
things
that
we
did
wrong
before
we
got
here
and
the
first
was
to
do
so.
B
Is
it
not
Tanaka
doop?
We
love
to
do
but
its
its
focuses
a
lot
around
batch
processing
right
and
when
we
looked
at
our
problem,
we
wanted
to
be
able
to
reflect
changes
as
they
happen,
transactionally,
and
we
couldn't
do
that.
Well,
just
spinning
up
the
Hadoop
job
was
longer
than
we
wanted
it
to
take
to
be
reflected
in
the
user
experience,
so
that
wasn't
going
to
work
so
and
then,
in
order
to
effectively
take
off
those
who
do
jobs,
you
also
needed
to
track
what
changed
in
the
system.
B
So
what
we
did
wrong
part
2,
so
we
moved
to
a
ALP
triggers
approach,
so
we
like
that
you
know,
took
all
the
burden
off
of
the
clients,
so
we
actually
used
an
LP
extension
that
we
open
sourced
called
Cassandra
triggers
that
would
watch
data
as
a
change
in
Cassandra
and
update
some
wide
rows.
So
that
was
great.
That
worked
really
well,
but
then
we
realized,
as
we
integrated
more
and
more
systems
and
wanted
more
and
more
wide
rows.
B
That
became
a
huge
burden
on
the
right
in
Cassandra
and
then
in
addition
to
that,
guaranteeing
the
execution
of
those
triggers
to
make
sure
you
had
guaranteed
processing
those
triggers
was
extra
overhead,
so
we
turn
to
towards
complex
event,
processing,
so
just
kind
of
did
a
background
event.
Processing
is
straight
from
the
geek,
so
to
throw
definition
on
slide
thing
on
complex
event.
Processing
is
a
matter
of
treating
the
events
like
a
streams
right
and
then
discovering
different
things
from
that
stream.
Those
streams
of
information.
B
So
if
we
take
our
use
case
and
frame
it
as
a
complex
event,
processing
problem,
our
events
in
our
system,
our
crud
operations,
as
they
either
happen
in
the
system
or
12
happen
in
the
system.
So
if
you
take
that
frame
of
mind
on
the
complex
event,
processing
can
become
an
ETL
tool
and
or
analytics
to
figure
technical
editing
again,
and
then
what
that
means
is
that
you
can
take
a
complex
event,
processing
engine
and
apply
it
as
a
data
processing
pipeline.
B
So
if
we
go
back
to
original
here's
where
we
want
to
get
so,
our
users
are
happy
right
and
we
take
a
complex
event.
Processing
engine-
and
you
have
the
credit
operation
coming
in.
You
could
take
that
piece
of
data
transform
it
before
you
write
your
system
of
record.
So
in
our
case
our
system
records
Cassandra
then
continue
to
pass
that
event
down.
B
The
line
could
do
dimensional
counts,
so
aggregate
aggregate,
the
different
dimensions
on
the
data
that's
coming
through
you
can
done
in
which
that
data,
by
touching
other
systems
and
pulling
in
extra
metadata
and
then
write
it
to
a
fuzzy
index
right.
So
this
is
a
really
powerful
turns
out
and
we're
really
really
like.
What's
what's
going
on
and
and
it
pulled
all
of
that
complex
data
flow
that
we
are
embedding
in
different
applications
and
clients
out
to
something
that
was
manageable
and
tangible
that
we
get
reason
about.
B
So
that's
one
of
the
powerful
pieces
of
storm
that
these
typologies
and
I'll.
Let
till
I
go
into
the
details
of
it.
The
pollies
actually
articulate
your
data
flow
in
the
system,
so
that's
great
and
allowed
to
do
some
really
really
cool
stuff,
so
I
will
at
this
point
hand
it
over
to
tailor
and
he
take
you
through
the
details.
Okay,
thank.
C
You
Brian
so
just
to
start
out
with
a
quick
overview
of
storm.
Storm
is
a
distributed.
Real-Time
computation
system,
so
it
does
complex
event,
processing
type
work.
They
will
open
source
by
Twitter
in
September
of
2011
that
worked
out
well
for
us
at
health
market
science,
because
about
that
time
is
when
we
started
moving
away
from
batch
processing
into
a
more
transactional,
real-time
processing
model,
I,
say.
A
C
Yes,
that's
a
lot
to
thank
you.
Okay,
storm
is
fault,
tolerant,
it's
distributed
among
multiple
nodes
and
I'll
get
into
more
of
the
architecture.
A
little
later
we
can
have.
A
C
C
Storm
supports
a
model
of
guaranteed
processing
so
that,
when
you're
through
data
streams,
events
within
your
streams-
if
for
some
reason,
fails
process,
they
can
be
replayed
so
that
processing
is
guaranteed
and
in
terms
of
cep,
storm
operates
on
one
or
more
streams
of
data,
so
you
can
add
any
number
of
inputs
into
your
distributed.
Computation.
C
The
anatomy
of
storm
cluster-
the
diagram
you
see
is
a
typical
development
cluster
that
we
use
here
at
HMS.
If
you're
familiar
with
the
Duke,
then
this
layout
will
look
fairly
familiar,
there's
a
master
mode
in
storm
language,
that's
called
Nimbus
and
then
there
are
slave
nodes,
which
are
essentially
your
worker
nodes
in
our
development
clusters.
We
also
with
Cassandra
nodes
on
our
our
slaves,
so
there's
also
an
assistant
note
and
for
our
clusters.
C
That's
where
we
run
zookeeper
and
what
zookeeper
beds
is
maintained,
the
state
of
the
cluster
and
then
supervisors
are
storm
demons
that
actually
run
the
tasks
that
make
up
your
data
processing
and
I'll,
get
into
more
of
how
that
works
in
subsequent
slides,
but
Nimbus
is
job,
is
to
assign
tasks
to
the
supervisor
nodes
and
that
can
be
multiple
workers
and
multiple
tasks
and
master.
All
also
true
zookeeper
keeps
track
of
the
health
and
state
of
those
demons.
C
So
if
one
demon
were
to
go
down,
we're
saying
you
lost
a
node
master
would
take
the
workers
and
reshuffle
them
among
those
laid
notes,
storm
components,
the
source
of
data
entering
into
your
computation
comes
from
spouts
spouts
are
essentially
stream
sources,
sources
of
data
and
I'll
expand
on
each
one
of
these
later
bolts
and
a
storm.
Topology
are
your
units
of
computation,
so
they
do
operations
or
react
to
the
data.
C
That's
passing
through
the
system
and
then
a
topology
is
a
combination
of
any
number
of
spots
and
any
number
of
bolts
and
that
it
defines
the
overall
computation
or
computation
network
so
to
expand
on
spouts
a
little
bit
as
I
mentioned
earlier.
Spouts
represent
a
stream
with
data.
Examples
of
that
could
be
cues
like
a
JMS
queue,
kafka,
kestrel
its
era
or
something
like
the
Twitter
fire
hose
or
sensor
data.
C
I
know
the
weather
channel
uses
storm
to
process
weather
information,
so
a
storm
spout
connects
to
some
sort
of
source
stream
or
queue
and
emits
couples
which
represent
the
events
in
PDP
language
and
couples
are
the
primary
data
structure
in
storm
and
they're,
basically,
just
a
set
of
main
key
value.
Pairs.
C
Storm
bolts
are
responsible
for
receiving
couples
from
spouts
or
other
bolts,
and
they
operate
or
react
to
the
data.
So
bolts
are
typically
something
you
provide
or
provided
for
you
and
they
can
perform
functions
and
filtering
joining
aggregation.
That
kind
of
thing,
and
they
can
also
do
database
rights
and
lookups,
which
is
where
cassandra
will
come
in
a
little
later
and
then
both
can
all
optionally
emit
additional
tuples.
C
So
if
you
have
a
doubt
that
submitting
tuples
the
spout
will
send
those
to
one
or
more
bolts
and
then
the
bolts
can
either
just
react
that
data
or
they
can
permit
additional
couples
I'll
get
into
more
of
that
later.
On
and
storm
topologies
represent
the
data
flow
between
spouse
and
bulks
and
routing
of
tuples.
Between
spouts
and
bolts.
The
routing
is
basically
a
simple
subscription
models,
and
when
you
define
your
topology,
you
define
groupings
and
what
groupings
do
is
determine
how
tuples
get
routed
between
bolts
and
spouse
within
your
topology.
C
C
But
when
we
do
stuff
like
filtering
and
aggregations,
it's
important
that
sometimes
the
same
fields.
Content
goes
to
the
same
bowl,
so
a
field
ooping
helps
accomplish
that,
and
basically
the
way
that
works
is
probably
those
are
familiar
with.
Cassandra
will
understand
the
concept
of
a
distributed
hash.
So
basically
it
just
passes
the
values
of
the
fields
that
you
specify
and
that
determines
which
foldable
pass
to
storm.
C
So
storm
and
Cassandra,
some
of
the
use
cases
that
that
we
looked
at
were
being
able
to
write
storm
couple
data
to
Cassandra
examples
that
that
would
be
computation
results
or
pre
computed
indices
and
then
also
to
read
data
from
Cassandra
animate
storm
troubles,
so
that
kind
of
thing
would
enable
us
to
dynamic
real
cuts
eventually
have
a
tough
will
come
in
and
based
on
the
data
in
that
tupple
do
a
lookup
or
a
festa
cassandra
and
then
based
on
those
results.
Additional
problem.
C
So
in
storm
cassandra,
we
have
two
bull
types
that
I
mentioned
earlier:
there's
the
basics
and
revolt
and
what
that
does
is
take
tuples,
passing
through
a
storm
topology
and
persists
the
data
to
Cassandra,
and
then
we
have
Cassandra
look
up
both
which
does
the
opposite
of
it
pulls
data
out
of
Cassandra
and
emits
tuples.
Based
on
that
data.
C
The
storm
Cassandra
project,
which
is
open
source
on
github,
provides
generic
bolts
for
reading
and
writing
storm
couples
to
and
from
Cassandra,
and
the
way
we
did
that
was
come
up
with
the
concept
of
mappers,
so
we
have
a
storm.
Cassandra
project
is
essentially
a
framework
that
defines
generic
bolts,
and
then
you
provide
a
cup,
a
double
mapper
or
a
columns
wrapper.
That
is
specific
to
your
use
case
or
your
data
model.
C
C
So
the
couple
mapper
interface
tells
the
Cassandra
bolt
how
to
write
a
couple
from
an
arbitrary
datum.
So,
given
a
storm
couple,
you
would
map
to
a
column
family.
So
basically
you
tell
it
which
column
family.
You
want
to
write
to
map
to
a
row
key
that
allows
you
to
determine
what
your
rope
he
is
based
on
the
content
of
the
couple
and
then
mapped
to
columns.
That's
mapping
how
the
data
in
the
storm
couple
gets
stewarding
sander
columns,
then
on
the
opposite
direction.
C
C
The
current
state
of
the
storm
Cassandra
projects
and
right
now
we're
working
hard
on
version
0
about
40,
which
is
a
work
in
progress,
is
currently
using
the
stn
ex-client.
Now
we
have
a
couple
out
of
the
plaques
mapper
implementations,
a
basic
key
value
column
matter
for
basically
liking
a
hashmap
type
data
model,
search,
sander
valueless
columns,
that's
a
common
data,
modeling
pattern
in
Cassandra
and
I
all
demo.
C
C
So
the
first
time
I'm
going
to
do
is
a
wordcamp,
and
this
is
sort
of
one
of
the
canonical
examples
that
is
used
for
both
storm
and
for
Hadoop.
The
ideas
that
you
have
a
spout,
that's
emitting
random
words
and
from
there
it
goes
to
through
a
fields
grouping
to
accountable.
The
counts
keeps
account
of
each
word
and
how
many
times
that
words
been
seen
and
the
importance
of
using
the
fields
group
in
there
is
if
the
count
bolt
is
parallelized
across
multiple
nodes
in
the
cluster.
C
You
want
to
make
sure
that
the
same
word
goes
to
the
same
bolt.
Otherwise,
your
your
accounts
with
get
out
of
whack,
then
from
the
account
both
the
tenant
bolt,
emits
essentially
the
counts
of
each
word
in
real
time.
Ads
are
incremented
and
shuffles
them
to
the
Cassandra
bolt,
which
is
responsible
for
persisting.
The
count
for
each
word
so
now
get
into
a
demo
of
that.
A
C
C
C
Okay,
now
that
the
topology
is
running
so
it's
actively
spitting
out
random
words
and
actually
I
think
in
this
demo,
their
names
and
if
I,
lift
that
Cohen
family
again
we'll
see
the
words
and
a
correspondent
count,
so
the
rotative
Nathan.
You
can
see
that.
As
of
that
query,
that
word
had
been
emitted
56
thousand
times,
then
there's
Mike
Jackson
and
a
couple
other
names
and
if
I
run
it
again,
we'll
see
that
that
incremented
much
more
an
hour
up
to
172,000
instances
of
that
work.
C
B
C
So
in
our
topology,
it
could
show
you
before
they
pulled
it
up.
We
have
a
word
spout
which
sends
words
to
account
bulk
chemicals
keeps
track
of
those
and
then,
when
it
updates
account,
it
sends
a
message
to
the
center
bolt
to
persist
that
cap
and
basically
in
this
line,
we
were
setting
up
the
cassandra
bowl
and
using
a
default
couple.
C
Mapper,
that's
using
strings
persistence
and
we
do
support
different
types
of
serial
serialization,
for
these
demos
were
mostly
just
sticking
strings
and
as
I
as
I
mentioned
before,
storm
sports,
guaranteed,
processing
and
part
of
that
is
when
a
couple
passes
through
a
or
when
a
couple
gets
emitted
from
a
spout.
If
you've
defined
your
topology
to
be
guaranteed
processing,
then
all
those
couples
must
be.
The
whole
couple
tree
must
be
act
and
extender
bolt
supports
different
act
strategies,
so
in
this
case,
I've
used
a
cone
right
and
what
that
means
is.
C
As
soon
as
the
data
gets
written
to
Cassandra
successfully.
The
couple
will
be
acknowledged
and
once
it's
acknowledged
them,
the
spout
role
being
trigger
to
remit
that
that
couple,
the
next
lines
down
here.
This
is
where
we
build
our
topology
here,
we're
saying
creating
a
new
topology
builder.
They
were
saying
setting
the
stats.
They
are
words
that-
and
this
last
number
over
here
is
the
parallel
ism.
C
C
The
next
demo
is
for
distributed,
RPC
calls
and
storm.
Well,
there
are
cases
where
you
have
streams
of
data
that
are
open-ended
like
I.
Just
download
dr
pc
in
storm
allows
you
to
create
a
request
response
type
interaction
and
essentially
distribute
out
that
processing
to
your
cluster.
So
for
the
way
that
works
in
storm,
is
you
have
a
dr
piece
client
that
passes
in
arms
to
a
gay,
RPC
server
and
the
server.
C
Send
out
tuples
via
a
dr
pc
spout,
that's
a
special
kind
of
stout
and
then
from
there.
It
goes
into
your
topology
that
you
define,
so
it
did
keeps
track
of
that
by
request
ID,
which
is
ultimately
how
that
the
result
will
get
returned
to
your
client.
So
the
step
and
lips
to
your
topology
and
then
the
result
from
your
topology
goes
back
to
the
dr
pc
bolt
and
the
bolt
will
notify
the
dr
pc
server
when
the
computation
is
completed
and
then
the
result
will
get
returned
to
the
dr
pc
client.
C
Example,
if
I
have
three
followers
and
I
tweeting
URL,
that
reach
would
be
three
but
then
let's
say
someone
else
tweets
the
same
URL
and
depending
on
the
number
of
their
followers,
then
the
reach
would
be
the
the
net
film
of
my
followers
and
that
other
users
followers.
So
to
do
that
at
that
Twitter,
where
they're
you
know
massive
numbers
of
users,
that
would
take
a
lot
of
database
lookups
and
a
lot
of
performance
if
you're
hitting
just
a
single
database.
C
So
essentially
our
input
to
that
is
a
URL,
and
that
goes
to
the
tweeters
bolt
and
that
essentially
takes
an
URL
and
does
a
lookup
of
how
many
users
tweeted
that
URL.
So
all
those
get
so
all
of
the
records
of
that
tweet
then
get
shuffled
route
to
another
bolt
the
followers
goal,
and
that
takes
the
number
of
followers
or
I'm
sorry,
the
user
ID
at
each
follower
and
admits
that
out
as
a
couple
from
there.
It
goes
to
a
partial
uniquor
which
uniques
those
and
finally,
it
sends
a.
C
B
C
C
B
Oh
alright,
so
so,
but
that
was
a
some
pretty
cool
storm
in
detail,
and
you
know
both
of
us
are
out
there
on
github.
So
if
anybody
has
any
questions
all
this
is
all,
is
it
out
there?
You
can
concluding
the
examples.
So
if
you
clone
storm
Cassandra,
though
you'll
see
an
example
for
a
career
in
there,
that
has
all
this
stuff.
So
and
if
you
have
any
questions
just
set
us
up,
we
can
answer
those
so
next
partner
we're
going
to
talk
a
bit
about
trident.
B
B
So
there
are
a
bunch
of
patterns,
I
think
that
we're
cropping
up
when
people
went
to
start
using
storm
and
underneath
the
hood
we
said
we
had
both
the
batching
and
non
batching
version
of
the
Cassandra
I
mean
things
are
easy
when
you're
only
taking
one
event
into
account,
and
then
you
know
happily,
processing
and
moving
along
to
the
next,
and
when
you
start
batching
stuff,
and
you
start
considering
fault
collar
and
things
get
a
little
complicated.
So
Nathan
saw
all
those
how
Nathan
mark's
twitter
sold
those
patterns
and
decided
hey.
We
can.
B
We
can
all
benefit
from
creating
a
higher
level
abstraction
here
and
hiding
some
of
the
underneath,
the
hood
stuff.
It's
required
to
implement
exactly
once
and
semantics
and
transactional
integrity,
so
so
try
to
in
a
nutshell.
He
took
like
I
said:
state
management,
which
is
you
know
what
gets
complicated
when
you
consider
fault,
tolerance
and
batching
and
provided
an
API
on
top
of
it.
So
he
created
additional
primitives,
where
the
primitives
in
storm
are
the
bolts,
the
spouts
and
the
topologies
all
operating
on
tuples
in
trident.
B
You
have
functions
that
are
n
state
objects
that
are
operating
on
trident
tuples
and
effectively
and
kind
of
we
were
early
adopters
in
the
transactional
topologies,
which
have
since
been
deprecated,
because
I
think
with
Nathan's
shown
is
that
all
the
power
that
we
had
in
transactional
topologies,
if
anybody's
been
using
them,
is
available
and
Trident,
but
from
a
much
higher
level
abstraction.
So
you
know
at
all
the
classes
that
were
in
the
storm
caca
that
dealt
with
transact
with
apologies
are
now
gone
in
favor
of
Trident
versions
of
those
same
classes
and
I
tried.
B
It
is
not
something
different
than
storm,
it
is
TI's
in
storm.
When
you
good
deployed
a
trident
apology,
it
actually
compiled
down
into
storm
storm
topology
and
runs
so
too
has
a
couple
different
primitives
in
trident
one
of
those
are
the
operations
that
you
can
perform
and
well,
you
can
think
of
what
try
gives
use.
It
takes
a
string
of
those
tuples
and
partitions
that
stream
into
set
of
batches,
and
then
he
provides
operations
on
those
tuples
within
those
those
of
batches
and
I.
Just
called
out
a
couple
kind
of
operations
that
you
can
perform.
B
You
can
see
how
the
higher
level
abstraction,
so
one
of
the
use
cases
of
a
bolt
that
Taylor
called
out
is
the
filtering.
So
instead
of
going
in
subclassing
from
bolt
and
implementing
my
filtering
as
a
concrete
class,
I
can
just
use
a
function
here
that
implements
a
filter
right,
so
I
just
implement
a
quick
function
that
says
is:
should
I
keep
this
tuple
and
my
responsibilities
to
the
filter
are
either
and
miss
a
tuple
or
I
phone
them
at
the
tuple
drug
wise.
You
know
a
tip
generic
function.
B
I
take
in
the
people
and
I
commit
that
thing
again
with
additional
fields
in
it
or
additional
kooples,
and
then
there
are
aggregates
functions,
so
I
can
do
with
making
all
combining
right,
which
is
pairwise
combining
of
data
or
I
can
reduce
that
which
is
an
iterative
accumulation.
So
you
basically
keep
a
cow,
but
then
your
past
tuples
and
adjust
that
count
or
more
generically
I
can
go.
You
know,
implement
the
aggregator
and
that's
sort
of
bring
your
own
aggregation
to
the
table.
B
So
this
is
a
sample
topology
in
trident.
It
looks
very
similar
to
a
storm's
apology,
but
you
can
see
you
know.
These
are
functions
here
like
count
and
split
that
I
just
plug
in
and
then
I
basically
say
what
my
input
is
and
then
what
my
output
is
right
in
anthropology,
so
very,
very
similar
and
really
the
benefit
of
Trident.
Is
this
so
on
the
Trident
state
Nathan
figured
out
I
guess
a
couple
cool
things
about
how
to
write
state
so
that
you
can
handle
losses
of
nodes
and
batches
and
the
way
that
he
did.
B
B
We
should
give
a
shout
out
to
a
pet
remember:
frostbite,
zero
summer
cross
man,
frost,
man,
who's,
gotten
implementation
of
the
trata
trident
the
the
persistent
mat
for
storm
and
we've
been
using
it
internally.
So
that's
been
working
out
well,
lance
and
flexibility,
so
probably
reach
out
frost
man
to
coordinate
and
collaborate
with
them
anyway.
B
So
with
respect
to
storm
state
and
how
it
applies,
the
transactional
integrity
are
two
different
kinds
of
spouts
that
you
can
have
in
tried
in
one
is
glad
the
transactional
spout
and
the
guarantee
there
is
that
a
batch
or
your
obligation
as
a
spout
implementer
for
a
transactional
trident
spell.
Is
it
the
best
contents?
Never
change.
So
we
use
Kafka
here
as
our
main
spout
for
all
the
work
for
art.
Apologies.
B
So
there's
a
transactional,
trident
Kafka's
spell
that
is
guaranteed
to
omit
the
same
batch,
the
same
tuples
in
the
same
batch
each
time
and
reasons
I
won't
go
into
now.
That's
not
so
great
if
you're.
If
you
lose
a
note,
because
if
you
have
a
multiple
multiple
partitions
emitting,
you
could
lose
a
partition
and
then
you
wouldn't
be
able
to
have
the
same
tuples
from
that
partition
in
this
in
the
same
batch,
but
say
that
for
another
time
where
you
can
have
an
opaque
spell
where
the
batch
contents
can
change.
B
But
in
that
scenario
you'd
be
very
careful
with
your
state
and
there's
additional
obligations
on
the
state
object
if
you
want
to
maintain
transnational
integrity,
so
I'll
go
through
that
now,
so
so
these
are
using
sort
of
the
contracts
established
by
Trident
for
spout.
So
one
case,
the
batch
contents
never
can
never
change
and
the
other
they
can
change
as
implications
to
the
state.
So,
just
like
stop
this
transactional,
opaque
state
management
and
in
the
transactional
state,
the
transaction
ID
is
stored
with
the
value
in
the
in
years.
B
State
object
when
you
persisted
the
database
on
each
bad
batch
has
unique
transaction
ID
and,
like
I,
said
that
that
trick
works
then,
because
what
I
can
do
is
like
every
time
I
update
the
state
I
just
write
the
last
transaction.
My
current
transaction
I
did
and
I
don't
write
until
the
previous
transaction.
Id
is
written
and
in
fact
contents
never
changed.
Then
that's
great
that
works
that
doesn't
work,
however,
if
the
best
contents
can
change
right.
B
So
in
that
case,
what
your
obligation
is
to
create
a
state
object
is
to
write
the
previous
value
and
then
the
last
transaction
ID
and
then
the
current
value.
So
this
says
that
if
I
read,
if
a
batch
gets
replayed,
I
can
replace
the
value.
That's
in
the
database
with
what
I've
calculated
from
the
new
batch
so
would
with
that.
You
can
get
exactly
one
semantics
from
storm
with
combinations
of
the
spouts
and
states.
B
So
we
probably
should
include
a
few
more
slides
that
can
get
a
little
trickier
to
give
it
a
couple
demos
of
who
we
can
come
back
and
do
another
session
on
that
when
we
are
in
it
all
out,
okay,
so
I'm
in
order
to
give
some
time
for
questions,
here's
the
shortlist
shoutouts
slide.
So
here,
HMS
I
didn't
go
through
the
full
architecture,
topology
that
we
have
in
place,
but
we
have
three
or
two
two
bolts
that
are
out
there.
Now
the
storm
Cassandra
bolt,
which
we've
been
talking
a
lot
about.
B
We
also
have
storm
elasticsearch,
so
I
combining
those
two
you
can
actually
have
a
data
processing
pipeline
that
writes
to
Cassandra
plus
enables
fuzzy
search
on
the
other
end.
If
you
tack
that
elasticsearch
bolt
onto
the
end
of
your
apology
and
then
internally,
we
haven't
released
it
to
the
public
yet,
but
we
also
have
a
storm
jdb
I
bolt,
so
that
allows
you
to
tie
in
the
relational
database.
So
if
you
go
back
to
that
picture
from
the
beginning
of
slide,
you
know
you
can
you
can
tie
in
your
favorite
relational
and
enrich
the
data?
B
That's
coming
through
your
pipeline
with
any
kind
of
metadata
that
might
be
stored
in
a
relational
database
and
or
you
can
write
to
the
relational
database,
that
supports
reporting
tools
and
then
till
is
got
a
two
other
ones
out
there:
storm
JMS
and
storm
signals
and
on
your
phone
signals.
Basically,
we've
got
I'm.
C
Sure
storm
signals
is
basically
it
allows
you
to
communicate
with
your
spots
and
bolts
out
of
band.
So,
for
example,
let's
say
you
offer
a
topology
and
you
want
to
send
it
some
sort
of
signal
to
changes
behavior
or
something
like
that.
Storm
signals
just
gives
you
a
very
simple
way
to
do
that,
and
everything
that
we
demo
today
is
all
part
of
the
storm
Cassandra
project
you
can
should
be
able
to
get
up
and
running
with
those
examples
in
about
five
minutes,
so
we're
all
around.
If.
B
You
can't
let
us
know
yeah
and
just
a
couple
of
closing
remarks.
I
mean
what
we
found
is
that
you
know
and
I've
always
believed.
Cassandra
provides
great
primitives
that
allow
you
to
do
distributed
storage,
and
you
know
so.
We've
done
that
the
data
model
that's
provided
by
Cassandra
in
the
way
that
it
partitions
up
the
key
space
right
automatically
across
house
is
perfect,
absolutely
perfect
and
I.
B
Think
storm
is
a
natural
fit
to
that,
because
what
it
actually
provides
its
primitives
for
distributed,
processing
that
you
can
do
a
lot
with
and
combining
any
kind
of
creative
ways.
So
you
know
those
two
together
created
a
pretty
pretty
good
basis
on
which
you
can
build
up.
B
A
A
So
you
know
this
is
one
in
a
series
of
webinars.
We
always
archive
them.
We
want
to
help
the
community
by
putting
out
some
great
content
and
Taylor
and
Brian
have
put
out
some
great
content
today.
Next
up
is
Christos.
Kilometers
from
netflix
is
going
to
talk
about
his
transition
from
being
a
DBA
of
relational
databases
to
now
really
leading
the
charge
of
Phoebe
engineering
by
moving
to
Cassandra
he'll
he'll
address
other
no
sequel,
databases
as
well,
and
then
make
sure
you
mark
your
calendars
for
Valentine's
Day.
A
It's
not
V
is
not
for
Valentine's
it
savvy
nodes.
That
should
be
a
very
hot
topic:
that's
by
Patrick
McFadden
and
then
on
februari
28,
Aaron
Morton
will
take
a
look
at
an
introduction
to
Cassandra
and
we'll
be
focusing
on
the
recently
available
one
dot
to
some
resources.
We
have
three
weeks
ago
we
released
planet
Cassandra,
that
is
the
community
site.
The
community
is
doing
a
good
job
centralizing
stuff
fast.
So
that's
sort
of
your
one-stop
shop
for
everything.
Profound
related.
A
This
webinar
will
be
available
there
if
you're
interested
in
more
deep
dive
training
around
Cassandra,
with
it
training
page
and
then
Brian
and
Taylor
are
going
to
be
talking
at
this
one
NYC
star
marks
20th.
If
you're
on
the
East
Coast,
you
know,
definitely
try
and
make
a
damn
for
this
technical
day
around
Cassandra.
A
B
B
If
you
wanted
to
we
just
we
preferred
the
abstraction
that
storm
provides,
but
I
should
say
that
there's
a
lot,
there's
been
a
lot
of
activity,
the
less
we
can
ask,
probably
on
the
on
the
discussion
list
around
other
dsl's,
so
domain-specific
languages
that
would
allow
you
to
articulate
processing
on
the
you
processing
flows.
It
would
compile
down
or
eventually
become
storms.
Apologies
right,
so
there's
been
a
lot
of
discussion
about.
You
know,
making
a
spur
like
language
that
we
compile
them
into
storm.
B
We've
also
seen
on
some
people
working
on
drools,
so
that
would
leverage
storm
on
the
back
end.
So
I
think
I
can
imagine
what
that
is
right.
That's
a
you
know,
bundling
the
drools
engine
so
that
it
deploys
probably
in
bolts
and
can
read
the
rules,
language
that
rules
has
so
you
know
so
that
did
kind
of
a
hard
question
to
answer.
But
I
think
a
lot
of
people
are,
you
know,
are
looking
at
storm.
It
certainly
has
the
most
momentum.
B
I
forgive
it's
like
a
6,000
followers,
the
500
forks
or
something
like
that
on
github.
So
certainly
at
the
most
momentum-
and
I
think
everybody
knew
nathan's
got
a
cascading
backgrounds,
so
I
mean
cascading
is
another
one
but
cascading
compiled
down
in
the
hadoo.
So
he
took
the
best
of
cascading
and
made
it
real
doc.
Okay,.
A
Great,
that's
good
to
know
and
yeah.
Just
to
reiterate,
you
know
whip
seeing
this
combination
of
storm
and
Cassandra
more
and
more
out
in
the
Cassandra
community,
going
back
to
your
I
think
it's
at
references!
Thank
you.
A
sort
of
architectural
slide
up
front
in
a
multi-day
trous
scenario
will
zookeeper
only
look
after
the
local
slaves,
or
will
it
be
aware
of
all
the
slaves
in
all
the
data
center?
A
C
That
in
touch
and
actually
zookeeper,
doesn't
look
after
the
notes
per
se,
zookeepers,
essentially
just
houses,
the
storage
state
of
the
system
and
its
nimbus
that
communicates
with
the
nodes,
true
zookeeper.
But
how
that
how
that
setup
is
basically,
however,
you
want
to
configure
it
so
Nimbus
find
out
about
additional
nodes
through
through
Z
zookeeper.
A
Yeah,
thank
you
very
much.
This
one
I
can
talk
about
this
data
pipeline
use
in
production,
your
volumes,
hardware
and
the
unique
value
Cassandra
provides
versus
other
no
sequel
options.
So
you
know
I'm,
assuming
you
can
integrate
storm
with
with
other
databases.
Yeah
can
you
can
you
focus
on
the
unique
value
of
Cassandra
sure.
B
Yeah
so
I
mean
when
you,
when
you
compared
so,
and
this
is
discuss
elasticsearch
too
so
our
architecture,
you
know
what
we
want
out
of
it
is
linear.
Scalability
right
so
you
know,
are
the
we've
got
a
legacy
system
here
that
has
got
upwards
at
all?
Probably
a
hundred
we've
got
hundreds
of
machines
here.
So
you
know,
linear
scalability
is
really
important
to
us
and
Cassandra
you
nailed
that
absolutely
nailed
it.
So
every
other
piece
of
our
architecture.
B
We
want
this
a
linearly
like
Cassandra
right
so
and
I
think
that
so
storm
does.
That
elasticsearch?
Does
that
you
know.
So
when
you
look
at
Sandra
versus
others,
you
get
you're
going
to
start
to
get
an
impedance
mismatch
if
you're
processing
from
or
scales
linearly,
but
your
database
under
the
hood
can't
support
that
right.
B
A
B
A
C
A
Okay,
great,
thank
you
very
much.
We're
right
at
the
top
of
the
hour,
really
appreciate
you
taking
the
time
today,
guys
a
great
topic,
and
you
know
that
this
has
been
a
great
session
to
educate
the
community,
stay
tuned,
everybody
for
more
upcoming
webinars,
and
we
will
see
you
next
time.
Thank
you
very
much.
David.