►
Description
Speaker: Sebastian Estevez, Solutions Architect
The startup program has had over 600 applicants, from over 20 market verticals, leveraging a wide range of DSE features for their use cases. This talk is my best effort to synthesize key learnings to benefit future DSE powered startups (with major constraints on time and money). The talk will range from design and development to operations and production.
1) Program overview and startup breakdown
2) Development
3) Operations
4) Q&A
A
A
For
the
past
year
or
so,
I've
had
the
honor
and
privilege
to
help
out
a
few
hundred
startups,
using
data
sex
Enterprise
from
getting
started
and
getting
started
to
know
the
technology
all
the
way
to
the
production
and
an
even
live
issues
in
production.
And
it's
it's.
It's
been
a
really
special
opportunity
for
me.
A
There's
definitely
been
a
lot
of
learnings
that
I've
tried
to
encapsulate
today
in
this
talk,
there's
a
couple
of
ways
that
that
I
could
have
done
this,
but
I
decided
to
go
pretty
technical
to
to
try
to
capture
just
common
issues
that
I
saw
whether
they're
beginner
issues
or
kind
of
more
advanced
tuning
issues
as
well,
and
I
just
want
to
go.
I
went
through
just
things
that
I
saw
quite
frequently
from
startups
asking
for
help.
A
I
want
to
thank
all
the
startups
that
gave
me
the
opportunity
to
help
them
out
and
I
mean
they're
all
doing
really
transformative
things
they're
changing
the
world
in
their
own
way,
with
our
technology
and
so
for
datastax.
It's
it's
just
it's
a
great
opportunity
and
for
me
personally
to
see
those
things
at
action
and
to
try
to
help
make
them
a
reality.
A
So
all
right,
oh
the
the
deck
for
this
presentation
is
online.
You
can
click
the
link
at
my
twitter
handle
right
here.
It's
a
little
small,
but
if
you
guys
can
see
it
just
pull
it
up
on
your
laptops
and
there
are
some
command
line
details
some
technical
stuff
in
there.
That
will
probably
be
useful
later
for
any
of
you
that
are
looking
to
deploy
dancing
production.
So
it's
an
opportunity.
A
Okay,
so
I
want
to
talk
about
a
little
bit
about
the
program
high-level,
very
quickly
how
we
they're
supposedly
way
to
make
this
full
screen
here
there
we
go,
how
we,
how
you
apply
for
the
program
and
qualify
and
a
couple
statistics
on
the
program
for
the
last
year,
then
I'm
going
to
talk
about
some
getting
started
problems
and
how
to
solve
them,
how
to
get
up
and
running
with
DSC
and
common
things.
That
I
saw
folks
fall
over
with
in
the
beginning
and
then
I
want
to
get
into
deeper
tuning
cassandra
and
DSC
problems.
A
Basically,
our
enterprise
grade
software
for
free
to
any
company
that
makes
under
30
million
sorry
that
makes
under
3
million
in
revenue
per
year
or
has
under
30
million
in
funding
capital
raised,
and
basically
there
are
no
limits
to
the
extent
to
which
use
the
product
you
can
use
as
many
no's
as
you
want.
You
can
use
all
the
features
and
yeah
I
mean
that's
about
it.
You
get
some
perks.
The
perks
include
discount
on
our
supports
and
services.
If
you
need
them,
this
comes
with
some
of
our
partners.
A
They
include
marketing
opportunities
as
well.
There's
a
few
folks
in
the
room
that
I
see
here
today
that
we've
done
some
co-marketing
with
and
we
both
come.
It's
scenarios
where
both
companies
benefit
when
you
guys
are
happy
and
then
also
some
free
tech
help
right.
So
that's
that's
new.
As
the
last
year
we
used
to
just
give
folks
the
software
say,
good
luck
and
for
the
past
you
know,
12
months
or
so.
I've
provided
technical
help
for
all
the
startups
around
the
world
that
are
in
the
program.
A
We've
had
I
think
over
600
startup
sign
up
from
more
than
20
market
verticals
and
more
than
50
countries.
So
it's
a
pretty
wide
group
of
a
group
of
folks
again
I'll
do
a
transformative
things
with
our
technology
and
it's
been
a
privilege
to
participate
in
that
and
to
be
a
part
of
it
I'm
going
to
dive
right
in
so
one
of
the
first
things
that
I
see
a
lot
actually
so
before
I
get
into
this
for
support.
A
If
your
startup,
you
post
up
there
and
myself
for
a
couple
other
technical
folks
from
the
company,
we're
always
actually
looking
the
upsetter,
chemo
or
other
team
members
helping
out
on
Stack
Overflow
as
well-
and
this
is
something
that
I
see
quite
often
up
there,
hey
I
started
up
Cassandra
or
a
start-up
DSC
and
I
can't
connect
right,
so
I
want
to
go
through
and
some
of
us
believe
the
getting
started
stuff.
What
might
the
reason
be
and
and
how
to
troubleshoot
when
you
can't
connect
the
DSC.
A
So
some
of
the
things
that
the
first
check
are
you
know
is
the
process
up
and
running
a
lot
of
the
cases.
The
process
didn't
come
up
for
one
reason
or
another,
and
if
that's
the
case,
then
you
want
to
check
your
system
log
or
a
plug
to
find
out
why
it
didn't
come
up.
So
these
are
scenarios.
All
these
bullets
are
scenarios
where
a
process
didn't
come
up
in
a
particular
reason.
So
one
reason
maybe
there's
another
process
running
on
a
Cassandra
port,
so
70
199,
the
jmx
board,
is
being
used
by
another
process.
A
So
can
you
come
up?
You'll
see
that
I've,
the
logs,
you
may
have
a
an
error.
A
problem
with
your
llamo
can
be
parsed.
So
you'll
see
an
acre
on
your
in
your
in
your
system
blog,
you
may
have
a
problem
of
your
Cassandra
environment
show.
So,
for
example,
if
you
upgrade
from
one
version
to
another
and
there's
a
different
version
of
jam,
that's
packed
with
this
package
to
deal
with
cassano
india
see
you
may
have
to
just
update
your
environment,
shell
and
those
these
are
all
things
that
you
would
see
in
your
system.
A
Log
you
may
have
bad.
You
may
have
a
bad
upgrade
so
sometimes
with
if
you're
using
ubuntu
end
up.
Yet
you
do
an
upgrade
at
some
of
the
some
of
the
libraries
didn't
get
overwritten
properly
and
so
I've
seen
some
corrupted
upgrades
and
sometimes
app
get
purged
is
the
solution
for
that
there
can
be
permissions
issues
when
folks
are
using
tar,
balls
manually
or
using
a
doubt,
run
installer.
Sometimes
the
commit
log
directories
and
the
data
directories
aren't
available
for
the
user,
that's
running
cassandra,
and
so
that
will
stop
you
from
starting
up.
A
You
may
have
old,
SS
tables
seen
this
a
couple
times
during
upgrade
procedures
with
some
customers
we
were
going
through.
Maybe
a
50
node
upgraded
with
one
of
our
customers.
The
first
like
23
upgrade
knows
we
started
perfectly
fine.
The
24th
one
wouldn't
come
back
up
scratching
our
heads
trying
to
figure
out
what
was
different.
Well,
it
turns
out
that
that
node
had
a
couple
SS
tables
that
were
super
old.
They
were
more
than
one
version
old
and
Cassandra
will
actually
not
support
SS
tables
that
are
too
old.
A
So
what
you
want
to
do
is
make
sure
you
upgrade
SS
tables
from
one
version
to
another,
and
that
way
you
don't
run
into
this
issue.
But
again,
all
this
stuff
you'll
see
in
your
system,
in
your
system,
log
relatively
clues
to
what
the
problem
may
be
when
you're
using
DSC
search,
you
may
actually
see
a
search
load
time
out
which,
which
is
basically
the
amount
of
time
for
the
search
core
to
be
loaded.
This
is
you.
A
There's
there's
a
jmx.
There's
a
game
x
flag
and
you
can
turn
on
to
actually
replace
the
existing
node
in
the
cluster.
So
that's
something
else,
so
these
are
all
kind
of
examples
and
you'll
see
based
on
your
system
log,
while
you're
noticing
coming
up
and
yes,
this
is
one
of
the
most
frequent
things
that
you'll
see
on
Stack
Overflow.
It's
just
folks
trying
to
get
up
and
started.
There
are
a
couple
of
scenarios
where
you
can
actually
not
they've.
A
Actually
one
scenario
where
your
your
process
won't
start
running
and
you
won't
see
anything
in
your
logs
right
and
I
actually
see
this
more
often
than
then
I
would
have
expected
your
disk.
My
people,
all
right,
and
so
obviously
it
can't
write
to
the
logs
so
have
been
out
a
couple
calls
in
it
and
maybe
I'm
doing
the
screen
share
with
somebody
and
I'll
say
mutf
for
me
and
it
turns
out
their
disks
or
full.
A
This
happens
a
lot
with
folks
upgrade,
for
example,
from
20
to
2.1,
because
the
commit
log
and
cassandra
which
used
to
default,
the
one
gigabyte
is
now
eight
gigabytes
in
2.1
in
folks.
Head
may
be
mounted
it
on
the
root
partition
and
that
root
partition
was
small
and
so
you're
running
out
a
disc
after
a
little
while
and
not
able
to
come
back
up
and
also
you
won't
see
anything
in
your
logs
right.
So
that's
a
good
one.
A
Let's
see
so
there's
a
couple
scenarios
where
DSC
is
up,
but
you
can't
connect
right.
So,
let's
talk
a
little
bit
about
all
the
different
addresses
that
you
configure
in
Cassandra.
One
of
them
is
the
listen
address.
That's
for
note
that
no
communication,
so
it's
possible
that
the
notes
can't
talk
to
each
other
and
that
something
have
to
check
in
a
network
level,
our
PC
addresses
or
for
client
communications
to
the
nodes,
and
then
you
actually
have
broadcast
addresses
there
used
to
be
just
a
regular
broadcast
address.
That's
the
listen
address
now.
A
We
also
have
a
broadcaster
RPC
at
risk.
So
that's
those
are
the
addresses
that
that
Cassandra
actually
advertises
that
I
knows
that
are
connecting
or
the
clients
that
are
connected.
So
you
have
to
check
your
firewall
and
your
security
groups
to
make
sure.
Where
that
note,
the
nodes
and
clients
can
actually
talk
to
each
other
on
those
boards
pro
tip
Matt
Kennedy
did
a
ticket
on
Cassandra
that
actually
created
something
called
RPC
interfaces
and
in
listen
interfaces.
A
So,
instead
of
actually
specifying
the
IP
in
the
in
your
yamel,
you
can
specify
an
interface
at
the
OS
level
like
eth0,
and
so
that
way
you
can
actually
configure
all
your
notes
the
same
way
without
specifying
a
hard
coding
IP
which
can
be
nice
for
registration.
So
that's
a
good
tip
something
else
that
can
happen
if
you're
running
security,
your
authorization
key
space
might
not
be
replicated.
So
that
actually
happens
quite
often,
that
and
in
folks
are
pretty
confused.
A
So
just
make
sure
if
you're
running
security
and
the
notes
not
connecting
and
talking
and
working
out,
right,
you're
off
key
space
might
not
be
replicated
there's
also,
if
you're
using
SSL.
So
encryption
for
note
to
note
or
client
to
note,
with
Cassandra
or
with
DC
I've
gotta
get
a
bleep.
Oh
that
actually
helps
set.
That
up.
This
is
clickable.
If
you
pull
down
the,
if
you
pull
down
the
deck
all
right,
something
else
you
can,
you
can
actually
get
some
cryptic
errors
that
are
confusing
if
your
OS
level
config
is
not
set
up
properly.
A
So
these
are
some
of
the
examples
of
things
that
you
have
to
configure
in
order
to
get
Cassandra
to
work
correctly
in
Linux,
but
we
do
ship
a
pre-flight
check
inside
of
DSC.
That
will
check
some
of
these
things
for
you.
So
that's
something
else
to
take
a
look
at
if
you're,
you
know
not
using
an
ami
but
kind
of
rolling
your
own
boxes.
A
Your
rights
so
I
believe
recently,
at
least
in
DSC
we're
shipping
that,
as
a
default
22,
so
you're
only
going
to
see
22
compaction
running
at
the
same
time,
this
is
configurable,
so
you're
falling
behind
on
Compassion's
and
you've
got
lots
of
CPU.
Something
you
can
do
is
crank
up
the
number
compactors.
This
is
configurable
at
the
Amal,
but
a
lot
of
the
time,
you're
doing
something:
you're
troubleshooting
a
live
system,
and
you
don't
want
to
have
to
do
a
rolling
restart,
especially
if
you're
running
lots
of
nodes
to
change
your
configurable
like
this.
A
So
you
can
actually
change
it
by
via
jmx,
and
you
can
do
it
programmatically
with
this
cql
shell
or
start
a
jmx
shell
snip
it.
So
you
guys
can
pull
the
power
point
down
and
use
the
snippet
to
actually
change
your
number
of
concurrent
compactors
on
the
fly
in
a
box
and
if
you're,
using
some
orchestration
tools
like
knife,
you
could
actually
that
run
your
cluster
as
well
pretty
easily
all
right.
A
This
is
the
second
is
compact
and
throughput
right,
so
we
actually
throttle
compaction
the
16
megabytes
per
second
by
default
and
Cassandra,
and
so
yeah.
If
something
that's
happening
again
is
you're
falling
behind
on
some
of
your
compaction
'he's,
you
can
increase
that
throttle,
especially
if
you're
running
n
SSDs.
So
a
lot
of
folks
that
run
SSDs,
basically
just
actually
on
throttle,
is
completely
and
leave
it
at
zero.
A
I
seem
to
find
I
seem
to
find
that
some
throttling
maximum
is
better
than
not
then
completely
on
throttle,
so
I
tend
to
set
this
to
100
or
150.
Mx
per
second
went
in
the
scenarios
where,
where
Compassion's
are
falling
behind,
so
it's
all
about
finding
that
medium,
where
you're
not
falling
behind
a
compaction.
A
So
you
can
still
have
your
fast
reads
and
not
hit
a
lot
of
SS
tables
for
read,
but
also,
but
also
not
have
that
compaction
process
use
too
many
of
your
system
resources
that
region
rights
get
affected
it,
so
you're
you're
you're
right
through,
but
in
Layton,
sees,
might
actually
get
affected
yeah.
So
the
other,
the
other
piece
about
compaction,
says
to
make
sure
that
you're
using
the
right
strategy,
so
I'll
talk
a
little
bit
about
size,
tier
level
and
a
tiered,
which
are
the
three
available
compaction
strategies
today
in
Cassandra.
A
A
That's
the
default
site,
that's
the
different
compaction
strategy
and
it's
very
good
for
for
rice
for
right,
based
workloads,
but
there
are
no
guarantees
that
at
each
level
of
sizes,
your
your
role
will
only
appear
one
time
right.
So
you
may
actually
have
different
very
different
different
amounts.
Sorry
different
instances
of
your
row
across
across
the
different
levels,
and
so
that's
not
necessarily
optimum
for
for
situations
where
your
real
agencies
are
really
really
crucial.
A
So
as
you're
writing
data
and
compacting
you,
the
cassandra
is
working
a
little
harder
to
make
sure
that
that
you
only
have
one
one
instance
of
a
row
per
level.
So
you
pay
a
little
bit
up
front.
But
then
your
reads
should
be
faster
on
the
grand
scheme
of
things.
So
fast
disk
is
important
for
for
something
like
leveled
and
then
finally
dated
compaction.
Us
Jonathan
talked
about
it
yesterday,
we're
seeing
it
as
a
potential
solution
for
getting
denser
nodes
in
Cassandra,
but
it
requires
that
you
know
one.
A
You
actually
know
what
the
levers
are
to
tune
this
and
to
that
you
have
the
right
workload
actually
probably
other
way
around.
You
need
to
make
sure
you
have
the
right
workload
first,
so
one
of
the
main
things
about
they
toured
is
that
you
need
to
have
potentially
a
right
of
right,
only
workload
so
that
you're
not
doing
any
updates
or
deletes.
You
can't
have
a
TTL
and
there
are
actually
some
benefits
for
running
day
period
in
TTL.
A
So
basically,
the
point
of
safety
compassion
is
that
cassandra
is
aware
that
you're
that
you're
doing
a
time
series
workload
that
only
does
inserts
and
not
updates
and
deletes,
and
so
it's
able
to
optimize
the
way
that
it
compacts.
With
that
assumption,
there
are
some
levers,
including
a
maximum
number
of
milk,
of
time
after
we
just
stop
compacting
and
assume
that
there
won't
be
note
that
there
will
be
no
more
updates
beyond
that
point.
A
So
today,
you
kind
of
you
have
to
be
careful
with
things
like
read,
repairs
and
date,
your
compassion,
because
if
you
have,
for
example,
we
prepare
enabled-
and
you
end
up
re
preparing
data-
that's
pretty
old.
Well,
then
you'll
have
tiny
SS
tables
there
beyond
the
maximum.
Compare
date
right,
so
things
like
that
are
gotchas
in
day.
To
do
you
really
have
to
watch
out
for
those
things,
but
there
are.
A
There
are
a
bunch
of
customers
that
are
using
it
very
successfully,
both
in
outside
of
the
starter
program
in
our
achieving
denser
nodes
and
also
just
overall
lower
cpu
and
disk
utilization
for
compaction.
So
it's
it's
a
very
important,
powerful
tool.
If
you
have
the
right
use
case
and
if
you
do
your
homework
on
on
the
levers
and
on
the
on
the
gotchas
in
cruel
agree,
prepares
things
like
that:
okay,
so
tombstones.
A
When
we
talk
about
tombstones
in
Cassandra,
we
don't
just
we're
not
able
to
just
delete
data
in
Cassandra
immediately
when
adeleke
comes
in.
The
reason
for
that
is,
if
there's
a
network
partition
and
if
there's
a
node
that's
down
and
that
note
doesn't
get
the
delete
when
it
comes
back
up,
it
would
still
have
the
old
data
alright.
So
the
way
they
can
trade
deals
with
that
is
by
well
there's
a
parameter
called
GC
gray
seconds
right.
That
Cassandra
will
wait
as
a
minimum
before
it
actually
expires
and
garbage
collects
and
deletes
any
tombstones
right.
A
So
we'll
mark
the
data
is
deleted,
but
it
won't
actually
get
removed
off
disc
until
GC
gray
seconds
has
happened
and
after
a
few
more
details
are
met
right.
So
if
you
need
to
get
rid
of
some
tombstones
right,
because
you
actually
have
to
go
through
those
on
your
reads
and
they
can
then
become
a
become
expensive,
you'll
see,
for
example,
on
your
logs,
some
tombstone
warnings-
hey,
we
have
to
go
through
these
many
tombstones.
A
In
order
to
accomplish
this,
read
you'll
see
that
in
the
logs,
one
way
to
clear
that
out
is
will
first
make
sure
that
your
cluster
is
consistent
by
running
a
repair.
But
once
that
has
happened,
you
can
decrease
GC
gray
seconds
and
you
can
tune
the
the
compaction,
some
properties
for
tombstones
that
actually
control.
Whether
or
not
two
stones
in
an
SS
table
are
compacted
and
it
and
deleted.
So
those
levers
are
tombstone
compaction
interval
and
to
some
threshold
to
some
compassion.
A
Interval
is
a
minimum
number
of
days
after
which
the
tombstone
becomes
available
for
compaction,
and
the
twosome
threshold
is
a
percentage
of
data
within
the
SS
table
that
isn't
this
composed
of.
In
comprised
of
tombstones,
so
there's
actually
a
third
lever
called
uncheck
to
some
compaction
that
basically
just
ignores
the
other
two
and
and
bases
the
ability
to
compact
tombstones
only
and
simply
on
JC,
based
on
simply
on
GC
great
seconds
so
yeah.
A
So
if
you
need
to
get
rid
of
data
and
reclaim
disk,
this
is
the
way
to
do
it
in
those
are
the
levers
in
order
to
actually
do
some
introspection
and
find
out.
If
you
have
a
tombstone
problem
and
if
you're
to
some
problems
resolved,
you
can
use
notes
stats
which
gives
to
stone
details
as
well
as
SS
table
metadata,
which
is
a
utility
that
is
shipped
with
Cassandra
inside
the
tools
folder
inside
the
tools
directory,
and
you
basically
run
as
unstable.
A
The
data
and
the
SS
table
name
and
you're
able
to
see
what
tombstones
are
there
and
at
what
point
they're
actually
expiring.
So
you
can,
you
can
actually
see
the
frequent
than
the
number
of
tombstones
that
expired
on
April,
first
or
whatever.
It
gives
you
the
epic
dates
from
1970,
so
there's
actually
links
out
there
to
the
docs
into
other
places.
When
you
guys
actually
have
this
problem,
you
can
come
back
to
the
slide.
I
think
I
might
skip
the
slide,
but
there
was
a
scenario
that
that
we
have
not
going
to
skip
it.
A
That
owned,
the
the
primary
key
and
the
replicas
for
these
rows
were
basically
read
and
and
and
they
were
having
very
long
delays
on
both
reads
and
writes.
So
what
we
did
was
first,
we
identified
what
SS
table
had
the
problem,
and
we
did
that
by
using
this
is
a
Brendan
Greg
tool
called
open
snoop
that
actually
allows
you
to
check
how
often
how
frequently
a
files
being
read
in
Linux
and
that
snippet
will
basically
give
you
the
most
open
files
for
the
time
period
in
your
box,
and
this
will
actually
so
this
was.
A
When
you
add
a
new
node
to
Cassandra,
you
need
to
expand
your
cluster
because
well
one
you
need
to
start
storing
more
data
or
to
need
to
start
or
you
need
to
write
faster
data
right,
more
rice
per
second
or
reads
per
second.
So
this
happens
all
the
time
in
Cassandra
and
sometimes
folks,
actually
a
lot
of
time.
I'll
get
calls
or
emails
from
folks
it'll
say:
hey!
A
You
know
my
my
boot
strap
is
stuck
it's
hanging
and
it's
been
hanging
for
a
day
or
for
a
week
or
for,
however
long
right
so
there's
actually
a
configurable
in
the
Cassandra
llamo
called
streaming
socket,
timeout
in
milliseconds
right
and
there's
a
juror
here
to
set
a
default
to
that
value.
8611.
Actually
Rob
kohli
from
the
mailing
lists.
It
was
the
guy
that
fix
that
I
believe
yeah,
if
I
recall
correctly,
but
so
up
until
now,
there
is
no
default
for
this
value,
and
this
is
the
and
basically
this
value
controls.
A
How
long
kisan
will
wait
for
a
streaming
job
to
complete
before
actually
retrying
and
it
doesn't
have
a
default,
so
cuz
I
will
actually
wait
indefinitely
for
a
stream
to
complete.
This
works.
Great
if,
like
networking
is
perfect
with
in
your
data
center,
but
that's
not
usually
the
case,
and
it's
certainly
not
the
case
in
the
cloud
where
a
lot
of
the
stars
that
I
work
with
are
running
right.
So
there's
lots
of
network
gremlins
screaming
sessions
will
fail
and
then
Cassandra
will
hang
indefinitely.
A
So
before
you
do
any
bootstrapping
or
I
mean
actually
go
home
and
check
your
clusters
and
make
sure
that
this
has
a
value
right
set
it
to
an
hour.
If
you
want,
if
you
want
to
be
very
conservative,
that's
3.6
million,
milliseconds
or
south
of
10
minutes
was
probably
more
sane,
but
make
sure
you
have
a
value.
Then
there's
streaming
throttle
right,
so
you
can
actually
in
the
same
way
that
we
can
set
the
compaction
throughput.
We
can
get
and
set
the
streaming
through
it,
and
so
that
can
actually
accelerate
or
decelerate
your
your
bootstraps.
A
A
There
is
also
a
contingency
strategy,
so
if
you
didn't
have,
for
example,
streaming
socket
timeout
set,
and
so
in
you
bootstrap
most
of
the
way
we
got
ninety
five
percent
of
the
data,
but
then
our
streaming
sessions
hung
right.
You
don't
have
to
start
over
I
mean
one
thing
you
can
do
is
clear:
your
commit
love
for
your
data
directory.
Bootstrap
again,
you
don't
have
to
start
over
just
restart
your
node
without
a
booster
of
false
and
the
yambol
and
repair
the
rest
of
the
way
right.
A
If
you
almost
made
it,
and
so
that
can
save
you,
you
know,
depending
on
your
data
volume,
a
bunch
of
a
bunch
of
time.
So
we
do
this.
We
have
to
do
this.
Sometimes
what
else
do
I
have
here?
Okay,
so
something
else
that
was
introduced
in
2.1
in
these
two
years,
so
24,
34
and
70
69
is
something
called
consistent
range
movements.
A
So
there's
an
edge
case
in
Cassandra
in
old-school
Cassandra
we're
under
very
pretty
convoluted
circumstances
during
a
bootstrap
you
might
actually
have
one
still
reads,
and
even
potentially
data
loss
so
that
got
that
get
resolved
in
20
20
24
34.
But
then
the
stale
of
reeds
are
still
a
possibility.
So
after
a
bootstrap
because
you're
actually
streaming
data
from
us
for
somebody
to
own
the
data
before
to
the
current
to
the
current
node.
A
If
you
do
that
to
multiple
nodes
at
the
same
time,
you
may
actually
not
not
have
the
most
consistent
version
of
your
data
when
you're
reading
and
so
70
70
69
doesn't
actually
actually
doesn't.
Let
you
bootstrap
two
nodes
at
the
same
time
that
have
partially
the
part
of
the
same
range
part
of
the
same
ranges
within
them
right
to
avoid
this.
This
edge
case
now
a
lot
of
the
cases
you
actually
you
need
to
grow
your
cluster
and
you're.
Okay,
with
potentially
having
a
stale
read
right.
A
A
Cassandra
features
that
I
want
to
go
through
and
I'll
start
talking
about
DSU
search
a
little
bit
so
when
it
comes
to
DC,
search
and
performance,
we're
interested
in
either
indexing
performance
or
query
latency
performance
right,
so
indexing
performance
is
the
amount
of
time
that
it
takes
and
the
throughput
at
which
data
goes
from
getting
into
Cassandra
to
also
being
available
as
part
of
the
search
indexes
right
and
then
query
time.
Performance
is
actually
just
the
amount
of
time
between
your
client
and
your
in
your
response.
When
you
do
a
query,
so
these
are
those.
A
So
those
are
the
two
main
kind
of
things
that
you
need
to
worry
about
when
you're
running
search
when
we
vse
search
actually
co,
locates
solar
and
cassandra
the
same
jvm
in
the
same
in
the
same
machines,
and
so
when
data
gets
written
into
cassandra,
it
automatically
gets
index
into
solar
there
and
that's
that's
the
process.
I
would
get
the
data.
A
We
have
something
called
indexing,
pool
stats,
you're,
actually,
I
want
to
click.
This
I've
got
a
screenshot
here,
so
this
is
what
it
will
look
like.
Basically,
we
have.
We
have
jmx
beans
and
beans
for
for
the
thread,
pool
stats
and
we
do
we
paralyze
the
number
of
threads
that
are
used
for
taking
the
cassandra
data,
that's
coming
in
and
building
indexes
with
it
right.
So
in
this
case,
there's
a
box,
that's
running
with
four
different
thread
queues,
and
you
can
see
the
process
tasks
at
the
top.
A
You
can
see
the
queue
depth
which
is
at
zero,
which
is
healthy,
and
you
can
see
I
believe
in
the
amount
of
time,
the
time,
the
processing
time
in
microseconds
that
it
took
the
data
to
actually
get
in
to
get
in
the
get
in
the
index.
So
this
jmx
Fabian
actually
allows
you
to
monitor
real-time,
as
the
date
is
coming
in
how
long
it's
taking
three
indexing
and
how
it's
being
spread
out
through
the
q's
you'll,
see
in
DSC
4.8
that
we
have
an
improvement
here.
A
So
we
used
to
split
out
the
data
that
goes
through
each
of
those
cues
based
on
the
Cassandra
partition.
And
so,
if
you
had
a
super
wide
row
within
Cassandra,
it
would
actually
only
heat
up
one
of
these
queues
and
you
would
see,
for
example,
q0
here
with
a
very
with
a
very
big
number
in
cluster
depth.
A
But
but
all
the
others
were
kind
of
relaxing
right.
And
so,
if
you
had
wide
rows
and
Cassandra
hurt,
but
then
they
actually
heard
more
if
you're,
using
search
as
well
so
yeah,
so
that's
actually
improved
in
4.8.
That's
one
of
the
big
performance
improvements
that
came
through
it
came
through
with
that
release
and
then
let's
talk
about
come
in
and
update
stats,
so
these
are
actually
going
to
give
you
details
on
your
timing.
A
So
what
are
there's
different
phases
in
the
indexing
process
and
you
can
actually
pull
up
the
details
of
how
much
time
was
spent
in
each
of
those
phases.
So
those
are
additional
details
that
you
can
see
for
for
for
monitoring.
So
if
I
want
to
improve
my
index
and
performance,
something
that
I
want
to
see
so
so
what
are
the?
What
are
the
levers
that
I
have
available
alright?
A
So
the
first
is
something
called
soft
autocommit
threshold
and
that's
the
minimum
at
a
time
that'll
go
by
before
you
actually
start
indexing
data,
that's
coming
through
Cassandra,
so
I
believe
that
by
default
we
said
this
to
one
second
in
real
time
indexing,
but
you
can
actually
turn
it
down
and
get
like
faster
indexing
time.
So
this
is
the
main
lever.
A
If
you
give
DSC
more
time
as
a
soft
autocommit,
then
it
will
it's
easier
on
the
resources,
but
if
you
actually
need
indexing
time
to
be
really
fast,
you
need
your
data
available
for
sure
right
after
you
right
after
you
insert
it,
then
you
can
decrease
this
lever
and
as
long
as
you
have
you
know,
enough,
cpu
cores
indexing
will
actually
work
well.
The
second
thing
is
number
of
cores
per
core,
so
that's
kind
of
confusing.
A
What
is
course
performing
that
actually
means
the
number
of
cpus
cpu
cores
that
will
be
available
to
index
a
solar
core.
A
solar
core
is
like
a
solar
shard
right,
so
you
want
to
set
that
to
the
number
of
CPUs
available
in
your
box,
and
so
that's
why
that's
why
we
always
talk
about
search
indexing
scaling
with
a
number
of
cores
available
in
your
machine
writing.
You
can
actually
configure
it
here.
There's
something
called
back
pressure
that
happens
when
we're
looking
at
the
Q's
before.
A
If
your
queue
depth
reaches
a
certain
level,
then
we're
going
to
turn
out
something
called
back
pressure.
That's
to
avoid
DSC
search
from
actually
oh
I
mean
right,
so
you
can
tune
the
back
pressure.
Thresh
will
write
the
amount
of
depth
queue,
depth
at
which
will
start
to
slow
down
and
actually
load
shed
in
extreme
circumstances
to
avoid
oos
and
finally,
there's
something
called
a
ram
buffer,
which
is,
let's
part
of
the
process.
A
So
those
are
the
levers
for
indexing
and
that
scales
with
course,
and
now
we're
going
to
talk
about
query
performance.
So
the
main
thing
about
query
performance
is
that
it,
actually
you
actually
want
to
search
performance,
will
perform
very
well
if
you
can
fit
your
your
leucine
indexes
in
page
cache
right,
an
operating
system,
page
cache.
So
if
you
have
enough
RAM
or
your
indexes
are
smaller,
then
you're
going
to
have
pretty
good
performance
with
the
AC
search.
So
there's
a
couple
things
that
you
can
do
to
actually
shrink
your
indexes.
A
You
can
turn
things
off
like
term
vectors
and
turn
positions
in
terms
of
offsets.
If
you
don't
care
about
things
like
highlighting,
which
is
a
pretty
nifty
solar
feature,
but
you
might
not
be
using
it
and
things
like
Amit
Amit
norms.
You
can
actually
turn
that
on
if
you're
not
using
something
like
a
like,
boosts
and
solar
right,
so
you
can
kind
of
disable
some
solar
features
and
shrink
the
size.
A
Your
index,
if
you
don't
have
the
hardware
the
RAM
to
throw
at
it,
and
then
you
can
also
just
index
the
fields
that
you
intend
to
search
on
right.
So
if,
if
you've
got
a
table
with
20
fields,
and
you
only
really
care
about
searching
five
of
them,
don't
index
the
whole
thing
all
right.
So
what
about
the
JVM?
We
talked
about
how
both
Cassandra
and
solar
are
going
to
be
running
in
the
same
jvm
with
DSE
search.
So
what
so?
A
What
you
want
to
do
is
actually
increase
the
heap
size,
44
Cassandra,
so
if
traditionally
you're
running
an
eight
gig
heat
with
Cassandra,
which
is
what
we
recommend.
Historically,
you
have
to
want
to
run
a
larger
14
big
heap
if
you're
running
solar
and
if
you're,
using
the
live
indexing
with
DSC
search,
they
just
drop
in
47.
You
actually
want
to
have
at
least
a
20
gig
heap,
because
it's
going
to
put
additional
pressure
on
your
heap
because
your
indexing
so
quickly,
so
the
recommendation
for
live
indexing
is
to
rungy
ones.
A
You
see
with
20
gauge
heap
and
there's
a
couple
of
tricks
to
g
one
g
c:
don't
set
a
new
gen
sighs.
I
don't
want
to
get
into
the
rabbit
hole
of
JPM
tuning
right
now,
but
g
ones.
You
see
actually
sets
your
new
generation
size
dynamically,
and
so
you
don't
want
to
hard
code,
something
there
and
there's
one
lever
in
G.
When
you
see
this
really
important
called
max
pause
time
in
milliseconds,
and
you
probably
want
to
set
it
to
one
or
two
seconds,
even
though
that
sounds
high
to
get
some
pretty
nice
through.
A
If
you
want
to
know
of
your
abuse
and
collections,
just
pull
up
Luke,
which
ships
with
DSC
and
it'll,
actually
if
it
takes
forever
to
load
that
page,
you
know
that
you're
abusing
collections
and
as
an
alternative
to
using
maps
in
dynamic
fields
with
solar,
actually
just
use
two
tables
into
a
solar
side
join
and
that
way
that
will
solve
your
kind
of
OM
bad
news
with
them
at
with
dynamic
field
scenario
and
there's
a
link
there
of
an
example
of
how
to
do
that.
Side
join
okay.
A
So,
let's
move
into
analytics
low
on
time,
so
I
want
to
speed
up,
but
with
analytics,
there's
three
really
important
things
right:
redresses
blog
posts
who
read
russ's
blog
post
and
read
russ's
blog
post
right
so
rest
is
Russell.
Spitzer
is
one
of
our
guys.
They
actually
build
a
lot
of
the
cassandra
connector.
This
guy
is
excellent.
A
At
the
end
of
this
talk
in
the
slides,
I'm
linking
to
his
blog
posts,
but
most
of
the
content
from
there
has
been
either
like
directly
out
of
his
blog
post
or
things
that
he's
helped
me
with
when
the
stars
were
in
trouble
with
particular
analytics
jobs,
so
read
russell
blog
posts
and
watch
this
video's
too,
he's
really
good.
So
a
couple
a
couple
of
levers
just
for
getting
started
with
spark
right
a
lot
of
the
time.
Something
that
will
happen
is
your
spark
long
run
because
I
don't
have
enough
resources
available.
A
A
Ok,
then,
for
reads:
it's
very
easy
for
reason,
right,
it's
very
easy
for
spark
to
overload
to
overwhelm
by
doing
lots
of
parallel
queries
or
Cassandra
cluster
right.
So
what
you
do
is
you
you
can
actually
control
like
the
speed
at
which
and
like
the
size
of
the
the
size
of
the
jobs
and
the
batch
sizes
in
order
them
to
control
the
speed
at
which
you're
reading
from
Cassandra
not
overwhelm
it.
A
All
right
for
right,
this
is
dense.
Isn't
it
for
rights
similar
to
reads
you
don't
want
overwhelm
Cassandra,
so
there's
a
couple
of
two
intervals
that
you
can
use
so
the
number
of
concurrent
rights
and
also
the
throughput
for
me,
the
throughput
megabytes
per
second,
then
you
can
actually
throttle
spark
and
then
ok,
so
when
you're
actually
building
spark
jobs,
we're
going
to
get
into
just
the
performance
of
of
spark
itself
and
and
how
it's
written
in
your
job.
So
what
you
want
to
do
is
make
sure
that
for
each
spark
operation
you
follow.
A
So,
instead
of
doing
a
group
by
use
something
called
span
by
and
so
you'll
avoid
shuffling
the
data
around
your
cluster
that
will
give
you
performance,
there's
a
link
there
to
the
documentation
that
talks
a
little
bit
more
in
detail
there
or,
if
you're
doing,
if
you're
looking
at
trying
to
do
predicate,
push
down
with
lots
and
lots
of
primary
keys
and
partition
keys,
there's
a
method
called
join
with
cassandra.
Cable
that'll.
Give
you
that'll,
give
you
much
better
performance
that
actually
like
pulling
up
the
whole
table
and
then
filtering
it
in
spark.
A
So
that's
a
that's
a
useful
method
for
for
writing.
Lots
of
smart
shelves
on
Cassandra
and
also,
if
you
use,
collect
and
then
do
a
bunch
of
work,
realize
that
all
that
work
is
going
to
happen
at
your
driver
and
not
at
your
executor.
So
it's
not
going
to
be
in
parallel,
it's
going
to
be
single
threaded
and
actually,
usually
your
drivers,
don't
have
a
lot
of
a
lot
of
resources
allocated
to
them
anyway.
So
don't
do
any
work
after
I
collect,
so
those
are
some
tips
for
building
spark
jobs.
A
That'll
give
you
performance
boosts
a
little
looking
forward
today,
DSC
4.8
dropped
and
it
ships
with
spark
14.
So
these
are
the
things
that
I'm
really
excited
about.
There's
a
lot
of
things
that
are
there,
but
these
are
the
things
that
I'm
really
excited
about.
I
think
we're
at
a
time
there's
meta,
monitoring,
capabilities.
There's
data
frames
that
give
you
kind
of
the
same
interactions
between
Python
and
scholar
and
Java
and
there's
something
called
Java
server
and
Ilya,
who
might
be
here
had
a
quote
in
an
earlier
talk
this
week.