►
Description
This session covers diagnosing and solving common problems encountered in production, using performance profiling tools. We’ll also give a crash course to basic JVM garbage collection tuning. Viewers will leave with a better understanding of what they should look for when they encounter problems with their in-production Cassandra cluster.
A
Hello,
everybody
and
welcome
to
this
week's
webinar
diagnosing
Apache
Cassandra
problems
in
production.
I
am
delighted
to
have
with
us
this
morning.
John
Haddad
John
is
a
Technical
Evangelist
here
at
data
stacks
and
prior
to
coming
to
date,
stacks
he
ran
Cassandra
in
production
at
his
previous
company,
has
a
lot
of
experience
with
operations
and
we're
going
to
learn
a
lot
from
him
this
morning
before
I
hand
over
to
John
just
a
couple
of
housekeeping
items.
A
If
you
would
like
to
ask
John
a
question,
please
use
the
Q&A
tab
inside
of
WebEx
type.
Your
question
there
and
at
the
end
of
this
morning's
webinar,
we'll
save
some
time
and
get
through
as
many
questions
as
we
can
in
the
remaining
time.
So
that's
enough
for
me
welcome
John.
Thank
you
very
much
for
agreeing
to
present
today
lay
down
the
education
on
us.
Please
all
right.
B
Very
excited
to
be
here
so,
as
Christian
said
today,
we're
going
to
be
talking
about
diagnosing
problems
in
production
and
I
actually
gave
a
similar
talk.
I
got
the
summit,
but
unfortunately,
due
to
time
restrictions,
I
really
wasn't
able
to
go
into
details.
I
would
have
liked
on
certain
topics.
So
one
of
the
things
that's
interesting
about
kind
of
diagnosing
pumps
and
production
is
there
to
be
a
problem
throughout
your
entire
stack
and
being
able
to
diagnose
just
Cassandra
isn't
really
enough.
B
You
have
to
be
able
to
narrow
down
the
problem
figure
out
if
it's
your
application
figure
out.
If
it's
the
JVM
Cassandra
itself,
you
know
if
it's
configuration
that
you
need
to
change
or
the
hardware,
so
we're
kind
of
going
to
walk
through
all
the
things
that
you
need
to
know
if
you're
running
Cassandra
production,
how
to
figure
out
what's
broken
and
what
steps
to
take
to
try
and
fix
it.
B
So
the
first
thing
that
we're
going
to
do
is
is
we're
gonna
talk
about
preparation
and
if
you
haven't
put
Casandra
into
production,
yet
these
are
kind
of
the
things
that
you're
going
to
want
to
know
about.
You're
going
to
want
to
take
care
of
these
things,
take
some
notes
and
make
sure
that
you're
ready
from
once
you
get
into
production.
You
want
to
understand,
what's
happening
on
your
system,
so
the
first
thing
that's
really
important
to
put
into
place
is
op
center.
Op
Center
comes
from
data
stacks
and
it's
basically
a
custom-built
operations
tool.
B
That's
meant
specifically
to
tell
you
what's
going
on
with
Cassandra
it,
basically
it
it
will
help
you
with
about
90%
of
the
problems
that
you
encounter,
because
it's
so
tailored
towards
the
Cassandra
use
case.
Alright.
So
if
you
do
end
up
with
an
issue
in
production,
it
should
be
the
first
place
that
you
go
to
and
there's
two
versions.
B
The
community
version,
which
is
free
and
I,
definitely
recommend
that
anyone
running
open-source
Cassandra
download
it
and
set
it
up
because,
as
you
can
see
from
these
pictures
here,
it
tells
you
a
lot
of
information
about
your
cluster.
The
enterprise
version
has
some
extra
features
that
are
pretty
useful.
I
won't
really
go
into
detail
on
now,
but
they're,
both
very,
very
useful
and
I,
strongly
recommend
that
you
use
them
with
any
cluster.
That
you
have
the
next
thing
I'm
going
to
talk
about-
and
this
is
this
is
just
general
stuff.
B
A
lot
of
this
comes
with
op
Center,
but
throughout
the
rest
of
your
application,
you're
going
to
want
to
use
these
tools
you're
going
to
want
to
have
server,
monitoring
and
alerts
in
place,
and
you
can
use
you
can
do
it
yourself
through
open
source
software
or
you
can
use
a
third
party.
It
really
doesn't
matter
in
the
end
which
you
use
just
matters
that
you
pick.
One
mana
is
a
really
useful
tool
if
you
want
to
monitor
processes
and
disk
usage
and
get
alerts,
but
things
go
wrong.
B
So
if
all
of
a
sudden
you've
got
a
you
know
a
disk
at
ninety
percent
full,
you
should
get
an
email
about
that.
This
is
a
pretty
big
deal
and
it's
really
easy
to
set
up
these.
These
things
do
not
it'll
take
a
long
time
to
learn.
You
know
you
put
a
few
days
work
into
it.
You
can
have
your
whole
system
monitored
and
you
have
a
really
good
understanding
of
what's
going
on
in
this
particular
case.
This
is
nice
because
it
can
prevent
a
lot
of
problems
before
they
even
began.
A
B
It
you
can
see
in
this
example
from
novios.
You
can
see
that
you've
got
a
critical
alert
on
this
disk
right,
so
instead
of
hitting
a
real
performance
problem
when
you're
out
on
a
disk
space
you,
you
definitely
want
to
know
about
this
kind
of
thing
ahead
of
time-
and
this
is
this
is
stuff.
That's
like
pretty
basic.
There
should
be
on
every
system.
B
B
We
have
Nagios
excellent
tool
on
the
Sephora
called
icinga,
which
has
been
picking
up
a
lot
of
the
development
and
then,
as
I
said
before,
there's
a
lot
of
third-party
services,
I'm
familiar
with
data
dog
and
server
density,
and
they
can
provide
a
lot
of
the
same
tools
and
the
thing
that's
nice
about
that,
is
you
don't
have
to
host
it
yourself,
so
whatever
you
want
to
do
just
pick
something
and
roll
with
it.
The
nice
thing
is
that
this
kind
of
thing
is:
you
can
swap
out
services
without
interrupting
all
your
application.
B
B
So
the
next
thing
that
you're
going
to
want
to
make
sure
you
have
put
into
place
are
your
application
metrics,
and
this
is
kind
of
where
developers
and
keep
one
operations
really
need
to
work
together
in
order
to
make
sure
that
the
system
is
functioning
right,
a
lot
of
times
it
things
are
kind
of
put
on
operations,
people
and
they
don't
really
have
the
insight
that
they
need
in
order
to
diagnose.
What's
going
wrong,
as
it
may
not
be
an
Operations
problem.
B
So
what
you
want
to
be
able
to
do
is
is
track
a
few
things
you
want
to
be
able
to
track
events
in
the
system,
and
you
want
to
be
able
to
have
micro,
timers,
around
small
blocks
of
code
and
the
easiest
way
to
do
this
is
with
stats.
D
and
graphite.
Stats
D
actually
also
works
with
a
few
other.
A
few
other
applications
that
I
know
LeBron
Oh
is
one
that
you
can.
B
You
can
output
stats,
the
information
into
and
that's
a
hosted
service,
and
then
you
can
use
either
graphite
or
grep
on
ax
and
I.
Have
the
two
on
the
right
over
here.
Graphite
is
on
the
top.
Cortana
is
on
the
bottom.
It
looks
a
little
bit
nicer,
it's
gaining
in
popularity,
but
both
tools
will
just
allow
you
to
graph
things
that
are
not
system
metrics,
so
user
signups
error
rates
and
people
from
different
countries.
B
You
can
get
a
lot
of
really
really
good
information
out
of
this,
and
once
you
start
correlating
it
with
your
system
metrics.
If
you
see
that
all
of
a
sudden
user
signups
just
drop
well,
you
know
there
may
be
a
serious
problem.
So
it's
it's
nice
because
you
may
not
have
let's
say
super
high
load
on
your
Cassander
servers,
but
you
can
tell
that
you
know
the
number
of
user,
signups
or
log
ins
or
whatever
is
just
drop
to
zero.
Maybe
there's
a
DNS
problem.
B
B
The
other
thing
that's
really
cool
about
these
tools
is
since
Cassandra
2.0
there
has
been
a
the
integration
of
metrics
library
and
that
came
out
of
Yammer
and
the
metrics
library
essentially
allows
you
to
pipe
Cassandra
metrics
into
a
whole
variety
of
different
places.
So
you
can
you
can
spit
your
met,
your
metrics
from
Cassandra
into
graphite,
so
you
can
start
to
correlate
metrics.
B
So
you
can
see
on
the
bottom
graph
here,
there's
a
few
graphs
where
there's
multiple
lines,
so
you
can
start
to
correlate
things
like
JMX
information
from
cassandra
with
user
signups
and
all
of
a
sudden,
you
understand
that
there's
correlation
between
events-
and
it
just
helps
you
get
a
little
bit
more
information
out
of
your
system
and,
if
you're
not
running
what
they.
You
know,
you're
running
other
Java
utilities,
and
you
have
some
JMX
stuff
that
you
want
to
plot
on
this.
B
There's
a
nice
there's
a
whole
bunch
of
different
libraries
that
you
can
use,
but
a
really
useful
one
is
JMX
trance
and
that
will
allow
you
to
basically
kick
any
JMX
metric
out
to
stats
D
and
it
will
grab
it'll,
go
into
graphite,
so
really
really
useful
stuff.
Just
make
sure
you
have
all
of
your
application
metrics
and
you
can
get
to
them
quickly
and
you
know,
coordinate
with
developers
and
operations
and
just
have
some
really
good
dashboards,
because
it
will
help
you
out
a
lot.
B
So
log
aggregation
is
the
next
thing.
That's
really
important
and
again
this
requires
developers
and
people
and
operations
to
work
together
and
it's
the
same
deal
you
can
either
go
for
a
hosted
service,
so
they've
got
as
I
have
here:
Splunk
and
log
lea
or
you
can
go
open-source
and
you
can
use
log
stash.
The
Cabana
which
I've
used
a
lot
gray.
Log
is
also
really
popular
and
then
there's
a
whole
bunch
more
I'm
not
going
to
list
them
all.
It
doesn't
need
to
be
an
exhaustive
list.
B
The
thing
that's
important
is
that
you
have
really
good
logs
and
you
have
them
aggregated
somewhere,
and
this
this
shouldn't
just
be
application
logs.
This
can
be
your
Sandra
logs
if
you're
running
elasticsearch,
if
you're
running
engine
excerpts
or
Python
like
you
can
you
can
always
put
your
logs
in
this
tool.
Take
the
time
to
make
sure
that
the
logs
are
parsed
correctly
and
binding
errors
when
they
happen,
is
actually
going
to
be
really
easy
and
make
sure
their
logs
are
meaningful
like
if
you've
got.
B
Let's
say
user
data
make
sure
you
put
your
user
ID
in
the
lock
message
and
then,
if
they
call
you
up
and
there's
a
problem,
you
can
just
search
on
the
user
ID
and
all
of
a
sudden.
You
have
a
really
nice
easy
way
to
understand
what
happened
for
a
particular
user
so
and
make
sure
that
you
solve
the
problem
quickly
and
people
will
really
appreciate
you
for
it.
Because
you're,
you
know,
you
don't
sound
like
an
idiot
trying
to
say,
like
oh
I
have
no
idea.
B
B
B
So
if
you
have
servers
that
have
different
times
and
one
ten
seconds
ahead,
ones
ten
seconds
behind
things
even
get
screwed
up
now,
if
you
take
a
look
at
the
example,
I
have
on
the
ripe
this
time
on
the
server
on
the
left
is
ahead
of
the
real
time
in
real
time.
You
know
twelve
I
just
picked
a
number
that
was
relatively
low
and
easy
to
talk
about.
Obviously
we
don't
have
time
twelve
right
now,
but
for
the
sake
of
the
discussion,
I
think
it'll
work.
B
If
the
server
happens
to
be
eight
seconds
ahead,
it's
at
time
twenty
and
the
server
on
the
right
is
behind
right.
So
it's
five
seconds
behind.
So
let's
say
at
time:
twelve
we
do
an
insert
right.
That
insert
is
going
to
carry
with
it
a
timestamp
and
unfortunately
the
timestamp
is
going
to
say
twenty
right,
it's
it's!
Basically
the
server
thinks
it's
in
the
future.
It
goes
ahead.
It
writes
this
data
with
twenty
now
a
few
seconds
later,
someone
goes
to
read
and
then
they
go.
B
You
know
what
I
need
to
delete
this
data,
it's
no
longer
valid
and
the
timestamp
should
be
1500
or
three
seconds
in
the
future.
But
the
problem
here
is
that
this
server
actually
has
a
time
in
the
past.
So
it's
going
to
say
you
know
what
let's
do
or
right
and
the
time
stamp
on
this
is
actually
going
to
be
ten.
Well,
when
we
look
at
the
delete
versus
the
insert,
the
insert
looks
like
it
was
done
further
in
the
future,
and
so
the
delete
actually
doesn't
occur.
B
So
you
end
up
with
this
problem
where
you're
trying
to
delete
data,
and
it
just
doesn't
disappear
and
that's
kind
of
confusing
you
will
you
will
you
know
because
cassandra's
is
eventually
consistent,
people
will
go
well,
I
did
or
I
do
right
at
consistency
level,
one
and
it
didn't
take
and
what
happened
and
then
they
do
another
delete
later
and
it
works,
and
it's
really
weird
because
your
servers
behaviorally
consistently
well,
it's
because
you
have
a
really
weird
race
condition.
B
So
the
the
thing
to
take
away
from
this
is
just
make
sure
your
server
times
are
right
and
the
easiest
way
to
do
this
on
Linux
is
to
install
ntpd,
and
this
will
just
make
sure
that
that
your
servers
are
constantly
checking.
You
know,
with
the
correct
time
and
they're
drifting
back,
basically
a
correct
story,
and
if
your
servers
are
like
way
off,
you
actually
want
to
run
ntp
date
and
ntp
date
will
just
basically
Jam
a
server
time
back
into
the
correct
time.
B
So
the
next
thing
that
we're
going
to
talk
about-
and
this
actually
has
a
little
bit
to
do
with
the
slides
or
tombstones
so
tombstones-
are
basically
a
marker.
That
says
this
data
no
longer
exists
all
right,
so
you
do
a
delete.
Instead
of
just
deleting
things
off
the
servers.
It's
a
distributed
system
deletes.
Don't
really
work
that
way
you
can
have
data
come
back.
B
You
have
to
put
a
marker
in
place.
That
says:
there's
nothing
here
and
you
know
I.
We've
got
our
tombstone
on
the
right
and
this
is
actually
a
valid
timestamp,
that's
stored
at
the
Sandra,
and
it
has
the
key
and
it
has
the
timestamp,
and
so
in
that
scenario
that
we
just
talked
about
where
we
had
back
here.
We
have
these
servers.
We
have
this
delete,
we've
got
we.
B
Basically,
we
have
to
have
that
delete
with
the
right
time
stamp
and
that's
what
exists
here
and
it's
it's
a
really
really
useful
tool
and
it
prevents
a
whole
bunch
of
errors
from
coming
up.
However,
then
there
is
a
problem
that
you
can
run
into
calls
a
lot
of
people
call
this
tombstone
hell
and
basically,
what
happens
is
if
you've
got
a
really
really
big
partition.
B
So
you
can
see
here,
we've
got
99999
tombstones
and
you
actually
have
to
read
through
all
of
them
in
order
to
get
the
data,
that's
at
the
front
of
the
queue,
and
this
is
why
people
say
do
not
use
Cassandra's
queue,
every
library
that
basically
ever
comes
out
that
tries
to
uses
a
queue.
It's
just
it
will
hit
a
limit
and
it
will
get
a
performance
bottleneck
and
you
will
end
up
doing
a
lot
of
I/o
and
a
lot
of
CPU
just
to
do
something.
That's
really
simple,
stuff
use,
aq
use,
something
like
Kafka
I.
B
B
But
the
problem
is:
if
you
don't
use
a
snitch,
then
you're
not
really
taking
fully
taking
advantage
of
Sandra's
high
availability.
So
there's
there's
two
purposes
of
a
snitch
on
a
read:
a
snitch
will
keep
track
of
the
fastest
replicas
or
reads
so
it
effectively
lets
you
get
much
better.
The
best
performance
out
of
your
system
and
the
honor
write
what
it
sniffs
does
is.
It
will
actually
ensure
that
your
data
is
spread
out
across
different
racks
or
availability
zones.
However,
however,
you
want
to
arrange
your
data.
B
You
make
it
so
that
the
snitch
will
just
help
pick
which
servers
everything
goes
to
and
that's
pretty
awesome.
So
what
you
want
ideally,
is
if
you're
using
multiple
racks
to
not
have
multiple
copies
of
a
single
piece
of
data.
In
one
rack
like,
if
you
have
three
replicas,
you
don't
want
them
all
to
be.
In
the
same
rack,
the
rack
goes
down,
you
lose
all
your
data,
and
this
just
makes
sure
that
that
doesn't
happen,
and
you
can
see
there's
a
whole
bunch
of
them
list
listed
on
the
left.
B
B
So
this
one's
becoming
this
next
one
is
actually
becoming
less
of
a
problem
and
it's
it's
version,
mismatch
issues
and
basically
I've
actually
run
into
this
on.
What
can
happen
is
if
you've
got
a
cluster
right.
Let's
say:
you've
got
a
1.1
cluster
and
you're
trying
to
upgrade
to
1.2
or
you're
trying
to
create
a
2.0.
B
What
what
I
tried
to
do,
which
is
totally
incorrect,
is
adding
in
a
new
node
of
a
different
version
into
an
existing
cluster.
So
you
just
want
to
make
sure
that
all
your
nodes
in
your
cluster
have
the
same
version.
If
you
add
in
a
new
node,
basically
there's
a
process
called
streaming
and
streaming
happens
whenever
you
bootstrap
a
new
node
whenever
you
decommission
a
node
or
whenever
you
run
a
repair
and
when
that
happens,
the
the
file
formats
are
not
going
to
be
the
same,
and
it
won't
be
able
to
read
the
data.
B
So
it'll.
Basically
just
hang,
and
you
end
up
with
kind
of
this
weird
system
where
it
says
that
it's
joining
and
it'll
say
that
it's
joining
for
a
while
like
days
and
it's
it's
really
hard
to
figure
out.
What's
going
on
and
basically
basically
what
you
want
to
do.
Is
you
just
want
to
make
sure
that
you
avoid
introducing
new
nodes
and
do
existing
clusters
that
use
different
versions?
Even
minor
versions,
just
stick
with
the
same
version,
it's
much
safer
and
it
decreases
kind
of
the
number
of
things
that
are
different.
B
B
If
you've
got
a
big
cluster,
let's
say:
you've
got
10
racks
or
20
racks
or
something
you
can
upgrade
as
long
as
using
the
right
snitch,
you
can
upgrade
one
rack
at
a
time
you
can
shut
down
ten
servers
or
whatever
upgrade
them
all,
bring
them
all
back
up,
and
that
will
actually
be
fine
because
hinted
hand-off,
which
is
the
process
that
happens
when
a
node
turns
back
up.
That
works
just
fine
between
versions.
So
the
gist
of
this
is
just
be
safe.
B
B
Basically,
the
gist
of
this
is,
if
I
add
in
a
new
node
into
my
cluster.
It
gets
data
from
the
existing
nodes
right
asked
to
get
it
from
somewhere
and
the
big
nodes
that
it
gets.
The
data
from
actually
won't
delete
the
data
that
they
strained
off,
all
right
that
so,
if
you've
added
a
new
node
in
because
you're
running
low
on
disk
space,
you're
almost
to
you,
unless
you
actually
run
module
cleanup,
you
won't
have
solved
your
problem.
You'll
still
be
getting
these
alerts
and
you're
you're.
B
Why
you're
running
out
of
disk
space?
If
you
change
your
replication
factor?
This
is
the
same
thing.
It
won't
just
delete
data,
you
have
to
run
cleanup
and
you
know
you
can
you
can
reclaim
a
ton
of
data.
The
the
other
thing
that
can
happen
is
depend
if
you're
running
incremental
backups.
You
can
run
into
this
problem
as
well,
so
you're
going
to
want
to
make
sure
your
backup
strategy
isn't
resulting
in
you
piling
up
a
ton
of
SS
tables
that
don't
need.
B
B
Shared
storage
has
a
whole
slew
of
issues
that
are
associated
with
it,
one
being
it's
you're
using
a
distributed
database
right
using
a
cassandra
is
meant
to
handle
failure
and
you've
effectively
added
a
single
point
of
failure
into
this
plane,
and
it's
just
not
a
good
idea.
Your
latency
is
way
higher
than
it's
going
to
be
than
if
you
were
using
local
disks
going
out
to
the
network
is
always
going
to
be
slower
and
using
your
SSDs
if
you
have
SSDs,
which
is
a
good
idea.
B
A
B
It's
really
expensive
and
it
doesn't
work
well
with
Cassandra
and
I,
haven't
heard
of
anybody
running
it
Cassandra
on
a
sand
and
getting
either
the
performance
or
the
value
that
they
want
and
remember
your
Cassandra
performance
or
your
performance
with
any
database
is
about
latency.
So
if
you've
got
fast
disks,
then
your
servers
gonna
be
fast,
and
it's
not
about
I
ops,
I
ops
does
not
measure
latency
what
you.
B
Basically,
you
can
have
ions
through
the
roof
and
have
15
30
millisecond
latency,
and
that
just
doesn't
work,
and
you
know
it
kind
of
it'd
be
like
if
we
had
great
throughput
to
Mars.
You've
got
huge,
huge
latency,
and
but
who
cares
about
your
throughput?
It
doesn't
really
help,
but
you
may
be
able
to
get
a
billion
ions
to
Mars,
but
your
latency
is
is
just
huge,
so
stay
away
from
shared
storage,
and
that
includes
things
like
EBS.
B
B
Center
actually
gives
you
some
insight
as
to
what's
going
on
throughout
your
cluster.
How
much
compaction
is
going
on?
You
can
look
at
it.
A
per
server
level
really
really
useful,
like
I
said,
if
your
before
on
the
original
slide,
op
Center,
if
it's
a
ninety
percent
of
what
you
need,
if
you
take
a
look
at
op
center-
and
you
see
a
ton
of
comp
actions
occurring
and
your
disk
usage
is
out
of
control.
Well,
there's
a
good
chance
that
the
two
are
correlated
and
using
the
neutral
command.
B
You
can
actually
adjust
the
throttle
on
compaction,
so
you
can
set
the
compaction
throughput
through
no
tools
on
a
running
machine.
You
can
actually,
you
can
say
like.
Oh
no
I
want
you
to
slow
down
or
I,
feel
nuts
like
I
I.
Don't
I
don't
want
to
throttle
you
and
that's
great.
If
you're,
if
you've
got
a
solid-state
drive,
you
can
turn
that
throttle
way
up
or
I
suppose
down
on.
B
B
So
there's
there's
a
few
different
compaction
options
and
these
are
pretty
important.
It
depends
on
your
workload
and
you
just
want
to
read
up
a
little
bit
on
the
differences
between
levels
in
size
to
compaction.
The
gist
of
it
is,
though,
that
level
is
really
really
good
on
solid-state
drives
and
Reed
heavy
workloads
and
also
update
heavy
workloads,
and
it
does
a
ton
of
money,
though
so
the
gist
of
it
is.
B
It
tries
to
keep
a
partition
and
as
few
SS
cables
as
possible
and
so,
like
I,
said
a
lot
of
I
hope
to
figure
out
which,
which
data
goes
where
it's
pretty
crazy.
You
probably
don't
want
to
be
doing
this
on.
You
know
a
spin,
a
traditional
spinning
drive
it's
going
to
be
really
slow
and
you
know
you're
going
to
introduce
some
performance
problems.
B
So
if
you
do
have
a
spinning
drive,
you're,
probably
going
to
want
to
stick
to
size,
tiered
compaction
and
that's
kind
of
like
the
old
school
default,
and
it's
really
really
good
for
right,
heavy
time
series,
workloads
and
there's
actually
a
new,
a
new
compaction
strategy
called
dated
compaction
and
that's
been
introduced
in
the
sandro
2.1
and
back
ported
into
Cassandra,
zero
Inc
11,
and
it's
kind
of
like
an
experimental
compaction
option.
That's
even
better
for
time
series
workloads,
especially
if
they
have
TTLs.
B
So
the
next
thing
that
we're
going
to
talk
about
it
diagnostic
tools-
and
these
are
basically
just
the
tools
that,
if
you're
sitting
on
a
server
and
you've
got
a
command
line
open.
These
are
the
things
that
you're
going
to
want
to
know
to
understand
what
is
going
on
with
a
machine
at
this
exact
moment
in
time.
So
this
is
like
real
time
monitor
and
stuff,
and
it's
you
know,
you're
digging
into
one
machine.
B
I
found
these
tools
to
all
the
external
useful
and
I'm
they
get.
They
start
out
with
really
basic
and
they
go
into
more
complex.
So
this
one's
a
no-brainer
I
was
actually
using
this
right
before
our
our
webinar.
Here,
a
chop
really
simple:
it's
just
for
process
overview.
You
can
do
all
the
stuff
that
you
can
do
with
top.
It
just
looks
a
lot
better
and
it's
a
good
first
tool
to
fire
up.
If
you're
having
problems
with
machine,
you
throw
up
H
top
it's
going
to
put
everything
ranked
by
CPU.
B
So
it's
really
useful
to
see.
If
there's,
if
there's
a
performance
problem,
you're
going
to
know,
okay
I've
got
low
memory,
I'm,
swapping
or
CPUs,
which
is
really
high.
It
tells
you
really
really
quickly.
The
next
thing
that
we're
going
to
look
at-
and
this
is
like
I,
said
a
little
bit
more
a
little
more
advanced,
but
you
know
shouldn't
be
too
bad
is
IO
staff
and
IO
staff
basically
gives
you
disk
stats,
and
the
idea
here
is
to
understand
what
is
going
on
on
each
device.
What
is
what
is
my
read
rate?
B
What
is
my
right
rate?
Do
I,
have
a
queue,
so
there's
a
says:
average
queue
size
right,
so
you
can
understand,
is:
is
my
disk
queued
up
on
a
wait?
How
much
time
am
I
waiting?
And
these
things
are
really
nice,
because
you
can
quickly
like
if
you're
using
a
raid,
you
can
actually
quickly
identify
if
one
disk
in
the
rate
is
slow.
So
this
is
a
super
useful
tool.
B
If
you
know,
if
disk
looks
like
it's,
it's
not
behaving
right,
and
especially
in
the
rape
case,
you
can
identify
the
problem
really
really
quickly,
and
maybe
you
need
to
swap
out
that
drive.
Maybe
it
doesn't
work
right
anymore.
There's
a
percent
utilization
that
is
kind
of
not
always
accurate,
so
I
would
definitely
ignore
that
vmstat
really
useful
tool
gives
you
an
idea.
What's
going
on
virtual
memory
statistics,
are
you
swapping
there's
a
swap
column
on
the
right?
This
are
in
the
middle
here.
B
It
says
si
that
swap
in
s,
oh,
let's
swap
out,
and
you
can
see
VI
vo,
that's
blocks
into
blocks
out
and
what
it
does
is
it
gives
you
an
idea
if
you're
actually
having
if
you
hit
the
memory
limit.
So
if
you
have
swap
still
turned
on,
then
this
would
be
the
place
to
figure
out
if
you're
actually
hitting
a
swap
issue
normally
on
production
servers.
B
I
personally
turn
swap
off
I'd
rather
have
a
server
just
crash
and
with
Cassandra
it
doesn't
really
matter,
because
if
you
have
multiple
replicas,
then
the
other
ones
will
continue
to
work
in
this
particular
case.
I
would
hope
that
if
you
reach
your
memory
limit,
you
would
actually
be
getting
alerts
through
either
Nagios
term
on
it
or
server
density
or
whatever,
whatever
tool
that
you
put
in
place.
You
should
find
out
about
this
before
this
problem
happens.
B
So
hopefully
you
don't
have
to
use
this
tool
too
much
and
you've
been.
You
know,
you're
told
about
it
ahead
of
time
and
that's
kind
of
where
prevention
is
much
better
than
having
a
problem,
because
if
you
hit
a
problem
with
swap
there's
a
good
chance
that
all
of
your
servers
are
close
to
hitting
a
problem.
So
you
definitely
don't
want
this
to
happen,
because
it's
going
to
be
harder
to
introduce
new
servers
when
other
servers
are
crashing
a.
B
Actually,
you
will
go
to
it
before
I
go
to
IO
staff,
and
basically
you
know
it
is
what
it
looks
like
it's
just
a
lot
of
information
generally
on
a
system
if
I'm
trying
to
figure
out
what's
wrong,
I've
got
like
four
different
tools:
open
and
D
stat
just
took
all
that
away.
I!
Don't
need
to
worry
about
anymore.
B
You
know
I
used
to
run
iost
at
an
age
top
and
vmstat.
Like
all
these
tools,
and
now
you
can
just
run
addy
static.
It
does
like
95%
of
the
same
thing.
So
if
you
need
to
dig
into
like
disk
statistics,
you
can
run
iOS
dad
to
get
a
little
bit
more
information,
but
often
the
most
part.
These
guys
going
to
do
this
one
this
one
s.
Tres,
is
really
really
useful
if
you've
ever
run
into
a
system,
and
this
this
is.
This
is
part
of
my
you
know.
General
tools
are
useful.
B
B
Every
system
call
so
I
like
this
a
lot
and
like
I,
said
it's
helped
me
debug,
some
really
weird
things
and
you
can
you
can
optionally
filter.
So
if
you
do
like
e
trace
equals
Network,
you
can
just
see
the
network
trace.
So,
like
I,
said
super
useful.
You
want
to
understand
exactly
what
your
application
is
doing.
B
S
trace
is
a
really
good
way
to
find
it's
going
to
print
out
a
ton
of
data,
so
it
might
be
good
to
output
to
a
file
but
yeah,
it's
great,
so
kind
of
on
that
same
general,
useful
utility,
that's
a
little
bit
more
intense,
is
TCP
dump
and
what's
nice
about
TCP
dump
is
that
you
can
get
a
really
great
idea
of
the
traffic
that's
going
across
from
and
you
can
look
at
a
particular
report.
So
if
you
want
to
run
this,
let's
say
on
your
case
and
report
like
I
did
over
here.
B
I
can
see
that
I've
been
doing
queries,
and
this
is
actually
from
a
project.
I
have
called
meet
bot
and
meet.
Bot
is
just
a
chat
bot
and
you
can
see
the
queries
that
are
actually
being
sent
over
here.
So
it's
pretty
cool
it
lets
you
trace
really
anything
so
I've
used
this
to
watch
Redis,
elasticsearch
Cassandra,
it's
great!
You
can
also
see
what's
going
on
if
you're
on
an
application
server,
you
can
see
what
requests
are
coming
in.
There's
a
whole
ton
of
flags.
There's
a
lot
of
options:
it's
a
really
really
flexible
tool.
B
I
strongly
recommend
getting
familiar
with
this,
so
those
are
good
system
level
utilities
and
they're
they're
great
for
getting
a
general
idea
of.
What's
going
on,
machine
Kassandra
actually
includes
a
something
called
node
tool
which
is
really
useful
for
understanding.
Very,
very
specific
things
about
Cassandra
and
the
first
thing
that
we
have
in
node
tool
is:
it's
called
TP
stats
and
it
gives
you
a
really
good
high-level
overview
of
things
that
are
blocked
on
your
system,
and
you
can
see
this
fifth
column
over
here.
It
says
full
name.
B
Active,
pending,
completed
blocked
blocked
is
the
one
that
you
want
to
take
a
look
at.
For
example,
if
you
take
a
look,
there's
a
mem
table
flush
writer,
if
that's
blocked,
then
you've
got
some
disk
problems
generally.
That's
that's
what's
going
on,
and
it
can
also
like
blocked
will
also
lead
to
garbage
collection
issues,
because
it
means
that
many
tables
are
sitting
around
in
memory
and
that
memory
can't
get
freed
and
we
need
to
do
more
garbage
collection.
B
The
other
thing
that's
really
nice
down
here.
If
you
take
a
look
all
the
way
at
the
bottom,
it
says
message,
type
and
drop:
one
of
them
is
mutations,
you've
got
drop
mutations,
then
you
need
to
run
a
repair
if
you've
got
data.
Consistency
problems
as
a
good
idea
to
take
a
look
at
TP
stats,
see
if
you've
got
dropped,
mutations
and
do
a
repair.
If
that's
the
case,
another
tool,
that's
really
nice
is,
is
histograms
histograms,
there's
two
of
them.
B
The
first
one
is
proxy
histograms
in
case
you
high-level,
read
and
write
times
under
cluster,
and
it
does
it
in
microseconds,
and
you
can
get
an
idea
of
how
fast
queries
are
being
serviced,
both
reads
and
writes
by
using
this
tool
and
once
you've
done
that.
If
you
determine
that
there's
a
problem,
you
can
use
CF
histograms,
which
is
on
the
right,
and
you
can
get
statistics
or
a
single
table
on
a
single
node,
and
this
is
really
nice
because
you
can
help
you
narrow
down
performance
problems
down
to
the
table
level.
B
Now,
if
you've
identified
problems
at
the
table
level,
the
thing
that
you
don't
want
to
ask
yourself
is
which
queries
am
I
executing?
Is
that
table
once
you
have
those
queries?
You
can
use
query
tracing
to
determine
the
query
path
on
what's
happening
right.
So
this
is
an
example
here,
I
have
a
you
know.
This
is
my
tombstone
problem,
I
say:
okay,
now,
I've
got
a
hundred
thousand,
you
know,
I've
got
a
hundred
thousand
rows
in
this
partition
and
I.
B
Do
is
select
and
well
guess
what
they're
all
tombstones,
so
nothing
comes
back
except
the
thing
is
I
still
need
to
do
a
ton
of
work
to
figure
out
what's
going
on
here.
This
doesn't
really
look
like
that
big
of
a
problem
where
I
am
on
on
this
drive.
The
actual
time
is
not
it's
totally
terrible.
Actually,
no,
it
is
it's
pretty
bad.
The
in
this
is
still
in
solid
state.
So
this
kind
of
goes
back
to
my
tombstone
issue.
You
don't
want
to
have
a
lot
of
tombstones.
B
B
B
About
there's
the
last
this
last
topic:
we're
going
to
be
talking
about
GBM
and
how
garbage
collection
works.
So
what
is
garbage
collection?
Basically,
it's
an
alternative
to
managing
memory
of
yourself
and
what
the
JVM
will
do
is
keep
track
of
which
objects
point
to
which
other
objects
and
when
objects
aren't
being
used
anymore.
It
will
get
rid
of
them
and
reclaim
that
memory
a
bit
again
and
the
way
that
this
works
with
Cassandra
is
we're
using
part
new
and
CMS.
B
So
once
garbage
collection
happens
in
the
new
gen,
we
have
what's
called
a
minor
G
C,
and
this
is
stop
the
world
operation
and
it
occurs
when
the
new
gen
fills
up.
Dead
objects
are
removed
and
then
live
objects
are
promoted
into
the
survivor
area,
so
you
can
see
even
down
at
the
bottom.
Objects
are
promoted
out
of
evening
just
into
Survivor
and
they're,
promoted
back
and
forth
between
the
survivor
generations
as
well
and
after
a
certain
amount
of
of
swapping
back
and
forth,
an
object
will
actually,
if
you
promote
it
into
the
old
gem.
B
So
the
important
thing
to
take
away
from
this
is
removing
objects
is
actually
proving
fast
and
promoting
objects
is
really
slow.
There's
some
accounting
and
the
background
needs
to
happen.
There's
a
men
copy
and
it's
pretty
bad.
It's
a
lot
slower
than
removing
like
removing,
is
super
fast.
So
there's
a
couple
patterns
that
we're
going
to
see
as
a
result
of
this
before
we
talk
about
that
that
we're
going
to
talk
a
little
bit
about
the
old
gem
so
after
the
object
has
been
promoted
from
the
new
gen
to
the
old
gem.
We've
got
this.
B
This
huge
lump
of
memory
laying
around
and
over
time
actually
constantly
we're
going
to
have
what's
called
a
major
juicy
and
a
major
GC
is
mostly
concurrent.
So
most
of
the
stuff
is
actually
going
to
happen
in
the
background
and
there's
going
to
be
two
short
pauses
and
what
you
don't
want
to
happen
is
a
ton
of
major
GCS
happening
one
after
the
other.
B
You
don't
want
there,
your
your
whole
gen
to
get
totally
full
and
your
new
Jam
to
get
totally
full
because
then
you
will
hit
what's
called
a
full
GC
and
a
full
GC
is
what
happens
when
the
old
gen
fills
up,
and
basically
this
is
stopped
the
world.
So
if
you've
got
a
20,
gig
heat
you're
gonna
have
to
look
to
all
20
gigs
of
memory.
B
It's
gonna
have
to
draw
its
graph
and
like
do
a
ton
of
work
and
I've
heard
people
having
full
GC
that
have
gone
on
for
hours
and
that's
what
happens
if
you
ever
really
do
cheapest
pie,
you
don't
want
to
use
really
big
heaps.
It
doesn't
work
that
well
and
basically,
your
system
will
be
completely
unresponsive.
Are
these
that
node
will
be
totally
unresponsive
during
the
full
juicing?
B
So,
as
you
can
see
in
my
notes-
and
hopefully
you've
inferred
by
now,
these
are
really
bad,
so
we've
got
two
problems
that
we
can
have
and
the
first
one
is
really
promotion.
So
if
we've
got
a
bunch
of
really
short-lived
objects
and
our
new
gen
size
is
too
small
and
we're
creating
these
short-lived
objects
really
really
quickly.
What
happens?
Is
your
new
gen
fills
up,
and
then
your
new
gen
gets
promoted
to
your
old
gen?
Your
old
gen
is
supposed
to
have
data.
That's
that's.
B
Long-Lived
objects
like
they're
supposed
to
be
things
that
are
going
to
stick
around,
for
at
least
you
know
a
little
while.
But
if
you
set
these
short-lived
objects
by
short-lived
I
mean
you
know,
100
milliseconds.
Well,
it's
a
pretty
short
lifetime.
If
these
things
are
in
the
old
gen,
that
means
your
old
gen
is
going
to
be
filled
with
these
objects
that
don't
need
to
be
there.
B
The
other
problem
that
you
can
run
into
is
a
long
minor,
G
C
and
the
the
problem
here
is:
if
you've
got
a
new
gen,
that's
way
too
big
right.
So,
if
you've
got
this
right,
heavy
workload
and
your
burger
we've
got
all
these
mem
tables
sitting
in
memory
and
then
we're
going
to
copy
them
over
from
the
new
james
gilligan.
Well,
if
we've
got
a
copy
over
like
three
gigs
of
data,
it's
going
to
take
a
long
time.
B
So,
as
a
result,
you've
got
a
ton
of
data
being
promoted
and
it's
really
slow
and
forced.
So
a
really
useful
tool
to
understand.
What's
going
on
with
garbage
collection
and
just
in
profiling
on
it,
it's
Jake
stat,
so
Jay
stat
has
a
flag
called
GCE
till
you
can
pass
it
a
process
ID
and
an
interval
account,
and
it
will
actually
show
you
what's
going
on
all
the
way.
This
is
also
available
via
Jay
visual
VM,
but
I
personally
prefer
the
command
line.
B
You
can
take
a
look
at
than
the
survivor
on
the
left,
Eden
old,
gen,
perm
gen,
and
then
they
have
accounts
and
times
for
what's
going
on
with
garbage
collection,
really
really
useful.
Other
things
that
are
really
useful,
op
Center
will
actually
show
you
government
reflection
stats
and
you
can
see
correlations
between
GC
spikes
and
readwrite
latency.
You
can
turn
on
garbage
collection,
logging
on
Cassandra.
B
B
Rather
than
change,
something
that
like
will
maybe
make
it
go
away
for
a
little
while
it's
going
to
be
really
hard
to
understand,
what's
broken,
so
that's
kind
of
why
I
recommend
chase
that,
if
you're
hitting
a
problem
in
the
center
box
and
you're,
seeing
no
like
not
too
much
I
owe
and
your
CPU
usage
isn't
that
bad
and
your
memory
is
not
full.
It
can
be
really
confusing.
You're
not
going
to
understand
where
your
ball
Mike
is.
You
may
want
to
check.
B
So
basically,
what
you
want
to
look
out
for
is
long
multi,
second,
pauses,
okay
and
that's
caused
by
full
GC,
and
basically
that
means
your
whole
Jam
is
filling
up
really
fast
and
it
means
that
that's
being
promoted
under
the
new
gen
twosome
and
the
other
option
is
long
minor
GCS
you,
you
know
if
that's
like
you've
got
a
lot
of
objects
being
promoted
indulgent
and
it's
generally,
your
new
gen
is
too
big
and
it
matters.
Okay,
it's
a
big
big
deal.
B
This
was
our
cluster
at
my
last
company
when
we
were
trying
to
understand
what
was
wrong.
We
did
a
bunch
of
jvm
tuning
and
this
is
what
we
got
out
of
it.
So
is
absolutely
necessary.
So
what
do
you
do?
Oops?
What
do
you
doing?
Something's
broken
right,
you
get
that
call
well
this.
This
is
where
all
this
work
that
you've
done
along
the
way
getting
familiar
with
these
tools.
This
is
where
it
pays
off
right.
So
the
first
thing
you
need
to
understand
is:
is
your
problem
evening
to
Sandra
all
right?
B
You
need
to
check
your
metrics.
You
should
have
all
these
in
place
already
I've.
Given
you
the
tools,
there's
no
reason
not
to
have
them.
You
could
have
notes
going
up
and
down
for
that.
Op
Center
is
going
to
be
really
useful.
Look
at
your
system,
metrics
and
if
you've
got
slope,
queries
you're,
going
to
find
the
bottom
length
using
histograms
figure
out.
B
If
you've
got
disk
issues,
it
might
be
a
compaction
right,
so
you
just
want
to
use
all
these
tools
that
I've
talked
about
in
order
to
figure
out
what
exactly
is
wrong
with
the
system
and
if
you
put
all
the
things
in
place
ahead
of
time-
and
this
should
be
really
easy
and
basically
you
look
awesome
and
you
don't
have
to
you
know
you
really
don't
have
to
deal
with
that
many
problems
and
they
won't
take
a
long
time
to
figure
out.
So
that's
actually
ok
and
it's
right
on
time.
B
A
A
We
will
be
talking
about
spark
and
sparks
dreaming
with
Apache
Kafka
and
Apache
Cassandra
and
then
especially
if
you
are
in
Europe,
make
sure
that
you
register
for
the
sumit
2014
on
December,
3rd
and
4th
in
London.
There
are
a
few
free
tickets
left
for
the
main
conference
day
and
then
we
are
also
having
training
the
day
before
John
I
think
you're
heading
over
the
pond.
Is
that
right?
Yes,
I
am.
B
A
B
So
you,
you
won't
ok,
so
if
you've
got
a
ton
of
tombstones
you're
like
if
you
were
to
create
100
gigs
of
data
and
then
delete
all
that
data,
you
would
actually
have
the
same
amount
of
information.
Just
as
tombstones
tombstones
will
be
removed.
There's
a
visit,
GC
great
seconds
and
effectively
what
it
does
is
it
says,
tombstones
will
exist
for
a
certain
amount
of
time
you
can
play
with
that
to
determine
when
tombstones
are
removed,
but
they
only
get
removed
through
compaction.
B
A
B
Would
say
that
that's
pretty
slow,
so
2
milliseconds
latency
for
a
sound
with
SSDs
is
not
very
good
you're
going
to
get
sub-millisecond
latency
with
as
if
these
are
talking
10
or
100
times
faster
than
that.
So
you
know
worst-case
scenario:
10
times
faster
ass
and
is
pretty
bad.
Compared
to
that
and
remember
the
sand
is
so
expensive,
like
you're,
not
just
paying
for
the
drives
in
it.
It's
it's.
A
It's
kind
times
this
is
a
topic
we
feel
very
passionately
on.
We
see,
you
know,
basically
it's
an
anti-pattern
for
Cassandra.
So
if
you
need
more
information
on
that,
please
reach
out
to
community
at
data
stacks
comm,
and
we
will
reach
out
to
help
you
this
Cassandra
Creek,
double
zero
values
as
null
John
double.
B
A
B
Some
of
them,
like
App
Center,
is
per
cluster,
something
like
Nagios
cluster.
The
tools
that
I
was
looking
at
the
command-line
tools,
those
are
per
server,
so
IO
stat
p.m.
stat
key
stat,
TCP
dump.
Those
are
those
are
per
server
great.
A
Hey
here's
a
question:
I
can
answer:
we
post
the
recordings
and
the
slides
to
Planet
Cassandra
org,
usually
within
24
hours
of
the
webinar
we've
had
several
requests
for
the
slides
in
the
recording,
so
they'll
be
on
there
and
apologies
for
the
mix-up
for
those
who
did
not
receive
the
correct
password.
You
missed
a
few
minutes
at
the
beginning.
Sorry
about
that.
Okay,
John
does
proxy
histograms
report,
client
requests
times
or
local
disk
time.
It's.
B
The
full
time
for
the
request,
so
it's
not
necessarily
the
client
time
because
there's
like
100
milliseconds
latency
between
the
client
and
the
the
coordinator.
It
wouldn't
report
that,
but
it
reports
the
total
time
from
the
start
of
the
request
until
the
end.
So
if
you've
requested
something
at
consistency,
level
quorum
or
all,
it
will
actually
cable
pick
that
time
to
do
account.
If
you
want
client
level
times,
I
didn't
I
didn't
have
time
to
include
this
in
a
webinar.
That
I
would
include.
A
B
Like
a
query,
timer
and
if
you
exceed
a
certain
amount
of
time,
I
would
send
that
into
log
stash
or
into
whatever
long
aggregation
tool
that
you
want.
So
that's
kind
of
one
of
the
uses
for
that
those
logging
tools-
I
talked
about
you-
can
log
individual
slow
query
to
figure
out.
What's
going
on
with
them.
A
A
So
my
office
is
asking
which
of
these
diagnostic
tools
can
be
used
to
diagnose
after
the
fact.
A
B
Basically
means
that
you
have
to
have
had
you
have
to
be
recording
everything
along
the
way,
so
I
like
op
Center.
That's
one
of
the
reasons
why
I
like
audios,
because
if
you're
recording
all
these
metrics,
ideally
you
should
be
able
to
do
that
and
that's
you
know
like
those
command-line
tools.
Those
are
definitely
all
looking
at
a
system
at
a
given
moment
trying
to
figure
out
what's
going
on,
there's
no.
The
two
clubs
are
which
I
didn't
cover
today
and
that's
really
good
for
system
metrics.
But,
ideally
you
know
you're.
B
You
should
be
recording
all
of
your
jmx
metrics
using
that
JM
x-trans
a
tool
and
you
should
be
recording
all
of
your
system
metrics
and
you
know,
if
you're
doing
log
aggregation,
then
you
should
be
able
to
piece
together
all
this
stuff
to
figure
out
what
happened
after
the
fact.
So
it's
like
your
if
you
had
a
problem
with
the
machine
at
3:00
in
the
morning,
and
you
know
you're
going
to
take
a
look
at
it
the
next
day.
Ideally
you
want
that
information
there.
So
it's
a
combination
of
all
the
tools
together.
B
B
It
depends
on
your
application
honestly.
So
you
know
I
personally
have
seen
like
20
milliseconds
in
there
and
I
kind
of
got
nervous
about
that.
The
problem
is,
is
that
during
those
minor
GCS,
if
you
are
a
query
that
started
before
it,
the
and
you
have
a
20
millisecond
pause,
but
you
can
end
up
doing
like
let's
say
2
milliseconds
worth
of
work,
one
before
the
GC
and
one
after
now.
Your
query
just
took
22
milliseconds
is
your
application?
Is
that
a
problem
in
your
application?
B
B
I,
don't
know
anyone
using
it
in
production,
yet
so
I
would
personally
probably
not
push
it
out.
I
would
wait.
A
few
I
would
wait.
A
few
more
bug-fix
releases
before
I
rolled
it
out.
I
would
just
try
it
no
local
death
cluster
to
see
if
it's
even
applicable
for
you
and
the
nice
thing
needs
a
DVD.
You
can
change
a
table
too.
You
can
change
the
compaction
strategy
and
it's
totally
fine.
B
A
So
we
have
a
lot
more
questions
left
and
we're
not
going
to
get
through
them,
because
we've
got
one
minute
here,
but
on
Planet,
Cassandra
or
there's
a
way
to
book
office
hours,
which
are
quite
hip
meetings
with
the
Evangelist
team,
John
and
others.
So
if
you
have
a
burning
question
that
we
haven't
got
to
this
morning,
please
feel
free
to
reach
out.