►
From YouTube: Pythian: Monitor Everything!
Description
Speaker: Chris Lohfink, Engineer at Pythian
This session will cover a walk-through to provide an understanding of key metrics critical to operating a Cassandra cluster effectively. Without context to the metrics, we just have pretty graphs. With context, we have a powerful tool to determine problems before they happen and to debug production issues more quickly.
A
A
A
A
I'm
a
senior
engineer
at
Pythian,
where
I
will
lead
the
cassandra
practice
like
a
lot
of
people
at
pity,
and
I
work
remotely,
but
I'm
based
out
of
minnesota
I
like
doing
software
development
like
a
lot
of
people
here
in
particular,
I
do
Java
closure
and
Python,
but
the
language
isn't
so
much
is
important,
is
just
going
out
there
and
playing
with
it.
I
like
big
data.
A
I
sided
I'm,
one
of
those
guys
I
enjoy
having
large
data
sets
and
then
the
the
algorithms
and
data
structures
involved
in
in
doing
analytics
and
statistics
over
them
and
I
like
to
set
my
house
and
fire
in
lech
cube
myself.
You
know
hobbyist,
electronics,
so
pipiens
a
data
outsourcing
and
consulting
firm
and
oops
sensitive.
So
here,
jockbox
andhra
from
an
operations
perspective.
One
of
the
features
that
is
most
loved
about.
A
This
is
the
fault
tolerance,
and
this
is
when
you
get
that
phone
call
at
3am,
particularly
if
you
don't
have
appropriate
escalations
or
multiple
people
to
handle
it.
It's
really
nice
so
that
if
my
phone
dies
and
no
one's
there
to
take
care
of
it,
the
system
is
going
to
keep
running,
even
though
at
even
though
it
failed
at
3am
until
I
wake
up
in
the
morning
and
see
all
the
red
red
alarms.
A
So
that's
really
great,
but
it's
really
easy,
then,
to
forget
about
Cassandra,
because
you'll,
you
will
necessarily
notice
when
things
start
going
wrong
because
little
hiccups
and
stuff
can
be
easily
glazed
over.
Maybe
something
will
go
down
for
a
minute
or
two
one
of
the
nose,
but
it'll
come
back
up
as
things
get
queued
up
and
it'll
just
keep
we're
running,
which
is
which
is
great,
and
it
gives
you
a
nice
buffer.
A
A
What
we
really
want
to
do
is
utilize,
this
buffer,
that
this
time,
where
things
can
start
going
wrong
and
actually
breaks,
we
do
this
in
two
different
ways.
We
would
be
both
proactive
and
reactive,
proactive
being
your
daily
and
weekly
checkups,
and
this
is
something
that
people
should
really
be
doing.
Is
you
just
go
when
you
look
at
the
metrics
see
how
things
are
going?
You
know
just
get
a
standard
health
check,
and
this
helps
with
predicting
capacity
issues.
A
So
if
there's,
for
example,
it
in
your
data
model
a
cql
collection
or
something
that's
slowly
growing
over
time,
if
you're
not
capping
this
or
if
this,
if
you're
still
doing
reads
on
it,
you
could
eventually
end
up
having
really
bad
memory
issues
and
garbage
collection.
So
if
we
find
any
data
modeling
issues,
we
can
we
can
address
them
before
they
become
a
problem
as
opposed
to
waiting
for
the
actual
crash,
but
no
matter
how
hard
you
try,
things
are
going
to
go
down.
You're
gonna
have
hardware
failures,
sometimes
catastrophic
ones.
A
Some
person
drives
their
car
into
a
transformer
at
the
Amazon
region
and
takes
the
whole
thing
down.
It
happens
bugs
and
Cassandra
it
happens,
and
sometimes
you
have
users
who
may
use
you
in
ways
that
your
data
model
doesn't
actually
support
either
on
purpose
or
not
so
much,
and
this
is
ultimately
where
you
have
your
alarms.
Your
metrics
and
page
duty
I
saw
a
couple
people
from
page
or
duty
walk
in,
so
you
guys
you
guys
are
awesome.
Thank
you.
A
Having
appropriate
escalations
is
really
important,
but
for
both
of
these
really,
what
you
need
is
data.
You
need
metrics.
You
need
to
be
able
to
to
form
form
the
alerts
to
be
able
to
see
trending
to
be
able
to
to
debug
problems
after
they've
happened,
and
this
is
a
quote
from
Kota
Hale.
Is
you
need
to
bridge
the
gap
between
how
you
think
the
applications
running
to
how
it's
actually
running
these
are
really
the
window
to
the
application.
A
This
is
how
you
see
at
things,
work
and
there's
probably
at
least
one
person
in
the
came
to
this
talk
expecting
that
picture
so
I
threw
it
in
it
somewhere.
So
there
there's
multiple
ways
to
there's
a
lot
of
metrics,
I'm
kind
of
breaking
it
down
here
between
the
Cassandra
metrics
and
the
environmental
metrics.
A
A
So
that
way
everyone's
happy,
but
you
really
need
to
understand
what
their
meaning
and
so
as
I'm
going
through
this
talk,
I'm
going
to
try
to
provide
some
context,
I'm
going
to
try
to
explain
it,
but
I
can't
go
into
too
much
detail
just
because
of
the
time
limits
so
I'm
going
to
give
a
really
high
level
overview
of
a
lot
of
the
subsystems.
So
jmx
a
lot
of
people
here,
probably
already
pretty
familiar
with
them.
A
They're
pretty
standard
with
Java
they're,
pretty
complex,
there's
a
lot
to
them,
but
we're
at
this
stage
is
going
to
consider
them
objects
with
attributes
and
operations.
A
lot
of
Java
applications,
Cassandra
included,
use
this
pretty
extensively
for
monitoring
and
user
input.
It's
pretty
annoying,
it's
very
slow!
It
requires
java
to
access,
even
if
you're
using
you
can
use
things
like
jolokia
and
stuff
to
access.
It
through
other
languages,
but
you
still
need
that
java
wrapper.
A
It's
had
memory
leaks
and
some
versions
of
the
JVM
and
it's
been
pretty
frustrating
from
your
opera
for
an
Operations
team
in
general,
primarily
because
of
this.
This
mechanism
that
jmx
uses
where,
when
you
make
a
connection
to
the
jmx
port,
it's
actually
going
to
reply
back
with
a
different
hostname
and
port
uppal.
That
you're
going
to
then
reconnect
to
again
and
of
course,
initially
that
that
port
was
random,
which
makes
it
virtually
impossible
to
set
up
a
firewall
and
still
have
this
work
in
the
more
recent
versions
of
the
JVM.
A
There
is
an
option
that
you
can
use
to
set
the
port
that
the
week,
the
second
connection
uses,
and
it
can
actually
reuse
the
port
that
the
initial
j,
MX
or
use-
and
this
is
actually
configured
by
default
in
Cassandra
after
0.8.
That's
a
newer
version
so
that
some
people
haven't
gotten
to
that.
Yet
so
you
can
set
that
attribute
yourself
if
you're,
using
a
newer
java,
7
JVM,
there's
a
lot
of
ways
to
access,
jmx
visually
you're
going
to
have
j
console
and
visual
vm.
A
These
are
things
that
come
with
the
jdk,
so
a
lot
of
people
will
actually
already
have
this
on
your
computer
and
providing
the
firewall
issues
aren't
going
to
be
causing
a
problem.
You
can
actually
just
connect
to
it
right
from
your
system,
but
ultimately,
I
think
it's
really
important
to
be
come
familiar
with
the
jmx
with
a
command-line
versions.
So
jmx
term
is
a
great
one,
and
that
way
you
could
just
SSH
21
your
systems
and
quickly
poke
at
something
that
isn't
exposed
elsewhere,
there's
also
MX
for
Jay
and
jolokia,
which
are
pretty
great.
A
If
you
don't
want
to
use
Java
at
all,
because
and
they'll
provide
like
soap
and
rest
wrappers
for
your
jmx
interface,
so
jmx
looks
kind
of
like
this
there's
beans
here,
objects
which
have
a
domain
and
then
a
series
of
key
value
attributes.
So
here
we
have
an
example:
we
have
comp
atheon
and
then
just
a
set
of
attributes
to
narrow
down
what
that
beam
is,
for
so
it
looks
hierarchically
hierarchical,
but
this
first
level
the
domain
and
cassandra
originally
had
four
different
domains:
the
DB
internal
net
and
requests.
So
these
are
there
not
deprecated.
A
The
first
attribute
in
this
is
a
tight
and
there's
a
lot
there
and
I'm
not
going
to
walk
through
them
or
anything,
but
that
type
is
first
attribute.
After
type
majority
of
all,
these
beans
are
going
to
have
a
scope
and
a
name.
The
scope
may
or
may
not
be
there.
Three
special
cases
are
the
thread
pools
with
a
path
column,
family
with
the
key
space
and
the
key
space
with
the
key
space,
but
no
scope.
A
All
these
metrics
come
from
metrics,
which
is
a
tool
kit
made
by
Kota,
hail
at
Yammer,
and
it's
pretty
great,
actually
I'm
a
fan
of
it.
It's
real
easy
to
use.
There
was
a
project
I
worked
on
where
we
actually
had
about
a
total
of
fifty
metrics
total
in
the
entire
application,
but
then
that
we
installed
this
and
we
started
adding
things
some
dynamic
and
such
and
within
three
months
we
had.
A
Oh,
we
had
thousands
and
thousands
of
metrics
that
we
were
able
to
collect,
and
then
it
just
becomes
interesting
trying
to
storing
and
keep
all
those.
If
you
are
familiar
with
Java
it's
it's
in
github,
it's
pretty
easy
to
look
at,
so
you
can
just
open
it
up
and
it
gives
you
a
good
understanding
of
how
it
works.
Just
open
up
and
look
at
the
source
code.
I'd
highly
recommend
it,
and
it's
really
popular
and
using
a
lot
of
projects.
A
So
in
metrics
there's
a
bunch
of
different
types
like
and
Cassandra
uses
pretty
much
all
these.
The
first
need
simplest
is
a
gauge
which
is
just
a
value
that
can
be
a
string
and
array
of
strings.
It
can
be
an
integer
or
whatever
it
can,
whatever
a
counter
being
something
that's
incremented
or
decremented,
pretty
self-explanatory.
A
meter
is
just
the
rate
of
things,
so
you
have
the
number
of
requests
per
second
for
the
number
of
requests
per
minute.
A
It's
it's
all
depending
on
your
units
and
it
keeps
15
and
15
minute
moving
averages,
there's
a
histogram
which
gets
a
little
bit
more
interesting,
because
this
is
where
we
have
the
statistical
distribution
of
your
data.
It
keeps
a
bunch
of
percentiles
and
then
the
mean
average
standard
deviation
and
all
those.
So
what
this
is
is
more
if
you
have
something
like
a
payload
of
a
request
and
you
want
to
keep
track
of
how
big
those
payloads
are.
If
you
just
keep
the
max
min
an
average,
you
can
end
up
with
something.
A
Where
will
the
max
is
to
Meg's,
which
can
throw
off
the
average
a
lot?
Because
in
ality
you
could
say
that
the
99th
percentile
is
100
bytes
and
it's
just
you
have
one:
that's
thrown
one
huge
outlier
so
having
the
statistical
distribution
is
really
helpful
to
understand,
outliers
and
actually
how
the
data
looks
and
then
there's
a
timer,
which
is
one
of
the
really
common
ones,
and
that's
just
a
combination
of
a
meter
of
the
events
of
whatever
is
happening
and
a
histogram
of
the
duration
that
it
took.
A
So
here's
an
example
of
how
one
would
look
inside
of
jconsole.
We
have
a
histogram
of
the
right
requests
from
the
coordinator
perspective
and
one
of
the
nice
things
is.
This
includes
them
the
units
inside
of
it.
So
I'm
able
to
read
this
and
say
that
well,
the
75th
percentile
is
683
and
the
latency
is
microseconds,
so
I
know
that
75%
took
683
micro
seconds
or
less,
and
then
the
same
goes
with
the
meter
side,
where
I
can
say
that
there's
been
thirteen
thousand
calls
per.
A
Second,
I'm
able
to
just
read
that
right
off
there,
including
all
the
metrics,
including
all
the
units,
so
this
can
be
a
little
overwhelming.
There's
a
lot
of
attributes,
there's
a
lot
of
operations
and
there's
a
lot
there
and
there
isn't
really
any
documentation
to
explain
them.
So
you
ultimately
have
to
end
up
going
to
the
source
code,
to
figure
it
out
and
even
then
between
versions.
They
move,
they
change,
they
get
renamed.
A
So
it's
really
hard
to
follow,
and
this
is
where
there's
this
really
great
tool,
no
tool
which
I'm
sure
everyone
in
this
room
has
used
and
that
all
that
is
is
a
command-line
wrapper
around
jmx
and
so
similar
to
jmx.
It
has
a
lot
of
options
and
there's
a
lot
of
things.
I
can't
go
through
them
all
here,
but
I'm
going
to
go
through
some
of
the
ones
that
I
think
are
the
most
important
from
a
monitoring
standpoint:
TP
stats.
Now
this
is
the
thread
pool
statistics.
A
So
what
the
thread
pools
are
in
cassandra
is
cassandra
is
based
off
of
a
staged
event,
driven
architecture.
So
this
means
that
it
takes
a
bunch
of
common
tasks
and
breaks
them
into
thread
pools,
and
then
it
just
throws
a
queue
in
front
of
each
one.
So
each
one
can
just
take
a
task
and
pass
it
on
to
the
next
here's
a
this
is
kind
of
it's
kind
of
a
simplification
of
the
process,
but
it's
it's
a
it's
I
think
a
decent
one.
So
let's
say
we
have
a
read
request
come
in
on
node1.
A
A
So
that's
just
it's
a
simplification,
but
it's
how
things
work
and
then
optionally
that
request
response
stage
may
suggest
me
randomly
it's
ten
percent
by
default
kick
off
a
read
repair
when
it
does
that
it'll
create
another
task
and
push
it
on
its
stage.
Now,
that's
interesting
because
you're
outside
of
the
feedback
loop
from
the
actual
request
being
made,
so
you
can
potentially
if
those
read
read,
repairs
are
taking
longer
than
the
number.
Then,
how
long?
The
requests
themselves
are
taking.
A
A
The
pending
is
how
many
are
in
that
queue
before
the
the
thread
pool
and
the
completed
is
how
many
tasks
see
it
completed.
So
blocked
is
where
it
gets
interesting.
You
shouldn't
see
many
of
these
get
blocked,
in
particular
the
flush
rider
and
the
replicate
on
right
might
get
on
in
1.2
and
2
point
0
below,
but
most
the
others.
You
shouldn't
see
them
block.
A
That's
when
there's
a
limit
to
that
q,
how
deep
it
can
go
and
when
that
limit
gets
reached,
it
will
actually
block
request
from
putting
even
things
on
the
queue
so
it'll
block
the
caller.
So
that's
a
really
bad
thing
and
you
don't
want
that
to
happen.
So,
even
if
you're
missing
the
missing
it
so
you're
you're
pulling
it,
but
you
still
missed
when
that
blocked
earth
happened,
you'd
be
able
to
see
the
all-time
blocked
increment,
so
you
wouldn't
be
missing
any
there's
another
section
underneath
that
the
drop
messages.
A
So
this
is
when
a
task
is
originally
started,
so
it's
going
to
take
a
timestamp
for
it
and
then
there's
going
to
be
a
timeout
associated.
So
when
a
read
happens,
if,
by
the
time
this
task
gets
to
one
of
the
stages
the
read
timeout
has
passed
since
it
was
created,
it's
not
going
to
process
it.
It's
just
going
to
throw
it
away
because
we
would
have
already
returned
to
the
client
saying
that
we
had
a
timeout
exception,
so
we
don't
bother
doing
any
processing
your
CPU
on
it.
A
So
there's
a
lot
more
to
this
and
I
don't
have
time
to
actually
go
through
all
the
different
thread
pools
their
limits
and
what
they
could
mean,
but
there's
a
blog
post
there
where
it
actually
walks
through
them
all,
so
these
slides
should
become
available.
So
if
you
are
interested,
you
can
read
them
there.
A
Okay,
so
no
tools,
CF,
histograms
or
column
family
histograms,
is
beneath
the
in
a
column,
family
there's
going
to
be
a
lot
of
statistics.
Some
of
those
are
histograms.
This
is
going
to
print
those
out
that
are
all
relative
and
the
column.
Family
is
just
the
old
name
for
a
table.
They
kind
of
rename
them
recently.
So
that's
what
it's
it's
interested
in,
you
can
actually
specify
which
keeps,
but
you
need
to
specify
which
key
space
and
table
you're
looking
for
now.
A
Hopefully,
you
guys
have
had
some
exposure
to
the
read
and
write
path,
but
just
in
case
you
haven't
I'm
going
to
give
a
very
high
overview
of
it.
Here
is
when
a
right
comes
in
it's
going
to
write
to
an
in-memory
table
when
reads
happen:
they're
going
to
check
that
in
memory
table
and
ne
ne
SS
tables
on
disk.
So
when
the
rights
stack
up
enough
in
mm
table
gets
large
enough,
it
flushes
to
disk
and
creates
another
SS
table.
A
So
then
reads
have
to
access
all
those
SS
tables
continually,
which
can
become
a
problem
because
that
can
become
a
lot
of
disk
seeks.
So
you
can
avoid
reading
the
ones
that
don't
have
the
data
you're
interested
in
by
adding
this
bloom
filter
in
front
of
them,
which
would
basically
say
when
you're
doing
a
read
this
data
that
I'm
looking
for.
Isn't
there
so
don't
bother
checking,
but
it
still
gets
bad.
So
there's
a
periodic
task
that
comes
through
called
a
compaction
that
will
merge
the
SS
tables
together.
A
So
when
you're
looking
at
this
CF
histograms,
what
you're,
looking
at
at
the
top
of
s
tables
/
read,
is
how
many
of
those
esas
tables
are
touched
in
a
read.
That's
really
important,
because,
especially
when
you're
on
spindles
is
that's
going
to
be
pretty
expensive.
So
usually
your
read
latency
is
going
to
be
pretty
look
pretty
similar
to
your
SS
tables,
/
read
and
then
the
write
latency
is
how
long
it
takes
to
basically
right
to
that
mem
table
which
should
take
long
at
all.
A
There
are
cases
where
it
can
take
longer,
particularly
during
a
flush.
So
the
interesting
thing
with
this.
What
how
this
looks
is
like
here,
I'm
saying
that
ninety
eight
thousand
reads
went
to
one
ss
table
or
that
four
thousand
reads
went
USS
tables
now
this
is
a
little
bit
different
than
how
they
used
to
look
for
people
who
are
familiar
with
the
old
style
of
CF
histograms,
which
provides
a
lot
of
information.
But
it's
not
easy
to
read
so
in
this
case.
A
We're
saying
that
from
the
SS
tables
perspective
that
three
thousand
reads
looked
at
one
ss
table
and
or
that
even
hard
look
reading
it
now
so
or
ten
rights
took
60
microseconds.
So
it's
it's
it's
pretty
hard
to
read,
which
is
why
a
lot
of
times
it's
a
good
idea
to
then
do
something
like
take
a
use,
Python
or
something
and
Matt
plot
lib
and
just
read
that
ring
and
write
them
out
as
a
as
a
bar
chart.
A
So
that
way
you
can
read
them
a
little
bit
easier
and
then
here
you
can
see
it
makes
a
little
bit
more
sense
that
well
at
you
know,
24
microseconds.
There
were
you
know,
seventy
thousand
requests.
So
it's
a
little
bit
easier
to
read
that
in
a
graph
format,
so
I
would
recommend
doing
that.
I
included
a
little
link
to
a
gist
in
the
bottom
there.
Once
again,
you
could
get
these
slides
later
and
hopefully
get
that,
but
it's
just
it's
a
modified
version
that
someone
else
wrote
but
I
kind
of
made
it
work.
A
So
this
this
is
al.
It's
a
lot
more
convenient
to
look
at
something
of
interest.
Here
is
the
min
and
Max
are
not
there
of
all
time,
but
the
rest
of
the
percentiles
are
actually
of
the
last
five
minutes.
In
particular,
it's
a
forward
decaying
priority
reservoir,
which
doesn't
necessarily
mean
it's
the
last
five
minutes,
but
it
means
that
it's
exponentially
weighted
to
data
that
has
entered
in
the
last
five
minutes.
A
So
when
you're
looking
at
this
you're,
essentially
looking
at
the
last
five
minutes,
this
is
drastically
different
than
the
old
style
and
the
current
in
20
and
12
and
previous,
where
every
time
you
called
CF
histograms
it
reset,
which
is
great
for
benchmarks.
But
it's
really
inconvenient
for
operations,
teams
and
people
debugging,
because
one
person
logs
in
and
looks
at
it
and
then
they
reset
it
and
everything's
back
to
zero.
So
then,
the
next
person
goes
and
looks
at
it
and
it
looks
like
Oh
everything's
awesome.
A
So
so
it's
for
this
style
just
from
the
being
able
to
debug
and
diagnose
it's
a
little
bit
better.
So
but
there's
a
lot
more
statistics
to
each
table.
That's
where
CF
stats
comes
in,
which
is
the
column
family
statistics
where
you
can
give
an
option
for
specifying
a
key
space
or
a
key
space
and
a
table,
or
you
can
do
the
dash
I
to
exclude,
because
excuse
starts
with
I
and
in
particularly
it's
nice
to
do
that.
A
So
you
do
a
dash
I
system
and
it
will
remove
the
system
because
you
never
really
care
about
the
system
statistics
usually
well.
Usually
you
don't
so
at
the
first
indention
there's
going
to
be
a
couple
key
space,
scoped
net
metrics.
These
are
really
just
the
sum
and
the
average
of
the
individual
key
space
metrics.
So
it's
really
not
that
useful
to
look
at
averages
or
sums
so
usually
just
skip
by
those
each
table.
A
It's
going
to
be
it's
going
to
be
named
at
the
first
one
and
then
it's
going
to
go
into
the
different
metrics
for
them.
So
in
this
case
we
have
3
s's
tables
total.
Usually
when
you
see
this,
this
is
going
to
pretty
much
correlate
to
your
reads:
Perez's
table
you're,
going
to
see
that
your
SS
tables,
/,
read
I'm.
Sorry
we're
you're
going
to
see!
Usually
at
three
at
the
highest,
so
usually
that's
going
to
be
like
your
upper
bound,
particularly
with
sized
here
compaction,
but
with
leveled
compaction.
A
It's
going
to
look
a
little
different,
you're
gonna
get
an
extra
line
there,
where
it's
going
to
tell
you
how
many
s's
tables
exist
at
/
at
each
level.
Your
reads
should
only
touch
one
and
that's
this
table
at
each
level.
If
everything
is
working
correctly,
but
that's
not
always
the
case,
this
is
kind
of
how
it
would
look
like.
A
If
that's
the
scenario
is
l0,
the
first
one
is
not
is
actually
using
size
to
your
compaction
and
the
/
4
is
basically
saying
that
it's
the
threshold
of
four
before
it
would
do
a
compaction,
so
it
should
be
four
at
the
most.
But
since
it's
14,
that
means
the
compaction
is
I'm
keeping
up,
which
is
bad,
so
moving
on
you'll
see,
then
a
batch
of
information
about
the
size.
A
This
usually
isn't
that
particularly
interesting,
but
you
know
it's
it's
good
to
know
and
good
to
see
the
net
whoops
the
mem
tables
is
when
those
rights
are
going
to
the
in-memory
table
is
how
many
columns
how
many
cells
exist
in
it.
The
data
size
is
an
estimate,
but
it's
an
estimate
of
how
much
size
in
memory
that
an
end
table
is
taking
up,
including
the
jvm
overhead,
and
the
switch
count
is
how
often
they
get.
How
often
this
table
has
been
flushed
to
it
as
an
SS
table.
A
The
local
read
and
write
Layton
sees
and
counts
are
the
amount
of
time
and
how
many
times
reads
and
writes
are
being
taken
on
a
local
case,
and
this
is
not
including
going
to
the
other,
replicas
and
stuff.
This
is
just
how
long
does
it
take
to
get
the
data
off
disk
and
look
at
it?
So
these
are
usually
should
be
a
lot
smaller
than
how
long
your
application
sees
pending
tasks
is
what
I
like
to
consider
the
most
useless
metric
here.
A
It's
actually
a
the
number
of
mutations
that
are
backed
up
on
the
switch
lock
during
a
flush,
basically
which,
basically,
you
don't
need
to
know
what
that
really
means.
It's.
It
doesn't
mean
much
because
into
one
it
doesn't
even
exist,
so
I
would
just
never
pay
attention
to
that.
One.
There's
a
bunch
of
statistics
on
the
bloom
filters,
in
particular
the
one
that
I
think
becomes
the
most
useful
to
look
at
is
the
amount
of
space
the
bloom
filter
takes
out,
because
sometimes
you
can
sacrifice
iving
the
read
hit.
A
It
all
depends
on
your
which
you
can
accept,
but
ultimately,
if
that
gets
too
large,
even
though
the
bloom
filters
are
now
capped
off
heap,
they
still
need
to
be
read.
So
when
you're
doing
a
read,
these
bloom
filters
need
to
be
in
memory.
So,
if
they're
being,
if
Els
is
a
paging
them
off
on
a
disk,
then
during
a
read
they
can
be
pretty
expensive.
So
you
want
that
to
be
a
reasonable
number.
Based
on
how
much
memory
your
system
has.
A
A
You
should
keep
that
below
the
thousands
if
possible,
but
there
are
scenarios
where
it's
acceptable
to
be
larger.
It's
the
wrong
direction.
The
tombstones
is
how
many
tombstones
during
a
read
this
in
particularly,
will
go
high.
If
you
are,
for
example,
using
Cassandra's
AQ
you're,
going
to
end
up
where
you're
deleting
things
on
a
partition
as
you're
adding
things.
This
will
get
very,
very
large
over
the
period
of
the
GC
grade
seconds,
so
you
do
not
want
you
don't
want
this
in
a
thousand,
so
this
is
in
the
thousands
you're
doing
something
wrong.
A
We've
I've
been
mentioning
a
lot
about
the
column.
Family
read
right,
Layton
sees,
but
that
is
useful,
but
really
what
you
care
about
most
the
time
is
how
long
it
takes
for
your
application
to
have
these
read
and
write
stake
and
that's
where
the
proxy
histograms
comes
in,
because,
instead
of
just
being
the
local
time,
it
takes
two
to
insert
the
the
mutation
into
the
mem
table
or
read
the
data
off
the
disk.
A
A
If
you
want
to
get
to
these
metrics,
but
you
don't
want
to
just
be
pulling
no
tool
or
polling
jmx.
The
metrics
library
has
a
nice
interface
provided
for
you,
where
it
could
actually
push
the
metrics
out
to
whatever
you're
using
so
by
default.
Metrics
comes
with
jmx
console
csv
and
lara
j,
but
it
has
in
the
the
metrics
library
there
is
also
a
gang
glow
and
graphite,
but
they're
not
compiled
with
it.
A
So
if
you
want
those,
you
have
to
include
the
drawing
your
classpath,
but
they
are
maintained
along
with
the
rest
of
the
metrics
library.
The
there
is
a
lot
of
community
reporters
as
well.
In
fact,
there's
probably
more
than
I
could
list
so
and
they're
also
really
easy
to
create.
So
if
you
want,
you
can
just
build
your
own
and
have
a
push
to
something,
but
if
you
have
a
new
relic
or
something-
and
you
want
your
Cassandra
metrics
to
go
to
it,
it's
pretty
easy
just
to
to
pump
to
set
that
up.
A
It's
easier
with
a
console,
see
SVG
angular
graph.
I
dreamin,
because
you
can
there's
just
a
llamo
file
that
you
can
configure
the
metrics
to
use.
Unfortunately,
if
you're
not
using
one
of
those
reporting
interfaces,
you
actually
have
to
create
a
java
agent
to
run
as
a
pre
main
and
set
up
the
reporter.
So
this
is
how
things
were
done
previous
to
1.2,
and
you
still
have
to
do
it.
A
If
you
want
to
use
one
of
those
specialty,
config
reporters,
something
that
I
think
everyone
in
any
java
developer
or
anyone
who's
worked
with
java
has
experienced
his
garbage
collections,
I
in
particular,
love
them,
I,
think
they're,
really
fun
I
like
tuning
them,
I
love
the
metrics.
I
love
how
complex
it
is.
I
think
it's
interesting.
So
fortunately
it's
a
whole
nother
talk
in
itself.
But
if
you're
interested
come
talk
to
me,
sometime
and
I'll
love
to
rant,
so
something
you
should
do
with
Cassandra.
A
Every
one
should
do
with
cassandra
is
enable
the
GC
logging
there's
virtually
no
overhead
in
it,
and
it
provides
a
lot
of
information.
There's
no
reason
not
to
do
it
to
do
it.
There's
a
in
the
Cassandra
env
file
as
Cassandra
dashi
and
via
SSH.
You
just
have
to
uncomment
out
a
bunch
of
lines
at
the
end
of
it
that
has
all
these
melodies
garbage
collection,
metrics.
There's
one
exception
to
that.
I
would
say,
though
the
fls
statistics
that
it
has
I
would
leave
commented.
A
A
So,
if
you
take
the
GC
launch
from
Cassandra
and
try
to
load
it
in,
it's
just
going
to
crash
and
die
so
it's
useful
for,
if
you're,
looking
at
it
or
you're,
building
your
custom
parser
to
look
at
the
logs,
but
from
a
perspective
of
having
a
nice
tooling
to
look
at
them.
It's
not
worth
it
so
I
would
actually
recommend
not
including
them,
as
I
mentioned.
There's
a
lot
to
garbage
collection
and
I.
A
Definitely
don't
have
the
time
left
here
to
talk
about
it,
but
I
would
recommend
grabbing
one
of
those
things
like
the
GC
viewer
and
being
able
to
just
open
up
the
logs
periodically
and
looking
at
it.
Ultimately,
though,
I
think
if
you
want
to
get
really
serious
about
it,
you
should
use
Python
or
our
whatever
your.
Whatever
statistics
package
you
like
to
use
and
analyze
the
logs
yourself
so
so
there's
a
couple
logs
in
Cassandra
and
I
think
it's
really
important
to
do
lago
eurasian.
A
A
It's
also
kind
of
bad
because
there
are
some
exceptions
that
will
not
be
logged
in
the
system
log
in
particular,
if
a
uncaught
exception
gets
propagated
to
the
top
of
a
thread
pool
it's
just
going
to
dump
the
air
the
stack
trace
out
to
standard
air
and
it's
not
going
to
be
included
in
the
system
log.
So
in
those
scenarios
it's
still
good
to
have
the
output
log
captured-
or
at
least
maybe
before
you
restart,
create
a
backup
so
that
way,
you're
still
able
to
look
at
it.
A
There's
a
lot
of
system
logs
that
you
should
also
be
monitoring
just
from
a
standard
Linux
kernel
perspective.
This
is
slog
device
messaging
and
such
there's
a
lot
to
monitoring
the
kernel,
which
itself
is
another
talk
and
in
fact
I.
I'm
recommending
here.
Brandon
greg
has
a
great
set
of
talks
and
explanations
on
what
you
should
use
to
monitor.
What
aspects
of
the
kernel.
A
The
JVM
has
a
lot
to
monitor
as
well,
in
particular,
the
two
that
you
should
be
monitoring
as
the
heat
and
the
threads.
The
heap
you're
already
capturing
the
garbage
collection
logs,
hopefully
from
the
previous
slide,
but
there's
also
the
possibility
to
do
things
like
take
the
dumps
and
stuff
like
that
in
case
you're
having
high
memory
pressure
and
you
want
to
analyze
them
on
a
more
deeper
level
and
jmx
provides
a
way
to
just
trigger
a
dump
for
that
or
you
can
use
J
map
to
do
it
as
well.
A
There's
a
there
could
be
a
lot
of
buildup
of
threads.
It's
kind
of
useful
to
look
at
the
thread,
count
and
jmx
just
to
make
sure
that
you're
not
exceeding
any
limits
on
your
clinic
on
your
kernel
or
anything.
But
it's
also
nice
to
look
at
the
what
the
threads
are
doing.
So
you
can
do
that
with
calling
a
kill,
dash.