►
From YouTube: PagerDuty: One Year of Cassandra Failures
Description
Speaker: Donny Nadolny, Scala Developer
Despite being a highly available system, we have had three outages caused by problems with our production Cassandra clusters over the past year. We'll take a look at each of these outages: what we saw from the inside, the actions we took to recover, and most importantly the procedures and monitoring that will help prevent it from happening to you.
A
B
B
Down
so
distributed
systems
can
be
really
complicated,
there's
a
whole
bunch
of
problems
that
can
happen,
and
you
can
get
kind
of
the
unintuitive
result
that
what
should
be
a
highly
available
system
actually
has
lower
availability
in
practice
than
a
single
machine
server.
Just
because
of
all
the
complicated
things
that
can
go
on
now,
one
of
the
things
that
I
think
can
help
improve.
B
B
What
I'm
going
to
talk
about
today
is
three
different
outages
that
we've
had
that
pagerduty
has
had
with
our
production
Cassandra
cluster.
If
you
don't
know
what
Patriot
II
does
our
customers
are
mainly
tech
companies?
They
have
a
lot
of
different
mongering
systems
when
they
detect
a
problem
like
a
server
going
down
or
high
latency
they'll
send
an
event
to
us,
and
then
we
do
a
whole
bunch
of
processing,
and
then
we
call
or
text
or
email
you
to
let
you
know
what
the
problem
was,
that
you
can
fix
it
quickly.
B
Now
this
part
in
the
middle
is
where
we
care
about
high
availability,
a
lot,
because
if
your
stuff
is
going
down,
you
want
to
be
able
to
rely
on
us
to
stay
up.
So
in
the
middle
here
we
have
a
whole
bunch
of
different
services.
They
are
mainly
using
Cassandra
and
zookeeper
running
in
multiple
data,
centers
and
actually
one
of
the
fairly
uncommon
parts
about
our
Cassandra
setup.
B
Is
that
we're
running
in
multiple
data
centers
and
we're
doing
coram
reads
and
writes
where
our
quorum
actually
goes
across
the
wide
area
network,
with
significant
latency
co-worker
of
mine
is
giving
a
talk
about
some
of
the
kind
of
cool
things
that
happen
when
you
have
that
kind
of
a
setup
tomorrow.
So
if
you
want
to
check
it
so
the
first
outage
I
want
to
talk
about
which
I
have
honestly
nicknamed
the
backlog.
B
The
a
bit
of
background
about
our
cluster.
We
have
a
shared
cluster,
so
all
those
arrows
from
the
different
services
that
were
pointing
at
one
cluster
that
is
actually
just
one
cluster
spread
across
multiple
data.
Centers
is
still
just
one
and
it's
fairly
small
cluster
we're
talking
low
tens
of
gigs
of
data
and
even
more
than
that,
with
our
usage
pattern,
we
tend
to
write
a
bit
of
stuff
into
Cassandra
and
then
pretty
shortly
after
that
will
process,
it
may
be
a
few
seconds
or
a
few
minutes
later
and
then
we're
done
with
it.
B
So,
in
terms
of
the
active
data
set,
it's
even
smaller
than
that,
it's
more
like
tens
or
a
few
hundreds
of
megabytes.
The
first
outage
actually
begins
the
day
before
we
had
some
kind
of
warning
signs.
We
had
a
few
small
degradation
as
going
on
and
what
triggered
this
was.
Cron
kicked
off
a
repair
process
on
our
Cassandra
cluster
and
it
caused
a
fair
bit
of
load.
We
were
seeing
high
latency
from
the
application
and
what
we
ended
up
doing
to
resolve.
B
This
was
disabling
thrift
on
some
of
the
nodes
and
turning
some
of
them
off
for
us
we're
still
on
the
old
thrift
interface.
For
so
disabling
thrift
is
forcing
our
application
to
use
a
different
note
for
the
coordinator,
and
by
doing
that,
we
can
kind
of
push
the
load
around
a
little
bit
kind
of
shuffle
it,
and
with
that
we
were
able
to
recover
somewhat.
B
So
what
it
looked
like
was
we
have
a
read,
latency,
very,
very
low,
what
it
should
be
and
then
it
spikes
way
up,
and
then
we
use
these
techniques
of
kind
of
disabling
thrift,
disabling
nodes,
shuffling
around
the
load
and
then
it
kind
of
improved.
But
we
can't
leave
it
like
this
because
we
have
notes
turned
off
and
that
will
affect
our
availability.
B
B
At
the
end
of
this,
we
did
have
everything
up
everything
recovered
and
we
were
pretty
much
okay,
but
the
problem
was
that
we
had
killed
that
repair
process
that
had
started
this
whole
thing
and
you
need
to
do
repairs
fairly
often
on
your
cluster.
Otherwise,
you
can
have
data
coming
back
from
the
dead,
so
we
had
to
actually
do
that
repair
again.
So
our
plan
was
to
get
a
bunch
of
people
around
to
keep
an
eye
on
the
cluster
trigger
that
repair
and
hopefully
be
able
to
react
really
quickly.
B
If
anything
went
wrong
and
as
I
mentioned
before,
this
was
a
shared
cluster
that
all
of
our
services
were
using.
So
we
were
able
to
turn
off
a
couple
of
non
critical
services
to
decrease
the
load
a
little
bit,
and
our
plan
was
to
just
use
these
strategies
that
we
would
use
the
day
before
if
things
went
wrong.
Well,
the
problem
was
that,
when
this
happened,
when
we
manually
trigger
this
repair,
what
we
didn't
realize
was
that
we
had
chron
set
up
on
that
machine
to
trigger
a
repair
on
a
different
key
space.
B
B
You
can
see
kind
of
almost
a
little
bit
of
recovery
in
the
middle
I'm,
not
sure
if
that's
actually
real
or
just
a
blip,
but
give
us
a
bit
of
false
hope
there,
and
if
you
want
to
see
what
an
unhealthy
cluster
looks
like
by
the
way
we
have
lots
of
graphs
of
that
here.
We
have
the
read
stage.
This
is
when
you
are
doing
a
read
in
Cassandra.
It
has
a
bunch
of
internal
cues
that
things
go
through
and
we
can
see
here
a
growing
backlog
and
kind
of
very
unhealthy
q4.
B
Pegged
but
where
we're
not
really
good,
and
so
this
went
on
for
around
two
hours
and
unlike
the
last
outage,
we
didn't
really
have
any
recovery
periods
here.
We
just
had
you
know
more
and
more
of
a
backlog,
so
we
were
struggling,
trying
to
figure
out
what
to
do
what
we
were
doing
before
wasn't
working
eventually.
What
we
kind
of
concluded
we
had
to
do
was
all
the
data
that
we
had
and
our
cluster
is
basically
in
flight
data.
B
So
it's
not
a
real
user
data,
it's
just
kind
of
things
that
we
put
in
there
that
should
be
handled
quickly,
so
we
had
an
option
which
was
to
delete
all
of
our
Cassandra
data
and
bring
up
the
cluster
to
recover
it,
and
so
we
really
very
quickly
tried
this
out
in
our
low
test
environment
to
make
sure
that
we
actually
could
do
it
and,
in
the
meantime,
try
to
fix
it.
We
couldn't
fix
it.
So
that's
what
we
ended
up
doing.
B
B
After
this
we
did
a
ton
of
investigation
into
what
went
wrong
now
in
kind
of
a
very
selfish
way,
I'm
just
a
tiny
bit
happy.
This
happened
because
I
got
to
spend
a
lot
of
time.
Learning
about
Cassandra
figuring
out
went
with
what
went
on
learning
about
the
internals,
so
the
benefit
from
this
is
that
a
lot
of
people
at
the
company
learned
a
lot
about
Cassandra.
Because
of
this,
what
we
kept
on
trying
to
find
during
this
was
what
were
the
leading
indicators.
B
What
would
have
told
us
that
this
that
things
were
going
wrong
because,
while
the
average
was
going
to
on
it
was
really
clear
that
things
were
bad
but
leading
up
to
it?
A
lot
of
the
metrics
that
we
were
looking
at
didn't
really
show
that.
So
what
we
found
was
that
we
needed
to
put
more
effort
into
looking
at
the
Cassandra
specific
metrics.
We
were
looking
at
mainly
host
level
metrics,
like
CPU
network.
That
kind
of
thing-
and
those
were
fine,
almost
all
the
time
or
right
up
until
you
actually
have
an
outage.
B
So
we
didn't
really
tell
you
anything.
The
Cassandra
metrics,
though
I'll
talk
a
little
bit
more
about
what
we
found
there
after
what
we
ended
up
concluding
from
this
was
basically
that
our
cluster
was
just
underpowered,
may
mean
CPU
during
normal
usage,
when
we
had
a
regular
request,
things
were
fine,
but
when
you
have
repairs
and
compaction
is
going
on
on
a
somewhat
growing
data
set,
it
was
just
too
much
and
we
were
on
to
core
machines
and
in
the
shared
clusters
it
was
just.
B
It
ended
up
kind
of
overwhelming
the
cluster
another
one
kind
of
a
specific
one
to
our
use
case.
But
we
did
a
lot
of
operations
that,
at
the
beginning,
you
kind
of
needed
a
whole
bunch
of
operations
in
a
row
to
succeed
before
you
could
do
other
work.
And
so
if
your
cluster
is
unhealthy
and
even
a
small
percentage
of
operations
are
failing,
that
whole
sequence
fails
and
then
you
can
do
basically
nothing
rather
than
doing
a
little
bit
of
work.
B
Some
of
the
lessons
that
we
learned
from
this
first
one,
the
big
one,
was
capacity
planning.
This
was
just
an
oversight
on
our
part.
We
had
what
we
thought
was
a
fairly
small
amount
of
data.
We
thought
the
cluster
could
handle
it,
so
we
weren't
really
paying
attention
to
capacity
planning,
but
we
should
have
the
other
one
is
that
we
added
a
lot
of
Cassandra
specific
laundering,
so
Cassandra
exposes
a
lot
of
metrics
through
jmx
and
we
were
actually
collecting
them,
but
we
didn't
have
them
on
easy-to-use
dashboards.
B
So
we
put
a
lot
of
work
into
that.
The
other
really
big
thing
that
we
were
in
from
this
was
more
isolation
as
good.
We
had
all
these
different
services
hitting
one
cluster
and
it
made
it
really
hard
during
the
outage
and
after
the
outage
to
figure
out
what
was
going
on
was
it
you
know
which
service
was
causing
the
load?
Was
it
in
Cassandra?
It
just
makes
it
really
tough
to
figure
out,
and
it
also
means
that
when
it
fails,
everything
fails
rather
than
having
just
some
things
fail.
B
Some
of
the
metrics
that
we
found
I
won't
go
over
all
them.
I'll
just
mention
probably
the
most
important
one
is
the
drops
messages
one.
This
one
is
Cassandra's
sign
that
it's
overloaded,
so
when
it
gets
a
request
and
it
can't
handle
it
in
time,
it
will
drop
that
message
and
record
it
in
a
metric
and
we
weren't
paying
attention
to
that.
But
that
would
shown
in
advance
that
our
cluster
was
becoming
overloaded.
B
All
right
now,
a
completely
different
outage,
unrelated
to
the
freshman
role.
We
had
made
some
improvements
since
the
last
outage.
One
of
the
big
ones
was
isolation.
What
I
mentioned
before,
and
we
did
this
a
live
migration
of
all
of
our
services
to
have
their
own
independent
cluster,
and
we
actually
did
this
for
not
just
Cassandra
but
for
zookeeper
as
well
and
I
have
a
ton
of
stories.
I
can
tell
about
that,
but
the
short
version
is
that
isolation
pays
off
there
as
well,
even
though
it
should
be
highly
available.
B
There
are
lots
of
problems
where
it
might
not
be.
We
also
added
a
new
surface,
which
is
what
this
outage
is
about,
and
last
one
we
bumped
up
our
cassandra
version.
It's
a
little
bit
embarrassing
how
old
of
a
version
we
were
on
so
I'll.
Just
tell
you
the
version
that
we
went
to,
which
is
we
upgraded
to
version
1.2,
so
we
were
a
bit
old
at
the
time.
Actually
so
now
so
what
happened
with
this
new
service?
B
Was
we
needed
to
add
a
bit
of
data
and
have
it
wired
through
a
few
different
column
families?
So
we
did
our
our
schema
migration
on
our
cluster
and
then
we
did
a
deploy
to
actually
use
that
and
we
started
getting
a
few
errors
recorded
in
the
application.
This
invalid
request
exception
telling
us
also
what
what
key
space
in
kollam
family
had
problems.
B
So
we
immediately
checked
our
cassandra
danger,
metrics,
dashboard,
that
this
is
actually
a
real
name
for
it
too,
since
the
last
edge,
which
we
made
a
dashboard,
which
is
lots
of
different
metrics,
which
can
be
a
sign
that
Cassandra
either
is
overloaded
or
is
having
problems.
But
in
this
case
it
was
clear,
but
from
the
exception
before,
meaning
that
it
was
something
schemer
related.
So
we
ended
up
running
described
cluster
through
the
CLI
and
we
saw
this
output,
which
shows
us
the
problem.
B
Every
node
here
is
on
one
version
of
the
schema,
except
for
one,
which
is
on
some
other
version.
So
we
know
we
need
to
take
some
action
against
that.
No,
but
we're
trying
to
figure
out
what
to
do
and
the
solution
ends
up
being
turn
the
note
on
and
off
and
then
once
you
do,
that
everything
is
fine.
This
is
what
the
output
should
look
like
everything
on
one
line,
meaning
that
all
of
the
hosts
are
on
the
same
Cassandra
version.
So
at
this
point
we
haven't
really
had
an
outage.
B
We
had
a
few
errors,
but
with
retries
all
the
tasks
we
were
doing
were
successful
and
it
was
fine
for
a
couple
of
hours
and
then
we
get
this.
This
is
the
outgoing
notifications
for
pagerduty,
where
everything
is
flying
for
a
while,
and
then
it
drops
off
to
nearly
nothing
so
checking
our
Cassandra.
Oh
sorry,
actually
what
we
saw
else
in
the
application
was
really
high.
Latency
to
Cassandra
should
be
low
spiking
up
too
much
much
higher
than
it
should
be.
B
Checking
the
Cassandra
danger
metrics
page
this
time
we
do
find
something
which
is
the
mutation
stage.
This
is
similar
to
the
reed
stage
I
showed
earlier,
but
this
is
for
right
operations
that
are
going
on
in
cluster.
It
should
be
basically
not
how
many
are
queued
up,
and
it
should
be
basically
nothing,
but
instead
one
host
goes
off
on
its
own
and
has
a
ton
of
operations
that
are
backing
up.
B
So
we
know
we
need
to
do
something
in
this
case,
we
immediately
disable
thrift
on
that
note
to
prevent
the
application
from
using
it
a
little
while
later
we
notice
that
we
have
a
repair
process
running,
and
so
we
kill
that
ensuring
we,
after
that,
we
just
kill
the
node
completely
and
that
ends
up
working
for
us.
This
is
how
many
operations
we
were
able
to
perform
on
our
cluster.
B
Now
the
cool
thing
about
what
we
did
before
was
that
there
ended
up
being
a
bit
of
a
gap
in
between
each
operation
that
we
did
so
zooming.
In
on
this
graph,
we
can
see
what
the
effect
of
each
individual
operation
was.
When
we
do
that,
we
see
that
first,
we
disable
thrift
and
then
immediately
everything
recovers.
So
the
other
actions
that
we
took
didn't
actually
fix
it.
It
was
only
disabling
thrift
that
fixed
it
now.
Disabling
thrift
is
just
forcing
our
application
to
not
use
that
note
as
coordinator.
B
B
So,
even
though,
and
keep
in
mind,
we
had
a
fairly
small
cluster.
So
even
though
only
some
of
the
operations
are
going
to
a
bad
note,
it
doesn't
take
a
very
large
percentage
for
you
to
have
a
really
high
multiplier
of
how
long
your
average
request
will
take,
given
that
some
of
them
are
going
to
a
bad
note.
B
Now,
there's
also
the
question
of
what
happened
to
Cassandra.
We
know
kind
of
given
that
Cassandra
was
acting
up.
Why
other
things
went
wrong,
but
they
the
question
of
what
happened
to
Cassandra.
For
that
we
don't
really
know.
We
do
have
some
theories
that
some
people
are
laughing
who
recognize
the
picture.
We
do
have
some
theories,
but
we
don't
have
any
actual
reproducible
things
that
we
found
at
least
not
yet
now.
Some
of
the
lessons
that
we
learned,
one
of
the
big
ones,
is
that
we
got
some
payoff
from
this
isolation.
B
We
had
problems
with
our
cluster,
but
it
only
affected
one
service
now,
because
I
services
kind
of
form
a
pipeline
where
you
need
all
the
ones
in
the
chain
to
work.
It
did
still
cause
notifications
to
be
delayed,
but
it
meant
that
all
the
requests
from
our
clients
coming
in
they
could
just
be
queued
up,
and
it
would
just
happen
a
little
bit
later,
rather
than
being
dropped
on
the
floor.
B
We
also
learned
how
we
should
be
doing
schema
changes,
which
is
you
do
describe
cluster
make
sure
everything
looks
good,
run
the
schema
change
for
one
column,
family,
and
then
you
describe
cluster
at
the
end
to
verify
that
everything
actually
went.
Okay,
I'm
even
added
a
bit
of
monitoring
for
this
schema
disagreement
to
all
right.
The
third
outage
this
one
is
particularly
painful.
This
one
had
the
lowest
impact
on
our
customers,
but
it
was
pretty
directly
caused
by
me.
B
So
what
happened
with
this
one
was
we
were
scaling
out
our
one
of
our
Cassandra
clusters.
We've
been
adding
nodes
after
they
were
added,
you
run
repair
on
them
and
then
they're
good.
We
had
added
a
new
node,
we
ran
repair
and
then,
after
a
couple
of
hours,
we
noticed
that
nothing
was
happening
on
the
node
Cassandra
wasn't
logging
anything
out,
it
wasn't
having
any
network
traffic,
so
we
restarted
that
node
and
just
to
be
safe.
We
did
a
slow,
rolling,
restart
across
all
the
nodes
in
her
cluster
and
partway.
B
B
It
was
trying
to
replay
the
hints
to
another
node,
but-
and
it
did
do
some
of
them
successfully
at
part
way
through
it
failed
ended
hand
off
by
the
way
it
was
when
you
know
that
you
want
to
write
something
to
another
node,
but
you've
either
tried
and
failed,
or
you
think
that
the
note
is
down,
so
you
don't
even
bother
trying
at
all.
You
write
a
hint
locally,
and
then
you
replay
it
later
so
this
replaying
we
replay
some
of
them,
but
what
all
of
them
now
together
cluster
back
into
a
healthy
state?
B
What
we
ended
up
doing
was
picking
that
host,
which
seemed
to
be
having
problems,
killing
it
and
then
everything
recovered,
and
then
we
waited
for
a
little
while
after
things
were
good,
we
brought
that
note
back
up
again
and
then
everything
was
bad
again,
so
we
put
be
shut
it
down
and
may
actually
we
were
curious
to
see
if
it
was
just
a
coincidence
that
have
happened
after
we
brought
it
back
up.
So
we
brought
it
up
another
time
and
then
things
started
going
bad,
so
we
killed
it
and
said:
okay.
Clearly,
something
is
wrong.
B
With
this
node,
so
we
started
investigating
and
we
found
something
strange
in
the
commit
log
directory
of
Cassandra.
We
saw
a
couple
of
files
that
were
owned
by
root,
and
these
were
actually
from
about
a
month
before
the
outage
they
were
really
old
files,
and
so,
after
a
bit
of
digging,
we
found
a
couple
of
commands
that
were
run
around
the
time.
Those
files
were
created
by
me.
I
had
run
this
SS
table
to
Jason
command
as
root
on
our
machines,
and
it
ended
up
creating
these
files
now
quick
detour
into
SS
table
to
Jason.
B
If
you
don't
know
what
this
is,
it's
a
really
cool
command,
where
you
can
look
at
kind
of
the
low-level
details
of
what
cassandra
has
stored
on
disk
in
the
SS
table,
and
it's
really
cool
and
I
was
trying
to
run
it
in
our
low
test
environment,
even
even
before
the
month
before
the
outage,
and
if
you've
never
done
this
before
what
happens
when
you
first
run
it
is
you
get
an
exception?
Some
partitioner
doesn't
match
this
other
thing,
and
so
you
look
for
a
while.
You
try
to
find.
Is
there
another
argument?
B
I'm
supposed
to
pass
to
this,
to
tell
it
what
partitioner
to
use
after
giving
up
on
that
and
just
searching.
You
find
a
bunch
of
unhelpful
advice,
telling
you
to
delete
files
and
other
stuff,
and
eventually
you
find
what
works,
which
is
you
need
to
set
an
environment
variable
to
tell
the
tool
where
to
find
your
config
file
for
the
partitioner
that
it
needs
to
use.
So
after
you've
done
that
you
get
another
exception,
and
this
one
you're.
Just
thinking
like
come
on.
B
I
just
want
this
thing
to
work:
I,
don't
care
about
from
there's
just
a
run,
so
you
run
it
as
root
and
then
it
works,
and
this
is
a
low
test
by
the
way
notnot
production.
But
the
problem
here
is
that
this
SS
table
to
Jason
command
has
a
bug
in
it.
It
was
reported
after
the
outage
where
it
will
write
commit
log
segments.
B
This
happens
because
the
SS
tables
adjacent
command
actually
lives
in
the
same
jar
as
the
rest
of
the
Cassandra
code,
and
so
it
manages
to
accidentally
call
the
commit
log
code
and
write
out
a
commit
log
segment,
and
if
you
run
it
as
root,
you
get
a
commit
log
segment
as
root.
So
we
have
a
commit
log
segment
written
as
root.
B
What's
the
problem
here
well,
the
problem
is
that
Cassandra
will
try
to
recycle,
commit
log
segments
that
is
rather
than
creating
a
new
one
and
then
deleting
the
old
one.
It
will
rename
the
old
one
to
the
new
file
name,
that
it
wants
and
then
truncate
the
file
and
try
to
do
that.
So,
if
you
try
to
do
that
and
you
get
a
permission
problem,
but
the
other
part
here
is,
this
was
a
very
delayed
effect.
B
I
had
run
this
command
a
month
before
and
it
wasn't
until
we
restarted
the
process
that
that's
happened,
and
the
reason
here
is
because
Cassandra
will
only
reuse
the
commit
log
segment
for
the
files
that
it
knows
about
so
because
it
was
written
by
another
tool.
It
didn't
know
about
it
until
you
restart
the
process,
then
it
will
read
in
all
the
files
from
the
file
system
and
then
try
to
reuse
them
later
on
after
it's
filled
up.
B
So
that's
why
we
also
get
the
the
kind
of
some
of
the
the
hints
go
through,
but
then,
after
the
commit
log
segment
fills
up,
then
nothing
else
can
go
through,
and
so
this
is
the
whole
chain
that
breaks
its.
You
have
a
mutation
stage,
which
is
just
a
write-in
Cassandra,
and
to
do
that,
it
has
to
add
something
to
the
commit
log
to
add
it
to
the
commit
log.
If
it
runs
out
of
space
on
the
current
one,
it
has
to
fetch
a
new
segment
and
the
code
that
is
populating.
B
These
segments
failed
a
while
ago,
with
an
IO
exception,
trying
to
rename
this
file,
which
was
owned
by
so
with
that.
What
we
have
are
the
lessons
from
this
outage,
which
is
you
need
to
be
careful
about
what
habits
you
develop.
I
think
this
was
particularly
hard
because
there's
a
delayed
effect
of
you
run
something
ammo
test
and
it
works
there
and
you
kind
of
build
up
this
habit
of
doing
it.
B
A
certain
way
just
run
it
by
root
and
everything
was
fine,
but
if
I
was
to
run
it
in
production,
the
first
time
I
wouldn't
have
just
on
that
I
would
have.
You
know,
looked
more
carefully
switch
to
the
Cassandra
user,
probably
but
I
built
up
this
habit
before
the
other
part
is
that
some
discussion
that
went
on
in
the
ticket
for
the
SS
table
to
Jason
command
is
that
they
should
have
had
the
command
be
more
isolated.
It
was
able
to
do
that.
B
Even
have
these
delayed
effects
where
you
do
an
action,
and
it's
not
until
much
later,
that
you
feel
the
consequences
of
it
like
some
kind
of
failure.
We've
had
other
problems
like
this
to
where
we
change
a
config
file.
But
then
we
haven't
restarted
the
process
that
read
that
config
file,
and
it's
not
until
you
do
that
that
you
find
out
that
whatever
changes
made
to
the
config
file
broke.
Something.
B
So
one
of
the
things
that
I'm
happy
about
with
these
outages,
at
least
the
trend
here,
is
that
they
seem
to
be
getting
less
severe.
The
first
one
was
really
terrible,
and
these
are
in
chronological
order
by
the
way,
so
we
had
a
really
terrible
outage.
Second,
one
bad,
but
not
quite
so
bad
third,
one
we're
even
faster
at
fixing
them
I.
Think
that
the
main
reason
why
that
has
happened
is
because
we
put
so
much
time
into
investigating
these
problems
when
they
happen.
B
F
B
The
drops
metrics
one
certainly
is
a
good
leading
indicator
blocked,
flushed
writers,
at
least
in
the
case
that
we
saw
it
was
a
good
indicator
that
can
also
happen
for
a
few
other
reasons,
but
I
still
think
it
is
kind
of
generally
pretty
good
GC
behavior,
it's
a
bit
hard
to
read
so
it's
kind
of
hard
to
tell
and
the
lagging
ones
that
I
gave
those
are
still
quite
valid,
but
they
are
lagging
indicators.
They
just
tell
you
when
something
is
already
wrong.
B
We
use
pagerduty
to
a
little
self.
We
actually
do
so
for
most
of
the
problems
that
we
have
the
vast
majority
of
them.
We're
actually
still
up.
We
just
notice
kind
of
weird
problems,
and
so
we
can
still
use
our
own
application
to
alert
us.
We
do
have
a
couple
of
backup
ones
that
we
use
just
in
case
if
we
have
something
that
actually
causes
all
or
it's
not
to
be
able
to
go
out
or
the
account
that
we're
using
for
a
self
not
to
be
able
to
go
out.
C
A
A
B
We
often
do
try
to
reproduce
these
problems
in
our
low
test
environment.
Usually
we
don't
have
to
bring
data
over.
We
did
that
when
we
were
expanding
our
cluster
to
make
sure
that
when
we
were
adding
nodes
that
nothing
went
wrong,
we've
heard
about
problems
with
that
happening
in
the
past.
The
most
of
these
we
were,
they
weren't,
really.
Data
related,
though
the
permission
problem
we've
been
able
to
reproduce
pretty
asian
easily.
We've
also
been
able
to
reproduce
a
lot
of
schema
disagreement
problems
that
we
had
just
by
adding
packet
loss.
E
We're
back
here
hi,
yet
this
way.
First
of
all,
thank
you
for
the
nice
sharing
your
nice
experiences
for
the
last
one
year.
The
failures
in
the
Cassandra.
Those
are
very
good.
Thank
you
and
we
had
gone
through
the
similar
things,
but
you
know
I
wanted
to
understand
more
about
the
danger
matrix
that
you
guys
are
using,
which
tool
that
you
are
using
men
know
that
you.
E
E
E
D
Yeah
so
I
you
mentioned
that
process.
Startup
can
be
like
a
very
rare
process,
so
in
the
spirit
of
if
it
hurts
MIT,
do
it
more
like
if
you
guys
ever
considered
running,
restarts
on
your
clusters
continually
and
if
so
like,
did
that
help
reduce
those
types
of
errors
and
issues
or
have
we
thought
about
doing.
B
Research
right,
what
I
would
worry
about
that
is
that
would
be
masking
other
issues.
I
think
that
kind
of
the
ideal
solution
is
anytime.
You
have
one
time
process
startup
code.
You
should
also
write
some
other
code
that
monitors,
for
whatever
condition
is
there
and
almost
do
it
continually
some
of
the
case
of
config
files
rather
than
reading
it
once
and
then
using
it?
Maybe
you
sure
you
only
read
it
once,
but
you
should
also
have
monitoring
that
if
it
changes
you
should
get
alerted,
and
so
you
should
take
some
action
based
on
that.
G
B
A
B
How
the
scheme
is
disagreement,
occur
bugs
I,
believe
it
it
should
never
actually
occur.
We
are
using
one
dot
too
and
I've
heard
that
it's
better
in
20,
but
at
least
in
one
dot
to
you
can
quite
easily
call
a
schema
disagreement
by
adding
some
packet
loss
to
a
node
and
then
doing
a
few
schema
changes.
And
after
a
little
while
you'll
see
that
the
cluster
disagrees,
it's
not.