►
From YouTube: Netflix: A State of Xen Chaos Monkey & Cassandra
Description
Speaker: Christos Kalantzis, Director of Engineering
This talk will cover how Netflix monitors its Cassandra fleet and the steps we take to make sure we can survive even the worst unplanned outages.
A
A
This
morning
last
year,
Cassandra
summit
was
early,
September
I
believe
it
was
September
11th
and
something
happened
a
week
or
so
after
that
talk
and
we're
gonna
talk
a
little
bit
about
that
incident
and
then
I'm
gonna
hand
it
off
to
two
of
my
senior
software
engineers,
who
will
tell
you
a
little
bit
about
our
automation
stack
and
how
we
handled
that
event,
but
first,
just
for
my
name
is
christos.
I,
lead
cloud
database
engineering
at
Netflix
and
today,
I
think
you're,
all
very
lucky
to
get
John
Sebastian,
Janet
and
Nia
are
all
funny.
Sorry.
A
A
A
Anyone
who
gets
really
bad
news
goes
through
five
stages:
they're
called
the
five
stages
of
grief,
and
so
just
like
in
the
five
stages
of
grief.
When
you
get
very
bad
news,
we
had
very
similar
reactions,
so
on
Wednesday
they
told
us
start
Thursday
and
for
the
next
four
days
we're
going
to
be
rebooting
ec2
instances.
Well,
our
Cassandra
fleet
runs
on
ec2
instances,
so
our
first
reaction
was
denial.
A
No,
they
can't
be
doing
this.
This
is
ridiculous.
This
is
this
is
AWS.
This
doesn't
happen.
Well,
it
turns
out.
It
was
going
to
happen
whether
we
liked
it
or
not-
and
the
reason
was
a
security
flaw
in
the
Xen
virtualization
software.
They
used
to
cut
out
ec2
instances
on
their
hardware,
so
obviously
the
next
stage
was
anger
for
all.
You
know
for
lack
of
a
better
word.
We
were
pissed.
Are
you
kidding
me
you're
gonna.
Do
this?
A
We
actually
have
a
party
set
up
in
Los
Angeles
to
celebrate
our
50
million
users
and
now
we're
all
gonna
miss
this
party,
and
you
know
it
wasn't.
It
wasn't
a
good
end
of
a
Wednesday
to
to
be
worth
it
to
be
a
Netflix
CD
employee.
So
then,
then
we
went
through
the
bargaining
phase.
We
called
up
our
Tam
and
said:
hey,
hey,
hey,
look!
Please!
Please!
Please
don't
do
this,
we've
got
like
a
trip
planned.
A
A
You
know,
I
basically
told
the
guys
hey.
We
might
want
to
update
our
resumes
and
and
maybe
go
see
the
people
at
the
Apple
booth
for
a
job,
because
if
this
all
goes
down
and
goes
to
hell
well,
you
know
there
might
not
be
a
Netflix
come
next
week,
but
the
final
stage
of
grief
is
acceptance
and
it
wasn't
an
acceptance.
It
wasn't
a
capitulation
of.
We
can't
do
anything
about
it.
It
was.
It
was
more
of
an
acceptance
of
wait,
a
minute
we
test
for
this.
So
so
maybe
maybe
we're
gonna
be
alright.
A
Well,
how
do
we
test
for
this?
A
lot
of
you
probably
have
heard
of
the
simian
army
chaos
monkey,
which
made
a
cameo
yesterday
during
the
Kia
in
the
keynote
chaos
gorilla
and
chaos,
Kong,
where
we
evacuate
all
traffic
from
one
AWS
region
to
another
to
ensure
continuity.
When
there's
events
like
this
weekend's
DynamoDB
outage
so
about
it
year
and
a
half
ago,
we
we've
started
the
Poynting
chaos
monkey
not
only
to
our
stateless
systems
but
to
our
state
full
systems.
We
turned
on
chaos,
monkey
on
our
Cassandra
fleet
and.
A
So
Friday
came
along
sorry,
Thursday
came
along
started,
doing
the
first
set
of
reboots
and,
and
things
look
pretty
good.
Then
then
they
did
another
zone
in
another
zone
and
and
we
started
building
our
confidence
that
hey
maybe
we
can.
We
can
go
to
that
party
over
the
weekend
and
Friday
night.
We
were
all
still
on
standby
and
you
know
we
decided
hey.
No,
let's
do
it,
it's
working
everything's
fine,
so
we
all
got
hopped
on
planes
and
went
to
LA
and
had
a
good
time.
A
We
hung
out
with
Snoop,
Dogg
and
r2d2,
and
it
was
lots
of
fun.
So
Monday
morning
comes
along
all
the
reboots
were
done
and
the
total
was
218.
Cassandra
knows
got
rebooted,
22
didn't
come
back,
and
so
our
automation
detected
it
found
out
why
it
didn't
come
back
and
initiated
Auto
remediation
and
the
cassandra.
The
instances
were
were
restarted,
data
streamed
to
the
new
nodes
and
Netflix
suffered
zero
downtime.
A
That's
that's
one
heck
of
a
feat,
so
I'm
gonna
now
give
the
stage
to
John
Sebastian,
who
is
gonna
talk
a
little
bit
about
what
the
stack
looked
like
back
back
last
September
and
then
he'll
be
followed
by
near.
Who
will
tell
you
about
what
the
stack
looks
like
today
and
how
we
leverage
a
new
product
called
stack
storm
and
they
have
a
booth
here.
A
B
B
So
I'm
gonna
start
by
talking
about
what
our
stack
looked
like
at
the
time.
Our
automation
platform
on
the
Left
you're
gonna
see
a
Sandra
cluster
class
external
cluster.
The
only
difference
that
you
might
not
have
on
your
clusters
is
the
preamp
process.
Preamp
process
is
taking
care
of
maintenance
operations
on
the
cluster,
for
example,
repair
and
compaction,
and
it's
also
taking
care
of
token
management.
When
a
new
node
comes
in,
it's
going
to
take
care
of
assigning
a
token
to
it
and
let
it
join
the
cluster
and
on
the
right.
B
So
you
probably
are
wondering:
is
there
something
missing
here,
so
we
don't
have
any
system
doing
the
monitoring,
so
you
probably
know
about
App
Center
at
Netflix.
We
don't
use
App
Center
at
the
time
when
we
start
using
Sandra
App
Center
was
non-existence,
so
we
use
priam
and
what
we
call
what
we
our
own
monitoring
system.
We
just
call
Atlas
it's
open
source.
Now
you
can
look
it
up.
This
is
a
standard
dashboard
for
our
Kessinger
clusters
and
every
every
every
few
quarter.
B
We
reassess
up
Center
to
see
if
we're
going
to
use
it
in
the
in
the
next
version,
but
so
far
we've
been
using
Atlas
and
one
of
the
one
of
the
reason
for
that
is
that,
as
you
can
see,
on
the
top
right
corner
is
our
Atlas
clusters.
We
have
priam
that
is
responsible
for
sending
the
metrics
to
Atlas,
so
it
sends
heartbeat
and
sends
a
lot
of
metric
about
coordinator
latency
and
all
the
other
metrics
are
used
to
she
dropped,
drop
messages,
saundra
exceptions
and
and
all
along
and
on
the
left.
B
Here's
an
example,
graph
you'll
see
that
we
can
put
on
the
same
graph,
client,
side
and
server
side
metrics,
which
give
us
a
good
idea
of
how
the
system
reacts
to
the
server
side,
weapon,
see
all
the
blue
lines
or
client
side
latency.
The
athlean-x
Latin
see
the
99th
percentile
for
Reid
and
we
have
multiple
lines
because
we
check
the
latency
on
a
column,
family
and
key
space
level
and
the
right
red
one
is
the
server
side
latency.
B
So
now
that
I've
explained
the
monitoring
system
that
we
have
there's
another
way
we
monitor
our
our
our
fleet
of
Cassandra
is,
however,
health
check
script.
The
health
check
strip
is
run
through
Jenkins.
It's
more
of
a
pull.
Pull
approach
is
going
to
contact
all
the
clusters,
all
the
notes
and
make
a
deep
diagnostic
to
see
if
there's
anything
going
wrong.
B
I'm
not
gonna,
go
through
all
the
this
slide,
but
focus
on
green
and
red
green,
good,
Ram,
bad,
that's
pretty
simple!
So
good
means
nobody
got
page
and
the
health
check
either
didn't
see
any
problem
or
saw
problem
and
fix
it
automatically
so
and
red
is
we
got
a
problem?
We
couldn't
fix
it
automatically,
so
we
were
gonna
page
an
engineer.
B
The
two
cases
that
we
had
automated
at
the
time
was
an
instance
that
disappeared.
For
example,
chaos
monkey
just
terminated
at
one
of
our
instances
or
any
or
it
got
rebooted
or
something
like
that
and
the
other
cases
I
won't
go
deep
into
it
to
see.
If
there's
any
higher
issues
on
the
disk,
so
there's
nothing.
We
can
do
the
drive
just
fail.
The
ephemeral
drive
fails.
So
let's
just
terminate
the
instance,
but
for
all
the
other
cases
at
the
time
we
just
page
an
engineer.
So
what
about
reboot?
So
during
reboot?
B
It's
it's
kind
of
a
mix
of
the
chaos
monkey
exercise
and
just
a
single
reboot
instances
because,
like
crystal
said,
some
of
our
instance
never
came
back,
they
got
rebooted,
but
the
the
ec2
health
check,
Amazon
health
check,
actually
contacted
the
nodes
in
Chile
and
if
the
node
doesn't
respond,
it
turns
and
terminates
them.
So
some
of
our
instance
were
so
old
that
they
got
their
terminated
because
of
that.
B
So
what
does
it
look
like
for
our
workflow
so
the
day
since
disappeared,
because
it
never
started
back
again,
so
we
detected,
we
launched
a
new
instance.
The
process
of
launching
a
new
instance
is
for
those
of
you
familiar
with
AWS.
We
have
auto
scaling
groups
with
normally
controls
a
fleet
of
server
that
can
dynamically
scale,
but
for
Cassandra
we
keep
it
stable.
We
just
tell
them.
We
need
six
node
in
that
zone.
So
again
we're
gonna
have
the
HT
configured
with
six.
B
B
The
other
cases
is
that
the
instance
gets
reboot,
but
it
actually
comes
up
and
on
first
time
it
the
first
check
looks
at
it
and
there
oh
there's
an
instance
missing,
but
it's
not
missing
out
of
the
HDD.
It
is
full
there's
six
instances,
but
when
we
do
a
ring
command
an
O'toole
ring,
we
see
there's
a
node
that
is
down.
So
what
do
we
do?
We
start
looking
at
two
why's
it
down,
so
we
go
through
all
these
steps
and
actually
there
is
actually
I
know
that
is
down
it's.
B
Not
their
central
process
is
not
running
because
the
timing
of
the
health
check
actually
caught
it
a
while
it
was
rebooting.
So
how
do
we
make
sure?
We
don't
page
an
engineer
for
for
that
case,
because
actually
we
don't
need
to
pay.
You
just
need
to
wait
a
little.
Second,
that's
actually
what
we
already
do,
there's
always
transient
failure
in
the
cloud.
Sometimes
you
just
there's
a
node
going
down,
but
if
you
wait
10
seconds
it'll,
just
gonna
go
back
up
for
in
it
any
network
issue
or
disconnect.
B
So
this
process
I
actually
have
a
stage
which
is:
is
it
the
first
time
it
fail?
And
then,
if
it's
the
first
time,
this
workflow
is
failing
for
that
node.
It's
gonna
sleep
for
X
minute
at
the
time
it
was
10
minutes,
but
for
reboot
and
the
only
thing
we
change
before
going
to
the
party
was
switch,
a
configuration
to
say,
let's
give
it
20
minutes,
because
we
know
there's
gonna
be
a
lot
of
reboot
and
it
might
take
longer
to
reboot
them.
B
So
that's
our
now
our
stack
at
the
time
we
identified
a
couple
of
gaps
when
we
went
through
reboot
and
even
before
that
we
were
already
planning
on
the
next
system.
One
of
the
gaps
that
we
had
was
the
fact
that
our
automation
system
was
a
big
monolith.
It
was
a
big
bunch
of
Python
scripts
and
Python
library
that
we
cooked
up
a
hundred
thousand
lines
of
Python.
B
This
is
actually
a
picture
of
our
Netflix
data
center,
the
power
cable.
So
wherever
we're
hoping,
nobody
will
trip
on
it,
no
just
kidding,
but
what
I
want
to
illustrate
by
this
is
this
is
single
point
of
failure.
The
Jenkins
master
was
our
single
point
of
failure.
If
the
master
goes
down
ow
our
script,
that
was
currently
running
will
page
the
engineer.
So
we
wanted
to
reduce
these
false
positive
and
yes,
there
was
a
way
to
have
a
multi
master
Jenkins
and
there's
always
plugins
that
he
can
do
to.
B
But
we
wanted
to
go
back
to
the
basis
like
we
want
more
of
a
cluster
solution.
We're
working
extend
res,
so
we
know
about
high
viability.
We
won
our
monitoring
system
and
automation
system
to
have
high
availability.
So
with
all
of
these,
these
gaps
I'm
gonna
hand
it
off
to
near
who's,
going
to
talk
about
how
we
actually
fix
those
gaps
and
what
the
new
system
looks
like.
Thank
you
very
much.
D
Can
you
hear
me?
Oh
great
thanks?
Yes,
so
we
decided
to
embark
on
a
new
journey
to
figure
out
how
we're
going
to
fill
in
those
gaps,
and
that
in
mind
we
didn't
want
to
lose
all
the
principles
and
lesson
learned.
We
had
so
far
that
health
check
scripts
that
you
just
saw
you
just
saw
a
simplified
version,
a
diagram
of
it.
It
actually
handles
lots
of
edge
cases
and
throwing
it
all
away.
It
would
be
a
shame,
definitely
not
what
we
wanted
to
do.
D
So
we
started
by
going
out
there
and
talked
to
other
companies
see
what
they're
doing
so.
We
went
to
Facebook
spoke
with
their
engineers
about
fubar.
Fubar
is
a
system
that
they
designed
that
detects
machine
that
have
experiencing
all
kinds
of
issues
in
production
and
automatically
takes
them
out
of
the
fleet
in
order
to
heal
them.
We
also
went
to
LinkedIn
and
met
with
their
engineers
and
spoke
with
them
about
nurse
nurse
is
a
platform
that
they
also
built
in-house.
D
D
In
addition
to
that,
we
also
started
looking
outside
for
all
kinds
of
open
source
project
that
would
fill
in
our
requirements,
checking
different
stuff,
running
proof
of
concept
and
then
wage
to
come
a
with
a
decision.
Do
we
build
our
own
in-house
solution
or
adopt
something
that
already
exists?
There,
of
course
contribute
to
it
and
enjoy
it,
and
we
decided
to
go
with
stack
stone
stack.
Stone
is
an
event-driven
automation
platform.
D
It
tells
lots
of
capabilities.
We
mainly
use
it
for
our
to
remediation,
but
it
also
has
additional
capabilities
of
providing
audit
trail
integration
to
chat
ups
and
many
other
things.
As
Chris
was
mentioned.
They
have
a
booth
right
here.
Go
them
ask
them
questions
they're
great
guys.
They
have
a
great
community
join
in
and
contribute
you'll
have
fun.
D
So
let's
see
how
stack
storm
integrated
into
our
stack,
so
that's
decided.
We
saw
before
with
a
Jenkins
running
remotely
scripts
on
the
right
side
on
our
cassandra
fleet.
Our
cassandra
fleet,
a
report
metrics
through
premium
premium,
gathers
metrics
through
gem
x
from
Cassandra,
send
them
up
to
Atlas
our
telemetry
system
and,
in
addition
to
all
Netflix
environment,
all
the
services
from
metrics
environment
also
reports
metric
to
Atlas
and
it
of
a
if
at
last,
detect
any
kind
of
weird
behavior
or
that
something's
wrong.
D
It
hits
paged
you
the
API
in
order
to
page
our
own
call.
So
now,
instead
of
Jenkins,
we
have
a
cluster
of
stack
storm.
We
actually
didn't
totally.
To
be
honest,
we
didn't
retire
Jenkins
a
totally.
We
still
have
some
a
scheduled
tasks
like
repairs
and
compactions
that
are
still
running,
but
now,
instead
of
a
heating
page
duty,
whenever
there's
an
issue
detected,
Atlas
is
going
to
send
an
event
to
stack
strong.
A
stack
storm
has
a
really
fancy
rule
engine
that
can
figure
out
which
rule
it
should
detect.
D
For
that
event,
and
by
choosing
that
rule
that
rule
triggers
a
a
single
action
or
a
workflow,
which
is
the
chaining
of
actions,
depend
on
each
action
depends
on
the
result
of
the
previous
action.
So
you
have
a
full
workflow
of
if-else
depends
on
the
previous
result.
That
way,
we
can
it's
much
easier
to
auto
remediate
things.
I
want
you
to
provide
us
an
example.
D
A
disk
usage
alert
that
we
used
to
get
so
if
previously,
it
would
just
page
the
on-call
right
now
stack
stone
received
the
disk
usage
alert
event
and
it
goes
and
gathers
additional
context
by
additional
context.
I
mean
things
like
a
is
there
any
offline
process
running
our
factions
running
when
compactions
are
running.
As
most
of
us
know,
a
data
size
may
crease
up,
maybe
even
up
to
double.
So
if
compactions
are
running,
maybe
we
shouldn't
page
the
on-call.
D
We
can
just
send
an
email,
let's
say
a
clean
win
for
a
the
operations,
they're
getting
left
pages.
So
that's
already
one
good
thing,
but
how
does
it
feel
all
the
gaps
that
we
discussed
before
so
remember
the
monolith
part
so
now,
instead
of
having
monolith
scripts,
we
have
to
break
them
down
into
a
smaller
chewable
actions
that
can
be
reused
in
different
workflows.
So
breaking
down
the
monolith
is
already
a
win.
Another
Wayne
is
chaining
of
action.
D
It's
not
natural
to
change,
Jenkins
jobs,
but
it's
much
easier
to
change
them
using
your
work
so
and
as
for
the
single
point
of
failure,
we
run
stack
stone
in
a
cluster
when
Atlas
triggers
an
event.
One
of
the
instances
of
stack
some
instances
will
take
that
event
and
handle
it.
So
even
if
a
machine
goes
down,
another
machine
might
picked
it
up.
So
as
long
as
we
have
one
machine
that
is
functional
in
that
cluster,
the
event
will
get
handled.
D
So
that's
another
win
for
the
resiliency,
but
if
we
can't
auto
remediate,
we
can
still
use
page
of
duty
to
page
the
on-call,
but
here
is
still
another
big
win,
and
why
is
that
because
stack
stone
by
collecting
all
that
additional
context
and
information
it
can
check
out
stuff?
Like
the
ring
outputs,
it
can
check
stuff
like
offline
maintenance
running
it
can
provide
links
to
relevant
dashboards,
and
things
like
that.
D
D
Now
it's
a
part
that
is
personally
important
to
me
in
that
presentation.
I
remember,
I
was
talking
about
all
the
priests
and
poles
and
lesson
learned.
I
want
to
go
over
a
few
of
them
with
you.
Hopefully,
you'll
be
able
to
take
something
with
you
and
make
your
processes
in
your
system
more
resilient.
D
The
first
principle
is
making
your
processes
idempotent.
An
idempotent
process
means
that
you
can
run
it
to
multiple
times,
but
it
will
only
be
applied
as
if
it
was
running
once
now.
The
good
thing
about
that
is
imagine
you're,
upgrading,
a
fleet
of
200
nodes
and
after
150
nodes,
the
process
crashes.
D
What
do
we
do
now
start
to
manually
run
remotely
check?
Which
versions
are
on
which
node,
which
ones
we
need
to
upgrade
not-so-fun
from
open
operational
perspective?
By
making
your
process
idempotent,
you
can
simply
run
it
again
and
a
having
it
ID
opponent
hear
the
implementation
of
idempotency
is
actually
something
really
simple
before
you
start
upgrading.
You
know
just
ask
a
simple
question:
what
is
the
version?
That's
running
on
that
node
running
node,
no
tool
version?
Is
it
the
desired
version?
D
If
it
is
just
keep
on
and
move
on
to
the
next
node,
by
doing
that,
we
save
a
lot
of
time
a
lot
of
resources,
we
gain
resiliency.
So
that's
a
huge
win,
one
more
thing
to
keep
in
mind:
making
your
process
idempotent
could
be
even
better
if
you
break
it
into
steps
and
make
each
and
every
step
idempotent
ajs
put
it
put
it
nicely.
When
we
spoke
a
couple
of
days
ago,
he
said
that
idempotent
means
making
a
stateless.
A
system
feel
stateful.
Think
about
it.
D
It's
a
nice
sentence,
second
principle
of
simplicity
that
one
might
be
the
hardest
one
to
implement
because
taking
a
complex,
a
system
and
making
it
simple
requires
some
kinds
of
genius
ii.
So
it's
not
always
feasible,
but
it's
a
good
principle
to
keep
in
mind
whenever
you
design
a
new
system.
If
you
can
avoid
multiples
or
do
anything
simpler,
it's
always
a
win,
mostly
from
operational
point
of
view,
but
also
from
maintaining
the
code
and
etc.
The
combination
of
simplicity
and
idempotence
is
really
powerful.
D
An
example
that
I
wanted
to
bring
you
here
is
what
we
call
the
Netflix
resumable
weepers
same
as
the
upgrades
that
we
discussed
before
running
a
repair
on
a
300
node
cluster
could
take
a
while
to
take
say
seven
days
if,
after
five
days,
the
Junkins
job
crashed,
no
problem
repair
is
idempotent
right.
We
can
run
repair
on
a
node
multiple
times,
so
we
just
run
it
again.
This
time
after
another
couple
of
days
did
Jake
his
job
crashed
again.
D
So
now,
we're
seven
eight
days
into
the
process
say
that
one
of
our
column,
family,
has
the
GC
grace
period
of
ten
days.
Why
does
it
mean?
Does
anyone
here
have
an
idea?
I
need
your
help.
Gc
quake,
Great
Spirit
is
about
to
expire.
Some
of
the
nodes
world
repair.
There
are
tombstones,
their
data
resurrection
reminded
urgent
danger
of
data
resurrection.
Thank
you
very
much.
So
just
came
with
a
a
simple,
yet
brilliant
idea
of
before
running
a
repair
on
the
node
just
check,
which
Jenkins
build
number
was
last
run
on
that
node.
D
This
way,
if
a
process
crashes,
we
can
simply
run
it
again
with
a
resumable
repair
ID,
and
by
checking
that
idea
and
comparing
we
don't
need
to
rerun
it
on
the
same
node.
So
all
the
nodes
that
were
already
repaired
will
quickly
get
skipped
and
will
start
exactly
from
where
we
stopped
so
won't
take
again,
seven
days
to
repair
the
whole
cluster.
D
Another
a
lesson
learned
is
that
we
moved
from
using
remote
SSH
synchronous
connections
to
async
HTTP
connections,
again
repairs.
This
is
the
long-running
process.
There
are
a
lot
of
long-running
process
that
we
run
and
by
having
a
synchronous
connection,
your
process
becomes
more
vulnerable
to
transient
Network
issues.
For
example,
it
can
easily
fail
by
using
a
think
httprequest
with
either
callbacks
or
pooling.
You
make
your
process
more
resilient,
also
moving
from
SSH
to
http.
There
are
many
libraries
out
there
that
support
it.
It's
easier
to
test,
so
we
increase
developer
velocity
by
doing
that.
D
D
D
If
all
the
clients
keep
hitting
the
servers
with
retries,
without
sleeping
the
server
will
be
overwhelmed
and
might
fall
back
and
last
principle
I
wanted
to
go
over
is
using
fallbacks
am
I
some
of
you
might
have
already
used
fallbacks
without
even
realizing
it
by
trying
to
get
some
kind
of
configuration
or
something
from
a
remote
service.
The
request
failed
and
you
default
to
a
hard-coded
volume.
That's
called
serving
a
fallbacks
in
netflix
context.
D
Most
of
you
probably
know
that
there
is
their
personal
personalized
recommendation
engine.
So
when
you
log
into
your
Netflix
account,
you
see
personalized
recommendation.
If
that
service
try
to
become
a
starts
to
become
latent
or
even
falls
down,
you
wouldn't
get
404,
which
would
be
a
horrible
customer
experience.
Instead
of
that,
we
default
to
some
kind
of
general
global
recommendation
service.
This
way,
most
of
you
would
probably
want
even
notice
the
difference
you
keep
getting
served
with
a
nice,
a
nice
page.
D
You
can
hit
a
movie
at
the
play
button
and
watch
the
movie
and
enjoy
a
Netflix
I'd
like
to
wrap
up
with
our
plans
to
the
nearest
future,
and
so
we
all
the
time
we
keep
brainstorming
and
thinking
what
else
we
can
do
better.
How
can
we
improve
our
platform,
our
automation
services
and
make
them
even
more
resilient?
D
One
thing
we
came
up
with
is
to
start
gathering
data,
a
collecting
beta
data
about
the
clusters,
all
kinds
of
statistics.
This
way
we'll
be
able
to
try
to
predict
things
when
they
happen
before
even
before
they
happen.
Sorry,
by
being
proactive
instead
of
reactive,
we
can
make
our
platform
more
resilient.
Take
the
disk
usage
example
that
we
just
discussed
if
our
art
of
predicting
a
platform
can
detect
that
we're
going
to
run
out
of
disk
usage,
it
can
trigger
an
event
in
stock
storm
to
scale
up
the
cluster.
D
D
C
B
How
do
we
end,
though,
if
stacks
don't
fails
actually
right
now
we
get
page
eventually
we'll,
probably
have
and
try
to
have
a
way
to
make
it
auto,
heal,
but
right
now
it's
configured
with
auto
scaling
in
mind,
so
if
an
instance
get
killed
or
anything
it'll
bootstrap
in
during
the
cluster.
So
so
far
we
haven't
had
any
issues,
but
we'll
see
what
future
oh
yeah.
C
D
If
it
goes
down,
oh
I'll
take
that
so,
as
we
said,
the
workflow
in
each
action
result
might
trigger
a
different
action
in
the
workflow
and
we
can
you
can
define
in
each
workflow
on
success
or
on
failure
scenario
and
on
failure
scenario.
It
could
even
mean
that
your
script
crashed
okay.
So
even
if
something
went,
deathbed
the
own
fail
will
still
get
around
and
you
can
leverage
it
to
either
page
you're
on
call
or
send
an
email.
That's
what
we
currently
do.
Thank.
C
A
B
A
And
and
just
one
extra
clarification
I
know
the
guys
talked
about
operations,
we
don't
have
an
Operations
team,
everyone
in
the
team,
whether
the
developers
or
whether
they're
working
on
this
on
this
stack,
take
a
turn
on
call.
So
when
we
say
operations,
that's
actually
everyone
and
that's
how
all
teams
are
set
out
in
Netflix.
F
G
G
F
A
Yeah
so
I
mean
applications,
so
when
an
applique
new
application-
and
we
spin
up
a
cluster
for
them
and
and
most
applications
have
their
own
dedicated
cluster,
we
come
up
with
an
initial
schema.
Applications
evolve,
new
features,
get
added
so
yeah.
We,
we
reevaluate
the
schema
on
a
regular
basis
and
there's
members
on
our
team,
who
are
really
great
at
at
schema
design,
no
sequel,
schema
design
is
specifically
Cassandra.
No
schema,
schema,
design
and
part
of
what
cloud
database
engineering
does
is
offers
consultation
and
best
practices.
A
So
the
question
is:
when
we
upgrade
our
Cassandra
fleet,
do
we
go
to
the
latest
greatest?
Are
we
running
3.0
or
or
do
we
take
a
step
back
well
data
stacks
enterprise,
which
is
what
we
run
out
of
the
box,
is
a
version
behind.
So
we
run
the
latest
data
stacks
enterprise,
which
it
and
of
itself
is
about
a
version
behind.
E
G
B
Was
because,
at
the
time
sometime,
we
have
Sandra
instances
are
long-running
instances
and
some
of
our
instances
were
year
and
a
whole
two
and
a
half
years
old,
and
at
that
time
there
was
a
very
old
hardware
and
sometimes
when
they
reboot
their
boot
sequence,
was
too
long
to
actually
come
up
properly.
Ec2
has
a
timeout
on
this,
and
if
they
don't
boot
up
in
under
20
minutes,
they
just
terminate
them.
So
they
got
terminated
by
Amazon
before
they
some.
H
I
sharted
saw
them
and
my
question
is
generally
I
see
the
industry
moving
towards
the
cloud.
You
know
you
talk
about
the
incident
with
AWS,
where
you
had
no
control
of
when
they
were
going
to
reboot.
Well,
if
I
have
my
own
data
center
today,
I
have
control
over
that.
Where
do
you
see
the
industry
going
and
how
do
you
think
things
will
lay.
A
A
H
Think
most
a
lot
of
companies
do
that
today,
but
the
question
I
have
is
here:
you
are
a
company
as
well-known
as
Netflix,
but
you
had
no
leverage
against
your
provider
with
all
your
customers
could
have
been
impacted.
I
mean
that
I
think
is
from
a
customer
per
service
perspective
control
that
some
companies
would
not
want
to
lose
so
actually.
A
So
in
hindsight,
Amazon
did
the
right
thing.
It
was
a
very
bad
security
security
bug
in
the
Zen
virtualization
software,
where,
if
they
didn't
do
this-
and
there
was
a
zero-day
flaw,
everyone
on
AWS
could
have
been
hacked.
So
in
a
way.
Yes,
it
was
it
was.
It
was
a
little
annoying.
You
know
it
was
it
kind
of
was
inconvenient,
but
you
know
hindsight
it
was
the
right
thing
to
do.