►
From YouTube: OCB: Geographically Distributed CockroachDB with OpenShift - Keith McClellan (Cockroach Labs)
Description
As your OpenShift practice matures, it is likely that you will be asked to support stateful workloads. Multicluster deployment of stateful workloads can become complex especially when considering Disaster Recovery strategies.
In this briefing, Raffaele Spazzoli (Red Hat) and Keith McClellan (Cockroach Labs) will discuss how to deploy CockroachDB on OpenShift across three AWS regions to achieve a zero downtime, zero data loss disaster recovery strategy.
B
A
D
C
My
name
is
keith
mclaughlin.
I
am
on
the
solutions
engineering
team
at
cockroach
labs,
so
I
help
our
customers
implement
solutions
like
the
one
we're
gonna
be
demonstrating
here
today
in
in
their
own
environment.
So
it's
I'm
excited
and
raphael,
and
I've
been
working
hard
on
this
demo
for
for
quite
a
while
and
we've
used
it
a
number
of
times,
so
I'm
glad
that
we
get
to
show
it
off
to
a
broader
audience.
Thank
you
for
inviting
me.
D
So
I
started
about
two
years
ago
thinking
about
how
we
can
manage
disaster
for
stateful
workloads,
and
my
my
line
of
thought
was
okay.
We
have
figured
out
daedalus
for
openshift
and
kubernetes
in
general
as
a
community,
it's
time
to
think
about
stateful,
workloads
and
and
obviously
stateful
workload,
bring
bring
more
problems
in
particular
they
they
bring
state
and
and
and
they
need
to
sync
state
across
instances,
and
but
so
this,
obviously
there
is
storage
involved.
D
But
today
we
we
are
narrowly
focused
on
on
disaster
recovery
and
one
of
the
things
I
was
I
was
trying
to
do
when
I
was
thinking
about
these
things
is
to
define
a
new
approach
to
disaster
recovery,
which
I
call
cloud
native
disaster
recovery.
So
here
is
how
how
I
define
it
and
how
it
differs
from
traditional
disaster
recovery,
so
in
traditional
disaster
recovery,
and
usually
there
is
a
human
who
decides
when
a
disaster
occurs.
Okay,
so
a
human
triggers
some
triggers
some
disaster
recovery
procedure.
D
It's
not
it's
not
that
the
the
situation
is
not
detected
by
the
system
in
cloud
native.
We
cannot
wait
for
a
human.
We
we
need
faster
reaction
times,
and
so
the
the
trigger
has
to
be
autonomous,
and
so
the
system
has
to
identify
this
the
situation
and
react.
D
When
you
have
a
human
reacting
to
disaster
recovery,
what
happens
is
typically
a
long
time
passes
before
you
realize.
D
One
or
two
hours,
that's
what
I
see
at
my
customers
that
that
is
just
to
start
the
recovery
procedure.
Then
the
recovery
procedure
procedure
itself
in
traditional
disaster
recovery.
It's
usually
a
mix
of
automation
and
human
actions.
You
know
better,
you
know
if
you're
good,
you
probably
have
a
habit.
All
automated,
if
you're,
not
very
good.
You
probably
have
a
lot
of
human
actions
in
in
cloud
native.
D
D
What
is
the
length
of
time
of
transactions
that
you
have
missed,
so
it
measures
so
the
first
one
measures
availability,
essentially
the
second
one
measures,
essentially
consistency
of
your
data,
and
so
we
want
in
in
transitional
disaster
recovery.
You
have
you
have.
You
can
have
fast
rtrt,
you
know
minutes,
but
you,
but
it
can
grow,
go
up
two
hours
and
we
have
seen
why.
One
of
the
reason
is
the
human
human
component.
In
in
the
in
the
detection
and
and
in
the
procedure
for
recovery
in
cloud
native,
we
want
near
zero
rto.
D
So
so
it
could
be
theoretically
close
to
zero,
but
there
are
some
things
like
load,
balancers
and
l
checks
that
need
to
react
to
the
new
situation
and
start
diverting
traffic.
So
we
have
a
near
zero
in
in
the
order
of
magnitude
of
six
seconds
outage
and
then
for
recovery
point
objective
we
have
in
in
traditional
disaster
recovery.
It
could
be
between
zero
and
hours,
depending
on
how
you
sync
the
state,
but
in
cloud
native
disaster
recovery
we
want
it
to
be
exactly
zero
and
then,
when
it
comes
to
ownership
of
the
process.
D
What
I
see
usually
is
that
the
ownership
is
is
formally
on
the
application
team
to
to
design
a
disaster
recovery
process,
but
what
the
application
team
usually
does
is
they
turn
around
to
the
storage
team
and
say
what
sla
can
you
give
me
and
then
that's
that's
their
sla
for
disaster
recovery.
So
basically,
they
really
can
rely
completely
on
the
storage
team
in
cloud
native,
it's
gonna
be
all
on
the
application
team
to
find
the
right
kind
of
middleware
or
software
that
can
deal
with
disaster
in
cloud
native.
D
There
is
really
no
one
storage
team
anymore,
especially
if
you
have
an
hybrid
cloud.
There
is
a
aws
storage,
google
storage,
maybe
your
internal
storage,
but
there
is
no
single
team
that
you
can
go
and
tell
give
me
your
sla
and
then
from
a
technical
capability
standpoint.
There
is
another
interesting
difference
in
traditional
disaster
recovery.
D
We
we
usually
build
these
recovery
proceed
procedures
using
capabilities
that
come
from
storage
and
storage
products,
so
backups
volume
sync
like
this
kind
of
capabilities
for
cloud
native
instead-
and
this
was
an
interesting
finding
for
me-
the
capabilities
that
we
need
come
from
the
net
networking
space
in
particular,
we
need
the
ability
to
communicate
east
west
between
these
geographies
so
that
all
the
instances
of
our
workload
can
find
each
other
and
we
need
a
good
global
load.
C
You
know,
if
you
don't
mind
me,
jumping
in
here.
You
know
I'd
like
to
talk
a
little
bit
more
about
why
all
of
this
is
important
from
a
cloud
native
perspective
right.
So
I've
been
dealing
with
these
types
of
problems
for
for
a
long
time
which,
like
you
have
reality,
is,
as
we
become
more
abstracted
away
from
the
infrastructure.
C
You
know,
as
you
mentioned,
with
hybrid
workloads
and
where
we're
running
potentially
across
on-premise
data,
centers
and
cloud
data,
centers
or
potentially,
even
as
we
move
towards
you
know
a
full
full
cloud
deployments
right
in
a
lot
of
cases.
What
we're
seeing
is
silent
disasters
happen
a
lot
more
frequently.
We
don't
have
the
we
can't
rely
on
our
own
processes
to
guarantee
that
we
don't
have.
C
You
know
a
data
center
out
of
it
or
an
availability
zone
out
or
a
network
partition
right,
and
so
because
some
of
these
scenarios
become
more
likely
as
we
move
to
kind
of
a
cloud-native
ecosystem,
we
need
to
start
treating
them
like
just
any
other
issue
that
comes
up
on
any
right
and
fundamentally,
I
think
that's
that's
something
that
we're
gonna
we're
gonna,
be
showing
as
a
part
of
this
demo
later
here
today,
which
I
think.
D
D
Okay.
So
so
we
have
prepared
in
a
demo
for
you
to
see
this
in
action.
I'm
I'm
gonna
talk
about
a
little
bit
on
the
infrastructure
that
we
set
up
for
this
demo.
So
we
have
three
openshift
clusters
in
three
aws
regions:
two
in
the
east,
united
states
and
north
america.
I
should
say,
and
and
one
in
the
west,
in
this
openshift
we
have.
D
D
So
we
used
that
tool
to
bring
up
these
three
clusters
and
then
what
we
have
done
is
deploy
a
tool
called
submariner
now
submariner
helps
you
absolutely
establish
a
tunnel
between
the
openshift
sdns,
so
this
so
that's
the
sdn
is
a
software-defined
network
that
is
established
inside
of
a
openshift
cluster,
for
the
pods
to
run
with,
and
and
so
with
this
tunnel
between
the
sdns.
We
are
now
now
able
to
open
connection
from
one
cluster
from
one
pod
running
in
one
cluster
to
another
pod
running
in
another
cluster,
without
having
to
egress
and
english.
D
D
Another
thing
that
we
did
is
to
deploy
vault
for
our
secrets
and
certificate
distribution.
We
needed.
D
D
This
is
all
preparation
that
we
needed
to
to
be
able
to
deploy
core
codes.
The
last
piece
is
a
global
load
balancer.
So
obviously
we
are
in
uws.
We
are
using
route
53,
which
is
a
dns
right,
but
it's
a
very
powerful
dns
and
we
have
a
global
load.
Balancer
operator
here
on
the
right
running
on
the
administration
cluster
that
is
observing
the
other
cluster,
the
control
cluster
and
it's
automatically
programming
route
53.
D
C
So
so
I
have
a
couple
questions
for
you:
raphael
specifically
on
on
the
infrastructure
setup
the
opportunity
to
pick
your
brain
a
little
bit.
I
think
that's
fun.
You
know,
obviously,
I'm
a
fan
of
submariner,
but
how
exactly
is
submariner
different
from
some
of
the
other
ways?
We
could
peer
the
networks
between
these
different
openshift
clusters.
D
Between
the
sdns,
like
I
said
so,
it's
a
very
efficient
diagonal,
because
episode
is
an
established
technology
and
it's
it's
encapsulating
layer,
3
on
top
of
udp,
which
is
layer
4..
D
We
have
seen
other
solutions
that
users
that
use
a
higher
level
of
encapsulation
so
slightly
less
efficient
and-
and
I
I
should
I
should
also
that
that's
why
we
this
is.
This
is
one
of
the
problem
where
we
need
to
keep
the
latency
as
as
short
as
possible
to
to
enable
to
enable
this.
This
distributed
workload
to
be
efficient
right,
and
so
we
need
to
solve
the
problem
as
close
as
possible
to
the
network.
D
The
network,
physical
network,
space
and
and
submariner
here
does
a
good
job
there.
There
is,
there
are
two:
there
is
an
upcoming
way
of
running
submarine
that
will
make
it
even
more
efficient,
which
is
going
to
be
using
wireguard,
as
opposed
to
ipsec,
to
establish
the
tunnel
and
wireguard
is
a
la,
is
a
more
lightweight
protocol.
D
C
D
No,
I
think
I
think
that's
that's
that
and
the
fact
that
it's
you
know
it's
going
to
be
deployed
by
your
administrator
and
it's
going
to
serve
the
entire
cluster
right,
not
just
individual
namespaces
solution.
So
this
is
going
to
it's
some.
It's
it's
a
piece
of
the
infrastructure.
Once
it's
there,
it's
almost
disappears
and
it
it
just
works.
C
C
So
so
I
personally
haven't
used
the
advanced
cluster
manager.
Before
can
you
can
you
kind
of
talk
me
through
why
you
chose
to
to
use
that
to
to
orchestrate
these
these
kubernetes
clusters
and
I'm
curious
to
know
if,
if
that
administrative
cluster
is
running
all
the
time,
or
is
that
something
that's
that's
more
ephemeral
in
nature.
D
So
in
this
page
here
I
can
manage
my
clusters,
so
it's
a
single
pane
of
glass
for
all
my
fleet
of
clusters
and
customers
are
starting
to
have
several.
You
know
tons
and
tons
of
clusters
at
this
point,
so
this
makes
it
easy
as
an
entry
point
to
manage
all
of
your
clusters.
There
are
some
administrative
capabilities
that
you
can
do
from
here,
for
example,
I
can
do.
I
can
upgrade
all
of
them
at
once,
and
then
I
can.
D
I
can
set
up
monitoring
capability
where
all
the
metrics
are
collected
into
a
single
spot,
and
then
I
can
even
deploy
application
through
rackham
and
spread
it
across
multiple
cluster
or
enforce
policies
in
this
particular
demo.
We
we
just
use
them,
use
rackham
to
spin
up
these
clusters
where
our
workloads
are
going
around
got
it
make
sense.
C
So
so
it's
safe
to
say
then,
that
while
we
have
this
administrative
cluster
and
we
use
it
for
kind
of
you
know
administering
the
distributed
multi-cluster
configuration
here.
It's
not
like
a
single
point
of
failure.
If
the
administrative
cluster
would
go
down
because
of
the
failure,
the
the
infrastructure
that
it's
already
provisioned
is
independent
of
that
that's
a.
C
So,
as
you
mentioned,
copper
stevie
does
mtls
between
our
pods
we're
going
to
be
talking
about
that
here
in
the
next
few
minutes,
but
so
to
to
get
a
like
a
single
certificate
authority
across
all
three
clusters,
you
chose
vault,
which
is
great.
It's
the
same
thing
that
I
would
recommend
to
to
customers
going
into
production.
What?
What
specifically
did
you
do
to
to
make
vault
work
as
a
single
ca
across
all
three
of
these
clusters?.
D
Right
so,
first
of
all,
I
think
it's
important
to
discuss
why
I
decided
to
deploy
volt
this
way
right,
because
you
can
say
well,
I
need
I
need
a
ca.
Come
a
common
city
across
these
three
clusters.
The
ca
could
be
running
anywhere.
Why?
Why
running
in
the
cluster
and
across
the
clusters-
and
here
is
the
reasoning
with
with
these
three
clusters-
we
are
trying
to
build
the
most
available
infrastructure
in
our
data
center
right,
it's
going
to
be
distributed
across
multiple
geographies.
D
So
in
our
idea
I
should
say
it's
going
to
be
distributed.
It's
most
available
thing
that
we're
going
to
have
and
a
ca,
a
pki
and
an
assert
and
a
secret
management
tool
is
one
of
those
things
is
is
one
of
those
pieces
of
infrastructure
that
is
in
the
critical
path
for
applications,
meaning
that,
if
it
it
used
to
be
something
that
needed
to
be
available
at
maybe
boot
time.
D
But
now,
if
it's
not
available
things
stop
working,
so
it
needs
to
be
available
all
the
time,
and
so
that
should
also
benefit
from
the
most
available
infrastructure
that
we're
building.
So
I
was
looking
for
a
way
to
have
a
pki
slash
a
cert.
You
know
secret
management
that
never
goes
down
and
vault
can
do
that,
because
vault
supports
raft
as
as
as
a
storage
protocol
and
and
when
you,
when
you,
which
is
the
same
thing
that
cockroach
has
so
in
terms
of
syncing
the
state
and
making
sure
and
managing
availability.
D
A
D
A
D
No,
I
I
in
in
case
you
have
data
center
across
multiple
on-premise
data
center,
but
across
multiple
geographies
you,
I
recommend
you
do
a
global
load
balancer
with
a
dns,
so
you
can
use
maybe
something
like
a
five
big
ip
as
your
dns
and
and
you
still
need
which
has
the
same
capabilities
kind
of
the
same
capabilities
of
of
route
53.
But
you
can
certainly
do
else
exhibit,
which
is
the
things
that
we
need.
D
D
To
the
to
caucus
to
be
actually
keith,
would
you
like
to
describe
it
or
or
I
can
do
it.
C
No,
absolutely
so
so
cockroachdb
is
a
distributed.
Sql
database
that
it's
cloud
native,
we
run
the
vast
majority
of
our
installs
run
in
kubernetes
openshift.
We
run
our
database
a
service
product
on
kubernetes.
Fundamentally,
we
function
a
lot
like
the
other
technologies.
We've
already
been
talking
about
right,
so
raphael
mentioned
vault
and
how
it
uses
raft
to
do
consensus-based,
replication
across
sites.
That's
the
same
way
that
ncd
in
the
kubernetes
ecosystem
replicates
the
state
that
kubernetes
is
supposed
to
maintain
across
different.
C
You
know,
pod
hosts
and
whatnot,
so
cockroachdb
also
implements
the
wrath
protocol
or
for
doing
consensus-based
replication
of
our
of
our
data
layer.
Under
the
covers
we
use
a
kv
store.
It
used
to
be
roxtv.
If
you've
heard
of
that
we've
we've
re-implemented,
you
know
a
akv
store
under
the
covers
that
we
call
pebble
that
was
kind
of
more
purpose-built
for
for
what
we
were
trying
to
do.
Single
binary
deployment
completely
written
almost
completely
written
in
golang,
on
the
on
the
front
end
of
that.
C
What
we're
creating
is
a
mesh
where
every
single
node
has
the
authority
to
act
on
some
portion
of
the
data
in
the
database
is
a
follower
for
some
other
portion
and
then
potentially
is
not
involved
in
in
some.
You
know
third
portion
of
the
data,
so
every
node
is
active
as
the
leader
for
some
portion
of
the
data
in
the
database
and
we
create
a
global,
logical
cluster
that
allows
you
to
talk
to
any
given
node
and
we
will
route
your
queries
to
wherever
the
there's
a
there's
a
lot
of
great
stuff
here.
C
But
but
one
of
the
the
prerequisites
as
rafael
mentioned
is
those
nodes
talk
to
each
other
over
mtls?
Those
are
encrypted
communications,
so
so
cockroachdb
is
going
to
communicate
with,
in
this
case,
search
manager
to
to
get
certificates
to
enable
that
encrypted
communications
and
then
also
for
our
back
end
communications.
C
We
require
that
all
the
nodes
be
able
to
route
to
all
the
other
nodes.
This
allows
us
to
do
things
like
deal
in
the
case
of
losing
a
pod
or
a
site
right
to
do
that.
All
the
nodes
have
to
be
able
to
to
talk
to
all
the
other
nodes,
so
we're
using
a
submariner
here
to
allow
all
the
pods
to
talk
to
each
other
across
the
sites,
so
they
can
act
as
a
single
global
database
cluster.
C
I
I've
been
a
conference
labs
for
about
two
years
now.
It
is
especially
the
easiest
database,
particularly
old
cp
database,
that
I've
ever
had
the
privilege
of
of
supporting
one
of
the
great
things
about
designing
the
database
to
be
cloud
native
from
the
very
is
that
a
lot
of
the
operational
challenges
that
that
you
would
have
in
a
traditional
otp
system,
particularly
if
you
were
trying
to
run
it
in
kubernetes.
We
don't
have
those
things,
so
we
could
talk
about
how
we
manage
data
replication.
C
We
talked
about
career
performance,
a
lot
of
great
tops
we
go
into,
but
I'll
I'll
pause
there
as
kind
of
the
high
level
description
of
the
database.
D
So
you
said,
you
said
it's
a
sql
database.
So
as
a
developer,
let's
say
I
want
to.
Let's
say
I
already
have
an
application
running
on
a
sql
database
and
I
want
to
start
using
coker
cb.
I
can
probably
reuse
my
sql
skills
because
it
should
feel
the
same,
but
is
there
any
anything
that
changes
or
that
you
want
to
highlight.
C
Yeah,
so
so
so
we've
improv
implemented
the
postgres
wire
protocol,
so
you
can
connect
to
us
using
postgres
drivers.
You
can,
in
a
lot
of
cases,
use
your
existing
postgres
tooling
to
to
interact
with
us
there's
some
cockroach
db
variants
of
of
like
the
olms
that
are
out
there.
The
one
thing
that
you
need
to
know-
and
this
is
true
for
any
distributed
system-
is
that
the
data
has
has
a
location
attached
to
it.
A
C
It
may
have
an
intrinsic
location
like
an
address,
does
it
may
have
it
may
not,
but
that
and
then
we
need
to
consider
where
it's
going
to
be
accessed
from
and
how
what
that
access
pattern
looks
like.
So
so
the
one
piece
that
you
have
to
add
to
your
your
kind
of
dba
bag
of
tricks
when
you're,
when
you're
moving
towards
a
distributed,
sql
environment,
is
thinking
about
how
we
want
to
distribute
this
data
across
cluster
and
and
in
inverse
how
we're
going
to
get
back
out
right.
C
So
if
you,
if
you
ever
go
and
you
take
it
like
a
data,
modeling
class
in
a
college,
they'll
talk
about
the
physical
data
model
as
opposed
to
the
logical
data
model.
When
you
move
towards
a
distributed
system,
you
have
to
think
a
lot
more
about
the
physical
data
in
conquerors
tv.
We
make
this
super
easy.
We
have
a
couple
of
we're
called
ddl
extensions.
So,
basically,
when
we
define
the
table,
we
define
how
we
want
to
just
the
data
across
that
table
or
or
set
of
tables
by
default,
we're
going
gonna.
C
What
do
something
is
called
follow
the
workload
right
which
is
a
which
is
where
we're
going
to
basically
move
the
the
authority
to
act
for
any
particular
segment
of
the
data
to
where
it's
most
likely
to
be
used
from,
but
we
also
have
concepts
of
regional
tables
and
global
tables.
C
All
of
these
things
have
have
different
tradeoffs
on
read
and
write
performance
and
also
kind
of
impact.
What
are
what
types
of
scenarios
we're
going
to
survive
without
user
error
right?
So
one
of
the
big
philosophical
things
that
we
talk
about
is
designing
to
survive
as
opposed
to
designing
to
fail
right.
A
traditional
vr
is
you're
designing
a
system
that
can
pick
up
when
your
primary
system
right.
That's
why
you
have
two
site
solutions.
C
You
have
failovers
and
you
have
backups
and
all
that
kind
of
comes
we're
designing
to
survive,
so
we're
going
to
have
three
or
more
sites,
because
if
we
lose
an
entire
site
as
as
we're
about
to
start
talking
about
from
a
demonstration
perspective,
we
want
the
system
to
continue
to
operate
and
function
as
if
it
was
any
other
day.
C
So
the
data
center,
if
you
will,
is
the
new
rack
right.
So
you
know,
if
you're,
if
you've
ever
kind
of
set
up
a
distributed
system
in
a
physical
data
center,
and
you
wanted
to
make
sure
that
it
survived
say
a
layer
because
you
did
because
you,
you
know
you
had
a
pdu
failure
or
you
know
the
top
of
iraq
switched
failures
along
those
lines.
You
didn't
want
your
application
to
go
down.
C
In
that
scenario,
we're
now
kind
of
treating
the
data
center
as
that
new
abstraction
layer
that
needs
to
be
survivable
without
any
noticeable
doubt.
I
hope.
D
Yeah
yeah
it
does,
and-
and
I
I
think
this
is
a
perfect
segue
to
my
next
question,
which
is,
I
see,
customers
now
that
are
considering
migrating.
Their
sql
farms
could
be
any
product
right
any
database,
but
they
want
migrated
sql
firms
to
openshift
right
and
they
may
have
maybe
1
000
instances
of
of
databases
running
on
vms,
which
essentially
are
treated
like
pets
right
with
a
team
of
dbas
that
trained
and
care
for
these
pets.
And
what
is
what
I
feel
like
there
is
a.
D
I
feel
there
is
a
risk
that
we're
gonna
migrate,
these
databases
inside
of
openshift
they
can
certainly
run,
but
we're
still
going
to
treat
them
like
pets
right
and
instead,
the
philosophy,
the
philosophy
on
of
kubernetes
and
openshift
is
to
treat
everything
as
cattle
things
that
can
die
and
will
re
respond
somewhere
else.
D
So
here
is
the
question
for
you
and
sorry
and
to
conclude,
to
conclude
the
thought:
this
can
be
difficult
for
state
workload,
obviously
much
more
difficult
than
for
stageless,
but
how?
How
does
cockroach
help
in
that
space
yeah?
So
yeah.
C
So
fundamentally,
state
is
what
makes
things
special
from
an
it
perspective
right.
You
know,
if
you
remove
the
state
from
almost
any
system,
you
can
probably
genericize
it
pretty
easily,
so
we
fundamentally
have
kind
of
taken
the
same
approach
to
to
this
problem,
as
vault
has
and
as
ncd
does
right
in
that
we
we
make
sure
that
all
of
our
data
lives
in
more
than
one
place,
and
we
guarantee
that
our
data
lives
in
one.
C
So
so
then,
rather
than
having
a
single
point
of
failure,
then
we
have
a
configurable.
We
have
effectively
configurable
availability
while
guaranteeing
existence.
So
so
one
of
the
things
that
I
didn't
talk
about
is
the
replication
factor.
So
by
default,
everything
that
gets
written
to
copper,
hdb
gets
rid
of
at
least
three
places,
and
that's.
C
Up
from
there,
so
there's
scenarios
where
there
might
be
five
that
might
be
seven-
you
can
even
theoretically
go
higher
than
that,
although
mathematically,
if
you
lose
51
percent
of
your
replicas,
and
you
had
seven
rounds
because
you
probably
have
bigger
problems
than
the
database
number,
the
the
intent
is,
you
say
hey.
C
These
are
the
everyday
occurrences
that
I
want
to
an
aws.
It
might
be
an
availability
zone
outage
which
happens
in
aws
a
couple
times
a
year.
Maybe
it's
a
region
failure
which
happens
once
every
three
years.
You
want
to
make
sure
that
you're
surviving
some
cases.
It
might
be
a
full
cloud.
C
We've
had
scenarios
where
cloud
providers
have
had
kind
of
cascading
problems
or
caused
by
user
error,
because
almost
all
disasters
are
caused
by
user
error
at
the
end
of
the
day
and
where
we've
lost
multiple
regions
in
aws
and
in
azure
and
gcp.
So
you
may
want
to
actually
spread
your
workload
across
multiple
clouds.
This
is
very.
C
Neighborhood
of
the
hybrid
workload
I've
got
two
physical
data
centers
across
you
know,
and
my
third
data
center
is
in
google
or
it's
in
aws
right
or
maybe
I
only
want
to
maintain
one
data
center.
Now
I
want
my
other
two
data
centers
to
be
to
be
in
two
different
cloud
providers,
because
I
don't
want
to
put
all
my
eggs
in
on
that
right.
So
fundamentally,
we
use
raft
to
do
this.
We
have
some
enhancements
two
raft
that
that
make
it
function
from
for
sql
databases.
C
So
if
you
ever
go
and
read
our
life
of
distributed
transaction
documentation
or
you
read
any
of
the
blog
posts
about
how
we
guarantee
serializable
isolation
across
transactions,
when
you
may
have
two
transactions
that
come
into
come,
two
completely
different
nodes
and
two
completely
different
data
centers,
we
have
a
a
lot
of
very
interesting
writing
on
that
topic
and
I
won't
go
into
today.
C
A
Hey
keith,
can
I
ask
a
quick
two
questions,
so
you
did
mention
that,
like
most
of
the
time
we
designed
for
failure
and
not
for
survival.
So
if
you
can
just
explain
that,
what
do
you
exactly
mean
by
that
and
then
also
if
you
can
help
us
understand
what
really
cockroach
db
like?
How
does
it
differ
from
maybe
with
reddish
or
infamous
span,
or
is
it.
C
C
Yeah
so
I'll
answer
the
second
question:
first,
because
it's
pretty
short
so
those
other
databases
are
nosql
databases,
they're
they're.
They
tend
to
be
in
the.
If
you
look
at
the
cap
theorem
as
soon
as
you're
a
partition
tolerant
as
soon
as
you
are
so
as
soon
as
you
have
to
manage
for
network
partitions,
you
can
either
guarantee
consistency
or
you
can
guarantee
availability.
Okay,
so,
generally
speaking,
nosql
databases
are
lean
towards
availability
and
then
so
they
don't
guarantee
consistency.
C
In
all
cases,
I'm
not
going
to
go
into
the
specific
databases,
because
the
nuances
there
get
get
really
specific,
we're
a
cp
database
where
we
can
increase
our
our
availability
by
increasing
our
replica
account
because
we're
using
consensus
based
so
so
fundamentally
that
that'll
makes
us
more
valuable
for
system
of
record-like
workloads.
C
So
things
like
inventory
management,
financial
transactions,
we're
used
by
a
number
of
large
financial
institutions,
the
united
states
and
europe,
for
example,
because
we
can
be
in
a
cloud-native
environment
like
this,
and
we
can
have
extremely
high
availability
as
well
as
guaranteed.
Consistency
for
transactions.
C
Things
like
redis
and
cassandra
and
mongodb
are
are
much
better
at
workloads
where
they
are
kind
of
right.
Once
read
many
and
that's
a
broad
generalization,
I
know
that
there
are
people
in
the
call.
They
could
give
me
specific
examples
where
something
like
a
cassandra
or
mongodb
would
be
a
better
fit
for
cockroach
that
for
a
problem
than
cockroachdb,
and
all
these
things
are
always
true
use
the
right
tool
for
the
job.
We
are
specifically
very
focused
on
transactional
workloads
that
require
guaranteed
consistency.
C
We
do
all
of
our
consistency
in
that.
If
you
look
at
acid,
like
an
acid
compliant
database
for
a
fully
acid
compliant
database
and
all
of
our
transactions
are
serializably
isolated,
which
so
your
earlier
question
designing
to
survive
versus
designing
to
fail.
This
is
this
is
the
difference
between,
in
my
mind,
this
difference
between
high
availability
versus
fast
recovery.
So
when
you,
when
people
put
together
a
disaster
recovery
plan,
they're
expecting
that
things
are
bad
enough,
that
they're
willing
to
accept
that
things
aren't
going
to
operate
as
they
they
normally
would.
C
The
the
challenge
is:
we've
moved
to
these
kind
of
newer
cloud
native
technologies.
We're
running
the
cloud
is
just
us
running
on
other
people's
computers
right,
we
have
less
control,
so
it's
more
likely
that
something
outside
of
anything
that
we've
done
could
possibly
cause
one
of
these
failure.
Events
to
happen,
so
what
we
want
to
do
is
we
want
to
be
able
to
treat
them
as
a
high
availability.
A
C
So
philosophically
it's
basically
like
coming
at
it
like
a
glass,
half
full
versus
a
glass,
half
empty
kind
of
a
perspective.
By
going
to
it
saying
hey,
you
know,
I
need
to
be
able
to
continue
to
operate
if
I
lose
an
entire
region
of
aws
and
I'm
going
to
design
a
system
to
solve
for
that,
then,
if
a
region
in
aws
fails,
I
shouldn't
need
to
get
a
page
at
the
middle
of
the
night
to
get
up
to
fix
my
computer,
I
shoot
our
systems.
C
I
should
be
able
to
kind
of
kind
of
deal
with
it
in
the
normal
order
of
of
things
rather
than
treating
it
like
a
disaster
and-
and
if
you
soon,
as
you
start
to
look
at
it
as
I
want
to
be
able
to
continue
to
operate
as
normal
during
these
scenarios
versus,
I
need
to
be
able
to
get
back
up
and
running
at
some
point
in
the
future.
If
something
like
this
happens,
that's
all
of
a
sudden
you're
designing
to
survive
as
opposed
to
designing
to
fail.
Hopefully,
that
answered
your
question.
D
Can
I
add,
I
have
a
consideration
on
eventual
consistency
that
I
think
just
just
building
up
on
this
conversation.
D
When
I
started
looking
at
these
architectures,
I
could
have
chosen
a
an
event
or
consistent
database,
so
a
database
that
decides
to
be
available
rather
rather
than
consistent
in
in
an
event
of
a
network
partition
and-
and
if,
if
that
was
the
case,
we
would
probably
see
only
two
openshift
here
in
this
picture,
because
in
that
case
you
just
need
two
to
continue
working.
D
D
Eventual
consistency
means
when
the
network
petition
goes
away
and
all
the
instances
can
talk,
they
will
converge
to
a
state,
but
there
is
no
guarantee
that
that
state
is
the
state
that
is
logically
correct
for
your
business
problem
and
and
so
it's
very
hard.
So
so
so
I
didn't
like
that
situation
as
a
developer.
I
don't
want
to
think
about
that
situation
and
how
my
my
code
would
would
have
to
be.
C
C
So
you
have
to
think
about
all
of
those
potentially
income
things
that
you
just
mentioned:
raphael
in
your
application
and
you
need
to
handle
them
at
the
app
player.
There's
good,
valid
reasons
why
you
might
need
to
do
that
in
certain
scenarios,
but
in
a
lot
of
auditable
audit
type
situations,
particularly
to
mention
financial
management
inventory
tracking,
where
correctness
is
of
utmost
importance.
C
That
risk
is
unacceptable,
at
least
in
my
opinion,
which
is
why,
which
is
why
I'm
at
concourse
labs
and
I'm
not
currently
working
at
a
sql
database
better.
So.
A
C
Sorry,
it
is
application
data.
This
is
a
great
transition
to
the
next
slide.
Actually,
so
there's
an
industry
standard,
oltp
benchmark
called
tpcc,
it's
been
around
since
the
90s,
it
simulates
literal
warehouses
and
how
packages
or
or
things
might
flow
into
or
out
of
those
warehouses,
as
well
as
kind
of
like
a
point
of
sale
system.
Where
those
you
know
products
are
getting
manufactured
and
then
shipped
out
to
customers,
it's
very
much
a
transactional
use
case.
C
C
It's
a
it's
a
good
generic
benchmark
because
it's
one
of
the
one
of
the
benchmarks
that's
most
available
for
sql
databases,
there's
published
benchmarks
and
and
guidance
on
how
to
run
this
benchmark
on
pretty
much
every
sql
database.
I've
ever
ever
seen
going
back
to
like
1996..
C
So
so
it
gives
you
a
good
wide
swath
of
of
kind
of
what
that
looks
like
what
we're
going
to.
What
we're
going
to
be
showing
here
today
is
what
happens
when
one
of
these
sites
goes
away
while
we're
running
tpcc
against
calculus
tv.
C
D
Right
and
on
the
infrastructure,
I
should
I
should
say
one
thing:
you
said
before
cases
very
correct
today,
many
customers,
many
enterprises,
are
considering
building
a
recall
solution
where
they,
where
they
deploy
on
different
clouds.
There
is
nothing
in
this
demo
that
cannot
be
deployed
across
multiple
clouds.
It's
just
that.
The
account
that
I
have
is
only
on
aws,
and
so
we
are
using
aws
only
for
only
for
that
reason,
but
you
could
you
could
do
it.
You
could
do
this
them
across
multiple,
multiple
clouds.
D
So
here
we
have
the
cochlear
hdb
console.
We
can
see
in
this
nice
map
where
the
data
centers
are
and
where
the
calculus
db
nodes
are.
Okay,
we
have
nine
nodes,
three
three
and
three
of
course,
and
we
have
some
ranges.
These
are
the
data
spaces
that
concur
manages
and
sorry,
if
I'm
not
using
the
right
word
here,
but
essentially
these
are
the
partitions
and
the
replicas
that
are
being
managed
by
corkers.
D
So
with
this
tpcc
workload,
and
this
tpcc
workload
as
keith
said,
is
generating
a
bunch
of
oltp
transactions,
so
they're
typically
fast,
insert
fast,
fast,
update
or
or.
C
Yeah,
so
so,
let's
select
yeah,
so
it's
the
majority
of
the
workload
is
going
to
be
individual
item
updates.
So
as
a
particular
widget
moves
around
a
warehouse
or
a
set
of
warehouses,
then
that
that
record's
gonna
get
updated
and
then
a
portion,
I
think
it's
six
percent,
although
don't
quote
me
on
that,
maybe
I
should
have
said
that
on
a
broadcast
forum,
our
aggregate
queries
to
look
at,
like
the
current
state
of
the
inventory
for
that
warehouse.
C
D
Okay
and
as
you
may
have
seen
that
one
of
the
databases
was
orange
and
that's
that's
what
happens
when
it
goes
down?
I
don't
know
it
must
have
been
a
just
a
little
glitch,
but
everything
is
up
now.
I
want
to
show
you
that
we
are
generating
so
this.
This
see
this
little
processes
are
generating
load.
These
are
pods
running
inside
the
cluster,
so
they
are
near.
D
This
is
simulating
traffic
coming
from
different
sources
and
we
direct
a
portion
of
the
traffic
to
the
database
that
is
close
to
the
source,
so
the
traffic
stays
there,
so
they
are
generating
traffic
on
the
local
cluster.
Obviously,
dyna
cockroach
will
spread
the
data
where
it
needs
to,
and
I'm
just
redirecting
the
output
here
and
we
can
see
all
the
transactions
that
are
generated
and
if
we,
if
we
go
to
the
metrics,
we
should
see
that
we
have
some
queries.
D
So
now
what
we're
going
to
do
to
simulate
a
disaster
is
that
we
are
going
to
take
down
one
of
the
regions,
so
I
am
going
to
actually
take
down
the
the
west
region
and
the
way
I'm
going
to
do
it
is
I
I
am
going
to
completely
isolate
the
vpc
in
which
openshift
is
running,
so
nothing
can
go
out
and
nothing
can
go
in,
and
this
is
the
perfect
disaster
simulation,
because
it's
a
network
partition
right.
You,
you,
don't
know
you're
sending
a
packet,
but
nothing
answers.
D
So
you
don't
know
if
the
packet
has
been
received
or
not
it's
it's
it's
way
more
difficult
to
manage
than
I'm
sending
a
packet
and
I'm
receiving
a
response
that
says
there
is
an
error
right.
So
that's
the
perfect.
That's
exactly
what
happens
when
there
is
a
disaster.
D
D
D
Oh
wow,
like
I
said
there
can
be
some
glitch
when
this
happens.
This
remember
this
is
our
cockroach
console,
so
it's
it's
it's.
The
traffic
that
comes
from
my
browser
is
is
load
balanced
by
the
global
load,
balancer
that
I
was
describing
yeah
before
so
it
could
go
to
any
of
the
regions.
So
maybe
we
took
down
the
the
pod
that
was
serving
this.
D
This
console
in
case
you,
you
may
want
to
describe
it,
but,
as
you
can
see,
after
a
few
seconds
we
were
able
to
connect
again
and
and
the
the
console
is
already
aware.
You
see
that
three
clusters,
three
nodes
of
the
nine,
are
suspect
of
having
a
problem.
C
Yeah,
so
so
what
happened?
There
was
because
each
node
has
all
of
the
services
of
every
node
in
the
cluster.
You
were
definitely
the
load.
Balancer
was
originally
routing
you
to
one
of
the
the
plots
that
that
we
just
segregated
from
the
network
so
as
soon
as
the
load
balancer
realized
that
was
happening
around
the
pods
that
were
still
available.
That's
that's
what
we
would
expect
in
this
type
of
scenario.
C
Right,
there's
a
you
know,
a
couple
digits
second
service
glitch
for
certain
operations,
but
queries
that
had
come
into
nodes
that
weren't
impacted
by
this
will
continue
to
operate
and
we'll
be
able
to
continue
to
process
queries
against
the
database.
D
And
in
fact,
I
want
to
show
you
that
we
are
still
processing
see
the
metrics
did
not
go
down
and
our
client
number
one
is
still
working,
although
you
see
it
had
some
glitches.
So
this
client
didn't
have
a
connection
problem,
but
cockroach
was
adjusting
itself
and
there
were
two
arrows
which
this
particular
client
manages
with
retries
and
that
that's
a
best
practice
that
that
you
should
follow
also
in
the
code
that
or
developers
should
also
follow
in
their
code.
D
But
you
see
the
client
didn't
break
and
continue
to
work
same
thing
for
the
second
client.
It
only
got
one
arrow
in
this
case
and
obviously
the
third
client
died,
because
well
we
severed
the
connection
even
to
the
tail
of
the
log
here
right.
Okay,
so
we
are
what
we
have
done
so
far.
We
have
simulated
the
disaster
and
we
have
demonstrated
that
we
didn't
have
to
do
anything.
The
system
reacted
by
itself
and
continue
to
work.
D
Now
we
are
going
to
restore
connectivity
and
we
are
going
to
show
that
again,
we
don't
have
to
do
anything
and
the
system
resumes
working
with
all
of
the
capacity
that
that
we,
that
is
available
because
another
problem
of
the
of
disaster
recovery
failures
is
that
you
know
usually
yeah.
You
have
you
have
a
disaster
recovery
procedure
to
recover
from
a
disaster,
but
but
when
the
system
that
was
down
comes
back
up,
it's
usually
as
painful
also
to
to
restore
the
workload
where,
where
usually
was,
is,
is
the
same
kind
of
process.
C
I
was
just
going
to
say
right
now:
those
three
nodes
are
still
listed
as
suspect
we
don't
evict
them
from
the
cluster.
Until
I
think
it's
five
minutes,
then
we
assume
that
they've
they've
been
they're
dead.
The
only
difference
in
recovery,
if
we
were
to
wait
for
five
minutes,
is
just
which
path
we
take
for
re-replicating
the
data
to
the
nodes
as
they
come
back
under
five
minutes.
We
assume
that
the
nodes
aren't
that
far
behind
and
we
can
get
them
caught
up
using
the
raft
logs
after
five
minutes.
C
We
assume
that
they're
too
far
behind
and
that
we're
going
to
we're
going
to
re-replicate
the
ranges
there,
which
is
a
slightly
more
expensive
operation,
but
still
both
of
those
paths
are,
are
invisible
to
as
a
slightly
different
performance
impact
after
we
bring
those
new
those
instances
back
online.
A
Okay,
so
what
is
the
pitch
while
the
rafale
goes
and
saves
the
world?
Maybe
I
can
ask
another
question
so
so
what
is
the
pitch
like?
My
customers
are
using
oracle
database
and
they're
moving
applications
to
openshift,
and
this
is
not
connect
to
this
oracle
database,
which
is
outside
openshift.
So
are
we
saying
instead
of
that
use
cockroachdb
now.
C
Well,
this
will
allow
you
to
move
the
database
into
openshift
as
well.
So
one
of
the
things
as
a
as
a
recovering
operator
and
a
right
like
I
used
to
run
systems
like
this
in
production.
One
of
the
things
that
it's
really
frustrating
is
when
you
have
to
treat
something
special
so
right
now
and
that's
what
you're
talking
about
the
the
infrastructure
for
oracle
is
special,
the
tooling
for
oracle
special.
C
If
you
have
an
oracle
disaster,
you
have
a
completely
different
run
book
for
resolving
that
disaster
than
if
your
application
fails
by
by
using
cockroachdb
and
moving
that
into
openshift.
All
of
a
sudden,
now
you're
handling
a
database
failure.
Just
like
you
were
handling
nha
proxy
failure
or
an
app
to
your
failure
right.
It's
it
much.
C
It
drastically
reduces
the
the
scope
of
of
the
types
of
disasters
that
you
might
have
to
manage
the
types
of
availability
events
you
might
have
to
manage
on
top
of
that,
getting
all
the
great
self-healing
capabilities
that
raphael
is
showing
here
today,
just
by
reducing
the
administrative
burden
of
having
to
understand
multiple
different
ways
that
multiple
different
applications
in
your
stack
are
running.
Can
drastically
reduce
how
difficult
it
is
to
do
that
work
right.
So
there
are
a
ton,
there's
a
ton
of
other
things.
D
D
A
C
Do
we
use
stateful
sets
in
in
kubernetes,
so
those
stateful
sets
are
presenting
a
file
system
to
the
database
for
sure
we
use
a
kv
engine
to
act
as
our
kind
of
storage
layer.
So
there's
not
a
like
a
so.
You
have
a
decent
amount
of
flexibility
there.
You
generally
use
something
like
a
persistent
volume
claim
to
get
a
persistent
volume
from
whatever
storage
layer
happened
to
be
available
to
you
in
in
your
various
openshift
clusters.
C
Here,
because
we're
in
amazon
we're
using
ebs
volumes
to
access
the
back
in
store.
A
Yeah
before
karina
kicks
me
out.
I
just
last
question
so
when
this
data
center
came
back
up-
and
you
know
it
had
to
replicate
all
the
ranges
which
had
issue
so
while
that
is
happening
at
that
point
of
time
of
a
request
comes
into
this
data,
you
know
to
the
openshift
cluster.
It
just
came
back
up.
What
happens
to
that
request?
Is
the
cockroach
db
database
locked
at
that
point
of
time
and
it
cannot
handle.
C
C
Every
node
in
cockroach
db
is
a
common
gateway
to
the
entire
cluster.
So
as
soon
as
those
nodes
had
connectivity
the
rest
cluster
again,
they
could
act
as
what
we
call
a
query
coordinator
they're.
They
aren't
necessarily
the
query
responders
they're,
not
the
ones
doing
the
work
on
the
data,
but
they
are
acting.
They
still
can't
act
as
a
client
gateway
immediately.
So
you
don't
you
don't
get
a
scenario
where
your
database
is
locked
up
when
we're
replicating
or
any
of
that
kind
of
stuff.
C
D
The
first
thing
that
needed
to
heal
was
the
network
tunnel
right
submarine
and
needed
to
hear
and
reestablish
all
of
those
connections
and
now
and
then
cockroach
see
it's
healing
right
now
see
it's,
it's
replicating
the
ranges
and
then
every
every
node
will
be
back
at
full
capacity
and
serving
traffic.
D
Here
we
go
now,
it
did
it.
So
again,
I
think
the
point
to
take
home
here
is
besides
the
inner
working
of
corkers
is
as
an
administrator.
I
didn't
have
to
do
anything
right
it.
It
managed
the
disaster.
It
reacted
to
the
disaster
and
also
recovered
from
when,
when
we
healed
the
disaster,
when
we
fixed
the
disaster
recovered
through
to
full
capacity
all
by
itself,.
D
Okay,
I
just
want
to
add
before
we
close,
this
demo
is
completely
scripted
and
anyone
should
be
able
to
produce
if
you
guys
are
interested-
and
everything
is
here.
C
We
also
have
we
also
have
an
awesome
like
two-part
blog
post,
that
raphael
and
I
co-authored
walking
through
exactly
all
the
like
underlying
steps.
We
did
here
that
we
can
share
that
link
as
well.
B
So,
let's
not
do
that
just
yet!
Please
ralph,
could
you
in
the
references?
Can
you
put
the
links
to
the
the
blog
posts
and
then
we
will
post
the
link
to
this
all
right
post
it
out.
D
B
A
bit
over
time,
so
I
want
to
make
sure
that,
and
on
around
you
can
ping
rock
offline
too
right.