►
Description
In this OpenShift Commons Briefing, Raffalele Spazzoli (Red Hat) will introduce a new way of thinking about disaster recovery in a cloud native setting. He will introduce the concept of Cloud Native Disaster Recovery, which characteristics it should have, and the problems that need to be addressed when designing a disaster recovery strategy for stateful applications in a cloud native setting.
A
A
A
A
I'm
always
excited
about
those,
it's
so
important
that
we
work
upstream
with
the
cncf,
and
today
we
have
members
from
tag
storage
talking
about
a
topic
that
is
very
important
to
pretty
much
all
of
us,
especially
enterprise
installations,
and
anybody
that
is
concerned
with
how
to
get
your
data
back
or
your
applications
back.
If
something
happens,
so
today
we
have
alex
cherkop
from
storage
os.
He
is
the
ceo,
but
also
he
is
the
cncf
co-chair
of
tag.
Storage
and
tag
stands
for
technical
advisory
group.
A
Now
it
has
been
changed
from
sig
and
also
raphael
yeah,
who
is
a
tech
lead
for
tag
storage,
so
very
excited
happy
both.
Thank
you.
So
much
for
joining
us
and
raphael
and
alex,
if
you'd
like
to
tell
a
little
bit
more
about
what
you
do,
especially
with
tag
storage
and
then
dive
right
in.
B
Thank
you
so
much
karine
and
and
so
glad
to
to
to
be
here
with
with
the
openshift
community.
So
the
cncf
a
couple
of
years
back
had
created
sigs
and
six
storage
in
fact
was
was
one
of
the
first
was
one
of
the
first
cigs
that
we
created
and,
and
the
purpose
was
to
help
to
help
the
cncf
with
evaluating
projects
and
to
create
content
and
educate
an
educational
material
for
the
end
users
and
the
community.
B
And
since
then
the
sig
has
been
renamed
to
tag
just
because
sigs
were
getting
confused
with
the
kubernetes
cigs
and,
and
you
know,
as
karina
mentioned,
tag
stands
for
technical
advisory
group.
So
I'm
the
co-chair
of
the
group
together
with
zhenyang
and
quinton
hool,
and
we
also
have
a
number
of
tech
leads
of
which
rafael
is,
is
one
of
our
newest
members.
So
I'd
like
to
pass
the
battle
onto
to
rafaela,
to
introduce
himself
to.
C
Thank
you
alex
yes,
I
I
work
at
reddit
as
an
architect
in
in
consulting
so
help,
customers
build
their
cloud
native
solutions
and
recently
joined
the
tag
storage
to
work
together
on
the
topic
that
you
see
on
the
screen
today,
which
is
clarinet
disaster
recovery.
C
This
is
a
work
that
I
think
started
about
nine,
a
collaboration
that
started
nine
months
ago.
I
think
it
was
fall
of
last
year
when
we
started
talking
about.
If
this
was
a
good
idea
or
not
to
explore
in
in
this
particular
tag.
Right
or
it
should
belong
somewhere
else,
and
then
we
started
sharing
some
ideas
and
and
we
we
are
now
creating
a
white
paper
that
I
think
we're
gonna
publish
soon
and
today
we're
gonna
present.
Some
of
you
know
the
results
of
this
of
this
white
paper.
B
Indeed,
so
so
this
is,
this
is
particularly
exciting
for
for
the
tag,
because
you
know,
we've
had
a
lot
of
demand
and
in
fact,
a
lot
of
feedback
in
in
the
in
the
different
tag
calls
which,
which
which
rafaela
has
been
very
patient
with
and
and
also
it's
meant,
that
we've
been
able
to
to
iterate
on
the
documents
and
and
just
for
you
know,
everybody's
benefits
when,
when
we
talk
about
storage
in
the
cncf
and
and
persistence,
we're
we're
not
just
talking
about
you,
know
a
volume
or
a
file
system,
but
we
cover
kind
of
any
type
of
any
type
of
persistence,
layer,
including
databases
and
object
stores
and
key
value
stores,
for
example,
as
well.
B
As
you
know,
traditional
volumes
and
file
systems-
and
I
think
you
know
the
key
thing
here
and-
and
you
know,
keep
being
honest
here
roughly.
I
think
I
think
what
we're
what
we're
trying
to
do
is
to
make
sure
that
developers
and
devops
teams
in
this
areas
etc
have
the
information
and
the
and
the
tools
and
and
examples
on
on
how
to
adopt
some
of
these
different
technologies
in
in
in
a
cloud
native.
B
In
a
classic
way,
because
for
the
first
time
in
a
long
time,
it's
it's
no
longer,
you
know
infrastructure
teams
that
are
making
these
decisions,
but
it's
it's
it's
developers
and,
and
their
deployments
and
and
understanding
the
the
storage
subsystems
and
the
different
technologies
that
are
available
with
with
cloud
native
technologies,
is
extremely
exciting
and
and
enables
so
many
new
use
cases
which,
which
raphael
will
cover
shortly.
C
Yeah,
that's
that's
exactly
right.
I
I
totally
agree,
and
what
we
would
like
to
do
today
is
to
present
a
an
approach
to
disaster
recovery
that
we
call
clown
native.
Obviously,
it
focuses
on
stateful
workloads
and
that's
the
storage
workload
that
that
alex
was
mentioning
all
of
them
right.
We,
it
turns
out
that
the
hard
problem
that
you
have
to
solve
when
you
distribute
a
state
of
workload
is
always
are
always
the
same,
regardless
of
the
kind
of
interface
that
you
expose.
C
So
it
doesn't
really
matter
if
you
are
exposing
block
storage
and
a
block,
storage
interface
or
a
message
queuing
or
a
sql
database,
the
the
the
in
internal
you
know
sink
is
really
the
hard
problem
that
you
need
to
solve,
and
and
and
that's
what
we
explore
with
this
with
this
white
paper,
but
from
from
a
user
perspective.
C
The
concept
that
we
would
like
to
to
make
people
aware
is:
is
this
new
concept
of
cloud
native
disaster
recovery?
This
is
we
we
call
it
clarinet
disaster
recovery.
I
just
said
new
concept,
but
it
it
the
approach,
it's
something
that
you
could
have
done
even
in
the
past.
Our
point
is
really
that
with
cloud
native
approaches
it
it
now
becomes
less
complex
and
more
more
easy
to
to
to
create
these.
C
Are
these
architectures
and
this
deployment
and
probably
less
expensive
and
the
way
we
define
this
cloud
native
disaster
recovery
is
by
con
contrasting
it
with
traditional
disaster
recovery.
We
had
an
internal
discussion
in
china
in
the
time
whether
tradition,
calling
this
this
traditional
disaster
coming
was
correct
or
not
by
traditional
disaster
recovery.
We
mean
what
you
normally
would
find
in
many
enterprise
customers.
So
not
the
big
web
scalers,
not
the
newer
startups,
but
you
know
the
enterprise
customer
that
many
of
us
work
with.
C
So
let's
go
down
these
these
columns
and
and
rows
one
by
one.
I'm
gonna
try
to
keep
it
pretty,
because
we
don't
have
a
lot
of
time
but
and
and
create
this
contrast
and
and
talk
about
what
cloud
native
dr
is
type
of
deployment.
So
are
we
deploying
active
active
or
are
we
deploying
active
passive
in
most
traditional,
dr
scenarios?
C
What
we
see
is
an
active
passive
deployment,
especially
for
the
stateful
workload.
You
know
you
may
have
state
the
list
workload
is.
There
are
active
active,
but
they
all
point
to
a
single
active
site
and
then
and
then
there
is
a
passive
site
for
the
stateful
part.
That's
that's
what
often
what
you
see
for
cloud
native,
dr,
we
are
proposing
that
it
should
be,
it
should
do.
It
should
be
active,
active
and
then
obviously
we're
talking
disaster.
So
there
is
a
disaster,
a
situation.
How
do
we
detect
it
with
the
r
with
traditional?
C
They
are
in
most
cases,
it's
a
human
decision,
somebody
who
says
okay,
this
is
really
a
disaster.
We
need
to
trigger
the
disaster
procedure,
recovery
procedure,
it's
a
human
decision
in
cloud
native,
dr.
We
want
it
to
be
autonomous.
We
want
the
system
to
realize
something
is
going
wrong
and
react
to
it
and
then
the
recovery
procedure
itself.
C
C
Whatever
the
this
statewide
workload
needs
to
do
to
reset
itself
or
reorganize
itself
and
the
replicas
and
the
petition
to
continue
servicing,
and
then
you
know,
the
two
metrics
of
dr
are
rto
the
recovery
time
objective,
which
is
how
long
the
system
is
down
and
how
long?
How
soon
we
can.
We
can
get
the
service
up
and
running
and
recovery
point
objective,
which
is
a
measure
of
how
much
data
is
lost
because
of
the
disaster,
and
it
can
also
be
a
measure
of
how
much
inconsistency
I
have
created.
C
If
I
have
multiple
copies
of
the
data
right
so
for
for
rto
for
traditional,
we
usually
have
very
we.
We
could
have
good
good
enterprises
that
are
close
to
zero,
but
normally
we
have
hours
to
to
recover
from
a
disaster
in
india
and
cloud
native
dr.
We
want
this
to
be
close
to
zero.
C
Essentially,
a
couple
of
health
checks
have
to
fail,
and
then
we
realize
the
situation
is
a
disaster,
and
the
system
will
start
bringing
the
traffic
to
the
healthy
locations
and
for
rpo
again
you
could
have
close
exactly
zero
to
again
hours
of
loss.
I
usually
see-
or
there
are
minus
two
hours
in
in
terms
of
loss
data
loss
in
in
cloud
native,
dr.
You
have
two
options
here.
C
You
can
do
strongly
consistent
deployments,
which
will
have
exactly
zero
data
loss,
and
you
can
do
eventually
consistent
deployments,
which
may
have
close
to
zero
data
loss
in
normal
in
normal
circumstances,
but
but
you're
not
guaranteed
in
this
situation
that
the
data
once
it's
reconciled
is
exactly
correct
right.
It's
it's
going
to
be
consistent,
but
it
may
not
be
correct
for
your
application.
So
that's
that's
a
caveat
of
the
eventually
consistent
systems
and
then
from
an
organizational
perspective.
C
What
we
notice
is
that
in
traditional
enterprises,
you
know
application
teams
are
normally
formally
responsible
for
the
business
continuity
plan
of
their
service,
but
what
they
do
is
they
turn
around
to
the
storage
team
and
ask
them?
What
is
the?
What
are
the
rto
and
their
appeal
that
you
can
guarantee
for
the
storage
that
we
use
and
then
they
adopt
that
as
the
as
the
that's
their
standard?
So
the
facto
is.
C
The
storage
team
is
the
driver
of
the
of
the
enterprise,
dr
strategy,
and
when
instead
we
are,
we
are
arguing
here
that
was
with
clown
native
dr
it's
going
to
be
the
application
team
that
will
have
to
choose
a
correct
piece
of
middleware,
a
correct
product
to
handle
their
store,
their
state
right,
and
then
they
will
use
that
product
to
organize
the
dr
procedure
around
it
and
then
one
other
observation
that
came
up
working
with
this
new
technology
is
that
dr
I
for
traditional,
traditional
enterprises.
C
C
For
for
cloud
native,
dr,
we
noticed
that
this
capability
are
coming
more
from
networking
and
we,
when
we
see
the
need
for
east-west
communication
between
your
paleo
domains,
so
that
could
be
your
regions
or
could
be
your
your
data
centers,
so
that
this
middleware,
that
is
this
new
generation
middlebury,
can
the
instances
can
cluster
up
can
find
each
other
and
cluster
up?
C
C
B
I
was,
I
was
just
gonna,
say
one
of
the
sort
of
one
of
the
key
things
here.
Is
you
know
what
what
we're
talking
about?
I
think
and
again
you
know,
keep
me
honest
here:
rafael
is
we
in
in
a
cloud-native
world
where
applications
are
effectively
composable
and
the
infrastructure
is,
is
declarative?
B
What
we're?
What
we're
talking
about
is
cloud
natives
gives
you
a
lot
of
the
tools
that
are
available
to
automate
and
to
and
to
manage
disaster
recovery,
just
like
any
other.
You
know
healing
process
that
that
that
would
take
place
in
in
us
in
a
standard
cloud
native
environment,
and
so
you
know
we
kind
of
acknowledge
the
fact
that
this
isn't
necessarily
straightforward.
B
Some
of
these
technologies
are
also
fairly
advanced,
but
but
the
point
is
that
the
cloud
native
these
new
cloud
native,
architectures
and-
and
you
know,
rafaela-
will
talk
about
a
reference
architecture
with
some
with
some
of
the
options
available,
actually
enable
organizations
to
to
have
automated
failover
and
automated
automated
disaster
recovery
processes,
with
with
a
better
with
better
metrics
in
terms
of
rto
and
rpo
than
you'd.
B
Get
with
with
you
know
the
obvious
manual
failovers
and
manual
tasks
that
you
that
you
have
in
a
traditional
system
which,
which
you
know
I
think
is,
is
extremely
exciting,
because
what
we're
effectively
saying
is
we've
made
applications
composable
and
developers
can
declare
what
their
applications
need
and
from
an
environment
and
and
now
we're
kind
of
taking
a
step
further
and
saying
that
that
also
applies
to
the
disaster
recovery
process.
C
Right
exactly
and-
and
it
is
exciting-
I
find
it
very
exciting,
and
maybe
some
of
you
may
may
doubt
that
it's
even
possible
right
and
or
may
say
the
story
that
I've
been
told
many
times
is.
I
can
I
can
reach
rto
and
rpo
as
close
as
zero
as
I
want,
but
the
cost
increases
exponentially.
C
It's
it's
actually
a
matter
of
composing
the
architecture
in
the
right
way,
but
the
cost
does
not
increase
exponentially
and
you
can
actually
reach
these
numbers
with
a
relatively
inexpensive
deployment-
and
I
say
I
use
the
word
relatively,
but
certainly
not
something
that
grows
exponentially
and
so
talking
about
how
how
we
can
build
this
these
architectures
here
we
are
showing
a
strongly
consistent
deployment
where
we
have
a
stateful
workload
that
is
capable
of
handling
this
state
horizontal
states
link
in
the
proper
way
guaranteeing
that
there
are
correct
replicas
in
the
correct
regions
or
data
centers.
C
We
need
three
failure,
domains
in
this
case
their
data
centers
right.
We
need
to
build
them
because,
because
otherwise
we
wouldn't
reach
the
quorum
and,
as
I
said
in
front
of
in
front
of
the
state
workload,
normally
we
have
a
front
end,
probably
a
status
front,
end
and
in
front
of
it.
We
we
will
have
a
global
load,
balancer
okay,
so
this
is
a
very
generic
blueprint
that
you
can.
You
can
then
reuse
in
in
in
several
situations
I
I
failed
to
mention
obviously
the
state
of
workload.
C
We
have
storage
that
comes
from
from
the
local
local
data
centers,
but,
as
you
can
see,
we
don't
rely
on
storage
capabilities
to
replicate
across
across
this
data
center.
All
the
interaction
is
is
handled
by
the
by
the
middleware
by
this
state
workload.
C
So
how?
How
can
we
like?
How
can
we
build
this
state
for
workloads?
Because,
obviously
we
are
we
are?
We
are
now
relying
more
on
the
middleware
like
on
the
middle
side.
I
have
a
couple
of
slides
to
talk
about.
Did
talk
about
this
from
a
conceptual
standpoint.
C
C
It's
the
affiliate
domain
is
a
a
an
area
of
our
system
that
can
go,
can
go
down
or
can
fail
due
to
single
event,
so
you
could
have
nodes,
those
could
be
failure,
domain,
racks
clusters,
network
zones,
availability
zones,
regions,
data
center,
they're,
all
failure
domains,
and
if
you,
if
you
look
at
them,
they
are
they
they
they
contain
each
other.
C
C
So
we're
talking
about
our
our
payload
domain
of
references
that
at
the
center
and
normally
so
for
disaster,
we
mean
that
we
will
lose
the
entire
failure
domain
or
in
particular,
the
entire
data
center.
If
we
don't
specify
what
the
domain
is
and
and
the
disaster
recovery
is
what
what
happens
when
that
happens,
what
is
my
strategy?
C
Heavy
high
availability
is
slightly
different
concept.
High
availability
is
when,
within
a
failure
domain
something
breaks
one.
Maybe
I
have
one
fold
what
happens
inside
the
failure
domain?
Two?
C
Does
the
service
continue
or
not,
and
then
we
have
the
concept
of
the
concept
of
consistency,
that
is,
the
property
of
a
distributable
workload
to
all
that,
all
the
instances
observe
the
same
state.
Okay,
so
the
state
is
consistent
across
the
instances.
C
B
And
I
think
I
think
on
on
on
it's
it's
fine.
If
you
go
into
the
next
slide
on
on
that
point,
the
getting
strong
consistency
is
is
probably
the
single
biggest
architectural
challenge
right,
trying
to
get
trying
to
ensure
you
have
strong
consistency,
which
is
one
of
those
key
attributes
in
in
in
any
storage
layer
or
database
layer,
etc.
Is
is
effectively
a
balancing
act
between
you
know,
maximizing,
latency
and
or
even
reducing
latency
and
versus
versus
availability.
You
know
and
there's
there's
a
very
there's.
B
A
very
convenient
cap
theorem
here,
which,
which
you
know
is
is
is
one
of
those
things
where
you
can
have.
You
have
three
things.
You
have
consistency,
availability
and
partitioning,
and
you
can
have
any
need
to
pick
two,
basically
and
I'll,
let
I'll
let
I
I
don't
steal
rafael's
standard
but
I'll
I'll.
Let
him
explain
some
of
the
some
of
the
details
and
how
it
applies
to
some
of
the
different
systems
we're
talking
about.
C
Yeah
yeah,
thank
you,
alex
yeah
and
yeah
you
and
you
announced
the
theorem
correctly
right.
It's
usually
you
pick
two
of
these
three
consistency,
availability
and
network
partitioning.
I
should
improve
the.
I
should
improve
the
the
picture
here.
I
I
like
to
say
the
theorem
is
slightly
differently,
because
I
think
it
helps
understand
that
the
kind
of
choices
that
are
made
in
in
the
software
today's
today,
which
is
network
partitioning,
is
not
something
that
you
control
is
a
fault
that
happens.
C
So
a
way
of
reasoning
about
the
cup
theory
is
assuming
network
partitioning.
What
do
you
want
your
software
to
do?
Do
you
want
it
to
be
consistent,
or
do
you
want
it
to
be
available
right?
Because
you
can
only
choose
pick
another
one.
The
the
dental
partitioning
is
not
something
that
you
pick
it
it
just
happens
and,
and
that
I'm
showing
in
this
in
this
little
table
here
some
of
the
of
common.
C
You
know
software
today
that,
where
where
the
choice
is
very
clear
right,
if
you
choose
consistency,
you
are
a
strong,
it
means
you're,
building
a
strongly
consistent
system.
C
If
you
choose
availability,
it
means
you're,
building
an
eventually
consistent
system,
both
prime
cons
right-
I
they
have
their
usages,
but
that's
that's
a
very
clear
design
choice
that
you
have
to
make
when
you
build
software-
and
you
know
in
in
reality,
some
of
this
system
can
even
be
tuned
and
and
depending
how
you
tune
it,
they
can
change
the
behavior
from
from
being
available
to
being
consistent,
but
it's
it's
it
it
they.
They
are
all
they're
all
built
around
this,
this
gap.
C
Theory
there
is
a
color
already
to
be
to
be
kept
in
mind
about
the
cup
from
the
cafeterium,
which
is
the
patchalk
or
passalk.
I
hope
I
pronounced
correctly
choreography,
which
essentially,
it
means
it
says
in
absence
of
a
network
partition.
So
when
the
network
partition
is.
C
For
latency
or
for
consistency,
so
this
is,
I
was
recently
doing
some
experiment
and
it
was
very
clear
what
the
pacific
corollary
means
in
in
particular,
I
was
doing
for
kafka
right.
So
if
you
kafka
is
one
of
those
that
is
tunable,
if
you
set
kafka
to
be
consistent
and
then
you
spread
the
cluster
across
highly
region
that
have
high
latency,
you
get
a
very,
very
high
latency
on
on
the
response
for
a
single.
You
know,
communication,
whether
you're
reading
a
message
or
in
actually
producing
a
message.
C
Reading
is
always
easier
and-
and
you
know
that's
that's
just
the
nature
of
about
this
software
work
that
doesn't
mean
that,
for
example,
with
cafe,
you
can't
have
still
a
significant
amount
of
throughput,
but
each
individual
transaction
will
be
will
have
a
signific
high
latency,
because
you
have
told
kafka,
you
want
to
be
consistent.
C
Okay,
so
that's
that's
how
this
work
and
and
this
it's
very
nice-
it's
incredibly
convenient
to
have
a
theorem
to
think
about
this
thing
in
because
you
could
you
can
take
a
software
that
you
don't
understand.
Piece
of
software.
C
Don't
understand
and
you
can
ask
you
know
the
vendor
or
whoever
is
the
expert
to
talk
about
the
software
in
in
light
of
the
theorem
and
understand
the
choices
that
are
made
and
that
are
made
by
that
piece
of
software,
and
just
that
way
you
will
understand
a
lot
of
how
that
piece
of
software
behaves.
B
B
I
was,
I
was
just
going
to
say
just
just
for
further
information.
We
also
have
the
tag
has
has
created
a
a
more
generic
storage
landscape
white
paper,
and
in
that
we
we
define
all
of
the
different
attributes
like
availability
and
scalability
and
performance
and
durability,
as
well
as
as
well
as
consistency.
B
And
it's
it's
it's
it's
interesting
to
me,
because
you
know
what
what
we
have
here
is:
different
systems
will
will
have
different
use
cases,
and
it's
probably
worth
noting
that
you
know
no
one
system
will
will
handle
all
of
these
cases,
because
you
know,
of
course,
very
high
or
very
strong.
Consistency
also,
typically
has
might
have
you
know
scalability
or
performance
implications,
and
then
vice
versa.
For
example,.
B
C
Your
product,
so
we
said
we
have
several
instances
of
of
this
software
right
running
and
clustering
up,
creating
a
logical
instance.
How
do
we
do
they?
Sync?
We
need.
We
need
consensus
protocols,
I'm
gonna
go
quicker
in
this
slide,
but
the
there
are
two
types
of
consensus
protocol
that
I
call
share
state
and
then
share.
State
first
state
is
to
agree
on
on
a
state
that
all
of
the
instances
need
to
reflect
need
to
have
for
this
kind
of
protocol
protocols.
You
can
have
a
protocol
based
on
a
leader
election.
C
That
is
the
one
that
accepts
all
the
rights
and
the
other
one
are
just
followers
and
they
validated
by
the
community
by
the
academy.
The
the
the
algorithms
that
have
been
validated
by
the
academy
are
paxos
and
raft
in
this
space
raft
becoming
more
and
more
popular.
So
you
know
most
of
the
software
that
we
have
mentioned
so
far
and
we
have
even
you
know.
C
Another
thing
to
to
know
about
is
like
I
was
saying
at
the
beginning
that
the
hardest
problem
that
this
stateful
storage
software
need
to
solve
is
really
always
the
same,
and
that
is,
I
need
to
stink
with
the
other
peer
and
then
I
need
to
persist
that
data.
Somehow
I
need
to
have
a
consensual
protocol,
a
list
of
operations
that
have
happened,
and
then
I
have
to
store
it
to
persist
those
that
information.
C
You
can
it's
interest.
There
is
an
interesting
chapter
in
the
sre
book
from
google,
where
they
explain
how
how
to
periodic
theoretically
build
a
piece,
a
piece
of
software
that
can
be
reused
across
all
of
these
stateful
workload
products
because
at
the
core,
it's
always
the
same.
Naturally,
you
should
do
it
that
way,
you
wouldn't
get
optimized
performance,
so
you
have
to.
C
This
is
a
theoretical
approach.
You
have
to
still
take
take
your
make
your
optimization
out
of
it,
but
in
in
in
reality
there
are
several
companies,
for
example,
that
are
using
roxdb
or
there
is
another
one
from
apache
that
is,
a
storage
layer
with
some
some
level
of
consensus
protocol
to
to
coordinate
instances.
C
So
if
you
put
everything
together,
you
can
you
have
you
need
to
have
these
this
replic
reliable,
replicating
machines,
and
then
you
can
create
partitions.
So
where
you
separate
your
data
so
that
you
can
horizontally
scales
and
then
you
can
create
the
replicas
where
you,
the
partition,
is
replicated
in
other
instances,
so
you
don't
lose
data
when
when
something
goes
down,
and
so,
if
you
look
at
this
picture
here
on
the
right,
this
is
the
anatomy
of
a
stateful
application,
at
least
a
modern
one.
C
Where
you
have
two
replicas
here:
okay,
each
replica
has
a
several
partitions,
so
we
end
up
with
six
instances.
This
is
as
its
own
storage
and
then
we
have
the
coordination
layer
between
the
between
the
instances.
C
So
this
gives
you
an
idea
of
the
anatomy
of
these
fateful
workloads
that
we
can
use
to
build.
What
we
call
the
cloud
native
disaster
recovery
deployments.
B
I
think
just
just
on
that
on
that
point
rafaela,
I
think
one
of
the
exciting
things
here
is
that
what
what
is
effectively
happening
is
we're
layering,
different,
proven
technologies.
B
You
know
so
you
might
have
sharding
for
for
performance
and
you
might
have
rough
protocols
for
consistency,
but
you
might
also
have
you
know
a
variety
of
different
layers
in
in
that
stack
where,
where,
for
example,
you
might
have
a
sql
layer,
that's
using
a
key
value
store,
that's
using
a
sharding
process,
that's
using
a
file
system,
etc,
and
so
it's
it
is
really
more
than
ever
before
it's
it's
kind
of
important
to
to
to
understand
the
different
layers,
because
at
the
end
of
the
day,
the
attributes
of
your
system
and
your
your
failover
capabilities
and
your
dr
capabilities
are
going
to
be
an
amalgamation
of
all
of
those
of
all
of
those
different
attributes.
C
Yeah
yeah,
I
I
agree.
I
couldn't
agree
more
especially
on
the
you
know.
On
those
observations
you
made
on
the
interface
layer,
the
what
is
called
api
layer
here
we
see
more
and
more
products
now
that
offer,
for
example,
a
sql
interface
and
then
also
a
key
value
store,
and
it's
clear
what's
happening
right.
They
are
just
adding
a
new
p,
a
new
api
layer,
but
reusing
everything
that
is
below
it.
So
it's
relatively
easy
for
them
to
do
that.
C
C
I
I
would
highly
recommend
if
you
are
considering
you
know
a
new
state
of
workload,
ask
your
vendor
or
your
experts,
what
choices
do
they
make
in
this
space,
because
that
that
already
tells
you
tells
you
a
lot
about
about
the
software
that
you're
about
to
purchase
some
considerations
around
strong
consistency
versus
inventory
consistency,
and
they
both
they
both
can
be
approaches
for
cloud
native,
dr,
but
they
behave
differently,
for
example
in
in
terms
of
rpo,
as
we
said,
prong
consistency
is
about
consistencies.
C
Obviously,
so
we
don't
lose
any
data
once
we
create
a
well
done
deployment.
So
it's
a
it's
exactly
zero
right.
I
had
people,
they
couldn't
believe
it,
but
it's
it's
exactly
zero.
I
never
lose
data
with
with
this.
It's
assuming,
obviously
assuming
that
only
one
disaster
at
a
time
happens.
Okay,
but
you
don't
lose
data
with
a
venture
consistency
you
may
be
losing.
You
may
lose
some
data
theoretically
unbounded
amount
of
data,
but
in
impractic
in
practicality.
C
If
you
know,
if
the
system
is
not
overloaded,
it's
it's
something
close
to
zero,
because
it's
just
what
was
in
the
local
cache
that
the
system
didn't
have
time
to
replicate
and
another
thing
that
you
instead,
you
should
consider,
is
then,
when
you
lose
one
data
center,
one
failure
domain
in
an
event
or
consistent
system,
like
I
said
before,
the
rest
will
keep
serving
and
so
that
you,
the
two
sides
of
of
this
deployment,
may
diverge
in
terms
of
date
and
when
they
come
up
they
don't
necessarily
they
don't
necessarily
agree
with
with
with
the
states.
C
So
there
is
a
reconciliation.
You
know
algorithm
that
will
decide
who
is
right,
but
this
reconciliation
algorithm
algorithm
may
not
reason
the
same
way
that
your
application
from
a
business
perspective
reasons.
So
what
I
like
to
say
is
that
adventure
consistency
does
not
mean
events
or
correctness
in
business
logic
terms
and
eventual
consistency
may
pose
some
design
additional
design
consideration
to
your
developers.
C
So
I
personally,
if,
if
it's,
if
it's,
if
it
is
at
all
possible,
I
personally
prefer
to
keep
things
simple
for
the
developers
and
and
choose
a
strongly
consistent
deployment.
B
I
think
I
think
strongly
consistent
is
is
the
most
predictable
and
one
of
the
one
of
the
one
of
the
points
here
about
the
minimum
number
of
failure
domains
is
that,
with
with
strongly
consistent
systems,
you
effectively
have
an
odd
number
of
copies,
because
you
remember
when
we
were
talking
about
the
cap
theorem
and
we
were
talking
about
partitions
if,
if
a
node
is
partitioned
or
is
unavailable
effectively,
the
remainder
of
the
system
still
has
a
majority
of
copies
of
the
data
and
can
therefore
make
an
automated
decision
as
to
who
is
up
and
who
is
down
and
which
which
which
which
systems
are,
are
authoritative
for
that
consistent
system,
whereas
in
eventually
consistent
environments,
it
can
be
a
little
bit
more
complicated
because
effectively,
some
of
those
some
of
those
decisions
can
be
delegated
to
the
application
or
some
of
those
decisions
could
be
delegated
to
rip
reconciliation
processes
which
which
are
not
perfect.
C
Right
in
terms
of
rto,
both
will
react
in
a
few
seconds,
especially
depending
on
your
health
checks
and
the
tail
checks
that
you
set
from
the
global
load
balancers,
but
also
some
internal
abstract
that
the
system
has
in
terms
of
latency
strongly
consistent.
Consistent
workloads
have
a
strong
sensitivity
to
latency
between
between
these
failure.
Domains
between
there
could
be
regions
or
data
centers,
and
by
and
large,
your
latency
will
always
be
greater
than
two,
the
the
worst
latency
between
regions
that
you
have
multiplied
by
two,
because
it's
always
back
and
forth.
C
So
that's
that
sets
your
expectation
and
that
this
to
me
that
says
not
all
the
use
cases.
I
cannot
use
strong
consistency
for
all
the
use
cases.
I
will
have
use
cases
that
are
really
that
need
really
fast
responses
where
I
can't
use
it
use
this
system,
so
I
mean
you
know
I
need
to
find
different
solutions,
but
if
it's
acceptable,
if
the
latency
that
I
have
is
acceptable,
like
I
said,
stronghold
system
keeps
seeing
things
predictable
and
simple
for
the
developers
on
the
on
the
other
side.
C
Eventually,
consistent
instead
is
not
affected
by
latency,
because
essentially
the
system
first
writes
locally
and
returns
and
then
and
then
it
tries
to
synchronize
with
the
rest
of
the
instances.
So
that
does
not
affect
the
client
latency,
that
that
was
a
simplification,
but
it's
it's
a
way
to
explain
why
they
are
not
really
affected
by
inter
failure.
Domain
latency,
the
throughput
instead
for
both
is,
is,
can
scale
linearly.
You
know,
as
long
as
we
have
workloads
that
are
touching
all
the
partition.
C
Okay,
so
so
the
the
requests
are
normally
distributed
and
all
the
partitions
are
are
involved
more
or
less.
In
the
same
way,
this
this
system
is
scaled
horizontally,
necessary
scale
linearly
with
the
number
of
instances.
So
you
want
more
throughput.
Just
add
more
instances
increase
the
number
of
partitions
and
you
get
you
get
the
throughput
that
you
want
as
as
and
then,
as
alex
said,
prom
consistent
as
another
constraint
that
some
of
our
customers
find
taxing,
which
is
you
need
three
payer
domains.
C
But
they
don't
have
a
third
one.
So
what
can
they
do
right
at
the
same
time,
instead,
eventually
consistent
workloads
only
need
required
too
yeah.
So
you
know
there
are
solutions
to
get
the
third
data.
Centers
one
option
is
to
go
to
the
cloud,
but
that
is
certainly
a
constraint
of
strongly
consistent,
assist.
C
We
we
wanted
also
to
share
reference
architectures
for
kubernetes
deployment.
You
know
the
first
one
that
we
shared
was
very
generic.
You
could
build
it
with
vms,
but
obviously
we
are
looking
at
kubernetes
with
special
attention
here,
so
not
not
very
different.
We
still
have.
We
still
rely
on
the
state
workload
to
do
the
horizontal
sync.
We
still
rely
on
persistent
volumes
from
provided
by
kubernetes
in
this
case,
and
then
I
would
have
we
will
have
some
ingress
and
a
global
load
balancer
in
front
of
it.
C
That
decides
where
the
traffic
goes.
So
here
when
we
lose
one
side,
essentially,
nothing
happens.
The
the
global
advances
should
realize
it
and
just
send
the
traffic
to
the
other
ones.
C
Another
thing
that
you
should
ask
yourself
when
you
build
this
architecture
is
what
happens
if
the
clients
can
connect
to
to
my
workloads,
but
there
is
a
network
partition
between
some
of
the
data
centers,
and
so
one
is
isolated
for
for
strong,
consistent
workloads.
This
is
essentially
equivalent
to
losing
this
data
center,
because
all
of
these
instances
will
become
not
active
because
they
don't
have
quorum,
and
so
the
global
advances
should
realize
that
this
guy
is
not
responding
and
and
send
all
the
traffic
to
the
remaining
two,
so
the
behavior
should
be
exactly
the
same.
C
We
have
also
a
reference
architecture
for
eventual
consistency,
similar
right
except
you
just
need
two
failure
domain
or
two
data
centers
here
the
conversation
is
slightly
different
when
you
lose
well
again,
if
you
lose
the
entire
data
center,
the
global
advance,
I
can
only
send
you
to
to
the
remaining
one,
and-
and
you
know
there
is
no
real
state
divergence,
because
there
are
no
rights
on
on
the
data
centers
that
you
have
lost
different
story
when
you,
when
you
lose
connectivity
between
the
stateful
workload,
but
the
clients
can
still
write.
C
I
can
still
have
a
way
to
write
in
this
case.
You
can
have
divergence
of
the
state
and
that's
where,
when
we
say
you
know
that
that's
the
conversation
we
were
having
for
this
is.
This
is
a
situation
where,
at
some
point,
this
fault
will
be
corrected.
The
network
will
the
connectivity
will
be
reestablished,
the
reconciliation
algorithm
will
kick
in,
but
the
result
that
you
get
as
the
final
reconciled
state
is
not
necessarily
aligned
with
what
you
need
in
your
application.
C
Okay,
we
have
some
reference
material
here.
If
you
want
to
go
a
little
bit
deeper,
this
is
our
white
paper
and
some
blog
posts
around
building
this
architecture
in
practice.
A
A
A
C
I
I
don't
know
we
don't
focus
on
cost,
because
we
try
to
be
product
agnostic.
So
my
my
only
consideration
is
you
have
a
third
data
center
that
may
mean
more.
You
know
that
may
may
be
a
significant
cost,
depending
on
how
you
decide
to
implement
that
data
center.
If
you
go
to
the
cloud,
it's
not
actually
necessarily
a
huge
cost,
it
depends
on
how
much
you
deploy
right.
C
It's
a
as
you
go
model,
but
if
you
actually
build
a
physical
third
data
center
that
it's
a
huge
investment
up
front,
the
other
consideration
is
some
companies
may
be
running
on
on
software.
That
is
not
capable
of
being
deployed
this
way,
and
so
they
may
be
facing
a
migration.
C
A
C
Yeah
no
keep
asking
questions,
I'm
going
to
describe
this
environment
so
here
in
the
meantime
here
I
have
a
in
three
clusters.
Okay,
these
represent
my
three
region
or
data
centers.
They
are
in
different,
they
are
in
google
and
they
are
in
different
region.
As
you
can
see,
you
assist
for
your
central
and
us
west.
C
Here,
for
example,
I
have
one
of
these
clusters.
Obviously
I'm
I'm
from
redhat,
so
I'm
using
openshift,
because
it's
easy
for
me,
but
there's
nothing
involved
that
implies
openshift,
you
can
do.
You
can
do
everything
with
kubernetes.
So
here
I
have
kafka
deployed
this
way.
You
see
three
instances
of
kafka,
but
this
instance.
This
is
the
second
cluster
three
more
instances.
This
is
the
third
cluster
three
more
instances.
C
These
are
all
talking
to
each
other,
and
I
have,
for
example,
here
a
kafka
console
in
which
I
can
see
that
I
really
have
a.
If
I
go
here.
I
have
a
nine
node
kafka
cluster.
C
You
can
see
from
the
name
of
the
instances
of
this
know
that
I
have
cluster
three
cluster,
one
cluster,
two,
so
all
of
the
cluster
openshift
cluster.
All
of
the
instances
that
are
distributed
in
different
openshift
clusters
come
together
to
create
a
single
logical
kafka
instance.
C
C
It
uses
this
cluster
set
notation
and
it
also
has
the
name
of
the
cluster
in
which
the
the
pod
is,
and
so
these
are
actually
the
pods
that
we
were
seeing
before,
and
I
think
I
have
a
a
q,
a
topic
defined
here
and
if
we
look
at
this
topic.
C
Look
at
the
partition
it,
it
has
nine
partitions,
so
it's
well
balanced
on
the
on
the
cluster
on
the
available.
You
know
nodes
of
these
clusters
and
it
has
it
it's
of
the
I
I
configured
kafka
to
be
strongly
consistent,
so
each
replica
each
partition
has
two
three
replicas
and
each
of
this
replica
is
in
a
different
region.
So
if
I
lose
a
region,
one
of
these
will
become
red,
but
I
I
still
have
two
replicas
and
I
can
still
continue
working
in.
C
So
that's
how
you
you
can
set
up
your
workload
in
I.
I
also
have
in
this
experiment
that
I'm
running,
I
also
have
a
cochlear
cbdb
deployment.
C
So
coco
tv
gives
you
a
a.
C
C
These
are
the
nodes,
also
in
this
case,
I'm
using
I'm
using
nine
nodes
three
in
each
in
each
regions,
and
then
each
node
within
the
region
is
in
a
different
az,
so
so
that
trying
to
get
both
local
availability
and
global
geographical
availability-
and
we
can
see
another
nice
feature
of
coker's-
that
I
like
of
cource-
is,
for
example,
that
it
calculates
the
latency
between
the
instances
and
then
uses
this
information
to
do
some
internal
optimization.
C
But,
as
you
can
see,
this
is
obviously
a
symmetric
matrix
or
mostly
symmetric
and
from
west
to
east
is
where
we
have
the
highest
latency
around
60
milliseconds.
C
So
that
tells
us
based
on
what
I
said
before,
that
the
best
latency
that
we
can
get
from
a
transaction
here
once
we
have
distributed
the
workload,
the
the
data
across
the
three
regions,
the
best
latency
is
going
to
be
around
120.
something
above
120
milliseconds
will
get
120
milliseconds
just
from
the
network.
Then
there
is,
you
know,
processing,
writing
persisting
and
all
of
that
so
the
ques.
So
you
can
immediately
start
reasoning
about
okay.
What
kind
of
workloads
can
I
run
on
this
database
yeah?
So
it's
a
trade-off.
C
C
That's
it.
I,
I
think
we
don't
have
time
to
simulate
a
disaster
and
see
how
the
system
react.
We
we
would
see
that
here
you
know
all
of
this
system,
but
both
kafka
and
and
cockroach
can
auto,
detect
failures
and
they
will
start
reacting
to
it.
We
will
see
that
we
lose
some
nodes.
We
will
see
that
some
of
the
ranges
these
are
essentially
the
partitions
are
getting
moved
around
and
the
system,
like,
I
said,
re
readjust
the
new
situation
same
thing:
does
kafka.
B
I
think
one
of
the
one
of
the
key
takeaways
here
is
that
we
in
in
a
cloud
native
world.
We
now
have
the
capability
of
implementing
disaster
recovery
with
you
know,
different
storage
systems
and
different
databases
and
different
tools,
and
in
fact
you
know,
you
get
an
order
of
magnitude,
better
automation
and
better
flexibility
than
you
do
with
your
with
your
traditional
systems,
and
I
think
this
is.
B
This-
is
sort
of
the
next
logical
step
for
many
enterprises
as
they
as
they
adopt
cloud
native
technologies
and
and
and
kubernetes
and
openshift
and
cloud
native
storage
solutions
as
they
as
they
look
to
migrate,
more
mature
workloads
and
more
mission,
critical
workloads
that
require
that
required
disaster
recovery.
So
so
my
key
takeaway
here
is
understand.
B
The
different
layers
in
your
system
understand
the
different
attributes
like
like
the
latency
and
the
performance
and
consistency
requirements
of
your
of
your
of
your
applications
and
then
yeah
absolutely
take
advantage
of
the
composability
and
the
declarative
nature
of
of
cloud
native
disaster
recovery
and
all
the
advantages
that
that
brings
to
your
application.
And
you
know,
touchwood
sleep
better
at
night.
A
Thank
you,
and
we
had
a
question
come
in
that
I
think
we'll
have
to
take
to
if
you
came
in
from
the
linkedin
event
longer
than
a
minute
and
it's
a
great
question:
what
considerations
are
required
for
cloud
native
disaster
recovery
in
a
heterogeneous
environment
if
either
one
of
you
want
to
take
it
in
one
minute?
Otherwise
we
can
push
that
to
the
linkedin
chat.
A
C
Yeah,
I
assume,
by
a
theory
genius
we
mean
that
we
don't
have
a
homogeneous
cloud
provider
or
infrastructure
underneath
right,
like
I
said,
this
architecture
relies
on
capabilities
in
the
networking
space.
C
So
as
long
as
we
can
do
that
east
west
communication-
and
we
can
discover
the
instances
of
our
state
workload
on
on
the
remote
domain
failure
domain
and
as
long
as
we
can
set
up
some
level
of
global
balancing,
we
will
be
able
to
to
create
these
architectures
and
in
fact
we
are
doing
it
and
what
collaborating
with
some
of
these
vendors,
what
we're
noticing
is.
How
do
you
in
order
to
to
be
predictable?
C
You
know
in
these
deployments
you
would
like
all
of
these
senses
to
be
to
behave
the
same,
but
how
do
you,
for
example,
provision
the
same
iops
across
different
cloud
providers?
They
all
give
this
capability
or
this
sla
in
a
different
way.
Or
how
do
you
provision
the
same?
You
know
computing
power.
They
are.
They
are
slightly
different,
so
those
are
the
things
that
you
may
encounter,
but
there
is
no
actually
blocker
to
building
these
architectures
across
a
terrigenous
environment.
B
In
fact,
and
I'll
just
take
one
more
second
here,
I
I
would
kind
of
argue
that
tools
like
kubernetes
are
actually
are
actually
designed
to
to
abstract
your
infrastructure
and
to
give
developers
the
capability
of
of
getting
the
same
services
from
from
different
from
different
systems,
and
I
think
the
glue
that
holds
that
together
then,
is
the
is
the
east
west
networking
and
the
load
balancing
services
and
things
like
that.
That's
it.
On
top
of
that.