►
From YouTube: CNCF SIG Storage 2021-03-10
Description
CNCF SIG Storage 2021-03-10
A
A
B
A
All
right
we'll
just
wait
a
minute
or
two,
and
then
we
can
move
into
roughly
the
the
dr.
A
C
B
No,
you
should
be
able
to
set
it
to
like
anyone
with
the
link
and
view
only
and
then
after
this
presentation,
it's
probably
best
to
be
able
to
keep
it
like
everyone
with
the
link
view
only.
A
Yes,
do
you
want
me
to
raphael
if
it's,
if
it's
sort
of
on
your
internal-
and
you
can't
share
it,
I
can
I
can
make
a
copy
and
share
that.
Instead,
if
you
want.
C
D
C
I
could
probably
do
that
too,
but
I
mean
yeah
it's
well,
it's
it's.
What
alex
is
doing?
Basically,
so
after
that,
after
today,
I'm
gonna
use
the
his
copy.
If
there
is
any
plate,
cool.
B
E
E
A
All
righty,
then,
I
think
it's
five
past
we
should
probably
start
so
I'll
hand
over
to
to
rafaele
who's.
Gonna
who's
put
a
few
slides
together
to
summarize
the
documents
that
we've
been
working
on
to
see
if
it
makes
it
easier
to
to
get
some
of
the
concepts
across
and
maybe
provide
some
feedback
so
over
to
you,
rafael,.
C
Thank
you
yes,
so,
as
you
know,
there
is
a
much
more
detailed
document
that
we
are
considering,
for
you
know,
maybe
to
publish
where
we,
where
my
intention
and
hopefully
becomes.
The
group
intention
is
to
talk
about
a
cloud-native
approach
to
disaster
recovery.
C
So
this
I
created
this
presentation
just
to
make
it
easier
to
communicate.
But
really
that
document
has
a
ton
more
of
details
and,
and
the
idea
is,
you
can
still
use
all
of
your
traditional,
dr
approaches,
but
we
think
that
there
is
maybe
a
new
way
to
do
things
with
cloud
native,
and
so
we're
going
to
talk
about
that.
Okay,
that's
not
to
so
so
again,
it's
not
to
say
that
the
other
things,
the
other
approaches,
cannot
be
used.
C
Typically,
the
active
passive
approaches,
but
we
we're
looking
at
we're
going
to
try
to
give
a
definition
to
what
cloud-native
disaster
recovery
might
mean-
and
this
is
my
my
slides
here
to
compare
the
two
approaches,
so
I'm
going
to
go
line
by
one
line
by
line
in
in
as
far
as
disaster
detection
and
the
art
procedure.
Traditionally
there
is
a
human
decision.
C
Okay,
something
goes
wrong
and
somebody
triggers
the
dr
procedure.
Then
maybe
the
dr
procedure
is
automated,
but
there
is
very
often
there
is
a
human
decision
we
want
to
get
to
with
cloud
native
or
what
we
define
as
cloud
native
disaster
detection
is.
Is
that
it's
going
to
be
autonomous,
so
the
system
automatically
understands
the
disaster
and
triggers
whatever
reaction
is
need
to
be
triggered.
C
C
It
cannot
be
exactly
zero
because
the
there
are
caches.
There
are
load
balancers
that
need
to
switch,
but
it
can
be
very
close
to
zero,
and
normally
we
see
it
to
be
around
minutes
or
two
hours
right
in
modern
traditional
data
centers
the
recovery
point
objectives.
That
is
how
much
data
I
lose
or
how
much
inconsistency
I
have
between
copies
of
my
data
in
traditional
disaster
recovery.
I
see
it
being
between
zero
and
hours,
depending
how
we
do
the
sync
or
backup
and
restore
in
cloud
native.
C
C
C
D
You
is
it
when
you
say
zero
in
the
almost
zero
seconds.
Is
that?
Because
the
assumption
is
that
the
is
the
same
cloud
or
is
the
same
region.
D
When
you
do
a
dr,
not
across
like
across
far
regions
or
close
to
crowd
clouds.
C
No,
that's
not
the
assumption.
The
assumption
is,
we
are
going
to
go.
We
have
geographically
distributed
workloads,
possibly
across
different
clouds,
and
we
still
get
near
zero.
Okay,
okay,
interesting,
okay,
thank
you,
and
instead
going
back
to
ownership
in
cloud
native,
it's
it's
an
application
responsibility.
C
The
other
observation
I
made
is
that,
in
terms
of
technical
capabilities,
most
in
traditional
disaster
recovery,
must
we
leverage
capability
mostly
from
the
storage
side,
so
backups
volume,
syncs
and
this
kind
of
capabilities,
but
to
build
this
cloud
native
disaster,
recovery,
infrastructure
or
architecture.
C
We
we
need
capabilities
from
networking
and
in
particular
we
need
we're
going
to
see
that
we
we
need
his
ability
to
communicate
is
west.
So
if
I
am
in
two
different
clouds,
this
cloud
have
to
be
able
to
communicate
horizontally
east
west
and
to
have
a
good
global
load
balancer
capability,
and
that's
that's
where
the
switch
happens
right.
A
I
you
know,
I
think
we
may
need
to
differentiate
between
sort
of
what
the
high
level
objectives
are
and
what
happens
in
reality
right
because
you
know
to
have
recovery
point
objective
of
zero
is,
is
certainly
doable
and
it's
plausible,
but
it
also
kind
of
means
it.
It
sort
of
also
implies
that
that
every
transaction
or
every
database
action
or
every
you
know,
file
action
or
whatever
the
application
is
using
is
is
going
to
be
synchronously
happening
across
multiple
sites
which,
which
may
or
may
not
be
the
case
right.
A
B
C
We
need
to
get
to
zero
right
now,
yeah,
the
point
I'm
trying
to
make
is
now
you
can
get
to
zero
and
it's
not
that
complicated.
We,
you
know
the
the
narrative
was.
You
can
do,
dr
as
you
can
make
dr
as
good
as
you
want
as
long
as
you're
willing
to
spend
a
lot
of
money
right.
I
think,
with
cloud
native
approach
that
narrative
changes
a
little
bit
this.
C
These
architectures
are
not
that
more
expensive
than
the
traditional
you
know
active
passive
ones
and
so
much
so
that
in
in
an
article
that
I
wrote
about
this,
I
called
it
the
democratization
of
of
zero
down
time
right
during
a
disaster,
because
I
think
anyone
who
can
swipe
a
credit
card
and
start
deploying
on
different
clouds
can
achieve
this.
C
G
I
think
it's
not
just
about
cost
to
alex's
point.
We
had
the
presentation
last
week.
I
think
it
was
called
full
fs
or
that
they
were
talking
about
performance
tradeoffs,
how
they
were
not
posix
compliance.
So
I
think
it's
not
just
about
cost.
It's
also
about
performance,
because
if
you
want
to
have
zero
rpo,
you
have
to
write
to
all
you
know
every
single
zone
and
get
back
acknowledgements,
and
so
it's
not
all
about
costs.
C
C
If
reason
is
if
you're
thinking
about
the
cloud
or
it
could
be
your
three
data
centers
in
different
geographical
locality,
if
you
think
about
on
premise
it
doesn't
matter
the
architecture
works
anyway,
there
is
a
stateful
workload,
you
can
imagine
a
database
or
a
queue
right
that
is,
is
distributed
across
this
data
center
to
form
a
logical
entity,
but
obviously
there
are
different
instances
and
these
instances
communicate
between
each
other
via
this
horizontal,
east-west
ability
to
communicate.
C
We
don't
need
to
know
at
this.
You
know
in
this
reference
architecture
how
that
is
implemented,
but
they
need
to
be
able
to
communicate
east
west
so
find
each
other
look
up
each
other
and
discover
each
other
and
and
communicate
and
that's
how
they
achieve
data
sync.
You
know
state
sync:
each
of
them
will
have
a
volume.
C
So
we
need
storage,
of
course,
because
we're
still
storage
doesn't
go
away,
but
but
we
don't
ask
those
volume
implementation
of
that,
that
you
know
that
volume
that
storage
implementation.
We
don't
ask
you
to
have
any
particular
capabilities.
Besides
the
ability
to
obviously
store
data,
and
then
we
can
imagine
that
there
is
a
front-end
or
maybe
just
direct
connection,
but
probably
there's
going
to
be
a
front-end
status
front
end,
and
then
there
is
a
global
load,
balancer,
okay,
and
so
the
idea
is
when
one
of
these
regions
goes
down
because
of
a
disaster.
C
First,
the
stateful
workload
adjusts
itself
because
it
has
some
kind
of
you
know,
leader,
election
and
state-sync
protocol,
and
we
can
analyze
those
in
detail,
but
it
adjusts
itself
instantaneously.
There
is
no
data
loss
and
then
the
global
load
balancer
has
some
level
of
health
checks
as
some
of
the
checks,
and
so
clients
will
start
going
only
to
the
regions
that
are
active
okay.
So
as
a
so
we
we
we
reacted
to
a
disaster
completely
in
an
autonomous
way
and
the
clients
keep
working.
C
Maybe
they
get
they
get
a
glitch
of
a
few
seconds.
I
work
with
the
database
that
where
the
glitch
can
be
up
to
nine
seconds
and
then
but
but
then
everything
continues
to
to
function
normally.
C
Okay,
so
that's
the
idea,
I
think
it's
the
general
model,
the
the
trick
is
to
find
state
of
workload
that
can
actually
work
that
way,
and
there
are
some
prerequisites
that
they
have
to
implement
in
order
to
do
this
and
this
state
for
workload.
Sorry,
I
just
mentioned
queue
and
and
databases,
but
obviously
it
could
be
a
distributed
storage.
C
C
F
So
sorry,
so
I
was
wondering
in
this
reference
architecture.
We
basically
say
that
for
the
dr
we're
basically,
this
implies
the
state
sync
is
down
always
going
to
be
down
by
applications
right
right
and
application
have
to
have
ability
to
operate
replicas
across
different
data
centers,
which
might
potentially
has
very
high
latency
right,
and
so
that's
the
on
this
is.
Is
there
any
other
model
we
consider,
or
is
this
going
to
be
our
like?
Only
reference
architecture
for
dr.
C
Well,
you
like,
for,
for
this
model.
That's
that's!
That's
how
the
application
needs
to
work.
Like
I
said
and
like
the
slide
says,
you
can
still
do
your
active
passive
models
or
master
slave.
You
know
that
I've
always
worked
right,
but
you
don't
get
all
the
automations
that
that
I
described
yeah
so.
A
Maybe
maybe
I'd
like
to
suggest
sort
of
a
slight
refinement
here,
mostly
to
do
with
the
terminology
right
when
we,
when
we
put
the
the
sort
of
the
storage
landscape
paper
together,
we
kind
of
talked
about
different
ways
of
of
persisting
data
and
that
you
know
that
could
be
some
sort
of
volume,
but
it
could
also
be
you
know
app
level
stuff
like
like
a
database,
but
also
you
know,
key
value,
stores
or
or
or
object
stores,
for
example,
are
also
are
also.
A
You
know
valid
ways
of
of
persisting
data,
and
you
know
whether
it's
distributed
storage,
that's
providing
volumes
or
a
distributed
database
or
a
distributed
key
value
store
right.
I
think
what
we're
kind
of
saying
here
is
the
stateful
workload
needs
to
have
a
distributed
way
of
persisting
the
data
and
that
that
could
be.
You
know
distributed
volumes,
it
could
be.
You
know
like
like
a
distributed
file
system
or
a
distributed
storage
system,
it
could
be
a
distributed
database.
A
You
know
like
a
cockroachdb
or
a
yet
applied
or
or
or
for
tests
or
something
or
it
could
be.
You
know
a
distributed
object
store,
and
in
that
case
then
you
kind
of
have
that
that
that
sort
of
functionality
available
to
the
to
the
application-
I
think
yeah.
So
so
I
think
it
would
kind
of
be
useful
to
to
sort
of
change
the
stateful
workloads
as
sort
of
some
sort
of
distributed
storage
layer
and
the
volume
is
ultimately
where
data
has
persisted.
A
C
Right
right,
I
agree
yeah,
so
I
didn't
say
what
service
this
state
of
workload
offers
to
the
green.
F
C
You
know
it
could
be
anything,
but
yes,
I
think
I
can
improve
this
slide
by
by
adding
that
that
piece
of
information
and
so
yeah
this
this
volume
here,
like
I
said,
could
be
the
the
disk
on
which
this
state
for
workload
is
running,
or
it
could
be
another
layer
of
the
defined
storage.
It
doesn't
really
matter
because
the
state
sink
is
managed
at
this
layer.
The
blue
layer.
C
C
Okay,
so
was
saying
you
know,
so
the
the
the
document
that
I
wrote
tries
to
explain
why
this
is
technically
feasible
right
because
you
you
might
say,
I
don't
believe
that
this
can
be
done
right.
Well,
we
haven't
done
it
for
many
years
right
now.
Is
it
possible
right?
I?
So
I
I
try
to
explain
why,
and
I
just
have
to
remind
you
of
a
few
concepts.
C
So
I
think
everybody
knows
what
high
availability
and
disaster
recovery
is.
I
just
wanted.
I
want
to
define
them
in
a
in
in
relationship
to
what
a
failure
domain
is
okay,
so
failure
domain
is
something
is
an
area
of
id
of
our
iit
system
that
when
when
when
there
could
be
a
single
event
that
makes
everything
running
in
that
area
fail?
Okay,
so
it
could
be
a
node,
for
example,
which
means
all
the
process
running
on
the
nodes
now
fail.
C
It
could
be
a
rack,
which
means
all
all
the
nodes
running
on
the
rack
fail
fail
or
a
cabinet
cluster,
a
network
zone,
a
data
center
right,
so
a
failure
domain
is
sort
of
a
fractal
concept
that
is
out
of
similar
at
different
scales,
but
in
what
we
need
to
remember
remember
here,
for
this
discussion
is
when
we
talk
about
a
high
availability
relative
to
failure
domain.
We
really
are
asking
the
question:
what
happens
when
one
component
fails
within
this
failure
domain
right?
C
What
happens
to
my
system
when
one
component
fails
assuming
a
h
a
of
one?
I
I
I
have
a
full
tolerance
of
one,
I'm
assuming
that
that
is
what
we
mean
by
higher
value.
When
we
talk
about
disaster
recovery,
really
we
are
asking
the
question:
what
happens
if
everything
in
this
failure
domain
is
lost?
So
basically
the
failed
domain
fails.
What
happens
to
my
system?
Obviously
I
need
to
have
other
famous
domains
somewhere
and
normally
in
this
case,
the
failure
domain
is
the
data
center.
C
But
conventionally,
when
we
talk
about
disaster
recovery,
we
talk
about
an
entire
data
center
going
down.
Okay.
So,
with
this
in
mind,
I'll
continue
because
I'm
sure
everybody
knows
this-
this
concept
this
consistently
here
we
we
mean
we
mean
that
all
instances
are
observed
in
the
same
state
and
are
reporting
on
the
same
state.
A
Yeah,
I
think
I
think
we
we
define
consistency
in
the
white
paper,
pretty
well
right.
I
I
really
like
this
slide
in
that.
I
think
we
should.
What
we
should
sort
of
pop
out
of
this
slide
is
that
high
availability
is
about
sort
of
the
the
recovery
from
a
single
point
of
failure,
or
something
like
that,
whereas
disaster
recovery
we're
talking
about
the
failure
of
an
entire
failure
domain
and-
and
that's
that's-
a
really
useful
differentiation
to
have.
C
Right
and
and
it's
I-
I
felt
the
need
to
make
this
differentiation,
because
these
two
concepts,
when
you
talk
to
the
customers
these
days
are
starting
to
overlap,
because
rightfully
so
they
would
like
to
treat
a
disaster
recovery
event
like
if
it
was
an
aha
event
and
and
in
this
theoretical
model
that
I
just
described.
That
is
exactly
what
happens,
but
unfortunately
that
also
brings
confusion
between
these
two
concepts,
and
so
it's
it's
important
to
to
understand
what
the
difference
is
continuing.
C
The
other
thing
we
need
to
remember
is
the
cap
theorem
again,
I'm
pretty
sure
you
guys
know
what
it
is.
Many
you
know
the
common
way
to
to
explain
the
cup
theory
is
you
can
have
thinking
about
consistency,
availability
and
partitioning
you
can
have
you
can
pick
two,
but
not
all
of
three.
C
I
like
I
like
to
tell
it
in
a
slightly
different
way
that
I
think
helps
in
this
discussion,
which
is
that
you
don't
choose
partitioning
network
partitioning
is
something
that
happens.
C
You
know
miss
the
errors
will
happen
right.
So
assuming
that
you
need
to
be
partition,
tolerant,
how
do
you
design
your
workload?
Do
you
design
it
to
be
available,
or
do
you
design
it
to
be
consistent?
So
that's
really
in
my
mind
the
choice
that
you
have
and
I
have-
and
I
have
here
a
table
showing
some
of
these
choices
made
by
some
products.
C
You
should
obviously
every
state
for
workload
that
attempts
to
solve
that
attempts
to
be
distributed
has
to
deal
with
this
theorem
and
that's
to
make
a
choice
here.
A
Hey
rafael,
sorry,
just
just
done
one
step
when
you
say
the
capital
choice
for
those
examples.
For
example,
mongodb
is
consistency
as
in
it
allows
deferred
consistency
or
or
it's
optimizing
for
uptime.
What
did
you
mean
by
that.
C
So
when
you
choose
consistency
in
in
the
cafeteria,
it
means
that
when
the
system
goes
into
network
partitioning,
which
is
which
is
where
the
system
cannot
establish
anymore,
if
if
there
is
a
piece
of
the
system
that
is
actually
working
and
the
other
is
not
working,
but
it
doesn't
know
what
the
other
piece
of
the
system
is
is
doing,
then
it
puts
itself
in
a
not
available
state,
which
could
mean
read
only
or
maybe
just
rejection
of
goals,
and
because
the
objective
is
to
maintain
the
the
state
consistent.
A
C
Yeah
the
problem,
so
eventual
consistency
is,
is
an
appealing
approach
and
I
think
it
has
been
explored
a
lot.
There
is
now
some
emerging.
There
is
a
line
of
thought
you
may
agree
or
not,
but
there
is
a
line
of
thought
that
adventure.
Consistency
is
kind
of
dangerous
path,
because
eventual
consistency
does
not
imply
eventual
correctness.
C
It
just
implies
that
at
some
point
in
the
future-
and
there
is
really
no
sla
that
you
can
put
on
that
statement,
but
just
at
some
point
in
the
future,
all
the
instances
will
agree
on
the
state.
It
doesn't
mean
that
the
state
is
what
you
would
have
expected
from
a
business
stand
to
business
logic,
state
point
of
view
right,
so
so
the
developers
now
have
to
take
extra
care
to
make
sure
that
they
catch
these
these
incorrect
inc.
C
You
know
consistency
decision,
because
there
is
a
conflict
resolution.
You
know
algorithm
in
this
in
this
in
this
state
for
workload
that
decides
when
there
is
an
inconsistent
that
decides
who's
right,
maybe
with
the
timestamp
or
something
else.
So
now
the
developers
have
to
take
care
of
that
and
yeah.
There
are
some
papers
like
google
or
some
you
know
thread
in
where
they
discussed.
C
How
painful
was
to
remedy
those
kind
of
things,
and
so
at
least
this,
this
line
of
thought-
and
I
I
like
that
line
of
thought-
is:
let's
keep
everything
consistent
consistent
with
the
risk
of
taking
an
outage,
but
it's
it's
simpler
for
from
a
developer
point
of
view
to
in
many
in
many
cases,
right
it's
simpler
to
to
operate
that
way.
There
are
situations
where,
in
consensus,
it
doesn't
matter
too
much,
and
so
in
those
cases
it's
fine
to
use
to
use
those
databases,
but
I
work
a
lot
in
financial
institutions.
C
Consistency
is
important,
it's
very
important
there,
so
I'm
going
guide,
but
but
this
this
was
to
explain
why
I
focused
here
on
consistency,
so
the
and
and
that's
that
is
really
what
we
mean
when
we
say
zero,
rpo
right,
it's
there
is
no.
There
is
no
inconsistency.
There
is
no
data
loss,
so
consensus
protocol.
C
I
invented
these
two
definitions,
share
state
and
then
share
state.
This
is
really
my
terminology
and
we
can
change
it,
but
the
concept
is
like
well,
let's,
let's
find
consensus.
Protocol
first
is
the
idea
that
I
have
d.
I
have
distributed
workload
that
is,
needs
to
act
as
a
single
logical
entity,
so
they,
the
various
instances,
need
to
agree
on
actions
to
be
taken
right
and
there
are
two
kind
of
protocols
to
agree
on
actions.
The
one
in
the
first
one
here
share
state
is
when
we
have
to
agree
on
all
of
us.
C
C
The
major
algorithm
in
this
in
this
area
are
paxos
and
raft
and
raft
is
gaining
popularity,
because
it's
much
easier
to
understand,
I
couldn't
even
understand
it.
Paxos
is
it's:
it's
just
magic.
C
If
you
try
to
read
it
and
then
there
is
a
shared
state
consensus
protocol,
where
the
participants
to
these
orchestrations
really
are
can
potentially
do
different
actions,
so
maybe
I'm
writing
to
a
database,
and
then
another
participant
is
sending
a
message
in
a
queue
right.
So
in
this
case
we
have
the
this
historically
well-known,
two-phase,
commit
and
three-phase
commit
algorithm,
but
notice
that
these
algorithms
require
all
of
their
instances
to
be
online.
C
C
We
cannot
ask
later
so
there
are
a
couple
of
papers
from
google
that,
based
on
these
shifted
consensus
algorithm,
you
can
build
a
reliable
replicated
state
machine,
which
means
there
is
a
generic
way
of
of
agreeing
on
the
state
and
then
on
top
of
these-
and
you
know
then
raft
does
a
generic
way
on
agreeing
not
just
on
a
state
but
on
a
series
of
actions
to
be
taken
with
the
concept
of
operation
log
and
that's
really
the
state
that
is
being
shared
between
these
instances
and
then
every
instance
has
to
do
the
operation
that
is
written
in
the
operation
log
and
then
building
on
top
of
this
concept.
C
There
is
the
concept
of
reliable,
replicated
data
store.
Where
now
the
action
here
is
I
have
I
was.
I
have
a
series
of
operations.
I
have
a
log
of
operation
to
to
do,
but
really
the
operation
is
to
write
something
on
on
on
at
that
store.
So
this
is
a
concept
very
highly
reusable
concept
that
could
be
implemented
generically
and
then,
on
top
of
this
I
could.
I
could
put
an
api
to
serve
some
kind
of
storage
service
right,
so
it
could
be
an
api
to
do
q.
C
C
Now,
because
if
you
look
at
the
apache
bookkeeper
project,
it's
here
in
the
node
on
the
left,
that
is
exactly
a
reliable
replicated
data
store
with
with
with
the
abstraction
the
operation
that
they
abstract
is
really
the
the
things
that
kafka
does
so
append
only
operation
to
a
sort
of
file
system
file
and
in
fact,
apache
apache
bookkeeper
is
being
used
to
implement
highly
distributed
geographically
distributed
queue
system
and
pulsar.
C
So
putting
it
all
together,
we
have,
we
have
replicas,
as
we
know
in
so
a
stateful
workload
can
have
replicas
and
we
we
have
just
studied
how
we
can.
We
can
coordinate
this
replica
with
with
boxes
or
raft,
and
then
we
can
have
partitions,
which
is
I
I
partition
the
data
set
so
that
each
each
each
group
of
replicas
has
to
manage
a
subset
of
the
of
the
data
set,
and
I
do
that
for
being
able
to
scale
horizontally
right
and
partitions.
C
C
I
can
use
a
shared
state
protocol
between
between
partitions
and
that
that
is
how
I
can
create
a
highly
scalable,
highly
scalable
state
of
workload
distributed
state
for
workload,
and
here
I
have
collected
some
examples
of
these
these
workloads
because
there
are
starting
to
be
many.
C
C
Some
of
them
don't
support
partitions.
Some
of
them
don't
have
inter
partitions
operations,
they
support
partition,
but
you
you
can
only
work
with
a
single
partition
at
any
given
time,
but
in
general
this
is.
I
thought
it
was
a
good,
a
good
exercise,
and
I
thought
these
are
actually
the
right
question
to
ask
if
you
are
examining
a
state
of
workload
and
making
the
decision,
whether
you
want
to
use
it
or
not.
C
Okay,
the
other
thing
I
look
sorry
could.
F
C
Terminology
right
but
yeah,
so
so
a
client
may
try
to
do
an
operation
that
needs
to
touch
multiple
partitions
right
so
far,
so
good
yeah.
So,
for
example,
let's
say
I
think
in
inelastic
search
when
I
each
index
is
a
different
partition
or
some
something
similar
like
that.
So
if
I
try
to
add
a
lot
any
piece
of
information,
a
document
in
in
two
index
with
a
single
transaction,
I
need
to
do
that
operation
across
these
two
partitions
right.
C
G
Maybe
I
wasn't
very
kidding
okay,
so
I
guess
you
know
it
kind
of
depends
on
how
you
look
at
it,
because
replicas
can
also
denote
partitions,
or
at
least
but
here
basically
you're,
meaning,
if
you're
doing.
F
G
C
Okay,
I
suppose
so,
let's
say
let's
say
my
data
set
goes
from
a
to
z
right.
I
could
say
that
I
want
my
partitioning
to
to
deal
with
a
a2
m
say
and
then
partition
b
to
deal
with
n
to
z.
Okay.
So
if
I
divide
my
data
set
into
different
ranges,
my
entire
data
set
into
different
ranges
and
each
partition
is,
is
essentially
a
standalone
state
of
workload
just
operating
on
a
shorter
interval
of
that
data
set
and
they
don't
have
its
partition,
doesn't
have
to
do
anything
about
the
other
partition
except.
C
C
A
Yeah,
I
all
right,
maybe
maybe
I
see
where
what
what
you're
sort
of
pointing
out
it's,
I
think
the
use
of
partitions
here
is
possibly
unhelpful
from
a
terminology
point
of
view,
and
maybe
it
would
be
easier
if
we
just
call
them
shards,
simply
because
you
know
partitioning,
as
in
the
verb
when
applied
to
the
cap.
Theorem
is
sort
of
different
from
partitions
when
we're
referring
to
a
shard.
So.
C
C
C
F
C
A
Yeah,
I
think
when
we
we
we
had
to.
In
fact
this
was
one
of
the
things
that
we
debated
when
we
were
putting
the
landscape
together
and
we
sort
of
ended
up,
putting
a
table
to
describe
shards
and
replicated
charts
and
charted
replicas
as
well,
because
different
storage
systems
apply
them
in
different
ways,
but
yeah.
I
think
it
would
just
make
it
easier
for
everyone
if
we
call
them
shards
on
this
slide.
Okay,.
C
I
can
do
that
cool,
taking
notes
all
right.
So
I'm
sorry,
gentlemen,
that
was
asking
the
question.
C
C
Okay
cool,
so
these
are
just
some
databases.
You
know
some
some
stateful
workloads
that
I
have
classified
along
those
parameters.
C
C
The
other
thing
I,
what
I
have
explained
so
far,
is
really
generic
and
it
would
work
anyway
anywhere
or
with
any
deployment,
but
I
thought
we
could
take
a
look,
a
closer
look
to
kubernetes
and
how
this
would
work
in
kubernetes
right.
So
it's
essentially
the
same
slide
as
before,
except
that
now
there
is
a
kubernetes
cluster
in
which
our
workload
is
running.
C
So
we
can,
we
can
translate
it
to
more
close
more
closely
to
kubernetes
concepts.
So
we
have
a
persistent.
We
will
have
a
persistent
volume,
we
will
have
ingresses
the
global
load.
Balancer
has
to
load
balance
for
you
know
to
these
ingress,
or
you
know,
ingress
is
using
generic
terms.
This
could
be
a
load
balancer
service
or
it
could
be
an
ingress
object,
and
this
is
where
you
see
better
well.
What
I
meant
by
I
need
to
have
this
east-west
cap.
C
You
know
networking
capability,
because
building
building
that
across
clusters
is
not
that
it's
not
necessarily
straightforward
today,
with
with
kubernetes
it
can
depend
on
the
cni
implementation
that
you're
using
or
the
cloud
where
you're
running.
C
C
I
didn't
set
up
the
demo,
I
mean
I
have
it
set
up,
but
I
did
wasn't
planning
to
run
it
today.
I
don't
know
how
much
time
we
have
we.
I
could
certainly
run
a
demo
in
one
of
the
next
day,
but
just
to
explain
one
of
the
next
meetings
just
to
explain
what
what
the
demo
is
about.
I
we
would
have
this
cockroach
database
that
is
distributed
across
clusters
in
three
different
regions.
Right
now.
My
setup
is
on
aws,
but
it
could
be.
C
It
could
be
anything
we
deploy
a
network
channel
in
the
case
of
I'm
running
an
openshift.
So
in
the
case
of
network
in
the
case
of
openshift,
we
need
to
deploy
a
network
channel
to
make
this
cluster
be
able
to
talk
to
each
other
in
a
horizontal
way.
So
without
doing
eagers
and
ingress
like
this,
we
are
essentially
merging
the
sdns
into
a
single,
larger.
C
Network,
so
that
everything
is
routable
and
discoverable
to
do
that,
we
use
a
pro.
We
use
an
operator
and
a
product
called
submariner,
which
was
initially
developed
by
a
rancher,
but
now,
I
think,
is
joining
the
cncf
as
a
product
and
that
basically,
it
establishes
a
ipsec
based
vpn
across
the
across
the
sdns
of
the
clusters
and
then
with
I
deploy
a
global
load
balancer
with
health
checks
on
route
53,
using
using
an
operator
that
talks
to
route
53
and
and
makes
this
configuration.
C
C
What
is
it
called
local
majority,
so
local
leader
selection
and
then
global
leader
selection
and
then
so?
We
have
nine
instances,
but
they
behave
like
a
single
cross-original
entity,
okay
and
then
the
way
the
demo
works
is
I
take
down.
I
take
down
one
one
region
and
we
see
that
the
clients
keep
keep
just
working.
Normally
we
set
up
some
client
here
that
run
the
tpcc
test,
which
is
a
standard
sql
test
for
highly.
C
A
Hey,
finally,
so
is,
is
that
you
know
is:
is
some
sort
of
network
handling
like
like
with
submariner
mandatory
for
this
sort
of
architecture
or.
C
So
the
ability,
so
all
of
these
state
workloads
the
way
they
and
then
working
with
others.
The
way
they
work
is
each
instance
need
to
discover
and
establish
a
peer-to-peer
connection
with
all
the
other
ones,
that's
necessary
for
the
raft
coordination
to
work
so
so
discovery
and
connectivity
is
needed.
C
The
way
you
implement
it,
that's
up
to
you
right,
for
example.
I
know
that
we,
if
you
use
the
if
you
use
the
google
kubernetes
service,
you
can
build
the
cluster
and
switch
a
flag
where,
if
all
the
other
clusters
are
in
google,
you
know
regions,
they
will
just
be
able
to
talk
directly,
so
it's
they.
C
They
give
they
give
it
to
you,
but
other
implementation
of
or
other
distribution
of
kubernetes
may
not
have
this
capability
right,
so
you
have
to
somehow
provide
it
and
in
I
can
only
talk
about
openshift
for
this
particular
capability
and
obviously,
if
that's,
how
we're
doing
it.
A
Right
and
in
this
in
this
instance
in
this
sorry,
in
this
example,
the
the
database
has
nine
nodes.
Total
three
in
each
region
does
is.
Does
that
behave
like
a
single
logical
database
is
is,
is?
Is
that
kind
of
the
the
the
gist
of
this
here.
C
Yeah,
yes,
it's
kind
of
it's
it's
just
and
it's
exactly
what
happens?
It's
actually
nice
to
see!
That's
why
maybe
next
time
I'll
yeah,
if
you
guys
want
to
see
this
demo
I'll,
be
happy
to
show
it
to
you
I'll,
be
very
happy
to
show
it
to
you.
But
yes,
it's!
It
behaves
like
a
single
database
for
from
the
client's
perspective.
E
Could
could
you
highlight
on
this
slide
to
for
just
to
you
know,
follow
the
previous
slides,
where
you're
doing
your
replicas
and
where
you're
doing
your
shards,
I
mean
it's
pretty
obvious.
It's
pretty
obvious
that
you
want
your
replicas
in
the
regions
and
then
you're
you're
sharding
within
that,
but
just
it
would
be
nice
to.
Since
you
have
three
and
three
here,
it's
not
clear
or.
C
So
cockroach,
based
on
how
you
use
the
data,
can
recharge
can
recharge
and
can
decide
how
to
charge.
So
you
can
hint
when
you
create
tables,
you
can
hint
how
to
shard
them,
but
you
don't
have
to,
and
it
knows
what
to
do.
It's
really
really.
It
calls.
I
think
they
use
the
name
tablets
for
shards.
So
that's
yet
another
name.
Okay
and-
and
it
creates
its
own
tablets,
you
don't
have
to
decide
it,
and
these
are
nine
replicas.
C
So
all
the
database
is
re
is,
is
fully
replicated
everywhere,
except
we
don't
have
to
have
all
of
these
instances
agree
to
in
order
to
proceed
with
that
transaction,
and
that's
that's
how
they
can
make
it
efficient.
We
I
did.
I
did
this
with
the
cockroach
guys
and
you
really
for
this
question.
You
need
to
talk
to
them,
but
we
run
a
performance
test.
C
So
keep
in
mind
in
in
amazon
between
is
the
u.s
east
and
u.s
west
region.
There
is
about
70
milliseconds
of
latency.
So
that's
that's.
Just
physics,
there
is
nothing
you
can
do
around
that,
but
with
that
kind
of
latency
we
were
still
able
to
run
the
tpcc
test
with
97
efficiency,
which
the
tpcc
1000
sorry,
so
that
emulating
1000
databases
doing
oltp
so
highly
highly
transactional
kind
of
operation.
So
not
it's
not
data
warehousing
or
you
know.
Big
queries
is
more
insert
insert
selections
to
select
these
kind
of
things.
C
So,
with
that
kind
of
traffic
pattern
emulating
one
thousand
instances
we
did
64,
we
sorry
96,
which
is,
which
is
almost
the
same-
that
you
would
get
from
an
analytical
database.
Probably
more
analytical
database
can
do
a
little
bit
more,
but
it's
it's
close
to
the
theoretical
limit
100.
C
A
I
guess
from
a
concept
point
of
view
this
this
applies
to
to
just
about
any
distributed
storage
right,
if,
if,
if
you
have,
if
you
have
a
logical
instance
that
combines
sharding
and
replicas
between
the
between
between
sort
of
multiple
cluster
instances-
and
you
have
some
sort
of
network
tunneling,
then
this
this
can
apply
to
potentially
distributed
file
systems,
key
value
stores
and
object
stores.
And
so
so
so
you
know
we
can
probably
make
this
a
fairly
generic
play
as
well.
C
Right
that
that's
my
objective
here,
I
I
don't
think
it
matters
what
the
stateful
workload
does.
What
what
we
are
finding
a
solution
for
here
is
replicate
state
across
regions
right
and
or
keep
stating
sync
across
the
region
better.
So
I
think
it
can
be
done
with
other
ap.
You
know
interfaces
because
this
is
a
sql
interface
right.
It's
a
sql
service.
C
In
fact,
I
would
like
to
be
able
to
showcase
this
this
same
architecture
with
other
kind
of
workloads,
because
it
proves
the
point
right,
the
the
point
right
now,
one
might
say:
okay,
it
works
with
coco
cb,
but
it's
not
a
general
solution,
but
if
I
can
make
it
work
with
other
products,
then
it
starts
to
be
a
generic
statement.
More.
F
G
I
guess
the
part
that
can
vary
across
different
distributed
databases
or
file
systems
is
how
they
consume
this
topology.
So
for
this
demo
like
how
did
you
convey
this
topology
of
you
know?
There
are
three
different
availability
zones
and
you
know
how
did
you
make
cockroaches
aware
of
this
topology,
so
proper
starting
happens
across
azs?
You
know,
as
opposed
to
within
the
same
age,.
C
Cockroach
has
some
parameters
that
you
need
to
pass
to
the
process
when
you
run
it
to
make
a
topology
aware
so
using
downward
api
and
other
approaches.
I
I
make
the
pod
the
pods
of
where
or
where
they
run
and
then
and
that's
how
it
decides
to
do
the
sharding
right
because,
like
I
said
it's,
that's
it's
a
nice
property
of
it
of
that
it.
It
does
all
the
sharding.
C
Right,
yeah,
yeah
cockroach
understands
one
level
of
topology,
I'm
working
with
another
database
now
gigabyte,
which
understand
multiple
layers
of
topology
potentially
so
it
understands
cloud
region
and
and
az
passing
these
parameters.
You
make
it
aware
of
where
each
instance
runs,
and
then
they
can
make
a
decision
on
how
to
distribute
the
data.
A
And,
and
how
do
you
kind
of
like
define
the
topology
somehow
somewhere
because
it
could
just
be?
It
could
just
be
labels,
but
but
just
as
equally
right,
they
could
be.
They
could
be
looking
that
data
up
in
a
discovery
service
as
well.
C
A
C
C
Yeah
topology
is
is
a
fundamental
discovery.
Anthropology
are
fundamental
in
in
the
case
of
some
arena.
It
comes
with
a
discovery
service.
So
if
I
know
what
what
to
look
up,
if
I
know
the
name
of
the
server
you
know,
if
I
know
the
name
of
this,
these
are
stateful
set
right.
So
if
I
know
the
name
of
these
individual
instances,
I
can
look
them
up
from
this
cluster
just
because
I
have
a
generally
distributed
discovery
service,
but
yes
other.
C
If
you
don't
use
submariner,
you
will
have
will
have
a
way
you
need.
You
need
a
way
to
do
that
right.
For
example,
celium,
if
you
know
psyllium
is,
is
another
cni
that
you
can.
You
can
configure
in
your
kubernetes
cluster
psilium
support
network
tunnel
out
of
the
box,
so
it's
a
switch
that
you
can
turn
on.
I
think
what
is
the
other
famous
one,
the
other
famous
cni
calico.
A
Interesting,
maybe
maybe
it's
worth
pinging
the
the
sig
network
and
seeing
if
they
have
any
information
about
those
those
product
capabilities.
C
Yeah,
we
can
do
that
and
the
multi.
I
think
it's
the
multi-cluster
sig,
but
there
is
some
sig
that
has
defined
a
standard
as
a
final
spec
for
cross-cluster
discovery.
They
don't
define
the
tunneling,
but
they
define
the
cross-cluster
discovery
and
submariner
implements
that
spec.
A
Very
cool
we're
actually
a
minute
over,
so
I
think
we're
gonna
have
to
call
time.
But
but
this
was
this
was
brilliant
rafael
and
I
think
we've
got
something
solid
to
to
to
work
on.
A
Thanks
everyone
and
we'll
see
you
all
in
a
couple
of
bye-bye
weeks.
You
thank
you,
rafael,.