►
From YouTube: Lightning Talk Disaster Recovery for OpenShift Workloads - Annette Clewett - OpenShift Commons 2022
Description
Lightning Talk Disaster Recovery for OpenShift Workloads
Red Hat OpenShift Commons 2022 @ Kubecon/NA
Detroit, Michigan
October 25, 2022
Speakers:
Annette Clewett (Red Hat)
https://commons.openshift.org/gatherings/kubecon-22-oct-25/
A
A
So,
okay,
didn't
all
right.
So
we'll
start
off
with
a
problem
statement,
look
at
what
the
red
hat
Solutions
are
and
then
talk
about
some
technical
assets
right
after
that,
we'll
quickly
go
to
a
video
demo,
a
demo
that
I
did
is
a
video.
So
the
first
thing
is
disaster.
Recovery
in
for
for
applications
is
not
new
right,
I
mean
if
you've
been
with
a
financial
institution
Healthcare
any
of
those
other
companies
as
I.
A
Have
you
know
that
you
have
to
have
Disaster
Recovery
planning
and
in
some
cases
you
have
to
show
that
you
have
that
plan
in
some
amount
of
detail
even
to
have
the
application
available.
So
it's
not
new,
but
when
we
move
it
into
containerized
platforms,
it
becomes
sort
of
there's
not
a
good
solution
right
now.
The
cncf.
From
my
vantage
point,
doesn't
have
a
good
solution
yet
and
we
need
it.
We
need
it
today.
A
A
Another
concern
is,
how
do
we
trust
it
right?
So
it's
one
thing
to
test
that
it's
another
to
know
that
it's
going
to
be
there
when
you
need
it.
The
third
thing
is
it's
a
combination
of
products.
Red
Hat
has
different
cycles
of
release
and
all
these
products
have
to
come
together,
so
the
benefits
obviously
are.
A
If
we
can
get
it
to
work,
we
can
get
something,
that's
easy,
automated
and
can
actually
very
little
human
intervention
make
it
happen,
so
I'm
going
to
move
on
because
it
takes
a
while
to
do
the
the
demo
so
in
disaster
recovery-
and
this
is
not
again
new-
we
have
two
measurements.
Sometimes
we
come
up
with
these
measurements,
with
no
idea
that
they
can
be
met.
A
So
this
is
actually
bringing
the
two
ideas
together,
which
is.
You
should
be
able
to
test
that
you
can
meet
your
both
recovery,
Point
objective
and
your
recovery
time
objective.
One
of
them
is
a
measure
of
how
much
data
you're
willing
to
lose
on
a
per
application
Level
and
the
other
one
is:
how
long
can
an
application
be
unavailable?
A
So
in
the
past
this
has
been
maybe
measured
in
hours
or
sometimes
stays.
We
want
to
measure
it
in
minutes
and
we
want
it.
Maybe
single
digit
minutes.
So
again,
these
are
not
new,
but
this
is
how
and
and
for
for
this
solution,
how
we're
going
to
be
able
to
measure
it
so
Red
Hat,
along
with
IBM
over
the
last
two
years,
has
been
developing
two
solutions:
one,
we
call
Regional
Disaster
Recovery.
A
The
other
is
Metro
Disaster
Recovery
Regional,
Disaster
Recovery
is
the
idea
that
we're
doing
asynchronous
replication
on
the
persistent
data,
so
it
has
no
requirements
about
how
close
or
how
far
the
sites
are
apart.
Metro
is
meant
to
be
a
synchronous
solution.
Therefore,
you
could
have
recovery,
Point,
objective
data
loss
equal
to
zero.
So
the
way
that
we
get
there
on
these
two
solutions
is
using
components
from
Red
Hat,
Advanced
cluster
management.
The
Upstream
is
called
open.
A
Cluster
management
ocm,
a
red
hat,
openshift
data,
Foundation,
that's
the
product
that
I've
been
involved
with
for
the
last
five
or
six
years.
It's
going
to
bring
along
all
of
the
disaster
recovery
operators
that
I'll
go
through
in
a
minute
here
and
then
at
the
center
of
this
is
Red
Hat
Self
Storage,
which
is
the
Storage
software-defined
storage.
That's
going
to
actually
do
the
replication,
store
the
data
and
keep
track
of
it.
A
So
continuing
with
the
components
and
again
these
are
brought
along
with
openshift
data
foundation.
So
we
have
three
new
operators.
One
is
the
the
Dr
Hub
operator,
the
Dr
Hub
operators,
job
and
it
does
live
on
a
hub.
So
conceptually
I'll
show
you
in
a
minute.
But
this
is
a
three
cluster
solution
or
three
location
solution.
It
can
be
two
location
solution,
but
the
the
Hub
operator
is
really
the
the
one
that
has
the
custom
resources
to
actually
do
the
Dr
placement
and
fail
over
an
application.
A
This
does
require,
from
a
subscription
point
of
view,
an
advance
in
title
meant
for
openshift
data
Foundation,
so
architectural.
If
we
look
at
it,
we
have
a
global
traffic
manager.
That
is
not
part
of
the
solution.
You
do
need
to
have
your
own
Geo
load.
Balancing
your
load
balancing
to
plug
into
this
is
not
any
different
than
any
other
load
balancing.
If
the
application
is
active
on
cluster
one
and
we
want
to
fail
over
to
Cluster
two
once
the
application
is
live
on.
A
Cluster
2,
Geo
load,
balancing
needs
to
re,
redirect
the
the
inbound
connections,
so
you
see,
there's
no
distance
limitation
and
we've
got
asynchronous
replication.
I
show
it
into
here
in
in
going
in
One
Direction,
but
you
need
to
think
of
it
or
the
way
to
think
of
it
is
it's
per
application,
so
we
could
have
an
application
on,
on
the
left,
hand,
side
that
is
being
replicated
and
has
a
failover
cluster
on
the
right
hand,
but
we
could
have
another
application
in
cluster
2
where
it's
failover
cluster
is
cluster
one.
A
So
it's
it
really
allows
you
as
long
as
you
keep
enough
Headroom
to
use
both
clusters
and
not
have
just
one
sitting
there
idle.
So
again,
no
distance
limitation
contrasts
with
metro.
Metro
still
has
two
openshift
clusters.
Then
you
can
have
more
all
this
is
done
in
peers.
So
if
I
had
a
hundred
clusters
that
I
wanted
to
put
into
disaster
or
protect
applications,
I
can
divide
them
into
basically
50,
which
would
be
two
sets.
A
Each
one
has
two
clusters
so
right
now,
it's
it's
everything
has
a
pure
cluster,
so
you're,
either
on
the
preferred
cluster
or
you're
on
the
failover
cluster.
Really
important
to
this
solution
is
an
external
theft
storage
that
is
stretch,
it's
called
stretch
mode
that
will
basically
provide
storage.
So
you
have
two
replicas
of
the
data
on
one
side,
two
on
the
other
side.
So
if
you
lose
it,
then
you
have
the
ability
to
recover
the
data
synchronously.
So
there's,
no,
you
get
no
data
loss.
You
still
have
to
move.
A
The
application
over
important
here
is
also
the
idea
that
there's
a
monitor,
node
a
monitor,
is
a
service
of
stuff.
So
somewhere
you
need
to
have
a
a
fifth
monitor.
So
if
you
need
to
make
Quorum
that
is
not
going
to
go
down
at
data
center,
one
or
two,
so
some
technical
assets
that
we
have
here,
maybe
some
of
you
have
seen
I've
done
quite
a
few
Red
Hat
office
hours,
the
the
three
there
that
are
part
one
part.
Two
part
three
is
my
colleague
Daniel
parkes.
A
He
did
a
great
job
in
this
last
month,
explaining
in
great
detail
how
to
set
up
stuff
in
a
stretch
mode,
how
to
connect
it
into
two
open
shifts
and
all
the
details
of
that
so
that
that's
a
really
good
set
I've
also
done
multiple
videos
over
time.
These
are
links
to
a
few
of
them.
Recent
ones
I've
done
and
then,
if
you
want
to
get
into
the
details,
the
actual
documentation
which
I
personally
helped
with
so
I,
can
tell
you
it's
on
basis
of
documentation.
A
A
There
we
go
so
we're
going
to
do
a
little
video
action
here.
I
would
have
liked
to
do
it
live,
but
it's
three
clusters
and
the
chances
of
everything
working
out
are
not
good,
so
getting
started
here,
oop.
So
what
I've
done?
If
I
get
this
thing
to
go
away?
What
I've
done
is
I've
installed,
Pacman
on
using
Advanced
cluster
management
and
right
now
it's
the
packman
application
I'm
going
to
play
the
game,
because
the
reason
I
want
to
play
it
is
so
that
I
can
create
some
persistent
storage.
A
So
in
this
case,
I
want
to
lose
super
quick
and
I'm
able
to
do
that.
Just
by
putting
myself
in
the
right
position
and
as
soon
as
I
lose
I
get
a
high
score
and
I'm
going
to
save
that
high
score
to
the
persistent
data.
That
is
on
the.
What
we'll
call
the
preferred
cluster
and
it
showed
it
in
the
other
one
I
think
it
was
a
BOS
one.
So
now
we've
we
installed
an
application.
A
We
created
some
persistent
data,
and
now
what
we
want
to
do
is
to
be
able
to
look
at
failing
over
to
the
failover
cluster.
So
in
this
is
your
ACM
console
if
you've
seen
it,
this
is
actually
the
multi-cluster
console
and
we
have
a
new
thing
called
create
Dr
policy.
So
again,
this
Dr
policy
is
backed
by
The.
Operators
I
showed
you
and
the
custom
resources,
so
I'm
going
to
name
my
my
Dr
policy,
informative,
because
if
I
had
a
whole
bunch
of
clusters,
I
need
to
know
what
is
this
policy
apply
to?
A
A
So
after
that
I'm
going
to
go
down
as
soon
as
I
choose
the
two
clusters
it
goes
out
and
it
try.
It
looks
to
see.
Are
these
two
separate
openshift
data,
Foundation
or
soft
clusters?
Are
they
the
same
cluster
and
the
async
is
grayed
out?
So
it's
actually
figured
out
that
this
is
two
different
clusters:
storage
clusters
and
it's
going
to
be
an
asynchronous
relationship
now
I
need
to
the
default.
A
Is
five
minutes
but
I'm
going
to
change
my
sync
interval,
which
means
all
my
persistent
data,
the
Delta
Data,
will
be
replicated
every
two
minutes
so
now,
I
have
a
data
policy,
but
I
don't
have
any
applications
so
I'm
going
to
apply
it
to
my
Pacman
application
as
soon
as
I
do
that
it's
going
to
create
the
the
data,
the
disaster
recovery
resources
into
the
namespace
for
Pac-Man,
there's
a
Dr
placement
control,
the
placement
Rule
and
a
placement
decision.
A
A
It
can
actually
be
done
by
a
developer,
the
initiation
because
it
is
namespace
scoped.
The
actual
creation
of
the
data
policies,
though,
would
need
to
be
cluster
admin
right
now,
so
I'm
going
to
go
into
the
Dr
Hub,
so
important
about
this
is
I'm
doing
the
failover
on
the
Hub
cluster.
So
if
I
had
lost
communication
with
one
of
my
clusters,
I
would
still
be
able
to
fail
over
the
application.
A
So
this
so
the
Hub
cluster
currently
is
on
a
third
open
chip
cluster
in
the
future.
We're
going
to
be
able
to
do
Hub
recovery
and
we'll
be
able
to
do
two
locations
and
recover
ACM,
so
we're
going
to
add
a
few
parameters
once
you
add
these,
they
stick
in
this
crpc
again.
This
is
namesake
scoped.
So
this
this
data,
our
Disaster
Recovery
placement
control,
is
specific
to
this
namespace
and
actually
this
volume
we
could
have
multiple
drpcs
based
on
different
volumes.
A
So
and
a
volume
would
be.
You
know,
an
image
of
the
storage
that
you're
replicating
so
I'm
going
to
go
ahead
as
soon
as
I
hit,
failover
I've
changed
the
status
now
and
I
can
go
and
look
in
the
events
to
see
and
we'll
see
that
things
are
happening
failing
over
and
a
vrg
is
a
volume,
replication
group,
another
custom
resource,
so
we
can
also
watch
it
here,
look
close
in
the
middle.
It
says
Buzz
one
and
shortly
shortly
there
and
switched
over
so
and
actually
I
didn't
video
magic
this.
A
So
it
actually
switched
that
fast
again
in
my
test,
environment,
I,
don't
have
a
lot
of
latency,
but
we
can
see
now
that
it's
on
boss
2.,
so
that's
so
big.
So
basically
what
we've
seen
here
is
an
example.
Now
what
we
want
to
do
you
see
is:
is
the
application
still
working?
Also
here's
my
Global
traffic
manager
proxy
and
you
can
see
on
the
bottom.
It
switched
over
to
boss
2..
So
this
and
the
inbound
connections
now
are
coming
into
the
second
or
the
failover
cluster.