►
Description
This video demos a scenario of a zonal outage that would require a full relocation and recovery of Gitaly servers in the affected zone, by restoring them from snapshot.
See also https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/16665 (internal)
A
Hello
and
welcome
to
a
disaster
recovery
demo
for
the
getaway
Fleet
in
this
video,
we're
going
to
show
how
it's
possible
to
recover
an
entire
Zone's
birth
of
giddly
servers
into
another
project
from
each
server's
most
recent
disk
snapshot.
This
will
demonstrate
a
few
improvements
in
Disaster
Recovery.
One
is
a
helper
script
which
will
help
us
quickly
find
the
most
recent
snapshots
and
another
is
a
new
way
to
launch
giddly
servers
in
terraform
that
allow
you
to
specify
source
snapshots
for
a
large
number
of
servers.
A
A
A
A
A
This
shows
you,
the
First
Column
shows
you
the
date
that
snapshot
was
taken,
and
it
also
tells
you
approximately
how
old
it
is
so
for
these
snapshots
they
were
all
taken
around
the
same
time.
They
are
all
around
2
hours
and
30
minutes
old.
This
number
is
going
to
vary
depending
on
when
the
outage
occurs,
because
we
take
snapshots
every
four
hours.
It
could
be
anything
from
minutes
to
around
four
hours.
A
There
is
an
extra
option
on
the
script
to
Output
a
terraform
config
or
actually
a
terraformed
variable
that
we
can
easily
copy
and
paste
into
terraform.
That
will
help
us
to
like
provision
all
of
these
servers
in
parallel
using
these
snapshots.
So
you
can
see
this
variable
right
here
per
node
dated
this
snapshot
I'm
going
to
copy
this
and
put
it
into
our
terraform
config.
A
So
we
have
22
notes,
I've
already
set
the
multi-zone
load
count
to
22.
What
this
means
is
that
we're
going
to
launch
22
new
servers,
starting
at
node
number,
one
ending
with
22
for
the
servers
that
are
in
the
US
East
1B
Zone
that
failed.
This
shouldn't
take
very
long,
so
I'm
going
to
go
ahead
and
apply
it.
One
thing
to
note
is
that
I
am
not
sponsoring
these
in
usds
to
one
and
launching
them
in
us
to
East
4,
and
the
reason
for
that
is
to
avoid
any
capacity
issues.
A
If
we
were
to
launch
all
22
of
these
servers
in
a
single
zone
in
Us
East
one
you,
we
I've
seen
that
we've
run
into
capacity
issues
on
gcp
gcp
sites
So,
to
avoid
that
I'm
using
an
alternate
region,
gonna
go
ahead
and
apply
this
now.
A
Okay,
so
you
can
see
it's
adding
a
lot
of
resources.
It's
going
to
spin
up
all
22
of
these
servers
in
parallel
at
once,
using
the
snapshots
from
us,
East
1B,.
A
A
Using
the
snapshots,
we
specified
I'm
going
to
record
the
time
to
see
how
long
that
took,
but
the
instances
are
not
fully
configured
yet
it's
going
to
take
a
while
for
them
to
boot
up
and
get
configured
by
Shep.
So
what
I'm
going
to
do
is
pause
the
video
and
see
how
long
that
takes
I
expect
it'll
take
around
15
minutes.
A
A
So,
to
recap:
we
looked
at
the
disk
snapshots.
They
were
created
about
two
hours
and
14
minutes
before
the
outage
began
in
about
five
minutes.
Since
the
start
of
the
outage,
we
were
able
to
provision
all
20
nodes
and
restore
all
of
the
disks
from
snapshots
and
after
that
it
took
about
20
minutes
for
all
of
the
nodes
to
configure,
and
that
includes
installing
all
of
our
supporting
packages
installing
the
Omnibus
to
get
giddly
fully
configured.
A
Assuming
that
we
didn't
lose
the
database,
it
would
have,
it
would
be
much
more
current
than
that,
so
it's
possible
that
we
would
have
some
out
of
date,
information.
We
would
have
cash
problems
in
rails,
but
overall
I
think
it
was
a
pretty
successful
recovery.
The
RTO
for
just
this
giddly
data
was
about
30
minutes
and
the
RPO
was
2
hours
and
14
minutes.
That
concludes
the
demo.
Thank
you.