GitLab Infrastructure Group, 4 Nov 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Restoring an entire availability zone's worth of Gitaly servers, from disk snapshots

Description

This video demos a scenario of a zonal outage that would require a full relocation and recovery of Gitaly servers in the affected zone, by restoring them from snapshot.

See also https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/16665 (internal)

A

Hello and welcome to a disaster recovery demo for the getaway Fleet in this video, we're going to show how it's possible to recover an entire Zone's birth of giddly servers into another project from each server's most recent disk snapshot. This will demonstrate a few improvements in Disaster Recovery. One is a helper script which will help us quickly find the most recent snapshots and another is a new way to launch giddly servers in terraform that allow you to specify source snapshots for a large number of servers.

A

In parallel for this demo, we will be replicating a zonal recovery in real time and timing. The results to see exactly what our Po and RTO numbers are for a simulated outage. So with that, let's get started, we'll start the clock and we'll see how long it takes to recover.

A

Okay to start off I'm going to record the time and it's 1442 UTC. Let me go ahead and write that down.

A

So, assuming that we have a full zonal outage, the first thing we're going to probably want to do is to see what snapshots are available for ReStore to do that. I created a helper script called list snapshots, and this allows you to select a Zone and I'm going to just pick us East 1B.

A

And by default, this script will look for giddly snapshots in the production account.

A

So here they are. You can see that there is around 20 of them, so we have about 20 servers in USD storm B.

A

um This shows you, the First Column shows you the date that snapshot was taken, and it also tells you approximately how old it is so for these snapshots they were all taken around the same time. They are all around 2 hours and 30 minutes old. This number is going to vary depending on when the outage occurs, because we take snapshots every four hours. It could be anything from minutes to around four hours.

A

There is an extra option on the script to Output a terraform config or actually a terraformed variable that we can easily copy and paste into terraform. That will help us to like provision all of these servers in parallel using these snapshots. So you can see this variable right here per node dated this snapshot I'm going to copy this and put it into our terraform config.

A

A

So we have 22 notes, I've already set the multi-zone load count to 22. What this means is that we're going to launch 22 new servers, starting at node number, one ending with 22 for the servers that are in the US East 1B Zone that failed. This shouldn't take very long, so I'm going to go ahead and apply it. One thing to note is that I am not sponsoring these in usds to one and launching them in us to East 4, and the reason for that is to avoid any capacity issues.

A

If we were to launch all 22 of these servers in a single zone in Us East one um you, we I've seen that we've run into capacity issues on gcp gcp sites So, to avoid that I'm using an alternate region, gonna go ahead and apply this now.

A

Okay, so you can see it's adding a lot of resources. It's going to spin up all 22 of these servers uh in parallel at once, using the snapshots from us, East 1B,.

A

A

Using the snapshots, we specified I'm going to record the time to see how long that took, but the instances are not fully configured yet it's going to take a while for them to boot up and get configured by Shep. So what I'm going to do is pause the video and see how long that takes I expect it'll take around 15 minutes.

A

Okay, I'm back and we just finished the configuration of all 20 Italy servers, I'm just going to wrap this up by giving you the final results.

A

A

So, uh to recap: we looked at the disk snapshots. They were created about two hours and 14 minutes before the outage began in about five minutes. Since the start of the outage, we were able to provision all 20 nodes and restore all of the disks from snapshots and after that it took about 20 minutes for all of the nodes to configure, and that includes installing all of our supporting packages installing the Omnibus to get giddly fully configured.

A

Now there are some follow-up steps that would need to happen immediately after which would be to reconfigure rails so that it knows about these nodes, and so we would have to redo the mapping in males in rails to move the usds 1B storage nodes to these new gitly servers and, of course we would need to resolve any data, consistency, issues between giddly and the database, since our snapshots were created two hours since the outage started.

A

Assuming that we didn't lose the database, it would have, it would be much more current than that, so it's possible that we would have some out of date, information. We would have cash problems uh in rails, but overall I think it was a pretty successful recovery. The RTO for just this giddly data was about 30 minutes and the RPO was 2 hours and 14 minutes. That concludes the demo. Thank you.