GitLab Gitaly group, 22 May 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 20200522: Speed Run - Gitaly backups

Description

Share your thoughts here https://gitlab.com/groups/gitlab-org/-/epics/2094

A

Hi this is James Ramsay group product manager here at get lab and I'm looking into improving the way we allow you to back up your git repository data on your lab, so I've got some ideas from reading a couple of issues and feedback from customers, but I thought the best place to start exploring was to actually try and do a back up. So I've created get lab instance here and imported a no hate, also large popular repositories and I'm gonna.

A

Try and back them up so I've gone to the documentation and I want to use cloud storage and object, storage to store the backups, so I've gone and configured Google storage, at least I think I have a fan. I did that before the recording. So you didn't see my tokens, but this is the bucket I'm aiming to put the data in, as you can see, there's nothing in here. So let's go to the docks and see what has to say. Okay,.

A

A

Let's just do a simple backup.

A

There's a this node was configured with the giddily cluster, so it's three prefect at 3 Ghibli nodes, so those large repositories should actually be living on these three nodes. Think it's about looks like Navy got eight or so gigabytes free pose. Okay. So here you can see it doing some jumping wiki's.

A

Let's double check that all the repos are ending up on the preferred cluster.

A

Everything's young at the storage locker, which is when I configure the other day to success.

A

That one is the crime ring I'm just going enjoy bored prefix.

A

Oddly three I guess: that's.

A

A

Should see that put okay, so we still just working on these wiki repos and these wiki repos I presume acquire small under this skipped simply I'll, probably cuz they're empty okay. So it's these are the actual backups of the real, really large repositories.

A

So it looks like these repositories are being backed up sequentially, which is interesting because there's probably enough capacity on these servers to be running the back up more than sequentially and exactly how much parallelization we should be. Enabling is hard to know.

A

Looking at the dashboard for giddily we're actually getting any useful information. I'm only seeing the giddily one of the Gilly nodes, which is running locally on the omnibus server and not actually the get early nodes.

A

That have the data on them, so maybe that's a Mis configuration show in my part.

A

We're getting through them.

A

So see how long it takes well, that's happening. I'm, gonna, flip over the docs sign the page and see everything is happening, the same time multitask. What options do we have backup strategy options.

A

A

The streaming strategy stream data to the backup using tar.gz, okay, you.

A

You, okay and a copy process.

A

Of 20 X disk space I wonder.

A

How much disk utilization we're using on this primary? That's the wrong server.

A

A

Okay, so we've got enough disk space I now toweling it up gosh. That is not ideal. If we had more data than this, like we're, only compressing a couple of gigabytes of data, but this could quickly get very slow.

A

So, let's see if we could how we would go about setting hop it's another way, so trips back up, dump.

A

Okay, that's fly. Guard writing the file name.

A

A

A

Can I share I guess I can skip everything except you're, repulsive trips. We can focus just in on that, alright guys. So now we're doing the upload.

A

Right the command started 333, so we're taking one six or seven minutes so far, don't under backups, okay.

A

A

Okay, so the whole back, there's only three point: five gigabytes, not too bad, not too, that I wonder what would happen if we imported like really big repo in parallel, but he what's the WebKit, the big boy.

A

Maybe just stop sharing otherwise you're gonna see my access token.

A

I'm just running an API commit to trick the input because I have it handy.

A

It's not showing the screen again: okay, so I just treated the input back there. If we go to projects we'll see what kids have imported, that should be quite a lot larger and slowed things down, but in the meantime, let's take a look at these options and see what we can exclude from the backup. So you can just focus on repositories, because that's really what I'm interested in so skip DB upwards.

A

A

Tv uploads builds artifacts.

A

Like this they're just three pages, the other thing I want to do is see. If what happens, if we skip the tar creation and how it's going to work.

A

Especially in combination with the object, storage upload: let's see what happens.

A

Wonder if they should will be provider is one comma-separated list.

A

Probably maybe rails is fine enough to handle this.

A

For those of a lot of thinking happening.

A

Interesting dumping database I smile I skipped that I.

A

Did skip good database.

A

Maybe it's just skipping the tar command.

A

Let me try this again bail out on that try. This.

A

A

Okay, that's more promising.

A

A

Okay, okay, perusing the documentation to see what we find.

A

Okay, so I, already configured this one in kind. So yes, the! What we're trying to work out here is how we might make these backups faster and there's once a repository becomes of a certain size if you've got thousands of repositories, even if they only take a minute each or even 30 seconds, each. Your backup quickly takes hours and hours and hours, and it's not unforeseeable that you would have a lot of large projects, be they Forks of open-source projects or huge internal projects that you've developed over many years.

A

The backup could take a long time, and is it's not uncommon for an instance with thousands of users to have hundreds of gigabytes of repository data, so that could take a while, particularly they were doing these in serial yeah.

A

There's a bunch of workarounds that some people follow which to take disc snapshots or other kinds of snapshotting approaches that look at the raw repository data on the disk problem with those approaches is that the repository might not be snapshot in a consistent state if you've got rights, happening and sort of taken in a rolling fashion through the repo you can't guarantee that you snapshot of the refs and then you snapshot of the objects you and so that you might actually have commits or objects or refs that reference things that don't exist and now problem.

A

If you're trying to restore from backup, because then you'll find that your git repository is corrupted and it's going to need manual repair, which it's going to be quite stressful, when you're trying to recover from backup generally is enough enough stress involved in a backup without added data. Corruption concerns. So we want to make sure that the backups are consistent.

A

Just want to make sure that they're fast and the other issue is that, depending- and this is why I'm a little disappointed that I couldn't work out, how to make CRO final work for the the wrong get early node is the resource utilization could be really high, so making sure that we're able to take backups quickly that they're consistent and they don't put too much of a load on the server- is a real consideration that we're trying to look at I.

A

Wonder how the applications performance I just go to one of my other projects, this giddily one.

A

It's not too bad I mean he's a pretty big boxes, though so I would have kind of expected clicking around that's not too bad.

A

But I can imagine if you've got a large reload already with CI servers, lots of users doing get operations. The added load of doing all these get bundles would really add up.

A

See what just coming slowly reporting three point, four gigabytes used, which seems ah so this one is the secondary, so it won't have the disk utilization and updated until this is complete. So let's take a look at this one.

A

Yeah, okay, so utilization is up to 17 gig WebKit's, something like 20 gigabytes for memory.

A

Okay, I'm guessing.

A

Thing that is going to be the the big one.

A

Like I'm still going, it's not even trying to back up okay, yet what kids, what kids not even finished importing, yet you and poor.

A

File or directory ah-so object, storage, uploads, don't work.

A

Unless there's a tap, that's interesting, do the documentation say that.

A

Okay, I guess it's implied.

A

By the fact that it uploads the tile, so if we skipped our can't upload, that's interesting so I wonder what directory it creates. The backups in.

A

What is backup us.

A

I'm just going to pause, screenshare.

A

Shumacher, sorry, I'm just got sacred to need backup passes every awesome jump on out of that.

A

A

Okay, these ones we just took and repository certainly.

A

See what we found.

A

Okay, there is something what's gonna, fine.

A

A

Okay, there's the bundle, so it's just a get bundle, output.

A

So I guess one of the ideas and would be that a if we could keep track of these backups rather than treating them as like specific time stamped repository backups, but instead calculating the checksum of the bundle. So when we generate let me rewind so with git repositories, we have a method, an internal method for calculating a checks up for the repo. So we can run that and work out if two servers have the exact same state of that.

A

So if we stored that checksum with the bundle somehow, then we could compare the backup we last took with the current checks of the repo, and so, if I'm taking another backup and I just took one 12 hours ago and the checksum hasn't changed, then maybe I don't need to back that repo up again so, depending on the level of churn, it could cut the number of backup jobs to a fraction of a percentage. Maybe it's a 5% or 1%.

A

It's it's likely to be a significant savings and then even further than that we could iterate on taking even smaller snapshots, rather than taking a full snapshot every time. The repo changes we could look at taking an incremental backup, so those are some areas that we're exploring and we'll need to work out how this plays into not only the backup process, but also how we restore now.

A

The data would be structured in a an object, storage environment where we'd presumably typically store these kind of backups, but then again these could just also be on any file path. So those are some ideas that we're evaluating yeah looks like things are quite basic, quite slow even for a instance with only a handful of large repositories.

A

This thing was really slow. All those repositories were less than a gigabyte, and it's really not that uncommon for large organizations with historic projects that have been going for 10 plus years to have much larger projects than this, which would take longer to backup I'm so important that we solve these problems.

A

This isn't insightful for me feel free to jump in on the epic and ask any questions. Give your feedback is the epic 2:09 for incremental repository backups, so looking forward to your feedback.