GitHub Git Merge 2018, 31 Mar 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Backing up large amounts of Git data - Git Merge 2018

Description

Presented by Carlos Martin Nieto, Infrastructure Engineer, GitHub

About GitMerge
Git Merge is the pre-eminent Git-focused conference: a full-day offering technical content and user case studies, plus a day of workshops for Git users of all levels. Git Merge is dedicated to amplifying new voices in the Git community and to showcasing the most thought-provoking projects from contributors, maintainers and community managers around the world. Find out more at git-merge.com

A

Hello, everyone, I'm Carlos, I, work, a github and today we're gonna take a look at what it takes to back up, get data at the scale that we see so on the scale we're the largest get house in the world, with over seventy nine million repositories and twenty over twenty eight million users. So you can see we can see.

A

We have a lot of gate data, analytical activity for these numbers, so um I work in the team that represent for essentially providing get as a service both to internal consumers like github, the website itself and any other internal clients and any git client that connects via the Cape alcohol and even via the subversion protocol. We also handled support escalation for the technical aspects, we're second line support for get performance, questions that users might have, and we also provide input on technical aspects for responses from support right.

A

Let's move on to the topics we're going to cover today, we're gonna start with an overview of what the problem is that we're trying to solve with these backups what we want out of the solution, the solution we had until fairly recently and what the solution is right now and what we changed. It then we're gonna move on to the actual design of the new solution.

A

That's been right, the projection room for about the last year, and it's been the one thing: that's keeping your data safe for the past few months right, so everyone tells you like. You have to have backups right, I mean they keep telling me that but like why, right? What at once, I have this back house? What am I gonna do with them, and this isn't just to be contrarian but having a clear picture of what you actually want to do, how you're going to use them where you're going to store them.

A

That helps inform well all of these decisions about how you are going to make the backups and how you design the systems now for us. One of the first reasons is because people put their trust in us right once the repositories on github. They expect that it's always going to be available, that you know they can always go to github.com and they can download their data uploaded. They know edited work with a larger, more critical reason for the company as a whole is disaster recovery and you call business continuity.

A

Our infrastructures resilient against multiple server failures, but you know what happens if both the data centers catch on fire or become unavailable, or you know, whatever else happens, we should be able to use these backups to bring up a new site as quickly as possible. So you know we can keep serving the data and you know we also want to have a safeguard against any bugs that we find him in in the project in production code.

A

If there's a bite that leads to data loss, we want to have these these backups so that we can recover the data once we fix the bugs, and these are useful things to have for emergencies right, but by far the most common use of the backups day-to-day is just support. Restore back, restoring from backup repository that someone emails in after they regret hitting the big red button says delete so with that. Let's talk about what it means to perform backups at the scale that we said github.

A

So we want near-real-time off-site backups right whenever someone pushes to a repository or edits a file via web edit or anything else, we want to say to a background task and move the data off-site as quickly as we can. The rates we see for changes are quite variable. Roughly they go from 20 backups, a second on the weekends up to around 80 at peak.

A

This numbers are similar, but a bit similar to the actual number of pushes we get, and we also want to be able to deal efficiently with repository networks. Now, if you haven't had that semi before this is what we call a collection of repositories. So when you fork a repository on the website, we don't create a completely new repository. We put a repository next to its source and share all the actual get data.

A

That means that you know we can efficiently keep the 20,000 Forks of the Linux kernel without storing 20,000 copies of the newest kernel.

A

We want this kind of recreation in the backups as well, for you know the same reason, cost and efficiency, and we don't want to do full backups right. We want to be. We want them to be as incremental as possible, because that will that helps with storage costs. But that also means that if we just if we have to transfer less data, they get, we can get that data away from a data center and into the off side much quicker all right.

A

So let's look at some FS on some simple solutions and why it's not quite what we're after right. So you know C boy, oh say I'm, just gonna run tar on the repository and then shape that terrible off-site sure. But that means that in a full backup each line you can use are saying or some other tool to just synchronize them into a machine that already has and all the backup. But if you ever run, get GC or get runs it for you.

A

Now, the all of the files look different and there's no file or in the tool that's gonna help you do anything other than a full backup. At that stage, that's not even mentioning any issues around concurrency and the lack of atomicity yet for multiple updates, and we would have to do this at the network level. So any issues with changes to file this to the shape of files and this it just gets exploded amongst the all of the different for all different networks.

A

So we can, you can make it much quicker if you take like a volume or a file system snapshot right, but that it's really more. For in case, a full machine goes away right, single server errors, where you want to restore a new machine with the same data, but we have spokes, which is our replicated, get storage engine.

A

If a server goes away, no user should actually notice anything. So while we actually care if a server goes away because it shouldn't as far as data safety goes, we don't care like a single server can just go away. We do that all the time. So you know both of those things are out right.

A

They, the file or in the tool, doesn't know enough about gate, and the volume snapshot doesn't give us anything we want, but do you know which still knows enough about get to safely copy data at the level of parity that we actually want? It's good right. Luckily, we know I get personal.

A

So this is what we came up with the the all solution. This was in place for a few years leading up to the end of last.

A

It looks quite a bit like the live site, because that's what we know how to build. It's a set of get servers ready to receive. Pushes each has their own block storage with the big differences. These are in the cloud, so we have. You know virtual volumes, and this is already WS, because that's sort of the default choice that we do. However, the main drawback here is that it's machine is its own single point of failure. A repository gets backed up to a specific block device or a specific host.

A

If that host crashes, someone has to go in and fix it, either booting machine or stand up a new machine and attach the volume well that's happening. None of the backups that have to go into any of the block devices on that host can actually happen in you know, that's happened a couple of times and then you know we have like a thousand backups queued up and you know until a human reacts.

A

We can't do anything and it doesn't protect too much against bugs in all of the commonality that we have, because we are, they are set up as a good server, so all of these management code that we have for it. If there's a bug in it that is later loss, it's no unlikely that it would also lead to data loss in the backups.

A

We weren't too worried about this, because we had been running the live site for quite a long time without that, but it's it's a drawback to consider when you design something. So, let's look at a graph. Let's look a bit of what this looks like at a high level, so the user pushes and then this magic abstracts pokes box distributes it across the multiple servers when here backup tasks and then use yet to perform the backup.

A

These reporters are available on get servers and give us these in four pushes I can mention. So you know get is already pretty good at sending the incremental updates. So we just do a git push.

A

We load it into the the one specific reports during the specific look with the black device- and you know, we've fairly efficiently pushed data off of our machines into the off side, but these are add in they just more get servers right. They modify the file systems. Currently, you can log into them and they are slower than the you know: big, powerful machines with local SSDs that we have in our data center.

A

So sometimes things would time out and it will require human intervention to to prop them and get them healthy again, so much rather rare, but this means that you know I'm I, know someone else in my team. You know prodding around running. It commands trying to figure out what's going on. If you type a wrong command, then it you could delete data and that's bad, because these are the backups.

A

If you have a typo in some script or just a bug that leaves their loss. That's you know either from the same as in the live system or in the news script I could read plain error of data and so to save ourselves from that. Just in case this were to happen. We store nightly snapshots of these volumes, but I mentioned earlier. We don't care about single server failures in our data center and we don't need these volumes. However, this work this came if before we had spokes, so it did look a lot more.

A

Like like the system we had at the time, and because we care so much about single server failures here and the possibility of the eating of this backup data, we end up having you know the live site, the backup data and then backups of those backups.

A

Without given an exact figure, where you know storing on on the order of hundreds of terabytes of kete data on these machines, you can do your own guess at the cost of this, but it's not cheap on the upside. It did keep the data safe for very long right. I mean in you can just throw money at the problem as long as it is working.

A

Right, so the old system works right, but it is fragile. It's expensive. It's annoying request, my little prodding right. We went in search of a new system and we wanted to keep the new real-time incremental update aspect of it, but that's something we really really liked, but we definitely wanted to remove the single points of failure, who's, the biggest headache that we had, that plus the know very slow, virtual block devices that we had everywhere and then on the super block devices.

A

Something very annoying was whenever we filled up those drives those those lock devices, we had to resize them, and that was that for a couple of weeks you know last year, I sat in front of my computer like half my day was just resizing volumes. We have a lot of them and this is not a useful use of my time or anybody's for that matter right.

A

This was also introduced before we had smoked like I mentioned, and we have these single machines that has single repositories. So now we actually care our backup system is more fragile and more susceptible to single server errors than our live site. That doesn't seem right.

A

Yeah and we you know, and we would keep filling up like I mentioned. You just happen a couple of times when I was working on this, because we just keep growing people keep sitting as data for some reason. So, all right, let's talk about the the new now current system, right, we were looking around and we realized. We really actually liked a lot of the features of the old system. Particularly you know we get the data off pretty quickly and efficiently.

A

We just need to figure out without the specific implementation which is slow, annoying fragile, expensive. You know- and you know, after all it- it is what we do. It's the one thing we know right. We can figure a way to bypass the need for these get servers to do this git push. We can build a system that we can be proud of, instead of dragging the question of like hey. So how do we do backups?

A

It's this blog explanation of the fragility of everything, so, firstly, a decision stop using volumes attached to specific service when you use a generic object, storage service which can adjust to our need for storage space, our rate- and we just want to hit a URL upload download, and then everything else happens behind the veil and we want something will be much cheaper to operate now. I am describing essentially s3, which is what we ended up using, but we could have done what we want is sort of these basic characteristics of it.

A

I mean we could have come with someone else. We just end up doing that, and you know we definitely wanted to keep the incremental updates. So we have to figure out a way. Well, I can do a push into an s3 bucket, not without lots of wiggle room and one respect this. We already had some encryption of your system right. These volumes weren't we're in an encrypted file system, but we wanted to go further, be a lot more proactive about how we we secure these data and one other thing is.

A

We want to pro prove our confidence in the restore procedure. Right we do do performs every day, that's about at least ten backups at the ten reporters a day that will restore into the life system from backups or people who regret deleting it. But that's doesn't there was no systematic way of making sure that we can still restore data that the code hasn't been rotted, so the current system has fewer moving parts on our side.

A

We still have the user perform a push and Magic of spokes distributes among several machines, with an in queue the the background task again to perform backup right.

A

We do some, we do try and buffer it a little bit so that, if you, if you're doing a lots of you tiny, pushes this in short order, we can actually collapse them all together and push just the ones, but we can still get the data off of our machines, usually in under a minute and in fact, if we start taking a few minutes, we'll get an alert to investigate we're so slow when we perform these backups, we encrypt and only then upload the data to object.

A

Storage and then we also store some metadata that are getting too into the database right. So, let's see how to go from a backup right. What we actually want to do is simulate a git push, because that was working out really well for us, but we need to perform a few different steps because there is no give server on the other side.

A

So first we check to see what the latest incremental update is for a particular repository right. If there is none, then we know we have to do a full backup. If there is one, then we say: okay well, I only need to Porsche I, don't need to upload the difference, and this is this is what we get from the previous ref state. Now the previous ref state is the state of the repository the last time we did a backup.

A

This is essentially the same data that, if you, if we were to run git push, that's the data that the get server would give us. So we can create a an efficient pack file that only has objects are running.

A

Now then, we write into our write a headlock and I'll go up into why we have one of those and what it is. We all did a raid into object, storage. When that successful, we commit the backup and we're done. We have a backup and it was efficient and it was quick and it was great.

A

The this new upper data also includes the ref, the reference state, so that next time we can still do the this. Similarly git push again now, if we encounter any error in any of these stages, the task will retry twice, but it will not try and do any kind of rollback. It will just error out and I'll give us why that's an interesting feature of this right. So what is a rather headlock?

A

What I want to have is a transaction that accomplishes both the upload to object, storage and the insert into the demo database that doesn't exist, so I have to figure out a way to simulate it right, so inserting the entry into the right head log, it says I intend to write data to this location and we do successfully write the data. We then atomically at the entry into the backup table, and then we really key at the same time in this intersection, we remove it from the writer hello.

A

Why do we bother with these extra transactions with the way whether with this extra table, all right? Well, here's the thing about distributed systems in your when you're in the system. Even you can't really know what's wrong with it right. There could be any number of reasons why the the upload might fail either I can only upload half of the files to the object, storage or maybe the database is too overloaded or there's some some more error and it rejects my update its domain. Oh, you know, compactly, that's right again and I.

A

Don't know that if I can't commit the backup that I'm going to even be able to talk to the object, server object, storage service to tell it hey, actually all of these data, we don't actually want it. Can you delete it I, don't know that I can even talk to it. That I can reach it.

A

So we just you know we just don't do a rollback right. If we find an error, we go away. What we then also have is a task that makes sure that this data does get deleted eventually right, so this gets into the how to handle complexity at at large scale right. We want to move the complexity towards the system as a whole and away from the individual steps right.

A

So, like I mentioned, instead of trying to come up with a complex robot procedure in case we fail to perform a backup we just around and then every I think is every hour right now. We have a task that goes through the right ahead. Token says: well what is older than in ours?

A

I assume that that's actually just failed and no one cleanup, because that's just the point right so then this task goes and deletes this old data, and if it finds an error, even cutters are trying to delete the data from the object storage system. It will also just error out in the next hour another task we'll try again. This deals with primarily just network errors. I mean we were performing this back up in involves at least three servers in two different data centers. So the network you just don't.

A

We can't even account for all the kinds of errors that can happen. Networking and otherwise.

A

We have another kind of purely tax as well, which is why we can just abort on error for the backup we have another one that goes through all of the repositories that we have and checks hey. Is this one up-to-date? Is this one up to date? Yes, no, and if it's not being Hewitt ask- and you know- and we try again- and this is how we we keep all of the nebular steps simple, but the system itself is self-healing. It has a goal like everything is up-to-date.

A

There is no data in the right ahead log order than in ours, and it just keeps trying if it finds and that the this this current reality doesn't match what it thinks it should look like. We try and fix it and we try and fix it for as long as it is wrong. The rub cases were paid. Sometimes we have all those repositories that can't actually be backed up.

A

Well, that actually reaches gracious, an alert because we keep trying to do the same thing, but for the vast majority of cases we will eventually get it done. We will eventually get the backup.

A

So this is kind of how we do all these tiny backups right. Well, that's very good, because it means that we can very quickly get data off-site, but that means that the a storage is very inefficient right. You might have seen. Sometimes you run a command and get says I'm repacking, the repository in the background for optimum performance. Well, we have to do the same thing right when you have too many tiny packed files from all of these tiny incremental backups, that's very slow on multiple aspects.

A

It means that when it comes time to restore this repository, the any overhead we have pair restore that that you know that's like times a thousand times 2,000. However many backups we've done and the resulting repository is going to be slower to access and to operate on, and we want to be able to restore into a disaster recovery site in case of an emergency, and if everything has thousands of tiny, tiny, packed files that it's very, very inefficient and the website would slow down to a crawl.

A

So we just do the garbage collection write it. We go to a machine with lots of storage and we do a restore into it. Then we run GC, it's our own version of CAP of GC, not the vanilla one, but it's the same idea after that is done. We package it back up again and have it as a full backup and then any new backups can base off of this one and still be incremental.

A

One other very important aspect of this is that we are constantly systematically running our restore code. Every day will restore thousands of repositories, and this is very important because well, how often have you heard of you know Oh some company lost the data, and only then did they realize oops or backups were empty or they were backing up the wrong thing or we don't even know where the backups are or how to restore them right. We last time last we try.

A

It was three years ago when we set it up right, so we we want to be able to know that we can restore the system, the repositories to to whatever system.

A

Is quickly on the topic of encryption, which I mentioned so each backup that perform has its own symmetric key that we encrypt all of the data with right. Then we encrypt this key with a wrapping key and we we put it right next to the data it's encrypting. Now, these wrapping keys are only available on our own machines and never reach the object, storage and in order to decrypt any of these data, you do need this wrapping keys.

A

We also regularly generate new wrapping keys and then we go through and any symmetric key that is encrypted with two old wrapping key. We encrypt it. This is so that in case that they were to leak this wrapping keys, which they're really really shouldn't, we don't expect there will, but just in case we are being proactive and they will only have a very limited lifetime, even if we know this months after the leak that someone had them that we were not scrambling at that time to wrote the keys, we are constantly rolling.

A

The keys, those were you know, goes into the job being sure that hey the key rolling system does work because we're constantly running it because any testing it.

A

So in conclusion, for you know the design, is it more robust? Yes, absolutely all of the headaches that we had for the old one are gone. Is it more efficient yeah? We actually have ninety percent savings in cost and that's not a type of, and it is also faster to to do the backup now. Is it more secure yeah, we are much more proactive about securing and making sure that, if anything leaks, it's actually less we're not venerable done.

A

We were before and was it worth it yeah, both in the sense that it was worth roughly the one year that that I was working on this and it also it's worth it, because we can now be proud of how we do backups we can. We can feel confident. We are leaders in in this topic, more general conclusion on complex systems right well, we actually got very far with just standard tools right in the this old system that I described essentially just vanilla. Get and are saying that run for years.

A

It was expensive, but you know you can throw money other problem if it's worth spending engineers I'm on something else, but you know we did reach a size where we recognize that you know. Actually it's worth the effort to build something, that's very specific for us. We can use all of these little bits of information about what is efficient in our own infrastructure and and what isn't- and you know, do one and not do the other and I mean we. We do the you know.

A

We have our own very custom kit, backup solution, because it is what we do is our core competency. We do a lot of good operations. We have loss of key data, which means we have to do very specific issues of backing up and restoring all of these huge amounts of data and.

A

Finally, I've mentioned that this took about a year to build, or it was about a year from you know, starting it to carrying off the old system, but it's also been running for about as long we started by backing up just a few repositories to start out to test out like is this behaving like you expect it to? Are we adding too much code on the server's?

A

It's is the restorer working right and you know slowly more and more repositories were being backed up and then eventually we were running both systems at 100%, then we switched storing the restoring the deleted repositories from support to be done by this new system, and initially we had a fallback to the old system. Just in case we found an error.

A

Eventually, we just didn't need the old system anymore, so we just throw it down and you stop spending huge amounts of money, and this is crucial to doing any kind of large system switchover. You can just flip a switch after one year and it goes well.

A

It is a lot of extra work to make sure that both systems in co-exist that we can run the old system, the new system, that not gonna step on each other, that you know we're gonna, we're gonna have the data in both, but it is worth it because it does mean that we're gaining confidence on the new system and that no point where we, you know we devoid of the safety of a backup system. In fact, for roughly the last year, we've had two backup systems, keeping the data safe and well that's.

A

That is my big conclusion. Thank you.