GitHub Git Merge 2022, 18 Oct 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Git Internals: a Database Perspective - Git Merge 2022

Description

Presented by Derrick Stolee

The inner workings of Git's object database can be a mystery to most users. When you incorporate a database into your infrastructure, you are expected to learn about database internals such as table indexes, query plans, and sharding. Similar features exist in Git, and learning about them can advance your use of Git to the next level.

About Git Merge:
Git Merge is dedicated to amplifying new voices in the Git community and showcasing the most thought-provoking projects from developers, maintainers, and teams around the world. Git Merge 2022 took place at Morgan Manufacturing in Chicago, IL on September 14th and 15th.

A

A

I am so excited to be surrounded by like-minded folk, such as you today, I want to talk about an idea, and this idea should not be controversial or even surprising to this audience, but I hope that it gives us a framework and a vocabulary to use as you go and leave this bubble of, get super fans and go back to your own organizations spreading the good word. The idea is this: git is the distributed database at the core of your engineering system.

A

When you think about it, git is the center of our collaboration infrastructure. Not only does it, let us and multiple developers work concurrently on the same repository, but also links to our build and test infrastructure. It dictates which versions are deployed or released to customers and stakeholders tend to watch the repository to measure progress as a parallel. Your application database is the core of your application infrastructure.

A

The application database is used to store all the data that you are manipulating and serving through all of your services. Your background jobs process, data from the database asynchronously from user requests. Your infrastructure probably has some ways of doing database backups and failover Remediation, and don't forget that your database health is monitored by support and SRE.

A

This leads to Engineers needing to learn a lot about how databases work. There are thousands of books out there describing database internals and how you can write applications that take advantage of their strengths.

A

No one bats, an eye at the amount of work, that's required in order to understand a database and make sure that you are doing the right things in your application. I'm just saying we should all learn about git in the same way.

A

So we've already seen three talks today that about Engineering Systems investing heavily in git and to make it work exactly with their needs. These talks from Uber, Twitter and canva show what's possible when you make that kind of investment, but you know not. Every engineering system is large enough to really Merit that kind of investment. Many organizations really rely upon get the get client out of the box or the features of a their get host of choice.

A

As a git contributor, my goal is to ensure that every git user achieves their maximum collaboration potential. The git Community is constantly improving the git client to meet user needs. Some of these improvements are immediate upon upgrading others require opting in to a new get feature.

A

That's where this idea of learning about git like a database, comes in the more you know about your tools, the better you can use them.

A

So here's my personal pairing of Concepts from application databases and how they pair with Git Concepts at their core application. Databases, store tables of data, get stores objects in its Object Store. In order to access or manipulate data. Databases have query languages gets crew. Language is its command line interface.

A

In order to execute those queries efficiently. Application databases have really Advanced query indexes. Git has advanced data structures that really accelerate your git commands.

A

Now we can think about git as a decentralized version control system, so it's sort of like a distributed database distributed. Databases need to have mechanisms to synchronize, as concurrent requests are coming in and manipulating data git is disconnected by default, but still we need to be able to synchronize across repositories on user demand through fetches and pushes and finally, as applications, databases grow beyond the limits of a single node.

A

There are multiple sharding strategies that application developers can use to scale their reposit their databases beyond that limit, the equivalent in the git world is multi-repo organization.

A

This is what I mean by giving you a framework and a vocabulary, hopefully that you can use these Concepts. When you go back to your organizations and talk to your fellow Developers by starting with this common understanding of application databases, this might help bridge the gap to those git Concepts.

A

So today, I want to talk about my favorite parts of all of these Concepts. So, let's start all the way at the beginning, with the git Object Store.

A

We can think about the data that gets stores primarily as two tables. The object store has two columns: an object, ID and object data. The object ID is a hash of the data contents. This makes it a Content, addressable storage system.

A

Now it's not very helpful to know the content of what you're looking for before you find it. So that's why we have also a reference store. The reference store has two columns, the reference name and an object. Id.

A

The reference name is the primary key, and these are human created names that give us pointers into the object store. In this example, we have the refs tags. V 2.37.0 ref is pointing to an object in the object store. If we load that object's data, we see an annotated tag, annotated tags have a human written message, but they also point to another object in the object store using the object. Id that object in this case happens to be a commit, commits, have Commit messages and they link to later parents.

A

But specifically, let's look at look at this object. Id for its root tree. The contents for that we have a tree that contains many tree entries which correspond to what file or path do I see at these different names. So specifically, look at the readme.md entry and find that object ID, and when we look at its contents, we see a blob object, which stores the contents of that file.

A

We've just taken several Loops through the object store to find the readme forget 2.37.0.

A

We can think about this more abstractly as walking edges of a graph. We started by taking a reference which pointed to a tag: s have an object. Pointer in this case was a commit commits point to their parents, as well as their root tree.

A

There we go and then, with the trees point to more trees and blobs, and then specifically, we walk the readme.md file to find the contents for that file.

A

I'm going to use this notation for the rest of the talk circles are commits, triangles are trees and boxes are blobs when you scale out to the full size of an object graph. We get something like this.

A

I'll always keep the commit history there at the top, followed by a row of root trees and then, as you scan down through the entries of those root trees and you get deeper down, we see that a lot of things are actually shared, even though each commit is a snapshot of the work tree at a good point in time, two commets actually share a lot of objects in common. This Merkle tree representation is the first way that git saves the size of your Object Store, even as users are making changes.

A

Of course, you heard a lot about this by Jack's talk earlier this morning about how git Works, especially at its very first Incarnation. He talked about how git stores objects in its loose format, where there's one file per object and the file name is based on the hash of the object.

A

If we again think about this huge scale of git objects, it's not easy. It's easy to find out that that those loose objects themselves will not scale not only with the literal number of files become a problem, but Loose objects can't take advantage of a very important data shape.

A

That's common to repositories here: I've highlighted similar objects based on where they appear across multiple commits, so at the top we have all of our root trees, which correspond to the base of the work tree across all these points in time, and you can also think about these three blobs at the bottom. Maybe they correspond to the same source code file.

A

Git can expect that these objects will actually share a lot of data in common because, as software developers, we rarely change a huge amount of the code at a time instead making very calculated small changes when we have that kind of data get can use. An extra type of compression called Delta compression.

A

So here I have an example base object, which is a tree as we were seeing before, and we have an example: Delta object. Deltas are instructions to copy regions from a base object as well as writing new information when necessary.

A

So we start off by copying the initial segment of this tree up until the object ID for the Git Version gen entry. Then we write a new object ID in that position and finally, we can copy the remainder of the original tree, starting from the name of the Git Version gen entry.

A

In this way, we've reconstructed the decompressed object, but we stored those two objects using significantly less space than if we had them fully. Decompressed now I mentioned that loose objects can't take advantage of Delta compression. Instead, we need a different format to think about that and that's where Pac Files come in Pac files essentially take all of the object contents, either in base form or in delta form and concatenate them in a list they're all packed together.

A

This is a very efficient way to store the data, but if I come in with an object, ID that I want to get its contents, I can't parse the PAC file and expect that to run really quickly, I'd have to scan every object rehash it to see if it matches. Instead, git has a custom query index called a pack index. It's a DOT idx file to match the dot pack file. The first thing git does, when it has this input object.

A

Id is it takes the first byte of it and looks at a 256 entry fan out table and that that byte refers to the position. We take two values in there and that gives us a range of values within the sorted list of object, IDs that are in the pack.

A

Within that range, we can do binary search to find the exact object, ID we're looking for and once we have that position inside the sorted list that corresponds to another position inside of a list of offsets and that offset provides us the original position of the object in the pack file. So we've found our object great now. What do we do with it? Well, the initial segment of that object. Data includes a type and a length which lets us know how much data of the pack file corresponds to this object.

A

If the object is a Delta, the contents of that object are include an offset value to a base object previously in the pack file. It can then take those two objects and do Delta decompression to reconstruct the full object. Contents that match the input object. Id. Does this thousands of times as it's running your git commands? So this is a very fast operation that happens repeatedly.

A

So not only does git use the pack file format to store data on disk. It's also was used to send objects across the network between repositories.

A

So, let's think about how git synchronizes, using something like git fetch, but first we're now getting into the territory of thinking of git as a distributed database and whenever we think about a distributed database, we need to First consider the cap theorem.

A

Every distributed system wants to be consistent, available and partition tolerant, but the cap theorem says you can't have all three most distributed. Databases expect Network partitions to be rare, allowing it to be consistent and available with high probability.

A

However, with git network partitions are the default state, and so, instead of of having consistency as being a key Point, git has chosen to be available and partition, tolerant, I think that's a strength.

A

So let's go back to thinking about git Fetch and, let's take an example, client in an example server.

A

The first thing that happens is the client asks the server for a list of references and the server provides a list of references and their object IDs at its current point in time. The client scans these references and says these are the ones that are important to me and then also notices. These are the objects that aren't in my Repository, based on from this point on, though, all the communication will take place via object. Id in case the server has its references move.

A

The next thing that happens is the client will send a list of object, IDs Each of which is marked as a have or a want wants, are objects that aren't on the client repository but were referenced by those earlier advertisements.

A

A have object is on a client repository and the client guesses is on the server based on previous values of the server's references. So let's go take a peek into what's going on in the server at this point.

A

So let's look at this object graph again, we have commits trees and blobs. Let's suppose that the client sent two commits one have and one want in this case the commit wants commit a or the client wants to commit a and has the commit B.

A

The server can now infer that the client actually has all of the objects reachable from that commit giving this region of objects to be known to be on the client and therefore, what we need to do is we need to find the objects that are reachable from the wants, but not reachable from the halves in order to satisfy this fetch request. It gives you this region of objects.

A

So in this way, even though the client may have sent a small list of haves and wants, that may correspond to a very large set of objects known to the server, but now that the service figured out what the client needs it can take. Those object, contents concatenate to them together and put them in a pack file and send that pack file over the wire and again, this pack file can use full objects and it can use offset Deltas to previous objects in the pack. We can use an additional type of compression. That's really helpful.

A

In this case, we can use a thin pack, which has an addition. Reference Deltas, instead of pointing within the same pack, file to a base object. It can point to an object via object, ID and that object is expected to exist on the client based on that list of halves. This gives us even further compression than we would have had before now you may have.

A

This may be very familiar to you, because Taylor was just talking about how important it is to organize our repository data on the server, so we can serve our clones and fetches efficiently.

A

In addition, he talked about complicated query indexes like multipack indexes and reachability bitmaps, so I'm not going to get too much deeper into how this works, but you've seen it already.

A

So you might be thinking it's great. That git has my back and is doing all these complicated things under the hood to make my git commands fast. But what can I do about that? Well, I'm, here to say that you are in control of your repository. You determine its shape and you can influence the Norms of your organization, so I'm going to get into some really big picture items, but first I want to give you a couple, quick tips that you can use uh to take advantage of some things we've already talked about.

A

The first is that you should run git maintenance start and all of your favorite repositories. This will start running some background maintenance, including hourly fetches, to all of your remotes, which makes your foreground fetches a lot faster and, in fact each of those fetches will be fast because it's getting a smaller set of objects. It also will repack your repositories, Object Store, incrementally every night, making sure that you are saving data on disk, while also still having fast lookups for your git commands.

A

The other thing I want to mention is that you should use good repository hygiene. You really want to take advantage of Delta compression whenever possible. The good news is that objects that don't compress well also don't really diff well or merge well, and so they don't present as reviewable changes in your pull requests. If so, why are they in your Source control system?

A

So the recommendation is to remove large binaries, especially those that you know change frequently, and you should also make sure that you're not storing any build output in your repository, store, build inputs and then generate the build outputs.

A

The next thing I want to step into is again. Let's look at the big picture and I want to think about scaling our repositories and specifically taking ideas from application databases and how they can be sharded when they get too large.

A

But first I should mention that the lack of sharding is the basis of the monorepo pattern. Monorepos are great, should use them as long as you're being careful and having good repository hygiene.

A

Now, even with good repository hygiene, your monoreepos can get very large, in which case you will probably need to opt in to certain features like partial clone or sparse checkout to let you focus on a small subset of the repository, but we've talked a lot about that already today, so I'm going to move on and get to these more interesting sharding strategies, so the first starting strategy is vertically sharding, also known as functionally sharding and the corresponding type of organization get is independent, multiple repos.

A

Now this is a frequent thing that happens when you are in a micro service architecture where you have. Each service has full control over its database layer, and then you have one repository for each service. This really optimizes for fast deployments and low bureaucracy to be able to ship quickly.

A

The problem that sometimes happens is that there's no coordinator saying what is the greater hole that all these Services add up to. There's a lot of human overhead in order to find which repository need to work, especially if you're doing cross-service work.

A

So this brings us to our next charting strategy, which is horizontally Charter repositories, which I think the equivalent is to use sub modules.

A

A horizontally sharded database takes the same kind of data and spreads it a clock across multiple database nodes, but then uses A, Shard coordinator to kind of take queries and make sure they run on the correct shards based on some Shard key key.

A

So the in the git case with some modules we have that super project and it's got its get links into all the sub modules. In this way, the super project is in charge of what each sub-module state is in when it contributes to that larger hole.

A

The that human overhead becomes easier because you can go to the Super project in order to find all the different pieces of your Repository.

A

Now you don't take my word for it, because you saw Emily talk earlier today about sub modules. So I'll refer all questions to her as the expert in the space. The one thing I can say is that it's difficult to move into a model of sub modules. If you didn't start out that way, it's hard to carve out pieces of your repository that you could be treated as independent sub-modules, so it'd be nice. If we could do something where we didn't need to modify the work tree, but we could still do some Port of sharding.

A

And so this is the basis of time-based shards, which were inspired from time series databases.

A

This is a particularly good pattern to use if you're in a mono repo that previously didn't have good repository hygiene, but you've done the hard work to make your the good choices at the tip of your Repository, but you're still saddled with the baggage of your history being enormous.

A

So, let's take a look at this repository and imagine it has a huge commit history, but we're going to focus on a single tip at the moment to make a time-based chart. We can archive this repository and stop writing to it, but we can create a new repository with a single root commit whose root tree is identical to the root tree of the previous tip, and this way they would have the same checkout. The only thing is: you've lost all the commit history.

A

Now all new work can continue in this new repository with a much more efficient uh Repository. The problem here is that this new root commit looks as if it added all the files right from the get-go. You don't have that helpful context from the history of the previous Repository.

A

Well, git can help here too, because you can link these two repositories together. You can use replace objects to say when I get to this root commit instead go to the previous tip and you can access those objects from the archive repository using an alternate Object Store.

A

In this way you can still have that full history view when you need it, but this isn't a very efficient way to be so I, don't recommend keeping it this way all the time, but it's there when you need it and as you move forward in the new repository you'll need it less and less often the next strategy isn't actually a sharding strategy, but it comes from the idea of data. Offloading databases can offload data, that's not infrequently used to cheaper storage that and then keep the fast and expensive storage focused on the important pieces.

A

So if we want to think about this in the git world, we can think about. Another object. Graph so commits are super cheap to store and they're really important to many git commands. So we're going to think about them as a really important, but further all their root. Trees are really going to be Delta compressed well and they're, also very frequently used by git commands. So, let's consider all commits and root trees as important critical data.

A

Let's also think about objects, reachable from the root trees of recent commits as being important, because these are likely to be used as you're doing checkouts across branches at the tip of the Repository.

A

But as we get further into the history, we don't need objects at the really deepest points, because, as we do, history walks we're less likely to be focusing on any specific path. So in this way we can divide the repository into new and old data.

A

We can take a copy of the full repository put it on a network share or read-only network share and use it as a get alternate and have our local repositories with our expensive local desk Focus. Only on these critical new datas I think this is a really cool idea. I, don't know if anyone has ever done it. So, if you're looking for a project, give it a try.

A

Includes my interpretation of how database sharding strategies work and how they coordinate to get equivalents.

A

So we've talked a lot about today about how git already has a lot of features that come from the world of databases, but there's so many databases out there and so many more database Engineers than git contributors, but there must be some ideas out there that haven't yet been applied to git I want to talk just about two that we're ready to talk about today.

A

The first is that we've talked about charting strategies, but the only blessed sharding strategy is sub modules in terms of get features right now then, independent multi-repose is already there's not really support because you're going outside the boundaries of git at that point, but these time-based shards and data offloading are something that maybe we could consider adding to get as a feature. Maybe we can make a magic button to create these types of shards or third-party tools could probably create these shards without even modifying the git client, so there's lots of possible directions here.

A

The second idea is that I talked about pack files a lot and how the same format is used for network transfer and for our on disk format. Now these pack files are immutable once they're, written Taylor was talking a lot about the full repacking as being very expensive, and that's because we can't just add a few objects to a pack and move on. We need to essentially repack all of our objects into a new pack and then delete the old objects which gets really expensive when you have a very large repo.

A

Well, databases have ways of adding data to tables without rewriting the whole tables. So it would be great if we could use some of those ideas and those file formats, while pairing it with Delta compression to make sure we still have that efficient storage.

A

The really tricky thing here is making sure we can still serve fetches efficiently by creating those thin packs. On top of the network.

A

A

So I have limited time today, but I went super deep in all these Concepts in a five-part blog series on the GitHub engineering, blog I hope this has inspired you to go, take a look, or at least use as a reference in the future, and finally I want to leave you with this. Make sure if you didn't learn anything else at this entire talk. Is this one concept?

A

Git is a distributed database at the core of your engineering system. Thank you.