GitHub Git Merge 2022, 18 Oct 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Build-aware sparse checkouts - Git Merge 2022

Description

Presented by Waleed Khan + David Bernadett

Twitter has developed a tool called focus which manages sparse checkouts as defined by targets in Bazel. By carefully defining the dependency model, we can precompute dependency queries such that users can create the sparse checkout without necessarily having to invoke Bazel first in a dense checkout.

About Git Merge:
Git Merge is dedicated to amplifying new voices in the Git community and showcasing the most thought-provoking projects from developers, maintainers, and teams around the world. Git Merge 2022 took place at Morgan Manufacturing in Chicago, IL on September 14th and 15th.

A

Hello, everyone uh we're from Twitter and we're here for the second of three presentations on basil and sparse checkouts yeah. It's a common problem, so uh Twitter focus is our open source, Tool uh software tool for managing, uh build, aware, sparse, checkouts.

A

My name is uh David Bernadette I'm, a software engineer on the source team- and this is my co-presenter Walid uh Walid- is on the build team uh at Twitter and is our senior software engineer I'm really like uh brought this whole project together.

A

So our agenda today is first we're going to talk about like the difficulties of the mono repo at Twitter, and then we are going to talk about our solution for these difficulties through our build aware, sparse, checkouts uh and then we'll lead, we'll talk about our different uh basil, bazel caching, strategies um and then finally, uh talk about adopting Focus for other teams using bazel and monorepas.

A

So, let's begin with our mono repo difficulties at Twitter, we have a very large monorepo again around like half a million files. It seems to be kind of common between these companies and we have a working tree size of 6.4 gigabytes.

A

The size of this work tree causes a bunch of fundamental get commands to operate slowly, such as get status, get checkout Fetch and trajectory directory traversal tools such as find.

A

So our solution to this large work tree is also sparse. Checkout, but sparse checkout isn't super easy to deal with directly.

A

So that brings us to build aware, sparse checkout aligning checkouts with buildgrass.

A

The fundamental problem with sparse checkouts is that, when you're trying to build uh build a particular Target with a handmade sparse checkout, it's really really common that you run into like file, not found errors in an Ideal World. We would have a sparse checkout that is aligned uh exactly with our build graph so that when you do run bazal build, uh you always build a successful package and avoid any like missing files. So how do we get to this aligned, sparse, checkout.

A

The answer is in the bazel world, the answer is bazel query bazal query can give us can calculate the sparse checkout for us, it just needs an outlining tree and we consider an outlining tree to be this minimal, sparse checkout, with only the files needed by bazel query concretely. This is mainly build files and Dot vcl files.

A

We've seen great impact from sparse checkout uh uh we've seen status latency go down to three seconds. Originally, it was around like 32.. Our typical user will see that they're working tree size is about 18 percent and the chick typical checkout time has decreased from about 40 seconds to eight.

A

Unfortunately, it isn't just one big win. We find that bazel query is extremely slow and, ideally, users run bazel query on every checkout to align with the BuildCraft five minutes to just do a checkout uh makes the tool like kind of unusable. So now we'll lead we'll talk about our solution to this problem, uh including caching, bazel.

B

Thanks David, so caching, bazel query like David was saying: we need to Cache this, because five minutes to check out is pretty much unusable, I'm going to be discussing two methods today that we use to Cache basal query. The first is a course grains cache and the second is a more fine-grained cache.

B

So normally when you want to build a sparse checkout profile, you have a set of targets that you want to build, and then you query bazel to see what the dependencies are for these targets, which requires you to have. You know your work increase, ready to service bazel queries with these two things. You can create a sparse checkout profile, which git will accept and use to create your sparse checkout.

B

So this is exactly the function. We want to Cache a set of targets along with the state of the working directory. So that's a commit hash, for example, and we want to you know the value of this cache is the sparse checkout profile.

B

So yeah, the key here is the targets and the commit hash and the values as far as checkout profile. The advantage of this is that it's pretty simple to implement. You only need to do one lookup in your cache to find the necessary data, so that's extremely fast, but the disadvantage is that even a small change to your working tree can cause a cache Miss. So if I make a change to a single source file, then you know your entire cash.

B

Your entire cache key will change and you won't be able to access the cache and get your sparse check out. So this approach is much better when you have a set of commits that are well known and a set of targets that you knew ahead of time, that you can build these caches for so, for example, if you are pushing commits to your main branch, those are a good candidate for catches for commits to build caches for, uh and the second approach that I'll cover is a fine-grained approach.

B

So, instead of caching, the sparse checkout profile for the entire commit or the entire work entry State, we instead cache data on the granularity of a single Target at a time so pretend I. Have this target called edit tweets implements the Tweet edit button and it has some dependencies so, for example, a dependency on the button Target on the Tweet Target and then maybe there's a non-bazal dependency. That's just a bunch of boilerplate files.

B

So, in order to analyze this and produce A fine grain cache, we look at the build file. That's associated with the edit tweet Target, and this build file declares the dependencies for this Target and also it declares it loads. Some bzl files which contain definitions which are used to determine the dependencies, and we access this through a parse function which takes the build file and it traverses the bzl kind of dependency tree to get all the loaded files and from this it produces a cache Key.

B

By combining all of this data into a single hash, and after we have this cache key, we can then just query bazel with a function like called bazel analysis that just takes a Target and Returns the actual dependencies for that Target, and this together is enough to make our cache the parse that results with the parse function is critically. It doesn't require you to actually evaluate a bazel query. You can calculate this just using textual content from the build files and bzl files, and this forms the cache key.

B

The cash value is the actual dependencies that we got from bazel.

B

So here's some performance numbers for my tool. um The first command is US using the fine grain cache and the second one is without any cash whatsoever. We clear the cache before we actually try to synchronize the working tree with the sparse checkout profile.

B

So in the first case, you can see that it took an average of a little under five seconds to run on a typical project in our modern repo to synchronize a working tree and in the second case where we actually have to query bazel, because we've cleared out the cache ahead of time. It takes a little over 30 seconds. So that's a performance Improvement of a factor of about six, which is pretty significant and we do expect to be able to optimize further on the time it takes with the cache.

B

Twitter focus is not just the one sparse checkout feature: it's a lot of other things that developers will be using to use to integrate, sparse checkouts into their workflows.

B

So, for example, by default it does shallow clones to reduce load times and reduce load on your servers. It has facilities for managing old branches. In your shallow clones, you can upload it and download caches. So these caches we were talking about. Maybe your CI machine will generate them and upload them and your clients will actually download them so that they can have a warm and ready cache. We also have a UI for discovering projects that you may want to use and easily find and add them to your sparse checkout.

B

So we'd like to invite you all everyone here with a big git model repo- and you know, bazel- build graph to consider adopting focus. It is open source on github.com twitterfocus. It is written in Rust. It includes the tutorial to build bazel using Focus, so you take Focus to check out a part of bazel so that you can build bazel using bazel with Focus.

B

um It's also extensible to other build systems. So not just bazel. If you have a different one, you can add support for that and I'd like to extend an amazing thanks to all the people on our team who helped us get focused to where it is today over the last year. um Yeah. So thank you all they're now here, but they did a lot of work to get us to the state and thank you all as well for coming to listen to our talk and we hope that you'll consider using Focus.

A

A

It looks like we'll have time for like one or two questions.

A

This may be a little off topic, but could you speak to the advantages of.

C

Using a monorepo.

B

Yeah um so minor repos, you have all your dependencies in one repository. Some examples of things you can do are code based wide refactorings, which are a lot harder to orchestrate across a lot of different, smaller repositories.

B

Ideally, you would have just one version of every dependency that just simplifies, you know deployments and builds that might not always be possible at Twitter. We have multiple versions of various dependencies in our modern repo and, generally speaking, it kind of shifts a lot of the tooling pain that you would have from dealing with many different repos centralizes it to where a single like team for Source control can start to address those problems for everyone, rather than everyone having to use some other tools to deal with it.

A

And I think most companies also find that they have more than just one repo there's, always like little edge cases where people want to like mirror an external repo or maintain a fork or maintain some open source project like this, you will not get access to our monorepo if you want to use this tool. So it's separate on GitHub.

B

So the question is: how approachable is focus to new developers? You've been the developer for a little over two months. Is that right and you work with a modern repo at your company? Yes, okay! um So we're certainly hoping it will be approachable.

B

The documentation is in early stages, but we do have a tutorial and you're absolutely welcome to post on the discussion board on GitHub, and we will try to you know, work with you to uh for one thing: get you where you need to be with focus and another improve the documentation to a point where everyone can successfully use it on their mono repos. So we're very happy to work with you on that.

A

I would say if you have uh like a fairly passing familiarity with with bazel and uh to start and with uh sparse checkouts, get you'll be in a pretty good shape, um and then, if you wanted to extend uh it to like some particular build system, uh then it's going to require just a little bit more like rust. Knowledge, but again would be happy to help with whatever you're trying to extend it to yeah.

A

Thank you, I think. That's. We can probably do one more question. Yeah and they're in the back.

C

Hi, thank you. um Have you found I, don't know how far adoption is a focus within uh Twitter, but have you found? Maybe Dev teams are optimizing, their build graph or or their dependency graph, to take more advantage of focus. Is that something you've seen happen.

A

We have not seen that happen, um and probably the first thing we'll do is try and optimize our like initial Sparks checkout, there's a bunch of like mandatory things, you kind of need to make sure that the build system works and just like minimizing. That will also like improve things, probably more dramatically than having any individual teams try and like self-manage their their build graph. Thank you. There's.

B

Actually, just a slide here, which maybe we can look at this is just some other miscellaneous data at the top. Are some targets there's this thing called strata that does code generation, there's graphql, and these change like every 10 to 50, commits and then the projects that people are actually working on might change like every 100 or 200 commits. So we want to optimize that stuff at the top. That's core infrastructure that everyone depends on to improve these. The bazel build graph.

A

Cool. Thank you. Everyone.