GitLab 13.2 Release Kickoff, 17 Jun 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: GitLab 13.2 Kickoff - Create:Gitaly

Description

Planning issue: https://gitlab.com/gitlab-org/create-stage/-/issues/12671

A

Welcome to the get early group, 13.2 kickoff call. My name is James Ramsey and I. Am the group product manager for the create stage of the DevOps life cycle here at get lab manager and product manager for the Gili group in the 13.2 release? We're continuing some really exciting work that we've been doing on a feature called get lis cluster get away. Cluster allows you to store multiple copies of the underlying repository data on multiple servers. This is really useful for improving the fault tolerance of your git repository storage.

A

So this is particularly relevant for large gate lab instances that have a real need to be always up. Even if single server was to fail and makes possible high availability configurations, we released the very first version of guilty cluster in 1300 and it's a first version. So there's lots of work, improving it that we're doing, and so the the primary items that I'd like to share with you that we're going to be working on in the coming release. 30.2 is continuing to focus on the reliability of get early cluster.

A

We've observed some situations where replication can fail and then the queue backs up for replication jobs and it would prevent the replicas from catching up. So we've observed replication failing in our tests on git lab comm, so we're looking to resolve and continue to improve the reliability of get early cluster. So that's that's really our top priority making sure the feature is really reliable and stable.

A

The other area is I, guess improving the characteristics of the feature itself, and so the first feature that we've been working on and hopefully will ship a beta version of in 13.2 is a first version of strong consistency. So that means, when you push a change or make a change to a repository, that's replicated in multiple places. We've got a very rough development version that we're testing and experimenting with that works for certain write operations and requires all nodes to agree completely.

A

So consensus, that's a bit of a problem, because if one node is having a problem, one giddily node becomes unresponsive. It could take down the whole cluster because no write transactions would then be committed because consensus can't be reached. So we're looking on improving the reliability of that changing the behavior a bit to make it easier to test and deploy on gitlab comm, and we need to document how to turn it on as a beta feature.

A

So that's a strong consistency and the key improvements which you'll see in this list is working on the transaction hook, which include which improves coverage of which write operations will be covered by strong consistency. So, and this change here should cover all right operations. Currently, we only cover a subset of them in the development version, and so that's really critical, and then we also need to improve the voting strategy, as I mentioned, which is and avoiding requiring consensus.

A

So those are key improvements coming to a strong consistency, and hopefully in combination with other improvements that we're working on like the performance of frame, wise proxying, will make a stable enough version to call beta in parallel to that and we've been working on redistribution. So if all the replicas are up to date, so my repository is in the same state on three different servers.

A

Why not share the load for read operations across all up-to-date servers and so we'll be working on that we've got a rough version of this which we're still the process of documenting and testing and improving.

A

Where it distributes the reads and it's behind a feature flag and we've run into some performance problems with the number of connections being made in database. So we need to add documentation around the use of PG bouncer and improve the the observability, and we think we're these improvements.

A

It should be more reliable and hopefully it will be stable enough that we can call it better and make it easy for you all to start testing and experimenting with this so really yeah.

A

The number one priority is getting the baseline feature a little bit more reliable in certain edge cases and then working on strong consistency and redistribution, which are two very significant and important improvements that a lot of customers are looking for in this cluster feature and then finally, there's some administration areas that we're working on as well, particularly around just detecting data loss and knowing which the state of all the different replica nodes which repositories are today. Currently, we use a replication queue to determine status and we're looking at evaluating improvements on that front.

A

So those are all the improvements to get early. Cluster really excited to continue to see this feature, move forward and improve release after release. The other area that we're beginning to work on is improving the way, improve, adding new ways for you to backup your repository data in self hosted environments. So today there is a single.

A

Script that you can run on the server. It's documented this backup task scroll down, yeah sort of screenshot of running. It I think that's so much discussion. Yes, so there's this command to get led back up and you can pass in a bunch of arguments to say only backup repositories or include exclude different things, and this is the documented way to do a backup, but it doesn't work for large report large Gilad instances.

A

We have many gigabytes of repositories, it gets very slow, so it's not really feasible for most real-world large gitlab instances to use the documented backup script for backup and restore of git repository data. So the first improvement that were working on as part of this is to add concurrent to support. So at the moment the backup process is serialized. We take it back up to one repository and then we move on to the next and then the next and then the next.

A

This means that in a lot of cases, are highly provision and powerful service underutilized. All the calls might not be used or the memory might not be used. We really want to make backup faster and so we're working on adding concurrency support so that you can control the current concurrency factor.

A

How many backups will be repositories will be backed up at a in parallel to get lis node or get all this sharp I should say, and so that's really the first point and then from here we're going to work on more improvements around making the process incremental.

A

So if you've taken a backup recently, how can we only backup the things that have changed, skip repositories that haven't changed the last six hours and that would, in some cases dramatically reduce the amount of data that needs to be backed up and forgiven run so even on Gilad comm, obviously, there's a very large number of repositories, only a very small number of those like a couple of percentage points, change in a single 24-hour window, and so only those would need to be backed up relative to all the other data.

A

So this is just the starting point and if you've got any feedback, if you've struggled with backups on your lab, would really love for you to reach out on this epoch, particularly, let me know and would love to hear your thoughts on what we're thinking about as we firm up our plans for future durations, which I'll share in upcoming kickoff calls that's the second priority that were just beginning to dip our toe in the water on and finally, we continue to work on partial clone and partial clone.

A

There's a blog post for a few months ago on the gate, lab blog about partial clone and how that helps with large files. Working with large files is, and really large repositories is something we're keen to improve.

A

Support of in git lab, and what we'll be working on in 13.2 is continuing to improve the reliability of this feature in the last month, or so, there's been a couple of regressions that have come through, particularly in git itself that broke the partial closure feature and will be wanting to continue testing and verifying that those fixes are in place and it works as expected.

A

We're also continuing to document more workflows, like once, you've done a partial clone how to undo that and long term we're still working on how we offload large files from the high performance storage that most parts of the repository should be stored on and offload those large files onto more cost. Effective storage mediums like object, storage and doing that transparently in the back end.

A

So we're working on some ideas on that front and our objective for 13.2 be to continue to firm those up and start getting feedback from the broader get community, because this would likely involve changes to get itself. So those are our three priorities. The biggest and most substantial priority, of course, is get a lead cluster and then we're spinning up the incremental repository backups.

A

So that's what's in store for git early in the thirteen point, two release of git lab and I look forward to hearing your feedback and seeing you in a month for the 13.3 kickoff have a great day bye.