GitLab 14.10 Release Kickoff, 15 Mar 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: GitLab 14.10 Kickoff - Enablement:Database

Description

Kickoff for the Database Group for the GitLab 14.10 release

Planning issue: https://gitlab.com/gitlab-org/database-team/team-tasks/-/issues/242

Database Group Past Kickoff Videos: https://youtube.com/playlist?list=PL05JrBw4t0KqP3MYrcoQHrqPUqn_jJZSN

14.8 Kickoff video with a more detailed presentation of features: https://youtu.be/eek6Sn0p5gc

Presentation by: Yannis Roussos, Sr. Product Manager, Memory and Database Groups

A

Hi, I'm jens russos, the product manager of the database group and I'd like to take you through what we're trying to achieve in gitlab 40.10, which is scheduled to be released on april 22nd of 2002. As a quick recap of what we are doing, the core mission of the database group is to build the application code. The tools and frameworks that allow every gitlab feature to interact with the database in the most reliable and performant way possible.

A

We also beat the tools and product features that allow any gitlab team member to efficiently develop codes that interact with the database test against production grade data sets and make informed data-driven decisions before submitting any update to the gitlab pro. So let me take you through our top priorities for 4.10 the first one is batch background migrations. I have already provided a few in-depth presentations of this new framework in past kick-off videos.

A

So if you are interested in more details, please check the links that I'm providing in the description of this video, but as a quick recap, but background migrations is a way for us that we run a secrecy in the background, jobs that execute very large data dates, and I'm talking about millions or hundreds of millions of records updated and the patch background.

A

Migrations is a new framework that we are building from the ground up with a focus on reliability, self-monitoring and auto tuning, and we anticipate that this will be a major step forward for the stability and availability of gitlab incidences of any size while making most background data operations complete considerably faster in most cases. So we have a lot more information in the related epic.

A

Our work in 4.10 is driving us towards the general availability of the batch background, migrations framework. We are already working with a few groups uh internally that use the plasmagram migrations for very large difficult migrations.

A

We even had a small incident yesterday in gitlab.com related to right ahead low replication, where at the moment we had the bad background migration running and because it was about background migration, there's a reason called were able to pause the migration, let the system recover and then restart the migration without causing any problems uh and without a hiccup, so in 4.10 driving towards the general availability uh very quickly. So, first of all support for multiple databases. This is something we are working for: the full 14.9.

A

This is almost ready. It will be shipped very early in 4.10.

A

You know, maybe in a few days and then we are going to work on a few uh core last core updates like, for example, updating our scheduling and auto tuning algorithm, to be able to figure out background jobs that uh fail how they fail their error, our error out or the time out and if, if they time out, which means that the bad size may be too large, be able to split them in smaller chunks and execute them in smaller jobs.

A

uh Similarly, at the ability to disable the auto tuning algorithm so that we can limit the very sensitive, very delicate background jobs that we don't want to change their bad size, uh we can limit them to whatever we set them during scheduling, of course, a developer documentation which is very important for the general availability of the framework and finally, at additional optimization optimizations to our batch optimizer, so that we can board support more more use cases of data operations like uh partial updates, by adding a custom, but the strategy etc.

A

Our second top priority for 40.10 is implementing a throaty mechanism for large data changes. So in this case we are our worst enemy, so we can go too fast and going too fast can cause problems to the production system. You can go and update millions of records in a minute and.

A

That can cause a system to start feeling feeling it and feeling uh having multiple issues. So the idea here is to build a mechanism that will monitor the health of the production database and respond to it by lowering lowering the rate of updates throughout the updates or even posing them, and this is very important for us. This is the last missing puzzle.

A

On top of what I just discussed for bats background migrations, that's why we are going to to uh implement this mechanism we're going to introduce it as part of the auto tuning layer of the batch background, migrations and the idea there is that we are going to have monitoring uh in real time of the production system and various mechanisms for monitoring various signals.

A

uh That will tell us if there is a problem with the production system and neither lower the bat sizes and the rate of updates or even completely stop advanced background migration, wait for the system to recover and then restart without requiring any manual manual intervention. Our first step in 40.10 will be to identify those signals. So what are we going to measure so? Is it about rate of updates?

A

Is it about the doubles, other problems we can see on indexes or on the heap or whether there are maintaining maintenance operations like auto vacuum, running over a table, all the signals we are going to use them in order to make those decisions in later steps and finally throttling those operations.

A

Our thread third top priority is the automatic database test using clones in general with this? This is a project that is fully running in internal in gitlab. This is an amazing capability. We have. We are shifting left our ability to preemptively find database related, regressions and performance issues by testing all database updates and guests against the production clone of gitlab.com database. So this is amazing. Any update that we make right now automatically in a ci pipeline, is executed and tested against our production system, and we have already completed the coverage.

A

We have covered regular migrations and post migrations. Those are migrations that run after the deployment, the code deployment in no downtime updates and the next step. We have started working on uh already in 4.9 and have completed 4.9 at the first step of it is supported, background migrations.

A

So background migrations, as I already discussed, supporting them, is not triggered because they are meant to be run for long periods of time. So you have a background of migration that will update millions, hundreds of millions of billion records and it's meant to be run for hours or even days.

A

How do you test it in 10 minutes in a ci pipeline, and this is the purpose of this initiative to find ways to statistically or other ways, cherry pick jobs on various intervals and test them in a few minutes, let's say 10 20 minutes in a ci pipeline and make sure that the batch background migration, the background migration runs uh in 49.9.

A

We have already completed support for regular background migration are existing background migrations and in 14.10 we are going to add support for bots background migrations, testing batch background migrations, and then we are going to update the way. We return results on the mr to the developers to the gitlab developers and finally, another step of testing that will allow us to estimate how much time that background migrations will run, because there is the problem that you may define.

A

If you may block the auto tuning and copy it uh with a max uh but bad size and uh by mistake, set it to a very low value, let's say a thousand records per two minutes. This is very low and that will result in the migration running for 30 days and we don't want the migration running in 30 days. uh So we will add checks about that as well and inform developers that what they are doing maybe feel safe.

A

But it will result in an update that will last for 30 days and they will have to wait for 30 days for the for it to finish and finally, adding somebody as part of all the other steps, we're also going to have uh testing the background scheduling logic. This is a very special use case and we have seen a few problems in the past because when you're scheduling a background, job, the it is created and really created, and you can set up when it gets out of the queue.

A

So we have had some problems in the past with uh very specific uh uh parameters on uh background jobs that were passing and could execute without issues on development or testing environments and when the migration uh reads, production uh they're allowed for very specific reasons. So this is one more layer of protecting uh background migrations that are said to production.

A

Finally, our last top priorities added the data dictionary. I have also discussed about it in past week of videos very quickly. We want to. We have more than 400 tables. We want to label them uh using some metadata so that we can label uh who the owner is, which group owns the table, another additional metadata about a description or other data classification metadata.

A

We want to start by labeling dailies. This is the most impactful change for us, because it will help us very quickly in case of incidents or a back report to figure out who the owner of the tv with the domain specific knowledge is and address it as fast as possible. So we're going to work on this add support for a labeling technician, 4.10 and add some documentation.

A

So that's it for 4.10. Thank you for watching and talk to you next month.