GitLab GitLab Group Kickoffs - Enablement:Database, 15 Feb 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: GitLab 13.10 Kickoff - Enablement:Database

Description

Kickoff for the Database Group for the GitLab 13.10 release

Planning issue: https://gitlab.com/gitlab-org/database-team/team-tasks/-/issues/134

A

Hello from a pretty snowy greece, athens, greece, I'm janis russos from the database group and I'd like to take you through what we are planning to achieve in gitlab 13.10, which is scheduled to be released on march 22nd.

A

Our plan for 32.10 is to focus our efforts on the automated database, migration testing and the primary key overflow risk for tables with an integer primary key. Those are the same epics that we were focusing on 30.9 and we are going to keep our focus for the next milestone.

A

The same so addressing primary overflow risk a very quick overview. You can check past echo videos and our epics and issues for more details.

A

Integer 4 8 are 4 byte integers. They have a maximum value of 2.15 billion numbers that they can describe so in gitlab. All tables um almost up to two years ago, use integer for integers as primary keys, and we have started observing some tables, some of those tables uh going above one billion records.

A

So, as you can see here by taking the capacity used as a percentage, why that's important for us to monitor? And why do we want to address that? If we reach the maximum integer value, that's a catastrophic event. We won't be able to add any more records on those tables, so we want to address those problems as early as possible when I'm discussing this, it's not like this is a problem that we're going to have next week. This is uh the trend uh for the past month.

A

We can see that uh they go up, but it's uh they are not going up at a crazy rate, but because this is this can lead to a catastrophic event. We want to address it as early as possible. The second part there is that when we are discussing converting uh the tables converting the primary keys, this is not about only the primary keys. This is also about all the foreign keys that reference, those primary keys, and that means, for example, in the ci, builds that we have to convert.

A

Ci bills update one billion records, but we also have to update 16 additional tables that reference that table the cia builds table. So, in reality, converting the ci bills table means updating, 17 tables each one with more or less a billion records.

A

So we already have the migration helpers the tooling necessary to do that. But what we're trying to address right now is the infrastructure, our framework for running background migrations, so at the gitlab.com scale, whenever we're discussing for such a table, those are terabyte tables more or less. So the only way to approach those updates is to budge, let's say: 1 000 records at the time and split those updates into 1, 000 or 10 000 record batches and have a separate job make the update for its buds.

A

So what we are doing right now is that in general we have this framework. We are splitting generating the batches, creating background jobs and schedule them and let our queuing and and scheduling framework detail them and around them. What's the problem when you have a billion records, even if you have its job running 10, 000, updating ten thousand records at the time, you will need something like a hundred thousand jobs, that's a lot of jobs that have to be scheduled and wait in the queue and then processed uh uh one at a time.

A

So what we really want to do here is to improve how our make background migration mechanisms work. So we want to allow migraines to be configurable.

A

We want to have a new way of scheduling migrations that will not require a hundred thousand uh jobs to be scheduled beforehand and the bats pre-computed and, more importantly, we want to find a way so that we dynamically define the bad size because one more very difficult problem there is that you start and depending on hardware, you may think that the best path size um to to work on is 10 000 records. You may find that you could run 30 000 uh records at a time and it will be okay.

A

You don't want to lose uh that the difference or you may find out that 10 000 records at the time is too much and could bring a production system down. So you have to very quickly go down to 1 000 at a time, so we want to make those migrations configurably dynamic, so that we can change migrations, the bad sizes on the fly and, more importantly, we also want to make those parallelizable.

A

That means that why wait for the jobs to run one after the other, when you update, for example, primary keys, while you can update 10 different uh chunks of that table in parallel, as those do not affect each other?

A

A second thing is that we also want to make everything that we are working on is in order to make online updates. If we could take all of gitlab.com offline, things will be very simpler, but we're focusing we are always focusing on online updates. You can find more details in our uh epics in this specific issue, so a couple of additional things we want to work on. We also want to add an issue that uh for forecasting the integer for overflow date for all relevant tables.

A

So what we want to do is to have an automated way and repeatable way to get something like this. This is a very simplistic forecast that, for example, says that, maybe by december 2021, we will reach the limit for ci builds.

A

We want to have this uh using our current growth and uh automated and repeatable in github.com for all tables, so that we are sure that we can. We monitor- and we know exactly how things go, and the final issue that we won't do we plan to address during 30.10 is to for better was created, also an emergency plan, because our plan is to address everything beforehand, never reach the limit. But let's say that something happens and we reach the limit. How can we address it and there are some?

A

We have to research uh approaches that range from using negative primary keys to using table inheritance, which, in our case is not very useful uh to infrastructure solutions like if we can set up a second cluster and use logical replication and use logical replication for the updates and more things that we want to research and figure out. Those are contiguous plans, but we have to have those it's better prepared better safe than sorry.

A

So that's on the primary key conversion. The second epic that we are focusing on are the is the automatic database migration testing. This is our top internal feature priority. We are moving very fast. We have moved at the pace we have planned in 30.9 and we already have an initial mvc available and we have a plan to extend using it by all database maintainers by the start of 30.10.

A

So let me show you so, for example, uh what's our current state before we start working on 30.10? You can already see this is working on gitlab.uh gitlab, slash gitlab in the gitlab project, and we can see that for all migrations that are included in the image request. We can check how much time they took to run if they run successfully and also what was the database size change. So, for example, here this is a big migration that dropped um the audit events uh table the old events table.

A

It took some time to run because dropping such a large daily takes time, but we can also see the effect to our database size that where we want something, we gain something like 160 gigabytes or here you can see an index update in another table that took 15 seconds and added 70 megabytes, and this is the the the case where everything works. As expected, we also have the case where every something is not working. So we can see here- and this is a real world example where we figured out.

A

We found out that adding a specific index was breaking things you can see here, that we get a feedback, that it's not working and we understood and a database maintainer like myself can go. This is behind the scenes. What runs and if I check the db migrations uh job. I can see that, for example, when we try to add this index, uh we we're getting a statement timeout. So, for example, those 15 seconds is the statement, the amount in a gitlab gitlab's production and we had issues and we had to iterate another recovering index.

A

In order to solve that problem, and we are very very optimistic on on new cases about how we are going to use it, we are going to iterate very fast. We want to also add some summaries with all the queries that were executed, so, for example, here we have by migration, but we want to also get some summaries. Those were the queries. That's how much time it's query it took.

A

That's the average time, that's the total time, because you executed that query 10 000 times those are our plan for 13.10, together with exterior testing and integration using internal uh team and all the database migrate um maintainers, uh so that we can iterate and improve and then start adding more features like, for example, background migration testing and more.

A

Finally, one last thing that I want to mention for 30.10 is that, in order to prepare for switching to postgres12 as the default as the minimum required version, we want to also explicitly mark all the cts with materialize.

A

What happens there so um starting on post 12, the ct is the cities. Are the width closes in sql, where you say with something, and then you use it everything until apostles 11 those cities were materialized. What does that mean? Posters will create a temporary table and then use it in the rest of the queries. uh That was not this, that's not always optimal, but that can also use as an optimization fence in order to force uh posters to create the correct.

A

Optimization plans, query plans, so we were using this like that in in gitlab. So this is very important, for example, if you want to pre-compute something and then use it in order to run updates or inserts, you want to pre-compute it and then use it, but starting postgres 12.

A

In some cases, posters will will incline those cities that can have amazing effects in most some cases, but in some other cases can break query plans. So we want to mark all cities with a materialized so that we keep our current um behavior and then we can iterate and add some of those to be run in line as we go forward.

A

So that's it. You can check more uh issues in in our epic. uh Thank you so much for watching and talk to you next month.