GitLab 15.3 Release Kickoffs, 14 Jul 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: GitLab 15.3 Kickoff - Enablement:Database

Description

Kickoff for the Database Group for the GitLab 15.3 release

Planning issue: https://gitlab.com/gitlab-org/database-team/team-tasks/-/issues/259

Database Group Past Kickoff Videos: https://youtube.com/playlist?list=PL05JrBw4t0KqP3MYrcoQHrqPUqn_jJZSN

Presentation by: Yannis Roussos, Sr. Product Manager, Memory and Database Groups

A

Hi, I'm jennifer russo, the product manager of the database group and I'd like to take you through what we're planning to see in gitlab, 50.3 and scheduled to be released on august and 2nd of june 22.. So not a lot of our new things on our side. We keep on working on our top priorities. While we are pretty low on capacity, our top priority keeps on being the bad background migrations. We are almost done.

A

The most important thing is the effort on effort to improve batch handling. This is important for two reasons. The first one is because we want to cover all cases to make bad background migrations feature complete to allow us to to be able to make to run all the types of background migrations that we want to execute, and the second one is because this is the core task that we're doing with those batch background migrations, those asynchronous jobs that run in the background and update data.

A

So whatever value added added value, we can uh continue. Adding is important for the availability and reliability of github.com.

A

Our second to priority is updating the migration helpers that perform data operations.

A

So that they can use background migrations, what's the problem here, we have a lot of helpers, a lot of libraries, for example, for doing database operations for doing partitioning helpers, that rename columns, move columns, copy columns, copy data from one column to another change, the data type of a column and much many more. All those database operations that also perform data operations are not using bad background migrations right now. So the idea here now that the batch background, migrations framework has matured and we really trust it.

A

We want to rewrite all our database libraries to use the bus background migrations which will make all our operations way more efficient and also way better monitored. The final one is improving the self-managed experience. This is important for us, whatever we can do to make running a gitlab instance as seamless as possible, and also help gitlab instance administrators address issues during the updates, etc.

A

Our next priority is the throating mechanism. This is the next big thing that we are working on.

A

So this is our work on monitoring production database clusters and responding in real time to changing conditions by throating, their our background jobs, the rate in at which we update data or even stop imposing those background jobs. When we see that there is a big problem on a production system so that we don't affect it anymore. So we have two initiatives that we were working throughout 15.1 and 50.2. They are ready to be released. The first one. This is the first action that we are going to release is posing a migration.

A

Why a lot of vacuum is running on a tape. This is auto. Vacuum is the most important maintenance operation that a poster sql server runs. We don't want to run data migrations at the same time as of the viking runs, so the idea is check that the electro vacuum is running and pose any migrations any database on that table. Wait for it to finish and then restart. This is completed. We have uh merged the code.

A

The only missing part is to roll out the feature plug that will allow us to test it on gitlab.com production and once we validate that everything works as expected, also release it to self-manage instances.

A

Our second update is pause migrations when the q pending archival crosses the threshold walls are, are the database, our our backups so, and the wallpaper archival process is very important for the availability of any database system and for gitlab as well, which depends for the persistent storage for user data on the database.

A

We don't want this queue of wall segments to grow uh too long. So if we identify that there are a lot of wall segments spending that archival, which is a thread, a risk for our availability, if, if the database goes down, we cannot restore it, we cannot go uh go back.

A

We want to pause the migration and give some time to the system and the archival process to archive all the remaining wall segments. This is also almost done. It is in review. We expect to release it to test it when validated unit 50.3 and then release itself an instance on new initiatives. We want to also add code that poses migrations when the patronia objects drops below nslo patrony. Are our database servers so we have an object that checks the health? uh How well are things going on our database servers?

A

This is the main monitoring uh signal we have that things are not going uh well and it's a very early uh lead indicator. So the moment we see that patronio objects going down so, for example, in a gitlab.com below 99.99.

A

We want to to immediately stop everything, because this is a leading indicator. Something is happening with the production system. Let the production system recover and then restart the migrations and finally, also uh throaty migrations when the wall rate exceeds a threshold, so the rate of generating gold segments exceed the threshold, because that is also an indicator that there are too many rights. There is something happening there, it's not about an incident, but some other processes want to to to do a lot of rides.

A

Let's pause the background, the jobs that execute the synchronous updates, let them finish they'll, have a better a higher priority and then continue. That's it on this effort. Finally, we have this uh uh issue that we we're going to finish in 15.3 about as a bug on the database load balancer where, when we mix writes- and there is a needing transaction timeout between rides- the transactions, of course, roll back on posters.

A

But there is a net case scenario where the load balancer decides to retry the transaction starting from the middle, and this is, of course, a big problem for the consistency of the data stored. We are, we are almost done. We have. um uh We have a theory that uh about why this is happening. We have a fix, it's uh we want to to to release it and monitor and check if the fix that we have implemented fixes the problem.

A

Finally, this internal loan. We think we want to also add one additional node on our internal testing database cluster uh for testing against then against the new, the composer database. At the moment, we have all both databases in the same database server, it's a matter of allowing us to to scale our testing capabilities.

A

That's all thank you for watching and talk to you next month.