GitLab Database Team, 17 May 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: GitLab 16.1 Kickoff - Database

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Hello, everybody, my name is Roger, and this is the database group 16.1 planning kickoff meeting with me today is Alex hello. um So, let's go through the issue um for 16.1. The database group is going to be operating largely at full capacity. We continue to have one full-time, equivalent engineer assigned as a stable counterpart to support data store solutions for AI related initiatives and the last Milestone.

A

We supported the deployment of a PG Vector data store for experimental, AI, embeddings I think this is currently being used by a large variety of teams to develop AI features for gitlab SAS moving on into 16.1. We have two top priority items. Both of them are continuing onwards in our overall theme of helping to support database scalability. The first item here is partitioning strategies, so our overall goal here is to reduce the impact of individual table sizes, because we know that large table sizes consume large amounts of CPU Alex.

A

Did you want to touch on some of the current status and how see this impacting our vacuum activity.

B

uh Yeah so as as Roger mentioned, our primary concern at this point is vacuum activity um the the size correlates pretty well with that, though, we're gonna take a look at some of the vacuum activity metrics and make sure that we've got our priorities in the right order. We do know for sure that events and merge request diff Ables, are high priority, even with those metrics in mind, but we'll be using them to reevaluate uh priorities for any of the other tables.

B

For the events table, we identified a BRI working with the create stage, and so we're excited to work with them to help uh get the events table partitioned and then, with the code review team, uh we determined we can delete a substantial amount of data from those tables bringing them well within our 100 gigabyte guidelines. So we'll.

B

We'll be very excited to see that happen.

A

Yeah for sure, and then this next Point here, updating our primary keys to begin has been something we've been working on for the last few Milestones I think we're we're making decent progress. But you know again, we are seeing some concerns around the same events table. We refer to in our last bullet point, I.

A

Think just due to the size of this table alone, it's pretty hard to make meaningful changes quickly, um I think, overall, we were going to work on improving monitoring for this in the last Milestone, but some of this work has spilled over because of some current incidences that we are currently working to resolve.

B

um I'll also note this the number there is. We wrote this issue a couple of weeks ago and we're actually at 37 now uh for the events table migration after.

A

This recording thank you, Alex yeah,.

B

Or that's where we were yesterday, so cool uh yeah.

A

Sorry go ahead, yeah. So then, in addition to our top priorities, there's a few Focus items our team is continuing to work on um first up here is removing old migrations. I think this is really what's caused. Some of our recent interruptions and unexpected delays. Do you want to touch on some of those Alex.

B

Yeah you know in in uh 1511 we merged one of the migration squashes that we thought was going to be safe, but it turns out we accidentally squashed a current, a more current migration with a very old time stamp that got merged very late. So um we've had to revert that and then reapply it and now we're reevaluating the the squashing technique in order to make sure that we don't do that again in the future.

B

So John has had some good ideas and there's some good ideas from the team about how we can make that really safe going forward and so I think we're we're on our way out of the woods there, but uh we're still wrapping up the latest, my upgrade bugs so yeah.

B

A

I mean we're getting there I think generally. The idea here is we. We made some assumptions and we found that those assumptions do not hold in all scenarios. So now we're just expanding our coverage to have a very robust system going forward, right, cool and then next up here, Matt has been working on automated database testing for some time. I know this has been a little bit delayed due to some competing priorities from the AI, which Matt is our stable counterpart supporting for well.

B

And the other person who was working on this at the moment is John, who has been delayed by the uh migration incident, so we're we're uh we're a little short staffed on that right now, but we're hoping to make good progress again on this in 16.1.

B

This is especially important for the partitioning and table reduction size efforts, because if people choose to partition, we need them to be able to identify queries that don't have the partition key in them, so that we can make sure to optimize them in advance.

A

Cool and then lastly, um these two also kind of relate to improving our overall system stability, um I, think background processing is one of these things and I think some of the areas we've seen incidents in is when we have background workers processing a large amount of data. So this is an opportunity we see to improve some of these things and similarly uh replacing poster vibrations and how migrations go out of order um Alex. Do you want to touch on some of the technical implications on how these areas will help us with stability, yeah.

B

So for the safe background processing, um what what we're hoping to do is take a throttling mechanism. We introduced for background migrations and move it into the the general background worker queue and what that'll help us do is prevent. We've had a few incidents where uh background workers are processing, large amounts of data and sometimes they're updating records or they're.

B

Deleting records um like cleanup workers and that sort of thing and as a result, the dead, Tuple accumulation can cause replication lag and when the replication get lag gets super high, it can we we actually see so. One example of this is a background. Worker recently went through and marked a bunch of things as okay to clean up and another worker came through and cleaned them up and because we deleted so much data so quickly. uh All of the rep.

B

The read replicas got too far behind and were removed from the pool, and so once they were removed from the pool, all traffic went to the primary database and it promptly ran out of CPU and crashed. So um what we're hoping to do is use the same throttling mechanisms that we have for background migrations uh on this kind of worker. That way, we can watch table statistics to make sure that the wall pending queue, isn't getting too high or that sort of thing, and we can really make sure that we're operating them safely.

A

So the overall theme here I think you're hearing is that we have a large amount of data on gitlab. We want to be able to change and move and update them very rapidly, but we do have fundamental technical constraints on how quickly we can move and we hope, with these types of improvements and system stability, Investments we're able to make quicker changes, bigger changes more safely and as a result, improve our overall velocity of development deployment and scalability over time.

B

A

Cool thanks, Alex yeah.

B

Thank you, Roger.