GitLab Sharding Group, 17 Sep 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: GitLab 14.4 Kickoff - Enablement:Sharding

Description

Kickoff for the 14.4 release for the Sharding group.

Main epic: https://gitlab.com/groups/gitlab-org/-/epics/6168

A

Hello, everyone, my name is fabian zimmer, I'm a group product manager in enablement- and this is the kickoff call for the sharding group for gitlab 14.4 gitlab's next release and the format here is a little bit different to others, because it's a big project- and there are many things happening more or less continuously.

A

So I'll walk you through all of those. But the first reminder is: what are we doing right now and why are we doing it? So the main problem that we're trying to solve is that gitlab and specifically gitlab.com's database is growing very rapidly and we need more headroom to accommodate that growth.

A

So there are multiple strategies to consider this um longer term strategies that we are still discussing, but in the short term, we've settled on a strategy that we call decomposition, functional decomposition, which essentially means we're going to identify a set or a feature set in our case ci that contains many tables in our database and we're going to take these tables we're going to move them to a separate database cluster.

A

And why are we choosing ci? One of the limitations of our database is the amount of rights that are coming to the database. Those are very hard to scale and almost 50 percent of our rights are actually caused by our continuous integration features and so by moving all of our ci tables to a second database cluster. We simply double our ability to scale that also increases the overall headroom.

A

This is a pro a project that has been going on for a while, because the many different things that we need to figure out um to actually make that possible.

A

Roughly speaking, there are two big chunks that we have to address: one is making the application itself ready for running and supporting many databases, because at the moment we have only one database cluster, so we don't need to handle that complexity. The other thing is that in production on gitlab.com, we need to run two database clusters. We need to monitor this and have all of the logging and learning in place, and this is an infrastructure problem.

A

It's important to note that what we're talking about here is at this moment in time relevant for gitlab.com and has no impact on self-managed customers at this point. So if you're on self-managed, don't need to worry about this, we'll provide migration strategies. That will be very, very easy to follow and not interruptive. So you are not required to do this, but for us to actually scale to 10 million users and beyond, uh potentially we we need to think about scaling, okay, so for 14.4, what are we?

A

What are we doing our walk through the specific work items and where we are at the first one I'll highlight very briefly. I think this is really important.

A

One of our sres has actually worked on creating a benchmarking environment, to test our migration strategy from a single database cluster to two clusters and all of the specific steps that are required to actually do this with minimal downtime in production.

A

So that's already in place, but there are a few things to be to be ironed out, and so this is something that that is ongoing. Secondly, I think this is important here. There are quite a few items for some fixed, broken enablement features or fix the security features. Their features update, ops features so because we have to make the application ready for two databases or many databases.

A

Some of the existing features in gitlab are actually breaking, and so we fanned out many of these issues to the relevant teams and we are supporting them to actually solve these issues. For example, if we look at enablement features here, we talked a little bit about that as well in the database kickoff call. Many of these features have to do with our our database tooling. So, for example, we have to fix our database helpers to ensure that they actually support many databases. We have to make sure that all partitioning works with many databases.

A

These are ongoing work items that are going to continue into 14.4.

A

Next up, I think this is really exciting. We have a project in infrastructure ongoing to create repeatable deployments of database environments, and this is not specific to decomposition, but it is really important to have so. What jarv has actually done also for us is to be able to provision a database cluster, a second database cluster, so that we can use it to test some of these features in the sandboxing environment.

A

This is going to be really important because many of the changes that we make now we can of course see you know what breaks in our ci pipelines, but we need a sort of first sandbox. We can actually have two database clusters, configure that run git lab and see how all of that behaves and that's the first step to actually get us into our staging environment and then from staging into production.

A

So this should be ready in 14 point at the beginning of 14.4 and we'll start spinning up the sandboxing environment next up- and I think here I'll zoom into the support for many databases, because that's sort of at the heart of what's going on. This is like often a little bit more general, so the ci feature table that we are addressing are the specific functional uh set that we are targeting because of the properties, um but the many database support is more general. It could be any feature.

A

Many of the things that we need to do are a little bit more generic and that's you know a tension between finding more general solutions and making sure that in some cases we also do an iterative. You know, like small approach, specific to ci. You know this is a thing that the team is often discussing and specifically here with supporting for many databases, there are two items that I would like to highlight.

A

One is generalizing our load balancing usage to allow many databases, and this is something that one of the team members uric is working very actively on- is to essentially adapt our load balancing code in such a way that it actually knows that many databases or two databases for one main database, one ci database. We understand exactly, you know how that happens, and the connections can actually be balanced between these different databases.

A

This is being worked on is really critical and we hope to actually finish up some of the most important work in 14.4.

A

Another thing is a work item that has to do with transaction handling and two-phase commits.

A

I think this is actually quite interesting if you think about it, and janis would probably shake his head at some point when I'm oversimplifying what's going on here, but if you're working within a single database, you have transactions and you have guarantees for these transactions, but now with two databases that were one database before you have foreign keys and you have deletes, and now you need to manage a situation where, for example, you are deleting an entry in a table on one database that is dependent on something in another database and that needs to be managed, and we need to make sure that these things actually happen and happen in a way that are scalable and performant.

A

But you're, essentially now dealing with uh with these two separate databases and the application itself needs to manage some of that complexity.

A

That's really critical, and so, for example, this is relevant for deletes right if they cascade, usually through different tables, because you delete, for example, let's say a project and that then deletes a few ci things in the another database that needs to be addressed and handled, and so adam is working with others in the team to actually create some of this tooling. This is really important and we're working on it right now and lastly, we also we have sort of an ongoing effort to remove the foreign key constraints between ci tables.

A

We also have a proof of concept, mr that we try to keep green and we based on master. That tells us how things are actually like breaking and where areas are that we still need to fix and that's a sort of an ongoing effort of having to to shepherd it.

A

And it's really important also to prevent new things from going wrong, because we're making all of this like effort to separate out the ci tables, and we need to give developers the tools to know that certain things are no longer possible so that they don't create new features that are then incompatible with the decomposed world. And so this is. um This is something that we're also investigating.

A

Yeah I spoke about the dependencies to other stages, and that is ongoing. So there are still a lot of things to to fix. Many of them are fanned out.

A

One of the things that we are actively thinking about right now is also how we can actually move into staging. So staging is, you know the environment that we deploy changes before going into production, but it is really important to keep staging also stable, so we can't break it. So we need to think about strategies to take our changes that we're making and then moving some of them into staging testing, some of them out. You know, but having the confidence that we can roll back.

A

So that's a ongoing discussion, um that's also for 14.4, so you can see. There are many things going on. It's a very iterative, continuous way of working. The team is making really good progress and to wrap it up. At the end of all of this, we are hoping to have rolled out two database clusters in production where one database, the main database, holds all of the tables for the application, except for the ci tables which will live on a separate database cluster. That's the end goal and all of these steps are towards that direction.

A

Thank you very much for listening. That's the kickoff call for the sharding group. Thank you.