GitLab Sharding Group, 8 Sep 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2021 09 08 Sharding Group AMA

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

B

B

Happy wednesday.

A

A

A

All right, it's one past, so we'll start welcome everyone to the ama of the sharding group. My name is fabian simmer, I'm a group product manager in enablement. We have quite a few members of the sharding group available here as well. To answer all of the questions that you may have.

A

The goal of this ama is to increase transparency with regards to our own activities, to ensure that folks understand why we are doing what we're doing and also illustrate and explain the impact if specific areas have has questions.

A

And we have two questions in the doc. What I'm going to to do is at least vocalize the first one then we'll see. If there are more questions there is a slide deck that actually provides an overview of our current favored strategy for scaling our database, which is decomposition and we'll see how it goes. If folks have specific questions we'll do that, otherwise I may go through some slides and provide more context.

A

Okay. So the first question is from kenny and he asks where could we read more about the plan for decomposition? What work is required when it's expected to be completed? How will it be rolled out and there is a single epic 6168- I linked it here. I also put it through the links. That is the entry point for all of the work that is being done um by the sharding group right now. It also has linked work items.

A

So when we expect completion, we are currently operating under the assumption that we can roll this out into production by the end of this financial year so end of january. The details of the rollout plan are still being discussed. It's a it's a large change, so there's a lot of considerations.

A

We don't want to make these risky changes and causing any outages or availability concerns at gitlab.com.

A

So it's a it's a delicate method and as an example, um there is also there are work, ethics for specific sections, so, for example, dev or ops, where the work is being tracked for issues that are being found out so go check that out. It's all in the epic tree and then second question from from nathan. Are you here to vocalize your question.

B

A

B

um Regarding the slide number, two: uh do we know how much of infrastructure we spend uh per pay users versus free tier users.

A

I personally don't know in detail: is anybody in the sharding group here who has a detailed understanding of that? Otherwise, I'll give you a high level.

B

All right don't know either nintendo sorry, okay, there's something we can follow up on and find out, though,.

A

Yeah we'll follow up. I can give you a high level understanding that I have um and I'll follow up with andrew thomas who's, the infrastructure product manager.

A

So I know that a significant amount of infrastructure spent is actually attributed to our git storage, that git storage uses disks ssds, which is very expensive, and a significant proportion of the git data is actually created by free users.

A

So, at least for that, I know that three tier users are likely a driving factor of at least some aspects of that cost. For others like ci, for example, it may vary because we may have some very active ci users that are are paid customers, but that's out of the top of my head.

B

C

You I have the next question. um This is for the lay person potentially uh is sharding or decomposition a project, or is this a new ongoing approach? It's the first time I've seen the word in print when I got invited to the ama.

D

So my answer, my answer to that is like we identify this like us and like the best way to move forward to buy immediate headroom like 4 to 10 x. uh This is what we are kind of estimating on the decomposition and really depending on the outcome of these projects.

D

We may decompose another aspect or we may consider starting this aspect, because what we are right now doing is really like the first step on the approach to improve. Like the scalability of the github, I mean overall, like we are heading into more horizontal scalability of the github, but in order to get to that point, we need to have like some contained complexity of existing components, to figure out what is the best way to approach it.

D

So there is like some ongoing work on uh solving blueprint that tries to figure out and describe like where the composition and sharding is on the spectrum of the database. Scalability work that may answer some of your questions.

D

Whether this is one of think or more like a pattern that we may want to that, we may want to implement into the future, got it. Does it answer your question.

C

uh Partially, so let me ask it a different way: can the word sharding? Can it become a past tense? Like great news, we went through this and we sharded git lab. It was a tremendous project and a great success and it was a significant effort to get us ready and now we don't have to do it again or is it from now on in perpetuity? We will either choose one of these and it'll just be the way we design and develop um the product. So if I could just rephrase my own question, does that help.

A

Can you go ahead.

D

Do you want to ask.

A

That's a great question, so um I think you you touch on something really important here. Ultimately, what we are concerned with the scaling gitlab based on our growth, and so we are evaluating several strategies on how we can scale get lab and sharding or decomposition our strategies to address the scalability of our database so with what we're doing right now, which is essentially we're focusing on a set of tables for our continuous integration, we're going to move them somewhere else, there's an end date to that. We can speak of that in the past tense.

A

At some point and say: we've done this right. um We've accomplished our our goal here and at that point we can evaluate again or do it in parallel and say like okay, do we want to take another part of our product and do the same thing over again or do we need to think about sharding, which is essentially saying we select a namespace, for example, so something that you can loosely corresponds to a customer and we start horizontally fanning that out and that will also have an end date.

A

But it is a continuous sort of process of figuring out what the next thing is that we need to do in order for us to provide scalability the fun thing about some of these strategies, like sharding, is once you've done it and you've sharded by namespace.

A

If you add new customers, for example, you can do that for a very long time and and scales that may get us to 100x, whereas if you decompose ci is a very significant part of our rights to the database, so once we move that somewhere else, that's kind of done. We can't we can't do that again for the next ci, because we only have so many so many functions.

A

So those are the trade-offs that we're considering but for decomposition, there's a clear sort of start and end date to it, but we definitely hope there is an end date to it. Where we can say. Yes, we've done it does that help.

C

Yes helps a bunch. Thank you.

C

Brendan, I think, you're up.

E

Yeah, so I was uh asking how the kind of the two interrelate right, how will decomposition impact you know? I know we've been talking about charting for a long time.

E

um Probably ever since I've been here, uh which is four years, um and I knew of the you know- issue with the ci table growth- um I this is the first I heard of decomposition so kind of, like tim, not the first, I heard of sharding, but the first I heard a decomposition. So I was wondering how the two things will impact each other in the future and if we expect like our self-serve um customers, self-managed customers to use either or both decomposition or sharding in the future,.

D

So uh we don't know about the sharding architecture and like what it's gonna be looking like in the future. For example, in the presentation of fabian, you could write that we may use application level sharding like a horizontal, or we may want to use some database technology like evitas or maybe cytos.

D

It's usually pretty heavy stuck. We like to run this started horizontally, scarlet databases. So I cannot answer what is the future? I can answer how we see the the composition. The composition uses exactly everything that your omnibus uses today. So it's from the uh installation perspective. What we want this to look like, we already have possible server.

D

We can just split that into two logical databases that already gives you like this separation, that if you overgrow your single podcast server, you can simply move this another logical database into separate server and this way kind of pretty easily scale out your uh existing gitlab installation.

D

So uh for vanilla, installs, the simple installs like we will just provision two logical databases on the same postgres server.

D

uh That already allows you to scale out very easy to have significantly more headroom into the future, but from our perspective, it makes it light with how you run it on gitlab.com, but also like by design. It will be more uh performant because of the concerns separation and the complexity, separation and dataset separation. So that's really like our idea behind running on premise: we run on premise: we ask goes to github.com.

D

We don't use any custom postgres today, so um we don't really have this kind of like problem of running the composi, the compose ci on on premise, installation, it's more like the a way to figure out like how we want to migrate existing installations over some period of time to make it actually be executed. On those.

E

Does it answer.

D

E

Question it does it. I I much like tim also have a follow-up though, um and I was typing it as you were finishing up camille. Thank you. um Is there any advantage to the two logical databases um beyond, like the scale issue of number of records that we're seeing in ci, like would a self-managed customer have a reason to separate those logical databases into servers? Do you see unless they had? You know what we have, which is hundreds of thousands of users, and you know millions of millions of job minutes a day.

E

Is that is that the scale where that makes sense? Is there some other reason you might want to.

D

That's that's very interesting question uh today: ci versus the rest is like one third of the data store about half of the rights, so as by splitting like this main versus ci, we are kind of mostly splitting our database in half.

D

So, um as you can see, as for gitrap.com, we can take very long way till we need to run to different databases. To be honest, uh if our customers will need that, uh I don't know really. It really depends like about their their workflow. If they trigger, let's say 10 million beers a day.

D

It may impact the performance of the rest of the system and it may be more desirable for them to split.

D

The answer is not conclusive, uh but it's there.

A

Yeah, so I think I think the summary for me here is that I think right now. This is primarily something that is really important for our scalability at thelevelof.com, but having isolation can be helpful in some in some instances as well.

A

Does that answer your questions.

E

It does thank you, cool.

E

And I think craig just added a note in there with another link and then mark you're up next.

B

Okay, great thanks brendan, um you mentioned slack on slide 10, and you said it took three plus years. I just started reading the blog. uh Is this the expectation for git lab and have we learned anything from what slack did to help.

A

That's a great question, so slack did actually something that helps you know, which is using a different database technology called tess which uses mysql and then move to a system where slack is essentially sharding its database by by customer. So it's a horizontal approach, it's not decomposition, but that may be in our future. So we are actually looking at the tests, for example, as a potential future strategy to get us. You know to 100x.

A

I think the main learning here is that this is not an easy feat and we have to be very mindful on like going down down that path, and maybe there are simpler things that we can that we can do um to avoid um going on a multi-year year journey. um So for like one thing, for example- and that's that's important to note- is that um vtes exists and uses mysql, we don't we use postgres and v-test does not support postgres, so that immediately leads to questions like okay.

A

So, given that no such system like we test exists, should we maybe think about adding postgres support to the test? That's a really big thing and those considerations. So it's mainly, I think, for me at least a reminder that some of these things take a long time and it's worth evaluating the the options when you want to go forward um right now. I think we end a spot where we have to make smaller iterations, because we need to move forward. You know if we take three years to scale our database.

A

We are out of room in like shorter than that, so it's not really an option, so it factored in for us to say, hey we we may want to do this, but we need something that helps us in the in the short term. That is more intuitive.

B

Thanks, I took um mark's question a little bit more literally um about three years, so the for this project, for what we originally focused, this ama on, was decomposition and we expect that to take six months, you know give or take, and this especially on the give or take part so um that'll buy us. We think you know four to 10x headroom on the database and buy us some more times that we can look at sharding for the 100x future scale. As fabian mentioned.

D

I'm, like maybe adding another question like doing any sharding technology like we were trying that like for some time, I think, since, like two years or even more trying if there is like some database, sharding technology that we could use and in the monolithic, highly uh tangled set of the data, that is proven to be super hard thing really like to get right like with the sharding and like the horizontal scaling of let's say charging by customers or, like some other one.

D

Like some other factors like, we didn't discover that like we would have to clone a lot of the current database. It's like with all the like figuring how to look like all of these queries like how they would behave across crossing like different physical servers and, like this alone, like kind of sparked us. That's like it's like a multi-year project to get it right and it's more like almost all or nothing to some extent. So uh we were trying to find something that is significantly simpler.

D

That would move us into the direction of a much better scaling of the application and what was closer into starting by simply making our problem of sharding significantly smaller, because now with the decomposition of the ci.

D

If we then want to start the ci part only with vtes or some other technology, instead of looking at 600 tables, we are looking only at 30 tables how to best distribute them across manual database servers.

D

So this even like by thinking that about the relation between these tables is making this problem much more easier, even like to map in your head. So the composition is really like a way to break the complexities um and prepare the application to something more sophisticated in the future, and we are kind of finding. What is this so sophisticated?

D

What's like that, with these fears? Is they kind of move completely opposite because they, prior to the sharding strategy, they had a customer per single database and it was proving limiting their product on what features that can do so they kind of do. Do something reverse what we are doing.

D

They kind of join all of their customers into single database that is sorted and they pretty much rewritten most of the data access layer to migrate from transparently from like customer per database into many customers per fully horizontally started application, and that's why it took them three years, because they actually that, like gracefully migrating customers over like a longer period and ensuring that these features continue to work.

D

While they are doing that, um if you would have to do it like like starting the whole gitlab. I I I I would not.

D

I think three years can be accurate or optimistic if you, if you consider us like as a whole product project, like a moving target, that you have constantly teams, adding new features uh on which you need to kind of like like fix. While you are doing this project. So these three hours is basically due to the complexity of the product of they were trying to execute and doing that completely transparently to the customers, even though they were doing something completely opposite.

D

But I think if we would map that on github, it would probably be fairly the same. As for the complexity.

A

All right, thank you, camille are there.

B

A

Further questions right now.

A

Super I have a few finishing words. I think something that is important to to note for folks here in the room. Is that scaling our database is a great problem to have. It is a consequence of our growth, so it's hard, but also exciting, but it is also only a part of what we need to do as a as a company. So this is a strategy to buy us time and headspace, but there are many other things that also need to happen.

A

So the working group for database scalability is also working on defining blueprints and best practices for our developers so that they can utilize functionalities that allow us to scale better. So a good example of this is time decay. So if we are accumulating, for example, log data that we only need to keep for three months, then we can drop it after that time and that allows us to scale a lot better than storing that data for years and bloating tables.

A

So I think that's an important thing to to consider that decomposition and later on. Sharding is, you know, a really important part of scalability, but there are many other things that we also need to do and those are parallel tracks and that's also exciting, because there's still, I think, a lot of room in the system for for those kinds of improvements, I'm very positive about that and then lastly, oh yeah from camille. He has a here's. A comment on this that I think puts this into perspective.

D

No, I I think, like I agree with like the best data. I simply know data if you can drop the data every time I mean like recycle, that I think this is very good pattern.

D

Very often, like till that point, we were kind of trying very hard to retain everything forever, but, like it's not visible, there is like infinite amount of atoms in the universe, to store your data. So we may want to be concerned about that.

A

Thank you. I think. Lastly, I really appreciate all of your time. We are available in the slack channel that I posted in here. If you ever have any questions or you don't get something or you are concerned, or you don't know, your team is impacted or how your customers are impacted. Please reach out. We are available and happy to help and I believe we are also pretty much at time at least gitlab time I'll look to official timekeeper craig. Is that correct?

A

So thank you very much and have a great rest of your day.

B

Thanks everybody thanks.