GitLab Database Scalability Working Group, 19 Apr 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2021-04-19 Database Scalability Working Group

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

So we can go ahead and get started. uh This is the database scaling working group today is april 19th.

A

um I have item agenda 1a, which is uh basically uh the db-12 upgrade from this weekend was delayed, uh so we're shifting responsibilities around, and I have an mr there basically to take over facilitator responsibilities, at least for the next month. uh Eric. I think I assigned you as a reviewer, though, if you want to merge it feel free or give me an approval, uh because I think that's what you were asking me on friday for.

B

Sure, yeah I'll uh I'll get that merged for you thanks thanks cool, you got.

A

B

Yeah um from the daily stand up that we do immediately before this meeting, I've got a question I'm asking about how much? How much time do we think we just put ourselves the consensus right now is we feel like about six months or october, 15th we're sort of out of database capacity again. So um we have some ways to update that estimate.

B

But really you know time for analysis is kind of done and we have to go with our best guess to make sure we focus the majority of ourselves on execution, um so it feels like we should be launching the first aspect of starting um by the end of this upcoming quarter, which means july 31st, and that could mean a number of different things.

B

We have to figure that out as quickly as possible, it could be a sharding shim that is aware of only one shard, but does no harm and then would allow us to melee chart after it might be something that relieves some pressure off. The name name spaces table or something, but it feels like that's uh a ambitious date to shoot for um and then working back from that. The question be: could we actually pick our sharding key?

B

You know by may 1st, meaning like really really hustle, do whatever analysis we need to do and have our plan and spend all of q2 actually executing on that plan, which I, which I know is ambitious. That would be my sort of proposal working back from that july. 31St date.

B

Look camille, you do, it looks like you've got some thoughts.

C

Yes, uh I I guess, like um I think, like I mentioned a few things, I just see like like two main ways of like how we can approach our database credibility program, I'm just not sure like which one of them does is easier and faster to execute. I mean ultimately like we likely need both of them to be implemented.

C

um As for the timeline and starting key, I think my perception is like I don't want to rule out any of the approaches, so uh I think, like committing, maybe be pregnant, but I think, like my goal to like to solve, is to iterate as quickly as possible, like in the next weeks. It could be like you can define these to be four weeks or whatever other number.

C

Also how github would behave in these different approaches to actually validate the most sensible approach, uh and once we have that, I guess it's gonna be much easier to also figure out iterations because, like you are like you propose like this knight of the end of the q2 like. If we pick one of these approaches, like at least for the application sliding, I probably have like some good iterations. That would allow us to relieve the pressure, uh but I think, like one of the challenges right now is like.

C

um There is like a lot of complexities related to how we start our application and, to be honest, I'm I don't know like if the next is gonna be the easiest to do. I I don't know like how it turned out with others to do so. I guess, like my perception, is still that, like uh that it appears like the most sensible, but I want to validate them uh quickly and we gonna have like people, uh starting basically, they are starting this week.

C

So on the wednesday meeting, we're just gonna distribute the work to start violating these different approaches.

C

um How we can use database level partitioning how we can use application level partitioning, because, because each of them has a lot of associated complexities uh and then maybe we will be able to very quickly show how actually github runs on these approaches. So we could actually like test that and so on, onto like understanding, which of them gonna be easiest to implement and gonna give us the best iteration.

C

So I think, like that, locking us to picking the sharing king so quickly may lock us on, like on the on the better approaches- and I I didn't want ask a lot on that. um I think like I want us to understand like how these different approaches uh like connect with each other and pick the one that gives us like the most scalability room because, like even if we pick the top level group right now.

C

If you look at my iterations like we're, not gonna migrate data from the existing database right away, it's gonna be like a theme, lengthy process. It's we still gonna have the problem with the current database. There is like a lot of complexity, how we can online migrate. This data, which can vary on writing some additional extensions to performance, logical replication or similar stuff, so even like. If we picked the top level uh charging today it, it may not be the like the the single answer to say that we're gonna have a headphone in six months.

C

It may mean that in the six months we're gonna be able to start writing new information and significantly reduce the amount of the new data coming to the current database, but it will not magically move the data from the current database. Yet uh so uh I think, like is it's like. I want to evaluate these solutions. I want to evaluate them from their perspective like how quickly we can interact, how easy we can and reliably do it on github.com to not introduce more problems, but also like how well it answers like that.

C

Your estimates, because, like six months, I guess like if we continue growing data in the current form, uh we're gonna hit limits, but if we start studying by the top level group- and we cannot efficiently move data uh between starts by the time we still gonna hit that limits, and it's not gonna solve that problem.

C

uh So that's why I kind of mentioned these two alternative proposals. Like these two proposals, one is like we have database partition. We know that some tables are big. Maybe, instead of maybe we can approach this table to make them have the characteristic, how they behave slightly differently, and maybe exactly that also kind of concurrently to leave this headroom of the current database and have this like another aspect that we are working, which is more like the application that we're starting?

C

That's gonna, give us like the a lot of headroom and much more, even performance of all the current github, also, all the new coming people to the github uh in the six months from now. So um my perception is like it's very likely that there is no like single solution uh for that problem. I think we need to write in which of these solutions is like. It's give us like the most health room in that, given time frame that we mentioned.

B

So I think the the counter argument is one. We can't allow like analysis and execution to kind of go past that date where we think we're going to run out of capacity. So we we can base the schedule off of like what we think the most responsible way to make the decision is, but at the end of the day we have to fit this in the time box. So there has to be a solvable solution that fits in the the time and resources and people we have available.

B

Two, the outside feedback we've got from like postgres ais and we have in our glossary, functional decomposition and sharding. They say the big companies they work with. Do both just pick one and do it uh as quickly as possible, and if it's about plan a versus b, if both are better than where we are today, then pick one and and and do it, um and maybe we need to to do the other plan after the fact. But um we can't.

B

I can't allow the scenario where we we analyze and we wait and we delay and we run out of capacity again. So we need a bias for action, so, um if you can, I want to understand camille like in in the world you're putting forward like what is the work back schedule like when do you? When do you think that decision is made and can be executed against, I'm open to seeing that.

C

Okay, I'm hoping that we can really make this decision like in the next four weeks, uh basically and stick to it, because I assume that it's gonna be enough time to to find all the complexities of the tools that we need to do and it's after the reasonable understanding uh how the situation should look like um going back to your like to your question.

C

I think it may be even like more sensible to do both of them on the single time, but I'm not sure like if we can do both of them with the same group. We cannot uh but like the cibs is like something that needs to be worked anyway uh by the cia team. So maybe the the the like another aspect is like.

C

We know that both of them needs to be solved. We just don't know how urgently which one of them is more important.

C

So maybe, instead of like doing that in the sequence, maybe we should enforce this information like enforce ourselves that we actually need to work on them right now and then, like we can, as part of the starting group, focus on how we can start, maybe by the top level abandon. But then the other group can focus how we can partition.

C

The current dictate big data to make it more efficient on github.com to actually give us more space on the current database.

D

I also sorry this is fabian.

B

I was going to say I like that plan where we're yes, we're doing some analysis, but in parallel we're starting executing one or maybe two other tracks that feels much more comfortable and then we can, if we happen to choose the wrong track based on our instincts. Now, the analysts can tell us a month from now, but we're not waiting we're, not pausing on taking taking action. That feels a lot more comfortable to me.

A

And just to be just to be explicit camille when you said uh talking about ci, if there's a particular area or group that we need to basically say hey part of your requirements now is to go and belentus, um we can not have the the sharding team uh working on that. We could have that team working in parallel to that. So that's like like that is totally open uh for us to do so like if we need to try as an example.

C

I I think, like we can much better define that, but I guess like even in like in the ci group, there is like big understanding of like the need for the partitioning data, so I think that they are very well aware of that of that.

C

So I guess, like the sooner we start, the faster like we're gonna have much more headroom and, like my biggest worry about the application sharding like, I just see a lot of complexities, and I just worry that, like we say that it's gonna reduce the pressure on the github.com that main database in the six months it may reduce. But it's not gonna be something that is gonna.

C

Be immediate because there's like always a lot of back and forth and moving data around to rebalance and like rebalancing, is like very important aspect to do, and it's really risky thing to do on the living database. That is constantly being written to. So I think, like with the application starting.

C

If this is the approach that we pick, we definitely can get to the point in the six months that very likely the new projects, like the two, the new top level groups, will start we've written to the new database, but it still doesn't resolve the problem of the current database and its shield size. It's still not gonna go away, so whatever solution we take to uh make this problem smaller today during this time frame, is gonna, give us more hyphen from both sides. Basically,.

D

Thank you camille. um I would like to add. This is fabian that I second what camille has said, and I appreciate actually to the additional data from you eric regarding the timelines and database headroom.

D

I think everybody here would agree that this situation is urgent and important and we want to find a solution as fast as possible, but I also believe that when we make sensitive decisions like a sharding key that are potentially going to impact our users in ways that we still need to figure out and the impact is potentially detrimental, we need to allow the team the time to understand the complexities involved in the proposed solutions and then make a data-driven decision.

D

I don't believe that the 12-day time frame is enough to do that, but I do think that we need to reach these decisions as quickly as we can and as iteratively as we can, because I completely agree we can probably analyze this for another six months and not actually get anywhere. I don't believe that is the intention. Camille has already produced a couple of proposals that will allow us to move into that direction.

D

I think the team that can actually execute on them is being formed this week and I think we will work as fast as possible to make sure that that happens. I think the other thing to highlight here I think christopher, said this as well. There may be other dimensions that can happen in parallel. Right, sharding is one of the scaling patterns right.

D

There may be others that also buy us on some headroom, because I do believe that we want to avoid a situation at all costs where we run out of headroom with the database, but I think it is important that we are not making decisions early on that potentially costs us a lot of time.

D

Given that we don't have much data and we still need a little bit of time to investigate.

E

I think what we found during previous kind of big architectural changes- and I think gitlab geo comes to mind- is that we should have done more prototyping earlier.

E

So there was a lot of talk, but not a lot of prototyping and it really bit us because we, the the team, initially after a lot of analyzing, landed on a pretty impractical approach. So um thanks.

D

For highlighting that, I would like to say that I absolutely agree that prototyping is the way to go here, and I think this is actually the proposal. It's not to sort of over analyze. But to start these prototypes- and I think geo is going into.

E

Exactly that direction, I'm not asking for prototypes, I'm asking for one prototype, the top level namespace prototype.

E

I think, there's a risk if we do not push on that prototype fast. That will end up with three different sharding solutions that are that are greatly over complicating the application.

E

So there's some risk here that we in the optimization steps we make. We make mistakes and go down, go down paths that don't make sense, and I think we'll talk about elasticsearch starting key in a moment, but I think that's an example. um So I I'd love to see progress on both fronts and then.

E

That's not, I agree, look, this is a big decision, so we don't want it wrong and I understand that more time is needed before we come to an answer, but let's make sure we have something ready to go and I think the prototype will inform the analysis and vice versa,.

D

Yeah and camille correct me if I'm wrong, but I think that is also the approach that we we're going to go for right. We have um approaches that we want to evaluate and understand, and those will then influence the decision that we're going to make.

C

Yes, I I guess, like we're, going to be prototyping for various reasons. One is like ensuring that everyone within a team has the same understanding of how database can be partitioned.

C

Second, gitlab is very, very complex on its data model and not all of the approaches we fit we're gonna need to discover exactly which parts of the application will be broken and the third we need to actually have something that is easy for people to use later, knowing that the sharding is being used, which is like also part of like our ability to iterate. So that's, why, like?

C

We need to model these to understand how complex they are, what the problems they have um and ensure that we very well understand understood exactly how they fit into the current gitlab.

C

uh I mean like I'm just kind of for it. There is like sharding on the application level. uh I know that different companies do it differently. I have my personal bias on the solution, but I also want to ensure that, like others can provide like some insightful feedback on that, so uh like distributing this prototyping, where we actually give us like different perception, how different people see these approach, and maybe someone gonna propose something more uh much better. I think it's essential.

C

I really like to time box that like to not make us like to do it for months. I think, like it's reasonable, to think that we spent on these four weeks at the end of this time. We stop whatever we do. We did it. We just pick and like stick with like the best solution out of that to continue actually like implementing that. So I guess my perception is like two time box. That's very heavily.

B

So sid made a concrete proposal, though I'm not quite hearing like like a yes or no in answer to this question, so I think his proposal is go their instincts now, rather than the output of any to be determined. Analysis start prototyping the name space thing and then do the analysis and maybe the maybe the analysis tells us it's something else, and then we change the prototype or we add a prototype thread or something like that. But why not start tomorrow on the uh um namespace prototype.

C

Exactly this is what we're gonna be doing this week, we're gonna be starting on like one of these proposals that I brought, I'm just gonna, add uh some additional one uh to actually to start prototyping. This, I'm gonna be basically prototyping this. I hope this week, starting working on that.

B

Okay, so the answer is this question is yes, I.

A

Think I think the short answer is: is that we're trying to build confidence and whether we'll have confidence in 12 days or whether we build towards the confidence in a month is, is what we're talking about here, but I think we're. If you said today, what are we doing? We're committed to name space until we can find something that breaks us there's another way to say it from that objective. But that's.

E

Something right and to prevent confusion can we call it top level namespace.

A

Sorry top level namespace. All thank you for correcting me.

E

Yeah, so I think this sounds great. Can we move on to the elasticsearch feelings? Please.

A

Move on um so uh sid had a question about follow our elasticsearch. uh The team did a short write-up associated with it. um I just kind of put in my read on it around it. um This decision was made largely three years ago. It looks like, or two and a half years ago. um uh I didn't have necessarily great reasoning in the merge request, but um you know the answer that I've heard is is that it was the smallest unit smaller unit. So then it was basically the bin packing problem.

A

If you have smaller units than in theory, you can pack better associated with that um and the the other aspect that I saw was that elasticsearch is not a code change as much as it's a configuration change. uh From my perspective, though, I know it's not that trivial.

E

Okay um in the issue created a week ago, it gives two questions and both relate to gitlab region.

E

um I'm missing uh the kind of the approach, and so maybe that's missing for a good reason, and I only read the description, but I assume that if we now do per project, if you do a query across a group, you have to combine these results in gitlab the application, and that will give you lower fidelity results than if they would have been combined in elasticsearch.

F

No see, actually, I think the combination is done by elasticsearch.

F

It's just like reach to multiple shards and elixirs will combine the results, but when you reach out to multiple shards, it will be slower than just reaching one sharp.

E

Okay, so elasticsearch itself will fan out the queries and get them back.

A

Correct we, don't we don't we don't do any kind of code changes related to the sharding approach that we use with elasticsearch. It's time.

E

Okay, so thanks for that was my mission.

A

There's good news, there's good news here is: if we want to change it in theory, it's less work, uh though. Obviously, lots of system implications and testing that you have to associate with that. So that's not that's not trivial.

E

Work yeah, but it's also, it's also now, certainly a lot less important to change it, because I think um I understood this as starting between elasticsearch clusters, but I understand it now that it's starting between elasticsearch something else.

F

uh See it's to be clarified now we only run one uh cluster. So if we are going to change to multiple cluster uh approach, then we have to do work on application level. uh So now I think the topic is not new. You have you created that issue like two years ago and the team. We also had a similar issue that created a few months back to investigating whether uh shouting per uh namespace we'll get our better search performance, uh especially on group level search.

F

So the major concern as christopher mentioned here is the the shard imbalance, so it may give us like better performance, but we have to it. It doesn't mean that we want investigation, but I think it's just. It's also depends on the planning and the priority with other tasks.

E

Yeah makes sense. uh Look. This is suddenly a lot less. Well, it's it's! I just retract my concerns, I'm I'm fine with it. I thought we were um starting across multiple clusters yeah. That means we have like a really big cluster right now. Right, isn't that causing problems.

F

Yeah, so uh we have been doing um performance in different perspective. For example, the ongoing performance that we have, where we have been doing right now, is to split the data, not by name say that by the data type before we have everything in one single uh index now, with rhythm in code uh issues, merge request nodes. So after we separate this like, if you do the uh let's say, if you search.

E

Still one one cluster.

F

Yeah still one cluster, but the issue search even on the group level is much faster uh than before, and uh this just like a two codes uh this morning we just have to receive a feedback from one. I think it's a big customer that I don't name it, because it's going to be a public recording, uh it says uh after they re-indexed their whole uh cluster.

F

uh It is working at its target searches. Such risks are fun and latency is way better now and everything important to the to total housing. So uh just like example, of some good results from the recent performance improvements on the search.

E

Okay thanks thanks for that, my misunderstanding, uh and then I no need to re-evaluate the starting key or stuff like that. This is all fine. I didn't understand. They were that the cluster was still on the same level. I thought they were separate clusters thanks thanks.

A

So, in the interest of time I want to be respectful, um I'm going to jump past what's been done, because those I think are mostly to read eric. I want to get to your comment on 4c just kind of organizationally: did you want to move to a flat structure.

B

um Yeah, I noticed it last meeting and in past meetings when we have this highly opinionated agenda structure. Inevitably something important doesn't quite fit into it or things lead to discussion as they did today. And so I wonder if just a simple flat list and use the facility are free to determine what the most important stuff is we should be talking about. It tends to work a little bit better than trying to cram your thing in somewhere and we don't quite get to it later and whatnot.

A

That works for me. Does anybody object?

A

Okay, so we'll move the flat structure and I'll make sure to organize it in a fashion that gets the most critical items. Hopefully, at the front, though, uh looking for feedback from uh sit and eric on if something needs to potentially be reorganized, uh and then I just want to cover uh what's happening next, uh as we mentioned, there's a kickoff, um it's happening on wednesday, uh chen.

A

I don't know whether it's worth doing, but it may be worth looking to see if we can potentially move it to tomorrow, just because we do have a wednesday working group session, so it seems like it'd, be better and then the other question I had is uh it feels like the other thing we want to make sure that happens.

A

Next is uh kind of our first prototype plans and if you know right now, it's just name space, then that's fine, but at least we have to list it, uh though I think uh camille, when you're talking about prototype playing you're, also talking about time decay, which we also need to do uh as well. uh Just from that perspective, that's kind of independent of the starting discussion point. So sound, reasonable or folks have feedback.

E

I'll try to see if I'll, try to see if we can move that sync meeting to tomorrow morning have a conflict with the database team meeting, but we can probably use the same same time uh because they're both uh database related and moving to a cadence. This meeting, the sync meeting will uh we are trying to explore some other natives because uh the team is distributed.

E

So we want to keep not slowing down the team by the sync meetings, so the cadence. We may stick to the current cadence, but we have other other natives to work uh to collaborate to collaborate.

A

Cool all right: do you think uh at least having what we have so far in the prototype plans on wednesday? Just so that we can kind of articulate here. Are the plans.

E

uh I think we can move the meeting to tomorrow and then start from there to lay out the plan, and the plan is clear today, as we want to prototype the uh chart by top level name: space, okay, cool.

A

All right all right, uh I don't think, there's anything else unless there's any other questions all right. Thanks.

A

A