GitLab Database Scalability Working Group, 12 Apr 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2021-04-12 Database Scalability Working Group

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Let's begin with adam, which was a leftover item from last week,.

A

B

Thanks, so this is just a short one about how we do imagination within bit lab. We know that the current way we are doing pagination can be problematic when, let's say an api user visits a really high page number. So we had some problems that an api user requested, I don't know really large page size and it spent like 10 15 seconds in in the database, and the other part was when basically the first page load. When you visit a group level feature project level, feature and loading list of items.

B

The first page load is slow because we are doing an extra count query and we actually sorting all the other records in in memory to provide the first 20 items on the on the page.

B

So um for the offset, let's replace the offset page patching nation, we could do key set patching nation and we are doing this already in graphql, and the idea is that we could maybe adapt it more on our existing features and maybe for new features to to use keyset pagination. To avoid these problems in the.

B

A

And I actually had some follow-up to this item.

C

um Yeah, that makes a lot of sense. I think um we had. We had an attempt to implement keysight, pagination and the rest api earlier that got stalled, so we we only have it on a couple of endpoints, um but it's really the way to mitigate the let's say the attack surface, also on on the on the api, because it's so easy to just send requests um asking for very high offsets and, like adam mentioned, having like very large long database queries. um So this is great um thanks for the for the write-up on this.

C

I'm still reading into that, but that's definitely very useful.

B

D

um Yeah, maybe we can simplify this.

D

I was worried kind of that there was no affinity um as far as I could see like we're we're starting projects in italy uh without affinity. I think I wanted to do that by uh sharding or kind of by by saying which gita server you use based on the top level name, space.

B

D

Had a great remark that hey what happens if you move a project between namespaces, it won't happen that often, but it would be very, very sad. If, then, you have to move the whole level namespace or something like that. So I think I think I'm convinced I'd love. If someone would do a write-up or something like that, but uh I think I.

A

D

Oh cool sorry, I.

A

Did it this weekend and I posted it today, it's on um 0.2, but there's a write-up that talks about sharding and tenancy and how these two things relate and how project sharding would essentially allow us to do tenancy in many many different ways like namespace or customer or whatever so camille. Actually, he and I were chatting- and he put it in a way that I need to add to that write-up, which is a project.

A

It's a small enough unit that we can move around fairly easily, depending on what's happening, upper up in the stack and then, if we are doing sharding by project, there isn't one way to determine where the project lands in a database chart.

A

We can take into account wanting to balance the charts for performance, but we can also say well: this project belongs to this namespace or this user. So these are the shards that service that namespace in that user, so think of it as tenancy in a kind of a virtual lens, where we're not building walls right off the the at the beginning.

A

We're just saying if these shards are for this, this customer we're going to make that decision way up in the stack, but as it traverses down then they're going to land on the right places, and then we can have disalignment from you know from getaly to the database to uh the top of the stack that allows us to say yeah we can do you know we can put your name space in a specific place or we can have you as a customer that has this dedicated infrastructure.

A

So you know if you have comments on the on the write up, which I hope sort of explains this thing that I just mumbled out. um Let me know if that actually makes sense on how we can actually achieve your goals, doing this and still remain very scalable and highly balanced on the back end.

D

I I'd like to see in the blueprint like, I think the blueprint is a good start, but I'd like to see like how do we solve the problems as.

D

If you do the database per project, how do I get analytics over multiple projects? Do I now have to query multiple databases? How does that work? If I do search across multiple, if I have multiple tendencies like aggregating, that search is super hard because, like elasticsearch cannot make the weight, so you get the output of elasticsearch and you have to right get that together at the application layer, which is not going to be as good as so. Those are. Those are those are the problems I'd like to see addressed in these proposals. Yeah.

A

We can do that and you're right. That is, you know if, if you're trying to aggregate projects that happen to be on different charts, then that becomes a harder problem to solve and there are ways to solve it.

A

But this is in the in the um write-ups in the blueprints we talk about the caps url and we're starting to we're going to bump our heads into it, so you may not be able to get the analytics right away, but you can get them eventually, and so this is where we start choosing what is important to us that this query returns immediately, which is true for some queries or that this query is slightly delayed but more accurate.

A

So I will add a section about that and some of the ways in which I've seen this solved in the past.

D

Yeah, I don't like how much we're on the theory and how little we are on the practice here. I agree kept theorem like who cares like we have just a database on gitlab.com that we need to shard across 10 database servers, plus we have some customers who want to make sure that their data never leaves the eu that's what we have to solve and the capital theorem tells you like.

D

Everything, has a disadvantage, and I know of uh one of our main competitors. That is having like a lot of trouble right now, because they've done the ultimate thing and the theoretical thing, and you know what the applications are: dock slow right now they are unusable and all their customers are complaining.

D

So, oh, it will work, but it's just a little bit slower that that can be like a company ending thing. Company ending is a bit stark, but that can be like a huge problem for multiple years. um So, let's not do. Let's not overdo it. Let's do what we need to do to to make it lab.com scale to 10 million people, which is the the goal of this working group.

D

But if we do more than that, we have to be sure that it doesn't come with disadvantages and if that disadvantage is speed and latency, that's a big thing and github is still faster than us, and we've seen that some competitors, what happens if you overdo it. um So let's not make that same mistake and I'm I'm very worried if we're starting to go through so practical.

A

I do want to address that because.

A

I like to believe that I'm a very practical engineer that doesn't mean that I discard theorems or whatever they're called the whole idea of a cap theorem here, is that we're smart about things not that we just wave the flag of hey. I read about the caption and it's great my point here is when we're making decisions. We should understand what the trade-offs are. That's simply what it is and that there is some engineering and science behind them.

A

It doesn't mean that I want to build some random, perfect thing, but I do want people to understand that as we're making these decisions, some queries are going to be faster. Some are going to be slower and on those that are slower. There are ways to get around that, but we should walk into the query, understanding that it's going to be slower, because what I'm trying to avoid is for us to walk into building this thing and not understanding what we're building and just saying, gosh. It's slower.

A

Let's all rally together to make this query faster. That's not that's being reactive! So I'm trying to just say when we're thinking about hey, we need to solve this analytics problem that you just brought up. Well, what do we understand about the problem and how are we going to get around it? So maybe we need to add some cash in or maybe we need to do these things. So it's not like everything I'm going to write is going to be the caption says, but it is.

A

It is something that sort of I want to have some guidance on how we solve the problems, and I I want people to understand that today, all queries are nearly instantaneous for some value of instantaneous, because it's a single instance, but as soon as we have multiple instances per project per name space per whatever you want that changes, and we just need to have some awareness about this. So I am fully with you.

A

I don't intend to build um a time machine, but I do need some guidance to understand that how we're going to work around it so I'll do my best to to be very practical and very pragmatic and ensure that we're building the right thing and that we're doing it. You know we're doing iterations and we're trying to produce. um But it's not going to be just writing code.

D

No, no look. I want this solved more than anyone and look. This is the second working group. This is one and a half years that we could have had to to get this right and we've we've now we're now with our uh back against the wall, because we have to get this done fast. On the other hand, it's a very important decision and if you say look, some things are now instantaneous and they're going to be slower. I'm very worried about that because I don't think we need to start a database for a project.

D

I think we can do it per customer and keep all those queries that are important to them, namely how's my organization doing to keep those as fast as they are now and not take that extra hit. So camille.

E

It I think, like uh there is like a lot of sides to that.

E

I probably like the most important aspect from my site is like, like the blueprint that gary is right now working on it kind of like makes it possible to put us on the same uh understanding of the problem, but, to be honest, I kind of expected, like we're gonna hit, so many that ends with the solutions uh that we're gonna be testing, that there is no really like work where, like a way around of like testing these different things and, to be honest, like we actually gonna, be starting.

E

The engineering work like next week, because they're gonna be additional people joining, and I asked them that really like the first thing that we're gonna start doing like we're. Gonna start running github.com database in like in different ways in a sort of way and see like how many of these things are broken and how many things would we would have to fix like in these different scenarios, because, like now, it's like a lot of theory uh and like in the practice.

E

Some of these things, I'm not sure like if they're gonna be working uh properly, so I think, like the way to quickly iterate on that is really like get like this kind of fundamental thinking about the problem uh to be like shared between us. But then we like start testing. These different approaches, which is like we're gonna, be probably testing like per name space. We're gonna, be testing per project.

E

We're gonna be testing how we can move this data freely, what impacts it's gonna have to pick actually the one that is the rest of the resistance and the fastest really like to to implement, because uh gitlab is built about like this kind of single database in mind, and it's gonna take a lot of tinkering to like really disconnect this kind of thinking.

E

So um I'm like personally, I'm not sure like which one of this way it's gonna be like the most efficient to do even now, I'm just kind of know that we need to test them and we need to test them like very quickly and now. I know that like if I start working with the people that are going to be joining the team like next week, the gary work actually gonna be helpful to keep them on the same page for like what we want to achieve.

E

As for the uh guarantees that we want to offer, that's that's kind of. I think my perception about how we should uh approach uh is like way of doing that.

D

Yeah, I think it's really important to me that we keep the two two options open. So I'm, okay with the proposal that is as uh jerry wants to make it, but I also want to see the proposal or the alternative um to shard the database per top level name space and then probably red is an elastic search as well. I think those are two that's a major decision and I want to see the the pros and cons of both and I think we want to keep our mind open.

F

uh Yeah I said this is this: is fabian, um so I've laid out it's it's drafty right now, um essentially the pros and cons for both proposals, and I think what I'm looking forward to um this is the point after you, the draft mr .8, to summarize pros and cons.

F

So I think what we, what I would be looking forward to is saying. Okay, jerry has proposed this. We are in agreement that we're going to do this first and then we are actually going to start like camilla, suggested testing these things out and seeing how it works, but keeping our minds also open if we hit a role right and we find out that it doesn't work that we're able to um you know to iterate on that and move forward, because I think, there's still also a lot of things that are not quite known.

F

So, as far as I know, for example, we know that some cross chart queries will be slow, but I don't think we understand how many of those cross chart. Queries actually exist for any given entity, and that is you know, these kinds of things need to be. I think, worked on.

D

Thanks for that,.

D

We've been working on this for one and a half years, and and both proposals that we're discussing in this meeting have been made over the weekend like it. It feels it feels like we're starting from scratch, even though there's been all this time spent. So I'm a bit puzzled by that, thanks for the propose of fabian, I think that is great.

D

I'm sold after camille on the uh gita issue, beeper project, I think someone was.

D

I think that right now we're already doing search per project as well. Is that the current case, because that would be a good way to look at the pros and cons.

B

Yes, surgery is currently short per project. We have open issue discussing alternatives, uh shouting per namespace. I think both approach have problem cons. We need to investigate.

D

Where can I find more information over kind of what what's gotten harder, maybe and now that we sharper.

A

Project on elastic.

B

So we know this.

B

Is yeah? I I linked the issue about the upper discussion uh here um and uh uh highly too happy to collaborate and communicate, maybe asynchronously. I know this is the the database starting working group.

F

um Sid, this is fabian again I linked an issue. I just wanted to say that I think there's actually a lot of work that was done. It is a little bit scattered across quite a bit of items and handbook entries. I think so these proposals, as far as I can tell I've, also I'm relatively new to this there's been a lot of prior art on it.

F

But I think the challenge that we have here is to synthesize this information, and I think jerry did a great job on that and saying: okay, let's, let's take it and move forward, but I I've spent a few days reading through a lot of like deep thought on many of those things. So I think people have experimented, um but I I think it's important to also find sort of an exit point and say: okay, this is sort of proposal, a let's, let's try and iterate on it and see what walls we hit.

D

Thanks for that, I'm looking at shard and gitlab by root namespace, I I'm puzzled even by the name like the root namespace, there's one root name, space right, that is, that was going to be kind of our our workspace people use root, name, space and workspace interchangeably. The other thing that I think people mean when they say roots space here means top level namespace.

C

uh You're right that that relates back to what we wrote a year ago, I think- and at that time we were talking about namespace, but that has changed. I guess over time.

D

Yeah so this means this should be changed to top level namespace now.

D

I um okay, I'm there's like many different proposals, many different things and- and I think even the even the most recent proposals- don't have a clear articulation of like this is.

D

These are the problems we'll run into. This is the trade-off we have to make etc. So I think we've got a lot of work to do in a very little time.

E

Here this is very correct. We have a lot of work and also a lot of people like to be joining to help out with this being.

A

A

Yes, they all want to work, and there is a ton of information and, as the dri for the working group, I will make sure that it's sorted and dealt with appropriately.

F

Terry, I'm also happy to to support and summarize these proposals in a digestible way, so that we have a good overview of pros and cons, because I think a lot of the information is already there. um But there's also, as can be said, a lot of unknowns. I think, will only surface by moving forward.

A

Yeah- and you know to add to that, the um the team that is going to be focusing on this is starting to be constituted this week. So camille is already starting to think about this, um essentially full-time.

A

So we will have some answers for you, some more concrete answers for you shortly.

D

Cool what I'm super interested in is like the.

D

I'm sold on italy being project, I'm not sold on redis elasticsearch and the database or project. I think it should be for top level name space, but um I think we should. uh We should just it's a big decision, so we should have uh we should. We should have that trade-off clear and that should include like what is the percentage of queries. How much slower will they be, but also the the advantages of sharding per project like how? How much easier is it to move et cetera.

A

D

Of that we already.

A

Have like the latter stuff, we we've already outlined the early staff. You know percentage of grace accessory. We need to go and find and put in writing so we will get that.

D

Cool yeah and then also like hey elasticsearch. If we have to do it combine results in the application layer. The results get this much slower this much worse et.

C

A

Okay is point e. I take to be a part of the conversation we just had in terms of regional, which I think is another vector for the sharing conversation we have two minutes. Do we want to do go through f and g, real, quick.

C

G

I've yeah I've. Actually uh we have a few uh architectural issues uh uh from the rapid actions uh recently. So I'm just I just created uh a search here, so we can probably investigate whether they fit into one of our work work tracks like the data patterns.

G

uh One of those may be a good place to hold this uh items there so just to bring up two attention, then the the dris from the work tracks from the data patterns research can take a look and consider if any of those issues will fit into one of our data patterns to to work on here. Otherwise, we may find a another solution here to host this. At these issues.

B

Okay: okay, thank.

F

You vocalize key very quickly if you want to read something really interesting. um The paper linked here is about the bias of people to solve problems by adding new things rather than by removing them, and I was just wondering in a general sort of scalability discussion if there are opportunities in gitlab to just remove certain things uh from the database or at other points, to make scaling easier, um it's more of a thought, provoking provocation.

F

Whatever the english word is, um go read it it's cool.

A

Excellent all right time, thank you. Everyone see you.

A