GitLab Sharding Working Group, 24 Apr 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Database Sharding v. Service Extraction

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

This meeting this is about how we should deal with our growing database. A database is growing and it's not sustainable, as in it's getting into as approaching 10 terabytes I think that we should either shard it or extract services, and the current the proposal of the team is to extract services, and it's a big decisions. I wanted to make sure I was fully informed and I left a bunch of suggestions on the document.

A

All of them are added, so maybe create you. Wanna go through that.

B

Which we refer to the brainstorming session, doc yeah.

A

So, for example, my assumption is that service extraction will give us about 5x Headroom, assuming our largest service is 20% of the total.

B

Yeah and there's still so, this was kind of a late-breaking discussion to move down the service extraction route. There's still some research that would need to be done to figure out what the impact of that Headroom is. We have some new services coming in, like prefect in container registry are the ones that are most frequently talked about as far as potential for new services being extracted and then from existing functionality. We've talked about taking a look at CI, CD daemon, the pole, mirror scheduling is another one that we've talked about and I.

B

Think there's a third that's slipping my mind right now, but bottom line is we're not sure what the overall impact is on extracting those. So we would need to do some more.

A

Research on that would that change this statement, or would that just be because you're going to another statement like a? Does it get easier or harder to do the next step and I think it gets harder? You you're not sure we don't have. We don't have to have that conversation you're still working on it or everything, whatever you think it's still. It's still only buying you that had from oh it's oh, this is about. Oh, this is I had a this is one suggestion with two things. Sorry, if I thought I make my suggestions, row-by-row.

A

Also added some disadvantages of moving to many services. Did anybody have a chance to look at that in that two days since I made it.

B

It's the one question that came up on testing becomes harder with multiple services. Can you elaborate a bit on that because the way I'm thinking of it is with the service is a little more isolated? It actually could make testing easier. Yes,.

A

Small tests get easier like, for example, if you know tests the main real SAP it it takes a while to just load the test. I think this is more in reference to. If you want to do an integrated test now, not only do you have to setup the application under test, you have to set up all the other services as well in order to do kind of an end-to-end test, so maybe I shoot testing.

A

A

Testing end to end.

A

But a good way to phrase it yeah.

B

I mean sorry I struggle with brainstorming, because you're not supposed to shoot down ideas, but I would think that would also help us to test our setup to make sure that it's seamless for customers right so I see it still. I see some advantages to it. You're right I mean it's. There's more configuration settings, there's more surface area that you need to configure and set up and test. So AC pros and cons on both sides. I think on the next one. The observability and debugging becomes harder to just pick.

C

Up on the on the testing thing before we move on from it yeah, so this is the route, it's a different technology, but it's the same approach that we've used with Redis right. We have four different Redis servers and as far as customers and self-managed are concerned, it's one sort of you know we use the same actual physical instance, but on get lapto con, we divided up into four physical servers and it doesn't have any impact on our on our testing.

C

It doesn't have any impact on you know the way we've been able to scale self-managed. You know it hasn't, had a big impact there, either so and I think. If we took that same approach here, we would have this. You know we could we could do it in the same way where it's not a big impact when testing I, don't think that.

A

I think that that would alleviate a lot of their concerns and the questions I have that's not that's, not how I read the service extraction proposal, but if, if that's the case that for self-manage, it would still work in the same same way is everyone is everyone? Is that the understanding of everyone.

D

It's definitely my understanding is that we have to have a VA at the same way if we have like dual code paths. As an example, that's gonna be really problematic, I'm, really against any kind of dual code path solution. You know the intent would be is, is that hosted? Customer would have the same experience, albeit maybe it's slightly overhead, because we'd obviously have processes same host from that solution.

D

But essentially you know the scale out would be just basically breaking that out, based on the the need for either higher power in a particular part of the services or different pieces of it. Okay, so I.

A

I'm a bit like that, maybe let's, let's get down to the first example- the first example to service extraction to contain a registry as I understood the proposal. It was to make a new database for the service registry that the rails app, wouldn't directly talk to so from the rails. App will be hard to see what containers you have and what the versions are things like that. But- and we say now- it's like no, no, it would work like Redis would work so can because someone maybe talk me through how that works technically.

C

On a technical.

A

C

So obviously, we're talking about logical databases from a from a you know on a soft managed instance, we're primarily talking about logical databases on the same physical database server. So the only difference is that it's like a namespace on the service and we're saying this is the CI namespace, and this is the you know, Rails namespace or the monolith, and we don't call it that, but something along those lines and and so over time we will extract these services and each one will have a logical database name on kidnap comm.

C

We then have the option to take that and spin it off onto a different server right and that's the first step and then the second step is we can partition that server. Then we can show that server and that's what we've done with Redis.

C

You know: we've got three Redis instances of the moments self managed all do the same on one we've got three and one of them we're going to take out and break it out since we're in as clusters soon, hopefully and the abstraction of Redis right- and this is the same so behind a single Postgres abstraction. If.

A

I ever have a self managed instance. Today, I just set up a single Redis service, with a single kind of from there's no authentication in Redis. L don't have to do that. If you have multiple databases, my rails, app would certainly need multiple kind of accounts and, if I set up, oh I have to start streaming multiple databases, so I, don't I, don't see how you could gonna. Do it the same way with.

C

Geo we are using what's the name of the replica streaming.

E

C

Screaming replicas.

F

C

Someone else can Brody speak to this better than I can but yeah it's as far as I understand. It's not a big deal right, yeah.

F

With geo, you basically get every database that is on that instance of the database right, so you create another database. You get that for free and.

A

Let's call it the database manager so that, or instance database instance.

F

A

F

It sits anymore, you know what was even more interesting. I have two ways of doing that. Right. You have a different, you know: what are you going to collage achill? Does it database within the instance, but there's also this coyote of a schema with, so you can have a single database with different schemas to that's another way we might.

F

A tackle is prom where container registry might have its own schema and the gate lab HQ production database, but like the only people who can actually physically see it is the container registry or if we wanted to do at the rails, app see it. We could do that too, but it doesn't limit us I guess so.

A

Suppose this do reals get Afghan access to container registry. It could no, it.

F

Doesn't have to, though, like we could restrict it. I, don't know the question whether we want it to actually be able to dig into the layers and things like that and if we do then, if it's on the same schema you couldn't you can do you can do that. But if you don't or we want to restrict that, then you can also change.

A

Doesn't it like right now we have packages in get lab if tons of different packages for Titan for other stuff containers, I see that as another type of package, so packages are in a good, LePage kyudo the main database. Now containers are handled differently.

G

Go ahead, sorry, yeah, I, think from a user perspective we could look at it as being sort of similar in the way that uses look at these things. I doesn't really account for a lot of the ways that container registry is interacted with by users, but the data were actually talking about separate, creating a separate schema for as the man of it's a low-level breakdown of. What's actually in images, it's not it's actually, not the user storage. Currently we get sorry, not you, the user related tables of what they see in terms of the UX.

G

What we're doing to get that right now is they're actually hitting the container registries API and returning data. One of the reasons why the performance of that is not the best, the real time right now for our users and something we're trying to solve. So the data is the data that we would expect to end up as user facing and does relate more closely to our mono repo, but I would expect that to end up in the same schema that we have right now, but the communication is via an API and that might continue.

G

Although it doesn't have to if we have a separate schema within the same database, so there's two different types of data. It's purely the manifests. We're talking about with a separate database right now, we'll separate schema within the same database is that it's not looking that.

A

Would answer my question and that would address a lot of my concerns. It's absolutely not how I read proposal, it doesn't say anywhere like how would the list of containers would still live in get lot HQ, but the manifest would live in a separate thing. Well, it could.

G

Now and the proposal might say we want to keep these all separate right now, but what I'm really concerned about? Is it at the core level we're trying to implement online garbage collection for our container registry that hasn't been done anywhere and so we're trying to find the optimal solution for that and I'm really worried about the idea that this internal file that structure that we're creating to represent this internal system, as well as the other system, was trying to support OCI and at home.

G

It would be negatively impacted if we choose to sort of put it on the rail side and how we actually implement that cause. It's a core level system. It's a core level representation of data for the container registry. That's intended to reflect only that again, we can put it in there and keep the user facing stuff in this separate schema or we could put it in rails. We could share access at some point. It's really about ownership at the scheme and I'm really worried about it. Yeah.

A

That makes total sense, but but that's that's not that's not what kind of these documents this brainstorming document is leading the user to believe, like oh yeah, another container registry, it's complex and we'll have the list in one database and the manifest in another, and this is how we're going to join those two because they I assume you have to do joints like you have to see how well for this container. That's in my list. What's in the manifest or things like that, so, okay.

E

So so I I think like there may be. The confusion is like that. Logical database represents, like the functional part container registry like its food. It's fully functional part. You cannot really spit darts from the manifest because it's like we are kind of gonna, be putting what is 20 on the object storage into some of the data into the database. But it's dinner all of us because it's like it's in the same logical database to efficiently query the data.

E

It's like the question is whether we want these data to be accessible by going directly to the database from rice or whether we prefer these to go through the API. But now since, like it's a separate logical database, the container images which is separate out can fully manage that from like administrator perspective, it's fully transparent because it's like logical database between the same was whether they are running its replicated automatically. If the current mekinese nothing really changes what it offers.

E

It give us flexibility that if we know this container registry consumes a lot of selects on the database, it's very slow user I mean administrator, can start this database or may decide I'm just going to move the data into another instance, but it would be not default default would be. We start on the same. We replicate everything. It will just create separation between different data in.

C

The same, the same applies for the prefect database. It's not like where we keep our repos, it's like a lower level, so not user facing database in the same way.

A

Like maybe so, I'd get lab, we have I'm gonna, call it two turbos. We got two things that that really drive our our business.

A

We have the wider community contributions which come out of open source and we have the advantages of a single application, and that leads to like, if you use an additional stage, you're three times more likely to convert our namespaces three more times likely to convert. This is like the essence of our business like we are only successful because we have a complete devops platform delivered as a single application that is essential to more licensed users, high revenue per users, which leads to more iacv. This is this is core to get lab. What I'm f?

A

What I'm afraid of is: okay? Well, the container registry. You know it's a separate thing: okay, now, I want to see I have an attack on the company summer. There was a coke account compromised. I certainly want to see of all the people who has been last spotted in the Netherlands, where I'm from who accessed containers now gotta do a joint between a user table and the last access table, and the container registry and I can't do it because they're in separate databases I'm. This is a contrived example.

A

I'm sure that there's simpler examples, but suddenly we can't do it because you can no longer do it join or like hey one of the main benefits of gitlab. Is that it's easy for me as a user to switch from source code to CI if I now, instead of creating a database, need to access another app, the latency is gonna increase, I, don't know whatever it's 2, X, 10x or 100x, but it's gonna. It might on many cases, be slower. So now it doesn't quite work as well anymore, it's more like github at Travis CI.

A

Instead of a get lab experience and that's and then we're losing something that is like core 2 to our business. But.

C

If you take a look at the example, sorry keep going back to it, but the example of Reddit say I'm, even wearing the t-shirt, and with that example right. What we were able to do is we can tune these different machines that have got different workloads and get better performance rather than like slowing down by breaking up the data across multiple instances. So we could say well the CIA sorry, the cache instance has got.

C

Our citizens have a workload and over time we've seen for that and like if anything, I think that the performance improves, because you you can sort of mash the machines to the word row that they're running and so for the first example like I. Don't I do some of that investigation and I? You know most of the time I kind of based that sort of thing on logs, rather than like an online transaction processing database.

C

So I know you said it was a contrived example, but I you know, I, don't think that that would slow you down that much in that kind of security, investigation, security.

A

Investigation or just small numbers, or things like that sigh I think so do bredis was a big success or one acknowledge that it's it's different like for Redis there's no config for read is, like self hosted, had them. In the same, the actual same namespace we're proposing that the proposal for the manifest is to put it in a different name space that is different for Redis.

A

You don't do joins it's a it's more of a key value store, so there's no relational joins that you can do or cannot do based on putting it in a different database.

C

With with regards to the joins, the the other point that I wanted to make was that you know, there's there's certain types of there's certain times where just any developer from any part of the application, taking a join from their object across to another part of the application, can kind of speed up your development, but with something like like the container registry, where it's so isolated right.

C

If we you'd have to kind of reflect half of the application every time you try to iterate on that, ultimately, you slow things down, and so at some level of complexity, having those bulkheads between different parts of your application can help speed things up to some degree, and you know it obviously depends on the case and there's a lot of different cases but like if, if we did service extraction rights, I think that would be an advantage, not a disadvantage and helps. If we did is that.

A

That's a if we do anything right if, if I'm right, if I'm a MBA basketball player that scores well, I could get a great contract. Yes, however, I I getting it right is what my inquiries about and what I think is happening. A little bit is that this team is responsible for running all our data stores, so logically, you'll focus on that and I that these things are easier and more reliable to run.

A

If you separate them more, however, there's there's also the advantages of a single application that we don't have to lose in the process and I have a super hard time, seeing kind of the list of containers living in get lap, HQ and then the manifest living somewhere else and how that interacts like things like that, they did worry me and I.

E

So I think in the end we gonna have to start like make race up, cannot have to be started unlikely that any of the XY database is gonna have to be started in the long term in the long time. A lot of problems from this are being is like. You need to make figure out like the best possible partition. People using cross joints across different tables, make these partition and starting less efficient.

E

Now, if you kind of up to that, a little more effort, just use a well structured API, it makes the starting more efficient, because the API by creating this abstraction makes it easier for you to catch, but anyway this example is something that it works today, like that, it's inefficient, not because we are using API, is inefficient, that we are accessing of deck storage, which is like very slow now, if you would accessing.

E

If, when we start accessing database, we can access or information about a given play in a single request and push to you Seminary how we do that with that application, it's gonna make the significant performance improvement by not sacrificing purple, but actually making you more aware about creating proper spread between these friend responsibilities. So in place of the siding, it's actually kind of the spirit. It makes easier to start database because these cross lines are anti charting a.

A

So I I said I would summarize your cases. We have two chart at some point. When we start we can do cross. We can't do the big joints anyway, so we need to go to an API approached I. Very much disagree with that. If we start by root, namespace people only care about people in their organization like we have customers and customers care about, we have users, they belong to an organization. Take care about that organization. Take care about the containers in that organization. They care about productivity in that organization.

A

We have very few things that cut across organizations and those things I agree. We should treat like api's or explore page things like that, but that, like 99.9 percent of the of the actual usage of gitlab, is within within a group, basically so I think we have a great shining mechanism. I think- and I don't I don't agree- that sharding would lead us to a whole bunch of api things that used to be database giants now being api calls. I think it making all that api calls would lead to a disastrously, slow user experience.

A

So I think we've got a great opportunity to shark, because gala is mainly used like within it within the context of an organization.

E

Also, maybe another interesting and gentleman point is right about cutting within application like we. Actually, our application is very deeply coupled between different contexts.

A

I, don't want to and I I think that sharding is our future. We got to nail that and we learned with Geo that it was super important to make prototypes and I've seen that there's been a sharding working group for months now, but I've not seen it I've, not seen any lines of code and I think as soon as we get down to that level, it will be much easier to have these conversations, because we'll have real examples and it's always like discussing programming patterns.

A

Everyone disagrees in the abstract because everyone has their their favorite I think this conversation makes clear that's clear what the database team wants and it's clear what I want, but we need to get down to examples too to do this and we need.

A

We know that charting is our future I put all our chips on starting, because, if we're able to chart I think we'll get to 100x I've seen that I've seen your comments about about 20% of our data can't be sure I'd love to dive into that look at what that actually is because I think charting is gonna, be amazing for us and I think we should bet on that. I I, don't I, think the detour of spinning out services I'm! Look we can discuss that, but we need to have wait.

A

I said before in this conversation, if we do that, we need to do it right and think through all of the implications for the user experience, because that's one of our two turbos and we can't afford it's not acceptable to lose one of our turbos. So.

B

On the topic of the charting working group, most of the time has been spent investigating siteís because it showed a lot of promise early on right and then we ran into the licensing roadblock both on the enterprise and community side. So we're done with our situs evaluation and now we'll switch over to what's built into PG eleven and foreign data wrappers, and what the various permutations are. There.

A

That sounds really exciting. I intuitively, like I, have a hard time picking between database churning and application, but database sharding for sure is ideal for, like the speed of iteration at get laps. If we can make that work, that would be amazing. It's going to be tougher, but I'd love to explore that and I think we should totally focus on that and put all our chips in that as the way to to get there thanks. Everyone for the further conversation, strong opinions loosely held and I do I do recognize.

A

This is super tough technical problem, so I, don't wanna I, don't have the answers that I do have some concerns. That needs to be addressed.

B

On the database team, we have some ideas on what to do next, some of its mitigation about the current database size and some things that we can do and then we'll get some more detail on what we can do going forward with partitioning and sharding iteration with what's built into Postgres 11, and I will summarize those in the brainstorming doc and then make sure it's somewhere in a baker and issue as well. So I'll follow up with a summary thanks.

A

Much appreciated thanks. Everyone.