GitLab Sharding Working Group, 2 Mar 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2020 03 02 Database Sharding Working group

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

All right so Josh answered this question before hit record.

A

You know working on feedback from Ben's questions from last week as far as the timeline scale for a site as proof of concept, so I added the issue they're currently Andreas is researching the features that we use. That cytus does not support, and it's in comment, if you scroll down on the issue, some of the features that we're currently using the first comment here.

A

Still so ondrea's can speak to this if I am speaking incorrectly, but still doing some research on what siteís can provide us, what some of the blockers will be and how that's going to lead into subsequent work and comparing that and we'll get to the other trade-offs about maybe application level charting as well. But it feels like a scale of weeks just on the current research for site us, and what we could do is that and then the overall POC implementation timeline is gonna depend on the outcomes from the site is research.

A

It could turn out that laundry has find some showstopper that we just can't use situs at all, so right now, I'm focusing entirely on researching situs and what it can cannot do for us. Andres did I miss anything there.

B

Yeah, that's right. I think we already know that there are block on some side studies support for us, so some key features that we are already using. So it's not a matter of just exchanging the database. We also have to figure out how we how we do that in the application, and even even more, if, apart from those features, I think it's also about the application design. So how do we?

B

How do we design or redesign the application to to basically work well with sliders to be, and, for my part, I I have to look and understand more about sides to be first to even start thinking about a POC so going forward. If we can, if you want to explore that option, I would I would like to have some time to learn more about sides to be first office.

A

Okay, thank you and then there was a fall by time to talk about partition timelines. So there's going to be a trade-off about between focusing on charting implementation and partitioning performance improvements, they kind of called out what I would take to do the partitioning work. We have a small team to developers on database team, so we need to get some clear guidance, probably from Christopher which on which to focus on because there's at some point in time, there's probably going to be a sacrifice.

A

Do we stop focusing on partitioning, so we can spend more of our time and focus on charting with a goal to implement some sort of charting, whether it's our application or database level.

A

There's I think there are some small things we can do with partitioning, but we might have to make a tough decision in the future and once we have more information, we can come back with what the trade-offs are going to be, but I'm just calling that out now and that actually came out order, so I updated the handbook to hopefully it properly represents the priority order of what our business goals are and I called it out: an availability, scalability and performance.

A

The partitioning work that we are talking about would fall under performance and that's why I was asking a question about prioritization and trade-offs between partitioning research and charting research. So I can follow up with the right folks to talk about those next time.

A

Any questions on that one from the handbook entry.

B

So we're we're saying that the first priority is availability and scalability and performance are behind them. So we're looking for a charting solution with a focus on availability, stop right.

A

That was my takeaway from one of the early meetings that we had, and so, if we don't have a definitive answer, so Christopher and Eric are not here. I can follow up on that one too I.

C

Think the goal of this was to try and get to isolation rate, so we can reduce the blast. Radius I'll get lab comm if we do have a database problem or if we have a performance while that affects the database, for example, that was my understanding, but I think there's something. Obviously there are some table to me a very large tradition and could help those and also I. When I understand there was a prerequisite to actually get sharding done so.

A

C

Might might not be the case, though, so that's.

A

I grouped that under availability, joshua and called it out under the description so better than isolate a better isolation of database outages, so I will ping product in engineering leadership to make sure that this is accurately reflected. So we were going we're moving forward with the right priorities in mind. I.

B

Found it interesting to read about what shop we fight it thanks for the links correctly yeah and actually turned out out for them that they they were looking into a sharding solution and they they got the benefits from in terms of performance and scalability, but it turns out that availability suffered quite a bit because, all of a sudden you we want to run operations that are on all the charts and when one chart is down, you have a problem, the more shots you have, the more likely that is so.

B

If we, if we sort of focus on availability here,.

B

Perhaps, where we're maybe looking in the wrong corner at the moment for presider's to be even.

D

Yep Shopify did a kind of assured nothing architecture right where they basically have pods everybody has their own playground and that works for them, because they're really independent, like every customer as their own, like shopping, cart or a skit lab, may not be as much of that there may be some of that like or you could have a top-level organization, and they only have access to that and that's its own pod. But then, if they want to search across I'll, give up a comm, that's you know they either can't do that or we have to.

D

You know make that a you know slower feature less available feature right, but that may be the trade-off right. If you wanted this enterprise great, maybe it should be their own playground. Maybe they shouldn't be subject to other gillip comm I'm, not sure what the is. It is kind of a product decision as well like do we want to put every customer and it's isolated sandbox there there's an argument to be made. That's a good thing right.

A

Go ahead, Andres.

B

I feel the same way about it or we that's what I meant with with also, we have to think about the application where how that how the application behavior matches, what what we have in the database, because if we keep doing features that are cross charts, it's probably not going to work with any of the of the database. Sharding solutions right, I created them a small issue to discuss that or to start that discussion. Think about what?

B

If we, if we take top-level namespace, for example, I think it's the natural key to to the applications data today right so most of the data that we have, it always lives in a namespace. There is a few exceptions to that. Maybe we can look at what these exceptions are and figure out how we want to treat that going forward.

B

What is that issue linked in here under s I'm, just trying to find out you missing.

D

And is there somebody who knew when you asking for someone in the incident reviews, yet somebody is somebody in the engineering infrastructure assigned to that or is it just kind of open right now? It's.

A

It's not so the original I guess thought our approach was Andres and I. We're gonna sit down and go through incidents and see where DB sharding may have benefited, but once we both kind of looked at it, it made sense to have infra involved and that's what we're asking for here and Chun added an assignment for Jerry to take a look and see. If there's anybody can be involved and run through the incidents, because that might give us more feedback that hey, while you know X percent, is it would actually benefit from DB sharding.

A

Most of the incidents would have benefited from a higher level application charting so and it's quite possible we're not even looking at the right list of incidents here. So just getting someone from the infrastructure side that was closer to the incidents and getting their feedback on what they think. Maybe what architectural changes or what kind of feedback they can give us on the right direction to take.

E

Yeah, that's that's super important we'd, better discover that solution is right for the problems we are trying to solve. Rather than later in a row on the road we are 50% off the project. They may discover that this.

F

Alone is not helping those.

E

Incidents at all at all, if it's not a bad situation, yeah.

A

And so I can make sure that we have a follow up, ID I'm here to get the right people involved and if we need to have a synchronous working session where we just share a screen and run through them. Real, quick and just do like a gut check, because there's I think five hundred incidents listed in the filter for this issue run through them and see what kind of breakdown we get on where we think dB database sharding would have helped or application charting or maybe is something altogether different right.

A

Maybe we run through these incidents and figure that caching would solve all of our problems. So and that's don't don't quote me on that I'm just throwing out a third alternate in there and I was first one that came to mind, but I will follow up and make sure that somebody is on this that we schedule the time to either run through them, synchronously or asynchronously, and get get some metrics from it. I.

C

Think, thanks for raising this I thought we'd sort of identified. That sorting was the solution to our problems because of the struggling working group we have but yeah we don't know for sure. We should definitely make sure we're doing the right thing. Do we have a like I guess: I cannot fund the issue of the Genesis. It was working group. That may also be helpful. Imagine we started it for a reason, but the job I.

A

Don't think they're well, I, don't know and Chun. Maybe you can correct me on this- that there was an issue for the genesis of this working group. I think it was thrown out there that charting will solve some of our problems in the original focus was on database sharding, but it could be that this marks into a more general charting solution for solving availability and performance issues. In sorry,.

E

Miss our team is definitely the first one. We identified you to work on, but still I believe over the last several weeks we have been trying harder to understand what problems we are trying to solve at this point, I think the best we know we want to use the blasting media's, but it's still a little bit ambiguous and this moment what.

F

Is the proximity.

E

Is and how the datum is charting a health back, so this I think this question about I mean this action item could be viewed and passed the incidence and identify if the sharding will help here. This is a very solid I have become work to understand if a solution is the right path or we need to focus on something else. So I think this is a very important while we are, you know doing the beauty of the situs. This is a very good question to validate first email, Kenya, of course,.

C

And the PUC per site is Suzette purely focused on charting, or is that, for other reasons that would give us additional Headroom beyond what postage does.

E

E

A

E

I did I cited saris in the site.

A

This is an extension to post-grad sequel database. It effectively. It makes it easier to charge your database, so it's focused entirely on database sharding. So it's.

E

A technique to shark the database yeah.

C

Yeah, that's right! That's right, alright and so I know. If we're not sure I shorten is the right move with the question I guess is: should we spend time doing a posco siteís if you're not sure, if Charlie's actually looking like, let's.

A

The charting will solve well. It will improve our scalability story right implementation. Details aside, it will definitely help with some level of scalability and we've gotten feedback from undress and from some of our other database expertise on site. That sharding is something that we need to look into implementing for long-term scalability. So there's no doubt that sharding and database would help scale our comm offering we're now getting into the implementation details about. Is this the only solution for scaling and that's why we're asking those questions? Maybe there's another charting implementation?

A

That's not focused entirely on the database that will help with our scalability. So it's not. It is a valid investigation path, but we're also looking to see if there are alternatives or additions to the database. Sharding research that we're doing that makes sense.

B

Thanks yeah yeah.

E

It was definitely have the scalability, but speaking of reducing the blasting radius, that's more about availability. So that's the question we need. We want to answer. Does this have this availability but we're sure it helps wait a minute so.

A

B

A

Going Andres by.

B

By reviewing those incidents, I think we can also better understand what the what the problems are that we're like it is solved with it, because we have a lot. We've had a lot of problems with PG bouncer, for example, as well, the load balancing in front of the database, and now, if we exchange the database, I disturb you we're still going to have these problems, so we should be I. Think we can. We would get a lot of clarity from from looking at the existing incidents.

A

Let's see Jose on the call, is this something you can take back to Jerry or just I, don't know, maybe even volunteer to help out run through the incident list and just I wouldn't spend a ton of time on each. It's just I think it's a gut check. If we have 500 issues in there just run through them and say yeah, this application charting database, sharding or something else altogether would help with all these, and we can list them out in that issue that we've created.

E

Mortician Tina sculpt appeal eyes widen. First, then maybe some of the PDS to us too. You know that's first, the pass to understand the graph picture. If 500 is too many to go, stroll yeah.

G

Issues in German it, okay.

A

Thank you very much house. They appreciate it.

A

That's it for the agenda today. There are any other topics anybody want to cover today.

B

Jimmy perhaps talk a bit about the benefits of the Shopify model that we would get or how they would look like on a very high level for us.

B

A

Know what they.

B

Did was maybe maybe just to recap what what they did and there is an issue, a dissertation, a link to the article a they wrote in the blog. So what they did was like, like Stan explained already, they they they have a multi-tenant application. So it's very easy to say that some there is, you know we can.

B

You can easily chart that by a customer, for example, because they there's no feature that goes across shards and what they ended up doing is creating what they called cuts, where each pod contains application servers and also database, and basically a full environment. If I understand that correctly and the database contains data for a subset of the customers only and yeah, you can see that in that picture, so.

A

I think my understanding was the pod was the data itself and the application layer spread across or spanned across the pods? Oh.

A

Yeah sure um right, you.

B

Just mentioned earlier that pods.

A

B

A

Application tier I, don't think it does yeah.

B

Yeah, that's right, okay, and in our case, if we assume for a second that we have a similar model, we can compare say that top level namespaces a a boundary for a customer right. So typically, if we look at our organization, we have good lab. Auric is a top-level namespace, whereas another company would create their own namespace and most of the data lives inside that namespace.

B

So if we, if we ended up in a similar situation, we would be in in the in a spot where we can.

B

You can actually create such an environment for a particularly large customer as well like if there's a large customer coming in and they request their own isolated environment in a sense will be way more flexible on that and I've seen requests for where, where people who are asking for the ability to influence where, where their data lives like in which data center or which continent even we have those requests, or also that, would also work in this kind of setup.

B

G

The downside to doing that level of customer sharding is if that shard has has difficulty that whole customer is down and when you do more general sharding, for example like if you're, if you're, using something like a you know, if you're using we want to take a full full range, shard or full full key sharding like something like Cassandra, does a single customer, a small amount of every customers affected instead of all of a single customers affected. We have this.

G

We have this problem with with with giddily, because giddily we're doing exactly this and get Li each we can. We can isolate a specific customer to a specific guilty shard. We did this for gitlab org, and it has the downside that, if that giddily node is down the the hulk's that whole customer pod is affected and it's actually worse from a scaling perspective, because if we have a if we have a large customer and we put them onto a custom shard, we now have a total component of that that one customer shard is.

G

It has to be maintained, separately versus. If that customer comes on board- and they have say 50 projects and we have 50 Giga servers- we could have that customers spread across evenly across all of our giddily fleet and if one storage pod goes down, that customer is not completely affected, and so there there's a trade-off there. There's a there's a trade-off. There.

B

A

B

Consider the data locality problem or is that something that we, that is out of scope like that.

G

Little oops I'd say that's out, I'd say that's something we should we should. We should think about so that it's not. We don't lock ourselves into a solution that can't solve that problem, but I think it's out of scope for for what we're doing right now,.

C

Yeah I think that's a different question, something great, obviously alone. One concern is with the truffula model. Is that you love eats it's not ideal. If you have like load balancers in some portion of the country, you know some portion of the world and you have you know you're getting a service elsewhere, and so it gets in the question of it starts to feel more like a true second instance.

C

At that point in time you can you wouldn't want to have you know your load balancers in front in United States, and your data in Europe, for example, so.

G

Yeah we for data locality, I think Josh was right. We need to think about that holistically across our entire intro yeah.

C

And there's a discussion with you right now, since you're doing on staging and we're ready to start, you know, concern moving calm and we're sort of trying to figure out these questions here, which is, do we do is ask recovery? Do we huge GA, like it's like a sink, and we do get that daddy, you and whole bunch of questions here, because you know acceleration that pops up in there as well.

A

So also go into the bins comparison bin andreas conversation about the sharding implementation. Do we want to isolate customers, or do we want to have like an even separation of isolation? So in the example that then gave you know, if we have a shard goes down, do we want only a subset of features and functionality to go down with it or do we want an entire customer to go down? It seems like a product management question.

C

To me all right sort of there's trade-offs right, so some customers, some very large customers I, don't want to be impacted by noisy neighbors right and so one easy way to do. That is to just make sure you have you know giddily and database and Redis and sidekick all started by custom.

C

In that way, you know if someone's doing denial, service attack and it's an account and you're in still theory, somewhat fine, yeah I think if we achieve that isolation in other ways, it's it's fine as well, but that's sort of like an easy leaver to throw and yeah, but.

G

Confidence, it's it's the easy way, but it's it it's still it has. It has like it like. We are saying and as other implications, you know, if you look at the, if you look at something like YouTube's, how YouTube solved their sharding problem, they they did the the situs TB aka Vitesse, they wrote a.

G

They wrote a horizontal charting system so that they could generically shard everything across for all users and they they built that into a robust system and eventually now now, of course, YouTube has moved to a completely different method, which is as far as I've heard. Youtube is now entirely on.

G

Now, what's a cold.

G

Google Google's Google's scalable SQL database.

C

Yeah a spanner whatever it is um yeah.

B

F

C

Mean um so I think it's sort of engineering question like what's the best way to achieve isolation, which is some of the main goals, and then there are other considerations like you know. As you know, ghibli right now is going down the road of you know. Well, you have one guiltiest server or in the future one ghibli h a server.

C

You know well to Gili servers in the prefect server for a set of customers and shorting that way, so that we're kind of Thunder guard down that road, but I I think that the business requirements are obviously isolation and availability and how we achieve them is you know. Obviously, an office of cost goes in the business side, but whenever whether we shard it how they currently are, but with the ghillie model or or more of a sort of distributed model, is a I think engineering solution to in many respects, I.

A

Have to follow up this conversation with the next meeting or a sink. Mac is already adding agenda items for next week, thanks Mac, and with that we have two minutes any other topics. We need to call out for a sink follow-up, I'll.

F

Just hopeful eyes, I'm, sorry, localize I don't have the flu system. Oh, there are three cases here now. Hopefully that brings a few laughs, so I'm gonna take some pages from the GDP migration and I think we really need to call out what are the steps we're gonna be taking, and what I like this team this group to come up with is it was successful? It's great. We move to the next step, but then we anticipate if it fails like who's who's, gonna be dealing it what's.

F

The mitigation plan was the backup so having these clearly thought out earlier on is key to their migration success. So I'm just want to make sure all those are aware and to be specific you're talking about PG 11. Yes, yes, good.

A

Yep I'll add it to the top of the agenda next week. Thank you all right. It's probably it for now thanks. Everybody have a good rest of your day.