GitLab Database Scalability Working Group, 28 Feb 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2022-02-28 Database Scalability WG Weekly

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Okay, it's the time, so, let's get started. uh Welcome. Everyone today is february 28th. This is the database scalability working group, uh weekly update and uh I'll get started from the again. The first item is a significant update since the last meeting and uh first one is first bullet number. Eight is from the uh update from the sharding group, so uh very excited to uh to acknowledge that the decomposition run a fully uh decomposed pipeline and it turned green. This is actually a precursor of the phase, seven, the last phase of rollout uh success.

A

So uh this is a precursor we're still not there yet, but uh it's really great to see that we can run the fully uh decompose the pipeline, and this is essentially demonstrates that uh we can already run gitlab with two completely different databases and our solution works as design, so the imr is now merged, although it's just a couple lines change, but there's tons of work behind it. So uh so good news to share any questions.

A

Okay, then move on to next a few items from the starting group so uh phase four. The next step is documented uh and uh we are planning to tackle this uh in a staging environment. First, that's the uh first uh road rod of this phase. Four, uh however, uh we're yielding the priority to fix reproduction deployment right now, because uh uh we'll mention that later uh then, we need to think about disaster recovery. uh The mechanisms and the strategy and task force was formed for this uh brainstorming.

A

uh The task force is consists of camino, fabian, jose raphael and gonzalo, and then uh last item for from the shutting group is performance. Concerns surfaced for a fixed physics issue actually and the the viability of the the current approach.

A

So the problem is, uh is the deep hierarchy of groups uh that caused the performance problem and progress is being made and now targeting 14.9 any questions.

B

Jen, let me um uh the uh disaster recovery issue geos being talked about in there. I didn't see it mentioned explicitly in the issue.

A

uh No, this is not.

B

A

Field, this is more about operation or how we have a strategy and a mechanism to uh to react to some disaster in the in the production environment like uh how we can restore the service and how we can recover the data. If something happens,.

C

So this is specific yeah. This is specifically eric when we actually have all of the ci tables replicated to a separate database cluster, and we need to essentially perform a failover.

C

You know that should be fairly quick, but we have to establish you know if something goes wrong in that process. What do we do? You know um to go back, you know: how do we recover from this.

B

C

Specific to the decomposition.

B

Of capability, the capabilities we gained from decomposition got it thanks.

D

Yes, the the idea is like when we're gonna be deploying uh we're, definitely gonna, be doing some kind of validation, phase with testing, but us, but we also gonna open uh a doors.

D

So if we discover some kind of failure, we need to be able to assess if this is a rollback trigger, we need to define a window in which we allow the rollback to happen, because after some point, we simply may be better to fix instead of the rollback, and we need to understand the implications for whether we are fine with data loss and what kind of data loss would be or whether we are not fine with data loss.

D

So if I we split databases, mainly the ci and we we see error or in fifth minute after the deployment, uh but our rollback window is defined by 30 minutes the week, the rollback. Now, um if we know about what type of data we lose, probably some kind of ci traces, uh some some other machine generated data and the question gonna be like I mean before when we define this disaster recovery plan.

D

How are we are willing to lose data to speed up iteration of rolling uh out changes, because we're gonna be deploying that, probably in that time, there is like there is amount of the traffic we're gonna be defining the times uh the windows for this, um so we're gonna be actually evaluating risk versus benefit versus the cost associated and the likelihood of this occurring uh to figure out actually like, basically our willingness.

D

If we are finally flew some amount of the machine generated data or if we are not fine and what is the criteria that would trigger the rollback? What kind of metrics to look at this is exactly the type of the disaster recovery. It's right after the deployment of the axe for the composition to the production.

B

Yeah I like defining all that stuff ahead of time. That's uh that's really smart and if you need, if you need a uh you generate a plan and you want someone to kind of look at and punch holes in it. I recommend jerry, I think, is on the call um to uh you know um kind of see if he sees any potential gaps.

D

Yes, because, like I think in the end, we definitely gonna give our recommendation, uh but it's really like probably product uh slash, diary, slash executive decision, uh what we actually, how you want to tackle that given, given the anticipated likelihood scenario occurring and the risk associated with it. So we definitely gonna be extensively discussing that, to figure out the best approach for all of us.

A

uh Shall we invite jerry to the structure force ii.

E

Yeah uh there is an issue in in the infrastructure queue that we opened to participate in this as well. I don't have it handy. I've been looking for it, um ken is driving that, but yes do add me to that workforce.

E

A

C

But it's also really exciting to be able to talk about that right now, because we're getting to the point where we you know, we intend to make that change and we need to think about the risks associated with it. But it is a it's an exciting next phase.

E

Yeah, the difficult part of this conversation is going to be deciding. What are the acceptable risks right, so we can think of strange things that can happen, but just throw in the lines of when we the decision to move forward or roll back. We need to figure out what those stressfuls are, and those are the really important ones. um Luckily, we've done this in the past with the gtp migration, so we have some prior art to go back to to at least guide how we're gonna think about this problem.

C

Yeah, I think my my preference would be you know we think about all of the possibilities ahead of time, that we make a recommendation so like this is, I think what we should do, but there's always going to be risk associated with it, and this is what eric can sit at some point and other executive stakeholders need to understand and then say like. Yes, this is acceptable or we need to actually do something.

B

Different right and the um the risk is the severity of something were to happen by the multiplied, by the likelihood that it's going to happen. It can give us both those things. It's easier process.

A

Cool uh move on to the next bullets, uh sam, sam or cheryl. Here, uh no, I can just update.

A

As far as I know, there is just uh one more issue remaining for uh for phase four uh from the uh for the uh uh ops team, so they're working on that uh and on the dev team, uh all the facebook issues were resolved, so there are still physics issues outstanding in both ups and the depth, but we can wait a four bit time a bit more time and uh is steve here for infrastructure, update, nope, okay, I'll follow up offline with steve, but we do have a few items below here.

A

uh Moving on to the review roll out of pro pro progress, uh our confidence level of the timeline still remains like fifty percent, uh I have two bullets here: uh development outstanding issues continue trending down and the confidence level is high. uh On the other hand, the deployment is behind so I'll see the risks below uh in bullet four, but uh overall, uh our confidence level remains unchanged at fifty percent.

A

A

Okay, then uh the chart is here uh showing our issues burn up or uh you can see the green line. The the green line is actually the the remaining uh outstanding issues, so uh it continues to drop uh from development perspective.

A

Okay, then, the risk section really. uh We have two risks here. The first one is the uh primary uh pg cluster reboot, that is a blocker to our phase three uh production deployment and the deployment already was a two weeks already. So I'm asking if there is a new eta of the reboot so that we can move forward with to the production deployment.

A

uh Steve is not here, so I will follow up with you after the meeting.

E

Yeah, our follow-up with ken actually was the uh squad lead for the database, and I know that there was a change issue submitted for last thursday to do this and for some reason it was cancelled. I I don't have the details um so, but the ken should know when this is going to happen. Yeah.

A

Thank you. Thank you gary also, thank you for being ken and the next one actually is a uh request, so jose actually suggested having a dedicated sre moving forward, uh because there is a load in backlog now, and there is plenty of deployment tasks uh we can uh work on uh and that will be a lightly sustainable workload uh moving forward, so uh it's better to have a dedicated sre to uh ride, along with the dvres, so um I'll also raise to stephen ken on this. Thank you jerry for your suggestion.

A

Any questions about those two bloggers.

A

Okay, uh moving on to the last bullet uh is the end of the month, so review the timelines, uh basically repeat what we are looking for: help uh from infrastructure the phase three production deployment is behind the schedule, so uh we'll continue to work on that and also phase four staging deployment is basically we can start now. There is just one uh two more outstanding issues, but those are the corner cases uh if we can and it will will not slow down the phase three production deployment.

A

We can start to work on that now so fabian uh you want to verbalize your your comment.

C

Yeah, it's minor. We still have a few outstanding or like two or three issues in review um they're, going to close very soon we're also already deploying um the phase 450k reference architecture. So it's looking pretty good. um We are we're quite happy about the progress that the team has made.

A

Yeah, thank you for correcting me uh three opening issues. I thought they were only too good to correct them. Okay, questions about the timeline and the other items.

A

Okay, thank you. Everyone enjoy the rest of your day.