GitLab Infrastructure Group, 22 Dec 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Discussion about Gitaly Disaster Recovery plans

Description

https://docs.google.com/document/d/1sQo-ccDd_Ix3ASZjQjaMaW3p_Kr90R86aNmxHUY9Rb0/edit

A

Foreign, it is December 22nd and we are going to have a chat about Italy disaster recovery and uh to get started. We have an agenda I'm going to go ahead and share my screen.

A

Okay, uh so I have the first item. um This is the issue that we're checking for this discussion, which is uh for this quarter to come up with a plan uh to see like what we can do for Italy Dr. For a little bit of background, um the current Dr solution for giddly is to use snapshots. We have established that the RPO for snapshot recovery is six hours. Rto is about 30 minutes.

A

Rpo is the amount of potential data loss and the reason for that is due to the frequency that we create disk snapshots which can be up to six hours. More typically, you know I think snapshots are more typically as old as four hours for our test. We had a snapshot that was around I think between one and two hours it takes about 30 minutes to restore from snapshot. That's the time to create the disk and reconfigure giddly foreign.

A

So we have a whole bunch of options laid out for how to improve this, but so far we haven't really narrowed on a solution, because none of the approaches so far are ideal, at least for the short term. So I wanted to do this recording to discuss the options with Steve um Steve. You have the next item.

B

uh Yeah, so by next item, do you mean b I or b2a 2A, okay, yeah, yeah I? Think you are the one true. That's.

A

B

Of git data, which is six hours 30 minutes- that's that makes sense. um uh Do you want to go.

A

B

Through the proposals.

A

Yeah yeah, we can go through each of them, so we have the PCP backup Appliance. This is a relatively new feature that gcp offers it's an agent, that's installed um on each server and um a backup appliance. That is uh that Google runs on your behalf.

A

um Like I said it's fairly new. uh We had a discussion where I I had a discussion with Google about this, an asynchronous discussion through a Google doc. To answer some questions at this time. I don't think it's going to be suitable.

A

um The main one of the main reasons: why is that? There's no G Cloud terraform support yet, which makes a little bit difficult to automate, uh there's also an additional cost of I approximated around 500 a month per storage node, that's in in addition to the storage cost, so um you would have the storage cost in object, storage for the data that we're backing up plus 500 per month. For you know, each storage, node um and.

B

I think like the maybe so so this backs up to cloud storage and like the 500 a month is only like to install the agent on. Have it run, yeah.

A

And maybe there's some like, maybe there's some flexibility there, but uh from my discussions with them they they charge on the source. You know per uh per Source disk, so.

B

As we add more nodes, when we say no gcloud or terraform support, which which part of that are you thinking like, is it the agent installation or is it something else just.

A

To create the backup Appliance.

B

A

Could figure it.

B

A

um One of the things that attracted me to the solution was with gcp snapshots. We don't have any control over when full or incremental snapshots are taken. We had a conversation with Google about that and they said, there's really nothing that we can do to control when when that happens, and that's the reason why our RPO can be so long, because um these full snapshots take a significant longer amount of time compared to the incrementals um with the pack of Appliance. We have a lot more control over that. So that's one benefit.

B

Yeah and so so, it's 500 per Storage, Note, okay, uh I'm thinking about cost reasons as well, because I.

A

B

Need because I'm looking at the region I want the.

A

B

Provision, storage, one as well all right and that's also gonna cost us a bunch of uh um money.

A

B

If we go with SSD based or balanced, based and whatnot, so I.

A

See yeah I think like um like yeah, we could do the cost calculation. If we had tiered uh you know tiered storage. We could probably compare the cost of this versus the cost of regional right.

B

A video, tiered storage: does this mean that we're gonna have an RPO and RTO uh different, depending on the tier that you are on yeah.

A

B

And would business be okay with that yeah.

A

I, don't I, don't know like.

B

I think, because that is a big, a big uh caveat right with regional disks like, and it's going to be way too expensive to roll it out everywhere uh and if we only roll it out on a license basis, it's gonna be different. Rp ortos per.

A

Yeah I mean I I'm, going to I'm kind of going in. With that assumption that we would advertise, you know different uh RPO RTO um I mean I yeah different RPO RTO, depending on the the plan, and that would probably be okay, but maybe we need to explore that a bit further yeah so for B. So for B2 um yeah. You had some comments here. Steve.

B

Yeah uh I guess we do need to Benchmark if there's like uh any difference between reach and not PD balanced and be debased like performance hits or anything like that, um since they're gonna, be you know, different systems right, so we're gonna need to look at um how electricity cluster reacts to a regional disk, um and these two proposals at the moment seem very hard to say, um high level right proposals which makes us but I feel like before we take a decision.

B

We need to look at full cost of each type kind of thing, so for the regional. How.

A

B

More is this going to cost us um with the current setup that we have um and how much is the backup apply? Appliance gonna cost us because, like if it's something that we have to do a few clicks once it's not going to be that big of a deal until gcp kind of automates that, through the API.

A

B

Yeah so um I I guess we'll have to do a full, detailed cost analysis of the current setup. If that makes sense before we can do a decision on this as well and have like a benchmarking um for the regional PD balanced and see like okay, if we deploy a gitile node with regional BD balance like how quick is it going to be to recover if something goes wrong there uh and things like that right.

A

And the backup, Appliance and the regional storage isn't I, don't think is a good like uh I mean obviously the the backup Appliance has still has an RPO issue. When I asked gcp about like how long these incrementals will take, they said: okay, they're gonna probably take around the same amount of time as snapshots, so um we're still looking at.

A

Probably uh you know, I, you know I guess like we need to get an exact number for that, but we're still looking at some data loss, um I assume for the regional, but you know balance provision storage that it's going to be um much less.

B

Yeah a little loss um so and and the action items I'm writing down something. So if you scroll down from your screen, so maybe we need to create some creative criteria for the best option right, uh which would be costs, uh of course, for for current utility Fleet. So let's say: okay, let's, for example, the backup storage, the backup Appliance is going to cost us x amount of dollars per month, but the 89 hours that we.

A

Have right now right.

B

um Rpo uh RTO uh ignoring the story like user's license, so let's say: okay, if we deploy Regional balance disks to everything right, we ignore the what license the user is on. How much is that gonna cost us and.

A

B

Be the RPO and RTO.

B

Is there anything else that we need to look at here, um or is it mostly cost and RPO that we want.

A

B

A

Right yeah I mean I, guess, um I guess performance impact is uh needs to be taken into consideration.

A

And that that includes the agent, probably for the backup Appliance to see if that.

B

Yeah, that's true. Yeah.

A

um I wonder if there's going to be like yeah I, think probably probably the best option here would be to use the the Dr testing environment and we can. We can just start testing out the different different options.

B

Yeah, because to me, they both like I'm, only focusing on the first two, because the last two are pretty much uh but non-starters right, increasing snapshot, frequency, that's not possible and get the clustery architecture. That's way ahead of our.

A

B

At the moment, so focusing on those two um to me anyway, so I'm not sure about you, but both of them don't seem like a decent option, but we're still not clear which one is the best option right yeah.

B

um So maybe we can get some more data on that like uh get like an Excel sheet or whatever like. This is how much it's gonna cost us um on all of them and then, um if we want to make regional balance disk storage even more attractive, we can say, like here's gonna, be the third storage with caveat that we need to implement um uh all that logic, which is also a bunch of work. That needs to be done um and it's more of a cost optimization than anything else. I guess.

B

A

B

I'm not sure how you feel about that.

A

um Yeah no I think that sounds. That sounds good. That sounds good to me and I. Think those three, those three things um make make sense, so um I think the the first, the first two might be a little bit more straightforward than the performance impact we're going to have to.

B

A

um I guess set up a test bed.

B

Yeah, yeah um and I'm not sure how to test that right, because benchmarking is really expensive, and this sounds like uh not just money-wise but time wise, like getting a basement.

A

B

um I know the kitteller team is working on um on creating a bunch of canned repositories like the Android repository chromium repository and things like that to get a better benchmarking um will is working on that, so um I can ping them on the issue.

B

um If you want me to so, they can see what they're doing like what their progress is on that um yeah.

A

B

Will you open up issues to start like running um like create a cluster with PG Regional and see how that goes, and things like.

A

That yeah I think I think we need to decide whether we want this to be done in this quarter or next quarter, I'm I'm thinking it might bleed into next to next quarter, depending on the amount of work so yeah, you can start opening up issues.

A

um Just another thing that occurred to me is for the backup agent, it's possible that this can be done without any downtime, but for regional storage, um it's going to require a migration, so there'll be a migration cost that needs to be taken into account.

B

Yeah, that's a that's a good point.

A

B

Okay, so and uh okay, so let's say so the Epic that you have uh now the issue that you have.

A

B

With a plan to reduce RPO um So, the plan is for the original goal was to put it as a queue for uh uh like close this on Q4 and come up with a plan on Q4 right.

A

Yeah I was I was thinking that um I guess the output of this issue would be this new epic, with um a plan to I mean more like it's a plan to come up with a plan, but uh it's um I think I think. That's probably all we want to do in this quarter. We can probably start in January but I'm, not seeing us getting.

B

Yeah, so so what should we create yeah? What if we create an epic yeah um to Benchmark the proposed options right so right.

A

B

Have to to propose options using the backup, Appliance and the regional disks, and then we can um um go through that. um Another item I have in the agenda is I'm, not sure. Where is that, where I put it yeah, have we gone through um with sessions with gcp on this because I know in the past?

B

uh That's on 2C in the past that we've we've had architecture design sessions with them to see, what's the best and like they did enable like they did help us out as well and specifically talking about the CI architecture that we had done around two years ago.

A

Yeah the the the sessions we've had with them have all been async and are captured in the docks, and this was them going to the product teams to answer questions that we had with regard to snapshot, frequency and the uh the appliance, the backup Appliance. So yeah, um that's pretty much. All we've done so far. I think we could maybe maybe in this epic we can once we have like a very high level idea of like cost, and um you know calls for the two options and the uh I guess the impact.

A

Then we can take that to them, and maybe we can get the solutions architect to see if there was anything we're missing.

B

uh I would say: maybe we can go to a Solutions architects now. Is that like tell them like Hey we're looking into the backup plans we're looking into the region of provision storage? Our main goal is to improve the RTO and our po. Maybe they can provide us any other hints that they can give.

A

Us yeah yeah I guess we could um I I um I mean our conversations with them. So far have like led us to this back of Appliance and you know using snapshots and yeah. That's pretty much it but I don't know if we've talked to really like uh a Solutions architect. That's you.

B

A

Know where this, where this is uh like someone who is knowledgeable in this sort of thing, so maybe.

B

A

That's worth doing, yeah.

B

Yeah and I think so far we've just I've been asking them questions on specific.

A

Features we never.

B

Really ask them about the problem that we're trying to solve right yeah. um So maybe we can first do that. That's like the the least amount of work we can do in general right because I don't think we're gonna get the code scheduled next week um uh so yeah. Maybe let's do yep.

A

A

A

Is there anything else we want to discuss here, I think that there probably isn't much to discuss for the giddly cluster re-architecture.

B

uh Yeah I feel like that is very very far ahead um and yeah, like my problem with regional, balanced disk storage, I feel like if we want to go with that. We're gonna end up, deploying it Fleet wide, not dirt storage, because dirt storage just makes it a lot more complicated right.

A

Yeah, of course, yeah.

B

A

Okay um sounds good, so I'm gonna take the action to create the uh epic and add issues. I'll also add an issue for this meeting with the solutions architect, so we can just track it and um I'm expecting like maybe some of this. We can start in January, but it's going to definitely bleed into the next quarter. Yeah.

B

A

B

Anything else that you need from the get at least staple counterpart on this, or is it mostly support, hey.

A

B

You want us to execute on any of this. Basically.

A

um I I think that's kind of going to be up to whether you have bandwidth to help you know like then you I, I, probably don't, but um at least probably not in this quarter, but uh maybe maybe we can. We can work together on some of this stuff um yeah and from the giddly team. I think what I would like to understand is you know where we are with the rav3 architecture and where we are with tiered storage. If that's in the road map, but like.

B

A

It's so far out. Maybe it doesn't even matter.

B

Like dirt storage right now from what I understand, so the Raptor architecture is a few quarters from being developed. Yeah.

A

B

You know what that means, and it means like it's another few more quarters to deploy it on github.com, um so I would say it's I would say like a year or two away to be honest to for it to be for the mature of our.com and then that implements storage as well right, so yeah and I feel like that is very far away. Unfortunately,.

A

Okay sounds good awesome.

B

Thank you so much Sean yeah.

A

Thank you, ciao.