GitLab Working Group Disaster Recovery, 3 Mar 2021

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: 2021-03-03 Geo DR technical iteration sync

Description

Discussion about parallel tracks of work that can be done on the technical side.

A

Cool so um as I was just summarizing, I see that there have been discussions last week in issues um and also in the document about.

A

Generally, what we want to do at the end somewhere, but I'm not understanding. How are we going to get to that end from this point right now, so I wanted to kind of just chat with the two of you to see your understanding and your ideas.

A

I think henry might be frozen. A tiny bit.

B

uh My interconnection is bad. Is it better now.

A

Yeah we can hear you.

B

It's let's send it over to my phone, maybe, but um I put some notes into the um working group doc just for this meeting, because I think I I thought through all of this, and I think there are a few things which are making it fairly easy to decide right now I mean we have hard facts like we have this constraint of, let's say 80k dollars per month as a budget, then one hard fact is the cheapest solution to everything is storage that we know.

B

um Another heart is: if you use this snapshots for restoring, then that would always be way below above one hour for rto target, and if we want to provide something better for our premium customers, then we need to work on a sync solution right and even if we don't decide about the solution right now, I I think what we always will come to as work that is needed is that we need to build a scaled-down copy of um gprot there's no way around it, because we need to make things be working there and feeding just up.

B

It meets the vertical a real g prod, so it always needs to be something like that. So this work needs to be done anyway. um Then we need to build in for automation around um failing over failing back orchestrating all of this and scaling up and down. This always will be needed as work, and we need to work that we support deployments on the secondary side without database migrations, probably but still, and.

C

B

Think about how we can test this, and this is work that we know right now. I think that needs to be done, at least regardless of all the targets.

A

B

Right, yes, so that's clear, and then, if we, if we look into this business goal of supporting uh a below one hour, uh rto target for premium customers, then we need to build something which is synchronizing italy directly over to the secondary site. This snapshots are not good enough for that one, so we either need to use geo or set of a streaming from dedicated gitly charts or something like that, and that always means we need support for selectively syncing or putting customers on selected kiddley notes by plan right this.

B

This application feature needs to be there else. We can't do that, so this is also known work that needs to be done before we can start with that and as a last thing, I I noticed this, which is nice using this snapshots.

B

um It would be super cheap to achieve something, at least very shortly, because even if you increase the frequency of the snapshots- and we only pay for the um incremental size of snapshots- and if you have a higher frequency, then the increments will get slower smaller. So we will stay below 10k additionally, because even if you do it every half hour, I I estimate so that would be a very cheap solution of building something for all customers.

B

But then we would have the problem of um being the database and the goodly charts being out of sync when we fail over. So that's the big downside of that one. Any other solution for everybody would be something with. um I don't know an.

C

B

System for everything or object- storage in italy, ideally, but I think that's where we are right now.

C

It's a good summary.

A

It's a great summary.

C

Ideally, I feel like we need to maybe test a few of these options, especially just snapshots, because I'm going to hedge about there's some underlying infrastructure issues that ran out to deal with in the first place.

B

C

Whether or not that's even a viable solution, you know we brought down a database system because we did a snapshot that was live. You know that was a mistake on our part, but all of our file servers are technically live objects. uh If we do a live snapshot, we risk um doing something bad to a specific repository, potentially.

B

So we would have inconsistent fire systems or something like that, so we would have problems, but it's the cheapest and easiest and fastest solution to get something up until september. That's what I'm saying, but maybe it's not feasible because of too many problems, yeah.

A

Yeah I um like the first thing I wouldn't focus on is achieving the rto and rpo targets like that is so far off right now um that on on at this moment for the technical work we need to do, I I don't think we should constrain ourselves for the stuff. We want to do immediately for the stuff that we need to plan and get others to participate in.

A

That needs to happen in parallel right, so one track is going to be staying, standing up the site standing up like and playing what can be done uh with the things we have right now, even if they're slow, even if they are not optimized, even if they are going to take hours, if not days, to complete that is okay, while things are being built in the background right. um So.

C

So perhaps we should focus our efforts on because we have support from geo to stand up a secondary site using.

A

C

So we could focus our efforts in that realm and at least getting everything stood up and ready to go. At that point, it's just a maybe we could start testing some of these other items, such as the snapshot recovery option just to see where that lands, what is required for it and what we would need to maybe mount that to servers that are stood up in that area.

C

That way, it's not something we're duplicating the information um on the secondary site, but, like I don't know what that looks like because you're going to have a file server, that's not syncing data, and then we say: hey remount it with this data and all of a sudden. You have data available to you and then it's a weird thing, but I think.

A

It's, I guess at least worth.

C

A

I guess that's the first thing we want to find out. How does github behave when the the data is not there? The git data is not there. I mean we expect that the database replication is working, that object. Storage is going to be operational in one way or another.

A

B

I would think that we that we just don't have any githuli nodes. On the other side, we just have database application and in case of failover, we spin everything what is needed up and mount the disk that we create from snapshots. So there would be no working uh gitlab site on the other side and there's no geo involved in this. It's just database replication and infrastructure, which can be scaled up on demand and automation to get the snapshots built into a disk and mountain to the right servers.

C

That would be an interesting thing to test to see how quickly stuff, like that converges.

B

Yeah I mean it will take time to get all the servers up and this amount of servers and also a high amount of disk snapshots, building this out of that for 60 servers that can take a long time. Maybe because I know for databases which are one terabyte, it takes 20 to 40 minutes to um build a disk out of a snapshot, so we have 16 terabytes for um git lease, so it will be around an hour.

B

I guess, but depending on a lot of things like from which region we get the snapshot and uh other things maybe like. As how many increments we had on the uh the snapshot, and so we really need to test this timing a little bit, but for an rto above one hour. That's fine, I think like if you say we want to be back within four hours. I think that should clearly work.

A

Let's, let's set a target right now when it comes to our work, like I'm, going to suggest it in the working group, let's set in a target of 24 hours, because right now we have zero hours, actually worse million hours. I guess is the upper bound, because we don't really know so.

A

The thing is like. I really don't want us to focus on like super narrow things right now, because we are just not even close to right, like we're we're already talking. How are we going to ski down the mountain um while we are still on the beach yeah.

B

A

Of kilometers away, you know, I mean.

B

That's why I think vv can clearly do the work. That anyway, needs to be done with just standing up a new infrastructure in a different region and working on automation to scale this up and down as needed, and then we can find the solution to sync the data, and I mean that's that's just we can leave this out for now.

A

Now, what I am curious about, though, and that's why I want to have a conversation now with the two of you and not with the whole group.

A

Let's do a bit of uh playing here. Let's say I say we're not going to do any of the disaster recovery working group work, but we're going to focus on the kubernetes migration and we're going to focus on moving the api nodes and the web nodes and then have the kubernetes handling the actual scale up and scale down.

A

So in that case, let's say in that hypothetical we have everything, but italy radiss and both grass in kubernetes.

A

If we at that point after we have migrated that have a secondary site with vms for data stores right, um but a single kubernetes cluster for, like a you, know, a cold cluster, basically right like waiting for something to happen, would that mean less work for us to actually stand a bunch of those nodes up, or does it mean it's pretty much the same amount of work?

A

I would say.

B

It's the same because.

C

I think it's worth testing.

B

Yeah I mean it will help for the future for sure, but I mean if we need to, I mean we need to get all the basics up, like all the secrets and credentials and and service accounts, and all these.

C

B

Details and if we build it for accumulated disk cluster or for a bunch of vms, it anyway will be a lot of work right. I mean.

C

B

Would be nice for for scaling up and down and everything and having it more similar to what we plan for 4g prod anyway. But I think the work amount will not be much less.

A

B

C

So from my viewpoint, I think spinning up a kubernetes cluster and running a helm install is far faster than telling terraform to build us x amount of machines.

C

We're limited we're bound by the time that it takes gke to spin up a cluster which is within 10 minutes, whereas spinning up a single vm takes roughly 10 minutes.

C

I would hedge a bet that spinning up a cluster and getting kubernetes or getting gitlab going on that cluster is quicker, but I think it's worth testing.

A

So and then another we're talking here right, like I'm, not making we're not making any decisions here. So let's say in this hypothetical that we decided to say that italy is also in kubernetes italy itself. Right, the storage is somewhere else. It could be, you know, persistent volumes, it could be. I don't know wherever, but italy has the option of you saying: hey: my storage is somewhere somewhere else over the network or whatever else.

A

Would that help at all with the speed of the testing and the work?

A

So, let's say hypothetically: we go and build a secondary site with the cluster with the kubernetes cluster. The only thing that is vms is redis and the database.

A

We say to italy on our secondary site: hey read from our storage somewhere over there.

B

I think that we have two different things here I mean one thing is for sure we can be faster and scaling something up and down once we have it in kubernetes, as john said, but I think the amount of work to get there.

A

I know what you're saying is different right. I mean right.

B

Because if we spin up just your terraform, which we just can copy over and do some adjustments from the existing terraform or if we spin up a kubernetes cluster, it always means that we need to set something up, get configurations right and and test it. And so I think it will always be some amount of work. And I think there will be not such a big difference.

A

But there is a difference: henry there is a difference in how much work it is to actually move from the moment where you actually have something to test to the moment where you actually have the number right like the rto and rpo going down. So it is because it's the same amount of work right like okay. Let me try to explain a bit. So let's go back to the original plan. Everything is vm on the secondary site right, so we are going to build one api node we are going to build.

A

I don't know five italy nodes. We are going to build.

A

I don't know a couple of sidekick nodes or I don't know exactly what we're going to run there, so that will require a lot of infrastructure or correctly everything you listed connecting secrets, syncing.

A

Whatever else is necessary there, and by the time we finish, we'll be able to say all right. We have this 24-hour target that we can play with now getting down. The target is going to be much harder because by the time we fail over and then scale up and restore and do all of that stuff. It's going to take a long time and a lot of effort to get things done, but if we play the hypothetical and say that a secondary site is majority kubernetes, so the scale-up itself is fast enough.

A

So the only thing we need to focus is the storage itself. Then you have only one item to focus on and that item is figuring out how to restore um from a snapshot right or focusing how to restore on the snapshot, while the rest is basically just waiting for that. So you kind of removed and moved the focus only on one thing, instead of five different things that need to happen at the same time. So for that secondary option of kubernetes, the build-out cost is still the same. I agree with you like.

A

You still need to connect a lot of these things, but you get to a point of where you can play with the numbers um right like the rto and rpo target you get to that option. Much quicker from the moment you have the build out done.

B

Yeah, I I get you know so yeah, that's that's true. I mean the further we are with our kubernetes integrations, the easier it will be for us for for a secondary site to, and it will be less work to to adjust to to coming changes to our infrastructure. That's true.

A

Okay, so let's talk a bit about exactly what we need to build out. We need to build out the database cluster one way or another. We are absolutely sure of that right.

A

We need to build out a redis um setup. Whatever way we are building it, whether it's h, a a single h, a cluster or three clusters like we have in production.

A

We need to figure that one out that is not going to happen overnight. That needs to happen anyway, so those are the common things for kubernetes and for vms.

A

So if we focus on starting that build out with whatever we agreed on the sizing of dvms, the automation that needs to happen, that will give us enough work for probably a week or two um to get those things up and running the next thing that we can do then in parallel, you're doing the api migration right. So the next thing we can do is play a bit with italy and figure out like I'm talking about like a test node somewhere, you have some data on a test.

A

Node stand up a kubernetes cluster with italy inside and try to connect it to whatever storage is out there and just see how difficult yeah snapshots see how difficult it is to actually get that done. If it's only a configuration change that gives us step number three, which is all right. We need to make a decision whether we are going down the vm route or the single cluster route, um and that allows us to build out whatever we need there.

A

All the configuration we need there, like you said, kubernetes or vms, is going to be the same. You just place it in a different uh place and another reason why I'm suggesting this thing for consideration. um This is for both you scarbeck and henry is. I know that you're having problems with chef and um omnibus reconfigures and geo ins in that whole thing right, like one, is overriding another and it's really hard to connect all of those things.

A

I wonder how much easier this gets when we remove chef out of the equation, because if we built everything out in the kubernetes cluster theoretically, it could be a bit simpler.

C

I will say that we have an open issue inside the gitlab board, because we don't have it documented how to do a failover for geo. If everything is kubernetes like if you have two kubernetes get live instances, sgo, there's no documentation as to how to perform a failover. I suspect I know how to do it based on reading up on the configurations and the work I've been doing recently and it's.

C

I don't think it's as easy as running a simple rake task, but I do think it'll be a lot faster because there's a lot less nodes that need to be touched and instead we're just touching a massive configuration which touches the entire cluster natively, which will roll everything that's necessary quickly, handles.

A

The organizat orchestration itself, yeah.

C

Nicely the orchestration is the important part, because we are missing that inside of the omnibus multi-node configuration for jl. So I enjoy us thinking about uh the kubernetes route and a story about the storage. Obviously,.

A

Yeah and that's so, those are like a couple of different parallel tracks and then in parallel to all of that, we have the talk with um well the rest of gitlab, basically, which is how are you going to sync that data right like you need to give us? How are you going to sync automatically, only private sorry only paid customers, so that can happen in parallel, but we already have this couple of tracks that we need to do regardless of the storage story right yep.

A

Can you am I making any sense to you here.

B

Absolutely yeah I mean.

A

This is the first time we were working like.

B

Closely together, so that's yeah. I mean that really makes it easier, because I mean we, for instance, we we have this effort right now that the geo team is trying to get up a standard, get a.

C

B

Cluster and staging with get right, but but this wouldn't help us at least for anything, because we can't use get to standard a secondary site, because it's not usable for our deployments and then everything else, and so it maybe helps the geo team to test some things. But it's not related to standing up a dr site really.

A

It does help it does help with that last part of the story that I just mentioned, which is right like if this uh stand up a secondary site right now on staging that allows us to actually talk with them and the rest of the teams, italy and so on. How are you going to sync smaller portions of data and us to get the practice of now failover? How does the application behave now, failover? How does the application behave, so this is on a tiny bit of a different level.

A

So there is a level of infrastructure where you do the failover. You do all of that stuff, so that practice is going to become a bit useless, but the level above of how does the application behave? What kind of testing apps do we have? What kind of feature gaps do we have? That is still going to be useful for that part of the work you know you know what I mean like.

B

It's just yeah, I think in parallel, it could help to get something tested already. Just that we are missing a lot of features like no kubernetes. I think and get right.

A

No, no, no, no! No get it yeah. That's not going to happen any time soon for us at least right now, but we can always be the ones providing feedback and saying: look if you want us to use get. We need this to stand up a cluster easily. We have that right now we can do that relatively simply. um But if you want us to use get then go green.

C

Is actively working on trying to make uh get uh bring up a helm cluster with get lab running in this part of it? So.

A

Based on the tracks, I just explained basically by the time we get to the point of us deciding for the cluster or not, we might actually even have something to test. That's true, the the database and redis. We still need to build anyway. um So, like I said, database thread, is we need to figure out and do right now, regardless of everything? That's not in the automation.

A

We have right like that, that we need to do secondary site with get can be booted up so that we can continue practicing failovers now within a different expectation, and that is we need to give direction to product on what we need to see built in order to be able to write like replicate, only paid customers, and how do we then, once we replicated only paid customers?

A

How do we replicate three customers right like when we stand up on that side and all of that those questions need to be answered in that work stream, and then, by the time we complete postgres redis thing.

A

We get the fourth work stream, which is us playing a bit with figuring out how to get italy testing done faster right in parallel with everything and the fifth stream will be after we figure out what options we have there, whether we go down the vm route or the kubernetes cluster route, which then brings us the sixth stream, which is how do we do a failover with with kubernetes, with geo.

C

And after that, making it all faster.

A

And after that, making it all faster right like at that point, we are going to have 24 to 2 48 hour, turnaround time, hopefully clunky and breaking and so on, and then we can talk about all right. So let's figure out how to speed it up.

C

I enjoyed this approach um not for this meeting, but I would love to maybe chat more about redis and see if we could exclude that from the need to build out primarily to start, but I do enjoy this line of thinking. I think this would be a good way to start off the meeting today.

C

B

A

C

I don't think we've really discussed how we want to approach this at all, and it's been more about um trying to figure out our business requirements, but I don't think the business really understands what the consequences are. Certain decisions are, and I think until we start testing and showcase what our capabilities are. They'll have a better understanding as to what that looks like in the future, and then we can figure out what we need to do with budgeting at that point.

C

But all this requires testing upfront.

A

Yep, and this way also.

C

Which we don't really do yet.

A

Yeah, that's why I also said in uh in in that point number: three: we don't have anything right now so talking about sub one hour, uh rto or you know anything below honestly, but when 24 hours is nothing.

C

It's over things like because.

A

C

Set in stone at all.

A

Because it's so clear in my head these different types of work streams, I'm asking the two of you to write down what we talked about about these separate streams because you writing it down are going to open like so many different questions, and maybe we have to answer them. Maybe we don't, but if you are able to explain how we want to parallelize some of this work and who takes tackles what, when I say who I don't mean henry or scarbec, I mean infrastructure, geo, italy and so on uh product and so on.

A

Then uh it's gonna be much easier for us to have a larger discussion, because if I keep talking all the freaking time, then that's not gonna go well.

B

Sense um should we put this into an issue, because I think this document isn't good enough for that. Please.

A

Issue is fine, um just document it and uh let's open it up with that, and uh certainly just to be clear. I know this is a recording call, a recorded call.

A

The solution is not great right, like I'm not enjoying the fact that we need to have six different work streams to get this done, but at the same time we have some cost restrictions.

A

We have some time restrictions and we have some people restrictions right like how much time we can add to this, while all the way, all the time thinking if we slow down on the infrasight on this to build the the dr we are actually slowing down the improvement, we need to do in our infrastructure that will make us do less of that work.

A

So if we parallel some of this together with the work we are doing, theoretically by the time we get to deciding on the clusters and how we're going to do failovers, we might simplify situation for us significantly.

B

Yeah, I agree. I also see people resources as the biggest problem. Here I mean.

C

Everything is so.

B

Solvable, but without people, resources and time it's really hard.

A

Yeah and and that's why I also want us to focus on the first two steps right now when it comes to what we in infra needs to do, because next week something might come down the line and say hey. Disaster recovery is number five on the list which it happens like it happened before. So I wouldn't all right I'll upload.

A

This recording uh and post it in the channel and um yeah go out, go and create those issues and uh ping me if you have anything unclear uh about what we talked about happy to help right need to drop off to the next call. Okay,.

B

See you later, thank you, sir.

A

Chat later bye.