GitLab Working Group Disaster Recovery, 19 Feb 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Exploring DR flows

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Hello to everyone who may or may not be listening, um we are going to chat about some disaster recovery strategies and user journeys for um and we can take we. We should summarize this call later on in the issue as well. um I've prepared a mural board, I don't know if any of you both has worked with that before no yeah. So this is essentially just a a drawing or sort of collaborating thing. You can put stickies on stuff, you can draw out flows.

A

I often use this as sort of a first very high level. You know, let's talk about these things in a more abstract way and then, when you have a better understanding, you can convert them into mermaid diagrams or whatever it's more of a creative thing. um It's quite nice.

B

Okay, okay, can I can I share what I would like to talk about sure um I think maybe first about the um big picture like um what would we or can we expect from a failover and what is it about right and then maybe what technologies are available for those kind of things and then maybe looking what we could apply of that? That would be the things I would look into, because um I think we need to start a little bit from from the beginning. For all of this.

A

Okay, I can do that. um Can you see my cursor? Are you in here? You probably are yes, yeah? Okay, let me I'll just you don't need to do stuff here as well. This is a way for me to also take notes right.

A

I will just make a section here: okay,.

B

Let me start by talking about what I think about the requirements that we have set and vivi. I think when the discussion started even before the working group was founded, um we set some rto and rpo goals um and some considerations about course right and implications, and I think those considerations made some sense and they were set in this issue in the handbook for disaster recovery.

A

The rto and rpo goals are only defined for premium plus correct yeah, yeah.

B

Yeah- and um I think there was this consideration of how much would it cost, and what of that can we achieve without you know paying too much and things like that, um but but if you start at the very beginning and think about what is some disaster recovery, I think the disaster cover we are talking about here is about um a region failing right, like everything being down all of gitlab and what to do about that then, and if you think about that this is always a major event.

B

So it's not hopefully not very often that it happens and if.

C

B

Normally it's something which isn't caused by us by but caused by a club writer being down or some you know a catastrophe or something like that. It could be.

A

Caused by us right, it's like if, if you log into you, know like gcp and accidentally delete all of our nodes right like, even though that's unlikely, you know, that's also, potentially that's that's true yeah. It.

B

Doesn't matter it's a catastrophic.

A

Event, it's something that happens and the entire platform is down or in such a state that we don't believe we can. You know recover it quickly without some drastic action right.

B

Yeah, that's the point: um how quickly can we can we recover, and when do we do the decision to fail over right?

B

If you have something to fail over and and the expectation adaptation of our customers, and I think we should try it at first to be very, very um or to be not ambitious with what we can reach for our customers, because I think most customers, even paying customers would understand if a you know, um geolocation fails for whatever reason that this is a major event and that it can take hours to reappear, but it shouldn't happen very often right so for rto and rpo targets, at least at the beginning.

B

I would be very open for for making it not too hard for us, because I think the more important message is always to say we have something that we can guarantee you to recover within this.

B

You know rto and rpo target and it is easier and uh to achieve this fast for us if we don't set a too high target for that right. So we say we are back within 10 hours. For instance, we can maybe reach this solution within a few months, so we would have something that we already can deliver as as a promise right um the lower we go with these targets, the harder it will be to accomplish this. So I think.

C

That's the first thing.

B

We should think about like trying to be not too ambitious with what we promised to our customers to be faster to deliver something at least right.

A

No, I think my interpretation here, though, is that these are sort of ambitious, rto and rpo targets that we want to sell at some point right. I would not write that down anywhere before we are very certain that we can really deliver on that right because otherwise, you're also legally liable and all sorts of other things right, but I think the journey to that point is likely the one that you describe right.

A

It's not going to be step, one perfect solution, five minutes full failover right, it will take a lot longer and but something is probably better than nothing right and I think at the moment we had nothing as in you know the best effort. We could probably pull something off, but nobody knows exactly what that means is sort of my understanding.

B

Yeah and if you look at um what can we achieve within a short amount of time, I think then, if you look at uh technical solutions, then the main problem always is: um how can we get all the data like the state over to a second site right, and um how can we spin up infrastructure to be able to um serve everything from another geolocation yeah.

A

B

That's the solution to that that we are aiming for and and also if you look at, that there are coast implications. Can we do this for everybody or do we need to select somebody for that? One, and this again has implications of our application, can support the selective thing right. So it's it's not easy and hard to make decisions, and if you go with the decision that what was taken for um using premium plus customers, then we rightfully like the discussion uh discussion and the issue was come to the conclusion.

B

Okay, then we leave everybody else out of that and we need to um live.

C

The consequences at the beginning, we can't yeah, we can't I, I don't think we can do that at all um so um yeah. I I think uh part of that discussion and actually what I would like to explore.

C

I I I sort of want to leave like any sort of customer banding out of this initial, not not this discussion, but like our initial implementations, because it is additional work to me to like be selective, and I think, where we actually need to be selective, is um in in two areas like um uh the uh recovery point like how close was the delta between like failing and the stuff that we have saved, I think that's something that we could.

C

We would want to effectively scale based on like as a feature of different um uh customer plans and then the second one being like who gets access first, so the recovery time, because one of the challenges I I brought up in the call this week was um we're going to delay large customers workflows, which usually have programmatic, like shared ci runners, other things that they're doing hitting registry programmatically for various different things.

C

So we should also probably let like sort of have a um not a wide grand opening if everybody can come in and get their stuff again, but like a scaled opening of like okay, so you're, uh you know you're this type of customer you you're going to get in slightly faster, so that you can get back to your your day-to-day because it's going to be impactful like okay, we have to get caught up um and then start scaling in the other folks, but I think I think it needs to be everybody. Yeah.

B

Well, I think that the knobs are the.

C

B

Yeah but then we come to a very narrow path of positive solutions already because of course- and I think the narrow thing that I can see- maybe I miss something because I'm an.

C

B

In all of the technologies that are there, but what we have is it's fairly easy. We have the snapshots mighty region, the snapshots, so that's the thing how we can approach assure for all of our customers that they, we don't lose their data if we lose a region. So.

A

That's the thing very quick to restore from yeah. I was very excited about this possibility and I may have missed like a bad understanding of this, but I wanted to ask you and it's a slight digression is the following statement. True is all of the italy data the git data on local ssds in gcp.

B

Good question: I'm not sure how how this counts as local or not local ssds, but as we are making this snapshots of them, then I think it doesn't matter so much because.

A

Okay, because I was a little bit scared because there is something called local ssds in gcp that does not support disk snapshots, and I, like I read that yesterday and I was just like if we're using this, then this neat like solution that you like or a path forward. You know, crumbled and um yeah.

B

So I think the that we everywhere we are using those attached, pd, ssd disk things, um so there's an option to have local ssd is to get even higher performance. That was discussed for databases, for instance, but I think we still have the those um okay network.

A

B

Special things that google is providing as a storage layer- and I think we do every 24 hours this snapshots, I'm not sure if they are multi-region, but by default, the snapshots and gcp are mighty region. If you don't change that, so um we have something like that. We should of course check all of that. Yeah.

A

Yeah, that's fine, but you know that was my first fear, but if I just like, if I paraphrase what we said, because I think we may be closer to sort of complete agreement on some of those things, at least here it's like for me, you know not copying. Certain data from free users is not an option. I think like we can't lose data right. There is, I think, that's a no-go right.

A

We cannot also, in my opinion, create a system that makes it impossible to sort of open up to free users at a later point right.

A

So if we, for example, had a scenario where you know you have you fail over to a different region right, you open it up for premium and free users right free users, don't have the git data, but they have the database data, they start pushing stuff and then, like a day later, you have your other side up again right and you, like all of a sudden, you have a split brain situation. You can't really reconcile any of that. That's not, I think, that's not something we can do right.

A

I think the only thing we can do is we can essentially say for a select set of customers. We like open up earlier, because we have closer rtos right and ability to restore this, and then you know we have like something else. You know that we don't know yet right and we can make the like.

A

We can make it available at a later point in time, right, ideally hours later right, not weeks right, and I think that path needs to be laid out and part of the initial implementation, because I I don't think it is feasible for us to not have that. You know that would be like that's my my personal opinion right. It's like like 95 of our data and people using this are on free. I understand that we, you know, we may not want to make hard guarantees for this right, but we can't leave them behind.

A

I, I don't think that's a business. I think that's.

B

That's why we are taking the snapshots right. I mean yes, try to protect data for our customers, all of our customers. We have that in place and if they are already mighty region, then we already know the course, and if you want to have a better rpo than 24 hours, then you need to calculate how expensive it would be to do that more often, yeah, okay, but the basics. Basics is at gcp.

B

The way to have multi-region um disaster recovery is to use cloud storage. That's the thing. We could also try to do something with object, storage or something else. But cloud storage is the thing at gcp, mainly for for being safe and multi-original disasters right and with the snapshots. That would be one way which would work for goodly. I think it would work best. This way.

C

Yeah- and I agree with that- and I want to add one other thing for consideration, is um you mentioned uh split brain? uh I I think we should also consider not failing back like you know. If we have a dis, you know if we have something that moves us from. You know uh us east one to you know maybe the west coast or somewhere else in the in the united states for gitlab.com.

C

We just stay there um and then the former primary site becomes the secondary site. If it ever comes back.

A

Yeah no, but I think that also means that the requirement is that we are able to eventually scale up the disaster recovery side to the same size as it was before right yeah, and I think this is this- is how, like, I think in my mind, you know the process looks like this at this point with those requirements right, you have a catastrophic event. You know you need to determine that you, the the best way forward here, is to actually fail over to your recovery sites. Right, let's say you are able to make that determination.

A

Then you you fail over to your secondary side right at that point. That side must be able to relatively quickly allow premium and premium plus customers to resume their operations right. So we must have a way to like recover.

A

You know to have most of their data with like a relatively ambitious rpo target right and we must be able to let them in then. You know those folks need to be able to essentially continue working right, um having lost ideally no data or very little data right. um Following this, we must have the ability to open up the platform you know like gradually, ideally for everybody else, right that that is not in sort of the first sign, which means we need to scale up the infrastructure right. We need to like reattach that data.

A

You know that we didn't have in the in the first wave and then following a period of time right, the secondary side is at the same scale as the primary and you're, not feeling back. At that point.

B

That's not that's not that hard to accomplish. I mean it's. It's still, some work, of course, but um scaling up via terraform isn't hard.

A

Yeah, in most cases,.

B

Everything is there and restoring from the snapshots can be automated. It takes a time.

C

B

It's a lot of data, but it's the range of one hour, maybe for such huge this to be uh restored and attached to a new instance. I would say rough estimate, yeah.

A

um Because I think for me totally feasible yeah. So if, if we go down here right, it's like, um I can also share my screen if you like, but I made like this like little mini diagram here yesterday evening with, like stateless services, yeah exactly this one here.

C

I'm sorry I didn't mean to drag and drop that the.

A

Visiting squid is approaching um so the way I I saw this and you know this is all very like it's still a high level right, but I think it's maybe enough to talk about at first like um and henry you can correct me here. It's like. We have sort of stateless services, api nodes web, all of those things right and we have them in our cost estimate already right. We don't need to replicate anything there right, it's like they.

A

They have no state right and we can scale them down as much as possible on a secondary site right. So yes, that's fine. If we are terraform, we're also able to scale them up or it's in kubernetes right and it does it automatically right. That's that's cool right. It still work, but we don't really have an issue here. Then we have database stuff, you know which um is mainly our postgres database right, so that does need to be replicated um either.

A

You know to a secondary site with a reduced node number for sort of the first wave or you know already. We just say this is fine from a cost standpoint. We just mirror this right and there is a technology with patrony, right and standby clusters that allows you to do that right, but that that's the thing that we need to do and we need to figure out right. Then there is right. Then there is object.

A

Storage, object, storage, um like all the data that we have in there, but that is, I think, also already in cross cross region. So the object storage itself takes care of the replication. We don't really need to do anything there right. What we do need to understand is what the rpo targets are for the object, storage. I think it's 10 minutes guaranteed, um as in you know, like the across regions. You know there is some guarantee from um gcp that they will re-replicate to this other region within an amount of time.

A

I don't know what that is, but we can look that up right. That's important to know, because if, if that's something that you can choose and pay more right, then we need to consider what that is or if the default is enough. We're also okay right. um So that's that's object. Storage. I think there's not much! That needs to be done there um and then the more important thing is sort of all of the gitterly gettingly data right here between these two nodes, because this is where most of the cost sits at the moment.

A

I think right. If you had um live italy nodes on both ends right, then the amount of money that costs apparently is troublesome, even though you know there's maybe a separate discussion to be had it's like. If we could, you know, eat that cost for like three months and actually have a working disaster, recovery solution right and then we work on making this better.

A

Is that maybe acceptable, because we mitigate the risk of complete failure right so but that's another another discussion now you know what I'm saying: it's like the cost of not doing something and saving the money every month. Essentially right still exposes us to the risk of a disaster right, that's a business calculation.

A

In any case, so I think the only thing we really need to figure out only thing at the moment is how to split out this here at the moment, and I think I think what you what you said is there like. I can. I can think here of I'm getting up my stickies again here. I can think of three different ways of of handling this right and correct me. If I'm wrong, we just use everything disk snapshots, right, which I think I don't know what the the fastest snap incremental snapshot. Time is.

A

Maybe it's 10 minutes right or something.

B

I don't know you can't do more than every 10 minutes uh in.

A

B

And even then, I would need to see if we can do this, because we saw when we do the snapshots that we slow down um uh discretes for a short moment. We saw this in our databases so and to see how the implications are affirmatively. I mean we do this daily now without issues, but something like every hour at this snapshot and then see how how much that would cost. Of course, yeah.

A

B

Would be something that we could try to target for right.

A

So the rpo here is kind of like one hour right, so you people would lose a maximum of one hour of data um with that approach. Is that correct.

B

um Yes, one complication, of course, is that we would stream through and replicate the database and the database would be you know with within the minute to the point of of when we switched over, while the data and git leave would be back up to one hour um from the point we failed over, so they would not be in sync anymore, so people would need to um re-push their things to get up to the state that would bear before, and so you would see problems when the database is pointing to mrs which are not in the gitly nodes, because we use the snapshot and things like that.

B

So these are the complications we would see. Then.

A

Okay, um but that's that's one way of of doing this right. It's like you just do do it in that way. That's maybe the the simplest solution and it doesn't use anything else right.

B

Realistic solution cost wise yeah yeah.

A

Okay, then, I think the other solution that I can I can think of and now we're in product territory right where or actually like, there's there's, maybe another step. We can say we, we split the data and say free repos, via disk snapshot, right.

A

This assumes we know where free stuff is right, so we essentially say we have specific nodes. You know that um you know or disks that only host three data and we have others that host premium plus data. I don't think that's true. At the moment, I don't yeah.

B

But I did some calculation um based on the fact that 95 percent of the positive storage space is used by three users, so five percent only of repository sizes used by premium plus users. That means we could try to just migrate all the premium users to three dedicated gitly nodes or maybe for five just to be safe because they are more causing more traffic. I think, and um then we would have an easy infrastructure.

B

They split of you know, thinking things for for premium customers, because we know that's the these four three notes that we need to sync via geo or something yup, and um so that would be a very simple and boring way to um separate those.

A

Yeah I I actually agree with that. I think what we can do, then is. We should still snapshot these things right as backstab backups and that's, I think, that's nice to do, but um we can then, for example, say to geo these things here need to be replicated as fast as possible. Right and geo is async, so it's not 100 in sync. But um if you do this, you can probably you know stay within you know we would have to measure right, but I would estimate within a minute.

B

Yeah- let's, let's not bring you into this right now because, let's just say, we need to sync those synchronously. If you can right, I mean there are several technical options for that and maybe geos even the best one, because it's already there and working, uh but there are also things like set of a streaming application and other ways, I think, to um sync file systems.

B

So I would keep it at this level for now and let us see if maybe geo is the right thing to fit in there you're really not a fan of gui. No, that's not true! I I think we even will come to geo for the solution. um I think, but I I want to keep this open.

A

Now, I think, actually at the beginning.

B

We spent too much time to to try to think how, to you know, use geo for certain things instead of starting from the top with technology technologies that we can choose from right. And I I.

A

Actually, that's why.

C

B

Stop right now, but but come back to it later. Okay,.

A

So, but if we do this in in this way here, um then essentially, we have two two ways like sort of two streams, and this is all like, as you can see, it's it's bad drawing right, but it helps me to do this later on. It's like. We essentially have then two two streams of data right into the secondary site.

A

We have the the free stream right, which is a little bit longer right, and we have some other mechanism that separates out our premium users right to be able to deliver this at the different rpo target right, which may be more expensive, because we can't we can't use some of this technology right and we need to keep the ssds running. But that's kind of how I see this here.

B

Yeah, what we essentially have, then, is a very small site which is already up and running for premium customers. Yes,.

A

Which, I think is the is the main sort of um the main benefit right. It's like this. Here is the three thing, and this is the premium thing right, because we wouldn't have- and this is again something where my personal knowledge stops a little bit. We would have a live. Italy instance right that has these things available on the secondary site right and it essentially could use them immediately as soon as we fail over, whereas the other thing probably means we need to stand up more, getting the servers and attach the disks restore them.

A

Some kind of process that we need to go through in order to have those available is that is that right.

A

So, to come back to like what brent said like earlier in this scenario here, the e, the ingress and egress cost is actually only for premium.

A

Is that correct in this scenario, because we, the other thing, is um the disk snapshots that still, I think, has some cost for restoration? I don't know what that is right, but you wouldn't have to do it on a continuous basis.

C

Yeah and the way we're snapshotting makes them available to any region in the u.s or u.s and canada. Certain canadian regions are in that bucket, but we're snapshotting for us, meaning that they can be spun up at a uh another.

A

Region yeah, so we in this scenario here we we do assume a certain capacity right, which is um this only works if we are able to um essentially lock out free users until a point in the future.

A

That's also something we shouldn't forget right, it's like if you would make like. If you just stand up your secondary site, you repoint dns. You know everybody is happy, but you let three users in they wouldn't have their they get data. So there must be some kind of mechanism that allows us to say we're operating.

A

You know in a degrade, degraded state right. um Currently, you know we're still restoring free users and those people can't actually log in and when they or get any access to anything really right, because they um we need to still restore their their git data. Mainly all of the other stuff would be there. The database is there. The.

B

A

Wouldn't even say we need.

B

To restore for free users just say we need to restore because as people would feel excluded right and that's actually.

A

True yeah, I mean fair right. We can talk about how to do this. I mean luckily right. I think I think one like we've just we are shipping maintenance mode right, and um this would be like an interesting sort of additional feature for for the future that we could build on top of that and essentially provide a mechanism to um sort of regulate who can log in or can't. But I don't know, I would have to talk with the teams that handle this kind of stuff. um Okay,.

C

Yeah, because the one of the dangers here is like uh well, I mean it's dangerous to the the users the free users like they could just like push all their stuff and create new repos, and then they would have basically duplicated duplicate uh english. Today, duplicate uh repos, uh going on so yeah maintenance mode or some some mechanism to um uh notify them that they're in a queue of some sort. Whether or not we actually can tell them where they're at in the queue, but that we're you know still waiting on restore for their stuff.

C

I think, would.

A

Be important. We also need that time, not only because of the data, but also because you need to be able to scale up the infrastructure right.

A

It's like essentially like if we keep the site at a really minimal level right, um even probably seeing more traffic from premium only, and I don't think we know exactly what that looks like at the moment right, but you would have to like sort of increase the size initially already right for for premium users to get something out right, and then you probably have to scale it up even more to start to get to like dot com levels.

A

I don't know how how easy or like how continuous that process is right. It's like if you if it makes a difference of saying like let's grow by 25 or let's grow by 500. I don't know like how how how that works.

C

So I think that's the shortest bit of work for, like the the stateless things like api and and web notes. I I think the the the longest running thing here will be like restoring all the gettily file notes from from snapshots.

B

C

Yeah, I agree and.

B

Okay, I mean because one.

C

Of the things I would advise that we would do in this scenario is like we scale up beyond what we have for current.com, because we're going to anticipate additional traffic as people are concerned and and just coming and hammering the site until they have their data ready.

A

Yes, okay, so, um but this to me looks like sort of a path forward here right um that is pretty boring but possible.

A

um There is one more that is hypothetical for the for the future that I just want to like talk about, because people have talked about this as the way forward um in the past right, which is essentially.

A

You don't do any of this disk snapshotting, but gitly um manages all of this via object, storage.

A

At that point for like this is something that has been floating around in the like italy team for a while right, where they they're not only for dot com but also for customers right, maybe they they would not. They would want to provide an option for folks to store git data in object storage, um so that you know, like I think from a mechanism here like italy- would just load this from object. Storage, keep it in memory.

A

I don't know how any of this works right, but I think this is something that people have been talking about, and I've heard it like several times as this is how we're going to do it.

A

I don't think this is necessarily the solution, but it may be something that could happen in the in the future right, um and that means we probably wouldn't have to separate anything- and you know anymore because you know like object. Storage is relatively cheap, but I personally this sounds to me like a rather significant amount of work. um That will not happen like in the next two months right so.

B

Yeah also without knowing the details, but I would wonder about the performance implications of that, because I mean, if you run from ssds and git, is very much tailored to work with. Five is a system right and I don't know how you would translate this over to object, storage and how that would look like then.

A

Like I am in no position to answer this, but I I just want to point it out as a this is a thing that may exist in the future and at some point people told me you know like this is what's going to be built, but yeah.

B

I don't know why different products yeah.

C

Yeah, I mean sorry, I I could see a world where, like we, we do use object, storage for ultimately like the keeping the state of of the repos, but then we're doing some sort of behind the scenes streaming between the object, storage and a local ssd for the file notes themselves, so that they're highly performant. But so you know a whole other um scenario of like okay.

C

We need to to reduce like the the time between what what's on the the local disc and that, but I think that would dramatically improve our our um uh rpo uh because of the fact that, like we might be in milliseconds or even zero for those types of things, if we were storing an object, storage, ultimately, there's.

A

Actually also like there is another discussion going on that. I I've acquired the backup and restore category a year ago and have told people that, unless we hire more folks, we can actually work on this, which is still true. um But there is some very interesting, like I think, opportunity, for example, to build um sort of an incremental backup solution for for git right, um which ties in with some of those things right.

A

If you are able to, for example, you know like push all of the like actual changes that are made on a repository into object, storage right and keep other state in in memory or on ssd right. You could have a system where you continuously like push these changes to to repost into object, storage incrementally, which I think would be really cool for lots of folks, doesn't even need to be only object storage, but it's just also a way to provide easier ways to backup um data for people that are not in gcp and don't have.

A

Git have have disk snapshots, for example, yeah, but that's site tension, future music.

B

Okay, um now that we are at the possible solution.

C

B

Talk about the challenges that I still see here, um one thing is, I'm not very sure about how best to um think red is, but I think there's an easy way to do that and if it's just you know having.

C

B

Instance, on the other side or something, but that should be solvable so.

A

I have a note here: it is geo-specific, but I just want to know if you would stand up the secondary site as a geo secondary. We don't actually replicate redis at all, because I think the caches and everything would need to be regenerated right, but I don't think there is a need to really think those things you know. I think you would have performance issues probably, but we don't. We don't do that right now.

B

I think we have state and writers, um I think we would lose all sidekick jobs and things like that.

C

B

If you lose the state, so we need to carry this over.

C

Yeah we we have two redis clusters, one that that fabian you had mentioned were you have a general cache which we would likely just rebuild that and then the other one is like. We actually have a bunch of queues, work, cues, psychic queues, as henry mentioned in in another redis cluster, where if we were to lose that we would lose all these uh running jobs or queued up jobs. um So I mean that okay.

B

But I see that anyway and fail over here yeah.

A

You would lose that probably as well right, so you could maybe argue this is like it would be great, but maybe.

B

For first iteration people, that's something you need to think about like, like you should notice as an open question for research, then um the other thing is: how do we want to separate user groups like.

C

B

We use geo for that, maybe for selective thinking. Do we use this approach of migrating customers over two dedicated uh geonose or something like that? Then, is the open question how to deal with this gap between um sorry.

A

One second: before we move on, I need to write it down. How do we separate user groups.

B

Yeah like like, how do we selectively sync right, I mean the two approaches that we mentioned are via geo by enabling it to sync for premium customers only and the other way would be to migrate premium customers over to a dedicated, gitly notes and then just sync them in any kind of means that we see fit and make sense, and maybe there are other approaches I don't know, then the other issue that we really need to look in is the scap between uh database state and and getting storage state after we restore from snapshots because there will be a gap.

B

So do we deal with that, um but that's there was something else in my mind I forgot about it. Maybe it comes back later: oh yeah. The the other thing is: how do we test if the secondary site works, because with this approach of only departure partially having customers being synchronously replicated over to the site, you can't really do a failover test right and.

C

B

Testing, if that works, is really a challenge in this way- and I I mentioned already- I would love to see something like shooting part of the traffic over to a dr site to see that it's working all the time, but that's, I think, hard, because we need to do certain changes in our application to make this work, um but for the far future. I would like to see something like this, but but in general it really is a question. How could we test the failover and.

A

I have two ideas. um Actually I have a few more so I I think I really personally, I really like the idea of rooting traffic. This is actually something that geo would like to do at some point, for other reasons as well like, for example, we have this like read-only secondary web interface. That is pretty not user-friendly right and so for things like this. Actually, a mechanism where you go to, like you, have a geo aware, load, balancer right, you go to the site that is closest to you, you're being presented with the web page.

A

You make a change right, it routes back to the primary web like database and then comes back to your site. Doing that you know loop is something I would really like to do. We don't know how you know. That's there's, there's issues with this right, latency and race conditions and all that kind of stuff.

A

But that's uh that's definitely something interesting. I think when we say we need to test this, I think we also distinguish what we are actually trying to test right. It's like are we trying to establish that users would be able to use the secondary site, as is right? That is, I think, what we accomplished by rooting traffic right.

A

We also need to be able to test all of these other associated procedures right, like restoring um snapshots from disk right and being able to like scale up the infrastructure- and you know like turning this into a rewritable instance and repointing dns, all of this other like stuff that we need to do on top of this actually like working right, and I think that would require a probably some sort of bubble test right where you, um you essentially like isolate your disaster recovery site.

A

So for the dr side, it looks as if your primary site is catastrophically unavailable right. You, you do all of the things that you would do well um in case of an actual disaster, but you don't tell users that this is happening and everybody is still working on the primary side right. So there is no user impact, but you can run through the entire process and see you know if it actually does what you think it should right.

B

And then staging geopromotions that we do sorry a little bit like with these stages.

A

Yeah, that's exactly what what this is and what that tests is, um that your promotion's low and your entire disaster recovery flow is actually functioning right, um which I think is is also really important.

A

And I think there are maybe I think there are a couple of other like more nitty gritty issues here right um the whole like petroni standby thing right, it's like petroni, runs on a um non-omnibus version on production right. So, um as far as I know,.

B

I think, but you know, database thinking and failover, it's it's fairly standard procedures, so I mean we can stand this up and then let this run without big issues. I think I mean that's fairly well known how this works, so I.

A

Know that it's fairly well known how it works, um but I think there's still intricacies to this right. um It does work, though, like we have it running in ngo now in elsa or beta, um but I think it's still like work to set all of this up.

B

Yeah- that's that's true, but but we know that this works. I mean this is an often used scenario, and um you know that this kind of technology works and.

A

Assumes well, I think, then, the the other thing like at least work wise that I can come up with, um is just the automation right. It's like um associated with you know like all of the like processes here where it's like, because I think this is this is still you know. Even if we scale down everything right, it's still going to be a bunch of servers. You don't want to do a lot of manual things right, so there's at least work in setting all of this.

A

This up, right, which I think you are seeing already with some of the geo stuff, and I'm not saying you need to use geo. I'm just saying you.

B

Will have to automate something I think it's not as bad as you think, because um we will essentially need to mirror our production site to another region, and that means we just copy our terraform more or less over to that side using different credentials, maybe and things like that, but uh and then just scale down the node numbers, which is just changing a few numbers and one five. If we do it manually, I mean, of course there are some more details to that.

B

Maybe, but but the main thing really isn't that hard to accomplish for scaling the the status infrastructure up and down. um That's really not that hard to accomplish. In my opinion, the most work would go into setting this up once and then maybe writing automation of making this scaling up and down a little bit easier.

B

But I think there I don't have two big worries to get this accomplished.

C

B

C

Where I'm, where I, I might have a bit more concern for for work there so yeah, I think, there's a big spike of work in terms of like having the this similar terraform and what have you in in maintaining the dr site and and adjusting the size of it. But I would love to also have automation around like okay. We need to fail over so all the tasks that one would we would be normally manually doing like oh dns.

C

We need to point it over here and we need to figure out which ips and so there's a lot of safety. Mechanisms where, like removing a human from the equation in a stressful situation, would be helpful to have automated. So the dns is the only thing I can think of, but I know there's a handful of other things where we want to programmatically kick off things in a sequence. So I think there is a slew of automation here.

B

In the details, for sure I mean.

C

B

Are so many things that maybe don't work if you switch over to another region with different ips for sending out mates, or I don't know what other dependent services so a lot of details to think about, of course,.

C

We would also need to automate, like our our updating, our our list of ips, that our traffic will come from for the different customers that need to change that within their own infrastructure. So there's a whole bunch of uh fun things I mean honestly, we should probably do a test once we're feeling ready with this with a couple of marky customers that are willing to help us like at least tabletop exercise like hey. What would you need to do? If we did this? um Oh, we would need to work with our it departments upgrade.

C

You know, update this. uh This allowed list of ips for your new new range, et cetera, so yeah a lot of a lot of devil in the details, type stuff, henry you're right.

A

Okay, but I think I think if I, if I look at this, the main thing here that, like I think there are at least two things that are like things, that we just can't do at the moment right, which is selecting the premium customers, the customer data for gitobi. I think that capability does not exist as far as I'm aware.

A

Nor do we have a way to lock out three users and that premium users in I don't think that's particularly hard necessarily, but those are at least two things that we in product. I think can't can't do at the moment.

B

Yeah, what would need to be done is to enable the application to route premium customers to certain gitly nodes that we specify and the migration of already existing customers could be done by infrastructure. We have faced to do that, but the application would need to root on your repositories of premium customers to dedicated gathering nodes that needs to be built into the product. Somehow.

A

Okay, I don't know how that works. I like, I just don't, but I think that's something to be that can be figured out at what level that makes the most the most sense. um Okay, so that was kind of exactly what I was hoping for to get a little bit more into the details here.

C

Sorry, I I think one more problem that- uh and maybe I glanced over you mentioning this but like I, I do see an issue with like the database being out of sync with potentially with uh some of the snapshots, because that that's going to be. You know, I think at worst case. That's a horrible user experience, but it might be really confusing and- and uh one of the concerns I think we should have here- is customer support like because we can't scale people and so um with a lot of additional like.

C

Oh, I like it, says the mrs are here but they're not like. um I know we can have templates and stuff for that, but I think we should figure out like a way to I don't know be able to for, like our data store, to have knowledge of like what's actually there and and some system that can rectify and reconcile. I guess the the delta.

A

Yeah and like I think this is probably the most difficult thing um like I don't know much, but I know that the like the overall, like compartmentalization of our data in the database, is not very clear-cut. I think so.

C

It might not be an easy answer here. It might just be like yeah.

B

C

Look broken so for certain people, yeah.

B

Yeah, but I think it's gonna be very different for each of the kidly notes, because they might have snapshots which are older or newer, and so I think that's super hard to accomplish to get this right um and then I think it will never will work perfectly even using a geo for thinking, because that's not fully synchronous right.

A

So, like you will have some instances, if you essentially as long as soon as you lose any kind of data right um in one source, but not the other right, you will have some brokenness right. I think that's.

B

um I think we need to to count with the fact that we need to advise customers and this these cases to um re-push their local positron data. If they see issues because the good thing is normally all of the customers still have local copies, because it's.

A

B

Right, the bad thing is, I think we really can't prevent this without heavy technical measures which are not feasible.

A

Well, I think the only way to to really do this um is if you had a gitterly cluster setup that allows you to have um consistency across regions right, but I think that comes with.

B

Traffic traffic isn't supporting this, and the gitly team was saying that um traffic isn't built for that, and the latencies would be too high because they try to be sunk. But that means that latency would be.

A

B

High to go across regions- which I really think is, I think we should try to look into again because it would be the perfect solution to this. If you have would have put this into prefect. But um as far as I know, the team is of the opinion. That's not working yeah.

A

But you know like, I think there are many other things that we could. You know put like future things I think at the moment like I like, I, like your approach, and we are saying you know something is better than nothing right. It's like if we like, if you know a hurricane, destroys the data center for gitlab.com, and otherwise we would have like three weeks of downtime and then we actually are up and running in.

A

You know like a couple of hours, let's say for everyone, and people have to re-push their repositories because they lost one. Mr, I think that's that's better, right and um and there's obviously a lot of future improvement.

C

Yeah, I don't, I don't think we'll solve for the issue that I I called out. I think we just need to be prepared with, like a maintenance mode or whatever, to explain this to literally everybody that hey you're going to see some discrepancy here and here's how you can solve for it. So, yes, absolutely. I.

B

Completely agree, one last point I I still also want to mention is for infrastructure. Is that if gitlab.com fades in a region, then we need to make sure that infrastructure like chef, the ops instance and all the things that we need to manage gitlab.com and the the starstore recovery site is also still working. I mean we need to really make sure that we don't have a single point of failure in this infrastructure.

A

Because I think there was a discussion like many months ago about, for example, having a secondary site for um for ops. I don't think that's a thing at the moment, because people realize that if that instance goes down you're unable to actually do what you would need to do elsewhere, I don't know how that is. Centered.

B

You have ops in a different zone or region, and maybe um I think that was the thing that we have set up, but I need to look it up. Yeah.

A

But I think if you put.

B

It into a different, um a set for sure or maybe in a different region, and look into that yeah.

A

But um okay, so I think because we we are um almost out of time for this. What I can do, maybe even today, is I'll. Take all of these notes and create an issue and say these are some of the the scenarios and how we, how we see this, and these are some of the potential solutions for um for this right.

A

These are some of the likely pain points and what we need to do and then I think that's already a more concrete picture compared to before right, and I think then we need, I think once like. If we actually like have some agreement on on this, then we can. We can look how how to do that right and um I yeah, I don't think we can get around using some cloud service provider.

A

Specific implementations like uh the disk snapshots right that that's uh with the cost constraints, and I don't think um I don't think we will get around that which is fine right and I think that's also, um but it means there will be some some specificity for how gitlab.com solves some of these problems versus maybe some of our self-hosted customers right. But that's, okay,.

B

At a certain size, every customer will come to the same conclusion. I think because, of course yeah um I think a good thing would be to do the calculation for the snapshots every hour or maybe something like that for okay.

C

B

So that we have an estimate of what the courses would be compared to needing to sync everything right and then yeah.

A

And I mean if we if we can like- and this is like it's a tiny bit difficult for me, because I don't know what data is available and what isn't but like if we assumed right that five, like the premium plus data, is on ssd. We do actually have a cost estimate for that per month right.

A

I think it is something around 10, 000 euros or dollars or something right. So it's actually not not that much. I thought.

B

Yeah, I mean it's only five percent of the total amount of storage for premium customers, so we can yeah.

A

I I think the storage cost is 200 000 right, so 10 000 actually sounds right. So with my mental math, yes, so cool.

A

B

Yeah, that was great yeah.

C

Thank you, and I think this is a good tool to use for this. It's yeah.

A

Like I don't know like, I, it's a little bit silly on some level because it mimics um what is the word skeuomorphism right? It looks as if it was a physical thing, but I um like in my previous jobs, you know which was not always working from home. I personally really enjoyed the sort of engineering sessions where you just have white boards and stickies, and you start drawing these things out and when you capture that and then you write it up, I personally find this quite fun right and it.

A

It feels like a good thing to do when you have. You know something amorphous that you need to get into, because otherwise you spend ten ten weeks in an issue and yeah okay, um yeah, okay, I think what what obviously interests me um is. I think it is probably a fair statement to say that most of these things can be accomplished without using geo at all.

A

Some of these things may work better with what we have already, and I would I would personally want to consider where we can come in, because we do have this. This experience, like you, know, like the promotion sequences for a secondary site, putting it into a read-only mode. These things, which is, I think, is a separate bit from all of the replication logic right and then um I think that would be quite.

A

I would really like that to be part of this, because I think that is then there would be so much good feedback for how to do this at scale that would feed back into the product. Even for folks who run the 50k reference architecture, because many of these problems are very similar right and I think that um that would be quite exciting.

B

I also think for syncing the git data geo would still be the right tool. Maybe yes,.

A

Either I think.

C

B

Thinking uh either just thinking three dedicated notes or thinking by by customer type, if we can build this into geo, that would be a good solution. I think.

A

Yeah, I have actually a um like an epic open for the team and I meet with them next week about sort of an nvc to only sync data for specific customers.

A

Customer types right so and the initial thoughts from mike and douglas were that we would need to manage a specific sort of relationship in the in the in the code, but that that is actually possible and they were concerned about the complexity with, for example, all of these other data types like um job, artifacts and whatnot, but that's even not required, because that's all in object storage. So we really only talk about git data, which is projects, design, repositories snippets like a bunch of different types of stuff that you need to concern yourself with, but.

B

Yeah, please try to also consider just thinking specific gitly notes, because we can do that. I guess we.

C

B

Get to this anyway, because business, maybe you will come up with okay for this customer type. Please use this type of note for performance and cost reasons, so I I.

A

Actually made a question because I don't know how this like this is something I should know, but don't I.

B

Mean we have something in our database, which is saying for each customer on which um sharded uh the repository is located right. So yes, because.

A

Geo can do that already. We can select data only by storage chart.

B

A

So if that was a thing, then we kind of have that ability.

A

Already so maybe like, I never really understood how story charts work um in italy to be perfectly honest,.

B

Yeah, I think it's just meaning a path dedicated to a node, but I'm not sure about that. Yeah.

A

Yeah, but in any case, I think I think what we want to do from a geo team is, if we, if we can, you know if we can come to a conclusion on like the path that we want to take you know, then we can start thinking about what we what we can do in product to support you, um and I think we would also be excited in helping you with setting some of these things up.

A

You know like looking at the automation we have like quite a few plans, like my biggest plan for geo overall, like is to make the promotion sequence more automatic and easier right, and I think there are some interesting ways of potentially doing this. That are, let's say more generic right. It's like you could handle that with chef or with console or whatever, but the the difficulty there is how to like manage the state of all of the individual club, rb files, so cool.

A

um That's all. I had we're a little bit already a little bit over time. Anything else from both of you.

C

Oh thanks for walking through this.

B

Yeah, it was great, I think we have a better path forward now.

A

Yeah, okay, I'll. Actually, I think I have not many meetings today so I'll commit to writing all of this up today, and I will put it into an issue and share it with the with the working group and then I think, scarbeck is also pretty interested in this and then like I'll, just ping.

A

Folks on it and say like look what about something like this and then maybe we can get get that discussion going because, like this, my impression is it's probably easier for everybody to to give a more accurate assessment of how much something is going to cost once we know a little bit more concretely, what we're trying to do right and so.

A

Cool and then have a great rest of your friday thanks for the time I enjoyed it and talk soon, yeah same to you.