GitLab Working Group Disaster Recovery, 3 Feb 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2021-02-03 GitLab.com disaster recovery working group

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

All right, I think um this is pretty much what we'll see from attendance, which is good, because um we have enough people around scarborough. Do you want to take the henry henry's item.

B

Yeah um we did a test uh on the second yesterday. uh There is a slight issue that flared up again regarding the secrets that prevented us from being able to authenticate to the database upon promotion.

B

As henry noted, it appears to be a problem when he was attempting to resolve the initial issue. The second part of this is that the geopostgres service ends up being disabled. So when we come back around to demoting the node, um our documentation is simply incorrect about this, and he's got an issue to address this other than that the promotion went just as fine as it has been in the past. So um we still have a few things that we need to work on at this point, but hopefully nothing major for a single node instance.

B

A

Cool skyway: your item is next on the blocker side.

B

Perfect, so this is mostly a notification that there's going to be a little bit more strain on the staging environment. uh The delivery team has an okay. Are we going to be doing some testing of the ability to perform rollbacks and potentially some automation around that so because of this staging is going to be probably blocked at certain points of time, so they can perform some testing.

B

So if we need to perform some testing, we're probably going to want to coordinate with them um at certain periods of time just to make sure that we're not testing on top of each other.

B

C

And I think also uh before the ace, uh the geo team is also testing the maintenance mode. So this will be a little bit more. uh You know conflicts here.

D

So um do you will probably not uh do another staging test? I think we are probably just done with it.

C

Oh, that's cool.

E

Because everything worked, of course,.

A

You said for a while or at all, so you didn't catch that that you won't do another test.

D

Yeah geo will most likely not be doing another test in the coming few weeks, because the one that we did was successful so.

A

Cool well, our rollback was successful and we are repeating it. Can you better that.

E

You can put it in maintenance mode, while you're trying to roll back and then see what happens.

A

That's actually not the worst idea. I know that you're half joking, but actually that.

A

B

You have the discussion point, I'm gonna stop putting my name next to things. um Brent has been trying to get a conversation started about this, so I created an issue to help jump start the conversation, because we infrastructure is going to need guidance on how we need to build out the secondary site. Multi-Node configuration we all know we have a single node. We all know that that's not sufficient.

B

We all know that there's costs involved. That marin has added another discussion point on both options that we've discussed using our existing tooling plus uh the use of the git lab environment. Tooling, are viable options, but someone needs to make a decision as to how we move forward with this.

B

There are pluses and minuses to both that I briefly highlighted, but I probably could have done a very job when I wrote the issue, and both of these will involve a lot of infrastructure, time and effort to deploy no matter which decision we go with. So it would be wise if we could figure out what we need to make this decision as soon as possible, so that we are not constantly blocked in trying to progress forward. The purpose of this working group, I would appreciate guidance, either from marin or brent, to help guide.

B

The conversation that's necessary because I'm not sure what to do outside of opening an issue and plopping opinions.

A

um Okay, so first thing we have 10 people in this meeting. uh Can people take notes a bit while we are talking because it's it's hard to actually manage the meeting plastic notes. Bus answer questions, so I appreciate help um scarborough. This is a really good question and I can make a decision on the spot. That's not the problem, it's just that no one is going to like that decision, so um I think we need to talk a tiny bit more about it.

A

I do have a couple of questions for you that we might be able to uh answer here in this call. um First of all has jio used get at all so far,.

E

A

You have experience with that.

E

Yeah, maybe nick, can you give a short summary here.

F

uh Yeah sure um so I I work on get I built in the geo functionality in there, uh so we use it currently to build uh a couple of different 1k environments based on the reference architecture and on the 3k reference architecture as well.

F

So I don't know how much you know about. I can go like it. You know it can currently build two separate environments and tie them together, but it controls both those environments like it knows everything about both of those environments and.

A

What's what's necessary to happen for it to only build out the secondary site without controlling it,.

F

So well that would just be normal get usage. It wouldn't do any of the geo work. It would just build a reference architecture.

A

F

Scale we want, and at that point you would then need to add the geo part, because it needs to know about the primary so that it can correctly configure the two which.

A

F

Staging we wouldn't want to do that.

A

That makes sense, and um what kind of resources do you need like? Let's say, hypothetical, that we asked the geo team to actually help out with the setup of get right like to build out a secondary site. um What kind of access would you need like? You would probably need access to the staging gcp account, I'm guessing to boot up the nodes, but what else.

F

To be honest, that's largely it. We just need a service account to get into the gcp project, and then we set up ssh access within the project uh and then that's pretty uh external ip address. You know that kind of like they're, the only things that we need from gcp and then the tool does the rest. Okay,.

A

So nick is this: um I mean the other nick engineering manager. Gionee.

A

Is this something that we could collaborate together on? I can tell you right now that it's impossible to get infrastructure time to do everything.

A

So if we provide the service account and the the base things that you need to build out the secondary site, and then we take over the part of actually configuring right like the primary and the secondary sites and like work with you to figure out exactly what is necessary. That would help out a lot move uh this forward and it wouldn't it wouldn't, drag us in the direction of needing to figure out um whether we can even use get within the environment. We have would that be even an option.

G

Yeah I I yeah, I definitely think it would. I I I guess I want to ask nick w here as well um what he thinks, since I think we'd also be leaning pretty heavily on uh on his knowledge and work with with get to um to spin all that up.

F

Okay, splitting up a new environment, it's very easy within get just for uh simple, like once. We get the service account. It's really just running a couple of commands and waiting. So.

A

All right, I think, the interesting part there will be to figure out also like whether we can fit this in our infrastructure, as it is now starbuck right, like that's, going to be like the larger challenge. But if two sides are working on this right, like you know, gets nick and someone from infra knows how to tie this in then it becomes significantly uh significantly easier decision to make on just let's use what we ship to people.

E

But I think we shouldn't underestimate that part as well like this is the like. I personally from the product perspective, would very much like us to use something that we also recommend to our customers, because then, even if like and if we have a rotation, we do this regularly. You know, I think it's just you know so good on many different levels um plus, and I think this is maybe a long term item.

E

If the 3k user reference architecture is actually sufficient to run the the workload of the staging environment. That is an indication of you know like some larger, interesting infrastructure projects on how to set up gitlab.com and whatnot.

G

And I think this also provides an uh a good, um a good opportunity to to maybe use get in in somewhat maybe more of a real world situation, because you know maybe there will be a lot of customers who already have a multi-node geo setup that are a multi-node gitlab setup. That's not geode, where they want to add geo to it. um So so we can maybe make some improvements to get to um to accommodate that scenario.

G

Whereas right now it's it's kind of like yeah you're, either setting up a single installation or you're kind of doing the the entire geo setup from understanding correctly.

A

Yeah that that that sounds exactly um probably like the most common case, actually right, like they already have something that they configured within whatever they use um scarborough. How do you feel about this?

A

I know I don't know whether you would be doing that work, but like from from the infrasight like from someone who needs to connect things.

B

I think being able to connect things together will probably be the easiest part of this entire ordeal.

B

So I shy away from using get simply because it's not something the rest of the infrastructure team is aware of, and I guess we need to determine who's going to be the responsible party for maintaining this long term.

B

If we use get you're going to have a lot of infrastructure team members who don't know anything about it, how these nodes are configured and they may not even have the necessary ssh keys, because I've already exist all of our existing tooling, which uses chef right now on terraform to spin all that stuff up is well known within that team, and there are established patterns that they utilize for modifying and managing that infrastructure.

B

There's going to be a large learning curve to distinguish between what is set up between get and what I set up using our existing infrastructure, and I think that's going to be a point of contention with an infrastructure. I don't know how that would be very well handled or.

B

Met with expectations within various team members, so that's the only reason why I shy away from get otherwise. I welcome this and I would highly encourage it because it would be interesting to see what we could do here and maybe transitioning from our current existing infrastructure methods to using this full ansible setup entirely.

A

C

Yeah, does it make sense to invite uh grant young to be in this group, because uh here is the main or dri of that tool and he basically implemented all of the gadget.

C

So uh not all I mean there are other qe members, but grant is the dri of this tool and this to be to be honest, this will be the moving forward. We want to make this a standard uh automation for deployment for oklahoma, yeah.

A

um Now that that also makes sense, I think nick um nick currently can be that representative. While we make the decision right like you, can advise us nick and then we can pull grant when necessary, um but that's a good idea.

A

Scarbec um on to the the infrastructure side, I'm not calling the shots here, unfortunately, or fortunately, whatever I wanna uh say it, but um this is one of the opportunities where infra has a chance to start at ground zero because get is not out yet really it's beta, which means if we get involved right now, everyone will learn over time, and that would also mean that will improve it for everyone else.

A

So um that's going to be an easy thing to solve, I think, is the the hard part is going to be like for me at least it seems like how do we actually get something that is not that is made to manage the whole life cycle, not manage the whole life cycle, because we have other things that manage the life cycle.

B

I think there's going to be a wide variety of opinions about this, which is why, when I created the issue, I did not want to ping the entire infrastructure team to begin.

G

B

I wanted to make sure that we had something hard, some some, whether it be a decision or not something concrete that we could have.

G

B

In this issue, first, because there's a lot of stuff that our chef infrastructure currently does that I don't know if get handles appropriately. You know we do a lot of configurations, there's our own tooling, that we put in place on all the servers and there's all of our ssh keys and etc. Like there's a lot of stuff that we have already built inside of chef- and we would have to supplement, get to ensure that infrastructure.

A

B

Ability to manage.

A

We'll continue talking in the issue, but I would really really like us to take that issue and make it into what is necessary for us to run, get and then get a gap, analysis and then ship that to whoever owns, get and tell them. This is why we can't use it and in the interest of time for the geo. Oh sorry, for the disaster working group, we will move on with whatever we move on with, unless you help us out get this in now right, that's an approach.

B

And that sounds very reasonable to me.

C

Yeah, I agree, and those are those can be. The you know feature requests to the gadget to enhance the gap too yeah.

E

I also, I think, it's important, though, to to separate these two things, even though they're very related right. I think the difficulty here is to decide on the tooling right to actually create this environment right. That runs geo in this instance, but you know get can do non-geo configurations as well. That's sort of the the thing that we need to determine and then on top of that right, this is where geo functionality actually kicks, in is to say, hey. We have a 3k reference architecture.

E

Now we actually need to do the failover right. How do we do that right? And I think this is, I think, for geo um the or for this working group, probably the the more directly aligned with what we're trying to prove right. But the other thing is, I think, a great thing to discuss now, because it also highlights how these things are interconnected right, it's like for us for our customers. Difficulty may not even be executing the geo commands. It may be setting all of this up right.

A

Yeah yeah exactly and the case fabian will be made easier than um to say that the primary site also needs to be rebuilt in uh get in order for us to get access to those commands that will allow us to manage the life cycle for geo installation right. So that's also another transition, and I know it's not strictly related to jio right now, but it it is, it kind of is yeah kind of half of the work. Half of the work is actually setting up to you, configuring. It.

E

Very optimistic.

A

Okay, well, I'm being optimistic today, so you get a rare opportunity to see that.

A

um Items from here- let's uh maybe like summarize some of this discussion um in the issue and uh I'll get an item I'll get an item with brands tomorrow to discuss um exactly what we need to do here and how and it would be good nick and to involve you in that as well. uh Async, obviously, to just like see what kind of time commitment we are talking about here from from geosite, and we can hopefully find a common ground.

B

Marin I'm curious if it would benefit anyone in infrastructure. If we had someone from the get team to introduce get to the infrastructure team in one of the dna meetings that.

A

Sounds like a great ideas. Karbek like a demo source would be even better rather than someone showing a map, but someone like hey this is how you stand up environments. Look you might need it to test gitlab.

E

Yeah I mean there was the beta announcement today right, so that would be a great opportunity to demo it to our infrastructure department.

A

Yeah and then they can criticize it all they want, but they can also, then help make it better.

B

Well, if chef is going away, they need to help make it better, because we don't want to have two things available running everywhere, uh where.

A

B

Announcement I'll try to.

A

What's happening at gitlab.

B

A

Actually, that is probably the best pitch you can make in the infra department. This is how you get rid of chef join this meeting.

A

I think you'll get probably the largest attendance in history of infrastructure.

A

um All right, those are the action items, so uh scar back.

A

um Maybe we can start with item number two I see davis is in here, so I just wanted to kind of maybe cover some things um if necessary, if not davis feel free to call out and say no everything is written down or you can also tell us go or no go with 3k architecture in staging. That would be the most useful thing.

H

um Yeah, so I just joined because me and andrew were talking about this, and um I guess I'll link issue 34 in here, or did you already link that in here? Okay? So I'm not issue going over all the different I'm trying to go over all the different components of disaster recovery, because we're going to need to put this in the plan for next year, because it's definitely going to be over and above our forecast, so we're basically trying to figure out how much each component is going to cost.

H

So I think the main questions is for each of those major components like how many warm nodes are we going to need in the region like how many web servers do. We need to fail over how many database nodes all that kind of thing and then for the the big costing items, for example, italy.

H

If we can't separate out paid users- and things like that is there like? Are there other things? We can do to reduce that cost? So, basically, just trying to lay out like what the biggest cost services are, that we'll probably have to bring down and then for the ones like web and api. That probably aren't as big of a deal just like figuring out. How much how many warnings you actually need to get a better idea of the cost.

H

Yeah, we can talk about that here, but we can also put it in that issue. I'm not sure what the best yeah.

A

H

Information so.

A

I guess like what I wanted to find out in this call, and we we are running at time almost is: would a 3k architecture in staging cause, a problem for infra right now, so we can, like 3k architecture, has a list of nodes exactly like what kind of type you need and so on. So if we can get them blocked on that, then we can talk in parallel about the larger production uh story or.

F

A misunderstood I.

A

I could also misunderstand the discussion that that you were taking earlier.

H

No, I mean, I think you can do both. Is that the other issue, or it was how much was like 2000 a month or something? I don't remember now.

A

So that issue has like the discussion about the did. I make a wrong. I I linked the wrong issue. I'm sorry.

A

H

A

ah Annoying this one is for production, and this one is for staging.

A

We have too many issues. People did you know that.

A

But I think that was the like. The original issue linked had the start of that same discussion. Should we go with 2k or 3k architecture for staging and would that be sufficient.

H

Yeah, if I remember, I think that one should be fine we'll just I can talk to joey about putting it in just from today on yeah we're trying to figure out the timeline because, for instance, if we needed to make major changes in gili, I guess my question was like when it doesn't even make sense to start testing now we're not gonna, be able to roll anything out because it'll get blocked by other things in production. um That's a good point. It's not as.

E

Much I'm strongly in favor of having all of these operational and testing things figured out well ahead of time, even if the product development, if that is the case, lags behind. um I think I think we learned so much about many other things um that I think it is well worth it. But that's my my two sense. It is very likely, though, and I need to talk to mark wood, the pm for gitly- that we need significant feature development to reduce the cost of italy.

E

I think overall, which then will enable the dr site as well, and if that is not possible, we may want to resort to selective thinking and some configuration bits, but I'm cautioning here that I don't know yet what the best solution for this is. I know what the problem is, and I think that's well understood, but I will need to do more more research, also in collaboration with those teams, because that you know that may be something that, for example, geo shouldn't actually handle right. It's better kept in italy or we don't know yet.

E

So I think, but the question at hand for me would be. Can we can we deploy a 3k reference architecture in staging, which is, I think, two thousand dollars, or something like that a month um for the foreseeable future, for our testing endeavors.

H

Yeah, I think that should be fine. um I don't think we'll get any questions asked about that. So I can just talk to joey about that for now and then once we know when we're gonna deplete your production, you can also put that in but yeah. That was my only remaining question. That was if it still makes sense to do staging if we expect like delays and actually pushing to production at some point.

E

I think the other thing that I and maureen can correct me or scarbec. If this is wrong, is I thought for a long time that staging is not important? It's just our staging environment right, and so why would we care and have vr right, but I think the the opposite is true. If staging becomes unavailable, we can't go to production. It blocks the entire company, so having the r capabilities for staging separately is actually also very valuable right. So it's not that the only value is realized by having geo in production.

E

It is already valuable if we have it working in in staging, because we actually need those capabilities in that environment as well.

A

The data is not important, but actually availability of the environment is crucial, so even if we lose data, fine but being able to actually recover really quickly is really important because everyone.

D

A

Like it's as simple as that, whether they know it or not, right everyone stops uh if staging is not working.

H

Yeah think it makes sense to do that. Then.

A

Cool, um would you mind writing that in the issue and when I say the issue I mean uh in 2b, so in issue m stuff 34 just like there is a discussion about staging in our architecture. If you give us the go ahead, if you give us go ahead, uh we can we can move on that.

A

Thanks is we are at time? Is there anything else that people would like to discuss.

A

All right well I'll, take that as a no. um We have some action items that will take and please continue doing the failovers in staging just for your own practice. um This is more for henry who might watch the recording as well and uh we'll see each other next time have a good one.