GitLab Geo Group, 23 Sep 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: GitLab Geo Preso

Description

Nick Nguyen, Engineering Manager for Geo, provides an overview of GitLab Geo

A

So, thank you all so much for joining us. For another exciting week of the customer success skills exchange. This has been a hotly requested topic, so get lab geo. I think it's our number one upvoted topic so really excited to dig into this today, so we're so lucky to have nick and team with us to chat a little bit more about it in depth, so we'll be starting with a geo overview and then diving into some more technical topics, including http and https load, balancing failover, so promoting that secondary node three was a backup strategy.

A

Failover run book, reviews and installation upgrading fun. So without further ado nick the stage is yours:.

B

Hey thanks for having us here, uh we're excited to yeah talk some more about geo. I didn't realize this is the number one upvoted topic so um yeah. Hopefully we we do it justice and um answer everyone's questions, so I'm nick nguyen engineering manager for geo, um I'm also joined by uh fabian our product manager.

B

I believe uh yeah mike um one of our staff engineers, is on the call, as well as oh, see, gabriel, just joined and uh catalin um also a support engineer specializing in geos uh with us, so uh plenty of people here to help answer any questions um that y'all might have so yeah I'll just I know there is probably a varying level of geo familiarity on this call. So I wanted to do an overview.

B

um I wanted to do an overview of geo, real, quick and then uh dive into some of the more specific technical topics that were brought up in the um original planning issue for this, and then we can get into uh yeah any more q a so I will share my screen.

B

All right: uh can everyone see the presentation.

B

Awesome all right, okay, so yeah, so so what is geo geo at uh at its most basic level is a way to provide a uh read only um secondary uh gitlab instance of uh of a customer's main gitlab instance. So um we use the term geo primary. That is the uh that is the main gitlab installation that you can read and and write to and then uh geo provides. The ability to set up uh secondary geo uh sites um secondary is a read-only mirror of the primary gitlab instance.

B

There can be multiple secondary sites, but there can only be one primary. You can perform get operations on the secondaries for increased performance. So um if you look at the graphic here, we our standard example is you have maybe an office in, say san francisco and then a european office in the netherlands, and so you have two development teams uh at each of those sites, and so maybe your primary get lab instances in san francisco.

B

But uh you might have large repos that you want the development team in the netherlands to um have quicker access to. So you can set up a geo secondary in that office and operations like uh get polls um browsing the the gitlab web ui will be quite a bit faster um because of the proximity.

B

So uh so you can. uh You can do some operations that might resemble a write on the secondary, but those are proxy to the secondary to the primary. So, uh for example, get pushes you can do a git push to the secondary and then that will be proxied um to the primary uh and then another thing to understand about geo is that uh is that it's eventually consistent, meaning that there can be replication lag between the primary and secondary.

B

It's not an active, active uh setup, so um so they're, not always in sync but um but the secondary, so the secondary can be a little behind the primary, um but it if you kind of waited things, would eventually catch up all right so that what are the primary uh use cases for geo. So one is geo-replication the scenario I just described. Maybe you have multiple offices and you want um your developers to have uh um to have increased better performance when using git labs. So uh so we we. So we consider.

B

We call that geo replication um and that can reduce the time uh it takes to fetch and clone repositories and increase developer productivity uh for distributed teams. uh The other uh major use case for geo is disaster recovery, so um for customers that um that uh have business continuity needs um and fault tolerance needs um geo can uh provide a warm standby uh that uh customers can fail over to in in the event of uh of a disaster.

B

So um so that is another uh key reason why customers uh choose to go with geo.

B

uh A real quick overview of the architecture- and a lot of this is just stuff that I'm pulling from from our docs and and summarizing. So you can get a lot more in depth uh in the documentation um but yeah. So if we look here, we have a geoprimary and secondary.

B

um We uh utilize postgres uh streaming replication to uh to replicate the entire database of the primary to the secondary node um for customers. With uh that use ldap. We recommend that there's also a replica of that server on the secondary site um and then for data. That's not stored in the main postgres database. uh Things like um attachments, ci, artifacts, lfs objects, as well as get repositories.

B

um We use uh yeah. We um use uh jwt based uh authentication to um to transfer those from the um primary to the secondary, uh and if we look here, there's a geo tracking database that uh helps the secondary know what needs to be synced and um in terms of uh these, uh like files and repositories, um there's a geo log, cursor process that runs on the secondary um to facilitate that syncing. uh If you see here, there's a little line that says ftw um prior to uh someone correct me.

B

If I'm wrong get lab 13 3, we used foreign data wrappers um for cross database, queries between the geo tracking database and and the main uh postgres database um as of get lab 133. That has been uh the need for foreign data. Wrappers has been removed. So that's a was a really exciting achievement for the geo team. uh We still need to update this architecture diagram but, as a result, a lot of.

B

There's been a pretty nice performance improvement in some of these queries and geo is also easier to set up and maintain. So just a note, there um yeah just some some limitations uh and notes about geo, so one big one is that not all data types are replicated or verified.

B

Yet um that's one of the big focuses of the geo team right now, but I think it's important to understand these limitations um and so that, when talking with customers, we know um uh whether we're we're replicating everything that they care about. So um so there's this table uh that you can look at uh that I've linked to here uh that we keep updated with all of the possible data types that someone might care about replicating um as new ones are added.

B

We uh we update this table in the documentation and uh also try to link to the issue for implementing those, so that uh people can get a sense of uh when, when we plan to add replication, because uh it can be the case sometimes that a new data type is added but uh replication won't be supported for, uh for you know, a few milestones down the line.

B

um Another thing to consider is that uh initially setting up uh and uh doing failovers are a highly manual multi-step process right now. um That is another um thing that the geo team is trying to improve.

B

We do support postgres aj clusters on geoprimaries, but right now doing that on the secondary node is not supported. There are also limitations and risks when it comes to failovers on postgres clusters on the primary node. If, if you fail over, the geosecondary doesn't necessarily have doesn't necessarily know to um to follow the new leader right away. So those are also some issues where we're working through selective sync is available as part of the geo solution. So you can, you can specify a group or a storage shard to replicate on the secondary.

B

um This helps with the amount of non-database data that needs to be replicated, uh but it will still replicate the entire postgres database, so um so there's currently uh no no solution for uh only replicating parts of uh parts of the main database uh and then, as I mentioned before, active active, uh is not supported. It's a it's. A warm standby, that's eventually consistent! So there's there's going to be some replication lag between the um primary and secondary.

B

And uh yeah, just the I want to talk about the geo road map. um This is a very dynamic area in that uh and that we're constantly working to improve the geo solutions, so things that may be limitations right now, uh we hope will not be limitations uh going forward. Those are things we're working really hard on.

B

um So one of those is uh disaster, recovery, viable maturity, and um what- and this I believe is about 80 of the way there uh we use um epics to track our product, uh our category of maturity, and that epic, I believe, is- has about 80 of the issues closed out.

B

So um one of the big things that we did as part of this was to implement a self-service framework to make it easier for uh for developers not just in geo but outside of the team, to add support for geo replication to new data types, um so we're trying to make progress and make it easier to get that data types table to have a lot more yeses.

B

Instead of knows um so, and as we uh so, we implemented the uh the initial self-service framework and then we've been uh working on adding new data types uh such as external, mr diffs, um working on version, terraform state files, uh currently we're working on replicating uh snippet repositories um and, and then, after that, as part of disaster recovery, complete maturity. We want to do. uh We want to add more data types. um We want to uh move some of the existing data types that we replicated over to the self-service framework.

B

We also want to implement verification.

B

Other complete maturity, things that we want to do. We want to improve the failover documentation, make them more linear, run book style instructions um we're currently working on a read-only maintenance mode um which currently, which isn't available right. Now, we're also working on you uh being able to support um high availability postgresql on geosecondaries uh using petroni, uh so postgres 12 is going to be available.

B

um It is going to be uh I it already is available as an opt-in, I think it's going to be uh the required version in get lab, 14 and uh and along with that, we can um rep manager. Won't, uh is not supportive with that. So um petroni is the uh preferred solution for for aj postgres going forward, um and then the other big thing that we're working on for complete maturity is.

B

uh We want to make it a lot easier to promote a secondary, especially for customers uh with very large reference architectures that have have a lot of uh nodes for all the services we we want to uh right now they have it's. They have to.

B

You know go into each of those and manually, promote those nodes and so uh and update the get lab rb file and change configuration and reconfigure, and so we want to make that a lot easier in the way we're doing that is to start with just a single command that will that will promote a node and then go from there.

B

We also have a lot of ui ux improvements in the works and after the disaster recovery work, uh we want to work on things that we would consider as part of geo-replication complete maturity.

B

So one of the big components of that is essentially what we call secondary mimicry, where we're maybe hiding some of those details about the secondary, so that a user who's who's using uh say the web ui from a secondary site doesn't get that big message that they're on a read-only instance and can't can only do read-only operations.

B

um We want them to be able to uh to have it be like they're, using gitlab on the primary, even if some of the operations that they're doing there are proxy back to the primary.

B

So that's a big component of the complete maturity work that we hope to get to soon all right, so that was the geo overview. um So now uh so I've made some slides uh for each of these technical topics that were listed out in the in the issue. um Please feel free to jump in with any questions as I'm going through them.

B

So one question that was asked was: uh is it possible to is it possible for geo to communicate via http, or does it have to be https? um Yes, it's possible. uh It's generally a bad idea, because you're going to transfer data between geo nodes unencrypted, um you can certainly do that if you're just setting up a demo, instance or or something like that, but uh we definitely recommend for uh any production geo deployment um that uh traffic is encrypted and uh yeah.

B

There's a link to to some setup uh documentation there on the slide um load, balancing also came up and, uh yes, it is possible to have multiple secondary nodes behind a single load. Balancer, um currently putting a a primary behind a geo, node balancer isn't supported. That would require secondaries to to be writable. We do have an issue open for it and you can also provide get lab users with a single location aware get remote url, so that users will automatically use the the geo node nearest to them.

B

So we also have some instructions about how to set that up in in the docs.

B

uh Failover and promoting a a secondary, node, so yeah as mentioned this, um this process involves uh quite a number of manual steps and uh I believe uh some customers have built their own tooling to automate the process, but we don't currently have an official get lab automated tool to to help do this.

B

One project that has the potential to provide some automation around failover is the gitlab orchestrator project. If you, if you haven't seen it yet, I definitely encourage you to check it out. It's a project, that's that the distribution team is working on and is uh currently um getting ready for an alpha release, so uh still lots of development um ongoing there and and not ready for production usage.

B

Yet, but it's definitely on the roadmap to be able to do things like easily upgrade get lab deployments on larger reference architectures it would, I think, being able to uh automate uh mostly automate. A failover is is on the roadmap as well. um I think a user would start to trigger the failover.

B

It's it's not on the orchestrators roadmap to monitor a gitlab installation and maybe know if, if if a failover is needed, we, yes, we have work in progress to create a command that promotes uh secondary to primary and to simplify the failover documentation right now, it's it's a little jumpy you have to.

B

You, have to look at multiple pages, and so we just merged in mr to uh to add run book style, documentation for uh single node geo installations and now we're working on uh that style of documentation for a multi-node geo install and then. Finally, I linked to a single node, failover demo from january 2020.

B

So it's still pretty current. uh Doing more of these demos consistently is something that that we're discussing um we do regular, upgrade demos and um and doing more demos of uh doing more regular demos of failovers and geo. Installs is something that we've been discussing.

B

All right uh geo as a backup strategy yeah. So this came up. I believe the question was: if you're using geo, uh do you do you still need? Do you still need to take backups of your get lab installation or is geo enough, and so geo can definitely be part of a backup strategy. um It's a warm standby!

B

So if something does go wrong, you you can, um you can fail over to the secondary and uh and most maybe all of the data you you care about, will be there, um but geo does not replace the need to keep regular backups. So um so, even if geo replicates everything, a customer cares about. Geoff follows the primary very closely and so a potential scenario um that geo would uh where geo would not be great as a backup strategy.

B

As if say someone performed an upgrade on the primary of their upgrade of their gitlab version on the primary um and a bug causes tables on the primary to be a race. Well, the secondary faul can follow pretty quickly behind the the primary, and so you would also uh replicate this change to to the secondary. So, in that case, your secondary would be in a bad state and uh you will have wanted um a regular backup of your uh of your primary instance.

B

So um yeah definitely part of a backup strategy, but we still, uh we still recommend um keeping regular backups of your uh of the primary instance and then, of course, yeah. There's an attack scenario too, where um customer servers are compromised by a third party. uh All those servers in the same network start encrypting data, this yeah. This would affect the primary and the secondary. So another situation where you would uh still want a backup. That's independent of your geosecondary.

B

uh Yeah uh something that uh this wasn't a topic that anyone brought up, but something I wanted to highlight was that um the geo team is happy to review failover runbooks for for customers with complex uh architectures. So recently, uh two two large customers successfully completed disaster recovery tests and um and so that that was a a really nice uh accomplishment and really great thing to see it's. It's uh not oftentimes uh we're yeah.

B

We don't get to see, um see geo being used for for the purpose that it's been created um in in the real world, and so uh this was a great example of of that happening, planned failover, test and uh and successfully completed uh to help prepare, got members and support engineers reviewed the run books to raise any potential concerns, and so uh so we just wanted to put it out there that if anyone has customers that are planning a failover to uh yeah, please open an issue following the geosupport process.

B

We're happy to uh to review these um to review these plans. uh It helps us reduce potential day of issues and uh helps us improve the the product um just by providing some more insight about how how customers are using geo.

B

uh Installation and upgrading um yeah, as mentioned uh installation, is, can be quite an involved process and, of course, gets gets pretty complex when you're, trying to um when you're trying to install geo uh to to mimic a larger reference architecture. um One one way to uh that.

B

We've have to spin up a single node geo deployment quickly on gcp is our uh ansible get lab geo playbooks so um definitely recommend that if you ever need to just spin up a quick demo instance, the gitlab orchestrator is also working on being able to spin up different reference. Architectures and geo will be a part of that, so uh it's currently focusing on gcp and aws support uh will follow, and uh I think, though there are a lot of so those are um two ways to to quickly deploy geo.

B

um I believe the performance environment builder can also do that and they're adding support for upgrading geo. um One direction that we want to head to eventually is to have all of these tools to set up geo converge on on a single solution and and gitlab orchestrator is aiming to be that solution right now uh or aiming to be that solution eventually, but right now um uh it yeah it only handles certain cases.

B

So we so there is a need for for these different tools, um but yeah we're hoping that uh that eventually get lab orchestrator will be able to handle all the common deployment scenarios for geo uh yeah. I linked to a recent ux showcase that was done about the installation experience.

B

You can see some of the the challenges that customers, uh the customers that we've interviewed, have had setting up geo and uh some of the ideas and and plans for improving that experience, uh and then uh we also do regular upgrade demos um we're actually a little behind our most recent demo is from 12 8 to 12 9, and we we record these. um We were a little behind because we discovered some issues uh with doing zero.

B

Downtime upgrades on uh on the geo setup that we were using, and so we wanted to take a little step back and open up some issues to investigate. Why why that's happening, um but uh the the plan is anyways to to keep up and and do these upgrade demos um with each gitlab version that comes out uh all right and then uh other resources, so yeah I'll, just first of all, put in a plug for the geo slack channel um that yeah we're always uh happy to help answer questions there.

B

I think um our engineers and support team are are quite responsive there, and so um that is, if you have any questions, please don't hesitate to reach out on slack at the geo channel. um I've also linked to the administrator docs, uh the development team handbook page, which, uh which also has uh has some more information about um about how best to engage with the geo team for for support uh and then our epics roadmap, which, um which is nicely uh organized by uh by product maturity and then kind of drills down from there.

B

So you can get a good overview of of what we're uh working on and I believe that does it for the presentation um yeah. Thank you so much for your attention and I guess I'll uh yeah turn it over uh see. If we have any questions.

A

We have lots of questions.

C

Awesome sorry to interject, so this is fabian. um I have to cruise and go to another call relatively soon, but um I just wanted to say thank you so much for giving us the opportunity to sort of talk about geo.

C

One of the things that I benefit a lot from and I think the product benefits from is getting feedback from you. um So if you have customers or prospects or any questions, please ask them in the geochat. I think also if they are customer conversations that you would like um us to join um reach out. I really uh this is not like bothering us or putting extra workload on us. This is one of the like major reasons or major things that allow us to build the product that folks actually want to use.

C

So we really appreciate any feedback and all the questions. I try to answer many of the questions async in the doc, so I I hope my my words and the words of the team. You know help, but I unfortunately need to sign off but you're great. um I I've had the opportunity to work with many of you already um and thanks so much.

B

I'm signing off. Thank you.