GitLab Delivery Team, 23 Jan 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: DLV001 - 2019-01-23 - Delivery team Training - Deployer and Patcher

Description

John Jarvis talks about deployer and patcher for GitLab.com

A

Okay, hello, everyone, my name is John and we're going to be talking about deployment and patching. Let me share my screen.

A

A

Before we began, I just wanted to say if we could ask questions in a meeting agenda and they're also some great questions from the last presentation we did earlier. So maybe read those as well, and you know, so we don't ask the same things so: introduction I'm, John, I'm, a senior sre and the delivery team working on deployment and patching this presentation kind of served as an overview of what we've been doing in the last month to improve deployment and patching.

A

Forget lab calm and kind of a general overview of how the release works and that's where we're gonna start is a release, overview. I know.

A

A lot of people here are probably already familiar with this, but for those who aren't we release on the 22nd of every month, and we do a code freeze on the seventh, and this is something that we've done for a while now so sort of the the clock starts ticking on the seventh, and this is when the code freeze begins and at this point, what we tried to do is deploy to give medCom as quickly as we can with the first release candidate and then subsequent to that, we continue to great release candidates up until the 22nd, and then we make the official release after the code freeze.

A

We are pretty selective about changes that we make, and this makes it so that there's a bit of manual work to try to see work for the release team and the release managers and the delivery team right now to choose what changes go into the release and then we move these release. Candidates through our pipeline and pipeline right now is staging goes to QA, then canary and canary QA, and then production I also added a bullet ear. That mentions that. Not all our C's make it to production.

A

We typically maybe averages release candidate a week right now, but not all of them will make it to production if we find a critical problem on staging.

A

What we actually release is the omnibus package, and this is the same package that we give to our self-managed to the two self-managed community. It's a six hundred megabyte baby and file for boon to. Of course, we build this package for all different types of platforms, and it contains everything you need in one package to run gitlab. It contains the rails code for get lucky. It contains work horse, get a leak, if left shell pages registry sidekick and get lab monitor.

A

It also contains other binaries that you might need to run, get led like postgrads right as prometheus, an alert manager for again lab comm. What we do is we install this omnibus across the entire fleet, and then we selectively enable or disable the services that we that we need or what to disable. The other thing is also that we use the omnibus for rebus right now, since we moved our h, a database solution to Patroni, we stopped using the omnibus Postgres and we have our own deployments of prometheus and alert manager.

A

Prometheus, obviously is used for monitoring, and we also have a graph on a dashboard which you're probably familiar with jiz dashboards document and the public one which is dashboards like 'get LED calm.

A

We also, of course, have alert manager which Kendal's you know has you have a lot of alert rule definitions for different metrics, and this is what pages the on call when we have an issue.

A

We also in addition to these releases this this these monthly releases. We also have security releases that happen and those have to be back ported for the last three releases. So that's something that happens every month and then I also wanted to talk. We'll be talking a bit in this presentation of what we do for patches and when I talked about release patches when I'm talking about are these post deployment patches that we create when the code has been deployed to production, and we find an incident to be a security.

A

Vulnerability could be a critical bug or regression that we need to fix very quickly and the way that we do this is that we patch production it skips a bit of the it's a good portion of the pipeline and go straight to production.

A

So here is the deployment pipeline. You can see that it's broken into environments, we have staging, we have canary and we have production. What we've done in the last month is we've transitioned off of the previous deployment tool called take off into a new deployment tool called deployer which allows us to do it leverages, get lab CI CB, and it allows us to create a pretty flexible pipeline for deploying the omnibus one. Nice thing about the new deployer is that it's flexible enough.

B

Yeah I'm, sorry, you got muted.

A

You, oh that's weird. Okay, so was I needed before or was no your love? No okay! That's strange.

A

So you run a pipelines and it's flexible enough. I see yeah I, wonder what happened there hold on for a second.

A

A

Okay, hopefully you okay so back to this slide. So we have these different stages and it's flexible enough where we can create a staging pipeline or a pipeline.

A

It goes all the way from staging to production, each stage or actually each environment, creates pre-chat or has pre checks that checks for things like critical alerts, health checks for the load, balancer tier and versions, and things like this also at each environment- we're adding or we've added QA smooth tech smoke test that at the end of each stage, does a set of tests using QA and for the canary stage. What we do is we run a kill at QA and we set the canary cookies so that it checks just the canary infrastructure.

A

This sort of gives you a nice summary of what deploying to a single environment looks like right now with the deployer pipeline. What you can see here is that it was initiated with the chat.

A

Ops command york did a chat apps to run, deploy for 11 7 release candidates, 7, send it to production that kicked off a pipeline in the deployer project, and you can see that there are multiple stages and substage have some stages have a lot of jobs that run in parallel and when it's completed, then we annotate our dashboards, so that you know we can see when the deployment happens and one of the advantages over our previous tooling is that since it's sort of designed to run and see ICD we're leveraging a lot of you know, CID CI CD features and just doing a lot of things in parallel, and it also allows us to like if a job fails, we can restart in the pipeline.

A

There's there's just like a whole lot of advantages for using this new tooling.

A

So, basically, when we, when we deploy to a fleet of machines, there are a sequence of operations that happened to start off. A pipeline is initiated by with two variables deploy version and deploy environment, and then we deploy to a lot of different fleets, sometimes in parallel, and we do that in batches of 10%.

A

We can also be can also when we initiate this pipeline and I mentioned this earlier. We can actually pass in multiple environments, so this would create a longer pipeline that goes like from staging to canary, but we don't currently do that now, if every omnibus installation, what it does is it goes through this sequence and illustrated this in the diagram. This is the deployed lab demo, which is the definition for deploying to a single fleet. It does it in ten percent batches and we first drain connections from H a proxy.

A

We revert the curve right post deployment matches. We install get my de omnibus.

A

Let me install the new post deployment batches and then we restart services if they need to be restarted or hop unicorn at the very end we add the H a proxy or we add the nose back to H a proxy and we wait for the H, a proxy health checks to pass before we move on to the next batch of servers, I'm going to talk a little bit in this slide about post deployment patches, and this is something that we've been doing for a while.

A

It's gotten a bit better in the last month, because we have new tooling. That makes it one it's a little bit more self-service for developers and too it's automated. You see, I, see I, see I, see D, which means that we don't have to manually patch the fleet anymore. If you do need to create a post deployment batch- and this is usually for an s1 or s2 incident.

A

What you do is you submit an mr2, the patches repo and it's version by there's a bunch of directories in there for each version, and you can test it on staging on your branch and then, once you merge it, it gets deployed in a pipeline all the way to production. I put some bullets here that can explain the benefits of this new patching to lean right now. Patches, for example, like before, was manual.

A

Now it's only applied through CI CD right now as well- and this is fairly recent patches can be reapplied on automatically on new deployments. So, if deploying a new version of the omnibus- and we need to maintain a patch- and this is a fairly unusual thing for us- but if we need to do that, the new deployer tool candles it and another nice benefit- is that the patching tools share the same configuration with the deployment tooling. So it's all using the same configuration code and it makes it a bit cleaner, I put a link here.

A

If you're interested in reading interested to read a little bit more about post-employment patches, you can go to that link.

A

So this slide describes a little bit about the deployed tools, development and we chose ansible as a tool because it just suits very well. This idea of like orchestrating SSH across many different fleets of servers, its batteries included, meaning that it's it includes a lot of the things that we typically do like install packages deal with H a proxy.

A

You know shell out to commands and things like this, and there was a very small amount that ansible could fake for getting this set up until, like all in all like there's, probably you know less than you know a hundred lines of gamal just to deploy to production, which just makes it very simple and straightforward.

A

We have a single repository that contains all of the ansible config, and this repository is shared between multiple projects that have different pipelines. Of course, ansible is python-based.

A

It's not Ruby based, but there's very little Python to write the only Python code I had to write for this was a callback plugin for the annotations to piranha, but other than that everything is specified in llamó and I think one benefit we have to using ansible, and this tooling is that whatever direction we go in the future, you know it's very, it's a very useful tool for us to apply changes across the fleet, and we found this like, for example, with the registry restart er, which is something that we do once a day which applies a rolling restart across the registry fleet.

A

It has all these benefits like out of the box. You get graph on a-- annotations and you can also like take advantage of the h, a poxy draining logic that we used for deployments.

A

This kind of this gives you a brief summary of the different projects, the repositories that we use for the deployer and patcher. There are essentially three projects right now that have pipelines the deployer, the patcher and the registry. We starter these projects just to contain and get map CI that yeah moe. That's pretty much it and then they have get sub modules for the actual tooling repository that contains the ansible code and then there's also the Patra project, which not only contains the get mod CI, but it also contains those deployment patches.

A

So you can see in this diagram the deployer actually checks out the patcher repository as a sub module so that it can apply post-deployment patches during deployment, see I, think that's it. For this.

B

A

And I wanted to add some links here for future improvements. One thing that we're working on now is directing all requests that are made to the git lab, comma or group. Those are gonna go out by default to Canary and it'll be opted out. The advantage of this is allows us to test more internal traffic on canary. It also allows us to test traffic to other services like HTTP, GET and registry, that use the give comm or get lab org path. This does not mean that all Gila traffic is going to go to canary.

A

It just means that we'll have the ability to send more internal traffic. There we're also looking at daily deployments to staging. We have this proposal in flight for adding a one box environment which, essentially before we deploy to all the production we can deploy to a single single node in each cluster, which gives us a little bit more data and metrics before we decide to promote the entire fleet and I just linked here to all the things that we're doing to reduce the number of manual steps during releases.

A

Since it can be a quite a bit of toil when we do a release and we're trying to automate as much as possible.

A

So I think that's the end and I'm gonna go to questions and I do not see any questions so I guess it's wrongly 15 minutes in maybe I can just open up to see. If anyone has any questions on the call you.

C

I, just added one in sorry, Jared I was just looking and I didn't have enough time to look and try to answer it myself. What, where do you maintain the inventory for ansible for the deployer and batcher? Okay.

A

So that's a good question yeah, so the inventory is actually just pulled from chef. There's a dynamic inventory script that goes out to chef. It's it's actually I've modified it a bit, but it's fairly standard or just something that the community wrote to pull inventory from chef.

A

So we essentially deployed two fleets by roles and chef and in the ansible config you specify you specify a role, and this is actually done in the guild lab CID ammo for each of the jobs we set a role and that role is passed ansible and then that role ends up being in the hosts list. For the you know, deploy they begin by a deploy, yamo file and then that map's to a bunch of hosts.

A

D

Are there any other questions, otherwise we'll just send early I can take on the next one. So I have two questions. The first is roll back, so I'm glad to see that we have a naming convention for that roll back. Now it's we're still doing a roll forward mechanism, or is it a true roll back like what what's right so.

A

See, there's actually there's a there's, an open issue for handling roll backs a bit better than we do right now for now the way that we roll back there there are two roll backs, one is for post-deployment patches and to roll back a post-deployment patch. We just you just changed the file to dot roll back and then the pipeline will just roll back the patch. The other type of roll back is when we actually deploy an omnibus and we need to roll back the omnibus for that.

A

What we do is we just deploy with the same with with previous version, and this deploys everything in the same order. We have an issue open right now, because we deploy when we go forward, we deploy to get elite first and that's because we can't be sure of backwards compatibility and when.

D

We roll ideally.

A

What we want to do is deploy digitally after the fleet to ensure that it's done safely, if we ever do get like often like, if we get all the way to production, past post, employee migrations and we have to roll back, usually we're trying to roll back as fast as possible. But you know we are like looking into safer ways to roll back, got.

D

It thank you. My last one is a round post deployment that involves database migration. So for context we are we clean up some of the database database. Discrepancies last quarter. Are there any processes that makes sure that if a post deployment patch has a database migration, that migration makes its way into, they they've seen the master and make sure that we don't have database schema discrepancies between production and, what's in code,.

A

So your question is: is whether there's a patch requires a migration? Maybe I didn't fully understand right.

D

So if there's an s1 or s2 performance do we need to address in production like we submit a post deployment patch that involves a migration or schema change in the DB he goes into production. Are there any guardrails to make sure that it makes its way back into it to master, because uh there's we had to clean up some of this before last quarter and I? Think it's just a matter of doing that. The process is there that the chain just makes way back into master specifically for all schema changes. Yeah.

A

So actually I mean I in my memory and maybe maybe Stan can correct me if there was one but I, don't think we've ever had it. At least recently we haven't had postponement patches that require database migrations, but maybe you're talking about just patch releases I think.

B

You're right, John, I, think your question man because let's say we had to submit a merge request that had a database. Migration has to actually change his schema. We haven't actually run in that case, yet if I think there was one case where the approvers was not right, this last release and I think today to be see manually just set reset some of those columns and wasn't actually in a migration. So it's a good question. I think we should open up an issue to figure out.

B

If this is a problem, if we ever have to actually roll a migration, then what do we do? Because we're not equipped right now to do that.

D

Yeah I have one issue and I involve Don Jose I think we probably should open it up to involve the delivery team and then us close, then I'll tack, sharp you and understand on the issue and take it from there cool.

A

Sounds good and it kind of reminds me of the issue right now with post deployment migrations because oftentimes we consider this the point of no return, but it happens during you know in our deployment pipeline for production post, deploy, migrations may possibly delete data from the database like evicting stuff, but you know a column in a table or something like that. So you know these are these are discussions as well and in the context of rollback when those post deployment migration should be run?

A

Maybe we should add a game at QA step before we do post deployment migrations just to make sure that at least from our own testing everything looks same. Are there any other questions on the call.

A

You, okay! Well, then, we'll end it and thanks for everyone for coming and if you have any, if you think of any questions later, just feel free to ping me on slack thanks. Everyone thanks.

B

John, this work is awesome by the way, thanks.

C