GitLab Release Group, 13 Jul 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Overview on the GitLab release process

Description

Amy Phillips (Engineering Manager, Delivery) and Daniel Fosco (Senior Product Designer, Release) go over the release process for the GitLab application, and how the Delivery group is working towards full CD on GitLab.

Agenda & Notes (internal): https://docs.google.com/document/d/12plxvquQvhXie038FqjGmJ2CdDjPFlGrAvYn-U7hZng/edit

A

All right, so I'm here with amy phillips engineering manager on the delivery group, um so we're having a chat about how we at ditlab deliver the gitlab application itself and, to which extent we use our own. Our own release features on that life cycle. um Amy, I'm sorry! I I miss your your request to look at the agenda, but I hope you were able to take a look.

B

I did yes thank you for that. Yeah yeah.

A

um So I'll be taking a few notes uh as we go through, but can you maybe start by giving me an overview of the release process that it left today.

B

Yeah, of course, um so the way um we so we are deploying gitlab.com uh we're working on a continuous delivery um process, so we try and deploy um gitlab.com as frequently as we can um the moment that generally is around three or four times a day is, is the current schedule and what we have is set up is we have scheduled branches that get created and they are created uh time times suit, release managers um working hours basically, um and what we have set up is.

B

We will cut a branch um from master, so that's an auto deploy branch that gets created and um that will build uh head into staging it's an auto deploy onto our staging environment.

B

If the tests and the pipeline will succeed, we also deployed to canary so canary, is a subset of production shares database that has its own uh service, and um if the deployment to the canary environment is successful, we leave it sitting on canary for an hour.

B

People within gitlab are generally using canary. Some external users have also opted in to use canary. So we get like a reasonable amount of traffic and, um if everything's looking healthy, then on canary, we can choose to do a manual promotion to production.

B

Now we coordinate all of these things at the moment. We're running a hybrid environment, so we're using vms as well as kubernetes and we're coordinating all of our deployments using the bridge jobs feature that just came out a couple of milestones ago, so um I can um would you would it be helpful for me to show you that uh bridge shop? I think so so.

B

So we run all of our things uh via chat ups, pretty much so their announcements channel is our main um main source of keeping track of what's happening. uh Let me just take back to the beginning of the day, um so for me, my sort of the the as relief manager in emir time zones early in the morning the first package starts getting created.

B

um We can ignore this one, but what happens here is we go into staging? We had staging failure here and we pass through to canary, so we can see the package numbers so every time we pass through an environment, it just posts out here with our with our package, um and so we can see those. But what we're actually doing the coordination with is the with coordinated pipelines. So I'll show you this one.

B

So what we have here is we are tagging um on auto, deploy, schedule that will go for build.

B

We have a metrics job and then we start off, and so they all ought to go staging and canary, and then we have our baking time automated and then at this point this would be our manual job to kick things off into production, and then we finish up with.

A

The tracking that manual job, um where like what is the interface for you to approve it, do you do your chat ups as well,.

B

We do it here, so we let this we come in here and hit the play button here, all right um and what's happening alongside one of the reasons. um Some of our things are different is all of our every milestone. Every release we track. um We generate auto-generate a release issue, and this is what the release managers use throughout the whole month. So um we have one here.

B

We have steps for uh like normal things around creating the monthly release, um but we also have in if there are any special steps that need to happen that month. So, for example, we will um add in additional steps that we need to run around family and friends day. For example, like we go into a soft production change lock on family and friends day, so we'll pause the deployments for those times.

B

So we add in the steps, so the release managers can just follow, but what we also have at the bottom is every time we promote to production.

B

We post a comment here and we record who did the promotion and which package they promoted, and we have a couple of extra things that we do that we check before we do a promotion, um we're checking. There are no active incidents running on production, we're checking. There are no active change, requests of uh severity, one and two, um and we check the health status as well. So these things will also, if any of these fail, the promotion to production will fail and it will be logged on here. We probably have some actually.

A

Right uh one good question: uh what's the change request in this context.

B

So um if we want, if there's a change that needs to happen, uh for I think uh we have one open actually, if, um if we want to make a change to production uh or or any environment- and it's not automated, so we have a handbook. We open a change request so, and this allows us to um uh this allows us to record through the process that we're following. So this is the process right and.

B

I'll pop it under the validation, because it's sort of related, so this allows us to sort of detail out the process, we'll follow and any approvals that are needed. um So it's just another issue and it gets marked up with a a change severity. Basically uh so that I'm sorry criticality, so they have different approval levels around these things, so we're checking anything. That's a criticality one or two. Our deployment process assumes is risky enough that we wouldn't want to automatically deploy to production.

B

So there could be a database change happening or something significant going on with the underlying infrastructure, so we wouldn't just automatically deploy. um So what we have in those cases is.

B

Here we go so um last week uh when I triggered the deployment the production checks failed. So this is I'll, show you. This is one of the first steps of the production promotion pipeline and it failed because uh so this change request was in progress and this was updating a index on the database so because of that failure.

B

um So in this case this deployment didn't go through, but what we do have the option of is we have um overrides so on this one. We can see here um separate. I put this one through chat ups, uh we have an override command that we can add on an override flag and then what we do there is.

B

We add, on a reason, for why we're overriding um all the details are here and then the sre who's on call also adds a comment here to approve this, so release manager can't just override a production check and then go ahead and do a deployment. So this is kind of the additional compliance piece that we track around. What is being deployed to production and who's? Doing those deploys.

A

Right that makes sense, there's more accountability.

B

Yes, um so that is the additional checks. I'll just show you, so um the deployment is in progress right now. So let's take a look.

B

So um this is what the production pipeline looks like. So we have various uh checks that we do here. We warm up the environment. um This is where we check whether we're in a change lock so uh say. If we, if we try on friday to just push um a deployment out, it will fail at this stage.

B

It also applies over the weekend, so we have a change lock that runs between friday evening and monday morning, so that we can't automatically push over the weekends as well. Just we don't have many sres available if needed, we have a prepared job and the prepared job is generally checking for those production checks. So is the production environment healthy, other change, requests in action, other other deploys that are still going on, so we don't want to deploy over each other or any.

B

It also will fail if the last deployment didn't pass successfully for whatever reason, so this will have an auto fail and then we add in tracking. So this is where we report on the um the issue, and then we can see here the pipeline steps excellent failure, so we do assets and um the first round of migrations, and then we do gitly prefect the production fleet, and then we run post deployment migrations before we run all of our finished jobs right.

B

So let me just take a quick look at this.

B

B

So, generally, what we try and.

B

uh So, generally, what we try and uh do, okay I'll come back to that one that one may be slightly more involved. uh What we're generally trying to do is have as much stuff as we can automatically passing through.

B

So we have quite a lot of additional tooling and uh steps and things and uh things reporting out into the slap channels to make it as automated as possible. Try ideally trying to get to continuous deployment and just have the whole flow go through and just have it so that if something goes wrong or if something's unexpected, the process can halt, ideally would roll back and just alert the release managers. So that's kind of what we're working towards at the moment.

A

And right now, you're, mostly automated uh for the gitlab.com deployments. The only thing you do is to promote right. That's correct! Yes, exactly yeah.

B

Yeah yeah um and then the bit that's a bit more complicated is um if we have an incident. So if, um if the pipeline fails for like say if the tests fail or if as we're deploying, we see some issue, then um it gets a bit more hands-on, because what we'll need to do because of the way it's because it's packaged software. What we need to do, then, is like pause. Get the environment back into a good state, identify the failure, get a fixed, mr and then manage that through back to the environment.

B

So it's a little bit more hands-on at that point, which is where we're kind of trying to move. We have rollbacks, we have a rollback pipeline, um it's not fully automated! Yet there's a few um uncertainties around it that we want to kind of iron out before we just make it an auto roll back, um but yeah handling a failure means there's a lot more coordination involved.

A

Right um and then so, looking at at the the release project, I could see that um if you go to the environments page, it says the environments are created via api right um and- and I understand that's because you're not working from within the actual gitlab uh repository urine external repository and that more or less mimics how lots of our customers need to work. um Can you talk a little bit to the limitations of having to work on a separate repository?

A

I assume, because of compliance and and from permissions right like me as a developer, I just can't release gitlab by myself right. So we need something: that's right, yeah and how do you set that up and how is that experience.

B

Yeah, so um it's very much set up like this in this way for permissions, like you say, um this gives a good way of controlling like who has um you know, access to be able to deploy um all the things? uh It also gives us a bit of a secondary buffer, which is we operate, gitlab using gitlab, and so, if gitlab.com goes down, we want to make sure that we actually are still able to deploy a fix and use our tooling so having it.

B

As a separate ops instance gives us that separation so be incredibly unlikely. They would both be down. They are different um instances, so um we're using like um ce as well as ee, so we get to sort of see the different ones they update on different um cadences, we're using the uh like nightly build, so it gives us a little bit more protection um to guarantee we'll still have our tooling, um but uh in terms of how they uh well one of the limitations, we see the biggest one.

B

I think we have really is developers, don't have visibility of things. So we hear quite a common one. We hear is post deployment, migrations, failing and, unfortunately, developers don't have the logs that go with those jobs because they sit inside the ops instance. So we have a few problems like that, where just general visibility is, um it's definitely not ideal, because we're operating from within a separate instance.

A

Right yeah, I hadn't considered the fact that indeed, it makes sense for it to be a separate instance for for reliability reasons, um but I'm not sure if other customers would need that as well. I think.

B

That's probably reasonably unusual: um we have a few things that are a little bit unusual for us. We rely very heavily on mirroring, um so what we'll tend to do is mirror, uh because again one thing that's unique, I think to gitlab is the fact that we also have our security um repositories and uh we mirror all our code over to security. So we can do security fixes, so that's very uniquely git lab. So I think this one's fairly unique to us, I think um permissions and having a good way of actually it's almost less.

B

I suppose the who can access things and more, the audit of who did access it and what actions did they take is probably um is the thing that we rely on quite heavily, um which say we do through the release issue.

A

Right so that way, you only need to audit the release instance. The options.

B

That's right, yeah! That's right! So when we, when we do audit, we sort of check on who has access to things in uh in ops and in these projects, and then we can map that against the actions that we we can see being taken that have been logged somewhere.

A

Right um and it's it's really interesting how how not only all of the complex pipelines, but also the chat ups that that I assume your team built on top of it, um usually how? How do you work to to try and and dog food these improvements into the actual gitlab project.

B

Yeah, so what we try and do is um there are times where um there are features that already exist um so um often we'll like try and switch things over like piece by piece like if something comes up in the api. For example, we rely very heavily on the api um and that's really just because our processes are reasonably complicated, so we tend to string together actions, but also as we're trying to move towards continuous deployment in an ideal world. We'll actually just have an automated script that just runs through all of these steps.

B

So we try and have everything as kind of an automated step, but so yeah. What we'll typically do is, if we see a way of switching over to say we just recently moved over to bridge jobs and what we used to do before was have kind of um weights in our pipeline. So we'd wait 40 minutes for this downstream pipeline to hopefully pass if it didn't. You know, we'd take action, whereas we just recently refactored the pipeline, so we could have bridge jobs.

B

So we often do refactoring of our tooling to try and take advantage of get our features or in more rare cases, if there's something bigger where we are finding we're kind of having to maintain release like um code that we don't like, doesn't really make sense, we'll try and get it into the product. So changelogs are there good example of that one we're building that into um into the api meant we could delete a load of our custom coding, um but also it puts a feature back into it. It's a little bit unusual.

B

We do it that way, it's more normal that we'll try and um adopt or enhance the features that are already existing.

A

Right and um in terms of promoting the releases right, you said you go directly to the pipeline page.

A

Is that do you see that as the ideal place for your team for your workflow, or there is another place in the ui that we would be more interesting to to take this action to approve or deny yeah.

B

So it works quite well for us, so the reason uh we tend to come in this way. So actually we have one here. That's happened so when baking time completes the baking time is just an arbitrary time for us it's an hour at the moment um we have canary running and then what happens at the end of the hour is release tools, um alerts the release manager.

B

So this was quite a big enhancement for us recently um because it reduces waiting so release managers don't have to be keeping an eye on the clock, they just get a ping and then what happens here is we get. A kind of report of is, are things healthy are things in action and if they are passing, we provide a link straight into the pipeline and we can promote from there. So for us it works well, just because of the fact that um the information is coming to us at the time.

B

We need to take an action and we get this kind of, like short, summary of, like here's, all the stuff. um Here's what you need to do so that we find that quite an efficient way of working. It could be something in um in the product. One thing that we don't have great visibility on is um what's inside that package, so we've talked a little bit before um about. Should we be doing something else where actually like?

B

Maybe this spits out like an issue or something like that where actually you get more of a report of like here's? What's happened? Here's the package, here's what's in the package as well as these checks, and that could also um be a possibility for us.

A

But do you mean uh what's in the package in terms of a change log of all the commits and changes that were part of that release.

B

Yeah, like all the commits and changes, so one of the reasons why we don't use it too much at the point of deployment, um because we sort of treat it like it's a package and ready to go. But if this, uh if, as we're deploying this, we see issues it's a little bit of work for us to actually go back and pull out like okay, what exactly is inside this package? Did we change something on this database table, for example, that might be contributing to this problem?

B

So it does delays a little bit on debugging just because we like, we don't have any links right here to the full list of um of commits right.

A

um Would uh so I'm not entirely familiar with the actual release product yet, but with a release with a change log remediate that or that's not a good solution for for delivery.

B

So, unfortunately, with a change log, um so the changelogs come with the releases, but they don't get generated until the point where we create the release so they're kind of like a a lagging piece of information. So yes, they would have it, but for the gitlab.com stuff they they don't have it. At this point.

A

B

A

The monthly releases um is the release created before the deployment is approved.

B

Or after it's after so what the way we so every all changes go to getup.com first, so we try and deploy as frequently as we can in throughout like every day but throughout the month. And then what will happen um before the 22nd is uh release? Managers will select a stable package, um so it's usually something that's been on gitlab.com, for you know some hours at least four or five hours, we'll say: okay, it's fully deployed. There have been no problems reported so far, things are performing well, we will package that point.

B

um So it's a lagging action and that's really just because it's so much harder to get fixes into a package. So we try and sort of test things out as much as possible.

A

B

A

um Let me see if I have another question here.

A

Yeah, that was a lot. I think you answered. All of my questions um is there? Is there anything I can help you with like from from the release side? Like any questions you have so I suppose.

B

A couple of things which uh would be um would be quite useful, so one of the things that we um we don't have an easy way to do is overriding on environments.

B

There's two cases where we override environment locks, one is, there is fine and it's a little bit more casual like if we don't want everything to auto, deploy to canary or whatever, for whatever reason, we have a chat, ups command to lock and unlock the command, the environment, and that's totally fine, but the more difficult ones is if we actually need to override the actual production change locks uh which we do sometimes. So if we have an incident, we'll need to override them.

B

That would be super helpful to have a way of unlocking an environment, but also having that it's the approval piece. So if we have a hard production change lock in place, we actually need like someone at like vp level, to approve us on that and at the moment we don't really have a good way of capturing that and having that kind of baked into the who does the overwrite.

A

Right um so these overrides today, like you, create them on you can create them on a chat, ops command, but once it goes into gitlab, is it part of part of the ammo for the environment like where? Where is this register.

B

Yes, so we are um we're using ci variables for pretty much all of these things, so it the pretty much all of our chat. Ups commands actually are changing the value of ci variables right.

A

So you change you change the variable of the lock from true to false and then yeah right. So what you mean is that you need vp approval for for changing a production lock on while on production, but right now, there's no audit for that.

B

Exactly so what we and actually is pretty much the same when we override a regular deployment. So what we we have a pretty clunky process where we ask in slack, you know check like hey: is it going to be okay to overwrite this people say sure yeah we do the override and then they have to go and uh add a comment on the issue.

B

So that's um you know it's it's a little bit of working around like having something um like a job or a way of actually like someone being able to just press a button when it being logged as like the approval of who did that action would be incredibly helpful.

A

That's super interesting because um I remember when I think you a while back on that deployment approval issue that we're going to pick up soon in the future, and one thing I had in the back of my mind is that that kind of approval, ui doesn't necessarily have to be limited to approving or rejecting a deploy, because there are many different actions that need to be approved in this context. Right this one.

B

A

Them so you'll be super interesting for it to be a generic approval or rejection module. Where for any specific action, you can make a pop-up that yes approve this, and then only the right person with the right permission generates an audit log.

B

Yeah, that would be super helpful, yeah and I think for us, I'm not sure about other users, but certainly for us. I think it's um much more around um recording the like who was involved and what approval was given versus the like. You absolutely can't press this button uh type of thing, so it's just having that audit trail is more of our our requirement.

A

Right more important, the actual auditory, or rather than the permission.

B

I think so yeah.

A

Oh all right, we are almost at time. So if you don't have anything else, I think I think we are good. Thank you so much for taking the time for this. It was super helpful. uh Thank you and yeah I'll definitely be in dutch for next week. That's great, thank you. So much take care, bye,.