GitLab Delivery Team, 15 Sep 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Delivery: Hands-off production deployments

Description

Recorded on 2020-09-15

Slides: https://docs.google.com/presentation/d/1dfV5LDTAeLxIwpy5P4rIi3tCNTo1U2gFpiyMtkPHFyY/edit?usp=sharing
Main epic: https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/280

A

Hello, I am alessia kayatsa from the delivery team and I'm going to talk about hands off production deployment at gitlab.

A

So first question: why don't we deploy automatically to production? Our deployment process is gated by a human decision, so release manager have to make several choices in order to promote a build to production.

A

First of all, we have manual checks to perform, so we are. We are looking for errors and new errors.

A

Also, we don't have a good strategy for rollbacks, which has um it's very important for in the human decision of a release manager because they need to be around for an extended period of time to just make sure that the if something goes wrong. They are available to help in resolving the situation and providing all the contextual information about the deployment.

A

And finally, we rely on human intervention for aborting an ongoing deployment and restoring the system to a healthy state.

A

Why do we have manual checks so before three months ago, only the engineering call could authorize deployment, so we had just. We already had a first human decision outside of the release manager.

A

Now we have a new process and we are keeping track of active incident and active change requests and with gitlab issues. So we could evolve release tools to to check those information and make the first decision if it was a good moment for starting a deployment, but we still have a manual check which is error.

A

Checking, and we have a problem here, because detecting new errors is our sentry is full of fake new errors, each release, and this is not a reliable way for checking errors and is not reliable as a human check as well as automating it.

A

So looking at our production dashboards, we have other metrics that we can use for checking for errors. So we are using multiple bond rate alerting window, and with this thing we have um monthly error budget and we have. um We are counting the number of errors over a period of an hour and six hour and basically, if in one hour we are burning more than two percent of our monthly budget, we will end up violating the budget by the end of the month as well of in the six hour window.

A

If we are over the five percent of the monthly budget, we will violate the budget by the end of the month as well. So we came up with this idea of having a new threshold which is more sensitive than the one that we already using for production uh deployment. So right now we have three thresholds. The first one is for customer sla.

A

Then we have another one which is more sensitive, which is for paging the engineering on code and then the third one that was introduced recently, which is only for automated deployment, which is, which is a higher level of sensitivity.

A

Then why don't? We have a good strategy for rollbacks database. Migration are a blocker for rollback.

A

If we look take a look at this schema here so here we are, there's a timeline of a deployment and basically we have a machine which is a deploy box that runs migration. So we can think about this timeline. When we have database running schema a then the deploy box run, the migration and the schema is schema b. So as soon as we have the new schema, all the machine can start rolling out the new code. So we are switching from version n to version n plus one.

A

Then we reach a point here when we start running post deployment migration.

A

So we start running post deployment migration as soon as every machine is running the new code, so version n, plus one- and this means that we, when we are on schema c, this is no longer compatible backward compatible with the old version of code. So this is a point of no return as long as soon as we have schema c, we can't go back with the old version of the of the code.

A

There is also another things worth mentioning that if a migration requires no downtime, it doesn't mean that reversing it requires no downtime as well, so even removing post-deployment migration from the equation.

A

We can still think about rolling back code, but is not feasible to consider also rolling back migration. At this point, so we may say that we can run version n, so the previous one, with schema b, which is the new one in case of an outage.

A

And finally, why do we rely on human intervention so, prior to daily auto, deploy we used to deploy a big change set and with the big changes we changed, that we had a nice chance of having post deployment migration?

A

So this means that every deployment have a point of no return because of those migration. So we were not able, in any case, to consider a road back, but only patching and rolling forward.

A

So because rolling back was enough was not an option. We needed humans to fix the system and having human already involved in the process, made it easier to think about writing rombox. Instead of writing automations that were kind of a premature.

A

Optimization so state of the art in the industry on the right side, there is a list of reading and watching materials of what other companies are doing and their mileage may vary.

A

Some of them are using microservices other are using a completely different stack technological stack, but they are worth reading and I'm trying to outline the common techniques that are available in all those um reading and watching materials.

A

So the first thing is that keeping the chain set as small as possible, and we are already doing this with increasing the frequency of new ultra deploy branches.

A

Then everyone, it's clear that we need the ability to safely roll back and we need to be able to prevent a deployment if the system is not healthy as well as we need to be able to promptly detect anomalies and commence. An automated rollback, then phases roll out with baking time are useful for spotting anomaly before completing the production deployment, which means that we are restricting the blasting radius of a new of a change that introduced new errors at the first fl of the first machine in the fleet.

A

Instead of going all in through deploying in the whole infrastructure and then having to roll back every single machine and then another interesting tool is load, testing and production. So basically it's the ability to steadily increase the traffic to one specific node. We can think about cannery, for instance, in our case, so that we can compare metrics with the new version and the rest of the fleet and make decisions based on that so hands off deployment. Why? Now?

A

In the last few months we reached our average mttp goal of 24 hours, we could lower the the goal and increase the outer blood branch creation frequency. But it's unrealistic to have more than two or three hands on deployment each day. The span of attention required by an engineer, a release manager during undeployment is really high and it will just be not really safe to do more than what we are doing now and also increasing.

A

The frequency will increase the likelihood of skipping as it releases basically because we may end up to deploying re some releases only on cannery and then when we want to promote something. We aren't sure, because we already have something new ready, so it also affects other regular performance indicator, because we can imagine that deployment our deployment metrics will give us uh an edge on mean time to detection, because we have get.

A

We start gathering more data on the healthy on the health of the system, as well as having a rollback strategy will definitely work well with mean time to resolution, and all of this also affects key performance indicators like gitlab.com availability and also the mean time to production, which, with this new safety and automation around it around our deployment, we will be able to safely increase the frequency of deployment.

A

So this is the plan. We have a working epic, which is linked here, and this is mapping the work needed for the fiscal year, 21 q3 infra qr, which is the ultimate manual deployment approval, and basically this epic is broken down into four main epics.

A

Two of them are the assisted phase, and two of them are the automated phase. So assisted phase is breakdown in assisted deployment and assisted rollback. The first one is about creating a consistent way to make a decision. If it's time to promote a build or not and the assisted rollback is about providing a manual rollback options in case of an incident once we have refined those assisted technique, we will move on to the automation part with automated deployment so that we will create a predetermined release window and within that window.

A

If the metrics are okay and baking time has passed, the system will automatically promote a build to production and automated rollback is the end goal, and so the system should be able to understand.

A

If is not, if the iphone, if a new build, is producing an outage or is degrading performance or things like that and commands are rolled back automatically.

A

So thank you for listening each each each one of these four phase has its own epic, so feel free to check the status. Leave comment, and just let me know what you think about us. Thank you.