GitLab Release Group, 8 Dec 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Auto Rollback in GitLab CI/CD

Description

This video explains the new feature in GitLab 13.7 - Auto Rollback.

If you have questions, feedback or suggestions on this feature, please leave a comment in the issue https://gitlab.com/gitlab-org/gitlab/-/issues/35404 or create a new issue https://gitlab.com/gitlab-org/gitlab/-/issues/new.

A

Hi everyone- I am shinya in release group um today. I want to talk about this cool feature. It's called automatic rollback. This feature will be shipped in 30.7. So the point of this feature is that um automating, some of the cd workflow, for example,.

A

Let's say cscd planes already configured in your project and then every time you merge something a comet, a future change or even breaking change into muscle branch that creates a new pipeline that deploys the comment to the production server.

A

In theory, all of the comments are safe because there they were verified in module requests by ci pipelines, like probably a bunch of testing jobs, for example our spec jobs, and then that these jobs make sure that the code which will be will landing on the product environment is safe, but sometimes problematic. Comet could slip into production environment for, for example, the future logically corrects logically works.

A

uh So the test passes, but since it's inefficient, it causes a production incident performance, degree degradation on environment that has many active user base, so the problem surfaces only in a specific instance, so this type of problem is hard to catch at module request phase.

A

So um this feature is about rolling back to at previous, stable environment environment. If uh the recent deployment had something trouble, if there's a problem, our new rrt is raised and then by receiving that alerts, gilaf automatically creates a new deployment that tagging a previous stable comment and then automatically mitigates the production issue.

A

So uh there's a point of this feature and then let's dive into the demo, so okay here we are seeing this demo project this demo project. Let me briefly explain this: this is a ruby on rails application um already configured auto devops.

A

You can read more about learn more about audio debugs in offshore documentation page, but it's basically just you don't need to do anything to set up pipelines or city jobs. Everything is automatically automatically configured and then the code will be the application. Application is deployed to kubernetes cluster.

A

So I already configured this and let's check out the environment page here. The production environment is already created and it's let's check out the webpage here, since this is very basic application. It just shows uh the simple page, um but that's enough for demonstration demonstration and we are seeing that kubernetes cluster, the status that here are two parts on the production environment at the next. Let's take a look at the monitoring dashboard, here's a couple of metrics on this environment how this performs?

A

Well, everything is normal: how about memory, usage, cpu usage, etc, and in this demo we are specifically looking at this http era late on nginx ingress.

A

So if something went wrong on our application codes, 500 error, 500 lr means internal server error, so something went wrong. Dualing processing, user requests, um so, ideally the rate of 500 should be zero percent or nearly zero percent. um But sometimes you might see a spike on this uh in.

A

If a bad comment, uh the broken comet is uh deployed to a production environment. So let's try to make a bad comment here, uh intentionally.

A

Okay, we visit here and then.

A

Let's say make an error.

A

Yeah, I made a change right now and then this change will be deployed to you in production, environment. Again this shouldn't be happen.

A

This should be caught at the testing phase, but here we are in the situation that uh if the code, bad comet it slipped in the testing phase and then landed on the production and let's wait a bit until this gets on the production it doesn't take a while, but let's resume from let's resume after this pipeline finished okay, so the deployment pipeline has just finished and the this problematic comet landed on the production environment and let's take a look at the alerts.

A

Here we are seeing that a critical rot just created three minutes ago. It's really due to http erlade. Let's take a look inside.

A

um There are a bunch of informations on this alert, but the point is that the this all are happening on production.

A

So, let's take a look at the matrix matrix.

A

A

Okay, scrolling down to the nginx english.

A

We are seeing comet a1 fc 020. This was a comet. uh We made uh a bad comment and then right after the deployment does there's a huge spike here. uh The error rate uh increased to 100 percent and in a typical situation, as always um start investigating on what went wrong.

A

What caused this spike here, if they figure out that this deployment is related to this incident, maybe they perform rollback, but what's interesting here is that we see another deployment here. This deployment is created by ordered rollback. The neural introduced feature in 30.7, so this deployment is triggered by this alert that um the critical art rate is raised and then gillab automatically trying to mitigate the problem by redeploying the previous stable deployment.

A

So, uh interestingly, we think that the spike is mitigated from 100 percent to zello, right after this old rollback or old rollback.

A

So this feature frees operators from um the duty duty to keep looking at the metrics keep looking at the alerts by just mitigating the problem automatically, and let's take a look at the alert page again here, a lot is gone. This is because the problem was uh resolved by the old rollback.

A

uh Let's take a look at the environment page at last: here: here's a deployment index page. We are seeing the history of deployments, and here the latest one number 20 is the deployment created by old rollback so um yeah. This is the safe one and the previous one. uh This a1 fc was a deployment. We uh made that uh the the problem, the code problem, problematic comment that triggered high spike.

A

You can also see the deployment history in this page that when ultra rollback happened all right, that's everything about the or the rollback feature will be introduced in 30.7 it's available in github ultimate.

A

Please take a look at the documentation here. It has more description on the explanation on this picture and also there are a couple of limitations.

A

uh Please make sure that uh if this meets your criteria before you actually enables the feature to enable this feature um you need to visit the project. Configuration page here is a steps to enable the feature.

A

This feature is disabled by default, but it's worth considering if you have any questions or feedback or suggestions to improve this feature. Please leave a comment in this issue or um please create a new issue here in this gitlab project. Your variable feedback is always welcome.

A

So thank you for watching this video and then see you in the next video bye. You.