GitLab Delivery Team, 23 Sep 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: GitLab.com Kubernetes Deployment Pipelines

Description

Rehab Hassanein and John Skarbek discuss the pipelines for how we deploy configuration changes and deployments for GitLab.com

Timeline:

* 00:24 - Configuration Pipelines
* 09:39 - Why we use a different instance to Deploy changes to GitLab.com
* 11:15 - Auto-Deploy Pipelines

References:

* GitLab.com Configuration Repository: https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/gitlab-com
* GitLab Helm Chart: https://gitlab.com/gitlab-org/charts/gitlab/
* Reference Issue for this video and more to come: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13980 - Drop a comment for some ideas of what to cover!

A

My name is johan and I'm part of the e3 uh infrastructure team at gatlab and today we're going to talk about.

B

The kubernetes pipelines specifically for how we deploy gitlab's own helm, chart for gitlab.com, I'm john skarbeck, I'm also a gitlab sre.

A

Configuration pipelines what what is that, what's configuration pipelines, what are they meant to do.

B

So configuration pipelines consist of anything that is configuration related items such as scaling a replica set or changing the performance parameters of our deployments, or legitimately configuration changes to a specific service. That's running inside of kubernetes.

B

All of those are cases where we have a specific style of pipeline that runs. It starts with our non-prod environments. It moves on to our production environments over the course of the length of the pipeline or, however, the person committing the changes chooses to roll the change out. Some changes are safe to roll across the entire pipeline. Some are not that all depends on what that specific change is all that is to be evaluated by the engineer at that moment in time.

B

So I want to show an example configuration pipeline, um so I will proceed to do that now. So I have I'm going to share my screen. I have a merge request. This was done by joao. In this case, we are bumping the specific version of our container registry for our lower environments, pre-prod and staging.

B

This may seem like a deployment style configuration change, but the container registry does not have the tooling necessary to auto deploy itself, so this falls into the category of the configuration change. So if we look at the diff we'll see that we have a file, that's specific to staging, we are bumping the version. We have a file specific to the pre-prod environment and we are bumping the version here.

B

So with our repository mirroring that happens, everything that we do happens on gitlab.com inside of our kate's workloads get labcom repository. All of our pipelines run on our ops instance.

B

In order to make that process a little easier for the person who is doing the reviews, we have links that come from ops and provide us with the necessary details to link us to where those pipelines are. So if we click on this excellent link inside this comment, we get dropped into the pipeline. That's run on the ops instance that gives us the details of what that merge. Request is attempting to do so. In this particular case, the reviewer is going to see.

B

All of these dry runs you'll note that we run a drone across all environments. This is something that we would like to improve, because you saw that the merge request was only touching staging and pre-prod.

B

Unfortunately, we were running this across all environments, future improvement, but what the reviewer would do is they would click on the necessary pipelines. I'll look at three very quickly and we'll see that we have a change which I'm not going to talk about because it's out of the scope of this particular discussion.

B

But what we have is a change to our configuration map, which contains our version of the application. So this is probably some configuration inside of the container registry that requires this and then because of that configuration map change. We also see that change happening in our deployment as well, where the sha sum of that config map gets updated. But the most important thing that we see is that for the registry container we see that the version is updated.

B

So this these dry run drops provide us a diff mechanism for which we can see what changes are being proposed by the person that is asking for the change and if we go to staging, the container registry does not run here. So we do not see those dip jobs there for our staging environment, which does run the container registry, you can see that we do have the same precise version bump happening in all the necessary locations.

B

So with this, this is what a merge request will do when it's first opened. So at this point, the reviewer in this case henry reviewed this he would look at all these jobs. He would also look at the rest of those jobs just to make sure nothing abnormal showed up in those pipelines either.

B

He would then go through and check these boxes to indicate that he performed the diff or performed a review, that the diff doesn't show anything expected and that the changes that we see in the diffs are the expected items that we expect to see. At that point, he could approve and merge when ready.

A

B

Is where things get a little more fun, because our pipeline starts to expand a little bit? So, let's look at what that pipeline looks like keeping in mind that this is a change to only staging in pre-prod.

B

We still have both our dry runs run for production and we still have later down in the pipeline stuff that goes to our production environments as well I'll discuss that in a second. But here these dry runs are essentially a repeat of what the merge request was, but we had the additional stage where we actually performed the deploy of those changes.

B

So if we go into say pre-prod, for example, we'll see that the change did get rolled out, we see a line that says hey, we upgraded the release gitlab or the name of our releases, our gitlab across the board, so the upgrade did occur on pre-prod and because we did a change to staging in pre-prod, we kicked off a qa job.

A

uh You mentioned having dry runs, run both before the merge and after the merge right.

B

A

Okay, so is there a scenario when a dry run would produce different results between the approval and the after the merge.

B

There are two scenarios one would be if a change was being merged at near the same time and the pipelines have not fully been mirrored or synced up, in which case certain things may happen in an awkward order, because we are mirroring this repository from dot com to ops if there are changes that get merged at the same time or close to the same time, git lab itself has a limitation for how quickly it is able to mirror this repository.

B

There's like a five minute wait period between each time it attempts to mirror. So, if we're inside of that five minute window, we'll see awkward things inside of the diff that may not necessarily match up now, hopefully the person who's doing. The review is aware that there might be multiple merge requests happening, but currently there's no protection. That states hey. Let's wait for this to occur. What we do have in place in ci is we set a um we're leveraging? The resource groups, I believe, is what it's called that way.

B

We make sure that if a job is already running inside of say the pre-prod, we don't try to run this in another pipeline elsewhere. Instead, it sits in a pulse state until the prior job completes. So we attempt to mitigate that, but it's not going to be 100.

B

The other case in which we would see a differing diff show up is if a merge request is merged during an auto deploy that is active now. That has to be timed very at the right moment in time and for the right environment, and hopefully again, we mitigate that with the fact that we're using resource groups as well.

B

So continuing forward after we actually perform deploy, we run a qa, and this just reaches out to our quality project to initiate smoke tests specifically and as long as that passes, we'll continue forward with the rest of our pipeline, which includes production. This particular merge request only impacted staging, so these jobs were essentially no ops and if I go into one of these, we'll see that no upgrade actually occurred, we'll see that we did a comparison well, this is canary. So this is called lab canary.

B

We did a comparison, but there was no changes, so nothing actually happened in this job at all and same thing for the rest of these production jobs. One thing that you'll note that uh pre-prod, I actually had a failed qa for this particular merge request. Pre-Pride, that's a very specific environment. We don't care it's okay, that this qa pipeline failed for pre-prod staging is different. If this fails, these this pipeline will not continue.

B

This is uh a signal to the person performing the rollout that we need to look into why the qa failed to determine if it was this merge request or something else needs to be looked into.

A

So you mentioned the that gitlab.com uses ops for the actual pipeline deployments and why do we do that?.

B

So gitlab.com is where we do everyday work for our transparency value. We want to be as public as possible or public first. I guess so. Therefore, all the work that we do all the reviews, all our commits and merge requests all that is in the public forum, but gitlab.com cannot depend on itself. So we don't want to provide the ci runners that operate inside of.com access to the dot com infrastructure that would be unsafe if we had a security breach of some kind.

B

That would be bad in cases where dot com might be down and see. I cannot properly execute so, therefore, our ops environment- and this is just a plain repository mirroring that happens.

B

Ops is what has the necessary runners, which have the necessary access to talk to our clusters. That wave.com is down. The ops instance can then be used as a backup mechanism to submit the necessary merge request to make the necessary change, and those runners are the ones that have the necessary access to speak to our clusters and perform those changes in cases where we may be down. We may be able to leverage a change inside this repository to bring gitlab back up, for example.

B

So there's a few reasons.

A

Sounds good, so we actually run the pipelines on ops. Does that mean that if ops is down and our deployments are blocked.

B

That is correct.

B

Second style of deployments are what we call auto deploy pipelines.

B

So autodeploy is a whole topic in of itself, so I don't want to delve into a lot of details here, but I just want to show how things get kicked off prior to showing you how or what the deploy pipeline looks like. So what we are looking at right now is the release coordinated pipeline, which is initiated by release tools again, keep in mind that we're on the alps instance. So just a very quick overview.

B

The release tools reaches out to gitlab says. Give me a sha on a green pipeline. Tell me what we can deploy and it starts that process here. We then wait for the builds to occur for both omnibus and our helm chart.

B

We then have a management task for release tools itself, and then we start the auto deploy process for staging, so that auto deploy process happens in a tooling. We call deployer deployer is an ansible tool. This was implemented because, prior to having kubernetes in place, we deployed everything inside of virtual machines.

B

Ansible is a fantastic tool for orchestrating specific changes such as deployments. So this is what we utilized, so we have a couple of stages where we do some pre-steps that are required for deployments to occur, and then we start deploying to each of our fleets. In our particular case, giddily, followed by prefix, followed by our entire fleet. This looks a little different today. This is an older pipeline, but kubernetes and our any of our virtual machine fleets will happen at the same time.

B

So if we look at our staging environment, what ansible does is simply trigger a pipeline to our kate's workloads repository. So if we go look at what that triggered pipeline looks like it's very simplistic. We then execute our dry runs. This differs slightly from our configuration pipelines.

B

What we check here is to make sure that the changes inside of these diff jobs only contain changes to the deployment objects for the resources being changed. We don't want to see any configuration. Changes occur at the same time that we do in autodeploy, so objects such as secrets or configuration maps. If those get changed, that's a red flag and we will fail this job, which will send the corresponding flag to the release. Major saying please investigate as necessary.

B

If we look at one of these jobs, it doesn't look terribly exciting, because other tooling tells us what happens, but we document what we want to change so, for example, in this particular job. We're looking for the deployment object for psychic urgent other. That particular shard should change, but nothing else related, psychic, urgent, other should ever change. So we should never see a config map change or a secret related to this particular deploy change.

B

If that were to be seen, that means that we probably have a merge request that happened at the wrong time that got merged or maybe something where we are reaching out to other infrastructure for configurations may have changed outside of the kubernetes pipelines, and we need to be aware of that. Such that we're not making two changes at the same time.

B

So if we pass our diff jobs, we then proceed with the auto deploy and if we go into here, the only thing that you're going to see is the actual image changes as necessary for those deployments, which we don't compress these at all. But you can see that here we changed the image from a package that was built at six in the morning on the 18th to a package that was built at 11 00 in the morning on the 18th.

B

So that should be the only change that we see in autodeploy pipelines, you'll notice that we do not run qa afterwards, because this is coordinated with our virtual machine fleets and because this is coordinating with our other jobs such as post deploy migrations.

B

We need to make sure those happen at a specific step. So deployer is what manages that so you'll see post deploy migrations happen after our fleet deploy, we've got some tracking stuff that happens, whoops, followed by our qa jobs, so deployer manages everything related to upping the version of the git lab that we are running.

B

So that's an overview of our deployment pipelines.

B

You can see it's not simple, but it's also not terribly difficult. You just need to know where to find things. um If you have any further questions, feel free to hit up anyone on the delivery team or reach out to um the kate's workload's repository for more details and information.