GitLab Delivery: GitLab.com migration to k8s demos, 20 Jul 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2022-07-20 GitLab.com k8s migration (EMEA/AMER)

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

All right, good morning, everyone how's.

B

A

A

um Welcome to july goodness july, 20th, okay, so the year is going by very quickly. um I've only got one little item to discuss or demo actually um on our agenda. Today, I've been working on trying to figure out ways to help prevent auto employees from being blocked. In the case that a cluster is down for maintenance, whether that be planned or unplanned.

A

If it is a plane maintenance, we could certainly uh perform what I'm about to demo ahead of time. If it's unplanned, obviously we're probably suffering some major incident, and you know we could perform this same style of remediation at that moment in time, but I would imagine the other boys would probably be blocked regardless because of other situations, but in the case of potentially trying to avoid all the deploys being blocked. I would like to demo or showcase how we could potentially avoid such catastrophic events.

A

Normally this all happens via ci I'm going to replicate effectively what ci does locally. So, if I go into our favorite repo okay, I don't care. So this is the repository that we use to perform auto deploys uh when a ci job wants to perform a deploy. It simply runs k, control upgrade so our dip jobs, for example. This is going to be targeting our pre-cluster.

A

um Our deployments contain two stages where the first one is a dry run, but they all do the upgrade command. So if we do a dry run, real quick, we'll see, hopefully that there are no changes in pre-broad, there's nothing exciting in the dip that I have for this cluster. So we shouldn't see any changes.

A

A

There are more things that happen inside of ci, such as being able to pull the key, that's used and then authenticate, but all that happens is a step prior to us actually running k control inside of ci. We have some hidden text in my screen because of my color choices in my shell. So now, uh assuming this is an auto deploy. You know the next step will be to run this precise command, which will actually perform the upgrade being that we don't have any changes here.

A

I'm not gonna run this command because it's quite you quite useless um now, if we are undergoing maintenance, um I plan on creating a runbook that does this. I'm that's a work in progress at the moment, but effectively. What we'll do is just do. Cluster whoops, buster skip equals and the name of the cluster in this case pre, is the name of the environment.

A

Helm file uses the environment names to distinguish all of our clusters, so pre is equal to pre, whereas production is equal to the actual names of the clusters. So, instead of grpd here we would do grpd, hyphen us hyphen east, one hyphen b, for example.

A

But this does the exact same thing, except it adds a little notification saying: hey we're skipping this cluster because you told me to um and we're going to exit cleanly. So if I do echo that we'll see that the exit code is zero, the reason we are exiting cleanly is because we are targeting specific clusters for upgrades uh during auto, deploys we deploy to our regional cluster and cluster b at the same time, and then we deploy the cluster c and cluster d.

A

At the same time, if I were to skip cluster b and with an exit code of 1, that job will fail, so clusters, c and d will never get deployed to, for example, so I've made the exit code customizable. If I hop into this script, real quick, whoops, mk control.

A

I did make this customizable just defaulting it to a value of zeros, for so for whatever reason, if we do need to customize this and force ourselves to exit uncleanly, which will force I'll deploy to act a little goofy, we could force that change if necessary.

A

So I guess, there's two follow-up items that I'm currently working on. One is the fact that this exits cleanly, so you know there's the chance that if someone leaves this variable hanging around and leave a cluster inside of that variable configuration, we might accidentally skip deploying to a cluster without actually realizing it. So to solve that problem, I'm going to try to figure out if there's a way that we could parse the version of what gitlab version is running via metrics and create an alert for it.

A

I know we have this information on our omnibus installation, so, like all of our gatedly nodes, are exposing which version of gitlab is installed, but I don't think our clusters are at least I can't quickly find it so I'm going to try to figure out um where that could come from and see what we could do to do to add an alert and then maybe because our cia jobs, I think the timeout is 60 minutes or might be two hours.

A

I can't remember all the time ahead, I'll create an alert that um takes that into account such that if we do have a legitimate fire of some kind that we're not unnecessarily alerting us, but rather it'll be a very fine-tuned alert instead and, lastly, would be improvements to our run books such that we have the actual procedure that I propose that's necessary.

A

I did document a procedure very quickly on the issue that I'm using to define what we want to accomplish here.

A

I'll just share my screen really quickly, um so I'm effectively going to be documenting this, but in a nicer format, inside of a runbook where the first step is to identify a cluster- and you know on my shell here- I locally tested preprod, but we would input the cr variable. I did on the command line, but obviously we would use our ci variables for the op project and then we would have the necessary need to notify the release.

A

Managers in the uc said upcoming maintenance, and then we could then proceed to the set service state script, which is these are the modifications that did prior to me being on vacation in june, where we slowly ramped down the traffic go into the target cluster that we are targeting for maintenance, and we could use our metrics to confirm that a cluster is no longer receiving any sort of traffic at all and we'll probably see some requests like some rps value is never going to be zero because we always have health checks and we always have metrics being gathered, um but minimally.

A

It should be. You know less than two million requests per second, for example, you know, I think that'd be a good way to check for that. You know we could probably script that out. Then we perform whatever maintenance. We want to choose, so you know coming soon. I guess I'll be taking one of the staging clusters down just to you know test that all of this is working properly, that we have a solid runbook and that we've got the necessary. um uh How do you?

A

What is the the alerts making sure we silence the appropriate, alerting that way when we're performing normal maintenance, we're not unnecessarily paging the eocs um and then afterwards, if necessary, bring the cluster to parity. So if, by chance we had a deployment occur, while a cluster was down, we want to make sure that cluster is running the correct version of anything prior to it, taking traffic again that way, we're not accidentally downgrading our services and inducing an outage because of that type of situation.

A

Need to figure out how to do that. I've got ideas. Currently, my thought here is that we would simply replay a deploy job that was previously skipped. You know we would remove that variable replay the job it brings cluster back into parity uh after confirmation that all is good. We could then uh run the set service data and bring traffic back online and then we're effectively complete.

A

That's what I'm working on and that's my demo for the day. Does anyone have any questions related to this particular item at all.

A

Cool well short agenda. Does anyone else have anything they want to chat about.

A

B

I was thinking we could have chat about the okrs, but I think it's we still miss aqua and so on. I think it's something that we should keep doing in the issue um and then most question is: we can pull it up in a different session.

A

How about this? uh If we have more discussion items on the issue, maybe we can bring something to next week's agenda if you'd like.

B

Yeah sounds good, so hopefully we get some more comments in the next hours and let's do that sounds.

A

Good cool all right, unless there's anything else, we'll call it today that was a short meeting.

B

Yeah, thank you.

A

Thank you. Everyone have a good day.