GitLab Infrastructure Fire Drills, 13 Jan 2021

Previous Meeting

⏯

youtube image

►

From YouTube: GMT20210113 Kubernetes fire drill

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Okay, let's get started we're doing kubernetes service fire drill and we're doing scenario.

A

A

This so um I actually haven't been involved in any of these. Yet what I was thinking, what we would do is kind of walk through a fictitious scenario, to kind of see what ideas people have. um The first question I like to pose for this is what metrics or what would you do in the first five minutes or even the first minute of the incident?

A

If you have something that you'd like to add, maybe just put it on the document or feel free to speak as well.

B

You want to read out the scenario sure.

A

So scenario, three is alerts, are firing and showing high error ratios for the git service users are complaining, they're unable to connect to the service. um What do you? Where do you go to investigate the issue? What mitigation strategies can you think of or how would you execute them?

A

So um I think this kind of like varies depending on where you are in the incident uh like I think, I'd like to start off with just alerts have fired. uh What is the first thing you do.

C

Open the slo dashboard for for get service.

B

I'd also look at which alerts are actually firing um just to get a preliminary idea of like. Is this an ssh level like aj proxy ssh level alert or is it aptx? Is it error? Oh yeah? It said error ratio in the description. Okay, so I guess that kind of answers. The question.

A

Sure yeah, so we open the slo dashboard to see that the um you know, errors have started to increase and alerts are firing.

A

Let's say that the alerts are firing for the git https service, but they're also firing for get ssh.

A

Okay, um so the next question I'm going to pose here, um obviously like what we would do it after this is maybe we would look at logs as well to try to narrow in, but I think one thing I'd like to try to do immediately is to rule out a cluster specific issue. So um how would you do that, like you want to kind of rule out whether this is infrastructure or application.

D

So this this happened to me yesterday, although for a different service uh you can, you can go down to the kubernetes overview, tab and the generic dashboard and uh inspect the different clusters on various series like some when I've had this happen in real life, it's often been a bit indirect, like one cluster has substantially higher or lower, say memory or cpu, which can be used for further drilling down using the gke console, usually.

B

Greg, do you want to share your screen and demo that.

D

Yeah, actually might even I could even bring up the stuff from yesterday, but it's registry not gate, would that be useful.

D

B

It yeah or do you do it forget and just see what the process is like.

D

Sure can someone unsure yep all right.

D

So yeah the we've got a metrics catalog function to generate a couple of extra bits for dashboards marked as having stuff in kubernetes, which will soon be almost everything. You've got some kubernetes detail dashboards here, which I'm not going to go to yet and you've got a kubernetes overview fold out.

D

These are typically partitioned by a cluster. So the case of gear we've only got zonal clusters, but you might also see the regional cluster here for some services and as as expected for a healthy service, we can see quite tight banding for uh resource usage, one symptom at a higher level that might lead to looking for cluster level. Problems is uh sudden drops of like a third right, because we typically have like the three zonal clusters.

D

So if you see, uh for example, error ratio, abdex go to two-thirds or one-third, that's often a bit of a clue.

A

Given we don't have error ratio attacks per zone, yet I think that's going to happen soon. What other options do we have to kind of see what the error ratios are per cluster.

D

uh If you, if you really, if you already sort of suspected the cluster thing uh you could, you could inspect the workloads in uh gke viewer, but I would I prefer to start with the logs. So let's do this for https like puma. Am I still sharing? I think so?

D

Yes, you are sorry that wasn't like yeah I got paid for like 7 am so I'm still I'm still recovering. um What I meant to say in in actual english was, I would go to the. What I usually do is go to the logs and start splitting uh error time series charts by interesting metrics. So right here I could, um I could add, a split series for.

D

Do we have like a label called region, yeah, kubernetes region zone,.

A

Actually, on the uh on the agenda, I put a quick link if you just want to.

D

Yeah sure great, thank you. What have you got here?

D

I mean that's my answer to everything in these drills. It's like start with high cardinality and then sorry start with local, now team, prometheus and grafana and then just start partitioning time series in kibana by anything. Until you find something interesting and then so on so forth, yeah.

A

Sorry, the second link is the visualization. I gave you the log link, not bad.

A

D

Yeah kubernetes region, nice yeah, so there's nothing super interesting here. But uh if, if one of the clusters was was booked, you would see perhaps a substantially higher amount of errors in uh in one of them.

B

Right and that's pretty much exactly what you're about to uh put together as well in in your query,.

D

Craig mine was not going to be as pretty uh mine was just going to be four lines. I don't even know how to draw charts like this. What is.

B

Yeah, I meant I meant the the split by cubic region yeah.

D

That's the label. I was grasping for.

D

Is that the kind of thing you're looking for.

A

Yeah exactly, and so let's say the the problem- is that there's a networking issue in one, a z, that's out of our control, and this has caused the load on the other availability zones to increase and we have hit the maximum number of pods.

A

um We hit the like, we've scaled up to the maximum number of pods and we need to add capacity.

A

How do we add capacity.

D

Did you say we did the node maximum as well.

A

No, not the node max phone, let's say the the hpa pod maximum.

D

And you can you can bump that number in deep in some helm value file.

A

Well, would anyone volunteer to kind of like walk through the process, for creating an mri to do that.

B

C

B

All right, so my guess is that this is going to be somewhere in um not gitlab, helm files because that's our infrastuff but the kubernetes workloads. So let's start with that.

B

And the gitlab com repo.

B

So does this look like a good starting point? That's the correct starting point. Yes um and then.

B

I guess we usually configure this stuff in the releases folder, so we have the get lab.

B

We have a helm file.yaml and we have a values: file, I'll open this in a tab and I'll open.

B

Values and gprod in a tab, um okay, so this is including a bunch of stuff, including the gprod one. So let's maybe go take a look at.

B

This and the g prod one okay g prod. This looks like the place uh we want to make that change. So, uh let's see if we have a a sub section for the the git service.

B

I'm not seeing it.

D

We run our kubernetes application in like multi-mode mode.

A

Wait yeah so so this is where it's a little bit confusing right now for kubernetes, because we have to modify the web service deployment so, instead of looking for git look for web service, okay.

A

um So what we have here are this is like the base value for the minimum max number of replicas and then underneath web service. You can define multiple deployments. um This is the traffic splitting that was recently introduced into the helm chart which allows you to have different deployments for different https traffic.

A

E

Maybe this is the idea to also just add a comment in that file to basically when you get when you search for git, that you can basically see this information so that you need.

A

To increase the website.

E

A

E

I think that this is pretty um well, not that straightforward to actually like this is a detail that you need to know in order to fix something, um then maybe just including a comment about hey if you need to increase the capacity just increase the web service as well, or just increase this this instead.

B

Yeah, that could also be in the git uh in the run book for the git service. I think that would be a useful thing to have. I'm.

E

Not I think in the file is probably a better place to have this, because it's more direct, because uh when I find a file where I can tune values, um I'm less inclined to look for a run book to see. Okay, I can't find what I need to tune. um What what they need to do and then basically just having a pointer, would probably be a good thing there. So.

D

Like hendrick, I agree with you that comments near where you're, already more likely to look are better than run books, but what's even better than that might be a config file schema that kind of logically resembles the way we view our service topology.

D

I don't know how this is going to look when we have like web and api in there as well. This.

E

Was more like a quick fix kind of thing? Sure.

D

E

It's like a low effort thing to just include there.

D

Totally in the long run, we might want to consider just rethinking the way we like lay out our top level values, even if it involves writing little shims to map that, to whatever schema the gitlab helm, chart, uh expects.

E

Yeah, that sounds like a good idea.

A

Igor, can you flip to the values.yaml.com template I'd like to show you there and just grip for web service colon.

A

um There, it is so this is, this is what we're overriding. This is the base values file and you can see.

C

A

We have deployments defined here and right now we have two deployments web and git. The web deployment is the catch. All that's like any request. That's not a get request currently on kubernetes. That uh only is um well it's only covering web sockets for the interactive terminals. So it's not a lot of traffic eventually or very soon we're going to be creating a new websockets uh deployment for that specifically.

A

But for now most of traffic goes to git and you can see like you, have the ability to create overrides per deployment for min and max replicas, but we don't do that yet.

B

Got it um cool so jumping back over to the the gprod one? I guess this is what we would uh edit. So let me just go ahead and do that now.

B

Sure, okay, so let's say we: we need to bump this by a third, so 150, divided by three. It's 50.

B

So this is probably what we would want to do.

F

I don't think it adds up. Does it because you had 450 in total now you would have 400 in total.

B

Okay, someone did the math for me and tell me.

F

C

A

Depends on you.

C

F

225 or 250 to be sure.

B

C

Let's, let's just do this.

B

um Okay and and then I'll say.

B

Demo well I'll leave this.

B

Blank and then put this in in the match request so demo do not merge.

B

A

B

Cool bump max what was the exact name of the value max replicas.

B

For get https was it yes, yes, web service from 150 to 250., um yeah I'll, just go ahead and submit this blank. If that's okay,.

A

B

All right cool so.

A

Right now, uh this project requires approval from delivery, um but uh I guess like, depending on who's online, you may you may need to override that um you can see the dry run. If you haven't seen this before, you can see the dry runs that are happening in the deployment pipeline on ops. If you click the link igor for ops deployment pipeline.

A

So, what's happening here is that it's just running dry runs against all the clusters uh as soon as the um well. This is going to actually increase the max pods for all environments, since we made the editing values so we'll see the diff everywhere.

B

And these are on the zonal clusters. Only is that right.

B

Web service is on design or clusters only.

A

um Web service is on um all clusters because we have canary on the regional cluster, okay.

F

So what if no one from delivery is available to merge the merge request.

A

Then you would need to override the approval settings to merge it in the on the project, because I think everyone has the has edit capabilities on the project itself. Hi dude. um We should probably uh make it so. At least managers have merge approval, access um and probably all sres as well.

B

G

We should not have uh override approvals. We should have everyone actually practice more in this repository, submit more mrs and go to the trainee process that craig muscle, for example, is going through, and then it's going to be trivial for us not to have a specific named uh folks in there like. I don't want delivery to be the gate here. I just want to ensure that every sre has worked in this repository know what they're doing and then they can serve each other.

G

That is the only reason why there is a gate at the moment just to make that clear.

C

Yep readability- reviews are a thing.

G

A

Cool, so this is your diff. um If we were in an incident, we would look at this we'd, probably say: okay we'd merge it. Then it would get applied to the clusters.

E

And so what the dry run just to just to just to make sure I understand this correctly, what the dry run does is it shows you the difference between what is going to be deployed and what's currently deployed, or what exactly does the dry run entail?.

A

uh Yeah, the dry run just shows you the diff to the kubernetes uh config for what is going to be employed deployed.

D

Technically, quite it does a helm diff, which is the diff between the manifest that helm has generated from your branch yeah and what else generated last time. This is usually identical to what you described hendrick, but if someone has made a manual intervention, it doesn't show up in the dip. So this is just a bit a bit of a gotcha that usually doesn't matter.

E

It doesn't reflect what the cluster is currently doing internally, but what it was told to be configured as last time helm touched it yeah.

A

Yeah, correct yeah: you don't want to make changes outside of helm.

A

B

All right, I'll I'll unshare.

A

Is there anything else anyone would like to talk about or see or any questions that they have? That came out of this? I.

H

I have I have one comment before I have to drop off in a couple minutes here, but I wanted to make sure- and I already pinged alberto about this, but I want to make sure that we're we're actually leaving this call with like owners of some of these actions. Like I, I see some really great stuff here, like hendrick's mention of like we should comment about how traffic is split.

G

H

Know I love, I would love to see some of these things just generate issues like we don't need to necessarily like action on everything but like if we're getting things in our in the reliability backlog. For sure um I would love to see that as the outcome of this and.

F

But just pretty much.

B

Make sure that that happens.

F

uh Brent always thanks.

E

B

E

What how can we um so if we, if we see that the um we've already reached the capacity limits of the notes that are in the um they're in the cluster? How do we go about this? Is this auto scaling or how will we increase it just in terraform and bump up the number of the nodes, or how would that.

A

Yeah this this would be in terraform, um and we can take a look at that now. If we have time, maybe I think we have enough time um or does anyone else want to take a crack at this.

F

I want to see if I understand the question, so we have nodes managed in terraform that hosts the actual kubernetes workloads and we can run into the constraints of that. Is that what this is about?.

C

No, the nodes running in cooper kubernetes are auto scaled by gcp automatically, so we.

F

Think there are.

C

There are, there are min max limits in terraform.

F

Right, okay, we're configuring. The this is all automated by google cloud, but our configuration for google cloud is in terraform yep.

B

B

A

Okay, so um I'll just kind of give you guys a quick tour of how this is configured in terraform. There are um there's two terraform files, one for the regional cluster and one for the zono cluster.

A

The regional cluster. um Has uh you know a bunch of node pools and node pools. You can consider them as, like. You know those are the groups of vms that we deployed to for different services. Each node pool has a default min and max which we use. I don't remember off the top of my head what it is, but you can also override it. You can actually look if you look at the zono cluster config. So this is the zonal cluster config for usc 1b.

A

You can see that we set the max node count here to 50.

A

to see what the default is. Let's take a look. This is going to be in the in the.

A

Modules which is under, I think, gl infra,.

A

A

B

And I think we had the the the git um url to that in the terraform file, as well just for discovery purpose.

A

A

So for um the default node pool config, which is what is used whenever you add like a new node poll, these are the defaults.

A

The max node count is.

A

10., so um you can see, then, for that we've overridden it in some places. um Let's take a look for for git. We have not overwritten it, but um I think that was done because we were running much like much less than 30 bms.

A

Maybe one thing to be interesting to see is like how many nodes were actually running right now,.

F

I was about to ask like where do you see what's actually happening, because we have these layers of conflict that merge and arrive at some number, but what there must be a place where we can see what happened in the end.

C

F

C

Able to we, we we should be able to have a kubernetes dashboard that shows the node pool sizes, but you can also see it from where jarv is showing.

A

So these are the clusters um we have the regional cluster and the zono clusters, we'll just go to usc.

A

A

So 25 I mean that's like.

A

Higher than I would expect.

A

B

Maybe we should actually make the change make the.

A

Change for what increasing the maximum 50.

B

Yeah bumping it up.

A

No because it wouldn't it wouldn't go up to 25 if the um it's one it's it looks like it's one to 50., so we already, we already set this to 50.. So.

F

So that column.

A

F

That column, where it says one to 50, that's the min max. We were talking about exactly yeah yeah.

F

Is there any way for people to uh stubbornly edit this here in the web ui and then terraform comes back and stomps on them? Yeah.

A

For sure yes, but don't do.

F

That, no, I I know but, like suppose suppose, you're in an incident and you're trying to figure this out and you've landed here and you're like oh, that should be. I see 10 that should be 30. well.

E

I guess next time someone would apply the change via terraform, it would display as a diff. um So you would at least.

G

E

Something's going on, I mean, I think everybody.

C

um Actually display the diff or the diff between the terraform state file, because.

E

It checks this, it refreshes the state every time you do a plan, so it fetches the resource from from the cloud.

E

But do we have a dashboard that displays the current node count and the min max?

E

Because I would think that that would be able to be displayed from like metrics that we might be able to export from google just to have like a quick overview, because fumbling through dashboards and google cloud is not always the easiest way to find things.

A

Yeah, I agree, I don't think we have a very good view into this right now um we can look at nodes. I don't think we can actually look at the max node.

C

I should be able to get that done pretty easily, so many files.

A

Sorry for responding.

E

F

So the basic idea is, everybody needs to know to never edit anything in that google cloud page and somehow make it happen with terraform right.

E

Yes, but that is true for pretty much everything we do already.

B

Yeah that that applies to our entire gcp infra yeah.

F

I'm asking this as somebody who's, not an sre, so this is something everybody knows: okay,.

A

So um so here's where we set the max node count to 50. We would just increase this in terraform to 100 or whatever. We need to do.

A

A

If you've never seen, this part of the google console uh workloads is also interesting because it allows you to see deployments.

E

And so the this console is directly hooked into kubernetes into kubernetes itself. So yes, okay and that's pretty cool.

A

uh So we can look at like a deployment, it gives you some basic, metrics and monitoring. It shows you all of the pods.

A

A

I also wanted to touch on logs for.

F

A

It actually, it shows you, the config.

F

So this is what happened after helm. Did all its layers of magic.

A

It shows you the config here for the deployment yes like this is all defined in our home. Configuration.

B

Cool, so this is the equivalent of running cube. Ctl get deployment deployment name on the console, that's uh yeah.

F

So so this whole thing is sort of a kubernetes uh web ui, where you can click instead of running uh cli commands yeah.

A

um Like going back to ruling out a um oh, maybe we already touched, let me make sure that.

A

This is what I had in mind.

A

uh Yeah this was the first link I pasted, but I didn't really cover it. um This link here for log events on the cluster. um We have a. We have an index in elasticsearch for gke. You can use this to kind of see what's going on on the cluster in general. If you want to rule out a cluster problem here, we would see errors like for. If there was a problem with scaling the number of pods, we would see them here.

A

You see a lot of scary messages if you look at it right now, um but this is like normal stuff that happens when we cycle pods and go through a deploy, so um nothing to be concerned about like these 503s.

A

uh You can see like when, when containers are created, another um problem that we sometimes see is like, if we're unable to pull an image. For example, we deploy to the cluster, we do an application update and the image doesn't exist on dev. You would see errors here.

A

You can also see like scaling up and scaling down events.

E

Would we have metrics for these um container pull failures, or would that be in the in the crash loop back off thing in kubernetes or where we, where, basically, where would we see these these kinds of errors.

A

Yeah, that's I'm not sure, actually, whether we would uh alert on that at all. We would probably see.

A

We would see scaling related errors if we weren't able to deploy more pods but um yeah. I don't think we have a metric for just like errors from.

A

Kubernetes I'll um add.

A

A

Oh, is there anything.

A

Else, nothing, okay, I think we'll end. Then thanks everyone for joining, really appreciate it. This was very educational for me and hopefully others.

B

I learned a lot thanks: cool.

A

Thanks everyone bye thanks, alright.

F

Thank you, bye.