GitLab Delivery Team, 16 Nov 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: SRE Team Introduction at GitLab - Incident Management

Description

If you haven't already, watch the first Video in this series about the Infrastructure team structure at GitLab: https://youtu.be/SLTZzFT4mTs

Topics Covered:

- Definition of an Incident

- Who attends the incidents?

- What causes an incident?

- Who declares an incident?

- The Incident Room.

- The Incident Lifecycle.

Slack name to engage the SRE is: (at)sre-oncall

A

If you're watching this video, then that means that you liked the first video enough to watch this one. So thank you for that in this video. I want to talk about incidents at gatlab. How do we handle incidents? What are the causes of the incidence?

A

Who are the people who are involved in an incident who solves the incident? Who can trigger an incident and everything about incidents? So what is an incident?

A

An incident is an event that happens and it affects our slos negatively enough that we are looking at it or there is an outage in one of the services that are required for gatelab.com to function properly or our deployments are blocked for one or more reasons.

A

So an incident is basically an event that happens, and that requires the immediate attention of the engineer on call and perhaps some of the other team members. Let's talk about who is involved in the incidents.

A

First, the engineer on call: we know that for sure this person is responsible for mitigating the cause of the incident, it doesn't mean they're responsible for doing that on their own. If they think that the incident is bad enough, they can involve other people. So, as we said, the reliability team is the team that is participating in this on call rotation.

A

So, basically, this rotation is covering 24 7, so we've got emia apac and america's time. So there's always someone who is on call.

A

If the engineer on call decides that the incident is severe enough, they can involve the incident manager on call which is short imoc. The imoc is sort of the second line after the engineering call.

A

So if the engineering call is able to resolve the incident by themselves perfect, but if they can't, then they can call for help and the imoc is going to come, the imoc has their own role and what the imoc does is they ensure that all the resources needed to mitigate an incidence are present, so what the imoc does, for example, let's say the incident requires someone from the development team or from the security team.

A

So what they're going to do is they're going to ping them through slack or maybe page them if they are participating in an on-call rotation as well. For example, the dev team.

A

The imoc would also involve the communication manager on call, which is the seamock, and the cmoc would be responsible for relaying messages about the incident to the end users. So if the end users are affected and we need to relay messages to them, we basically page the cmoc on call the communication manager on code and the people who are responsible for doing that at the moment are from the support team. More specifically the um getlab.com support team.

A

uh So um these are the main three people who are involved in the incident. But of course, if the incident is severe enough, you always find people from different teams. Let's say, for example, the deployment is blocked because of an incident. You might find someone from the delivery team in the incident room.

A

uh Let's say, for example, the incident is affecting a part of the infrastructure that um is rarely modified and maybe someone who's been in the team for a long time knows more and they happen to be from the delivery team. So we are going to call for help, so we're going to call them on slack, saying hi, please join the incident room. We're going to talk about what the incident room is so keep watching.

A

The imoc is basically responsible for calling all of these people and hello come. Please help share your expertise with us, so that's, basically the people who are involved in an incident. Let's talk about what causes an incident. Any incident is caused by one of two factors: internal or external.

A

Let's talk about examples to narrow it down a little bit internal factors when, uh for example, the delivery team starts a deployment to getlab.com, they watch the services and they look, for example, at the error rate for some of the services.

A

If the error rate is suddenly elevated after a deployment, then they write away now, oh, it might be our deployment, let's say the infrastructure team other than the delivery and the deployments.

A

Let's say someone from the reliability team is doing a change if they notice an elevation in the errors or um any other factor after a change that we did to the infrastructure itself, for example, adding pots removing pods changing some sort of configuration somewhere if they notice a change after sorry, if they notice an elevation errors or a change in matrices after a change they made, then they know that they have caused that issue and they can revert it so. You've got, for example, deployments and an internal change by the infrastructure team.

A

An external factor, for example, a surge in traffic, a sudden surge of traffic, and in this case it could be a normal surge in traffic. It's just an increase for some reason: that's not malicious or in other cases it's an attack and our automated systems didn't catch it and didn't mitigate it automatically.

A

So the engineer on call in this case is going to look at the external factor and they're going to try to mitigate the problem so.

A

These are the two causes of incidents.

A

Who can issue an incident? Anyone anyone can issue an incident in get lab from any team once they notice something and.

A

Let's say, for example, a service- that's not available, of course, if you're, if you're someone from get lab and you um you work for get lab and you notice that getlab.com is not working. For example, I mean that's the biggest service.

A

Let's say it's not working well before you create an incident. You might want to check with the engineer on call just to make sure that the issue is not local to your computer. Maybe it's something local, so you don't really need to page or create an incident for that. um So you can always call the engineer on call by quoting a specific slack, um a name at something.

A

I'm gonna put that in the description, because I don't remember at the moment but yeah you can always relay a message to the injury on call directly before creating an incident.

A

So let's talk about the incident room, the incident room is a very special room. It's a permanent room. You can find the title of it in the incident management, slack channel and um there's a permanent link to a zoom room and that room is called the incident room. The incident room is a very interesting place, because this is where all the fun happens. Yeah. This is a room that is accessible by everyone at get lab, so you can always attend any incident that is happening.

A

Of course, if the incident is not severe enough, you're not going to find anyone in the incident room, but if the incident is severe enough is affecting enough users and needs think communication you're going to find people in the room most of the people in the room are going to be the engineer on call the incident manager on call, and sometimes the cmoc, the communication manager on call and a lot of times. You also find other team members from all the company different departments, some people where their specialization is needed.

A

So if you're someone who's just curious, you can still attend the incidents and see what's happening and see, listen to the talk, listen to just like a shadow to to that call, and that will give you a lot of information about how gitlab.com is operating.

A

What is an incident? What is it like? Really in real life, um how hard the infrastructure team works? To mitigate incidents, the amount of stress you can see the amount of stress in the room it's going to manifest in different ways. You might not detect it right away, but everyone is of course stressed from from time to time and depending on the incident depending on how they, how their days when it's going so some people are going to be laughing. Some people are going to be like really surprised. How could this happen?

A

How did we set up our infrastructure to handle something like that or not handed something like that? So I find it very interesting. I find the incidents really really really informative and I think particularly for support, because we also in the self-managed department of support, we participate on a customer emergency rotation and it is very similar to the incident room except we are not responsible for the customers infrastructure, so they basically handled that part in the customer. Emergency calls in the support team.

A

So we sort of see the end results of what happens in the incident rooms. We sort of see we think of infrastructure as a black box. We don't really know what's happening there and all we care about is the product get lab and how it's running on that infrastructure. Sometimes the setup is too easy, like a one node setup and it's just simple, and we know how to navigate that easily.

A

Sometimes it's a more complicated set of an h, a setup that, where the the setup is scattered among different uh nodes- and that makes it a little bit more complex. But if you're in the incident room, you're gonna get to see the infrastructure side of the incidents.

A

So you get to see the deployments you get to see some information. Sometimes the engineering call would share their screen and you're going to get to see the complex setup we have for gitlab.com, which is very interesting for support, because we we rarely see on that scale.

A

uh So I I advise you to actually, if you're in support, um perhaps um subscribe to the incident management channel and if you see an s1 or p1 um severity, one s1 or p1 priority. One issues. Incidents perhaps join the incident room and have a look at how the team is handling their incident.

A

Compare that to how you handle the customer emergency in the customer emergency you're only responsible for making sure that get lab is up and running. We are quite equipped to do that.

A

Sometimes it's harder, but most of the times we know how to navigate making gitlab come up and be running, but in getlab.com they are the ones responsible for the infrastructure, so they have to do it. Even the black box, for them is not black. It's very white, it's very transparent.

A

So that's the interesting part. uh They uncover a layer that supports don't get to see in their own corrotation. um So in the incident room, as I said, you get a lot of reactions. Some people are very surprised. Some people are stressed, so they laugh.

A

But, more importantly, you find that spirit of everyone is like trying to help and everyone has got the same goal which is get the service up. So that's the focus and um I'm gonna actually compare um what I saw in the incident room to something that I've learned recently as a parent.

A

um I downloaded this application on my phone and um it gives you a sort of a road map on understanding how to parent your child in a better way and at some point they talk about bonding and, to my surprise, bonding with your little child is um happens during the hard times during the times when your child is struggling to understand their feeling or when they're crying when they're experiencing a negative emotion.

A

That's when real bonding happens between you and your child, and that was really interesting to me to see, because it did actually make sense. Looking back at all my relationships with everyone in my life and I'm like that's right, it's like during the hard times. That's when you really get together get close and you really sort of bond- and you have you establish this shared experience where you sort of trust each other more like I'm gonna, be there for you, you're gonna be there for me, so I think it. It is an example.

A

I think what happens in the incident room is is sort of similar because it is a stressful time. Sometimes we have no idea what's happening and um because of that, we sometimes we experience a lot of negative emotions, but with the help of everyone you get you get to get through every incident we have to right.

A

So I think what happens is that it creates bonds. I think you see some sides of of everyone. You don't really see outside of the incident room, and I think it's mainly because of the stress, so I I find it a very interesting place not just um on the technical level, but also on the team level like how the team is interacting with each other.

A

um So yeah, that's that's the incident room um during an incident uh the the lifetime of an incident. It goes through um a few phases, so the first phase is when an incident is triggered um after the incident is triggered.

A

The engineer on call give it an immediate attention and they try to identify or observe as much data as possible to understand what is the impact of this incident on our users or on different services once they establish what the impact is they decide if they're going to involve other people uh in the in the call or if they're gonna try to mitigate it by themselves and if it's severe enough, it's going to get the s1 label and it has to be addressed immediately by a lot of people they're going to get paged once the the these little things are identified.

A

The impact the engineer on call is going to try to identify the root cause of the incident. um Believe it or not. I think that's the hardest part of the incident. I think that's the part that takes most of the time.

A

So, um if you're lucky enough, you're gonna do that under an hour. If it's an internal cause, it's gonna be an easy identify and we're gonna get to revert the change quickly because we're already aware of the change of the cause of the issue. But if it is an external factor a lot of times, it takes a lot of time just to identify the root cause or how we are going to mitigate that issue. So once we identified the root cause a lot of times, it's easy and quick to mitigate the issue.

A

Sometimes it's not that easy and direct, but a lot of times. It is easy and direct once you identify the root cause um after identifying the root cause. You mitigate the issue. You apply. Something affects a hot patch, whatever just to mitigate the issue. Once the issue is mitigated, it's marked as mitigated and it's not as urgent anymore, but we are collecting all the corrective actions that we can take to prevent this issue from happening again in the future. So this work can happen um most of the time it happens. Async.

A

There is a brief discussion about that in the incident room about corrective possible corrective actions from here and there during the incident, someone would say. Oh perhaps we can do that to permanently solve this issue in the future, and these corrective ideas are collected and added to the issue at the end of the incident, and someone then takes care of that after the incident is over.

A

So the main goal is to mitigate the incident and once it's mitigated, we then look back and have a look and try to perform a root cause analysis for for the incident, especially if it was severe enough to require a root cause analysis.

A

uh Sometimes the incidents are not severe enough, so we don't really have to spend that time on trying to find the root cause, um but sometimes if it's severe enough, we will have to perform the road codes and the corrective actions are going to be implemented in the following weeks months or, however long we can um find the resources to do so.

A

um So that's the life cycle of the incident. I hope you found this video uh interesting and helpful. uh Please let me know if you have any questions reach out to me: I'm rahab hassanin from the support team and uh feel free to leave comments below I'm gonna check it every now and then thank you.