GitLab Ops Section, 23 Jan 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Incident Management Walkthrough - Jan 2020

Description

Incident Management Walkthrough Jan 2020

https://gitlab.com/gitlab-com/Product/issues/717

A

Hi, my name is Sarah Wilner and I'm. A senior product manager for the monitor staging kit, lab I, lead the Health Group and today I'm going to be walking us through our product category incident management, so incident management was added to get lab last year. The product category is currently at the viable maturity level, and this set of features serves users who are responsible for monitoring and maintaining the availability and reliability of IT services.

A

They would leverage the set of features to respond to IT incidents that are triggered when there are outages or lapses, and the availability or reliability of those services that they are responsible for. So today this walkthrough is going to cover the configuration of alerts, setting up issue templates to be used for the auto generation of incidents enabling auto issue creation.

A

Embedding metrics on those linked to zoom calls and then collaboration via slack commands and select as we go through this walkthrough I'll, be pausing in between to take notes on improvements or bugs or future ideas that I have that come up. So, let's get started.

A

So this is a demo project that I have set up for the monitor stage and we're going to go check out the metrics dashboard.

A

So what I'm looking at here is a set of default engine X metrics that are automatically added to a project when you deploy ingress or Prometheus to a cluster that you're running your applications on so you'll get a set of nginx ingress metrics, as well as system metrics for kubernetes that automatically come out of the box. So for the purposes of this demo, I've gone ahead and set an alert or on my memory usage across my cluster.

A

So I did that by going to the drop-down on the upper right hand, corner here clicking alerts and then sitting it here, so selecting my query and the value that I want to set the threshold at if I wanted to remove this alert. I'd also have to mmm that's a good idea. To put so looks a little bit challenging to delete this I have to select the one and then click delete, so I'm going to go ahead and make a note on that.

A

Alright moving on so you can see as a visual here on the chart. It shows you my threshold and then it shows you the average value here. So it's clearly above and we've indicated that in a red color, so I had an issue that was automatically generated for this. Let's go take a look at that.

A

And there's a couple different parts to this incident that I'm going to walk through in a minute. I just wanted to show the incident and the way that you would get issues to create automatically on alerts that you're receiving from Prometheus is by navigating to settings operations.

A

And then the incident section so I've got a really simple enable/disable checkbox here when I set up alerts from the metrics dashboard, which I can do for instances of Prometheus that are deployed to get lab, managed clusters or I can also integrate external instances of Prometheus I can select to create get Lab issues automatically for each alert, triggered I also have the option of customizing. What that incident looks like so in this project, I've created an issue template called incident.

A

These are saved in the repository, so they're very easy to add and save, as you add, issue templates for any other project.

A

I've named this one incident for ease of identifying it for purposes of incident management and within it I've added a couple. Different sections are important to my team. When we're firefighting, I've embedded a metrics chart a section where I'm going to populate the timeline and then I've also added the zoom call that we always have open for firefighting. I've indicated the slack channel where this project is integrated with and then I have used quick actions to do things for auto assignment and then auto Laden.

A

This is super useful for automating, a lot of those manual tasks that you might have to do when an incidents created, and you want to save those Adam to the issue, template and have that done for you. So you don't have to do anything after it's paged.

A

So when we go back to look at the parts of the issue for the alert that was triggered, I've got the environment, the environment in which the system that triggered the alert lives name of the metric and then the threshold that triggered it. So it was greater than point two gigabytes for five minutes and anything over that I wanted to receive an alert on so when they alert triggered it automatically created this incident and I have this section at the top, which is the summary which is auto populous, set up Auto populated fields.

A

That gives me more information. That gives me information that comes from the alert payload. So it's actually five different fields that could show up here, but we only surface those that have values in the alert payload below the alert details is going to be the rest of the custom issue. Template that we just looked at so the embedded metrics is a link that I've generated from the metrics dashboard. So I can do that. Let's go look at that.

A

I can either copy paste the link to the entire metrics dashboard into a markdown field or I can generate links to charts for specific charts that I only want to show that chart in the issue. Both of those things are really helpful in the initial triage process, when you're trying to figure out what's happened, why it was triggered having that visual immediately available, shortens your time to action so below that I've got my standing zoom meeting.

A

Oh, that's because I have to add it via quick action. I remember how to do this.

A

So if you're, if I was paged and I started initial investigation, I realized it's gonna, take more people than I to fix this and I started. The zoo meeting I've got a really easy way to link this to the incident and I get a system action, but that was successful and oh I'm and I need to refresh the page there. So that's another that doesn't make a lot of sense, so I'm gonna say remove karma to refresh incident to see linked.

A

That's not a great experience if you're trying to move quickly and you've added something- and you don't see it immediately so now is this button. Anyone else that I've said this to whether it's in slack or anyone else has been paged has quick access to that conference bridge where we can then synchronously collaborate. A couple other pieces are decided to me. I did that with a quick action in the issue, template I've got a couple of labels that were added automatically.

A

The incident label gets added for all issues that are created for triggered alerts that doesn't need to be configured in the issue template. This makes it easy to have an issue board that you're using to triage, and so you can just filter that for issues with a label incident and always have access to those and then I've added a label for the service that this that's running in this kit lab project, and so, if I've got some sort of overview within a group, I can filter by those service labels as well.

A

So another way that people tend to collaborate during firefights is via chat. Ops. We support slash, commands both for slack and matter most and I'm gonna demo. Let's do slash commands, so this project is integrated with slack. We've got two different services that allow you to either send things to slack or change, get lab issues from slack. So if I tag Clemente and say.

A

Commenting on this issue is going to surface this in the slack Channel that this project is integrated with so before we go to slack I'll just show what that looks like.

A

So I navigated from settings to integrations going down through project services and we've got both a select application and slack notifications.

A

So, within the slack notifications service, I've got this once I've set up the web hooks for integration. I've got an entire list of actions that I can take to effect different things, either on issues or changes in the pipeline or delivery pipeline action. So these are all going to show up in slack so here for anything, any events that happened to an issue created, update or clothes are going to go to this slack channel.

A

So in slack we're looking at the tanuki ops Select, you know I commented on that issue, and so you get the text from that in slack.

A

A

An example of affecting an issue from slack: let's go ahead and close this issue per se, so.

A

One thing that we publish should improve is these / commands are really long, so the amount of time that it takes me to type this out I could probably just navigate to get low. So I can Khloe I've gone ahead and closed that from slack and then I'll be able to see and get lab.

A

You my issue was closed and then there is a system message indicating that it was myself who closed that, because there is user mapping between slack and get web. So one more item that I wanted to see.

A

We want to improve the X left command.

A

Its use terms like.

A

Awesome so that was an overview of incident management as it is today, as I said before it's at viable, and we are we're looking to get this dog footed by the internal infrastructure team, as well as recruiting externally, to build a special interest group for incident management to help us determine improvements moving forward. Thank you.