GitLab Customer Success Skills Exchange, 5 May 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Incident Management

Description

Kevin, Group Product Manager for Monitor, Configure, Release presents on increasing operational efficiencies and end to end insight & visibility with Incident Management.

Key Topics:
1. What is Incident Management
2. Learn about GitLab Incident Management, what it can do, what is coming up next
3. Feel equipped to ask customers about what they are using today and tee up future conversation on Incident Management

A

I'm the group product manager for the monitor, release and configure stages, and today I am here to talk about incident management, which is a relatively new set of features we have in gitlab, and recently we just released on-call schedule management, uh given that we thought it would be great for you all to learn a little bit about it.

A

I'm gonna start presenting my screen.

A

And since there's so few of us here feel free to interrupt me at any time with questions or if you want to dive into a particular area, uh I don't mind pausing and answering your question as soon as you have them.

A

Okay, so the agenda for today we're going to first talk a little bit about what incident management is and hopefully get you excited about incident management at gitlab.

A

Lastly, I want to talk about how I would like your help in getting our feature into the hands of customers.

A

So what is incident management? What's it or rather what's a problem we're trying to solve at the end of day? No one wants down time, but it happens. All the time and downtime is expensive for companies and for companies that experience downtime. The obvious thing is they want to recover from them as quickly as possible.

A

So incident management is all about that. It's helping, coordinate the efforts to reduce the time, reduce the time of reduced downtime as an aside contrasting that to monitoring monitoring is about knowing when something is wrong or potentially diagnosing helping diagnose what that problem. Is that led to problems and the other side on the other side, it's like responding to incidents is also hard and stressful, so for for for sres or infrastructures, when you're constantly in the loop of responding to that to downtime. This leads to burnout.

A

uh We hear a lot from our from the folks that we interview with that a constant problem is alert, fatigue having constantly respond to downtime, and you see a lot of people in that role constantly churning when organizations gets into that state. It's something that's that's hard to recover from, and so because of this incident management has become a must-have tool for many organizations.

A

Looking at the competition, what we've learned is that still a lot of companies have homegrown solutions for incident management, but there does exist several well-known brands in this space, a large organizations typically buy one of these services and small organizations typically have some homegrown tool, using slack as a central way to communicate and one trend that we do see in this space is: there is consolidation of incident management tool into other workflow tools or monitoring tools. The the one exception is the market leader pagerduty, which still stands as a standalone tool.

A

My theory for this is uh typically incident. Management tool is expensive, so pagerduty, for example, charges about 20 bucks per hit and there's various levels of like buckets of pages that you get with that pricey model.

A

Since it is expensive, it makes more sense to for it to be rolled into another tool in order to get the benefit across and multi across. A number of different feature sets um so yeah. Anyway, we can talk a little bit more about our strategy in this space in a little bit and.

B

C

You trying to.

A

Share slides- or am I oh- I'm not sharing slides at all. I'm sorry, that's terrible! So I've just been talking to myself.

A

Thank you for noticing that.

A

All right, this is a slide I was showing, um and I had a nice cloud losing money earlier that you don't miss anyway cool. uh So what does incident management actually do, there's a few jobs that, as the management tools are hired for? First uh it's it's a place. It's a centralized place to collect alerts.

A

Alerts are typically raised in monitoring tools, so, and organizations can often have more than one monitoring tool that that they're using at a time. So it's it's nice to not have to go to different places to see your different alerts. That's raised from different parts of your tech stack and once an alert comes into an instant monitor tool. They typically try to hydrate the alert with more information, so that incident responders can react and or can understand, more about what this thing is and if it's important enough, they will raise the incident.

A

An incident becomes something that more than more people are involved in to try to resolve because it is considered that is impacting the business in some way.

A

Another thing that the instant management tool does is it gathers all the people? That's going to respond to the incident, so this is when paging happens. So, typically the setup is you have a schedule or plan who's going to be on call and once an alert comes in, you send off a page to people so that they will come and take a look, even if it's outside the normal working hours once the team is gathered, the incident management tool is responsible for automating, uh some of the tasks to to facilitate coordination among team members.

A

So it can be things such as created a slack slack channel or create a zoom zoom call where people can talk synchronously.

A

Also, the information that are gathered. The steps taken by the instant response team are typically recorded in a single document so that you can be reused in the future or turn into a run book that the team can take and and apply it when the same problem occurs in the future.

A

And, lastly, the incident management tools are typically responsible for facilitating post-incident review. This is the important step because by gathering what has happened in the past, what the team has learned from it, they can then and make gradual improvement, improve and iterate over time all right. So I'm gonna move to show you what we built so far.

A

uh Typically, the first place that you start is setting up a on-call schedule, and this this feature was just recently released in 13.11 and we're going to continue making more improvements to it over time.

A

My window, slightly first thing I'll, do is I'll, add a schedule you can think of a schedule as a plan to who is going to respond to an incident when that incident.

A

Occurs so I'll just give it a name. I can add a description, it doesn't matter matter the time zone. I'll, add, add a schedule. So once you add a schedule, you'll see that a schedule by itself is not enough. Another concept, that's within our incident management tool, is the idea of a rotation, so the rotation.

A

Includes the people who are gonna participate in an incident, so I'm gonna, add myself, I'm gonna, add a few other people that I know are in this project.

A

I set it to start a few days ago at the beginning of the month and I'll I'll set it. So in this example, where each person is on rotation for seven days, uh there are other things that you can set, such as setting an end date for this particular rotation, or restricting this to a specific time interval oftentimes incident response teams will have specific hours that they work gitlab. For example, we have people located around different parts of the world depending on where they live.

A

They might be on call at different times during the day to cover for the 24 hours.

A

In this case, it's not necessary so I'll just remove that.

A

So a particular schedule can also have more than one rotation, so if, for example, this might be the case, you might have your first line of people that respond to incidents and but they typically will have backups.

A

But with this example, I'm just going to add myself again.

A

What we have so far in the project to set up the schedule and set up the rotation once this is set. The people that appear on the schedule will receive an email when the alert comes into the project uh for incident management tools. This is not quite enough.

A

What we'll be adding in the next few uh milestones is we're going to be adding this concept as a escalation policy. What an escalation policy is is whenever an alert comes in escalation. Policy is responsible for figuring out how to contact the people that needs to respond to the alert, so you could be so you could set it up. So the person receives a page receives a ping in gitlab. Receives email receives a phone call, so that will be coming up in the next few miles.

A

Currently, we just have email as a way of notifying people that are on call.

A

So once this is done, the next thing I'm going to do is I'm going to trigger an alert, I'm just going to use our alert integration. There's a capability to send a test alert to set things up, so I just sent this alert and what you'll see is at some point here. It is this notifies the responder that something has happened clicking through the email. That brings you directly to the alert details page, but before we show that here is the alerts list page.

A

And you'll see that alert that was just triggered a minute ago. So this is a place where an sre team or infrastructure team uh can monitor throughout the working day they're, not on call to understand what is happening across their system. It's a it's, a it's, a single location where, where all the alerts hopefully is, um is aggregated, so you can do certain filtering and certain uh sorting to to facilitate the workflow for the sre team clicking into the alert.

A

Actually, one other thing I want to show you real quick. Is it's important for alerts to be actionable, so I just set multiple of the same type of alert, which is very typical in a real situation, because when something goes wrong, the monitoring engine will keep sending additional alerts. If the threshold is met, if you have to respond to the same alert over and over, that's not particularly useful.

A

What the team has built is the ability to aggregate alerts that are of the same type together, so you'll notice that there's actually been four of the same type of events that has occurred so clicking into this alert. We on the main details, page you'll, see all the information that was supplied by the originating source.

A

uh If you have a monitoring tool, that's more deeply integrated. Currently, it's only with prometheus. The metrics will also show up here and anything. Any actions you have taken will show up in the activity.

A

So in this case, I'm going to assign this alert to myself and acknowledge that I've seen the alert as well tell my teammates that I am on top of this particular issue and, as I look at this alert, I notice that it is actually critical work. So in this case the workflow is to create a new incident.

A

Once the incident is created, this is where you and your team, your teammates, can collaborate to resolve the incident and bring it back to normal working situations.

A

So, looking at this incident details page, what this is is actually a a different issue type. So if we go to issues something that's new, that's relatively new, that's added is you can actually create. An incident also manually should shoot your workflow uh dictate that that's how things are done, but in our example, we can create an incident directly.

A

Directly from from our alerts and in the new instant issue type, it has it links to uh all the alert details, uh you'll notice, there's something specific about this issue type, that's different.

A

The the details page is filled in with a pre pre-built template. We have things specific, for instance, such as time to sla, which it's you can set up as a under operations or under settings in operations to notify it.

A

Under this tab, there's under instrument settings you can se activate a specific time. That's a countdown timer to notify the team how much, how much time has expired, and uh typically team teams aim to to respond to an incident within a certain amount of time.

A

Back to my instance,.

A

A

From within the incident page, you can say, for example, publish this information to a status page or collect all the information. So, as you work through the incident to resolve the issue this this, the comments automatically becomes a timeline that the team can review once the incident is resolved.

A

Any thoughts or questions so far.

C

Yeah uh thanks for that kevin I've uh posted the written up the uh questions into the doc as well, but I'll verbalize. It here.

A

C

First, one is, uh I want to understand if the alerts are rolled up to the group level.

A

Eventually, well, uh currently, it's only at the project level. Okay,.

C

And so it kind of leads to my next question, which uh is what size customer is uh gitlab's incident management capability well suited for.

A

Yeah, that's a great question I'll cover that shortly. Let me share my screen.

A

Cool, let me walk through what we have today and what's coming up next real quick. So what we have today is integration and by integration. Specifically, we have an http endpoints, that's able to receive webhook alerts from various monitoring tools, and you can you can map whatever the fields are for a specific alert to the way that gitlab displays information within our system, as you saw responders, can triage their alerts in a single location and they can also triage incidents in a single location.

A

uh What we just released was on call schedule management, which I showed you in the very beginning of the demo, and we're continuing to make more improvements to it. There's some more polishing, that's needed for this specific feature. What we're working on next is paging escalation.

A

So explanation, policy again is letting users determine how they want to be contacted and paging. The plan so far is we're going to integrate with twilio.

A

The first version of that is we're going to enable people to bring their own twilio account so that they will have a relationship with twilio directly in order to facilitate aging, but in the future. I imagine we're going to enable.

A

Gitlab to to supply the paging capability out of the box, what's not mentioned on here is group level, so on the project level, it's it's certainly insufficient for larger organizations who will have a bunch of different projects for the application.

A

So, as I imagine post paging and escalation, we will turn our attention to group level capabilities for incident management and to answer your question and what admin excuse me.

A

uh What what we built today, we believe, is a fit for customers who have homegrown instant management solutions specifically because, right now we don't have. We haven't, had a lot of feedback on the entire workflow.

A

What we've observed customers using is specific features within incident management, so we would like to use what we built today, ideally to replace other homegrown solutions, and we believe it's an improvement for most better in that situation for larger customers.

A

As I said, they typically have have a tool that exists today, whether it's pager duty, oxy or something else.

A

um The problem that we've heard over and over again is that they they feel that it's it's much too expensive, especially when they have to buy a seat for everyone that may, at some point, interact interface with the insulin, which is not all the time, so they want to consolidate their tool set in something uh that they already have so for customers that want to cut costs on incident management.

A

Our product product will eventually make sense, but I think that's after we have escalation policy paging and the ability to to to work with incident at the group level.

A

That makes sense.

A

I I see you talking, but I still can't hear you, I'm not sure if it's.

C

Me sorry, I was wondering uh thank you thanks for answering that question and wondering if we have plans to dog food incident, management.gitlab.

A

A

Do plan is to run uh because incident management has a pretty high bar before it can be put into production. We at getlab similarly have a high bar because we we want to be able to rely on the system 100 of the time.

A

The plan to dog food is actually start by running what we've been calling game days to simulate uh incidents that are that are similar to what we see happening in the in reality um and gradually over time as we build up more confidence in the product. We're gonna start using this ourselves, we're not quite there yet today.

A

Let's take a look.

A

There's a question on how this incident management capabilities compared to other tools.

A

I think yeah, no problem appreciate that so mark. uh That's a great question too. uh One of the key I wanna highlight I wanna talk about two things. Number one is the workflow that I showed earlier when I was ensuring my stream this one, this workflow. This is very similar across all the incident management tools.

A

The difference typically were. I would, I should say, modern incident management tools like pagerduty or ops. Gene. The difference between those modern tools is how they kind of group things together. So we, for example, we we call a plan, a schedule and a group of people who fits within a schedule, a rotation and how we page people an escalation policy.

A

So different tools will have different names for that, and they may facilitate you in setting up incident response in different ways.

A

I would say that one one thing that we are not doing today is: we did not build specific custom integration with all the monitoring tools. This is this is one of the ways that tools that pagerduty or ops genie have become really really popular, but to pursue the route of building a specific integration with every single monitoring tool that could be out there that people are using uh it's really expensive and it's. It really takes a huge investment effort right now.

A

Our strategy is to have a very basic integration point by building a http endpoint that that can receive web hooks.

A

While this covers a lot of ground, we we do imagine people who want deeper integrations that facilitate facilitates two-way communication at some point in the future.

A

How necessarily that is for smbs who are replacing homegrown tools?

A

We don't we, we don't know from our research, there's indication that it is sufficient to get started, but we may need to change that stance once we learn more uh for enterprises.

A

um Our working theories is that it is possible that what we have today is good enough, especially considering the cost savings that that they would get by switching over to gitlab incident management, but that's something also that we need to validate further any customers using it today that you know what uh so the end to end flow, not yet because we literally just released on-call schedule management, but we are seeing a ton of people uh already using specifically incidents as a separate issue type.

A

The reason for that is, I believe there is latent demand for kind of just people wanting non-issue issue types. So incident is one way to kind of just demarcate a different type of work item that people are using today.

A

We are starting to see more and more people set up alert integrations, so that's been growing, but incident management only becomes useful once you have the entire workflow. So that's what the team is hard at work working on today,.

C

So, john, just from a format standpoint, I know this is a skills exchange. Is it also meant to be a discussion uh forum or is it more uh delivering of the the information and the updates.

B

Yeah, absolutely I I would think of this as a group conversation, okay,.

C

So I I just as as somebody who speaks to customers on the front line representing gitlab's capabilities. I I think this is really cool um I've. I would probably position it more as dealing in the enterprise space as something that can help a customer streamline their management of an application, rather than an enterprise being able to monitor the application performance capabilities such uh through prometheus that they're deploying assuming they're using gitlab for deploy as well uh and tying that all together.

C

Then I think we have a really interesting incident management workflow thinking ahead, you know if I were to have this conversation with a customer. I think one of the questions that will come up is this is great. My team can get alerted.

C

Is there any way to feed those alerts up into a enterprise incident management platform, something that the ops teams would actually monitor, as opposed to the app schemes got.

A

It so uh targeting more of the developer persona. First, uh in.

C

In terms of current capabilities or a application, ops team, not an enterprise, ops team, got.

A

It uh and how would you define.

A

uh Sorry, that's the first time I heard that phrase: development, ops, team well.

C

So I think it's more a question of you know if I think about large enterprises, banks, utilities and such they have a whole ops console with an incident management platform that probably has some ai to or log scraping combination to identify which incidents are actually an issue and then correlate that with others to say. Okay, there's enough correlation here to raise a ticket and create an issue that is going to call attention and service now or something like that that drives a workflow and so on and so forth.

C

um It's it's quite elaborate, but at these large enterprises may have uh an app that they're deploying that's of particular importance, and there is a team that is going to be responsible for triaging that and wants a closer, better observability of what's happening in that app. That they're responsible for these are the people who will go out and buy their own copies or their their own use of app dynamics or new, relic or datadog, or something like that for their app. um I think that's probably a just imagining the fit for my customers.

C

I think this fits better with that in terms of our current capabilities um and and how it would be useful from an end-to-end standpoint at a project level.

A

Yeah, that makes a lot of sense. I I love to connect with you further and talk specifically about like what customers might be a good fit for um yeah we're super open to to particularly this type of feedback, because we do understand, like it is limited today, especially relative to the main tool that their people are going to be using.

A

If we can solve a another problem in the interim that that I'm I'm all for it.

C

Yeah happy to have you.

A

Cool um see, there is a few other questions uh from samir and adrian adrian. I think it's on the call, but um let me finish the content I prepared so uh actually, I won't cover this because we were just talking about it. I hope you care about incident management because it adds value to the gitlab platform.

A

There is additional. This additional value proposition should help with conversations and eventually we do see this as a way to expand um to a different set of users completely than the one we directly is a fit for today with the gitlab application, and so as we grow this capability, I think you should be. You should get more and more interesting for most of your customers.

A

I covered the target audience already and here's some discovery, questions that may be useful to you.

A

This information will be available to you later and additional resources are mainly in our direction page and I'd love for to hear from you if you have additional thoughts or questions after this call- and I just realized- I am not sharing my screen again, so I was talking and backing. I really apologize for that, um but anyway, I'm going to turn it over to questions so samir says: customer has aws instance and is trying to figure out how to use incident management.

A

It's a normal use case to tie the logs to alerts and that is used to create incidents. Do you have a use case, workflow recommendation, so.

A

Typically logs and to tie logs or alerts- uh I wouldn't say it's typically- to tie logs to alerts. More often the case is an alert is generated within a system so for aws you can have alarms within cloud watch and you can have those alarms bees be sent to gitlab, so they will show up within gitlab incident management in the alert list, and the alerts then would trigger a page to a instant responder that has set up that is set up within a schedule.

A

Second, question is: are we planning as aiops integration? This is a great question and this is something I did not mention. So one of the key features.

A

That a more established player in within instant management, uh its focus on a lot these days is using ai to become smarter, with alert management. One of the main problems for a lot of teams is alert fatigue where too many alerts come in all the time. At some point, you don't pay as much attention to them as you need to which may prolong incidents when they do occur, so so a lot of times.

A

The solution to that is to monitor what has happened in the past and use that information to get smarter about having a computer figure out what which alert is actually important. So currently we don't have specific plans for ai ops integration, but it is something that we're thinking about that. We hope to eventually introduce in the future.

A

Thanks adrian for capturing the notes, I'll verbalize, your question: do you yeah.

D

Well, sorry, I I was going to verbalize it, but.

A

C

Completely I.

D

Think I can probably revert it. It's sort of a pricing question really, and I think you more or less answered it. It was it was looking at. Do we expect the users of instant management to be existing git lab users, um because in in many customers they I could see, they may well not be, um and then you mentioned, the existing tools are are expensive. I wondered you know what what is expensive, um because the the features in incident management look to be spread. Some are in free tier, some are in premium summer and ultimate.

D

So if, if we've got a user that wouldn't be on gitlab that we had to put an ultimate to get incident management, how? How would the price of that compare to say um you know an existing enterprise grade incident management tool.

A

Yeah great questions, uh so certain features are just available in core and other features are available in premium today. Right now there is no incident management tool capability. That's uh ultimate and I'll, explain why that is so. We build incident management based on some existing pieces within gitlab. So, for example, an incident is an issue. An issue is obviously available uh at the free tier. So so, if you're just using an incident issue, that's that's freely available, but that's not the incident management workflow.

A

We imagine the incident management workflow will be at the premium tier. So that's uh so that that would include things like schedule, management, various or multiple alert, integration, eventually alerts and incidents. At the group level. Things of that nature will all be at the premium tier.

A

What we haven't determined for certain is, as far as like paging feature goes like how what's a trench of pages that's available.

A

Our first iteration is likely. The the users will bring their own twilio account to integrate with git lab, since they are responsible for uh their twilio account and gitlab's, not paying for that they're free to use it as they wish.

A

But uh if we are offering the trench of pages that gitlab is managing uh thus far, we think I think the latest thinking is that still should be a premium feature rather than the ultimate feature. When we talk about ultimate feature in the future, it's more things like ai ops and things of that nature.

D

Okay, I I just flicked down the features page and just happened to spot one ultimate on on the under incident management.

C

D

State's status page is down as an ultimate, but I guess that's right that that's not a specific user capability really. Is it.

A

Yeah and the status page, uh it's place that ultimate, but it's really an early early stage product that, frankly, doesn't do a whole lot. Today,.

A

But I can see that eventually being the ultimate feature, uh because who we will be competing with in that space. Typically, people charge uh it's it's we're targeting the larger enterprises that really need that feature. Okay,.

D

And thought that the price in a premium would be competitive with other tools.

A

Yeah, so uh let's talk more about pricing, this space, so uh pagerduty just taking pagerduty as an example, pagerduty charges, uh 20 bucks per hit- there's there's more variations of that. But that's that's like a typical uh on paper price and that's, uh but it gets expensive fast because if you're you're a devops shop, you're building microservices all of a sudden, all these people need to be available on call in some fashion.

A

So this the pricing scales up really quickly, speaking to a really large customer of ours. They are paging. They are paying pager duty licenses for 5 000 of their team members.

A

So that's that's quite quite a bit of money every single month going to pagerduty and it's one of those tools that, ideally you don't ever have to use because you're not having downtime all the time but for coverage. You need to ensure that everyone has access.

A

uh So we think our working theory is that for an ultimate customer or a premium customer since you already have access since your developers that may be on call already have access and they're really benefiting from the rest of gitlab.

A

That would make a lot more sense to have to get get lab incident management capabilities that that's something that you're already paying for you make a lot more financial sense. That way.

D

A

Cool mark anthony any thoughts or questions.

A

Or anything anyone else.

B

I don't see any more in the doc tried to capture as much notes as I could and I'll take a look at the transcript and see if I can't pull out some more uh tidbits from the call.

B

But I guess before we wrap up any uh famous last words.

B

Questions comments all right. Well, I hope you all uh enjoy the rest of your day want to give you the time back thanks so much for the presentation kevin I enjoyed it. Thank you.

A

Thank you for your.

B