GitLab Plan Stage, 3 Nov 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Plan Error Budget Investigation Basics

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

This is a quick introduction to the plan error budget dashboards, um not a deep dive by any means, but hopefully a couple of tips and tricks you can use to identify where you're spending error budget- the dashboard I have open here is the Stage Group dashboard for product planning.

A

I. Think it's pretty important to remember that this is a rolling 28-day window and the time picker won't um won't apply to this.

A

But, as you can see, the product planning error budget is in the red and if I wanted to investigate this, probably the first thing I would do would be go to this dashboard, which is the group error budget, detail dashboard.

A

um If you have a one-to-one with me. This is the dashboard that's linked from our one-to-one template and on this dashboard. The time selector actually does work but I'm going to leave it at 28 days for a minute. You can see it's the same number, so there are a few things going on um here. That would be of interest to me. First of all, something appears to have been fixed. uh That was causing an error previously prior to about the 10th.

A

There are also some point drops in app decks which are interesting, and then it also seems like aptx was okay-ish at times and then had sort of significant drops.

A

So first of all, I probably would want to know what the impact of this is on the overall error budget. So I might just start by Framing that out um to see if there's any change, it's 99.88 currently and it's still 99.88. So it's most likely that optx is the significant contributor here um and so like the next thing might be to shorten the time window here. So we can see. Maybe these two large drops here so I could just do this and take a closer look at this time period.

A

So again, there's a few things going on here. I mean we can see that updex appears to be fine at the beginning and then there's some like overall degradation. But it's important to remember here that this is the weekend, so I would even disregard this.

A

Also and focus on these Point drops so even I would maybe go in and just really Zone in on this one um just to see so, you can now expand service level indicators to try and identify like what's actually causing this optex drop, we're looking at a period, that's kind of three o'clock on the third.

A

That's today seems to go on for about one hour, and then you can look at all the services that kind of apply to your group, and while we see some general ongoing degradation in graphql queries, which is definitely worth looking into, it would appear that the really contemporaneous degradation is in the rails, requests, Optics and so I would find this probably most interesting.

A

You can solo this chart, so you can see that the web app text is affected. Also, the API art decks seems to be affected as well. So the first question that might come to mind for me would be: is this something Global? Is this something that other teams are experiencing, or perhaps the whole application, and in order to figure that out? There are some links up here: I'm going to select web detail I. Suppose you could look at API detail as well.

A

I'll look at the web overview and what this will do is give me an overview of the whole gitlab application, and you can see here there's immediately like a contemporaneous drop in web service Optics it's at three o'clock on the third.

A

So it's not that you're absolved from any further investigation, but it does give you some context about the impact of the problem. It could be still. You know, a change made by your group or a small number of groups affected that that drives the whole system availability down below SLO, but at this point, I'd, probably look in The, Incident, Management Channel and try to figure out.

A

Is there uh something coinciding with this? Perhaps pressure on a downstream service database issue? Redis.

A

um You know something like that coincides with this and that might give you more information before you go switching off uh some changes that have been made, or you know, having some false attribution to some change. The team has made um if you feel that it is likely to be attributable to your group, or you just would like more information. There are further charts below that will be useful, like SLI detail in this case, I think we identified that it was the rails requests service.

A

So there are further charts here that you might look into that. Might give you more information, this one's pretty interesting, um mostly because it shows you the various different endpoints that are spending your error budget. You can see the drops here. You can also like if you hover over, uh you can see which endpoints are the least performant relative to their target, and you can of course, solo things here to see see them in isolation.

A

So you may do that, but I think at this point, if I really thought that it was something uh to do with um my team I would go into Cabana I'm not going to go into that now, because I think it's it's a lot more detailed. But if you expand the budget spend attribution section you can see here within this time period and you can shorten or lengthen the time period to the period that you're interested in you can see.

A

You know which uh services are spending the most error budget and in this case web appdx is the top one. There are links here to Cabana if you'd like to investigate further.

A

um This is the one for slow, Optics requests, so that's the one I would go into about as I said, I'm not going to go into in this video. So hopefully this this was helpful and gave some initial tips on how to use grafana and how to use their budget dashboards, to diagnose a problem.