GitLab GitLab Tech Talks, 17 May 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: SLOconf: GitLab’s journey to SLO Monitoring - by Andrew Newdigate

Description

This talk covers GitLab's adoption of SLO monitoring, from our previous causal alerting strategy, which had outgrown its purpose as the complexity and traffic volumes grew, to our early attempts, building and maintaining configuration, and the problems that brought about, to our current, declarative approach. The talk will cover the challenges of getting buy-in from engineering, operations and product stakeholders, the benefits of having a common language of availability across the organisation and our future plans. This is a deep-dive, practical talk; all the code and configuration for GitLab.com's monitoring infrastructure is open-source, and the talk will include links to these resources.

A

Welcome this session is gitlab.com's journey to slo monitoring oops.

A

My name is andrew nudigate, I'm an engineer at gitlab, where I work on the infrastructure team helping to build gitlab.com in this talk, I'm going to give you a very brief walk through on how we implemented slos for gitlab.com. We don't have a lot of time, so I'm going to be moving pretty quickly as a starting point. It's worth describing where we started this journey in 2018. It definitely felt like we had outgrown our monitoring strategy.

A

Our monitoring was ad hoc. Our alerting was piecemeal. We had the cycle of when we had an incident. We would create a bunch of new alerts in an attempt to detect the cause of the incident, but these alerts would very often be flappy while simultaneously not alerting us to the next incident. So when that happened, we'd create some new alerts for that incident, and the cycle would continue.

A

Another problem was that we maintained our dashboards manually so often during an incident. We would discover that a really important dashboard was broken and we had to spend time trying to fix it. Instead of focusing on the incidents itself around this time, I read my philosophy and alerting by rob evershark, and so we began our journey to symptom-based, alerting and eventually to slo.

A

Monitoring our first step in our journey was to move from cause-based alerts to symptom-based alerts, but how would we alert on these signals for the first iteration? We decided to use the simplest approach. We would fire an alert if, over a five minute period, we generated errors at a greater rate than our slo allowed. So if our slo was 99 and we had more than one percent errors in a five minute period, we'd trigger an alert.

A

The reason we selected this approach was twofold. The first was the approach was simple to implement. It uses a single time window, so we could continue to manually, define all our recording rules and alerts, but what's more at the time we didn't know any better. It was the only approach that we had considered, but the problem was that the alerts were no less flappy than before and we expec we experienced a lot of false positives.

A

Luckily, around this time google published the sre workbook and we learned about multi-window multi-burn rates, alerting, as described in the chapter alerting on slos. We wanted to move over to this approach, but it was unfeasible to do so. While maintaining the recording rules manually, the solution we arrived at was to develop a simple dsl for describing our slos and our slo monitoring.

A

We were already modeling our application as a series of services. The next step was to break each service down into a series of components. Each component has three signals: a latency sli for which we use an aptx score, an error sli and a measurement of the number of requests per second, that the component is processing.

A

Let's take a look at an example of the dsl in this case, we'll use our ci runners service.

A

The first thing we define is the service slos. We have one for aptx and another for error rate.

A

We then go on to define a number of components in this example, I'll walk through the shared runners component, which is our hosted runner fleet used by gitlab.com customers for each component. We include some ownership information and a description. This information will be included in alerts and dashboards.

A

We then define the aptx sli. We have several ways of defining an aptx, but the most common is based on prometheus histograms, as shown here, we use an abstract description using the histogram metric name, the selectors. We will need to filter the metric and the latency threshold that the sli should complete within for the request rate we most commonly use a rate over a counter metric, as shown here again we define the metric name and the selectors.

A

Finally, we define the error rate sli again using the rate, while making sure that the selector only includes the appropriate failure series.

A

From this definition, we generate our prometheus recording rule configuration. We generate one recording rule per sli per time window. The recording rules also include appropriate aggregations. Here you can see that our histogram map decks has generated, recording rules for each time window, including 5 minutes 30 minutes 1 hour and so on.

A

We then also generate our slo alerts from the dsl. Each sli has several alerts, including, of course, slo violations. The alerts consume the recording rules that we generated in the previous step and use a multi-window, multi-burn rate alert expression described in the sre workbook and using, of course, the slo thresholds that we defined in the dsl.

A

Of course, we can also use our dsl to generate dashboards. We do this by leveraging the graphonet library we have a dashboard for each service and a row for each component of that service.

A

The row has four panels: one for aptx, one for errors, one for rps and finally, a markdown panel, with the description and links to other observability tools related to the component, which an engineer who's on call may find useful during an incident. Each sli panel has a consistent representation.

A

We use a five minute time window for the main series and include two percent and five percent slo thresholds on the dashboard. Another useful feature is the color-coded status panel, which we can, which will quickly draw your eye to any slo violations on a busy dashboard. These panels are designed to display aggregated metrics, but sometimes you need to dive further into detail for that. We also include extra detail in collapse. Rows one for each component.

A

At this level, we present the same metrics but include percentile latency latencies, broken down by certain labels and error burn by label, which shows how different labels contribute to the overall error burn rate. We call this error attribution.

A

So that's all I have time to show you in this talk, but before I finish, I'd like to wrap up by describing some of the benefits we've seen by moving to slo monitoring, I'd say: the greatest benefit has been standardization on an organization-wide language for availability.

A

This started off with better communication between infrastructure teams and product engineering teams, but has since expanded to include our product organization and because of this, we're now able to start introducing error budgeting with full buy-in from our product teams. Another advantage is that our dashboards are really consistent and reliable.

A

This consistency allows operators to navigate quickly during an incident to isolate the cause of a problem. The third advantage of using a dsl is that we have significantly reduced the barrier for engineering teams to maintain their own slis and, of course, another advantage of slo monitoring is that our alerts are precise and consistently specified across the application.

A

Finally, everything I've spoken about today is open source. So, if you're interested in the details, you're welcome to take a look at our runbook repository, the link is on the screen. If you have any questions I'll be in the slack channel for this session now, thank you very.

A

A