Cloud Native Computing Foundation KubeCon + CloudNativeCon Europe 2021, 14 May 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: How We are Dealing with Metrics at Scale on GitLab.com - Andrew Newdigate, GitLab

Description

Don’t miss out! Join us at our upcoming event: KubeCon + CloudNativeCon North America 2021 in Los Angeles, CA from October 12-15. Learn more at https://kubecon.io The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

How We are Dealing with Metrics at Scale on GitLab.com - Andrew Newdigate, GitLab

As GitLab.com has grown, the number of metrics generated by the application has grown exponentially. Ensuring our team has good quality dashboards and alerting rules was becoming an ever more challenging task. There’s no worse time than experiencing an outage that you expected to have been warned of, only to find out that the alert had been inoperable for months. As an engineer on the infrastructure team supporting GitLab.com, sometimes it felt, during an incident, that we were drowning in data while at the same time struggling to access the most pertinent indicators of the underlying issue. This talk discusses how we are addressing this problem by building up a catalog of key metrics for each component within our application, and then using this definition to automatically generate beautiful Grafana dashboards, rock-solid alerting rules and high-quality SLA indicators. This talk is primarily aimed at Prometheus users, but the fundamentals could be applied to any other metrics system.

A

Welcome this session is how we're dealing with metrics at scale on gitlab.com.

A

My name is andrew nudigate and I'm an engineer at gitlab where I work in the infrastructure team and I help work help build gitlab.com.

A

This talk is about how we've scaled our monitoring to support a site that has, over the past few years, grown rapidly in size and complexity, to illustrate that growth here are some figures to show how things have changed since I joined.

A

Back in 2017 we'd only recently adopted prometheus and were migrating off influx db. We had a single prometheus server, six infrastructure engineers, a handful of alerting rules and recording rules. We only had 21 dashboards and we were processing about 100 000 samples per second roll forward 400 years, and we now run thanos federated cluster deployed into kubernetes using tanka. The infrastructure department is around 40 people, so six times bigger, we have over 2 600, recording rules, 400 grafana dashboards and we're ingesting about 2.8 million samples per second.

A

So it's important to state this. What worked for us then worked fine for us at that scale. It was the right solution at the time, but that approach wouldn't work for us now, and this talk is about some of the tools and techniques that we've used to go from that scale to where we are today.

A

So what prompted our efforts to improve our alerting? We were seeing numerous problems that indicated that our approach to monitoring was no longer working for us. One of these problems was low precision alerting by this we mean that the proportion of alerts that was actionable was low and we're seeing a high number of false positives.

A

At any time, many of the alerting rules inadvertently generated low quality, unactionable flappy alerts. Very often, the engineer on call would determine that users were not being impacted, that everything seemed okay and they would acknowledge the alert and effectively ignore it.

A

Not only was the precision of our alerts very poor, but so was the recall recall, refers to the proportion of user impacting events that are detected by the alerting system. This means that, instead of finding out about incidents through an alert, we would sometimes be made aware of the incident by people rather than the software that we'd built to detect these incidents.

A

In other cases, the alert would fire but too late, and we already knew that there was a problem, and now it was just extra noise. While we were trying to solve the issue yet another problem we found was that the dashboards were very often broken and not working as we expected, since our dashboards were not managed alongside our other metrics, there was no way of validating that they were still working until we took a look at them and this often happened during an incident.

A

So now, instead of having one problem we had two and that we had to fix the dashboard before we could fix the problem.

A

One of the things that we began to realize was that having three distinct configurations for our metric stack was part of the problem. The source of metrics was independent from alerting and recording rules. Our alerting and recording rules were managed independently from our dashboards and our dashboards were installed in git and they didn't have any form of change control and they weren't.

A

Validated with this in mind, we set out to improve our stack with these goals, one to develop a common monitoring strategy across all of our services, based on a set of key metrics and service level indicators.

A

Two use those metrics to improve the precision, recall and detection time of our alerts and three unify our metrics. Slo loading configuration recording rules, dashboards everything into a single source to avoid inconsistencies between the definitions.

A

Let's look at how we tackle the first goal of building a set of key metrics for our application.

A

We based our metrics on google's four golden signals, but made some changes to better fit our requirements for latency. We measure aptx as a ratio rather than a percentile duration, measurements in seconds requests and errors are measured at a per second rate and saturation is measured as a percent lower being better.

A

Saturation is a pretty big topic on its own, so I'm not going to go into detail in this today. If you're interested in finding out more here's a plug for a talk, I did on a saturation on saturation monitoring on gitlab.com. I've included a link to the slides.

A

With our key metrics decided on the next step was to break the application down first into a set of services and then break each service down into a set of components. So, for example, we modeled web git and api services and then broke these down further into one or more components for the git servers. For example, we have ssh and https components.

A

Each component has three key metrics app decks errors and requests. So for some components or sorry part of me for some components, it's not always possible to measure latency directly, so aptx is optional. In those cases.

A

From our three key metrics we're able to derive two service level indicators or slis, an sli is normally expressed as a percentage of requests that are bad. An aptx is really the inverse of that. It's a percentage of requests that are have a satisfactory, latency or good, because our organization was already using the concept of aptx. We decided it would be better to adapt our monitoring system to the organization rather than the other way around. Therefore, our aptx sli is an inverted sli, with 100 being the best service level for errors.

A

We use a conventional sli definition with zero percent being no errors and the best service level.

A

Once we had our approach to monitoring our key metrics in players, it was time to start thinking about our second goal: to improve the quality of our alerting.

A

As I mentioned for each component, we derived two slis app decks and errors for each of these. We set a service level objective and trigger and triggered an alert if an sli is violating its slo target. If our aptx is below, sl is below the slo threshold or our error ratio is above slo. We trigger an alert for some services. We also trigger anomaly alerts for high request rate anomalies.

A

The original approach we took to alert was any violations over a five minute period. For example, if you have a thousand requests in a five minute period and two of those requests results in an error. Two in a thousand gives you zero point, two percent error rate, and if you have a ninety nine point, nine percent slo this zero point. Two percent exceeds the zero point, one percent threshold causing the alert to fire.

A

Unfortunately, this is a very naive approach and it has very poor precision in that it generates a huge number of false positives taken to the extreme. The alert could fire hundreds of times a day, yet the slo, the sli, could still achieve its slo.

A

In fact, our new alerting was no better than the old alerts that we were trying to improve on.

A

We went back to the drawing board and looked for better alternatives. We settled on using multi-window multi-burn rate alerts. Instead, I'm not going to go into the details in this talk, but if you're interested in knowing more google have published an excellent guide in their sre workbook, I have included a link on the slide.

A

This approach has provided us with high precision, load, detection, time and good recall on our alerts. The problem with this approach is the amount of complexity it brings for each component that we monitor. We need 12 recording rules to be correctly configured with dozens of components. You really need a configuration tool to help with this, as doing it manually would be very painful.

A

So before we could roll out our slo alerting with multi-window multi-burn rate alerts, we needed to investigate better tooling to deal with all the repetitive configuration that was required for each service. We may have several dozen similar, but slightly different, recording rules in our dashboards. We might have other queries that are also similar but use different aggregations changing these area. These queries was an error-prone error-prone process.

A

So we started thinking about what tools we could use to make this process easier.

A

The idea we had was to describe all of our metrics in what we call the metrics catalog. This is an abstract configuration it's written in json and is designed to be user-friendly, validatable and with as little repetition as possible. The configuration is stored in git change is managed through merge requests on commit. We use ci to validate the config, generate new prometheus, recording rules, alert configurations and grifana dashboards amongst other resources.

A

This is what a typical entry in the catalog looks like this definition is from the web service and shows one component from that service called workhorse. We define our slos as well as our aptx request rates and error rates that will be used to generate the slis.

A

These definitions are then used to generate prometheus expressions in our prometheus recording rule configuration as well as dashboards and everything else.

A

As you can see, this generates lots of very similar, but slightly different prometheus configuration, depending on the burn rates that you're evaluating.

A

The last part of our goal was to generate our dashboards 2.. Now, as it happens, the grafina team have built a jsonnet library called graphonet. We could use it to automatically generate sacrifice dashboards from the metrics catalog. This is a typical example of one of our generated dashboards.

A

This is from our web service dashboard, and what I really like about these dashboards is the consistency we have dozens of different services and for each service, the dashboard layout, the color scheme. The data presented is consistent.

A

The top row of each dashboard provides an aggregation of all the slis within that service, and this is followed by a row for each component of the service, with charts with key metrics and collapse, rows containing even more.

A

Detail once we had our slo monitoring in place, the next challenge we faced was making slo alerts easier for operators and engineers to understand, and, in particular reducing the time to diagnosis on our slo alerts.

A

One of the big differences between the old way that we alerted with causal alerts and our slo loads is that when an slo alert fires, it's not always immediately apparent what the problem is.

A

It's up to the operator to understand the sli then investigate the problem by digging through metrics, logs and other signals until the cause becomes apparent.

A

So our goal here is to give the operator the tools to navigate from an slo violation signal back up to the back up the stack to the cause of the problem.

A

Here's an example of to illustrate why our existing tooling was insufficient, for our needs. Alert manager has a feature that provides a link to the prometheus ui pre-populated. With the query that caused the alert to fire, it's called generator url. Before we moved over to slo violation alerts, we relied pretty heavily on this feature.

A

Each alert would include a link to the expression that caused it in the prometheus ui. We would manipulate the expression by adding labels changing selectors or changing aggregations until we could spot the problem. What we found with slo alerts is that this approach doesn't work very well. The problem is that the recording rules that we used in the expression are highly aggregated and it's likely that the labels, which may have been useful in an investigation, have been removed.

A

Unfortunately, there's no quick way to navigate back to the source expression from a recording rule arriving at a chart of an slr burn rate expression like this often led to more confusion for engineers. Instead of clarity, we needed to create a better initial experience for the operator following an alert.

A

The way we addressed this was to take advantage of the metadata present in the metric in the metrics catalog, since we're generating the slo alerts and the dashboard from the same source. We can include deep links from the appropriate to the appropriate grafana panel and these can be embedded directly in the generated alert definition by by navigating to the dashboard from the alerts the operators immediately provided with context around the alert signal, the thresholds, the status of other slis and the same service and links for onwards, investigation in our generated dashboards for each component.

A

We include this set of links to other observability tools that we use to assist in deeper investigation into problems for that component. Like everything else, these links are described in the metrics catalog. They include links into stackdrivers, sentry, kibana searches and visualizations, amongst other things, they're presented directly alongside each component in the grafana dashboard. So when arriving at a dashboard from an alert, we get an easy experience for the on-call engineer to continue their investigation in through our other observability tools.

A

The final part of this talk describes the challenges we've experienced, scaling iso monitoring from a single prometheus server up to the thanos federated cluster that we use today, let's start off by describing the simplest approach to slo monitoring, and that is by using a single prometheus instance for monitoring the entire application.

A

With this approach, all work to collect, metrics aggregate them into slis and evaluate those slis against service level objectives is done in a single prometheus server. This approach is very straightforward and easy to operate, but it's limited in how far we are able to vertically scale a single prometheus instance.

A

Once we hit that scaling limit, the next logical step is to break the monitoring down into multiple silo prometheus instances. With this approach, the data for each sli is fully processed within a single prometheus instance, so it can continue to collect, aggregate and evaluate our slis in a similar manner to before, except that each prometheus only contains a distinct subset of their slis.

A

In grafana, we use multiple data sources to visualize data across different sources. The advantage of this approach is that it remains fairly simple to both deploy and understand, while allowing us to scale prometheus horizontally.

A

One limitation to this approach, though, is that all the metrics required to evaluate an sli must be contained within a single prometheus instance, and unfortunately, this requirement became problematic for us.

A

This happened when our kubernetes migration project kicked off as workloads migrated to kubernetes. Some sli's were split between prometheus instances used for vms and new prometheus instances contained within our kubernetes cluster. This was made worse by the fact that we decided to employ three zonal kubernetes clusters, each with their own prometheus incidents.

A

So, instead of metrics being collected in a single prometheus instance, some of our slis were now being split between up to four different instances. The problem with this is that there may be local slo violations, but when aggregated across the entire application, the service level objective is not being violated.

A

This led to a series of low precision alerts, which we nicknamed split brain alerts, because they were only applicable to a single prometheus instance, not the entire cluster.

A

A second problem with having sli split between multiple prometheus instances, is that it becomes difficult to get a global view of an sli, since we need to combine data from multiple sources in our visualizations.

A

The solution we used to address this problem was to deploy thanos thanos is a cncf incubating project. I'm sure many of you know of it. Thanos provides single view across multiple prometheus instances.

A

It also has a component called thanos rule which can be used for evaluating recording rules against the single view. This provides us with a mechanism to aggregate across multiple prometheus instances.

A

Thanos rule will also evaluate alerts using the same approach as prometheus, except that it evaluates using the single global view once again to use thanos rule. We broke our slo recording rules into two parts. Most of the metrics processing remains in prometheus. Here we convert potentially higher cardinality application metrics into low cardinality key metric constituents.

A

Then in thanos rule we sum the key metric constituents across all instances before calculating global aptx and error rate slis. These are evaluated against slos in thanos ruler, to provide alerting on globally aggregated values, doing away with the problem of the split brain alerts.

A

This example configuration shows how we aggregate multiple prometheus metrics in a thanos rule using recording rules for each of our key metrics. We aggregate all our prometheus metrics whilst being careful to exclude any previously evaluated, thanos metrics, the first recording rule aggregates the error rate sli. The second shows operation rate and the third recording rule uses the two previous values to create a global error ratio. Sli note that we use monitor, equals global as a thanos selector to control whether to include or exclude globally aggregated metrics in these expressions.

A

Another important point is that the partial response strategy is set to warn instead of the default, which is abort. The reason for this is that, when partial response is set to warn, if a single prometheus store is unavailable, the aggregation won't fail. Instead, the metrics from that prometheus instance will temporarily not be included in the aggregation, but this is a better trade-off than losing all the metrics.

A

We work around this by monitoring for partial response warnings in our monitoring stack.

A

In conclusion, here are some of the ways we've learned to deal with metrics at scale. Firstly, we define key metrics for each service component. We manage complexity and repetition by using an abstract definition in the metrics catalog. As our single source of truth, we've migrated to multi-window multi-burn rate slo alerts for improved, alerting.

A

We generate our dashboards to ensure that they're kept up to date and validated.

A

We focus on improving on-call engineers experience because slr alerts are not always intuitive and, finally, we federated our service level monitoring using thanos and thanos rule.

A

One last point: if you're interested in learning more, I highly recommend reading these fantastic resources on monitoring in general and slo monitoring in particular.

A

Finally, all the code for our metrics catalog is available on gitlab.com in our runbooks project. I've included a link here. Thank you very much.