GitLab GitLab Tech Talks, 9 May 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: SLOconf 2022: Andrew Newdigate & Bob Van Landuyt - Everyone can contribute to our SLO

Description

At GitLab, we’ve built an extensive framework for defining service level indicators (SLIs) for our different services. This allows us to take a simple definition, and turn that into dashboards and alerts. There are different owners involved: Infrastructure and stage groups. The SLIs we use to monitor GitLab.com are attributed to groups building the features we run. Everyone is held to the same 99.95% SLO, everyone can contribute to our observability.

Join this talk to learn about the challenges with SLOs and error budgets. Hear how we are aggregating our infrastructure SLIs by features groups, and how we are involving groups in improving our SLI definitions.

A

So welcome to everyone can contribute to islo. My name is andrew nudigate, I'm a distinguished engineer in the infrastructure department at gitlab.com and.

B

And I'm bob from london I'm a backend engineer at the scalability team as part of infrastructure on gitlab.com.

A

Great, so gitlab has been working on its adoption of slos for several years and we can break this down into two distinct phases. In the first phase, we were primarily focused on replacing our cause-based alerts with something more fit for purpose.

A

The main stakeholders were sres and sres also owned the definitions of the slis, the slos and the alerts when we get an alert, it's handled by the sre on call, and this approach has improved the coverage, the accuracy and the recall of alerts a great deal. I did a talk at slo conf last year on this topic and you can find it online if you look for it or there's a link in the presentation, just a quick explanation about how we structure our slis.

A

So we apply a consistent structure to all our slires that we monitor in gitlab.com, and this structure has some quirks, but it works well for our purposes and it's evolved over time and has some features that predate our slo monitoring stack at the highest level on the left. We have an environment, and this is composed of multiple services.

A

Moving to the middle, each service is composed of multiple components and then, finally, on the right, each component has up to two slis, both of which can be optional, a latency sli, which we refer to as aptx and error sli. Ideally, a component has both slis, but it's not always possible to measure both, which is why we make them optional.

A

Both types of slis are currents or vent based. We don't use time-based slis mostly, and this makes aggregation much easier. We also include an rps metric or request per second, and this is for each component, and this is useful for abuse, anomaly, detection, ddos term service growth, etc.

A

Now the main reason we separate latency and errors in our slis is historic, but we keep it because it often is a good signal to the cause of an alert. The types of issues that may lead to aptx slo violations are often different from those that lead to error. Slo violations and knowing this at alert time can give us a jump start in finding the cause of the problem from an alert budgeting perspective. We aggregate these slis together.

A

So we had all of this beautiful service level data providing us with excellent information on the reliability and availability of all of our services, and we felt that there was much more that we could do with this as an organization.

A

We also wanted the application development teams that own the features to own their slis, and if we could do this, we could use this to feed into our planning process for improving the quality of our features.

B

One of the difficulties we encountered when trying to identify ownership of features and metrics is the fact that gitlab is mostly a rails monolith. We've got some separate services that are built by a single team like a runner, service or gitly, but most people contributing to gitlab. Do it inside our rails code base.

B

So that's where most of our features live besides being easy to contribute to. It also makes installing and running gitlab easier for most common cases. Gitlab.Com runs as standard as possible, just on a lot more hardware, so the monolith isn't going to go away during phase one of our of our slo alerting rollout.

B

We focused on setting slos for all requests to the monolith per service, but because we have different kinds of traffic being served by the same service, we didn't always have a clear view on the performance of lower traffic features that would be overshadowed by traffic for ci or other kinds of automated traffic.

B

This is especially true for new teams that add features that take some time to gain adoptions, so teams themselves didn't really have a window to see how their features were performing. On gita.com.

B

To put the slis more into the hands of developers, we've built a simple api. We did this inside the rails, application for the first iteration, but we plan on expanding this for other services as well.

B

Not all requests are equal and using the small api developers and project managers can specify uh what makes up an acceptable what makes up acceptable performance.

B

This means that we can have a tighter duration target for github for get authentication requests than we do for an api call made by the runner to pick up a new ci job.

B

This results in two prometheus counter work, one for the total rate of requests and one for the success rate. The counters are labeled with what feature the request. What for and we can use this information to make different kinds of aggregations.

B

We can then use this metrics in our metrics catalog and next slide.

B

When we add a new metrics to the catalog, we use the the same dsl that andrew described last year. As a result, we have slis defined by developers with input from products managers that will alert us sres when features aren't performing well. According to the to what product managers have defined for phase 2, we wanted to start aggregating these metrics using the feature label we added, so we could apply an slo to to teams in the future. We hope to be able to alert teams directly when an slo isn't met.

B

This alert should fire much earlier than the service alerts because of the narrower scope of the metrics. For example, ci traffic handled by the api service wouldn't be included for the import team.

A

This slide shows how we aggregate our slis.

A

This is made much easier by the fact that all of our slis are currents based, so we don't have to deal with the complexity of mixing time-based and occurrence-based slis internally, we store each sli for each period, with two counters good and total. It's only at display time that we turn this into a percentage. The same is true for our aggregated slis they're, composed of two counters. This makes aggregating very easy. The good concept is the sum of all the aggregated good counters and the total is the same.

A

This also works through multiple layers of aggregation, as we showed on the previous slide, so we can take multiple aggregations and aggregate them into a hierarchical chain.

B

Because we've reused the framework for slo alerting from phase one, we can we get dashboard for free. This is a dashboard for a single group that shows the same slis, but only for features that they are working on. The dashboards are generated from the same code as the service dashboard, so they have the same panels: error ratio, aptx success, success ratio and the request rates, because we also include the information in the logs. We can generate links that only show requests that did not meet the solo.

A

One of the most dramatic changes to our error budgeting process was when our cto decreed that our error budgets must have teeth, and this was the moments at which our error budgeting process changed from being just another input into our planning process, purely used as informational guidance to being something much more contractual, and it gave teams the ability to carve out time to work on technical debt that they needed to address. So this slide shows our error budget report. It's reviewed weekly on our engineering allocation meeting, which incorporates elements of the sre books production meeting.

A

The report contains rolling 28-day error budget information, each team's traffic share, as well as information about incidents with outstanding corrective actions and how these corrective actions are related to each team.

B

Every team now has an slo of 99.95 the same as we put for, as we said for gitlab.com, and since we report on that monthly, we can take a look back and see where budget was spent. If we spent more than we should have, we can use the aggregation one level now to see which sli was underperforming.

B

This is currently still a manual process, but we hope to automate that in the future.

B

So, to summarize, we went from slos being the business of sres only towards involving everyone at gitlab developers and product managers have been included by putting the sli definition inside the application where they can specify what is good and bad. These numbers can be aggregated by feature and by group and used to inform management. So it's clear. We need to spend some more time working on performance and reality.

B

So it's clear when we need to spend some more time working on performance and reliability, issues now everybody's contributing to our slo.

A

And that's it thanks for your time, we'll be around in the slack channel to answer any questions that you might have. Thank you very much.

A