youtube image
From YouTube: How We are Dealing with Metrics at Scale on GitLab.com - Andrew Newdigate, GitLab

Description

Don’t miss out! Join us at our upcoming event: KubeCon + CloudNativeCon North America 2021 in Los Angeles, CA from October 12-15. Learn more at https://kubecon.io The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

How We are Dealing with Metrics at Scale on GitLab.com - Andrew Newdigate, GitLab

As GitLab.com has grown, the number of metrics generated by the application has grown exponentially. Ensuring our team has good quality dashboards and alerting rules was becoming an ever more challenging task. There’s no worse time than experiencing an outage that you expected to have been warned of, only to find out that the alert had been inoperable for months. As an engineer on the infrastructure team supporting GitLab.com, sometimes it felt, during an incident, that we were drowning in data while at the same time struggling to access the most pertinent indicators of the underlying issue. This talk discusses how we are addressing this problem by building up a catalog of key metrics for each component within our application, and then using this definition to automatically generate beautiful Grafana dashboards, rock-solid alerting rules and high-quality SLA indicators. This talk is primarily aimed at Prometheus users, but the fundamentals could be applied to any other metrics system.