youtube image
From YouTube: An SLO-Driven Approach to Enhance Kubernetes Cluster Reliability - Qian Ding & Cong Chen

Description

Don’t miss out! Join us at our upcoming event: KubeCon + CloudNativeCon Europe 2021 Virtual from May 4–7, 2021. Learn more at https://kubecon.io. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

An SLO-Driven Approach to Enhance Kubernetes Cluster Reliability - Qian Ding & Cong Chen, Ant Financial

How to define reliability of a Kubernetes cluster? What are the SLOs? How many 9s is enough to ensure end-users are happy for a Kubernetes cluster with thousands of nodes? Service-level-objective (SLO) is the key to run large-scale production cluster reliably. Defining SLOs for classic web services is simple, since web requests are served synchronously with distinct status code. On the contrast, defining SLOs for Kubernetes services is obscured due to its intent-oriented design and declarative APIs. This talk first briefs the philosophy behind the SLO-driven approach for reliability engineering, followed by a deep dive of how SREs define SLOs for one of the world largest Kubernetes cluster in Ant Financial. Finally this talk shares concrete cases and lessons learned of building SLOs framework from several perspectives, including monitoring, alerting and tracing.

https://sched.co/ekCl