youtube image
From YouTube: Production Cluster Monitoring and Remediation for High Reliability - Shijun Qian & YingKe Liu

Description

Join us for Kubernetes Forums Seoul, Sydney, Bengaluru and Delhi - learn more at kubecon.io

Don't miss KubeCon + CloudNativeCon 2020 events in Amsterdam March 30 - April 2, Shanghai July 28-30 and Boston November 17-20! Learn more at kubecon.io. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects

Production Cluster Monitoring and Remediation for High Reliability at eBay - Shijun Qian & YingKe Liu, eBay

eBay runs dozens of Kubernetes clusters across global data centers in different regions. Tens of thousands of nodes support eBay core services such as search and big data. Complex large cross-regional production clusters and the extremely high cluster stability required workloads make monitoring and remediation a huge challenge for us. Based on Prometheus federation, component assertions, metric exporters and our own monitoring tools, we built a series of clear dashboards, and then we implemented a complete cross-clusters remediation flow, incident management, and monitoring automation. In this talk, we hope to share our large-scale Kubernetes production clusters monitoring experience and future thoughts.

To learn more click here: https://sched.co/FuK6