youtube image
From YouTube: Federated Prometheus Monitoring at Scale - Nandhakumar Venkatachalam & LungChih Tung

Description

Want to view more sessions and keep the conversations going? Join us for KubeCon + CloudNativeCon North America in Seattle, December 11 - 13, 2018 (http://bit.ly/KCCNCNA18) or in Shanghai, November 14-15 (http://bit.ly/kccncchina18).

Federated Prometheus Monitoring at Scale - Nandhakumar Venkatachalam & LungChih Tung, Oath Inc (Intermediate Skill Level)

In Media Build and Products under Oath, We run 12 production Kubernetes clusters running across our data centers with ~1200 machines with multi-tenant deployments. We monitor our cluster with Prometheus, each cluster runs a Prometheus instance and overall a single federated cluster with a persistent storage. Total time series is ~17mi (max 5mi /instance) with samples ingestion rate is 300K (max 80K /instance). We have built mind-blowing dashboards at a federated instance like Controller, Scheduler, API server, DNS, Kubelet, Etcd, Utilization overall and per-tenant namespace/ deployment/container gives high visibility. We leverage Alert manager which provides powerful alerting capabilities alerts on call on cluster status, nodes availability, scrape status, fd usage etc.We would like to share our experience of how we monitoring multi-kubernetes cluster with the multi-tenant environment

About Nandhakumar
Nandhakumar Venkatachalam is a Princ Production Engineer, Lead for Kubernetes Infrastructure/ Cluster management team at Oath Media Build and Products. He is a subject matter expert and solution architect specialized in high availability. Nandha has been under Oath for 11 years and has been dealing with operations at scale for more than a decade. He has delivered and led high profile product launches for Yahoo Fantasy, Yahoo Sports and contributed to flawless Fantasy season year after year. Prior to Yahoo, he had worked as a Linux System Administrator at IBM.

About LungChih
Lungchih Tung is a software engineer in core infrastructure team at Oath Media Build and Products. Lungchih has been working on building core infrastructure with Kubernetes, monitoring system and automating operations of cluster management.
Join us for KubeCon + CloudNativeCon in Barcelona May 20 - 23, Shanghai June 24 - 26, and San Diego November 18 - 21! Learn more at https://kubecon.io. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy and all of the other CNCF-hosted projects.