youtube image
From YouTube: Effective Disaster Recovery: The Day We Deleted Production - Rick Spencer & Wojciech Kocjan

Description

Don’t miss out! Join us at our upcoming hybrid event: KubeCon + CloudNativeCon North America 2022 from October 24-28 in Detroit (and online!). Learn more at https://kubecon.io. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

Effective Disaster Recovery: The Day We Deleted Production - Rick Spencer & Wojciech Kocjan, InfluxData

Imagine waking up to an sms, "we lost a cluster." On that day, with a one-line configuration change, we accidentally removed all of the compute from one of our busiest production clusters, causing a multi-hour outage. This presentation will cover the incident from the days leading up to it, to our full recovery, our customers' response to it, and how we implemented changes based on our learnings. It will go into detail about the configuration of our CI/CD pipeline, details about the specific change that caused the outage. Thankfully, we had a disaster recovery plan in place. We will discuss which parts of our disaster recovery plan worked, and critically, the few parts that didn't work. The session will cover a combination of technical and management content.