youtube image
From YouTube: Lightning Talk: Honey, I Broke the Things: Debugging Gray Failures in Production! - Radha Kumari

Description

Don’t miss out! Join us at our upcoming event: KubeCon + CloudNativeCon Europe 2023 in Amsterdam, The Netherlands from April 17-21. Learn more at https://kubecon.io​. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

Lightning Talk: Honey, I Broke the Things: Debugging Gray Failures in Production! - Radha Kumari, Slack

Migrations are one of the most challenging tasks we do as infrastructure engineers.
These are sometimes long, tedious and come with many technical challenges of their own.
At Slack, we switched from HAProxy to Envoy Proxy for all ingress traffic. Overall, this migration was a success, and did not cause any downtime, but even so, we ran into several interesting edge cases that caused minor problems, such as failing a small percentage of requests, or increasing latency for requests, or sometimes an unhappy bot.

Troubleshooting these sorts of 'gray' failures can be difficult, so this talk will discuss some of those facepalm moments: how they were detected, steps taken to investigate them, and how they were solved.

Takeaways from this talk include a specific set of approaches for debugging such problems with Envoy Proxy and other web proxies that we learnt via these events along with some engineering practices that eases the stress during a large migration.