youtube image
From YouTube: Help! Please Rescue Not-ready Nodes Immediately - Xiaoyu Zhang, Alibaba & Di Xu, Ant Financial

Description

Don’t miss out! Join us at our upcoming events: EnvoyCon Virtual on October 15 and KubeCon + CloudNativeCon North America 2020 Virtual from November 17-20. Learn more at https://kubecon.io. The conferences feature presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

Help! Please Rescue Not-ready Nodes Immediately - Xiaoyu Zhang, Alibaba & Di Xu, Ant Financial

For a Kubernetes cluster, nodes are crucial to make pods running properly. So it is indispensable to monitor nodes status and detect node problems. Node problem detector (NPD), an open source project in Kubernetes community, is a good answer to address this issue. Nowadays NPD has already been well accepted and widely used in production environments. Actually identifying the problem is only the first step. What we need to do next is to handle those problems and rescue the nodes. In this talk, we will list common problems and share how we establish rules to decide whether a node is ready or not and how to fix them if recoverable. Moreover, we will introduce some use scenarios on how we make a 99.9% uptime guarantee with ten thousand nodes in a single cluster. We will share some experience on how to recover the nodes within 10 minutes as well.

https://sched.co/Zek3