youtube image
From YouTube: Monitoring GPUs at Scale for AI/ML and HPC Clusters - Bharti L Agrawal, NVIDIA

Description

Don’t miss out! Join us at our upcoming events: EnvoyCon Virtual on October 15 and KubeCon + CloudNativeCon North America 2020 Virtual from November 17-20. Learn more at https://kubecon.io. The conferences feature presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

Monitoring GPUs at Scale for AI/ML and HPC Clusters - Bharti L Agrawal, NVIDIA

At Nvidia we have several large GPU K8s clusters for running deep learning training (AI/ML) workloads. On these clusters we need monitoring to support a range of user personas . First we have the end users (AI/ML researchers) who want to get an insight into how well their workloads used the GPUs and the system. Then we have the operations team who would like to monitor the general health of the cluster and be alerted in real time to any issues. Finally we have the stakeholders who would like to see the GPU utilization and saturation over time for capacity planning. These requirements cannot be satisfied by a standard “out of the box” setup. In this presentation we will show how we used a combination of open source tools to address our requirements. We will discuss various deployment, maintenance, security and scale challenges we hit and how we resolved them for monitoring GPU data.

https://sched.co/Zeoh