youtube image
From YouTube: Production Multi-node Jobs with Gang Scheduling, K8s, GPUs... Madhukar Korupolu & Sanjay Chatterjee

Description

Don’t miss out! Join us at our upcoming events: EnvoyCon Virtual on October 15 and KubeCon + CloudNativeCon North America 2020 Virtual from November 17-20. Learn more at https://kubecon.io. The conferences feature presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

Production Multi-node Jobs with Gang Scheduling, K8s, GPUs and RDMA - Madhukar Korupolu & Sanjay Chatterjee, NVIDIA

With the growing scale of DL and ML applications, distributed execution of jobs across multiple nodes becomes increasingly critical -- to solve bigger problems faster -- as illustrated by the recent MLperf results. However running such workloads in a production K8s cluster shared by multiple jobs/users has several challenges. In this talk, we’ll give an overview of this area -- including distributed Tensorflow, Pytorch, Horovod, MPI -- and the use of GPU nodes with NCCL and RDMA for accelerated performance. We’ll describe our end-to-end flow for multi-node jobs in K8s including gang scheduling, quotas, fairness and backfilling implemented in our custom scheduler for GPUs. Our cluster includes high-speed networking through RoCE and SR-IOV / Multus CNI. We’ll share our design choices, learnings and operational experience including failure handling, performance and telemetry.

https://sched.co/ZejQ