youtube image
From YouTube: High Performance Networking for Distributed DL Training in Production K8s - Nivedita Viswanath

Description

Don’t miss out! Join us at our upcoming event: KubeCon + CloudNativeCon Europe 2021 Virtual from May 4–7, 2021. Learn more at https://kubecon.io. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

High Performance Networking for Distributed DL Training in Production K8s - Nivedita Viswanath & Vatsan Kasturi, NVIDIA

Distributed DL training requires high performance networks connecting tens, hundreds, or for certain natural language processing models, even thousands of GPUs. Running these workloads on Kubernetes clusters of GPU enhanced servers requires careful engineering to avoid bottlenecks at NIC and switching fabric that act as interconnect between nodes. In this presentation we will describe the design and architecture of a 800 GPU cluster interconnected over RoCE fabric to achieve line rate performance between communicating containers in a multi-node job. Some of the topics we will cover are scalable cookie-cutter POD design for DC, low latency one hop network design that enables NCCL rings to avoid output port congestion and K8s integration with a multi-homed network for optimal GPU utilization. We will share performance numbers for training workloads from our production clusters.

https://sched.co/ekBq