youtube image
From YouTube: Make Kubernetes Networking Ready for World Class AI and HPC ... Sunyanan Choochotkaew & Gaurav Singh

Description

Don’t miss out! Join us at our upcoming event: KubeCon + CloudNativeCon Europe 2023 in Amsterdam, The Netherlands from April 17-21. Learn more at https://kubecon.io​. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

Make Kubernetes Networking Ready for World Class AI and HPC Workloads - Sunyanan Choochotkaew, IBM & Gaurav Singh, Red Hat

While use of Kubernetes for various services is growing rapidly, it is still behind in the world of HPC and AI clusters. Part of the reason is that the lack of support for advanced features like multiple 100G networks available in HPC/AI Systems. Vast majority of AI systems in hyperscalers such as IBM Cloud, AWS, Azure, and Oracle Cloud come with two to 8 100G network interfaces on the A100 GPU nodes. However, by default in Kubernetes, a pod has only one network interface, but attaching multiple interfaces is often a requirement in the scenarios. Multus unlocks the potential of multi-networking feature in Kubernetes, but there are still challenges in usability, manageability, and scalability. We present Multi-NIC CNI, a new open-source project, to democratize multiple interfaces capability for everyone. This CNI saves users from the concerns regarding environment heterogeneity and acquiring CNI specific knowledge. This talk will introduce the architecture, use cases, and performance of the CNI, then show how beneficial it is for HPC/AI. We will demonstrate the CNI on a large scale GPU Cluster consisting of over 1400 GPUs and two 100G network interfaces that we build in IBM Cloud.