youtube image
From YouTube: Kubernetes For GPU Powered Machine Learning Workloads In... - Camille Rodriguez & John-Paul Robinson

Description

Don’t miss out! Join us at our upcoming event: KubeCon + CloudNativeCon Europe 2023 in Amsterdam, The Netherlands from April 17-21. Learn more at https://kubecon.io​. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

Kubernetes For GPU Powered Machine Learning Workloads In Academia - Camille Rodriguez, Canonical & John-Paul Robinson, University of Alabama at Birmingham

Speakers: John-Paul Robinson, Camille Rodriguez
This talk aims to inform the architects and users of Kubernetes, as well as teams planning to transition for Kubernetes for research purposes, how we designed a high-performing Kubernetes cluster specifically geared towards machine learning and AI workloads. On the architectural side, the use of NVIDIA DGX A100 machines provides unprecedented compute density and performance for those workloads. Those nodes are integrated to the cluster with open-source software. We will also cover our challenges & successes in integrating to other components, such as external CEPH storage, gitlab registry and runners, and SAML authentication. The University of Alabama at Birmingham team will cover how they leverage container-enabled GPUs for their research and development workloads. Research workloads increasingly demand access to ad hoc, GPU-enable compute capacity, with complex software environments to power cloud-native workflows. K8s helps address needs ranging from regular ML training runs to supporting software development via CI pipelines.