youtube image
From YouTube: Scaling Kubeflow for Multi-tenancy at Spotify - Keshi Dai & Jonathan Jin, Spotify

Description

Don’t miss out! Join us at our next event: KubeCon + CloudNativeCon Europe 2022 in Valencia, Spain from May 17-20. Learn more at https://kubecon.io The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

Scaling Kubeflow for Multi-tenancy at Spotify - Keshi Dai & Jonathan Jin, Spotify

Spotify began offering a centralized Kubeflow Pipelines product to its machine learning teams around two years ago. Since then, adoption has skyrocketed, with more teams training more models and running increasingly complex experiments. These increased demands on our system come with more stringent demands on us, the Kubeflow team at Spotify, to ensure not just cluster reliability, but cluster equitability. Our job is to not just be cluster maintainers, but cluster stewards—ensuring equitable and reliable access to cluster resources, and keeping users from stepping on each others’ toes. In this talk, we’ll discuss our streamlined tooling to maintain, deploy, and monitor Spotify’s distribution of Kubeflow. We’ll illustrate the challenges we face as we scale to increased user load and increasingly distinct and demanding pipelines, and outline our approach to addressing those challenges with “multi-cluster” Kubeflow. Finally, we’ll give a preview of our future plans for the platform.