youtube image
From YouTube: Using Kubernetes to Offer Scalable Deep Learning on Alibaba Cloud - Kai Zhang & Yang Che, Alibaba

Description

Join us for Kubernetes Forums Seoul, Sydney, Bengaluru and Delhi - learn more at kubecon.io

Don't miss KubeCon + CloudNativeCon 2020 events in Amsterdam March 30 - April 2, Shanghai July 28-30 and Boston November 17-20! Learn more at kubecon.io. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects

Using Kubernetes to Offer Scalable Deep Learning on Alibaba Cloud - Kai Zhang & Yang Che, Alibaba

Running deep learning (DL) jobs requires end to end workflow to accelerate model training iteratively. It must be scalable on massive data and computational resources, and be framework agnostic to relieve the pain of managing diverse dependencies. In Alibaba cloud, we use Kubernetes to build elastic DL platform for continuous model training and optimization. It manages heterogeneous cluster including CPU/GPU/FPGA. Jobs are automatically scheduled to the best-fit resources. Kubeflow, which is a great machine learning scaffold on Kubernetes, is used to setup training pipeline. Project Arena is created to manage and instrument jobs with friendly user experience. In this talk, we will discuss how the platform is designed, and how it facilitates users to focus on DL tasks instead of managing underlying complexity. A demo shows how to run distributed neural network training in a minute.

To Learn More: https://sched.co/GrVk