youtube image
From YouTube: Building Distributed TensorFlow Using Both GPU and CPU on Kubernetes [I] - Zeyu Zheng

Description

Building Distributed TensorFlow Using Both GPU and CPU on Kubernetes [I] - Zeyu Zheng & Huizhi Zhao, Caicloud

Big Data and Machine Learning have become extremely hot topics in recent years. Google has announced its AI-centric strategy and released the deep learning toolkit TensorFlow. TensorFlow soon became the most popular open source toolkit for deep learning applications. However, it may take years to train large deep learning models on a single machine without GPU. In order to accelerate the training process, we build a distributed TensorFlow system on Kubernetes which support both CPUs and GPUs.

In this presentation, I’d like to share our experiences about how to build this distributed TensorFlow system on Kubernetes. First, I'll briefly introduce TensorFlow and how TensorFlow supports training model distributedly. However, the original distribution mechanism lacks lots of components such as scheduling, monitoring, life cycle managing and etc. to make it suitable for production usage.

In the rest of the presentation, I'll focus on how to leverage Kubernetes to solve those problem. The solution involves three components. First, I'll introduce how to schedule TensorFlow jobs in a cluster with both CPUs and GPUs. Then I'll share our experience in managing the life cycle of a distributed TensorFlow job. Finally, I'll state our efforts in lowering the bar for using distributed TensorFlow

About Huizhi Zhao
Software Engineer, Caicloud

About Zeyu Zheng
Zeyu is chief data scientist and co-founder at Caicloud which provides Cloud and Big Data related services. He leads the efforts to build reliable and scalable data analysis and machine learning platforms like Hadoop, Spark and TensorFlow on Kubernetes. His team has developed Machine Learning applications like image classification, time series prediction, which has helped well-known Chinese enterprises to utilize machine learning based on Kubernetes in production. Before I co-founded Caicloud, Zeyu worked for Google Shopping for almost three years. He proposed and leads the efforts in building structured product cluster data which is an essential part to trigger Knowledge Card product ads on google.com. Zeyu obtained his Master’s degree at School of Computer Science at Carnegie Mellon University (CMU). He was named Siebel Scholar, which only awards 93 graduate students from world’s leading graduate schools based on outstanding academic performance and leadership. During his internship at Microsoft Research Asia, he published several papers and delivered academic speeches at top Data Mining conferences like SIGIR and ICDM.
Join us for KubeCon + CloudNativeCon in Barcelona May 20 - 23, Shanghai June 24 - 26, and San Diego November 18 - 21! Learn more at https://kubecon.io. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy and all of the other CNCF-hosted projects.