youtube image
From YouTube: Lightning Talk: Running Cloud-native Spark Jobs with Argo Workflows - Caelan Urquhart & Darko Janjić

Description

Lightning Talk: Running Cloud-native Spark Jobs with Argo Workflows - Caelan Urquhart & Darko Janjić, Pipekit

Companies with large computational workloads often use Apache Spark combined with numerous Python packages such as PySpark, NumPy, MLlib, XGBoost, and more. Unfortunately, as teams add the number of jobs running on a single Spark cluster managing dependencies becomes a nightmare. Kubernetes makes it easy to use numerous packages for large data jobs in distributed environments, and Argo Workflows is the best way to run pipelines on Kubernetes. This talk demonstrates how to orchestrate common Spark jobs with Argo Workflows, from the architecture to resource and workflow definitions. We'll show how to provision Spark and Argo Workflows on Kubernetes to process large data jobs. We'll also show how Argo Workflows and Kubernetes provide distinct scaling and stability advantages for Spark users by running some example jobs. We hope that listeners of this talk will learn the pros and cons of orchestrating their Spark job on Kubernetes with Argo Workflows, instead of traditional local or cloud environments.