youtube image
From YouTube: Tech Talk: Top Tuning Tips for Spark 3.0 and Delta Lake on Databricks

Description

Apache Spark™️ has become the de-facto open-source standard for big data processing due to its ease of use and performance. And the open-source Delta Lake project enhances Spark’s lead with new capabilities like ACID transactions, Schema Enforcement and Time Travel. These features help ensure that data lakes and data pipelines can deliver high-quality, reliable data to downstream data teams for successful data analytics and machine learning projects.

In this tech talk, we will discuss the top tuning tips for Apache Spark 3.0 and Delta Lake on Databricks. Come prepared to ask your questions and join Joe Widen, Chris Hoshino-Fish, and Denny Lee to discuss when to use which join operations, how to pick your machine sizes, how to help speed up your merge operations, and how to make your jobs easier!

Link to slides and the notebooks used in this tutorial: https://github.com/databricks/tech-talks

Chapters
0:00 Welcome
02:52 Use the latest version of DBR
04:53 Picking the best join strategy
13:39 Use Apache Spark 3.0 and AQE
26:27 Partition Pruning
28:36 Data Skipping
31:24 Z-Ordering
39:34 Databricks Delta Lake and Stats
44:39 Optimizing Merges
47:24 Picking good instance types

Speakers:

Chris Hoshino-Fish is a Solutions Architect at Databricks. Chris is an active member of the Performance Subject Matter Expert group and a former Principal Consultant focused on Data Engineering, working with several Fortune 500 Databricks customers. Prior to Databricks, Chris worked for an adtech company as a data engineer managing pipelines using Apache Spark for 3.5 years. Chris has a B.A. in Computational Mathematics from University of California, Santa Cruz.

Denny Lee is a developer advocate at Databricks, where he works on Delta Lake, Apache Spark, Data Sciences, and Healthcare Life Sciences. He has previously built enterprise DW/BI and big data systems at Microsoft including Azure Cosmos DB, Project Isotope (HDInsight), and SQL Server as well as the Senior Director of Data Sciences Engineering at SAP Concur. Denny holds a Masters in Biomedical Informatics from Oregon Health Sciences University.

Joe Widen is a Solutions Architect at Databricks. Joe leads the Performance and Delta SME horizontal initiatives along with making customers successful with the Databricks Unified Analytics Platform. Joe has been working with Spark and more generally Hadoop for 5 years, with previous stops at Hortonworks and Capital One.

To join the zoom live chat:
https://www.meetup.com/data-ai-online/events/274093223/ Databricks is proud to announce that Gartner has named us a Leader in both the 2021 Magic Quadrant for Cloud Database Management Systems and the 2021 Magic Quadrant for Data Science and Machine Learning Platforms. Download the reports here. https://databricks.com/databricks-named-leader-by-gartner