Delta Lake Delta Rust Open Meetings

19 Jul 2022

Rust guarantees zero memory access bug once a program compiles. However, one can still introduce logical bugs in the implementation.

In this talk, I will first give a high level overview on common formal verification methods used in distributed system designs and implementations. Then I will talk about our experiences with using TLA+ and Stateright to formally model delta-rs' multi-writer S3 backend implementation. The end result of combining both Rust and formal verification is we end up with an efficient native Delta Lake implementation that is both memory safe and logical bug free!

Connect with us:
Website: https://databricks.com
Facebook: https://www.facebook.com/databricksinc
Twitter: https://twitter.com/databricks
LinkedIn: https://www.linkedin.com/company/data...
Instagram: https://www.instagram.com/databricksinc/

5 participants
34 minutes

delta

deltar

interfaces

ai

data

processes

rs

message

resource

discussion

19 Jul 2022

Scribd's data architecture was originally batch-oriented, but in the last couple years, we introduced streaming data ingestion to provide near-real-time ad hoc query capability, mitigate the need for more batch processing tasks, and set the foundation for building real-time data applications.

Kafka and Delta Lake are the two key components of our streaming ingestion pipeline. Various applications and services write messages to Kafka as events are happening. We were tasked with getting these messages into Delta Lake quickly and efficiently.

Our first solution was to deploy Spark Structured Streaming jobs. This got us off the ground quickly, but had some downsides.

Since Delta Lake and the Delta transaction protocol are open source, we kicked off a project to implement our own Rust ingestion daemon. We were confident we could deliver a Rust implementation since our ingestion jobs are append only. Rust offers high performance with a focus on code safety and modern syntax.

In this talk I will describe Scribd's unique approach to ingesting messages from Kafka topics into Delta Lake tables. I will describe the architecture, deployment model, and performance of our solution, which leverages the kafka-delta-ingest Rust daemon and the delta-rs crate hosted in auto-scaling ECS services. I will discuss foundational design aspects for achieving data integrity such as distributed locking with DynamoDb to overcome S3's lack of "PutIfAbsent" semantics, and avoiding duplicates or data loss when multiple concurrent tasks are handling the same stream. I'll highlight the reliability and performance characteristics we've observed so far. I'll also describe the Terraform deployment model we use to deliver our 70-and-growing production ingestion streams into AWS.

Connect with us:
Website: https://databricks.com
Facebook: https://www.facebook.com/databricksinc
Twitter: https://twitter.com/databricks
LinkedIn: https://www.linkedin.com/company/data...
Instagram: https://www.instagram.com/databricksinc/

3 participants
29 minutes

streaming

scribd

kafka

processing

docker

delta

dashboards

hadoop

topic

rustin

7 Oct 2021

Delta Lake committers Christian Williams and R. Tyler Croy from Scribd discuss with Denny Lee from Databricks the technical and business requirements around the Delta Rust API project: kafka-delta-ingest.

This project aims to build a highly efficient daemon for streaming data through Apache Kafka into Delta Lake and has been in production at Scribd for the last four weeks after six months of active development.

Come to learn about why they built it and how it's going.

Resource links:
https://github.com/delta-io/kafka-delta-ingest
https://kafka.apache.org/
https://delta.io/

Speakers:

R. Tyler Croy leads the Platform Engineering organization at Scribd and has been an open source developer for over 14 years. His open source work has been in the FreeBSD, Python, Ruby, Puppet, Jenkins, and now Delta Lake communities. The Platform Engineering team at Scribd has invested heavily in Delta and has been building new open source projects to expand the reach of Delta Lake across the organization. Tyler is also a Databricks Beacon.

Denny Lee is a developer advocate at Databricks, where he works on Delta Lake, Apache Spark, Data Sciences, and Healthcare Life Sciences. He has previously built enterprise DW/BI and big data systems at Microsoft including Azure Cosmos DB, Project Isotope (HDInsight), and SQL Server as well as the Senior Director of Data Sciences Engineering at SAP Concur. Denny holds a Masters in Biomedical Informatics from Oregon Health Sciences University.

Christian Williams is a senior engineer on Scribd's Core Platform team. He has done application and data engineering for 15 years working with a wide range of languages and platforms, most recently working with Kafka, Delta Lake, Rust, and AWS to deliver streaming data ingestion. Before working in software Christian was also one of the fastest sandwich artists in the greater Jacksonville area. Databricks is proud to announce that Gartner has named us a Leader in both the 2021 Magic Quadrant for Cloud Database Management Systems and the 2021 Magic Quadrant for Data Science and Machine Learning Platforms. Download the reports here. https://databricks.com/databricks-named-leader-by-gartner

4 participants
53 minutes

kafka

delta

docker

tyler

2019

challenges

presenting

thanks

started

scribd

26 Aug 2021

Join us for part three of a three part tech talk series: Upgrading from legacy to the cloud with Scribd. This is the final session, Moving Ad Hoc Users to the Cloud. Alexander Kushnir, R. Tyler Croy, and Hamilton Hord from Scribd discuss with Denny Lee from Databricks the technical and business issues around moving ad-hoc jobs to the cloud as part of Scribd’s migration from legacy environments to the cloud.

In this session, we dive into a variety of topics including exploratory non-dev use cases, how Scribd moved development into Databricks, model training use cases, and shared cluster resources. Listen to how Scribd engineers used Delta Lake to solve their production distributed cloud data issues.

Part One Recording: Replicating Data to the Cloud Recording - https://youtu.be/vGv6AcPp7Zs

Part Two Recording: Moving Batch Jobs over to the Cloud - https://youtu.be/siVvtalssrI Databricks is proud to announce that Gartner has named us a Leader in both the 2021 Magic Quadrant for Cloud Database Management Systems and the 2021 Magic Quadrant for Data Science and Machine Learning Platforms. Download the reports here. https://databricks.com/databricks-named-leader-by-gartner

4 participants
50 minutes

introductions

session

alex

migrated

cloud

2020

setups

sql

stepped

kroy

25 Aug 2021

Join us for part two of a three part tech talk series: Upgrading from legacy to the cloud with Scribd. Part Two: Moving Batch Jobs over to the Cloud.

Part One: Replicating Data to the Cloud Recording: https://youtu.be/vGv6AcPp7Zs

Abstract: Alexander Kushnir and Stas Bytsko from Scribd discuss with Denny Lee from Databricks the technical and business issues around moving batch jobs to the cloud as part of Scribd’s migration from legacy environments to the cloud. Instead of performing the migration as a big bang, there was a byte-by-byte migration performed incrementally to minimize any disruption to the business. Listen to how Scribd engineers used Delta Lake to solve their production distributed cloud data issues.

Speakers:

Stas Bytsko is the team lead of the Data Engineering team at Scribd with somewhat mysterious, and possibly exciting, past.

Alex Kushnir is a Data Architect and a Tech Lead of Data Engineering Team @ Scribd. Throughout his almost 20 year career, he has acquired experiences in various Software Engineering domains: desktop applications, web development, mobile APIs, distributed computing, cloud architecture, big data. He designed and implemented solutions utilizing various data stores: relational databases, document databases, key/value stores, object stores. For the past 5 years, he focused on distributed computing in cloud environments utilizing various BigData tech stacks and he’s a big fan of Apache Spark.
https://www.linkedin.com/in/alexander-kushnir-2b96114a/

Denny Lee is a developer advocate at Databricks, where he works on Delta Lake, Apache Spark, Data Sciences, and Healthcare Life Sciences. He has previously built enterprise DW/BI and big data systems at Microsoft including Azure Cosmos DB, Project Isotope (HDInsight), and SQL Server as well as the Senior Director of Data Sciences Engineering at SAP Concur. Denny holds a Masters in Biomedical Informatics from Oregon Health Sciences University. Databricks is proud to announce that Gartner has named us a Leader in both the 2021 Magic Quadrant for Cloud Database Management Systems and the 2021 Magic Quadrant for Data Science and Machine Learning Platforms. Download the reports here. https://databricks.com/databricks-named-leader-by-gartner

3 participants
46 minutes

cloud

alex

spark

transitioned

scribd

configured

microservices

tech

databrick

discussion

24 Aug 2021

Join us for part one of a three part tech talk series: Upgrading from legacy to the cloud with Scribd. This series will run August 24-Aug 26 at 9AM PT each day. Come join us!

Part One: Replicating Data to the Cloud

Alexander Kushnir (https://www.linkedin.com/in/alexander-kushnir-2b96114a/) and Maksym Dovhal from Scribd discuss with Denny Lee from Databricks the technical and business issues when migrating Scribd’s systems from legacy environments to the cloud. We discuss many technical issues ranging from S3 eventual consistency, cross-cloud consistency, metastore consolidation issues, utilizing multiple catalogs for the same Delta tables, data replication from on-premises to the cloud, and more. Listen to how Scribd engineers used Delta Lake to solve their production distributed cloud data issues. Databricks is proud to announce that Gartner has named us a Leader in both the 2021 Magic Quadrant for Cloud Database Management Systems and the 2021 Magic Quadrant for Data Science and Machine Learning Platforms. Download the reports here. https://databricks.com/databricks-named-leader-by-gartner

3 participants
46 minutes

scripts

introductions

cloud

lee

alex

replicating

transitioning

spark

hosting

databricks

15 Jul 2021

In this session we will introduce the delta-rs project which is helping bring the power of Delta Lake outside of the Spark ecosystem. By providing a foundational Delta Lake library in Rust, delta-rs can enable native bindings in Python, Ruby, Golang, and more.We will review what functionality delta-rs supports in its current Rust and Python APIs and the upcoming roadmap.

We will also give an overview of one of the first projects to use it in production: kafka-delta-ingest, which builds on delta-rs to provide a high throughput service to bring data from Kafka into Delta Lake.

Connect with us:
Website: https://databricks.com
Facebook: https://www.facebook.com/databricksinc
Twitter: https://twitter.com/databricks
LinkedIn: https://www.linkedin.com/company/databricks
Instagram: https://www.instagram.com/databricksinc/ Databricks is proud to announce that Gartner has named us a Leader in both the 2021 Magic Quadrant for Cloud Database Management Systems and the 2021 Magic Quadrant for Data Science and Machine Learning Platforms. Download the reports here. https://databricks.com/databricks-named-leader-by-gartner

1 participant
25 minutes

delta

lake

process

important

streams

ecosystem

daemon

logs

sdk

scribd

11 Mar 2021

Join us for this fun Delta Lake AMA session where we discuss with QP Hou, Christian Williams, and Alexander Kushnir from Scribd on growing the Delta Lake open-source ecosystem.

Scribd's current OSS projects aim to extend the power and flexibility of Delta Lake throughout the organization. Come ready with questions as this will be a fun interactive session hosted by R. Tyler Croy (Scribd) and Denny Lee (Databricks).

LINKS MENTIONED:
- https://github.com/delta-io/connectors
- https://github.com/delta-io/kafka-delta-ingest
- https://tech.scribd.com/blog/2021/introducing-sql-delta-import.html
- https://github.com/delta-io/delta-rs/discussions

Speakers:

QP Hou a Senior Engineer at Scribd and Airflow committer. QP manages batch processing pipeline at Scribd. Prior to that, he worked on Machine Learning and Monitoring Infrastructure at Floyd Labs and Linkedin respectively.

Alexander Kushnir is a Data Architect and a Tech Lead of Data Engineering Team at Scribd. Throughout his almost 20 year career has acquired experience in various Software Engineering domains: desktop applications, web development, mobile APIs, distributed computing, cloud architecture, big data. Designed and implemented solutions utilizing various Data Stores: relational databases, document databases, key/value stores, object stores. For the past 5 years focused on Distributed Computing in cloud environments utilizing various BigData tech stacks. He is a big fan of Spark.

Christian Williams is a Software Engineer at Scribd.

Hosts:

R Tyler Croy is the Director of Platform Engineering at Scribd, where he leads the efforts to empower data customers across the organization with higher quality and fresher data than had been previously possible. His background is in production data services, revolving largely around Apache Kafka and various stream processing tools. At Scribd, Tyler and his team work to bring data-driven insights closer to production applications with the “Real-time Data Platform”, built on Apache Kafka, Apache Spark, and Delta Lake.

Denny Lee is a developer advocate at Databricks, where he works on Delta Lake, Apache Spark, Data Sciences, and Healthcare Life Sciences. He has previously built enterprise DW/BI and big data systems at Microsoft including Azure Cosmos DB, Project Isotope (HDInsight), and SQL Server as well as the Senior Director of Data Sciences Engineering at SAP Concur. Denny holds a Masters in Biomedical Informatics from Oregon Health Sciences University. Databricks is proud to announce that Gartner has named us a Leader in both the 2021 Magic Quadrant for Cloud Database Management Systems and the 2021 Magic Quadrant for Data Science and Machine Learning Platforms. Download the reports here. https://databricks.com/databricks-named-leader-by-gartner

6 participants
52 minutes

introduce

session

alex

chat

discussions

presentation

speaker

users

thanks

delta

8 Oct 2020

How Scribd Uses Delta Lake to Enable the World's Largest Digital Library

Discuss with Scribd Engineers on Delta Tables and the Transaction Log

Join us for the next Data Collab Lab with Franco and Denny where we interview QP and Tyler from Scribd for a fun AMA session on How Scribd Uses Delta Lake to Enable the World's Largest Digital Library. In this session, we will discuss with Scribd engineers on how they transitioned from legacy on-premises infrastructure to AWS as well as utilize, implement, and optimize Delta tables and the Delta transaction log. Come ready with questions as this will be a fun interactive session. Databricks is proud to announce that Gartner has named us a Leader in both the 2021 Magic Quadrant for Cloud Database Management Systems and the 2021 Magic Quadrant for Data Science and Machine Learning Platforms. Download the reports here. https://databricks.com/databricks-named-leader-by-gartner

4 participants
56 minutes

script

geeks

delta

workflows

chat

server

scribb

developer

challenges

danny

Delta Lake / Delta Rust

19 Jul 2022

19 Jul 2022

7 Oct 2021

26 Aug 2021

25 Aug 2021

24 Aug 2021

15 Jul 2021

11 Mar 2021

8 Oct 2020