Delta Lake Conference Talks, 19 Jul 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Delta Lake Michael Armbrust Keynote Data + AI Summit 2022

Description

Data + AI Summit Keynote talk from Michael Armbrust

Connect with us:
Website: https://databricks.com
Facebook: https://www.facebook.com/databricksinc
Twitter: https://twitter.com/databricks
LinkedIn: https://www.linkedin.com/company/data...
Instagram: https://www.instagram.com/databricksinc/

A

Thank you so much ali. I am really excited to be here in person and with all of you virtually to talk about delta lake.

A

In this talk, I'm going to talk about three things: I'm going to explain what delta lake is and why it's the foundation of the lake house, I'm going to tell a story about delta, lake's history, which is actually deeply entwined with this conference, we're all at today and then. Finally, I have a really exciting announcement about delta's future.

A

So, first of all what is delta- and why is everyone talking about it? Well, as ali said this morning, there's always been this big divide between data lakes and data warehouses. Data warehouses were the traditional technology. They were really easy to use. They were really fast, but they were expensive and not very scalable.

A

Data lakes were the young upstart. They came in they let you store tons of data, but they were kind of slow and clunky and difficult to configure and delta really was created to unify these two worlds. It brings acid transactions to the data lake and it brings speed and indexing and it doesn't sacrifice its scalability or its elasticity. And it's what enables the lake house when delta first started. It was mostly a spark technology, but that couldn't be further from the truth.

A

Today we have connectors for everything from old school technologies like hive to fancy new technologies like dbt, which you'll hear out here about in just a little bit this year has been no no different. We've added a ton of different connectors we've added support for flink, trino and presto, and we're working on support for pulsar and google's bigquery as well as the ecosystem has expanded. So has our user base.

A

So a couple of I think, a year and a half ago when we released delta 1.0 we're only getting about half a million downloads per month, and today it's at over seven million downloads per month, which is pretty cool and the other thing that's really been changing in the project is the the health of the contributors.

A

So this graph that I'm showing here is actually a metric by the linux foundation that looks at the health of contributions in any given open source project. It looks at how many unique people are fixing bugs and responding to pull requests and merging code, and so you can see really just how much momentum there is behind behind the project and it's increased by over 600 percent in the last three years.

A

But now I want to kind of rewind and take a trip down memory. Lane talk about the history of delta, so I'm going gonna go back to the year 2017 a much simpler time. I was working on structured streaming and spark sql, as ali just said, and I was talking to a bunch of users at this conference about how they were processing tons of data from a variety of data sources, and they were doing it all in parallel on the cloud.

A

It was all great and pretty much what every single one of them was doing was they were when they were done, processing data. They were writing it out as part k files to s3 parquet. Is this pretty cool, open, columner format? It's part of a database, but there were still a bunch of problems.

A

It turns out that a big collection of files is not a database.

A

I was fielding bug reports from users constantly who were saying spark is broken because they what they were doing is they were basically corrupting their own tables because there were no transactions when their jobs failed, because a machine was lost, it didn't clean up after itself. Multiple people would write to a table and corrupt it. There was no schema enforcement. So if you drop data with any scheme into the folder, it would make it impossible to read. There was a bunch of added complexity of working with the cloud.

A

The hadoop file system just wasn't really built for it. I'm sure people in this room remember setting the direct output committer and if you got it wrong, things would be broken and even just working with large tables was slow, just listing all the files could take up to an hour, and so it was here at you know what used to be known as the spark summit that I talked to a bunch of people and thought.

A

There's got to be a better way and I actually always take a couple of days off after spark summit to decompress and I was so excited. I wrote a design dock during that vacation and so in 2018 we came back on the stage and we announced databricks delta. It was one of the first fully transactional storage systems that preserved all of the best parts of the cloud and even better it was battle tested by hundreds of our users at massive scale.

A

In fact, one of my friends dom uh got up here and told us about his use case where he had been using delta for the last year to process petabytes of data in real time, with hundreds of analysts around the globe for a critical information security use case. If you haven't seen that video, I suggest you go to youtube and check it out, but delta was too good to keep just for data bricks, and so in 2019 we came back and we announced the open source delta lake and we didn't just open source the protocol.

A

The description of how different people can connect and make transactions in the system. We actually also open sourced our battle tested, spark reference implementation and put all of that code up on github, but we weren't done with delta. We were actually believed in this, so much that data brick started to commit its business to it, and so, in addition to all the exciting things that were happening in delta lake, we were busy building features to make delta even better.

A

So we added this really cool command called optimize, which automatically takes all of your tiny files and compacts them into a larger one transactionally. So you can get dramatically better performance.

A

We built this really cool command to go, alongside of it called optimize z-order, which actually takes your data and maps it to a multi-dimensional space-filling curve so that you can filter efficiently on multiple dimensions that works really well. With this cool trick called data skipping based on statistics, it's basically like a coarse-grained index for the cloud.

A

We added the ability to write to these tables from multiple clusters and a whole bunch of other things that I don't have time to talk about, but there's a problem here this all of this advanced technology.

A

You know you could read and write delta from anywhere, but all these advanced features were only available inside of databricks and that's why today I am really excited to announce delta lake 2.0. All of delta is now open source.

A

And so, if you see this feature matrix, you can see we're already on our way. If you've been following the project closely, you might have seen something is up: we've been opening up tickets on github and we are rapidly open sourcing. All of these different features and bringing them out you know for the community and what we've seen is. This is actually going to dramatically improve the performance of this open source project. So, as you can see the baseline, this is delta 1.0.

A

With that optimize command, it improves performance a little bit, but when you add in z, order and data skipping, the performance gets really good, which is super exciting. This is uh you know the same tpcds query that we've been showing all day today and the other really exciting thing is delta is now one of the most featureful open source, transactional storage systems in the world.

A

We are the only one where you can run it directly against cloud storage systems like adls without any extra infrastructure, we're the only ones that have data sharing and there's a whole bunch of other differentiated features.

A

We also did a little bit of performance comparison across all of these different open source projects and delta is somewhere between two and four times faster than the next competitors and we're not the only ones who have noticed this huge performance increase.

A

So this is from some people over at data beans who are users of the open source project and, as you can see, they saw that we are dramatically faster not only at loading data, but it processing data, then iceberg, but we're not done yet there's, actually some really cool technology waiting in the wings. I want to tell you get just a quick preview of one of those things. So there's always been a problem with columnar formats like per k. The problem is due to the encoding.

A

When you want to update even just a single value, you have to rewrite the entire file. This is called write amplification in databases, because it takes a single, tiny write and turns it into a big massive copy of all of this unchanged data, and so we're very excited to add this new technology to the delta protocol. That allows you to delete a single row called deletion, vectors.

A

What this is going to, let you do is it's going to. Let you mark that row as deleted, so you can only write out the data that change, which will dramatically speed up things like deletes, updates and merges, and we've already gotten started on this effort. So, as you can see, there's a couple of jiras, some of them already resolved where we've been adding some of the groundwork to both parquet and apache spark.

A

Now I would be remiss if I didn't recognize all of the great members of the community, so if we could give a round of applause for all these people who've made delta what it is today.

A

If you'd like to learn more there's a ton of exciting things going on at this conference, come and join us for our amas or our deep dives into various topics. If you want to get involved actually coding for the project, you can come to our meetup and actually talk to some committers and figure out some good projects to get started.

A

If you're joining us virtually, you can still join the community check us out on github or on slack or on twitter and with that I'll. Thank you very much.