Delta Lake Conference Talks, 12 May 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Building Reliable Lakehouses with Delta Lake

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Welcome to building reliable data lakes with delta lake, my name is denny lee, I'm a senior staff developer advocate here at databricks, been here since 2015, or so I've been working with apache sparks in 0.6 and delta lake. Since its inception. um Prior to that, I was the senior director of data science, engineering at sap concur and a principal program manager at microsoft for azure cosmos db, project isotope, which is what is currently known as azure hd insight, as well as sql server.

A

So why are you here? Let's talk about it: it's a delta lake, an open source storage format that brings asset transactions to big data, workloads on cloud object stores, it's the key ingredient for building lake houses.

A

So let's talk about that evolution of data management, why do we even need lake houses in the first place? Well, first, let's start with data warehouses and I'm actually from that era by the way. So I'm I don't know if I look at, but I certainly feel it okay and what happens here is that I was part of the sql server team and we were boosting this idea that the data warehouse would allow us to do everything. We would build these old tb transaction databases do some etl.

A

I apologize in advance for my involvement with the sql server integration, services and dts for any of the folks that end up using it, and we push all of this data into data warehouses, which was great right, because that meant I could throw my bi my reporting. So again, you want to blame me for or partially blame me for sql server, reporting services or sql server analysis services for your bi. Yes, I was there during those days as well, helping to build those systems um they're great right, except and that's the key thing.

A

There wasn't support for video, audio or text, there's no support for data science and machine learning. Even this idea that we could potentially use cursors or functions within python functions or r functions directly within the database was problematic at best. There's limited support for streaming. Some different companies actually did try to provide streaming capability within warehouses, but it was very, very complicated and the most important aspect. It was the closed and proprietary formats. The systems themselves were not open.

A

They were built, uh it's a closed, silo, closed house that ensured that um it was that you would only utilize that particular uh data warehouse and it was really expensive to scale out and therefore right now. You know when we switched to data lakes, most data stored in data lakes and blob stores, because we wanted that flexibility, and so I was actually part of that transition as well, because I was part of the crew that said hey.

A

We need to build hadoop hadoop on windows, originally first and then afterwards, azure hadoop, um so in other words hadoop on linux, but in azure cloud and we we really pushed that idea of like schema on re. Yes, we could do everything with schema on read. We could wouldn't care about the structure of the data. We didn't need a scheme anymore. We would just simply go ahead and determine the schema at that point of read time right and no problem at all and, of course, to anybody. That's in this particular conference.

A

You know that's not true at all, okay, so, but that was the idea. The idea that data links could handle all of your data for data science and machine learning, but really poor, bi support any systems I want to buy. So, for example, if I wanted trino to go ahead and read a data like you know, if I didn't have a schema, that's a horrible thing to actually have right. It's super complex to set up at times more often than not extremely poor performance and unreliable data swamps. Now some people call data swamps.

A

Some people call data, salad doesn't matter what the nomenclature is. What it meant is that you had a lot of data stored in these cloud object, stores or hdfs, I.e data lakes in which there was no lineage or control over this data. Nobody seemed to understand what was going on. It did give you the flexibility to store structured, semi-structured and unstructured data.

A

It did give you the ability to store text audio video, but what ended up happening is that it became a much more complicated system to deal with in order to get to data science, machine learning in real-time databases, and if the schemas of your data could not be trusted the old adage of garbage and garbage out, you don't really know what you're seeing and so the idea of coexistence.

A

So the idea of saying: okay, no, no we'll have a data warehouse on one side and we'll have a data lake on the other side, and everything will be perfectly fine. Well, how do you reconcile all that data? How do you make sure that data is actually what you're reading out of the data warehouse and what you're reading out data lake actually made sense that the users themselves could actually reconcile between these two different systems?

A

You end up having reporting by one system saying one number or another system say another number and technically they make sense, but they don't and the reconciliation process ends up becoming or validation process becomes more complicated than the actual process of processing the data. Now, that's very undesirable, to say the least, and so this is the reason why we really want to bring this concept of a lake house best of both worlds.

A

It gives you the ability to have the flexibility of a data lake here on the right side, um yeah right hand here in which you can store structured and semi-structured, unstructured data and work with it and be able to deal with multiple paradigms like data science, machine learning, real-time databases, yet at the same time, on the left side that it actually has some of the most important aspects of a data warehouse. I have a schema. I have structure.

A

I have fast bi support, so that way you can actually have a simplified reconciliation or validation process in which there is none, because the lake house handles both of it for you all right right from the get-go.

A

So how do you reconcile these two okay? So we there are multiple versions like house and obviously today we're going to be talking about delta lake, but the key most important aspect is especially here in this conference of single neutrinos that you really need this concept of somewhere in between, where you have a delta lake for a data lake. Excuse me, um data lake, for all your data and one platform for every use.

A

Okay, you need somewhere in between okay and how you do that is with delta lake, okay, scalable, open general purpose, transactional data format, all right, that's a fundamental uh uh concept in order for rios to be able to do it and then, in order to add on top of it but delta lake, obviously we're going to showcase trino today, but that's the call out the high performance query engine that talks to delta lake in order to give you the best of both worlds. You need these two concepts.

A

You need trino for the high performance query engine and you need delta lake to give you that scalable general purpose transactional data format.

A

So let's go backwards a little bit and ask the question of in delta lake or the architecture. Why is it scalable? How does it actually give you that context? Well, delta lake equals this concept of scalable storage for data as well as scalable transaction log from edit. You actually need those two things, because azure transaction log or as your metadata grows, it actually has to be able to scale as well, and so how do we do it? Well, let's talk about the scalable storage concept.

A

First, the table data is stored as part k, files on cloud storage all right so in other words your standard, open source, parque files, that's exactly what's inside your delta lake table itself, okay, so in other words, for example, you have your past a table. uh You have your zero part, zero zeros there are ones there are two parquet files. These are the part k files that make up your start. uh Your scalable storage, they're sitting on cloud object, storage. So that way they can scale appropriately.

A

Okay, but as many as you probably know, the biggest problem when it comes to clogged object stores is the list. Operation itself is extremely slow.

A

So if I have millions upon millions of files, okay of these par k, files that I have to traverse through that list operation becomes excessively slow that and when it comes slow, I mean, as in the query performance, for example, if trino's trying to query that data the if I was to list all the parquet files, man that would significantly slow down the performance, because it's just waiting on that list operation.

A

So how do I speed things up? Well, I speed them up by having a scalable transaction log. Okay, so inside that same folder, there's an underscore delta underscore log, okay and so inside there is a bunch of json files. Okay, plus those are jsons. These are one.json, so it's basically a sequence of metadata files. These json files to track operations made on the files in the table.

A

Now these are like, I said, they're stored in the cloud storage uh along with the table, and they allow you to read and process the metadata in parallel. That's an important aspect. So when you look at the trino connector, for example, when it's accessing a delta lake table, it's able to go ahead and read that those json files and by the way there's every tenth is a checkpoint which is a parquet file.

A

So it'll just read the parquet files, but the idea is that it's actually able to read those files quickly understand what it inside those json parquet files in the delta log. That is, it includes a list of every single parquet file that makes up the table. Not every single par k file. That's inside the cloud object store actually makes up the table, at least for that particular snapshot. I'll talk about that concept in a second a little bit more in a second.

A

But the idea is that every single parquet file that is inside the cloud object store does not necessarily translate to what the table is at the time, because delta lakin also includes time travel the ability to basically go ahead and go back in time and look what the earlier version of the table is so based on the metadata that you're reading. For example, if I'm looking at the most recent snapshot of the data it'll contain all the parquet files that make up the current table.

A

If I was to go back in time, it would actually tell me all the parquet files that actually made up that version of the table at that time. So, instead of actually doing that file listing from cloud object stores, it's just simply reading a single parquet or json json files to determine what makes up that table, then trino's able to go ahead and quickly send its worker notes to go ahead and query the data directly all right.

A

So that way they can read know exactly which parque files it's supposed to hit in order to read that data so really powerful way for you to scale. Both the storage and the transaction log itself, okay, so let's talk about that. Why is this transaction log so important? Well, it's because it's going back to your database days. It's about transaction log commits changes to the table are ordered and there are atomic commits okay. This is very important. Okay, so in other words, it allows you to have that asset transaction.

A

That's so more, so much so important that versus if you were to go ahead and read and write directly to a parque file directly to a data lake in general, you would not have that, and what do I mean by that? What I mean is that there's this you're running a job doesn't matter which distributed system could be trino. It could be uh spark. It doesn't really matter just you're running some job itself. It's writing files in mid-flight, okay, but the job fails.

A

What happens? Well, a bun if there were let's say 20 tasks executed at that time to actually write all those files. What happens is that let's say the 19th task fails. Well, that means 18 files were already most likely were already written to. The cloud. Object store all right. Well then, what happens to that 19th file? Okay, well, the 19 file failed.

A

So now what you have is a bunch of orphaned files, a bunch of files that are in the cloud object store in the same location that you don't know if they're supposed to be there or not right. Well, that's what the transaction log does and transaction log allows us to know are those files that were written supposed to be there or not. So just in case there was a failure, the transaction log itself would have failed okay, which means that it would not have been committed transaction log.

A

So, even if there are those 18 files there, we would never read them because we knew we would only reference. The transaction log determine what files you're supposed to read.

A

Okay, so this allows us to do x when you're trying to process lots of data and write lots of data, whether it's batch or streaming to the to the cloud object store at any one point in time. It's writing these json files. That is the final commit that lets us know. What's going on so, for example, if I'm doing an insert action, I'm inserting data into my delta lake, okay.

A

Well, for example, using this green here, um I added 001 and zeros are due parquet files in zero, zero, zero dot, json the zeroth json, okay, perfect! So now we know that if I was to read the table at the point in time where you're saying okay, because it's a couple milliseconds before 01 json got written right, I don't want any dirty reads: okay, what is the snapshot at that point in time that I went ahead and ran the query?

A

I'm only supposed to read zero one and zero two parquet files. A few seconds later, I do an update that update goes ahead to create, generates a zero three dot parquet file uh and ends up removing zero one. Zero. Two part k files: okay, no problem, any uh query: whether it's a read, write or update or modification that happens after the update action. Instead, we'll see the zero one.json file and because it sees that it'll only see it'll, only the listing of files will only include zero.

A

Three part k: it will not include zero one and it will not include zero. Two.

A

Okay, so that's pretty cool, because that way, if you had a query that ran right after xero dot json and you had another query that ran right after zero1.json, you know exactly which files and you're not risking any dirty reads at that point in time, because this is actually a very for short running queries. That probably, is not that big of a deal honestly, because you'll probably just generate right right off the zero one, but for long running queries.

A

This actually is a big deal, because you want to know what the status of the data was at the time that you queried it. Okay, and so that's why it's important that every single atomic commit goes as a json file within that delta log, folder, okay, a set of actions, as you can see here all right, but in order to keep that consistent, snaps, just like I said this.

A

Actually, I just described it for you, so that can the readers, whenever they're reading they're going to read those atomic units, so they actually will see those zero dot, json or 01.json. So that example that I just gave you right if, if it happened, let's just say at one o'clock, it was zero.json at two o'clock at zero. One json, just for the simplicity of this example that I'm speaking to you, what will happen is basically any query that happens between one o'clock and 159 59.99 seconds.

A

It'll actually ensure it reads: the 0102 parquet files right from the insert action, but at 2 o'clock, when the zero one json gets committed, it'll only read the zero three parquet files, and so the readers any of the readers to ensure consistency. It'll only read either zero one and zero two per k or zero, three par k, but nothing in between and nothing in between. Excuse me: okay, because that way you have consistency when you're reading your data, okay, and so when we talk about acid transactions by mutual exclusion on those law commits.

A

What has to actually happen is that the concurrent writers have to agree on the order of changes, so we we actually follow the optimistic and currency control that we basically are choosing to say or if you're able to go ahead and write to the uh the the to your delta, like forsake argument um to different partitions of data, so one's doing an update to partition zero while one's doing an insert to partition uh two, that's fine, that's uh optimistic and currency control will allow that right to happen because we're saying it's they're actually not interfering with each other, but if you're actually trying to update the same partition of data as you're trying to insert okay, there may in fact be problems because you need to know exactly which order.

A

Did you update this first and then you insert the data, or did you insert the data which the update itself might actually impact? Okay, so these new commit files will be created mutually exclusively from each other, using storage, specific api guarantees? Okay, so, for example, writer one is going ahead and trying to write to zero.json. Okay, meanwhile uh zer excuse me writer, two is actually going ahead and try to write to zero one.json.

A

But then both writer, one and writer, two are both trying to write to zero2.json. At the exact same time, only one of them is going to succeed. In this case it's writer two, but you never know, maybe writer one would have won, but either way it doesn't matter. Only one of them succeeds and then writer one will actually try again, because this is import. This is an important fact of atomicity. We have to make sure whatever's being written is consistent at that point in time, there's only one of them.

A

That's allowed doing commit at that point in time. So, even with all of our discussions about parallelism at some point, there is a serial point where, basically, we have to make a decision which one's committing, which one that succeeds and only one gets only one gets the rule in this case. Okay, oops. Sorry, all right. So when you look at the same storage system, support delta relies on the scalable cloud storage infrastructure for an asset guarantee.

A

So remember it's relying actually on the s3 itself, it's running on azure, blob storage, uh azure data lake uh source, gen 2, excuse me um or google cloud storage. So that way there is no single point of failure and it's production ready, okay, and so so, I'm sorry, uh the storage system support just call that out, but with s3.

A

There is one call out: okay that actually in order to ensure, because s3 itself lacks put if uh absent, um consistency, guarantees what happens that we actually need a lock store, which one's locking first in order for us to write to the log store in this case with s3 we're also using dynamodb as well. uh It's commonly used for other uh uh storage formats as well, but so we're uh we're following the same thing. It was just released actually in delta 1.2, with our friends over at samba tv to go ahead and do that contribution.

A

So it's pretty cool all right. So what does it make it? What does delta make it delta lake? What features make delta lake uh uh unique? Well, in addition to the fact that we're talking about acid transactions- and we talked about things like scalable metadata- we also have the ability to time travel and audit history.

A

In other words, we can revert to earlier versions of the data and we can actually log all log all transactions that happened to the delta lake, to provide you a full audit trail, and so that's pretty cool, so remember how we had the 00102 and then 003.

A

Well the things that how about if you wanted to look at the previous data well after the 2 pm. If I want to look at the version, 1 data, no problem, the parque files are still there. Okay and you just refer to a ver earlier version to get access to it.

A

All of this together allows us this idea of unified batch and streaming so the semantics of basically writing to the cloud object store or for them or hdfs, whether it's a batch process, that's reading or writing, or if it's streaming process, that's reading or writing it doesn't matter. In fact, not only is it being utilized for structure streaming, but we actually have pr for pulsar and actually flink the flink sync actually just came out as part of delta connector 0.4.

A

There's absolutely this concept of schema evolution enforcement. This idea that you can enforce the schema to ensure that any changes to the schema cannot happen today to the data, unless you actually wanted it to happen, and at the same time you can evolve the schema if it's necessary.

A

There are things like constraints and generated columns, pretty cool concept to ensure that the data meet your semantic requirements, actual constraints, you know from our database days and generate columns. Basically this idea some people called in partitioning. We think it's a little bit better because it allows us this idea that you can generate new columns based on the existing data, and then you can build uh your structures or your constraints based off of that and of just as important. Of course, your dml operations data manipulation language. So you can do your updates.

A

Your merges your deletes, whatever you want in scala sql java python, whatever language you prefer? Okay, so really cool okay. So with the few minutes I have left, I want to talk a little bit about the road map. So delta lake has a very fast pace of innovation, so this is a massive screen which I'm not going to go through, but it has a lot of a lot of uh innovations that have happened since it's two years since we've uh uh open sourced the project.

A

Okay, so uh we're actually coming up to uh april, uh actually we're already april 2022 uh as part of summit, we're going to be having a big splash talking about our three year anniversary of open sourcing delta lake, which is great so there's a lot of innovations, as you can see from here, so as part of delta lake, one two which was just released, we're also including things like data skipping with column stats. uh We already talked about the s3 multi-cluster writes, which was great okay with the data skipping.

A

I do want to call things out right. It's the column and max values are automatically collected and when you're writing files and committed to the logs. So that way, when you write you're running your read, queries like from trino, you can go ahead and skip those files. You don't have to read those files directly pretty sweet. Okay, also things like compaction, uh the optimizing and uh optimize uh or commanders are automatically for. Some of you are gonna. Ask me the question: hey that's great, that I've got optimized but where's optimize, z order.

A

That's actually part of our delta 2.0 push as well. So if you take a look at the delta lake, github issues or delta, I o dot slash roadmap. You notice it's right there. Okay, uh you want to build to restore or roll back to previous table versions. You could do before with uh some funky queries, but now we're just including the restore table function. Okay, so that that way makes it a lot easier and you can also rename columns.

A

Oh that's right. I forgot I put this in the slide already the optimize. The order is actually important included for the uh the remaining of the h1 roadmap for of this year. Okay, so that's being included as well, um so with the optimize z order. The whole context is this ideas.

A

Multi-Column data clustering- that is better than just simply than multi-column sorting, so we actually are clustering, the data, so that way we know how to skip or read which file so that way: you're you're scanning, less files overall, there's less false positives for use to go through. Okay, now with call stats. This enables also the better data skipping, which also leads to faster queries.

A

So all this together allows us this much faster performance, okay and then oh uh other things that we definitely want to be adding things like generate it uh generate the change data, feed change. Data feed is a popular feature, that's also being included as part of the roadmap as well as draw columns like I said uh these. This, these slides by the way will be available to everybody.

A

So if you want to see the full road map magnus, just look at the github at you or go to delta io roadmap and then in terms of the connector ecosystem, with the few minutes that I've got left, as you can tell, delta lake has multiple api languages, scala, ruby, python, multiple cloud platforms, multiple c collisions uh engines uh and etl and streaming engines. Okay, so lots of different systems that are all working together that all work well, extremely well with delta lake and there's even more integrations, uh some calls outs.

A

We we're calling out the presto, trino, flink ones that were just recently released. So that's why I want to call those out: um let's sell, oh communion adoption, oops, sorry there we go so to give you some context of the incredible scale of delta, like we're talking about 450 petabytes of data processed each day and within the databricks environment.

A

We're talking about 75 of the data scanned is all in delta lake and more than 5 000 customers in production and actually, when we go ahead and talk about more in delta 2.0 for the data and ai summit yeah we're actually going to be telling you even larger numbers, because these are last year's numbers, which is pretty cool. But how do you want to engage with us? Well, there's multiple ways to engage with us. There's the delta io website. There's the slack uh google group youtube channel github issues linkedin the meetups right.

A

There are multiple ways to engage with us and don't forget. We have community office hours every two weeks, okay, so, for example, this one's from february 17th included uh apple's uh dominique berzinski, who was actually talking about how he had committed, uh contributed to delta lake and then finally, most important. How do I use delta lake well for us, cinco de trino folks just go to go dot, dot, dot, delta, dot, io, slash, trino, okay, that actually goes directly to the trino delta lake page.

A

To give you all the information you need to get trino up and running in delta lake and as well, like I said before, there's multiple ways to engage with us. You can start with the delta io page and all the different ways to engage us with this there, and with that I appreciate your time. Thank you very much I'll go ahead and probably answer some questions now, so that's it for now.