Delta Lake Conference Talks, 30 Jul 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: The Pace of Innovation in Delta Lake - Vini Jaiswal, Delta Lake

Description

The Pace of Innovation in Delta Lake - Vini Jaiswal, Delta Lake

A

Thank you. I'm excited to present about one of the linux foundation projects called delta lake. I'm a developer advocate at databricks, where I help developers and data practitioners built on the open source technologies like delta lake spark ml flow and help implement data ai use cases.

A

I had a keynote this morning where I talked about lake houses, how it emerged as the new modern data architecture, because it's it adds reliability. Performance quality features on top of existing data lakes, so that you can make sense out of your data and use it for critical decision making. So thank you for all uh who attended one of the most important part of that was delta lake, which is the foundation of lake houses so before diving into the project. Let's look at. Why do we care one work morning?

A

I opened my email, which was surprisingly flooded with a lot of emails, and that is because our system had errors, so our data centers had issues running the applications.

A

The first thing that came into my mind was: let's look at what happened within 24 hours, because I had access to the data center logs. I had access to the system data. I was able to pull all that information and we saved huge amount of time on resolution.

A

Does that resonate with you? This is just one of the problems which data allows us to solve, and there are many other use cases where data could be very useful for critical business decision making.

A

So organizations today have a lot of data, whether it's customer data or web data coming from sensors from iot devices right because of growing volumes of data. That requires a scalable storage.

A

They already have adopted this system called data lakes and the promise of data lakes is that you can take all your data, whether it's unstructured or structured, and dump them into a file system over s3, google, cloud, storage or azure blob storage, and this is really a powerful concept. When you compare it to the traditional databases, because in traditional databases, you have to come up with a schema and a lot of pre-processing and cleaning.

A

So what a data lake allows you to do is forego that whole process and just start collecting everything, because sometimes you don't know why. That data is valuable until much later and if you don't store that data, then you have lost it. Think about many powerful use cases that you could have improved.

A

Your businesses brought innovations if you had access to that data, but unfortunately, what happens when you collect all that data in data lakes is the data at the beginning of the pipeline was bad quality, which in turn means that more advanced uh processes that relied on the bad quality data it significantly flows through the entire pipeline, so machine learning, models and ai, which were built on top of that data now becomes unreliable as a consequence, data science and business leaders who were really trying to extract meaningful information out of their data.

A

They were not able to do so. So why does this happen? Why is it so difficult to get the quality and reliability on data lakes? To answer this, I will walk through some of the challenges that we have seen working with data practitioners over and over again.

A

Through this talk, I will talk about data engineering challenges, the solutions that it has to offer uh through modern data technologies and open source projects, and I will talk about how and where delta lake fits in the picture and, most importantly, if you find that technology useful, I will share the ways where and how you become a part of it.

A

So, let's talk about those data engineering challenges. The first challenge that I talked about was data reliability, for example, if a team upstream changes any kind of data or schema without letting you know that, might break your pipeline or cause an entire production job to fail.

A

If writing that data into a data lake- and this is one of the many instances where my cluster failed- what happens then?

A

Maybe the cluster failed, because you relied on an ec2 spot instance and that spot instance now is lost, so when a job fails halfway through now, you have to think about that corrupted data, because you know it fell through in the middle now you have all the half, half quality, half data and so on. So that needs to be cleaned up.

A

So the data engineers are then responsible for deleting any corrupted data check any remaining data for correctness and set up a new right job because the systems are not capable of doing it. This is time consuming both in terms of data engineers, time as well as your cloud compute costs.

A

Other reliability challenge is lack of schema enforcement.

A

Data validation is very vital for any data engineering pipeline, because machine learning and ai applications dependent on it and if there is no way to gauge whether something about data is broken or inaccurate, the worst is. You cannot identify data errors in the beginning of the pipeline, so you can corrupt the whole pipeline, so data lakes doesn't offer any kind of schema enforcement or data quality.

A

This is a screenshot from one of my projects where I have a parking table with over 14 000 rows and four columns, and now I have a streaming job that appends the data into this table. Let's see what happens after my stream, query is run so looks like my streaming. Job went through, however, my table now has 51 records and it has two extra columns that I didn't expect. So what really happened there?

A

What happened is that when the streaming query started adding new data to the parquet table, it did not properly account for existing data in the table. Furthermore, the new data files that are written out accidentally had two extra columns in the schema.

A

So when reading the table, the two different schema from two different files were merged together, and thus it unexpectedly modified the schema of my entire table.

A

With increasing amount of data that is collected in real time, companies also need the ways to reliably perform, updates, merges and deletes so that it can remain to date all times with traditional data lakes. It can be incredibly difficult to perform simple operations like these and to confirm that they occurred successfully.

A

This is how it happens in legacy data pipelines to insert or update a table. A data engineer has to find new rows to be inserted. Then they have to identify what rows they should be replaced by or updated identify all the rows that are not impacted by the insert or update, create a new temp table based on all these insert statements. Now that delete the original table with the wrong records, and then you have to rename the temp table drop the tem table.

A

So all these sort of reprocessing as entire tables or partitions needs to be overwritten in each run for updates inserts and deletes.

A

So those are some of the problems that I have seen. I don't know if that resonates with you. Hopefully it does. uh So how are we solving uh those reliability and quality problems? We need something. So that data can be reliable and used for production applications right.

A

This is why I'm excited to tell you that some of the very talented engineers built delta lake that tackles these problems so delta lake is an open source storage format that solves data, reliability, reliability, problems that data lakes historically presented, and it does that through all massive amount of properties that it offers, but most importantly, asset transactions, so delta supports major features like asset transactions, schema enforcement, schema evolution, unified batch and streaming open format, scalable metadata, dml operations, data, versioning, etc.

A

So it can offer all the reliability and data management for high quality data pipelines through this features, so let's dive into delta's architecture and how it achieves reliability through these features.

A

Delta lake file structure actually consists of two main components. The first component is the data objects. The data objects are stored as parquet files in the scalable storage, like s3 azure, google cloud storage second component is the scalable transaction log. One of the really cool properties of delta lake is that it's as highly available as cloud data lakes, so you automatically get all the same availability, scalability and flexibility that cloud providers have already baked into their flagship service.

A

So all the metadata for a delta table is stored in a separate folder under the root directory of my project.

A

So what I'm showing here is the actual directory structure of a delta lake, there's, basically a writer headlock that we maintain in s3, where we have an entry for each version of the table, every transaction that we are committing.

A

So each entry in here actually tells you all the files that are part of specific table version in practice. Sometimes it's not easy to like have all the millions of files maintained in this in this folder, so we have delta lake delta log folder under the root directory of my project name.

A

So if we dive one step further under the delta log directory, you can actually notice that there are a bunch of json files and each file has a version number with pre-leading zeros.

A

These files have meaningful transaction information like indexes and statistics, to ensure that you have all the metadata about every commit and every change that happened in the table and delta lake also periodically takes checkpoints in the same log folder. You can see that there are some checkpoints added uh and it is sort of like a shortcut to fully reproduce a table state and why this is useful is because it allows a query engine to avoid reprocessing.

A

What could be thousands of tiny, inefficient, json files? So let's further explore how these files are created and how you can actually use this metadata information.

A

So whenever a user performs an operation to modify a table such as insert, update or delete delta lake, actually breaks down that operation down into a series of discrete steps composed of one or more actions.

A

Those actions are then recorded in the transaction log which we saw earlier, and it's recorded as ordered atomic units known as commits what atomic means is in delta table. Either a data file is stored in completeness, or it won't store it at all. It will fail, and through these commits you can act access any historical version of that data. So I will show you how it is useful later on in a slide, so we got an understanding of how the logs are stored.

A

Delta lake allows the same logs to be accessed by multiple users and also allows readers and writers to perform actions at the same time- and you know you might see like how, how does that happen, and this is fine, because there no one is going to see any changes until you have successfully committed a change and log says that the table is now this version of the table. So basically, this is made possible by snapshot, isolation, property of delta lake.

A

So with that, readers can now read a consistent snapshot of a delta table at any given time, even in the face of different concurrent rights.

A

In this figure, for example, one of the query is actually doing an insert and another one is updating the table with the file: zero, zero, three dot parking, but only one of them. Those actions will succeed. Either a reader will be able to see zero zero, one plus zero, zero, two dot par k or only zero, zero. Three dot parking.

A

Now, what happens when multiple writers want to update the same table? We talked about readers and writers. At the same time now, let's talk about multiple writers, because delta lake has optimistic concurrency control. Multiple writers can concurrently modify a delta table by agreeing on the order of changes.

A

For example, when you have a query like this one, where there are two writers trying to modify a delta delta table with 002.json, only one of their changes will succeed and other change will fail in this case writer, zero, zero, zero, zero writer, two with zero zero, two dot json will be uh successful because of the agreement of order of changes.

A

Now, let's talk about how to maintain the quality of data pipelines, we saw how it handles reliability. So what happens when somebody changes your source system? It is going to break a report or downstream application, and whenever that happens, we should be alerted of those kind of situations.

A

So, if you remember my screenshot from earlier, where I discussed about parquet table how it unexpectedly modified the whole schema and rip and completely wiped out my original records, uh let's understand how delta solves this.

A

So I performed the same operation, but this time on delta table after running the query, it fails and you might see this as a failure right no, but this is a expected failure, since delta lake intentionally wants to block this, because schema of the new data did not match the schema of the original table.

A

Schema enforcement, also known as schema validation, is a safeguard in delta lake that ensures that data quality by rejecting rights to a table that do not match the original existing schema of the table. Like the front desk manager at a busy restaurant. They only accept reservations only allowing whether one has the reservation or not and rejecting you. If you don't have a reservation.

A

Similarly, delta lake checks, whether each column in data is inserted into the table, is on the list of expected columns or not and detects any rights that doesn't exist whose column doesn't exist.

A

So this errors errors like this allows me to you, know, work on the schema of fixing and I can't fix my whole entire pipeline without without correcting my entire data.

A

Moreover, the schema evolution feature of delta lake allows users to easily change the table's current schema so that you can now accommodate changing data over time, and you know there are two modes: it supports append and overwrite, so you can just give dot options schema, merge schema, and it will uh understand that this is a merge schema and it won't actually replace the whole file. It will add separate records as well as separate two columns.

A

Now that's about how delta lake works behind the scenes. uh If you remember, I talked about versioning and how it solves uh some of the use cases. So, let's look at uh those so delta data versioning is one of the most helpful feature in lot of different ways. We have seen uh you know with regulations as well as audit auditing. If there is data, of course, somebody wants to audit it right.

A

So, as we saw earlier delta automatically versions the data that you store in your data lake, so you can now access any historical version of that data. This data management simplifies your data pipeline by making it easy to audit reproduce, experiments or roll backs.

A

So, as I said, auditing data changes, which is critical both in terms of data compliance as well as simply debugging, to understand how data has changed over time. So earlier I showed that delta lake has some indexes and statistics.

A

How that information can be used here is, as delta lake records, every action that has been performed it also in the schema or metadata captures the files that were impacted and so on. So you can use describe history and see all the commits and look at the history of table changes that that's how you can solve. The audit second feature is reproduce experiments during model training data scientists run various experiments with different parameters on different data and when scientists revisits their experiments after a period of time to reproduce the models.

A

Typically, the source data has been modified by somebody else right a lot of change. At times, though, the changes were caught, unaware of and upstream data teams can actually sometimes modify it without even telling you so delta lake time. Travel capabilities works well in conjunction with another popular linux foundation, project called ml flow.

A

So for reproducible, machine learning, training, you can simply log a timestamp url to that path as an ml flow parameter so that you can track all the changes versions of the data which was used for training and so on, so that capability of time travel allows you to go back in earlier stages and settings uh for those data sets and reproduce earlier models, and to do that, you also have to make sure that that data historically has has been retained somewhere.

A

Otherwise, you will lose if you perform vacuum or something you lose that data retention capability.

A

Third important feature is use cases rollbacks, so data pipelines can sometimes write bad data for downstream consumers.

A

This can happen because of issues ranging from architecture and disabilities to messy data and bugs in the pipeline and time travel in delta lake actually makes it easy for rollbacks in case of bad rights, how it does that like, for example, if a gdpr writes a bad record in your table, accidentally user data is modified.

A

Now you can fix the pipeline by going back to that version before that change happen and restore it so very powerful capability.

A

So we talked about data reliability. How does delta lake solve the performance issue? So when we are working with large files and massive data set, you have got to get some box and run data pipelines efficiently right.

A

So one of the features in delta lake is data skipping and how it saves you some bugs is when you write data into a delta lake table, it automatically collects that statistics I told you about, but it also collects minimum and maximum values out of each column.

A

So this can be very useful during um reading of a delta table where you can skip reading, uh keep reading the files that are not matching specific, specific condition. For example, if you have like different set of groupings, it will skip the groups and only look for the groups where that data is could be present.

A

So in the example here, our query is actually looking for events triggered by user id 24 000, and you can see those groupings here, file, one dot parking file, two dot parking file, three dot parque in those first two groupings- user id 24 000- will not be found. So it will skip those two groupings and look for file three dot parque.

A

So it's easy for a delta to skip the records that don't match the query condition and you save a lot of bucks now. Another cool feature is generated columns. Let's say you have a table that have a timestamp. We deal with a lot of timestamps right because we have to find records from like specific dates and um years, so you don't have to write partition by timestamp. That would result in way too many partitions.

A

Instead, you want to particularly partition. By date you can just do a column adjustment that takes the time stem and converts it into a date, but you need to do that manually. So remember. I remember my sequel days when I had to deal with a lot of time. Stamps and like different uh tools did not match this uh query formats, so you have to play around with time stamps but delta lake has. This feature called generated, columns which automatically calculates those dates and times times and values for you.

A

So users don't have to provide those inputs when writing to the tables.

A

Another feature which is in progress that community is working on is z, ordering and z. Ordering is a way to co-locate the data in the same file under a partition or directory.

A

This collag co-locality is automatically used by delta lake data, skipping algorithm that we saw earlier to dramatically reduce the amount of data that needs to be scanned to z-order data. You simply need to specify the columns that you need to z order by and query will look for the most common columns and can find the data or columns in the same file rather than having to look at multiple files.

A

So these two features- data, skipping and z, ordering of delta lake- allows you to only touch a subset of files and prevents you from having to scan the entire table, and this saving of scanning entire table could be very crucial when you are dealing with massive data sets more than terabytes and terabytes.

A

So there are many more useful features that I didn't get to cover, but all the information is available on our on the blogs on delta. I o website and with the advent of all the features that we just walked through delta is now available everywhere.

A

You want to use it for and that's because we we released the delta standalone a few months ago, so it can integrate well with other ecosystem projects and also it's available from a wide variety of languages, a wide variety of services- and there are some popular connector tools for data engineers that it integrates with. So you can query many different databases and also delta lake is multi-cloud, so it operates on aws, google cloud and azure.

A

There are many more useful features, as you see on the slide. Here is the summary slide for you to actually show that how our community has worked on bringing innovation to delta lake, and now we are at delta lake 1.2.

A

It's been a long journey and very exciting and thrilling journey to get here. Ever since the delta lake was open source in april of 2019 at the spark summit, which is now data ai summit, um so you can go to delta. I o website again and check out our blogs or release notes, and you will find each of these features and how they are used, and we are not done. The community is always looking and working to bring more innovation to data engineering, and these are some of the features that community is working on.

A

You can stay tuned with the updates on our roadmap through github, where we discuss all the features and track progress there and open source projects cannot be here without thriving community of users and developers, so lots of organizations have adopted and are contributing to delta lake. This is just a subset of many organizations that are running that workloads on delta lake and the work that they are doing is very transformational collectively.

A

More than an exabyte of data gets processed per day on delta lake and we have an engaged uh very active community of slack users, which is more than 6 000, now very exciting, and more than 50 or 50 companies have contributed to data lake.

A

So while we have exciting momentum going on in the community, I want to encourage you to get involved. There are a bunch of channels how you can get involved with delta lake check us out on slack check us out on youtube channel where you will find tech talks, live q, a's demos there's a mailing list. If you want to get involved with the actual code, you can always join on github.

A

So in github there is a contributing guide. You can start with good first issues. If you are new to the project, you can participate in the roadmap discussions or any other issue discussions there. Some threads have like 32 plus communications jesus or you can create a pull request directly.

A

If you have a solution or code that you want to add to the project, and usually our committers and contributors always have discussion and give examples for the codes, we also host community office hours every two weeks where people can join, live and ask questions and answers about questions about delta lake or any questions about what is coming up.

A

It's live and recorded for those who cannot attend life and last call out.

A

We have data and ai summit next week where there are over 17 sessions just on delta lake and of course, since it's a data, ai summit, you get to hear a lot of use cases from different companies as well as different projects that community is working on and just on delta lake. Specific michael ambrose is also going to give a keynote, and some cool amas with the committers as well as we are celebrating delta, lake's third birthday party.

A

So you can join in person in san francisco or uh you can totally join it from the comfort of your couch online for free thanks a lot for joining me today and um yeah. I can take any questions you may have.

B

Yes, yeah, I'm curious, um so I think that the the the way the park date file is to sort of manage the data and the ability to go back in time, and you know progress later branch off make sense um when you go ahead and complete the schema of your table um uh when you update the schema of your table, um is that also in sort of a similar format? Something comparable?

B

Is it part of one of the the the next update parque files, or is it just such a simple process that, if you revert back you just you know, make another call to update your your schema.

A

uh Yeah, that's a good point, so every change that you do either it's a schema change or a data change. It will add a new version of commit to that delta lake, folder delta log folder, and you see that you remember. I showed you the two file, the delta log log directory as well as well as data files. So it will keep all those uh changes in that data delta log directory.

A

So whenever you are retaining older version, it will show that older version as the current version, but whatever whatever data, was updated before the retention, it will also store the log of that. So in case, if you again want to go back, you can do that.

B

Awesome. Thank you.

A

Any other questions, hi virtual audience. Do you have any questions.

A

Awesome well, oh yeah! Please.

C

uh First off.

B

C

You for presenting um my question so if I heard you correctly um so there is a retention policy, then on the delta log, um okay,.

A

Yeah, so typically, there is a retention policy as well as vacuum, so retention policy is 30 days. So anytime you have a changes to data, it will store it for 30 days. Of course you can play around with retention. There's a specific command you can give either to retain only for seven days, because maybe your organization doesn't want to keep historical data, so you can always play around with that. Yeah.

C

A

There is another uh retention, another thing I would say: let's say if you have deleted the data, so actually it doesn't take all the data away. It will store that deleted data for seven days, so it will still allow you to um you know to retain that data in case somebody accidentally deleted it. So after seven days it will automatically get rid of that deleted data. But if you do want to uh not keep that seven day deletion policy, you can always vacuum at zero retention, so it will completely remove that data. So.

C

When you say vacuum, that's like the hard delete.

A

C

Thank you, yeah.

A

Good question any other questions awesome. uh Well, uh hopefully I'll see you in like uh some of the delta sessions or maybe like if you get involved in the community. Thank you all for attending appreciate it.