Delta Lake Conference Talks, 19 Jul 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Deep-Dive into Delta Lake

Description

Delta Lake is becoming a defacto-standard for storing big amounts data for analytical purposes in a data lake. But what is behind it? How does it work under the hood? In this session you we will dive deep into the internals of Delta Lake by unpacking the transaction log and also highlight some common pitfalls when working with Delta Lake (and show how to avoid them).

Connect with us:
Website: https://databricks.com
Facebook: https://www.facebook.com/databricksinc
Twitter: https://twitter.com/databricks
LinkedIn: https://www.linkedin.com/company/data...
Instagram: https://www.instagram.com/databricksinc/

A

Hello and welcome everyone to my session for data in the isomet 2022 about um deep dive into delta league before we get started. um Let me also introduce myself so that you know that um who's speaking and who's, giving you all this information, so my name is get plugger.

A

I work for a company called pico, where we focus on building ai and analytics platforms for our customers, usually on the microsoft azure platform, and I also share most of my knowledge by my blog and also provide some free, open source, third-party tools for data bricks and also delta lake, especially for the delta lake, I'm responsible for the power bi connector.

A

So if you ever wanted or needed to connect your data from data bricks to power bi, you can also have a look at the other session about databricks me to power vi that I also did for a data in the I summit.

A

Okay, so let's get started, I prepared us a little agenda um on what what we're going to cover in the next um in the next hour or next 40 minutes basically have a quick look. What delta lake actually is and then have some details on the delta log, which is obviously very important for the delta lake architecture and also I'll show you um and some brief examples how this actually works and what happens under the hood?

A

If you, if you perform um some of the most common um transactions on a delta lake table, uh we will cover file and storage management, as this is very crucial for delta lakes and we'll also have a look at streaming use cases and what you need to watch out for when you use delta lake for streaming and the last technical thing that we'll cover are um table, properties and properties that you can set in in spark or on the table itself to control how delta works internally and at the end, I've prepared some conclusions and lessons learned.

A

So what is delta leak? Delta lake is basically an open source storage format that you can use to store your data. It's compatible with most of the common processing engines for big data for most spark, but also with others like heif and trino, flink and so on. It's also very important um and and makes it very usable across different um people and skill sets, is that there are different apis for most of the common languages um being python scala um and also for ruby and rust.

A

For example, um the key features of delta lake um is actually um that it is um asset compliant in terms of transactions, meaning that whenever you um do a transaction run. A transaction on your delta lake table being like an insert update, delete whatever um delta lake or the delta lake protocol, basically ensures that this transaction is either done completely or not at all.

A

So you basically never end up in a state where, um where your transaction is only executed halfway and then in the board, so this can never happen and the data is actually always in a consistent state.

A

um Some other features are which are actually very important because it makes the delta lake very usable is the support for updated merge. If you have used um hive in the in the very beginnings- or you probably know that updating data is actually not that simple, if you do a draw by row um and delta lake really helps here, especially using the merge statement, which provides a lot of cool features that you can use to easily manage your date um on top of the transaction um protocol.

A

There is some features that are built um for most time travel, so, as each transaction is basically um recorded in in the delta log that we will cover later. You can basically always jump back to previous version of your table and the data in your table there's also other features for um for schema, schema evolution and skip schema enforcement.

A

So it's actually not possible to add um data in a wrong schema to an existing delta table so that that transaction would just fail. If you try to, I don't know like insert a string value into an integer column, for example.

A

um However, you do have schema evolution which which allows you to change or extend um the an existing delta table on the fly, assuming um it's within supported um operations, so up casts, for example, from small end to bigint.

A

I'm also delta lake supports both batch and streaming, which is also very important, because um you can actually maintain the same code for your batch processing also for stream processing, and it is also 100 compatible with apache spark.

A

So um when it now comes to you to the actual implementation, uh what you will get when you actually create the delta table is um it's basically a folder that contains the metadata, the transaction lock, the delta, lock, that we will cover later and obviously also the data itself. um But this this folder insert itself is um is more or less like self-contained, so you can just take that folder put it on a usb key and open it from there.

A

If you want it's not actually tied to any um file system or a storage system, you can just place it anywhere. Obviously, it makes the most sense in in a data lake and um but, as I said, it's not meant that they're mandatory as the delta lake um table contains all the data that it needs to to operate um in itself in that folder. um Anyone who everyone wants to work with the delta table just needs to know the location.

A

So, for example, if you have a delta label table in your in your data lake- and you have like a spark cluster that needs to process it, you can just um tell spark that the path of the delta table and then you can just read it as a regular data frame. You don't need to specify the schema. You don't need to specify anything else. um It's all handled automatically for you. So that's that's pretty convenient.

A

So how is this all possible? As I said um there is um the the key concept of delta lake is actually a transaction layer on top of the actual data, and this transaction layer is called the delta log.

A

So this delta log is also stored as part of of the um of the delta lake table in a special folder called underscore delta underscore log, and it basically contains um everything that that's important to manage the transactions and and keep the um and then keep the data table and metadata up to date. So it is the delta log foremost contains this table schema, obviously, and all the changes. So if there is an like a new column is added, then that's also recorded there.

A

It also contains the references to all the actual files that contain the data and um with every operation and transaction that you run on the delta table. It also stores some additional metadata and metrics for each transaction that you do as you can see on the right.

A

Those transactions in the delta log are actually stored as json files and after 10 transactions, a so-called checkpoint file is generated which basically contains all the previous transactions and aggregates it into one big per k. File. Reason for that is that, as you can imagine, if you have a table, that's um out there for a couple of years, and you have a lot of transactions there.

A

It would cause- I don't know like millions of json files and obviously reading those json files um would be a big hassle, because you would have to touch let's say, a million json files. So, um to avoid this, those checkpoints were introduced, and basically, what you need to do is just like read the latest checkpoint and everything all the json transactions from there on so it it's a maximum of 11 files. Actually.

A

This um that the delta log further allows um these this acid compliant features for concurrency control. So if multiple.

A

Processors are basically updating the same delta table at the same time, one of them will fail because it would not be consistent anymore if you, for example, update data, that's also updated by by another process.

A

The output would not be the same or not what you expect. So that's why one of those transactions would fail, and that's so called optimistic, concurrency control. So basically, whenever um you commit that or before the the process actually commits, the transaction basically checks whether um that the data has changed since um it just since the transaction was started. If that's the case, the transaction will fail um and yeah the delta log is is used for time travel.

A

So you see all those different versions on the right side. You can basically say: okay, please show me the table as it was in version: zero, zero, seven for example, um and it will return the data um as it was back then also um for streaming as every transaction um is, is locked in the delta log. We can also use this information for streaming and process um the data as it was changed or added in order for for our streaming pipelines.

A

um There is a need command, which is called describe, history, which you can run on your on your delta table, and it will basically give you all the information um that currently exists in in this delta log. Okay, you see like the version, the timestamp when it was changed, user id username, the actual operation, and there are a lot of other um columns there that provide even more information um just to have a look.

A

I just wanted to to show you that you actually find all this information in a easy to use way, so that also helps a lot when you actually debugging, when you're actually debugging some um some etl pipelines and wondering what's actually happened to your delta tables. Just have a look at the history and you very likely see a lot of get a lot of information from there.

A

So what actually happens when I run a query against the data lake table, so with all the information that we already have with the delta log and the checkpoints? The first thing that it needs to do needs to read the delta log.

A

If we do have this checkpoint file, there is an additional file created automatically, which is called underscore last underscore checkpoint. That allows you to immediately um jump to the most recent checkpoint file.

A

Once you read the checkpoint file, you already have most of the of the transactions in there and if there are any transaction after the checkpoint file, um the the processor would also need to read all following json files as we create the checkpoint file um every by default every after every 10 transactions.

A

It only has to read a maximum of of those of 10 json files. In addition to the checkpoint file, um once the um the delta log is read, um the engine basically um receives all the the files that belong to that state or to the latest state of the delta table, which are basically per k. Files reads them and well returns the results to the caller. So that's how a queries is processed on top of a delta lake table.

A

So how does the delta lake work and how does the transaction, the delta logs and the actual store data files um are, are persisted in in the data lake? So let's take this very simple example: we have a a table of three rows. It was created as a delta table, so we have one transaction in the delta log folder and an indent that transaction basically added one file, which is the part zero one per k file that contains three rows.

A

If I now run an update on that table, where I set the price for my pc 2300, what actually happens is it first of all creates the new file in the storage, which is another three rows, and it creates a new delta log transaction, zero, zero, zero, one and dot json in this case, and we're saying that transactions basically basically specified that the old per k file is logically removed and a new k file is added.

A

Okay. So that's what it looks like after the update. So what happens if I delete a row? So if I run a delete from the product where the product equals pc, um it might be a bit counter intuitive because we're actually deleting some file some some rows. um But what happens is that it still creates a new per k file.

A

It only has two rows, though, but it still um adds um adds files, even though I'm running a delete which is a bit counter intuitive right and again, um the the delta log entry, like the json file, contains a remove entry for the the previous uppercase file and a new add entry for the new packafil.

A

That now only contains those two rows for an insert um slightly different uh because for an insert we don't actually have to remove something because, like um the the the spark engine that reads, the delta table or the consumer in general, basically always has to read all the k files that are in the in the in the current delta table and belong to the latest version. So we can simply add a new file. So in this case we add one row um and we also add and well.

A

The delta delta league engine adds one per k file with one row.

A

So that's that's pretty straightforward and should give you like a quick overview of what what each transaction causes in the delta log and also on the storage, um as you have seen like, even if we delete files, um the files still reside on the storage, so basically, each transaction can potentially create a new file.

A

An insert or an update, obviously always create new files, whereas even a delete can create new files. Okay and, as you can imagine, if you do a lot of transactions and touch a lot of different files, um this can um create a lot of files in your storage can be millions. um Well. Very much depends on your use case and your table, um but I I I get.

A

I guess you get the point so how to deal with with this issue or with this amount of files, um which leads us to the to the next big part, which is file and storage management.

A

Also, delta lake provides some built-in functions that allow us to to do the file and storage management. One of them is vacuum.

A

So if I run um this, this specific vacuum command, um it's actually not changing the data itself, as you can see on the right side, so the table at the top on the left side is the very same as the table on the right side.

A

But what happens is that it's actually cleaning up our storage, so, as you can see at the very bottom, the first and the second package file are actually physically removed so before they were only logically removed um due to the delete statement or to the update statement, and now once we run the vacuum command, it's actually physically deleting those files and also making the the storage free, and so we don't have to pay for it anymore.

A

Also, the vacuum transaction is locked in the in the delta log. There is a specific operation called vacuum start and vacuum end. We have just been a lot recently but like a couple of months ago, added so that you also um keep track of that and know when the table is actually vacuumed in the delta log history.

A

Okay, the second command that I want to mention here is so-called optimize. Optimize, basically consolidates smaller files into larger files. So a pain point for most big data processing engines is that if you have a very large amount of very small files, the metadata overhead of reading those single files can can be a potential bottleneck.

A

So delta lake introduces optimize command that basically um consolidates um small files like two rows and one row into a new file um that contains all the rows. It's a very simple example here, but just that you get the point um for optimize. So again, um the data is, as you can see, in the in the transaction. We remove two files logically and add a new one, but on the storage we now um have again um the data twice, even though, basically nothing changed right.

A

So the result is still the very same.

A

So what does vacuum do? um As I said, it basically physically removes the files from the actual storage. um It's a bit more complicated um that then I've shown in the example, um because it's not removing all files, but only files that have been deleted at least x days or hours ago, so um that have been outstated for longer than a than a given period.

A

um You can basically run vacuum whenever you want, it will never have effect on the on the on the most recent version of the delta table. What you need to keep in mind, though, is once you remove the physical files. um You obviously cannot use time travel anymore and go back to um let's say the table as it was seven or ten days ago. If you ran vacuum with a retention period of five days, because then everything that's older than that has been deleted more than five days ago um is also physically removed.

A

You will get an error message there saying that um well, the actual file doesn't exist anymore for optimize.

A

As I mentioned, it basically collapses small files into bigger files, but there are also some other features which is mainly set ordering, which is some kind of clustering and ordering of the data which allows for for better file file, and I close the partition, pruning and optimize as well as the name implies mainly used to optimize query performance so that you don't have this overhead of small um of the small files, but can ideally read files that have already one gigabyte or um an an amount of or a size. That's that's!

A

It's good for for reprocessing engine to consume, which is usually between 250 megabytes and one gigabyte.

A

Okay, some um some additional informations for vacuum and optimized, so vacuum is also um have a dry run, um parameter, which just tells you, okay, which files it would remove. The issue with the current implementation of vacuum. Is that it well? It would show you like a thousand files that would potentially be deleted if you run the actual vacuum command, but you don't know how many files there are in total, so it could be a thousand and one file, but it could also be a hundred thousand files.

A

So you don't know a little um trick that you can use is if you, if you execute um the vacuum command in a scala notebook or in a scalar cell, it actually um prints out the number of files that it that it would delete, which is not the case.

A

If you run the very same statement in python or in sql, for example, one thing to keep in mind when you run a vacuum is that um it can take a very uh long time, so we had cases at customers where we were basically deleting a couple of million files. I think 30 million and the job ran, for I don't know like a week.

A

Main reason for that was, is that um the deletion was actually like a single threaded operation running on the driver only and it didn't really really leverage um the the whole cluster and didn't distribute the load.

A

So that's just something that you should be aware of, but I guess this will also be fixed in the future for optimize. I already mentioned that it actually duplicates the data, creates another copy, an optimized copy and um also what may be important for you.

A

You can run and optimize on a petition level, which means you can also use different set ordering for different petitions. If you know that that particular petition is, is created or has like different um query patterns.

A

Two more functions or features for for data management are restore and clone.

A

So as we do have all those transactions locked in the delta log, there is also an easy way to basically restore a delta table to a previous version to previous state, so we can just run, restore the table to timestamp as off and then specify the date, and it will basically recreate the table as it was at that time stamp and from now on, after the the restore command finished, um that's your latest state of the table.

A

This can be very useful if you accidentally, for example, delete the data or accidentally did some um et or ransom media data pipelines that modified the data and um you're actually not happy with the output. You can just restore the previous version. It's actually super convenient, and um but what uh what's also nice about? Is it doesn't really copy any date? It's just like a metadata only operation and just um points um or yeah makes you delta.

A

Then the lake table point to a different version of the of the files, but um also the restore operation creates a new version. Okay, it will also be locked in the in the delta log, which is quite good actually.

A

The second function, clone is also very important, especially for testing, so clone basically allows you to to clone an existing delta lake table into a under or into a different path. So you can have there's two options. You can use a shallow clone or a deep claw, a shallow clone, basically just copies the delta law.

A

What forks, the delta log and every operation that you that you run on top of the of the of the shallow clone will actually not touch the old or the original um table anymore, but it will just touch your shallow clone and um do all the changes there.

A

So if you want to to test some some data pipeline, the safest way is probably to just um run it on or create a shallow clone up front and then run the pipeline on that shallow glow.

A

Another option is so-called um deep clone and, in addition to the actual delta log that is copied, it also copies all the data files. So um you need to be avail. uh If you have like a big table with a couple of terabytes, it will actually copy the whole table and yeah again can take some time.

A

As I mentioned here, it's it's ideal for testing. It's actually super convenient and yeah. The syntax is pretty simple, so you can just create table events clone shell clone events as especially the sheller clones. Just a metadata operation that usually executes in a couple of seconds.

A

Okay, some additional information about um restored clone, so you can run a restore as often as you want. So, as I mentioned, like restore, also creates a new version in the delta log table and if you, for example, are not happy with your restore and you want to go back to the original version, you can restore the restored version. Basically um yeah, that's just nice to know, and um as I mentioned before, right, it doesn't create any new data files. It's just a metadata operation, so you cannot really break anything there um regarding clones.

A

um One very important features, especially about deep clones, is that they can be incremental. So if you run a create or replace deep clone command, it will actually do incremental incremental updates to your to your clone, which is super convenient and can be used, for example, for backup.

A

So if you want to backup a delta lake table um using deep clones or incremental deep clones is actually a very good solution and very efficient, because it um you don't need to scan, um don't need to scan the storage and everything can basically done using the delta log only so that that's that's, really convenient.

A

um As you have been working uh with big data already, I guess petitioning um you are familiar with petitioning anyways. I want to to cover briefly what we want to achieve when, when we create partitions for our delta tables, so basically we we partition our data so um that our file, either our file management, is easier or our query. Performance is better, especially for the lower layers being bronze and silver, usually the data that the the tables are partitioned for etl performance.

A

So, ideally, if you load data from um into brass um the data that you load matches exactly one partition so usually, for example, if you get a daily um export from from your source system, you would usually create a one one partition per day when, when you received an export for silver, um that's usually similar, and ideally you can just use the same partitioning concept as on broads, because then you can just copy or reload um silver very easily by just replacing.

A

What's in that one partition and and reload it from bronze with, for example, a new processing logic.

A

But this again, like very much, depends on your use case and whether that's actually um possible for gold, um which is usually designed towards query performance. So end users, query goal and performance should be um your main priority, um which means that your petitioning can can vary so, um for example, if you, uh if you receive data in on a daily basis um on bronze and silver and process it there and the data, for example, contains a start date um that is not aligned with the petition.

A

So you can receive data today, which has a start date from I don't know like last year, for example, and um if users usually query by this start date column. It's it's. It's a good idea to to change the petition in gold to um to use that start date. Column, because then every query that uses the start date can already be um can already be filtered down to the partitions that actually contain the data without having to to read and scan the whole table, which obviously is very time consuming in general.

A

um The the the overall idea is to touch as few partitions as possible um or as necessary, because the less you touch the faster your um your operation will be regardless, whether it's an etl operation or a query operation.

A

As I said, etl and query requirements can can conflict.

A

In this case, you um you need to change um the partitioning, as I said like from silver to gold. That's just something that you need to keep in mind. Changing partitioning um can be very um complicated and resource intensive, because potentially every new batch that you load from silver.

A

Can can update every partition can potentially update or change any partition of gold okay.

A

So we had cases where, um like loading, one batch basically has rewritten the whole gold table and, as you can imagine, that's not really um yeah efficient in terms of um in terms of uh processing time and also in terms of storage, as we know that whenever we run a an update or a merge, um it's actually copying the data and if we touch the old partitions of the gold table, basically copy all the data again, thereby consuming twice the amount of storage that the actual table would need.

A

To make your queries more efficient and also your your etl, um it's advisable to always explicitly specify the petitioning columns when you, um for example, merge into a table or um well obviously deleted delete data or also, if um select, data, um as you can already see, and and now, based on the example that I've, given you um a good candidate for petitioning partitioning is usually time, but it again very much depends on your use case. Let's say: 95 percent of the of the delta lag tables that I've seen.

A

We at some point have petitioned by time whether it's arrival time of the data in bronze and silver or some event, time or start time in gold. um It's usually related to time and depending on your requirements and your data, you can add additional um columns for for your petitioning and um things to keep in mind. If you have um too many partitions, you again um potentially run into troubles of having a large overhead of reading all the metadata okay. So that's all something that you would like to avoid.

A

So, as a rule of thumb, you should you should have a few thousand partitions at a maximum, and ideally a single partition should be one gigabyte or bigger than one gigabyte, so it doesn't make sense to have a partition. That's only one megabyte um yeah. However, um it's very very very much depends on on your data on the distribution of your data, on your um query: query patterns on your etl patterns and so on.

A

um Yeah, so just evaluate it on each case. um Whenever you create a new table, what the the best petition scheme would be for you.

A

Another thing that I need to mention um is so called generated, column, so delta lake, since um some months are having a feature called generated, columns which basically allow you to derive.

A

Well, to derive data from existing columns and persist in in in new columns, so, for example, where it is used very often is for, if you have like an event, time stem column and um you would need to partition the the table by the date of the event timestamp year of the of the event timestamp, you can use a generated column that basically extracts this information from the timestamp column and populates it automatically the cool thing about this is, if you follow that approach, it's actually so the delta engine um will also try to to push the filters that you have, for example, on a query also down to the petition.

A

So if you, um if you have a setup like this, where you have like this event stand um generated uh or just this event date column, I'm generated based on the event timestamp and you run a query select star from my table where event date equals. I don't know um 27th of june 2022, then it will basically um push down that automatically for you.

A

Even if you, if you create specific, more specific filters on the on the original event timestamp column. This will also be pushed down to um due to the actual partitions, which makes your queries much more efficient.

A

There are some um some functions that can be pushed some others can't. um I included the link here, which ones those are so if you're not sure, just have a look. There.

A

Some specifics about delta and petitions, as we know, or as I mentioned.

A

Delta lake supports transactions and also supports multiple updates at the same time as long as um each of the updates only touches a specific partition, so you can have 10 concurrent um updates. If each of those updates um chops only touches one single partition and those are not overlapping, then that's just fine, because delta will manage it. So make sure that you that you always specify those partitions when you, for example, run emerge.

A

This can also speed up the merge process itself because it already knows in advance which petitions to scan for changes, because otherwise, if you do not specify um the partitions in the target, it will actually scan the whole target table.

A

If you want to know if your, if your partition filters have actually been used in, for example, a merge statement, you can um check the delta log history and search for the query predicates that have been used when, when running the match statement- and ideally you show you see your new petition filters, there.

A

So some more information on the petitioning um on the right side um and oh yeah. On the right side, we basically see the the raw delta log entry um when a new file, a new file, is added as part of a transaction.

A

As you can see, um there is in line three four and five: it's basically um the the object, four partition values, which contains all the partitioning values that that file belongs to.

A

um Usually, if you have a look at the at the delta table, you probably think that um that the partitions are resolved based on the path and the subfolders and subfolders, but that's actually not the case, um so the path could point anywhere. So it's the only thing. That's actually important are those partition values.

A

Okay, for example, if you create the clone um the path points to somewhere else and not necessarily has to contain um the folders and the subfolders, but by default that's the way how your um your your data files are laid out just from from for historic reasons. Actually, as it was the same for hive in the past, um what you will also realize is that the actual physical k files do not contain the values for um those partitioning columns.

A

So it doesn't really make sense to like if you have per k files, one million rows to just sort, a sales story, territory key or the same state, one million times with the same value. So it was just emitted there and is only available from from the delta log and obviously the engine, the delta engine or the reader has to pick up the value from there.

A

Another thing that's made slightly different to to other processing engine is that you don't have to specify all the partitioning columns sequentially. So if you have, let's say five partitioning columns, um you don't need to specify them all. You can just also just only specify the latest as it is resolved in the delta log. Only it just uses your um your partition, filter, predicates to filter the delta log and then retrieve the files that did still match your um your filters.

A

um I mentioned it so um delta lake also works for streaming. um It can be used as a source and also as a target, even though it's not really um streaming it's more like micro batches as as is actually spark structure streaming.

A

It's also micro patches and um the lowest granularity that we can actually stream um is a file or well a part of a file, for example, fbk file and when you're reading, from from a delta lake table in a stream the files actually processed in order first, in order of the of the version, obviously, and within that version, um if one one transaction creates multiple files, you will see this.

A

um The physical files having part minus zero, zero, zero, zero, zero one, zero zero, two and that's actually the number in which um in which, um how the order in which they are they're closer to in the stream from that table um to make streaming possible and to to basically um save the state of of what has already been processed. So called checkpoints um need to be created not to be confused with the checkpoint files in the delta log.

A

um You always need one checkpoint per source and you can technically also stream from the same source multiple times. If you use different checkpoints,.

A

Yeah, um when you do streaming and you need to use merge, our merch is only available in and for each batch um command. So you cannot stream or yeah. It doesn't make sense to stream like each individual row and run a merge statement for each. So what you actually need to do is to run the merges part of this for each batch function.

A

What's what you need to watch out is um the size of of your trigger and of your batch. So how many um well, how much data you actually read in one batch? This can be controlled with um there's two um just mainly two properties when you start to stream with such max bytes by trigger and um max files per trigger um there is, um there are different triggers that you can use.

A

One of them is trigger once which actually you should avoid, because it processes everything in one big batch, which is definitely not what you want, because it's very likely causing and out of memory if you um stream, initially from a big table. So there is another one just called trigger available now, which um which trig, which which creates batches in the size that you specify so trigger once ignores on this max files for trigger and max um bytes for trigger and yeah. You can basically um stop and resume a stream at any time.

A

So we have a scenario at one of our customers where we have basically um implemented a streaming architecture, but it's not 24 7 running, but we just um start the pipeline every every now and then like once a day process, everything that has been accumulated until um or since the previous execution processes it and and then we stopped stopped the pipeline again, because it's just more cost efficient and we don't need to have real-time data.

A

Delta lake table properties- um you can, you can use properties in in on on a delta lake table directly or also in the spark context, to control how delta works.

A

There are some um some yeah, more important ones and some that yeah, I'm not mentioning and see them on the next slide.

A

What's important, though, is that, as you can specify them on different levels, to know which ones are actually used, and in this case, if you have specified a delta, a property on the delta table itself, and you also have configured um your spark setting, the setting on from your execution from from the execution context, will always overwrite what you have specified on the table. Property.

A

These are just some important table properties that you should know. um If you want to know the details, um you can just simply google them or um can look them up in the references that I'll show you afterwards.

A

And um if you want to make your especially your data management and your delta links very efficient, I usually recommend to always run the the commands like an optimize or a vacuum. Is the defaults and specify the exceptions on a table level?

A

So, for example, if I want to have a table set to um to delete files older than three days um and not the default of seven days, I can just specify that exception that table and just run a vacuum command on that table, and then it will just use the default for the setting from the table and if I run the vacuum on another table which doesn't have the table property defined, it will just use the default of 7..

A

Changing any of the table. Properties is also unlocked in the delta log.

A

Okay, so let's come to the to the conclusion of my session and takeaways, um so delta lake can obviously solve a lot of problems, especially when it comes to data management in terms of updating and merging data and doing etl and data pipelines.

A

um However, due to the nature of the delta log and and how um how it handles versions and and data files file management is super crucial because, as as they've seen, each transaction can potentially create um new files and if, for example, if you run an optimize on a whole table, it will actually, um at least in the first run, duplicate the whole table physically.

A

um So um data maintenance paint maintenance, jobs are absolutely mandatory, so you should have a vacuum job or an optim or more important is the vacuum job I guess um scheduled on a regular basis like daily, ideally or maybe even um weekly or monthly, if they did. If that also works for you, um you just need to check, but you should definitely have some in place and, as I just mentioned before, use table properties to manage the different settings of your delta lake tables.

A

If you want to have a look on the on the internals and on more details on the delta delta lake lock protocol and in terms of delta lag itself, there are some very good links that basically describe what what's happening underneath and, for example, what what table properties you can use and what they actually do. So these are just two very, very important um resources and references and yeah. That's it. um Thank you from my side um for for having you in my session.

A

If you have any questions, please feel free to reach out to me. um You have all my contact teachers in the in the first slides, so just drop me an email or ping me on twitter. If you need anything, thank you.