Delta Lake Community Office Hours, 19 Apr 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Delta Lake Community Office Hours (2022-04-19)

Description

Join us for the next live Delta Lake Community Office Hours on April 19 at 9 AM PST / 12 PM EST and ask us your Delta Lake questions. Featuring: Tathagata Das, Allison Portis, Ivana Pejeva, and Vini Jaiswal. The session will be hosted live and the recordings are available on the Delta Lake YouTube channel.

A

Degree had to postpone it due to some technical difficulties, but we got to it so here we have ivana and td from our delta lake ecosystem teams, and um you know we we. We are here to answer questions about desolate releases, delta lake features, and you know if you have um any questions related to what's going on um and if you don't see any features in our roadmap.

A

You can ask questions about it, uh a quick recap from the last session, so we had scott benke and denny as panelists during last session, who provided amazing insights on the connector ecosystem, presto delta and trino delta integrations.

A

We also got a sneak peak of the feature which was going into the delta lake 1.2 release, which, where column stack generation data skipping, which is a big boost to read performance, and then we also covered data, fling, sync and updates on delta flink source.

A

We also covered file compaction, feature and improvements to restore the delta table to earlier snapshots. um So we will talk more about it. A little bit in this session as well. Also, a few light of highlights on the future releases, where improving the file compaction to capture the partial progress and also adding the z ordering so that the queries will be faster.

A

So that's a recap uh now I will hand it over to panelists, so they can do a quick introduction so td, why don't you just say hi and few words to our community.

B

Hey everyone, uh I I'm universally called td. That's my initials! I am a staff software engineer in databricks. I uh work with the folks that focus on the delta lake oss project uh and I my background is I've been involved with the spark community for the last 12 years, uh been working on stream processing in spark for last close to 10 years now uh now.

B

But my focus now is on the delta lake project and I'm very excited that we just released last week released delta lake 1.2.0 with a ton of new features that is going to make the delta lake users life a lot more easier with lot of features just present out of the box. So.

C

Yes, uh thanks. Thank you, hey everyone, I'm ivana, I'm a data engineer uh working as a consultant uh based in belgium, part of the delta ecosystem and actually doing a lot of implementations uh using delta at uh different uh customers and uh yeah in production systems.

A

That's awesome, yeah you're working on like large-scale products savannah so glad to have you as well and our og td here. So ah that's exciting. So td you mentioned that you know you are you know we were working on delta 1.2 release. What are some of the key features that community should be excited about? I know we talked about performance. We talked about user experience, improvements so few things that uh you would like to shed light on. Yeah.

B

Yeah, so let's see we can start yeah as you can see. If I there is performance and there is user experience, so performance wise one of the features uh that was most awaited for a long period of time, and finally, we have have it in the open source is the column stats where the short version short explanation of that feature is when we write out files into the delta lake format, we automatically collect stacks for columns like the what is the for each file for the first 32 columns, uh which is configurable.

B

We keep track that within each file. What is the max and the min values of that column or file? So this is column stats and we save this as part of the our delta log, which contains all the metadata of what files are present in the delta table. uh So we save along with that, and the advantage of that is that when you're actually doing going to read the table, if there are, if your sql query or your data frame query, it doesn't matter where you're acquiring from.

B

If your query has um like filters, like read, only column value equal to five- something like that, we can use this min max stats to pretty aggressively eliminate files that cannot possibly contain the rows that you're looking for so even before.

B

So we completely skip even reading those files whose min max range is does not include five, so that is a lot of data skipping and depending on your type of query, if you have query that is going to expect it to filter out a lot of files uh and focus on a few rows, this will help in this can help in uh eliminating a lot of scans out of the box.

B

Now the tide to this is our future uh feature, optimize, z order and without going to the details of z order, what z order is basically a multi-column spatial distribution curves. What it basically does to not use fancy terms it can cluster the data in your table, such that the min max ranges of each column in each file is small, so that, when you are looking for a particular value of that column, only few files then call can contain that value.

B

Most of the files can be eliminated just by doing the basic data skipping on the main max, without even touching and reading those files. So to make the this data skipping using column, stats even more effective, we will be work. We are working on optim z, order, uh data clustering, which hopefully the next release of spark next release of delta.

B

We will be able to make it to that, so that's the performance track of stuff, then the user experience track of stuff is actually tied to this optimize commander saying that up, we are working on optimize z order, which is the data data layout reorganization for better clustering. But what we released in this is optimize, which is just file compaction, so earlier there wasn't any inbuilt command to make it uh to do the file compaction.

B

If you are writing or from a streaming query, you often end up with many many many small small files which can cause read performance to be lower. uh This command can very easily compact them into 1gb files, which 1gb is a very sweet spot that they are large enough to avoid any uh small file. Reading overheads, but not large enough that uh you're not sacrificing a parallelism, etc for reading those files, uh so optimize comes out of the box now.

B

Similarly, another user facing api is restore which, if you have accidentally added data into a table that you know, is bad or corrupted, because some rows should has invalid values and stuff.

B

Then you can quickly find out which version of the table you can look at the history that is already those commands already there describe it, so you can look at which version of the table has correct data and revert the table, roll back the table or restore the table back to that version, either by version number or by timestamp, in order to like eliminate any changes that you don't want.

B

This is again another very useful feature when you have to fix something that went wrong um so yeah having all out of this box definitely improves the user experience and stuff.

A

Thank you to the uh for giving like highlight of each of each of the feature. So I think the first feature you mentioned was data skipping. For that we collect collect column stats. Can you go uh uh go a little bit deeper into like you know what like? How? How does it collect? uh uh You know information about those column statistics you know um you know have. We are only limited to 32 columns. Is there any uh limitation on the column, names or things like that.

B

Very good question, um so let me break it into two parts: how does it collect and then limitations on the column, names yeah.

C

B

So uh how it collects is that so delta uses sparks the parquet writing code path, so delta just says that just allow spark to do with sparky writing now.

B

Spark has hooks into its sparky writing code path, where you can add your own um stats, collecting framework and stuff, and so, as every row is being written out to parquet file, that stats collecting framework can inspect the rows and check what is the value of each column in that row and using that framework delta plugs in to collect statistics of assert the first 32 rows and statistics specifically min max null counts and stuff, uh so it basically nicely hooks into existing api is already provided by spark uh now why first 32 columns this is to kind of limp as of now, and this is not a hard rule, all these additional column stats needs to be stored in the delta logs, along with the file names and stuff.

B

So this can add quite a lot of overhead to reading the log. We don't want the log to be unnecessarily huge. So that's why, as of now, we are limiting it to the first 32 columns, and uh so you have to but and 32 132 is a configurable number.

B

You can expand it if you want and two if you don't want to expand it, but you want your columns to be present in that column size because you're going to frequent uh filter on that column, you can re, you can use, alter table to change the column, order and put that column within the first 32 columns. So those are two things that you can do to make it work.

B

The second question we need is that is there any restriction on the column, names and that's a very interesting question, because that that leads to another user facing feature that I kind of skipped earlier talking about, because there are so many is that we recently added support for adding renaming columns, as well as making each column name support arbitrary characters like earlier.

B

We had the limitation that uh that column names can only cannot have spaces or unique code, and that kind of stuff, because parque file format did not allow column names to have those with 1.2.0.

B

We release that additional feature where the delta log can maintain a mapping of the actual user, visible column name that can contain arbitrary characters to the column name that is actually saved in the parquet file, which is like there's like a level of indirection and which allows us to change rename columns, drop columns in the future or have arbitrary characters in the column without affecting, without being limited by what the parquet format supports as a as given their column, name and stuff.

B

So this is another user facing feature that was in demand for a long time and finally, we were able to do that. We added support for column. Rename. Next release will add. Support for column drops as well.

A

That so that that has come really a long way right. It.

B

C

A

That's awesome uh so with that there was al. Also one more feature was uh you know if you have like delta offers generated columns right. So can you talk a little bit about that as well? Like does delta liquid, like.

B

A

Statistics around it and how that works, yeah.

B

Yes, so in another feature, manner are running a lot of time to talk about all the features we read so, along with the column, stats based data skipping. Another performance improvement we did is that delta already had support for generated columns and for those who are not familiar with generate columns. It's a feature where you can say that the value of column, one will be always generated by expression on column. Two, for example, where this is useful.

B

Is that let's say you have time step in your data, but you want to partition by date instead of uh so one way to do, it is extract the date from the timestamp and do it, but with if you do it through generated columns, then the delta table is the delta. Log is actually aware of the relationship between these two columns. That day, two is always a function over the timestamp.

B

What it allows us to do is the new feature is that whenever you have any filters on top of timestamp column say I want timestamps to be within these two. This date range we can automatically generate additional filters on the dependent generated partition column for date, which, which means that all we can in many places we can take full advantage out of out of the box of partition, pruning and stuff like dynamic partition, pruning, etc. Can work with this additional automatically generated filter on partition, columns on partition, columns that are also generated columns.

B

So this is all taking advantage of.

B

The more information you give us about the relationship between columns. We want to take advantage of that and use mechanisms already in spark like partition, tuning, etc or dynamic. Partition pruning link them together. Using these mechanisms, you know to give you the best performance of the box.

A

So td you mentioned about this- this is awesome right by the way, this generated columns is like a column that is automatically added by delta, like if I, if I'm correct right, so what is what is? What is it that's happening in delta light, which allows it to generate that column. Can you give a little bit of mechanics there.

B

Yes, so the way it works is that when the user is defining the table, the defining the schema of the table, the user has to specify that column, one of type date generated always as some expression, some sql expression on another column of sometimes time in this case in the beginner example that the tension now that dependency, that sql expression is stored as part of the delta log, and because of that it does two things at right time.

B

Whenever it sees the value of the timestamp column, it can automatically generate the value of the generated column, the date column and write it without the user, actually explicitly specifying the date column, which means that, from the writer point of view, the writer doesn't need to worry about actually doing that generation themselves.

B

If you just give it the timestamp column, the date column will get automatically generated. It is more foolproof because otherwise, multiple applications writing the same table, computing date, differently, etc. That can't happen, and if, by any chance, some application is misbehaving by computing the date incorrectly, then the generated columns using the same expression will also validate that. Whether you're trying to write data that is satisfying that expression or not, and if you're not because some writer is misbehaving, then it will block that right.

B

So it serves two purposes both as a automatic generation of data as well as a constraint is another feature existing in in delta, where you can specify constraints on data that this column value cannot be this or something. So it serves that as a barrier to corrupting the data as well so wow.

A

B

A

Got it that's helpful to you, you keep adding more features. That's amazing! Well,.

B

This is what we have in delta and.

A

B

The way these features are designed is that that they work very well with each other using the same underlying concepts so that, ultimately, from the user point of view, things just work better out of the box. You cannot corrupt data if you give it more information like additional constraints, generate columns the more features coming along the way just try to take advantage of it as much as possible to give better performance.

A

That's awesome so td, you mentioned about user performance, now saying that I would like to ask the question to ivana who implements you know, delta in like production as well as scale it in the client's environment so ivana? How do you see these features fit into clients, environment.

C

Yeah, actually, this release has very exciting features, something that uh people were waiting uh really, uh especially on the performance part on optimize. uh This is something which is which was very much needed, especially if you're working with small files or it's especially in streaming cases where you will generate a lot of small files.

C

uh You have the small file problem, uh so this solves that perfectly it's something that was, I think, long awaited, and people using delta will be very much happy about it, but also, as tds mentioned uh mentioned, for from my user experience.

C

uh There are a lot of features uh here, the one he mentioned about the column names. uh I had that problem a lot of times at clients where, okay, you have uh some columns or some strange characters uh that they still wanted to persist them in delta, but it was not possible before so. I think this is something which is also one of the nice features, but this- or this has really many many features which really have better performance or bring better performance and a lot of uh user experience. Improvements.

B

That is so good to hear from you. This is awesome like hearing directly from the practitioner like you. This is. This is great. This validates all the work we have been doing for this. So thank you for those kind words yeah.

C

Nowhere is it, I mean, I'm really looking forward to what the next feature. What is the next.

A

Feature you would like ivana in the delta lake.

C

I think the z ordering will be one which, uh which is uh gonna, be really nice. I got uh I got to use it uh before and I think it can bring a lot of extra performance benefits. So it's one of the features that I'm looking forward to awesome.

A

C

A

uh So td, one more question around like the features. So there was a feature where you know we have consistent uh sorry, concurrency controls for reads and writes delta. Make sure that whenever you know there are multiple reads and rights, it ensures the mutual exclusion and figures out or you know, uh decides on the order of uh the right operations. So there is a feature which we were working on uh in the delta 1.2 release. Would you like to add what yeah? What's going on yeah.

B

Oh man and I confess I'm actually forgetting the entire list of features that we did so yes, that is again another usability feature. Oh my god. That is a such a big feature.

B

So till now, since the beginning of delta lake open source project, we have had the problem that, since s3 does not provide the necessary mutual exclusion guarantees, we could not support multi-cluster rights on s3, so that so at any point of time multiple say: emr clusters cannot write to the same s3 delta table on s3, with uh without breaking uh while satisfying the acid guarantees.

B

Finally, we have a solution, and it this was a huge collaborative effort between um us folks and folks from samba tv who conceptualized that idea. First code, we co-designed it together to make it more rock solid and then finally, they developed it, and now it is in delta. 1.2 is a true s3 multi-clusterized support using dynamodb as the mutual exclusion solution.

B

So basically the two. The crux of the matter is that for acid guarantees to work, you need to write to the delta log, basically create a log file into the delta log directory atomically and mutually mutually exclusively. That is only one. Writer should be able to create that log file for version 5. st did not provide any api to get that mutual exclusion, so both writers, attempting to write five dot json in the log directory, would have succeeded, which means that you are overwriting each other's changes effectively losing data with dynamodb this solution.

B

Instead of writing directly to s3, we will be writing to dynamodb using dynamodb's mutual exclusion so that only one of the writers can write to five the version 5 to dynamodb, and then only one of the writer wins, the other one backs off retries as to write six and then goal of the work and then from dynamodb. We think lazily.

B

Once the serialized order of the commits have been decided by committing to dynamodb, we sync it back into s3, so that all writers can now get the full correct view just by looking at the s3 as well. So we are getting so. That means that any writers, as long they're correctly configured to use dynamodb, will get the full asset guarantee properties without chances of data loss. This is an experimental feature. We released it for the first time.

B

Please look up the details through the release, notes and the docs to understand how to configure this correctly so that you are not accidentally writing to s3 without using dynamodb. In from my from multiple clusters,.

A

ah That's helpful td, so I have one question related to that which was like when you so you said, because of dynamodb, we were able to get this feature for s3 multi-cluster right uh when you were deciding like which how to how to go about. You know making sure that we achieved this feature. uh Why? Why did you use dynamodb? Was this something that samba tv was already using uh so a little bit about yeah.

B

That's that's a good question, so, conceptually speaking any data storage, any database or key values store that provides uh this single row. Mutual exclusion guarantee which pretty much all keyword, stores and databases provide, uh can be used. In this way we chose dynamodb, because this is primarily a problem only with s3, and so therefore, most of the users would be in the aws ecosystem, who are trying to write to s3. So the most natural product in aws to use is seems like the key value store which very little to maintain.

B

No interest inventory just create a dynamodb table, and it it. You don't need to give it a lot of uh get put requests per. Second like a lot of quota, because this is only for reads rights uh to at per version. So it's not like. It only contains metadata, not data, so it seemed the right natural thing to for all the aws delta users stuff.

B

So that's why we chose that, but if but the but in the future, if other use cases arise where we want to provide multi-cluster write on file systems, but other file systems uh in the same pattern. File system that provide does not provide those guides out of the box. We could follow the same approach on using other key value stores or databases and stuff the same technique.

A

Got it? Okay, that's that's helpful to know and learn tv yeah. There are a lot of people who might be using other. uh You know other type of design considerations. So if you do please reach out to us either on github or slack channel, and we can look at those requests as well. um So I think uh talking about ecosystem. uh There was a feature which you know community has been excited about, which was in the lines of streaming, which is splinxing connector. So any uh light on.

A

You know how that feature is going and now, since it's available yeah any light on that would be helpful.

B

Yeah yeah, I mean that itself takes a dedicated uh officer for that, because they were so so excited about that. Like so much things happened in the last month we released presto trainer, we released fling. We really is delta 1.2, it's very exciting times for us. So, yes, we finally have a flink sync that with which you can write to delta tables from clink, both from a streaming job or a bad job it. Unfortunately, it is using the flink sync stream api, which is a lower level api, so it doesn't support sql and table api.

B

Yet we are working towards that. That is in progress, but for people who want to uh start using delta table on the fleet ecosystem. This is a great unblocking feature uh so and you can always define flink. We designed it says the way that it supports flink 1.12, which is uh one of the lowest a pretty old fling version, and the connector will work on 1.13 and 1.14 as well.

B

You can write streaming queries using table api, but there is a way in flink to use non, to get it down to the stream api and use the stream api sync as well. uh So a lot of things is unblocked by this in parallel, we're currently working on the flink source, which will uh have both for streaming as well as batch queries and also the sql table api support so that you can define uh delta tables in the meta store and run sql queries on that purely using flink.

B

So lots of things in progress and so um yeah. That's the short idea.

A

Yeah so td, we talked about a lot of features and there are a lot of features. We haven't been able to talk in this half an hour session. um So looks like we are done right, like we released so many features, and now we are done right.

B

And, and no, and and and there's a lot in progress, I think some of the features that our next release should have have we're trying to make it happen is optimize the order I already talked about that. Then there is another feature we have had in the database version.

B

We were doing a trial run of that, and now we are. Finally, it's robust that we are, we can safely open source. It is change data feed, where uh the idea is that, if you want, if you have uh dml queries like insert like update, delete, merge writing into a delta table.

B

You delta table right now uses copy on write uh to rewrite entire files, so it is harder to read, identify and read only the changed rows that were changed by the delete, update, merge queries which changed it, a feed that in exposes the change, only the change rows for both batch and streaming leads.

B

So there is another thing that we are very excited about uh is uh that's this change interference feature and what this will help is build end-to-end in purely fully incremental pipelines, so that, if you one way to think of pipelines is like having this um this, this medallion architecture, bronze table silver table gold table. uh You can have fully end-to-end incremental pipelines that append to a bronze table from bronze, it, let's say, merges changes to the silver table, uh and so then from silver.

B

It can take the change, only the changed rows and merge it further into the gold table and stuff like that earlier, it was not possible to do that efficiently, because role level changes were not exposed. As a first-class citizen.

A

Yeah, that's exciting td. um I know we, we are always working hard and our community has given us a huge amount of support so that we can make these features happen and every time their feedback helps us to improve our roadmap. So we are never done and uh we keep releasing the features. Yeah.

B

A

Are never done.

B

And and just to highlight one more thing: please engage with us on github on the roadmap posted on the github uh that that really helps us decide and prioritize what features we want to publish like, for example, all of these features that we open source uh was the result of all the feedback we got from the uh from the community from uh uh actual practitioners like ivana here. uh So this is the culmination of all that feedback uh that the community has given us.

B

So please continue providing that feedback so that we can keep prioritizing and pushing things out and make everyone's life easier. Nicer and stuff.

A

Yeah, that's awesome. Thank you td ivana, since you're already here before, we close out any feature you would like to request or want to see in the roadmap.

C

uh Actually, the change date of it uh is something that I've already tried and uh it's really nice to create incremental increments, to create incremental pipelines. It's it's really nice to see that that's coming! That's coming! Next, I I didn't know.

B

C

It's really exciting. Yeah I've used it. It makes your life easier in duty when you're dealing with incremental data, you don't yeah all of the updates and and inserts that you get you don't have to propagate all the rows or all the files that were changed to your next to your next job. So that's indeed uh very nice to hear I'm exciting excited for that.

B

Thank you awesome. This is.

C

B

Strong recommendation serial strong feedback, so good we're working on the right set of things.

A

Yeah, it's always helpful to hear from users like you know. Some kind of uh you know testament of what the work that community is doing. So uh it was great having both of you here on the call today and uh you know, for the community if you would like to get involved in any of these projects or even if you have feedback, please share with us on slack github or google group, and we will our team works. Like uh not our team. Our community works on slack and you know they.

A

They will be there to answer your questions. So hopefully, this sessions are uh meaningful to you and uh we'll see you again next um next week, yeah. Thank you all.

B

Thank you very much everyone. Thank you very.

A

B

Thank you. Thank you. Thank you.

A

B

Organizing this.

A

Bye. Thank you.