Delta Lake Community Office Hours, 9 Jun 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Delta Lake Community Office Hours (2022-06-09)

Description

Join us on June 9, 2022 at 9:00 AM PDT for the Delta Lake Community Office Hours! Ask your Delta Lake questions live and join our guest speakers, Dominique Brezinski, Tathagata Das, and Gerhard Brueckl alongside Vini Jaiswal from Delta Lake!

Ask us your #DeltaLake questions. These sessions allow our community to ask questions about Delta Lake OSS and get to learn what we are building, planning to build and know about recently released features.

Quick links:
https://delta.io/
https://github.com/badal-io/datastream-deltalake-connector
https://groups.google.com/g/delta-users

A

Youtube channels- and please have your questions coming, so we can get to it quickly for those who are new uh to the session. These sessions are live and occur every two weeks on thursdays at 9, 00 a.m, pacific or 12 p.m. Eastern time- and we try to bring awesome people awesome contributors uh to the panel so that they can answer your questions live. uh You know um for all the delta questions you have, um so you can ask questions about data like open source software, how to use it.

A

What are the features coming and just in general, like you know, uh any questions about contributing as well, um so we had a lot of sessions in the past where we discussed connector community how we could work with other ecosystem as well as what were the features released in 1.2.

A

So without further ado, let me quickly direct back to the panel uh so that they can introduce themselves. They are big names already and you, if you don't know them it's good to have a recap of who they are. So, let's start with you dom uh who is our special guest yesterday.

B

Yeah, thank you uh dominique brzezinski, I'm a distinguished engineer at apple. I work in the information security space, but predominantly on our big data pipelines for security, telemetry, for doing detection response uh and been a data bricks customer for um four plus years now. I think- and uh our use case was sort of the genesis for delta lake and then things like auto loader and a few other kind of core features in data bricks.

B

uh So have had the pleasure of being able to work very closely with the team over time to refine and extend that stuff and only recently got to a point where um I could actually start to do real contributions to the open source project uh as well. You know given normal corporate politics and intellectual property stuff, but you go.

A

Through that, though, that's important.

B

Yeah years years later, yes.

A

That's really exciting. We are we're looking forward to it tom and thank you for uh the introduction um and your journey with databricks uh delta lake, so far um td. How about you.

C

Hey everyone, uh I'm td- uh I am a software engineer in databricks. I have uh been involved in the spark stuff for over 11 years now been involved with delta since its inception, building delta for four plus years. uh Yes, dumb by the way dom is selling himself short.

C

His requirement of the scale of handling data in a transactionally correct way, with high quality at petabyte scale is what led to building delta, and so he he is the the fundamental idea where, behind at this entire thing, that has that started this now four year old journey, four plus year old journey. So um but along the way, uh I have had the privilege to build some of the stuff in delta and right now I continue to work specifically focusing on the delta, open source side of things features connectors and stuff.

C

So I'm really excited to talk about on the next set of features coming in the next release, as well as other connectors. We are working on in travel.

A

Yeah, that's that's right td, um you know, dom came up with a lot of requirements and michael and tom talked and that's where the idea is needed and it's been on an amazing journey with uh you being our primary uh contributor as well um with that. uh Why don't you introduce yourself to right.

D

Yeah, hey everyone. My name is george brooker. I'm working at a company called pico as a data engineer and solution architect.

D

uh Well, my focus is mainly in the microsoft area and working since this data bricks, since it actually was available in the azure cloud, um same accounts for um for delta lake and I'm contributing the power bi connected to the delta lake. So, basically allowing you to read a delta lake table natively in power bi without the need of having a databricks cluster running.

C

D

Very awesome that.

C

Is very awesome like this came out when you contribute that this came out of nowhere and it was like wow, I didn't even know if it was possible like making power bi directly delta. This is amazing.

D

A

That's great uh so, um with that uh you know just to the panelist. uh I want to ask a question on whenever you're working on a new feature, what are some of the things that motivates or leads to building that specific feature any any? Any of you can answer that question.

D

Yeah, let me start- although I guess most of my inspiration, let's say, comes from from customer demands that I most of the time um get directly from them, and then I say: okay well, that makes sense to add to to the connector. um Sometimes it's also like. If there is some new features that get into delta, that I also try to add them as soon as possible.

D

A

Yeah, that's great to know. uh Yeah, that's yeah.

C

As well, customer demand like not just customers, just user community demand like customers, is only a small part. Database is only a small part of the much larger community. So it's a community many community demand led to all the new features we've been adding in the last few releases and the next release. That's going to happen uh so yeah, that's simple!.

D

I think I think another point would also be like if there are any requirements or or feature requests from the from the community, you can just post them in the indicate repo open an issue and then we'll have a look.

A

Yes, correct, independent.

D

Of the actual connector, obviously.

A

That's awesome also uh talking about connect, as gerard you were working on some of the power bi enhancements. uh Would you like to tell us uh you know what are the new things you are working on right now? What is your current focus.

D

Yeah, so I just recently um created a new pr, which is mainly about um complex data types being handled correctly, so in the in the let's say, old version um you have like this this static or this this predefined data types in power bi one of them, is any which could obviously be anything, um but now those are actually show up properly, especially when it comes to nested fields and complex objects like arrays structs maps.

D

Those work, fine, now, there's also a new feature which allows you to use the filter statistics that delta generates um to do some, some better um file pruning, and I think these are the the two main. The two main things.

C

Nice, that's awesome.

A

That's awesome uh and related to that. Are there any other connectors that we are working on right now.

C

Yes, so I think uh one of the main connectors that is our currently focus is basically flink. So about a couple of months ago, we released the first version of flink sync to write to delta tables. Now we are very close to code complete, and hopefully we gonna release something before the data plus ai summit uh for the flink source, so reading from delta tables. So with that, we have the end to end flink uh reading and writing from delta tables complete.

C

I think the next step after this is uh sql support, table api support, so so, there's like a whole set of things we're very actively working on the entire freeing support and bringing it up to screen similar to the govern spark support.

A

From your perspective, right in 1.2, we added a lot of features on optimize as well. As uh you know, column mapping uh generated column support for merge. All of these features. How? How are what is your uh take on you know, implementing implementing those features for delta data engineering pipelines would love to understand a little bit more on how you are either using these features or maybe think about how this is revolutionizing.

A

You know, data engineering in general, yeah.

B

I mean all these features are huge uh uh getting optimization, I think, to the community, to it's probably been one of the most sought after you know, features that we've had on databricks since the beginning, but um to get out to the open source community is huge, um especially releasing things like z, ordering right and being able to uh provide the advantages of the stats provide for for pruning.

B

um For us all, the other stuff is super quality of life being able to you know kind of remap columns and column names um having these types of features when you're having to do data engineering across hundreds or thousands of input, data sets right and and all the schema complexity that comes with that stuff. um Any features to be able to kind of uh make things better or correct things. I mean no matter how honest you try to be.

B

um If you have a team of people doing data engineering, somebody is going to like stray from convention or standard at some point and being able to correct things with like remapping and renaming stuff is super important. um I know it's come up in in our team many times like can we do column reordering?

B

You know, and uh you know all those types of changes um really can make a difference at some point or another um without having to just kind of recreate uh everything generated columns. um That is super interesting uh for us. We've always you know.

B

We have a lot of time series data we ingest about five petabytes a day, uh most of the mean time series data and so most of our tables have a date partition, that's derived from the timestamp that we've done explicitly right um and being able to start to do generated columns where uh the users don't have to specify the predicates right, but can be more natural around time ranges and things like that and have that then, underneath the covers turn into actual partition predicates on date or even more narrow on a time basis like hourly, um is, is super valuable and we're actually really excited about being able to use those features going forward, as we um kind of get into our second generation of data models.

B

um So watching watching generated columns uh advanced right, the basics have been laid out there right, but obviously there's more to come in that area as well so exciting. You.

A

Laid out perfectly done, these are the features which you know. uh People have been getting excited and actually, as a data engineer uh myself back then uh working with a lot of like time stamps and you know having generated columns automatically generated for uh data engineers. I think that's a that's a huge thing as well, so just adding two cents.

A

uh Somebody asked about index management and maintenance. uh Dewalker. Can you elaborate if you had a question specific around that.

A

If not, we can take another question which is around z ordering, so uh anybody from the panel can they uh can they help us understand. You know, what's the, what is uh the z ordering we are talking about uh for the next release.

C

Okay, um dumb, should I take this. Okay, dumb is on.

B

Yeah I'll jump in in a second I just gotta go wrangle, a dog all.

C

Right all right, so um z order. Okay, so, basically think z. Ordering is basically a very fancy form of multi-column clustering data clustering. So if you have to search on multiple columns like, for example, you have a search query that filters on that. That puts filters on multiple columns.

C

If you have the data sorted by a single column and you're quitting by filter, you can use binary search to figure that out or you can do like.

C

Searching through a single column, sorted data is easy. The moment you have to do multi-column searches- if you do the simple sorting like uh primary order, sorting like sort by x, column, x, first and then column y- doesn't give you the best possible results when you want to search by column, x or y.

C

So these uh z order- and this is one of the algorithms in this class of algorithms called uh space space, filling curves, which does much fancier type of sorting or, in the general sense clustering of data such that it's you can sort you can cluster data by any number of columns and searches using that columns will be faster now.

C

Historically, this was another one of those features that was built specifically to handle the scale that dom needed to handle like searching like having queries, run on terabyte petabyte scale, table single table having terabyte plus of data and stills returning within seconds to find that few columns that satisfy that filter across multiple columns. It was built specifically for dom and now. Finally, after this many years, we are seeing the transition that we are finally putting in open source. It has been merged. The next three is: will have the z order, this thing.

C

So what, from the user point of view, you will be able to run this command called optimize the order and that will rearrange your data in the table based on the columns you want. It will cluster the data based on those columns and, after that, all the filter. Queries on that column would be much much much faster because you'll be able to eliminate files completely without having to scan them. For that particular value, you want in the filter, yeah.

A

And uh yeah: please go ahead tom.

B

Yeah, uh it's so funny like that. I think the genesis story- and this this actually uh speaks tons for td's team and michael and a bunch of the other people at data breaks. But when we were originally, you know talking about this uh kind of selective search use case um michael and I were in a room and it was like.

B

Oh wait: there's min max stats, I'm like okay, so we could sort oh, except that we often need to search by source ip or destination ip, for instance, in like network telemetry stuff, and I see michael kind of go hmm right and then we we leave the room, and so I'm expecting to just like have to deal with normal, sorting and like primary secondary kind of sort type stuff and see how it goes and about two weeks later or something michael's like hey.

B

Can you run a couple of like pieces of code against this data and I need to generate some statistics and look at some stuff, and so I do this experiment and then not long later, like it comes back and it's like here try and run this, and it was basically like the alpha version of z order right and we run it on a bunch of data, and then we try some searches by one column, the other column. You know values from both columns and it was outstanding like right away.

B

We were able to just exclude like 90 to 99 of the data within you know very large cables, um just based on the fact that you know now. Multiple columns were sorted in a way that the min max stats, you know, would be sort of tight around files, and it's likely, if we had um you know, values that that were very sparse within our data set, that they would be nicely isolated into one or a small number of files in the table. And then you know the delta reader would do its thing. Evaluate the metadata.

B

Look at the query plan against the the column stats there and be able to generate a physical plan right that just targeted the parquet files that could possibly have answering and being able to do this kind of z, ordering or clustering sort over that uh has allowed us now to pick. You know up to a couple up to a few. You know columns that would be primary kind of search columns. We use in data sets, and then we can z order by that and we effectively take.

B

You know we have 20 petabyte tables and we're doing highly selective searches like you still can do these in a very reasonable time, um and then we have some other tricks. We use to be able to certainly stuff like that. This is a super, valuable, critical thing and.

D

And the great thing.

B

About it is that um it's one of those things that you can choose to use or not right, you can choose to spend the compute time, does the order a table if you get advantage out of it or you can not right. If you don't do selective queries uh and you're, not paying like a right time penalty to always maintain an index right.

B

um If that's not what you need, but if you do, you have the flexibility to kind of move this in right and uh and optimize your data in this way, and you also can change it, which is super important right. You can take the same table. You can z order it by you know two columns and decide those aren't the best columns and change it to a different two or three columns, um and that works just fine.

A

And to do that, john is: uh do you as a best practice? Do you recommend uh you know running that only on a specific test data set and not all on the full data set.

B

uh uh Yeah I mean if you have huge tables and you're gonna, try to you know, reorder them you're, fundamentally reading all the data and rewriting it right. So uh it's best to take an evaluative data set, that's less expensive and figure out.

B

You know where you're going to get the biggest bang for the buck before you, um you know, reprocess, you know a petabyte plus cable uh in order to optimize it um another great thing: nodes the order is incremental uh to a large degree right, and so once you kind of get a big table and then, if you're, you know adding data to it at a less frequent rate, um subsequency orders, you know, will be less expensive um because they'll just predominantly use uh most of the new data right.

A

That's that's! Wonderful! uh Thanks for going into a little bit details tom, that's always helpful! uh Gerard! You had one point there do you want to.

A

D

I think I just wanted to mention you can also do it on a petition level, so you don't have to do it for the whole table. You can just do it for like the most recent petitions that are probably queried the most often which at some to some degree limits or or decreases the impact on your overall table and storage and processing time that you need to actually um create the indexes uh yeah.

C

And so so, you could have like, depending on the pattern of your data in different partitions, you could have different partitions z, ordered in a different way, which is awesome. It completely allows you to evolve with time. What based on your query patterns and different partitions and stuff.

B

Yeah, that's something that's a great point and something that's really interesting and like the delta like model, um the fundamentals are there to be able to do a lot of operations, only on a partition and have the ord and be partitioned independent right on the way that you do certain things you can choose to just bin pack, one partition or a set of partitions, just the order or the order differently.

B

Right, there's and I think, over time, we'll see even additional flexibility come about as we build things on top of the delta protocol right, they can take advantage of that um and get that kind of partition, isolation or reduction of work right.

A

Got it and when, when we do the different uh styles of partition, uh does that does it go into its own specific folder like different? uh Does it generate different folders on on specific partitions.

B

uh That's an interesting question.

C

So now let me take that, so let me take it from the protocol. Point of view is that the delta protocol does not require different logical partitions to be present in different sub directories, uh but by default the way delta evolved over time and stuff. We d: we do that to do the actual subdirectory symbol directly by default, to kind of because we were inspired by hives style of partitioning, but we don't have to do that.

C

The delta protocol doesn't require that, because the partition values of a particular file is explicitly captured in the delta log and what path it is saved in is immaterial for that it could be saved in a subject tree or not does not matter the delta log knows what is the partition values of that file?.

B

Yeah, and uh indeed in in early days of delta, um the default or the only way was the high style like kind of directory tracking partition stuff, uh and we had such a high. I o rates against s3 that um that actually didn't work right.

B

We would get throttled by a three because we were hitting kind of hot spots in their front, end partition map and needed to do random prefix, and so there was an option on delta lake that instead of writing the traditional high style and mapping in right, it would basically just generate random prefixes uh on on the file set and write them in that way, and that gave us a much better distribution right across s3, much higher performance and even though s3 has increased that performance threshold that they no longer really recommend that, indeed our I o rates are still so high that they require that from us right and we have to do sort of random prefix on all our large tables, um but that that simple flexibility that the partition is actually a tag, value right, that's represented in the metadata and it's and then the file path can be entirely this.

B

You know disjoint from that, um and the readers all do the right thing and still prune by partitioning right correctly.

A

That's wonderful. I have two more good questions here. So one is about: is it possible to create the cdc tables in delta lake? That's awesome question.

C

Yes, so let me reinterpret the question, and hopefully my interpretation is the questions correct. So can you query row level changes from a delta table and uh the quick answer is yes, and now you can so in the next release. We another big feature: we're going to release is change data feed where it is. You set a table property in delta table and with that all right operations that are doing row level, modifications uh like merge or update, or delete like this modifying a single row or in dna file and rewriting files.

C

Because of that uh will save the row level changes as a separate set of files so that you can query just those row changes what whether what got updated, what got deleted, what got inserted as very efficiently and with that both by the both in batch queries as well as streaming queries so and and the exciting thing about that- is that, with that, you can really build end-to-end incremental pipelines like you, can have some chains that are coming in from external system uh merged onto a delta table as your like first level of tables like bronze tables, then from those from that table, you can still propagate the row level changes to the second level of table with more cleanup and stuff for silver tables and then so on and so forth.

C

Before this feature, it was not very efficient to do that, because you could only read files that were entirely rewritten, because only one of the rows had changed. So you would have to read the entire file, which is 99 unmodified data just to process that single row with. But with this feature, we keep them separately, it's a so that you can read them efficiently. It's opt-in features. So if you don't want don't care about reading low-level changes, you don't have to so it's completely opt-in. uh There is no cost associated with it.

C

If you don't want it, so that's pretty cool.

A

Yeah, that's awesome. Td cdf is the feature that uh that is coming up and we discussed last time. um I think we only have three more minutes, uh we'll take one question and we'll just uh give a recap on what to expect in three weeks. um So there's a question uh about: is it possible maintaining the indexes as online activity during business hour or its downtime activity, so maintaining indexes?

A

I'm not really sure on this question uh I think less context, but any thoughts yeah. I.

C

Can take a crack at it, so yeah, so at this point in time, the indices that delta supports is not it's more like min max values of each column in each file. So that's not a traditional index in the academic sense, but but that is very good to eliminate files. If you're looking for a particular value, 5 and a file has only data for that column, between 10 and 15, you don't need to read that file, so that is how column stats. What it's called is is used now for so business hours versus non-business hours.

C

It's it's all about how you arrange the table organize the data in the table. If you are like separating out business and non-business hours into separate partitions like partition by hour or something then you can have, you can potentially have different uh data organization like one thing just compacted the other things, the order, the other things got to be a different thing. You could do that to uh take best advantage of the kind of queries that you get on business hour, data versus non-business hour data.

A

Got it yeah? I think that makes sense. Now now I get the question too uh awesome uh any other thoughts which uh you know uh grad and dom you would like users or the community to know about uh for delta lake specific uh road map, yeah.

D

I think I mentioned the new features that are coming to the power bi connector in the near future.

A

That's awesome.

C

A

Yeah and go ahead yeah I was gonna close, but.

C

Yeah I just 30 seconds left, so I want to highlight uh to drum up some excitement in the community, because we are very excited that, as part of the next release that we're hoping to get out before the data plus ai summit, which is by june and june june last week, uh we in next few days we're going to announce a preview of that release so that uh the community can start testing these.

C

All these cool features that we've been talking about change data feed optimize the order, a whole bunch of other stuff, so if the user, so the community can start testing them out even before the release, we're going to make a preview so stay tuned, join the slack channel join the email group we're going to announce in all of them uh so that we can play around with it even before the games and provide us feedback.

C

If something is went horribly wrong, please let us immediately post github issues post on the slack channel and stuff so.

A

Yeah, so that's super important thanks for calling that out td. We will be releasing that through our google group and slack channel I'll point. The links in in this uh youtube uh recording and please do take a crack at it if you are active uh actively building on desalet, so that would be really appreciated and just as td called out, we have data and ai summit coming up last week of this month, and you know on june 29th you will get to see all these faces, plus others in the community. So please come join us.

A

I can send the schedule for the event uh where you can meet other contributors of delta lake that you may know may not have met and the community so looking forward to it. Thank you all. Thank you. Everyone.

B

C

C

The dog one of the dogs.

B

Had to put you.

D

I'm doing two remote sessions but um for personal reason I wasn't able to to join. I mean it's in.