Delta Lake Community Office Hours, 9 Dec 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Delta Lake Community Office Hours

Description

Join us for the next Delta Lake Community Office Hours and ask us your #DeltaLake questions. Thanks!

A

So give us a couple of minutes and then we should theoretically be working. Yeah famous last words, so give us a couple of minutes and then we should theoretically be working.

A

That was one heck of an awesome echo. There wasn't it okay, so I think we are live now. So welcome to everybody to today's delta lake community office hours. I am just currently trying to set up everything on youtube and as well as on linkedin, so give me a minute to try to get that in order, and then we will uh rock it and so perfect. We are on linkedin.

A

So uh excellent, so thank you very much everybody uh for joining today's community office hours. Right now, we've got christian from scribd. We have scott from databricks myself from databricks and I think we're gonna have td joining us a little bit late, but we are live on linkedin youtube. So if you have any questions when it comes to delta lake, you are currently in the right set of office hours. We're gonna run for about 20 minutes today, uh so we're gonna end at about 9 20 um uh a.m.

A

Pacific time or if you have no questions, then we'll end actually a little bit earlier, but uh I did want to let christian and scott introduce themselves real, quick for starting off uh christian. Why don't you start I'll, provide a little background on who you are and well you know you, as a delta lake committer what what projects you currently are focused on.

B

Sure uh yeah, so I'm a involved in delta insofar as I'm a committer on delta rs, which is the rust implementation of the delta lake protocol and kafka delta streams, which is a russ damon.

B

That streams json messages from kafka topics to delta lake tables, uh which scribd is using in production right now for about uh 60ish topics.

A

Thanks very much christian and it was actually completely accidental. I just realized I was wearing my scribd uh vest today as well. So yeah.

B

Oh yeah sure I believe.

A

That was an accident. It was, it truly was actually but yeah. It's a great shout out, though, actually yes yeah, all right, perfect scott, why don't you introduce yourself real, quick.

C

Thanks denny good morning, everyone uh hi, I'm scott, I'm a software engineer on the delta lake ecosystem team. Here at databricks uh recently, I've been working on a variety of like open source projects that that we uh that we contribute to, such as the delta standalone library, which is like the main low-level library for new.

D

C

Connect to the delta lake protocol uh also, last week I open sourced uh the latest version of delta lake 1.1, which is really exciting, and now I'm continuing to work on some open source features and performance improvements.

A

Perfect, thank you very much. Scott and hi everybody. My name is dan lee. I'm a developer advocate here at databricks, um long time spark guy long time, delta lake guy, so hopefully I'll also be able to answer your questions here. um I did want to start off. um I believe uh it's more on the boring, not not boring the more uber detailed part of it, which is christian. You had in yesterday's delta rust api meeting. We had talked about basically trying to figure out how to read the metadata faster.

A

I was wondering if you had any concerns or questions that, in fact, you could probably potentially ask scott right here concerning the uh the metadata itself actually.

B

Yeah, actually, um rather than fast, faster, it's more. The the concern I have more is around memory.

B

So currently, when we load the transaction transaction log for a table in delta, rs we're using a row-based format, we aren't using arrow or any kind of a columnar uh structure, and I and denny I think, had mentioned that you guys were facing some memory issue issues in the delta standalone reader, and I guess you may have already started to take steps to resolve it.

B

And so just I'm just interested to hear any uh gotchas you hit along the way or just anything about the the problem as as you're facing it in the delta standalone reader.

C

Yeah, that's a really interesting problem and I think I should have a decent enough answer for you so yeah. So the the main issue is: why implied that you had the answer so yesterday.

C

um The main issue here is that you need to get the correct and latest snapshot of a table uh which requires reading all the actions you need to know what are the add files that were added to the table, and you also need to know whether to remove files files that were removed, because sometimes those will cancel out. That's why you have to read all of them to get the latest point. So typically, you do that all at once. You load it all into memory.

C

That's where the memory concerns come from a better way to do. It is with an iterator method, but you can't just iterate forwards in time to the transaction log, because there could be some remove files that come on later, so the trick that we did and the code is open source feel free to look at it. In fact, you can email me and I can send you the exact links, but you actually iterate through the transaction log backwards and as you're iterating backwards.

C

You keep track of the remove files you've seen in the add files you've seen and you use that to actually decide which files to return to the reader. um So that's that's when you're um updating and actually reading all the files and just a quick hint uh when you instantiate um a new delta log, um as I'm sure you probably do in your rest, api.

C

You need like two things that a delta log requires to be instantiated is the correct protocol and the correct metadata of the table as well, and you can do that too, using iterator api. You just read backwards until you find those two and then you're done, you don't need to load all the actions into memory. um Let me know if that made enough sense, for you.

B

Yeah it does. I appreciate the answer and I'll check out the source.

A

Perfect, perfect, all right, so that was committed to committer questions. So that's pretty sweet um all right before without further ado td want to do a quick introduction yourself. We already did ourselves. We have td now joining us, which is awesome, because it's good timing, because we have a bunch of questions from linkedin that we gotta go answer we're gonna, throw them all your way td, I'm joking, but still introduce yourself.

E

All right, first of all, apologies everyone. I was a little late. My previous meeting ran over. um My name is well td. Everyone calls me dd. uh I am a staff software engineer in this company and same team as scott.

E

We work together very closely and uh I've been working on the delta lake project for the last four years since it's since its inception, so we're having way too much fun working on this delta lake stuff and that I'm still not bored of four years, so yep ready to answer all your questions so throw all the hard balls at me.

A

No problem: well, I mean in fact actually so. Shiv uh from linkedin has a bunch of questions, so uh I'm gonna go backwards first, just because I uh and and then, of course, christian and scott- definitely chime in too right, but I'm just gonna direct some of these because they seem to be very much in dede's wheelhouse. So, let's start with the uh the most recent one uh so shift. Please continue asking your questions. This is awesome. uh Can I run parallel queries to delta lake?

A

If I need to run thousands of queries and so yeah, I want to talk a little bit about the fact that, yes, you can definitely run parallel. Queries to delta lake in the context behind it.

E

Yes, you definitely can, and you can. The coolest thing is that you can run these parallel queries within the same cluster or across any number of clusters, with the full consistency guarantee, because, because delta internally maintains uh the version snapshots of the table, so each of these queries will see a consistent version of the table may not see the same version of the table because the table can be getting new versions, as people are writing into it as well.

E

But every query will see a consistent version of the table of the entire table rather than oh, it sees somebody's writing into it. It sees half of that right, but not the other half that won't happen either. It will see entirely version one or version two, but nothing in between. So yes, you can run things in parallely.

E

Things will be consistent uh and whether you can actually let it just is your cluster configuration whether it has enough capacity in your cluster for spark to run those tasks and stuff, but from the the from the data format point of view, you can absolutely run everything and you get full consistency, guarantee.

A

Excellent thanks very much td um and I'm just gonna go to the next question, though christian scott, if you have anything that you'd like to add by all means. So uh the next question now uh a shift, it's also from shift now. This is specifically the z ordering and uh how and why the question is: can you please explain how z ordering works and why it improves performance?

A

Now before I go ahead and hand that, off to the gentleman here, I do want to call out that z-ring is specific, currently specific to databricks uh the delta version, that's in databricks, uh so as opposed to the one that's in oss delta. We definitely want to answer the question, um uh but we wanted to call that out. Okay, so saying that uh who would like to tackle the the question of concerning the order.

E

I can, but uh I won't let others have an opportunity.

D

You go first td and I'll I'll add on anything if I, okay, okay, so.

E

The simplest way to think about z ordering is basically it's a very fancy form of multi-dimensional sorting, well technically clustering.

E

But um it's hard to explain this without the visuals and slides but say if you have two two columns and you want to cluster the data such that reading from both the columns is, you can filter down which files to read by some sort of clustering on both those columns? Now the naive way would be. You have a primary sorting column, you sort by and then within the and then a secondary starting column within that and then split that data into files.

E

The good thing is that if you really do the math to overuse, that word is that if you have queries that filter by one of the columns, you can filter down very well. But if you filter by another the other column, it doesn't filter on very well, because one is a primary sorting. The other is secondary. Sorting what z order, or the generic term is space filling curves does- is that it balances it out across the multiple columns like, for example, to put some rand number sample numbers on it like with simple primary secondary sorting.

E

If you, if you filter by column one, it will filter down from 100 files to two files, but if you filter by column two, it will filter down from 100 files to 50 files, not much better, whereas if you use the space filling curves, then both columns, it will filter down to, let's say 10 files, so it's not the best for any individual column, but it's more evenly balanced between whichever column you want to filter by. So that's the idea about space, filling curves and z order is just one particular type of space filling curves.

E

There are other space filling curves like hilbert curve and stuff, and so that's the basic idea of z ordering is: how do you cluster the data across files such that when you want to search for a particular row with a particular value of those multiple columns you can without reading the data in every file? You can very easily quickly narrow down this okay.

E

These are the only files that can have those particular values, because these files, the the those two columns, the min max values of those columns in these files, match the hazard as a small range that matches the thing and other files don't have. That cannot have the possibly have that one. Hopefully that answers the question. It's hard to explain that without slides and the visuals, but hopefully that conveys the point.

A

I think certainly think it does. By same token, folks, if you do have more questions concerning the order by all means duping us, I'm actually gonna go ahead and when uh we answer the next question, I'm actually gonna get the uh the blog post that we wrote on z, ordering and I'll paste that in both the youtube links and uh the linkedin. So that way that provides the additional details. But I figured at least giving us a chance to answer. First that'd be really good.

A

um The next question actually I'll answer uh at least I'll start. I should put that way and we'll go from there. So it's also from shiv it's a great question, which is how can you I replicate a snowflake database as a delta lake? Okay, so in other words, you're going to go from and I'm going to actually just answer the question not so much targeting any particular database just more of the idea of a database to delta like okay.

A

So- and this is because that's a very common approach and typically what's happening- is that most database systems are very proprietary, so they have their own version of an unloading or a bulk export mechanism. So you're going to.

D

A

To basically utilize their bulk export mechanism to get the data into blob storage, whether that's in csv, initially or some of them actually have part k some of them don't. So I don't want to automatically assume that. But the context is since you're talking about databases, typically structured data to begin with, which means csv or tab, or that type of format is perfectly fine.

A

When you export the data out, then having delta lake rated is pretty much a straight forward, you know literally just a straightforward. Okay in spark read bam. You create the delta lake and you're good to go um the the the one little except I won't say exception, but one little call out would be that if it's, if you can't explore in parque, there's an in-place conversion of a parquet table directly into delta lake as well, so you could always utilize that now the real concern isn't so much that part of the replication.

A

Honestly, though, even though I've went on a little already, the real concern is more more like uh when you're online and you're continuously processing data. How to get that data in and typically that means you're going to have to do some form of streaming mechanism or an etl, a regular periodic etl process, and so you will always need to separate that bulk export portion from that regular etl, batch or stream process.

A

Now, honestly, most of the customers that we work with when it comes, and I'm I I'm sorry, I shouldn't even say, customers most of the community members. We work with whether it's a database customer or a delta lake, a delta lake, util user.

A

What they're typically doing is that they have that two separate process. One is the that bulk export from a database right. I've seen it happen, multiple times: oracle, sql server, whatever else and then upstream they work on basically a regular etl process where they multicast the data out one time to the database and then directly to delta, like they do a reconciliation between what's in the delta lake versus.

A

What's in the database for over, like you know, a period of time, let's just say two weeks or a month to make sure the numbers are matching and then once they validate that, then they can shut off the flow into the database and it goes directly and delta life. So I provided a whole bunch of context and recognize the fact that if we want to dive deeper, we probably want to actually have a much longer conversation in the delta lake uh user slack. So please go to delta dot io near the bottom.

A

There is a slack channel uh link. You can just join us there and my name is denny. You can just help me down. Okay, so just want to call out, but hopefully that answers your question or at least provides enough context. um Anybody else like to add anything that I might have missed.

A

I'm I'm seeing nods, so I think uh I think that'll be.

B

Good, okay, perfect all.

A

Right, uh I'm gonna go switch to the next question. uh It's from uh sharon givi. uh I apologize if I said your name correct incorrectly. uh This one I'm definitely gonna target ut. Just because it is about merge. Okay, does merge, yeah yeah, exactly does merge uh sparks people databricks rewrite the files again, even if there are no inserts or updates when it comes to utilizing delta lake. The merge statement.

E

Okay, good question, so the way merge works is uh in two passes on the data. The first pass. It scans the data to identify which are the files that contain data that needs to be updated or deleted, and then, in the second pass it rewrites only those files so in your table, if you don't contain, if any file doesn't contain any data that needs to be updated or deleted, it will not read at all. So merge is very selective in that way.

E

Now, uh whether now, how many files out of your table will actually be needs to be rewritten is a kind of a separate question and it totally depends on the distribution of your data. The data layout, uh the in the changes that are trying to merge. But what is the distribution of those changes? Is there in any spatial locality in them and not not?

E

Those are the stuff that makes a difference in the merge performance uh and things like z order, and basically, data clustering in the end in the data layout can make a difference and make merge faster because it limits the changes. The number of files that needs to be written to a smaller number of files uh by clustering, the the value that the key space into a smaller number of files, if you're touching only a smaller range in the key specific that is, there is data locality.

E

So hopefully that answers the question.

A

Actually it does so, I think, that's probably all the time we actually have today for today's community office hours, uh so apologies, but we had some great questions so saying that I did want to call out a couple things number one I'm going to share my oh actually, it looks like I can't share my screen today. So my bad on that one, but I did want to call out there is a uh dated and ai summit cfp call for presentation. So by all means please go ahead and the deadline currently is january 2022.

A

So please go ahead. If you've got a great data session, it does not have to be about delta life and smart by the way, just any session on data and ai we'd love to have you uh go ahead and present there, uh or at least you know, uh call for presentations uh to do it. If you're looking for this video, it's actually going to be both on linkedin, where it is right now, as well as on youtube. So there is the youtube channel delta lake youtube channel, so just go to delta.io at the bottom.

A

There's links to both the youtube channel and also the slack channel. So since we didn't answer all of the questions that did pop up again, apologies for that. Please join us on the delta user slack. We are there and we will go ahead and answer your questions there as well. So seeing that I think that's it for today. I wonder any last words from christian or td or scott okay, I'm getting nod, so I think that's good to go.

E

So again, just only one thing keep these questions coming. These are absolutely great questions and we want to uh educate the community on the internals of delta lake. So please keep asking these questions in all these different venues. You heard from danny from all the different menu slack etc. So please give us, in these questions, we're happy to explain and help and and make everyone in the community aware of the delta internals and how it helps in different scenarios.

A

And especially from linkedin you've you're asking some awesome questions so we're up again: apologies for not being able to answer all of them, so please join us on the delta user, slack we'd, love to answer them and we'll it'll allow us actually to be a little bit more long-formed in our responses too, so that we get details as well. Okay, so again, thank you very much.

A

Everybody really appreciate your time um and we'll probably see you um yeah, we'll see you in two weeks: yeah, we'll we'll still have one more and then uh and then we'll take the winter break for everybody. Okay, so again, thank you very much everybody uh thank you, christian, scott and td for attending today's session and uh providing answers. Thank you to everybody on linkedin and youtube for going ahead and asking some amazing questions. All right have a good one. Everybody thank you. Bye.