Delta Lake Community Office Hours, 18 Aug 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Delta Lake Community Office Hours (2022-08-18)

Description

Join us on August 18, 2022 at 9:00 AM PDT for the Delta Lake Community Office Hours! Ask your Delta Lake questions live and join our guest speakers Gerhard Brueckl and Scott Sandre, alongside Vini Jaiswal from Delta Lake!

Ask us your Delta Lake questions. These sessions allow our community to ask questions about Delta Lake OSS and get to learn what we are building, planning to build and know about recently released features. These sessions are live and the recordings are available on the Delta Lake YouTube channel.

Quick links:
https://delta.io/
https://go.delta.io/slack
https://groups.google.com/g/delta-users
https://go.delta.io/github

A

um I'm waiting for my link to pop up on linkedin one second awesome um long, island whoa. We have san francisco canada so for those who are new to the session, uh these sessions are live and occur every two weeks on thursdays.

A

Last week last month we took a break because we did a lot of work during the summit and, as a result, you have awesome features to work with on delta lake. uh So if you missed any previous amas, no sweat, the recordings are available on our linkedin and youtube channel. So please, please subscribe to our channels uh and also there is an official uh webinar.

A

uh uh You know that is brought to you every two weeks. Also, since this is official webinar of delta lake, um we want to make sure that you know we are fostering an open and welcoming environment for everybody. So please make sure that you are abiding by our core of the conduct, which means that please do not add anything to the q a or ask questions that would be in violation of our code.

A

So please be respectful to your fellow participants and presenters. uh I will drop in a uh link as well. So you can review the code of conduct um and please have your questions coming. uh We want to make sure that you know we get uh your questions answered on any open source uh delta lake.

A

If you know if you are doing a project from from the beginning or if you are already working on something cool, we would like to know if you have any questions, if you are hitting any roadblocks or if you want to know about any future uh roadmap or features that we are building so without further ado, let's do a quick round of introduction of the panel. So scott, you want to go first.

B

uh For sure, thanks vinnie, thanks for having me super excited to be here, uh hi everyone, I'm scott, I'm a software engineer on the delta lake team at databricks. I'm super excited to be here happy to talk about a lot of the new features we have coming out with uh delta lake 2.1 or any questions you have about our big recent release: delta lake 2.0, as well as a lot of our other connectors. We have like the flint connector that we've been working on um so happy to answer any questions there.

A

Awesome thanks scott. uh Then we have matt matthew. Why don't you introduce yourself.

C

uh Hi everybody matthew powers. I am a longtime spark blogger and spark open source nerd. So I just recently joined the delta lake team, uh working as a developer advocate, so looking forward to writing more delta lake uh content.

A

Awesome great to have you matt, uh you have made a huge impact in the spark community, and now we have you on delta lake as well, so welcome welcome. um Would you like to you know, uh tell about any exciting features that you have seen in delta lakers recently, so um the community will know from your perspective as well.

C

Yeah, well, I think uh I'll give just like my high level delta lake experience. I was using playing vanilla, parquet lakes and suffered from all of those bugs. You know not having transactions schema mismatches, trying to compact small files, um so I mean the new features are amazing, but just the old features too, like I've faced all of those plainville parque lake problems and just having them all magically solved by delta lake uh has made me love delta lake lots.

A

Awesome um so we have a lot of people joining in mariella, punet hammond welcome everyone! Please post your questions on youtube and uh linkedin. uh You can uh find our linkedin live link from delta lake channel um awesome. So I have a question here. uh What are some of the recently released features uh from delta 2.0 scott? Do you want to answer that.

B

uh Yeah for sure, um well, yeah, the biggest news with delta 2.0 was just the fact that delta lake is now fully open sourced and every feature uh that you see um in some sort of like in databricks has now come to delta lake, um which is just really good news. You know we're happy to give the best features to all of our open source users and customers, so that was super exciting one such feature was the optimize the order command.

B

So this is a really useful utility to help solve the small files problem. uh This is a problem that you can experience when you have constant streams coming into your delta table, writing lots of small parquet files and, after a certain amount of time, all these small files. They will increase. Your list calls to your various cloud providers. So what we're able to do in a really really smart way is let you basically sort your table along and arbitrary dimensions and compact.

B

These files, together uh through what's called like a space filling curve, and we open sourced that in 2.0 and got like a lot of really positive feedback to that. People were really excited about that. um So that's probably my favorite feature recently, but delta 2.1 is coming up soon as well. So we can. We can chat about that later too.

A

That's awesome, so scott you mentioned about uh list operations. Can you describe what it is for those who are new.

B

Yeah for sure um so delta your delta table will exist in some cloud. Storage um like from s3 from azure from gcs, et cetera and delta, keeps track of all the metadata files for your table.

B

So you first read from the delta log, which is all the metadata that tells you which data files, parquet files to actually read and over time, as you have more and more of these metadata files, um just performing the list call to your cloud provider say hey: what are the files on my table that can start adding up as there's more files, there's longer lists so by compacting your files together.

B

That actually just reduces the list overhead, but another cool thing to add on actually is what we have a really exciting pr right now from the delta community. That's actually aiming to optimize. Our list calls on s3 um to make them not be like a function of the length of the table, but just like constant list calls, which would be like a huge speed improvement uh on s3. So that's just one little tidbit. I wanted to share.

A

That's awesome uh so, adding on to that there was another feature that uh I think was uh in the works, which was change. Data feed uh was was that release card. I think so right.

B

Yes, it was yeah that was also released in delta 2.0, uh so uh change data cdf. um That's our solution on delta lake, for the uh capture, data change, problem cdc and what we basically let uh users do is now capture row level changes um as opposed to file granularity changes with like very very little to almost no performance overhead.

B

uh So now, when you are writing your data, doing um upserts updates uh merges deletes whatever uh we're able to capture just the row level changes in like separate parquet files such that when you read you're able to know if a row was removed or updated, etc, um and that's um there's a lot like downstream use cases of that feature and again, we've gotten like really great feedback to that. So I'm curious to see how people are using it and what exciting things they're doing with that.

A

Awesome, that's that's wonderful! um You know. Cdf is one of the very uh popular features that have been asked by a lot of people who want to make sure that you know um what do they do for the upcoming uh tables or upcoming records. So that's really helpful. uh I know for those uh who are just tuning in, we also have gerard in the background. uh He was unfortunately not feeling well this morning, but he has volunteered to answer your questions through slack. He worked on amazing power, bi integration.

A

So if you do have any questions related to that, he will answer uh in the chat uh cool. So we have more questions on uh our linkedin channel, which is will delta lake version. 2.0 support, spark 3.1.1.

B

3.1.1 um I.

A

Think so right we have, we always have back backwards. Compatibility, oh, go ahead. Scott yeah.

B

So so delta, like 2.0, supports, um spark 3.2 and currently we are not supporting like cross spark version support a few people have asked for it. um Just there's a lot of like work overhead to do uh for that sort of support and right now we're prioritizing the latest focus on the latest spark version and adding like the best and latest delta features. So if there's growing support and demand for that feature, that's a discussion we can have later on for cross. Cross version spark support, but right now no, it won't work with 3.1.1.

A

Yeah, so sama, if you want to uh you, know, give this feedback in our github repo, uh where we have like roadmap discussion, uh we can possibly you know, uh you know, make that into consideration in the future. Awesome uh yeah.

B

Go ahead I'll share a link in the in our zoom chat and, if you could share it to linkedin or wherever that shows the um the delta and spark cross cross version, support.

A

B

A

Awesome, uh let me look at the youtube all right, so there's a question around support for uh dropping columns.

A

Is uh somebody read through it and is that supported now in 2.0 version.

B

um Yes, it is matthew if you ever want to jump in for these. I'm not sure how familiar you are with the latest features, but feel free to to chime in as well um but yeah uh drop column was, is support added in delta 2.0.

B

um It's actually a really interesting feature because dropping columns and renaming columns, all those sorts of operations um were kind of really difficult to achieve, using parquet uh because of some limitations in the actual um column, name that you can give to parquet.

B

So on delta lake we've been working on a really interesting project called column mapping which basically abstracts away the logical name that you associate to a column to give it like a unique id physical name, and what this lets us do is perform operation operations like drop column that are just metadata operations, which means you don't actually have to rewrite the files and they're really really efficient so that that came out in 2.0 as well.

A

That's awesome um so matt uh from your perspective, uh what like uh you know, what are some of the momentum you are seeing in the delta lake community.

C

uh Well, before I jump into that, I just wanted to add a little thing for the drop column, which was so basically parquet. Files are immutable and I'm actually kind of confirming my knowledge here with scott, so parquet files are immutable so prior to 2o.

C

In order to drop a column, you would basically need to read in all of the data and then write it out less that column so, like dropping a column, was a big data processing exercise versus what we do now is we're just in the metadata in the transaction log saying ah we'll just ignore that going forward type of type of situation and without the column, mapping that was not possible. Pretty much and column mapping was added in one two is that correct.

B

You're testing my memory now but.

C

C

uh Yeah, so so you, I think to your other question vinnie. So what are the kind of big trends? I think one of the big trends I'm seeing is just more adoption, more connectors for delta lake. We have so many connectors. Everybody seems to be supporting us now, which is so important because you know when you're building a big data etl pipeline, you always want to be able to go from one system to to the next, like, uh oh yeah, we're building this big data processing pipeline, and then we want to make a delta lake.

C

That's optimized for this other team to query with trino, for example, and now we have that connector. So that can happen um versus back. In the day we had to write out parquet files, and that was kind of our interface, but it wants to deal with parquet files. Now now that we can use delta lake.

A

That's awesome, I think, releasing delta standalone has opened a lot of uh doors, for you know other integrators, uh other integration to come in the picture, and I think I yeah that's a very critical, or that was a very critical uh release that we made back in december um yeah.

A

So in terms of connectors, I think uh some of the things we were working on around flink right, something that that was missing was source. Is that something that we are working on? Scott.

B

It's something that we worked on and solved yeah, we with the release of delta connectors uh 0.5. Two weeks ago. uh We added support for the flink source. So uh now we have a flink connector that suppose that supports both streaming reads and writes with exactly one's guarantees. um Yeah, it's great uh again. We've gotten I've gotten a couple dms actually about that uh people saying thank you.

B

We've been waiting for this, and so uh now that we have both the um sync and source working with the data stream api, we're now working on adding um catalog and sql support for our flink connector. So that's our next major initiative, we're still in like the design phase right now, but we're actually um we've actually gotten a few issues about this on our repo and we're so we're excited to in a couple weeks, share the design, dock and and get people's feedback on our approach and our api decisions, etc.

B

um So that's that's going to be coming along soon.

A

You know it's very interesting. A lot of people give feedback that we just think about a future and that's our community releases it it's it's just so fast. So that's a very good feedback um cool. I think another question is, I am a data engineer and new to delta lake. May I know from where I should start to learn delta, like I think we have a lot of resources, but I would let panelists answer this question.

A

Matt. How about you answer this question.

C

I personally started at the quick start. uh I fired up a jupiter notebook. I went through the examples and I just studied what was happening in the transaction log. That's uh that's how I personally I was able to grok. uh Definitely because at the first it seemed a little mysterious, but then once I was like performing operations and seeing which transaction log entries uh were made that that's what made me understand it, but luckily it's a beautiful abstraction. So you actually don't need to do that.

C

You don't need to know what's happening under the hood, but that's personally what I did.

A

Scott, do you want to add anything else.

B

um Yeah, uh like matt, said, I don't think uh I you don't necessarily need to know exactly all the details about the delta protocol. That's all abstracted away for you, we're just here to help you solve your problems and meet your business needs. um What does get exciting from my perspective is those little details about the delta protocol, because one feature that's on our h2 roadmap for this year.

B

That, I think, is really exciting, is called deletion vectors and what's what makes it so exciting for me, is exactly those little details so, for example, the problem that it's trying to solve is that whenever you perform an update um on, let's say a single file in your delta table, because we support multi-version concurrency control, we never modify that parquet file in place. Of course, we just rewrite a new one.

B

That means you can time travel back in time and see historical versions of your table, but sometimes you're only updating a few rows in that part k file. Yet here we are going and rewriting the entire par k file and what deletion vectors help us do is within a certain threshold. um We will just be writing the changes to a separate file, um which means you're, not rewriting the entire 4k file.

B

So little detail little details like that once released um like once fully developed that those will significantly speed up your reads and rights, your rights, sorry only um but yes, little, details like that are really fun.

A

Yeah, I think that's a good point, scott, because you know, as a data engineer, you you like to know how those mini details come into the picture. What what? What can make your data engineering pipeline robust so sometimes yeah? It's it's really helpful to know. Like you know what is the protocol? What is it doing? So I think we have a lot of resources um if you do want to go to delta dot io.

A

We have like learn page where you can start with getting started, guide um spin up your local instance or you know in the cloud we are also working on making more tutorials accessible for the community so that they can get up, started and running um yeah. So please check out delta dot, io and github. I pasted the link in linkedin.

A

Hopefully that helps another question we have is from for us: will delta lake ever support materialized queue.

A

I think that's a sql context.

B

Yes, that is a great question. um I yeah you're testing my knowledge of spark sql right now. um I don't think we support it right now and I also don't see it on our h2 roadmap. So this is something where, um if this is something that this user wants, I think they should message us on slack. Join our community, create an issue and start a conversation about that, because currently there hasn't been many asks for that. So that's not something.

B

That's not a feature that we've actually prioritized, um but if that's something that the community wants, then we're happily happy to work on it.

C

B

C

Spark view built on top of a delta lake and materialized view when you do like create view from source delta lake. I think that that's arguably a materialized view, so maybe we support it.

A

Yeah technically right.

C

Yeah so delta likes the storage layer. The query engine is kind of the materialized view maker.

A

A

um Another question is uh on clusters: edit api has um clusters api. I think that is more specific to. I think you are using data bricks, um so I will. I will pay some resources to help you sharon um yeah, so this is only for delta lake oss cool. So I think we are coming uh close to the hour uh close to our 30 minutes, but I do want to make sure that we call out some specific features that the community is working on uh in the second half of the roadmap um scott.

A

What did we miss like? What are some of the features that our prioritized based on our feed uh feedback from the community.

B

Yeah great question: um the biggest uh feature uh was uh that was really really demanded. uh The past couple weeks has been support for spark 3.3 and I'm happy to announce that with delta lake 2.1, uh which a preview was just released yesterday, for that uh we we have added support for spark 3.3. So with that comes like a lot of all the any kind of improvement you get with the latest version of spark. That's now brought into delta lake and there's also some extra sql syntax support. That's pretty exciting too.

B

So, like the version as of or time stamp. As of time travel syntax is now uh sql. Syntax is now supported in delta lake on 3.3. So that's that's really exciting.

A

Oh, it's already released scott.

B

Preview was yesterday and then oh preview, so the goal of our previews is to get community feedback. Are there any bugs um any noticeable performance decreases? Even though we do our own internal benchmarks, get early feedback from the community work up some kinks and then release the final version in a couple weeks.

A

Awesome awesome uh ravi is asking how shall I research on um integration with delta lake um yeah, so I'm gonna drop links here. We have worked on a lot of integration, so I'm gonna drop links for you uh for our roadmap. You can check it out. There.

A

uh Marcos is saying very excited about 2.1 release, especially time travel and show column support. Oh thank you.

A

Awesome there are more people joining in now uh any closing thoughts from you, matt.

C

I think closing thoughts is we just have a really friendly nice community and we encourage everybody to join. Our slack ask questions, uh ask questions on stack, overflow and we're always happy to help and we're very friendly.

A

I like that, no pun intended.

B

If I could add on to that, not only do I want you to join our community, I want you to make prs, because we've had a lot of really good features added in the past couple weeks after the launch of delta 2.0, when people got excited about delta lake um and people have been adding great features, uh so a couple of those actually have been um our dml commands like delete, update, merge, etc.

B

um The sql commands are now are actually returning, some really useful metrics uh from that operation, which was an api that wasn't there before. So this is something where users wanted. This wanted to see this result and they made the pr themselves. um So I'm glad that people are are contributing.

A

Yep, so thank you all who are active in our github and slack channels and who are joining the office hours to show the support yeah. Thank you all.

B

Thanks vinnie, thanks for having us.

A