Delta Lake Community Office Hours, 8 Sep 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Delta Lake Community Office Hours (2022-09-08)

Description

Join us on September 8, 2022 at 9:00 AM PDT for the Delta Lake Community Office Hours! Ask your Delta Lake questions live and join our guest speakers, alongside Vini Jaiswal from Delta Lake!

Ask us your #DeltaLake questions. These sessions allow our community to ask questions about Delta Lake OSS and get to learn what we are building, planning to build and know about recently released features. These sessions are live and the recordings are available on the Delta Lake YouTube channel.

Quick links:
https://delta.io/
https://go.delta.io/slack
https://github.com/delta-io/delta/releases
https://groups.google.com/g/delta-users

A

Delta lake, uh if you missed any of the previous ama's office hours, no sweat, there are recordings available on our linkedin and youtube channels. So just a quick reminder as well. This is an official webinar of delta lake and therefore it is subject to a code of conduct. Please do not do or post anything to the q a or ask questions that would be in violation of that code. I will paste the link for code of conduct.

A

This was a thing I had to mention you know, so I didn't mention that so would love to know uh without further ado, where you are where you are dialing in from uh we have a panel from across the world, uh california, france, so yeah cool. So in the meantime, while you are joining our channels, uh we will go ahead and kick off the introductions of our panelists. So I will start with uh the person on my right florian. Why don't you give a quick intro.

B

Thank you, vinnie, uh hello, everyone, so I'm florian and I work at black markets where we sell refurbished devices marketplace and I'm a contributor to delta irs, rust native library to to have a low access on the delta tables, and we provide as well python by dangerous.

A

Awesome next is nick.

C

Hi um so my name is nick karpov, I'm a developer advocate here at databricks, I've been working on the delta project um for several years in capacity as a field engineer, um and I recently joined the dev advocacy team. So I'm excited to be here and talk about delta.

A

Awesome. Thank you. Nick uh allison.

D

um Hi, I'm alison and I'm a software engineer here at databricks um and I work on the delta lake project, and this includes things like managing the 2.1 release so happy to talk about things related to that, um and I also work. I've worked a little bit on the connectors repo and the delta standalone project.

A

Pretty cool uh last one we have ryan.

E

Yeah hi, I'm ryan and software engineer at database. I have been working on data theralik project since the beginning, and I'm happy to be here to like share like a new song answer, questions for catholic.

A

That's awesome: we have like exciting uh range of contributors here and myself and vinnie jesus developer advocate at databricks um have been associated with the league project for uh several years now um so happy to answer your questions. Please, uh post your questions here. uh We recently released three major uh three major announcements to our project, so we will cover what is available in delta, late 2.1, with spark 3.3.

A

uh We have release on the python dress binding. So you can ask questions around that. We also applied a fix. That was, um you know uh that was released for uh for fixing a bug for a recent release. So that's how quickly we are relieving the features so pretty exciting.

A

um So what are some of the things which were fixed in uh uh delta, python, rust, bindings uh florian? Would you able to take that question.

B

Yeah, thank you. um So basically, we release a breaking change with the latest version of the python bindings. It was a 6.0 and we change it. The way we we read the storage, so we allow more configuration of options for the different cloud providers that we have and this we can change using a new crate of the rest.

B

It was you used. Object, storage crates introduced a bit some issues that we quickly fixed inside the new version, so the new version fits all the issues that we we received right after and it's also a breaking change because we remove the support of the python version 3.6.

B

So we would like to evolve with the rest of the python ecosystem and also to measure improvement about the way that we better read and pass statistics of the delta tables using the the library, and we provide also a better way of support the schema, especially for the both dates and the decimal inside. It.

A

That's awesome so for those who don't know what uh delta rest bindings are, can anybody on the panel give a quick overview of what that project is.

E

Yeah, I can give a quick introduction, but, to be honest, I haven't used this by myself.

E

So basically, this is just like a kind of like we used us to write like a reader and writer uh based on the delta protocol, and we we already have like a delta protocol which defines how to read and write the delta transaction log and then tell the laws it's kind of just building from scratch, using rust itself and then so it is so. You don't need that.

E

For example, you don't need a spark or like a jvm to run, and then you can just use like rust, or you can also lightly invite this into like a python which can quickly line past the reference in law and read the writer data table.

A

Yep, that's uh that's what it is and also like. If you have any projects you are working on and you would like to um build a connector with delta lake, uh please we welcome contributions from everyone and because we released some standalone reader and writer back in december, you are able to uh delta lake opens the whole new room for connecting with other ecosystem uh projects, so definitely check out our github repo, and please put your pr there.

A

uh We, we have a very good, uh you know uh reviewing schedule, so we will be happy to connect with you. There is question uh around why uh you know if delta lake works uh with emr, um so I'm pretty sure it does, and you can actually configure it. uh You know on emr. uh If you need the bindings, uh if you need the binaries, it is available on our github people. uh Does anybody would like to add anything.

E

uh So basically, emi is just uh like a ship like apache spark on yamaha and as long as you are using spot, it's pretty easy to like use delta lag. We have like a a quick start. The document to tell you how to start like to use delta electron like a spark.

A

Exactly thanks and yeah yeah go ahead. Ali.

D

Feed back about like setting that up and like whether our documentation is helpful or not, I think we're all happy to hear it and see if we can improve that experience exactly.

A

And we have more, we are working on more tutorials uh for those quick starts, so uh you know we also improve our documentation so like ali mentioned, please definitely let us know if we are lacking anywhere in the documentation and our team will be happy to update it awesome. uh There is another question on anything related to governance, data governance. We have added in delta lake or data quality, something like great expectation.

A

E

Add something but.

A

Please feel free to yeah go ahead and.

E

Yeah, so basically, there are like a support like for example constraint. You can define the writer. The requirement for call a column is pretty like a sql standard. You can just add a constraint on a column and then just set whatever, like expectation, you would like to do with list data and then, if so, with this check constraint, we will like verify the data quality first before writing to a data table.

E

If, like you, your input data breaks this concern, we will fail your query to make sure all the data in the table are like a following. Your like a quality requirement.

A

Yep exactly and also like, if you have any pii data- or you know, like some kind of compliance, that is that should be enforced or uh you know you need to have separation of tables. For you, your users, you can also like apply salt key or any kind of like uh pseudonymization for your tables, um and you know we also recommend bronze silver gold level architecture, so you don't have to give access to all the users in your organization for all tables. You can basically apply apple's.

A

uh Sorry icos at the tackles meaning access level, control that table level for your users, and you know, govern your data in that regards as well. So there are different ways and approaches you can take. Anybody else would like to add any other recommendation.

C

um You can also um this is a little bit more custom, but you could perhaps leverage the commit user commit data um as well on a per commit basis to to write additional metadata for a given committer and then leverage that, as kind of the basis of additional information to improve governance. So you can use that in your readers really to do a check. Allow things stuff like that. It would require more work than just using the metadata, but the metadata could be the basis of that.

A

And nick was: has this solution been used by any specific use cases? Would you like to?

A

Is it more for gdpr.

C

um Well, it's just a like the technical ability to add metadata as part of a commit um is just like uh kind of foundational, so um as an example within databricks, specifically um there's metadata regarding, like which user actually committed this data um from what cluster things like that, so um in the open source delta project, you could achieve similar things using the user commit data.

A

And also we recent, I think I read it somewhere in our release. Notes we recently added some uh more operational metrics to our uh delta lake project. Is that right.

D

um I think it's less, that we added new operation metrics, but more that we exposed them directly after performing like the operation. So previously, you'd have to like go describe history um and retrieve the metrics for previous versions and previous commits, but we just added it in the sql commands now so that you get those operation metrics like directly back from your command.

A

Yeah- and I think this is a very powerful feature uh having that exposure to operation metrics, uh you know I have seen a lot of use cases in the field where uh it could be used, for you know like uh backtracking your data as well as a lot of like uh different use cases um yeah. So having that functionality is pretty helpful, awesome uh there's another question on: uh do you need to reshuffle before merging uh in delta lake, or is it enabled by default.

C

You need to reshuffle before merging.

A

C

Do you, I think um so the merge consists of two joins, which will um which will perform a shuffle, um and there is a repartition if needed, flag that you can add such that uh your files do get shuffled and re-partitioned into larger files per partition.

C

um But if the question is, do you need to re-partition before you run? Merge then, probably not.

A

Yeah, I think there is a setting or parameter where you can enable auto merge.

A

Like merge, dot repartition equals to prove something, uh so I think we have a documentation around that.

E

Yeah, I think now that particular this should be enabled. Basically, we will do like a repartition before merge to reduce the file created by the magic command by default.

C

If you re-partition things on your own beforehand,.

C

Particularly, I guess the source side that could have an impact, but uh on the writer side, like you, don't really have control, except for the flag that you can disable, because it's enabled by default.

A

Awesome there's another question from amit: he is asking. Does delta lake also support data mesh architecture.

C

I'm gonna go out on a limb and say yes.

A

Yeah, I think it's just like.

C

It's as as as a file storage framework, I think it's unopinionated in which context it's used, and it could certainly be leveraged in a mesh architecture, data mesh architecture. um But I don't know the specifics.

A

B

A

That was helpful, yeah go ahead. Sorry yeah.

B

We can think about uh having databricks the way of decentralizing the data sets produce so in a way, it's a more decentralized way to fetch everything and make it available to for every use case that we can find so in a way. Yes,.

A

Awesome uh so there are uh some. uh I would definitely like to dive into some of the 2.1 features. So you know it would be helpful to know what have been. uh Why did we decide to release some of the features in 2.1? One like what were the most asked features? If you know our panelists can like take a quick, um quick answer on that.

D

Well, I think probably the biggest thing was really the upgrade for spark 3.3. um I think that this has been something that we've been meaning to do for a while and was kind of pushed back a little bit because of 2.0, um and I think that's something that people have requested a lot. So that's was like the major thing that was kind of pushing the 2.1 release, but I also think that a lot of the other, like main improvements that came along with that, had to do a lot with like sql syntax.

D

um I know, for example, like we now support time travel in sql, um and that was something that I think has been requested a lot for like over a year now, um and so that was something that we were finally able to do in 2.1, um since we finally got the changes that we needed into spark 3.3.

D

So yeah like a lot of different changes for the sequel side. I think that other than that um there's been a lot of other improvements, as well as just bug fixes and things like that, a lot of which were contributed by community members. So that's awesome. um Yeah. I mean happy to talk about anything more in detail, but or if anyone.

A

Yeah, that was that.

D

A

Overview ali uh yeah time travel, uh definitely a very uh good highlight and, of course, 3.3. Any other features from your perspective. Nick that will help you know, unblock any use cases.

E

Not really, but I we have like a few like improvements in the optimized command, which can either speed up your optimized commands. If you are trying to compare a lot of files, so you you can take a look at it in order to learn the details of all these improvements.

C

Sorry I was on me earlier when you asked me um sorry, I don't know if it was mentioned, but maybe the available available now trigger um that was introduced in spark 3.3, um that that could certainly help. That's not exactly a delta use case, but as far as the delta spark connector goes, um it's it's definitely a notable improvement that can help people implement their.

E

Yeah streaming stuff, like a capability issue when you are trying to process like a large table, change.

A

Got it got it yeah? I think those are very good highlights. So there is one question around: will vacuum delete all the files beyond the retention, duration or will it make sure? Latest version is untouched.

C

Can you ask that again, I think.

A

It was both, I think,.

C

The answer was both.

A

Yeah, so I think uh yeah, I think, uh doesn't uh nick quick question. Doesn't latest version uh always gets retained.

C

Yes, that's that's why I was confused. I wanted to clarify the question.

A

Okay yeah, so I think the question was: will vacuum delete all the files beyond the retention duration or will it make sure version is untouched? So I think uh that's right um it it. It retains both the properties asked in the question.

D

I think it's that it it will retain the latest version and delete all files that are not needed in order to read that version past the retention direction. So it's not really both.

C

Yeah, it's not it's.

D

C

Yeah yeah, it's not the or is throwing me off.

D

It doesn't delete every file past the retention duration, only the ones that are no longer referenced by the current version of the table. Right.

A

And also, I think, uh vacuum also saves uh it's a it's a soft delete for seven days I mean if retention duration is not specified. If retention duration is zero, then it will delete everything, but if you don't specify retention duration, it's still there softly for seven days in case. If you want to have ability to read to read it retain those tables back right within seven days.

C

I thought the default was 30 days, but I may be mistaken.

A

C

A

Awesome uh all right, let me see if we have any more questions.

A

A

um I think one question is when uh foreign key primary relationships constraints are coming. I think it's it's the way you define the table right. It's.

E

Oh, you mean actually constraint.

A

Yeah as a constraint, relationship constraints.

E

Yeah, it's like you need to use like outer table. To add, like I think we are talking about the check engine document.

A

E

Can just try to search this in our document website.

A

Yep good call out all right all right, so I think that's, uh that's all the questions we have uh thank you nick for sending the link um cool any other closing remarks from anybody in the panel.

C

If there's, if there's none from anybody else, just uh really excited to hear from the rest of the community, what what new features you'd like to see so don't hesitate to reach out. Please.

D

I'm basically seconding that please be vocal and slack or get hugged like we like the feedback.

A

Awesome and you, if you do want to you, know, associate yourself with the project. If you want to contribute something, we have like a list of good first issues, um so I will think the link in the chats. uh You can start with good. First issues, don't hesitate, like you know, it's my new contribution or something just get started, and we will be happy to help you get there so awesome. Thank you all.

A

uh Have a good rest of the day and we will see you again in two weeks. uh Please do let us know if you would like us to improve or uh get anything specific on the office hours. uh We will be happy to accommodate your request.

A

Thank you all bye.