Delta Lake Community Office Hours, 31 Mar 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Delta Lake Community Office Hours (2022-03-31)

Description

Join us for the next Delta Lake Community Office Hours and ask us your #DeltaLake questions. The Delta Lake community AMAs occur bi-weekly on Thursdays at 9AM PST. These sessions allow our community to ask questions about Delta Lake OSS and get to learn what we are building, planning to build to recently released features.

A

A

So we are live now. I guess: okay.

B

Cool all right, so we're living on zoom but uh are we? Are we live on youtube and linkedin? Yet.

A

We are live on youtube, yes, awesome. Let me check the linkedin.

A

B

Excellent, so far, so good right, I think we're live on youtube. I'm checking.

A

Linkedin right now.

B

To speak, yes, let's do it.

A

Yeah, why don't you check uh linkedin danny? I think I'm still waiting for the url, but.

A

A

Awesome uh everybody who is joining, uh we are just getting started. Hi everyone.

A

From italy I am from san francisco, we are uh joining from all different places, so why don't uh you put uh put down where you are joining from and then we will start with introducing our panelists great and for any questions related to delta? These are delta lake community office hours. So we encourage you to ask questions on open source delta lake. uh It can be whether the questions about the feature release that we uh we did recently or what is coming up in the roadmap.

A

So with that, I will uh kick it off with introduction to the panelists. We have scott wenke and denny um the contributors of delta lake oss. So why don't we start with scott scott? Why didn't you introduce yourself.

C

Sure, uh thanks vinnie uh good morning, everyone or good afternoon, wherever, wherever you're from uh I'm scott, I'm on the delta lake ecosystem team um here at databricks um and what I've been working on recently is a variety of open source features for delta lake, as well as expanding the connector ecosystem. So for the former uh one exciting feature I worked on recently was open sourcing um column, stats generation and data skipping.

C

So that's a big boost to um read performance and I'm sure that in the next release, 1.2 people are really going to enjoy that and for the connector ecosystem. We are working on releasing the delta flink sync and we're actively working on developing the delta flank source and that's all really exciting. Thanks.

A

Awesome. Thank you, scott. uh Thank you. Why don't you introduce yourself.

D

Sure, thanks vinnie, I'm venky and I'm also part of the delta ecosystem team at dynamics. uh I've been working on some parts in the connector related work with delta and also improving the delta liquid project. So recently I worked on a file compaction feature and also some improvements to the uh restoring the delta table to earlier snapshots um and some box here here and there around fixing some bugs and on the connector ecosystem side.

D

I worked on like the the presto delta delta, like integration, also like the trinomial and delta like integration, so those are available now in uh in the respect to trinomial and delta uh presto projects um and in future, like while planning to work on improving the file compaction to capture the partial progress and also adding the z ordering so that, like the queries, will be faster.

D

Yep and looking forward to like reviewing any full request or like any core uh issues that you post on the delta metallic projects, thanks.

A

That's exciting, thank you already. Questions are coming through for you great to hear uh so we will get to that. But let's, uh let's get denny's introduction um danny. Why don't you introduce yourself yeah.

B

Thanks very much vinnie hi everybody, my name is denny. I'm a developer advocate here at databricks, long time, spark and delta lake guy, so um here to answer any of those uh delta lake questions, uh but I figure with scott and venky here. I probably won't be needed that much today, so hopefully, hopefully that stays true.

A

I think we need a little bit of everything from the panel panels, so thank you awesome. uh So uh yeah hi everybody- I am vinnie jesus developer advocate at databricks and I'm excited to uh get started with our questions.

A

So let me start with uh the question on trino, so the question is trino launched delta connector will trino be able to leverage optimization done on data, z order and bloom filter, while reading.

D

So uh so uh the optimization like what two components like whether the file compaction and z ordering so the file compaction. It should be able to make use of it because it's independent of, like I mean you're, just like compacting the small files into a bigger file. So the planner should be able to read like less metadata and able to like prune more depending on how the data is layout.

D

uh For the z ordering, uh I am still investigating like how how much work is needed and also like how to implement in the uh uh in the delta lake. uh So I mean when, when we implement like the order in uh delta lake, we are also plan to look at to see like how trino rhino and other projects can make use of that. So uh I I don't have exact answer now, but like stay tuned for this one like so uh we'll soon have a jira uh issue in the project and have the details.

A

Yeah your work slate seems so busy. Thank you with all these requests be ready for putting what exciting, yeah getting more community more excited about it. Thank you all right. So, um let's take a question from youtube: are we planning file skipping without the ordering at the same at the time? Being I'll repeat the question: are we planning file skipping without the ordering at the time being.

C

That's that's: what's already been um added to the master branch of delta lake and that's what will be included in 1.2 but, of course um adding um z ordering will make file skipping get that much better, but the first version is up right now.

A

Awesome thanks scott and for those uh who want to have a link on the roadmap. We have. We just released data skipping. As scott mentioned uh in q q 2, we will be working on z, ordering so I'll put the link in the chat. All right next question is: um will the upcoming dynamo db log store work with dynamo db, compatible alternatives like.

A

I I don't think I have heard of that.

C

I can, I think I can take that. uh Oh okay.

A

Thank you, scott.

C

I'm the one um working with uh uh open source contributor on that. So it's a good question for me. uh That's a great! That is a great question, so currently it won't. Currently we are hard coding this just to work with dynamodb um and not just because this is our first attempt at solving this problem. But what we've done in our solution is, I know what one relevant implementation detail is: there's a specific class that interacts with our all of our metadata and interacts with all of our cloud stores.

C

It's called a log store and what we've done is we've abstracted away the whole problem of interacting with a cloud store that doesn't give you mutual exclusion, which is one example, is s3, and so what we've done is we've implemented one version which does use dynamodb and that's our first version, but it's completely open-ended and up. There can definitely be other implementations that use any other key value store or any other external locking store as a solution. So this is something you're interested in file an issue, and we can.

C

We can chat and it'd be also be great if you'd like to work on it together with us.

B

Yeah one thing I'd like to add to scott's point is: don't forget the reason why the dynamodb log store is required is because we do need a lock store, as opposed to log store in this particular case, due to the fact that s3 is missing the put of apps and consistency issues, and so this is not applicable to um excuse me not applicable to azure or google cloud storage.

B

uh There are rumors that this will be eventually fixed as well within s3, but because there are, including the the community member that scott's currently working with they've, actually been running this type of system in production uh for a while now, so it's pretty cool. So basically, scott's been working very closely with that team to make sure that it's applicable and more generic for many other use. Cases that are also using s3 so just want to add a little little tip in.

A

Awesome, that's amazing, information, scott and terry. Thank you. Our next question is what are the plans for delta, rus and python binding library?

A

I am wondering if there is a way to retrieve the files for different disjoint partitions in parallel.

A

B

Yeah, so, okay, what are the plans? Let me answer the first part that one I can probably answer. The second part, I'm not entirely sure honestly. So the first part is what are the plans for the delta rush so right now, the key plan for delta rust right now is to take the writing.

B

That's actually already in production for the last three or four months as part of the kafka delta ingest project and bring it back into delta, rust uh kafka delta injustice. Well, as the name sounds, is for kafka to write directly to delta lake, and actually it happens to be utilizing the rust api to do that, but it's actually a subclass or somehow some set of uh classes are part of kafka delta, ingest. Now, they're moving those classes back up to uh well actually more like crates and modules uh up to delta rust itself.

B

In order to be able to go, do that so right now, the primary focus is to get that part done first and there's actually a couple of pr's already open for that. Now related to that, because you did ask specifically about python bindings, the python bindings actually uh would be shortly after that part is done, and so right now there are already python lightings that work with delta rust for the purpose of read. So as soon as the right is done, then the python bindings will be working on top of that afterwards.

B

So I believe the target was around q2 time frame, but uh the best thing for us to do specifically would be to go ahead and join the delta user slack there's, actually a delta rush channel that you can go ahead and chime in and ask your questions there, and so we can probably provide a lot more details and that actually leads to the second question you just asked, which is about the partitions things of that nature.

B

Honestly, I think that's still up in the air in terms of those specific details, and I would advise joining the delta user slack again, delta rust channel, it's literally pound delta-rs and all of us are there to answer your questions specifically on that. So hopefully that does answer your questions.

A

It does thank you, danny uh there's, a question about uh is optimize available with the uh oss now, um so I think it's a part of it that I would say, file skipping is available, uh but.

D

So the optimized file compaction is available in on the master, uh but it's not yet released, so it will be released soon in the next couple of weeks.

A

Awesome, thank you uh and then.

B

I did want to add sorry, I think, he's downplaying all the amazing work that him and the other team members have worked on. It's actually really exciting the fact that we have that optimizes being brought in to delta oss. So do you want to call that there's going to be a bunch of blogs uh and for a matter of the community emacs? We have here right now specifically for us to go into some of the details, but uh I think vicky's uh being a little uh um humble, modest.

C

Honest. Thank you. That's the right.

B

C

Thank you very much he's being a.

B

C

B

About how cool of a project that that particular feature is so um hats off to the team to get this out as already with delta 1.2? So.

A

Thank you. Thank you, yeah. That's an amazing feature. I think it was like a over a year requested feature. So thank you for finally, uh releasing it like soon all right. Next question is uh you know how do you see the focus on and the future of using manifest files with the you know, connect uh connectors that you're releasing?

A

I know this is a very broad question, but yeah just in your own, like um you're on thought leadership. How would you describe it yeah? This question can be for anybody.

D

Yeah, so I I can uh talk about.

A

From there and.

D

uh Try no point of view so the manifest based approach. uh It has like a descent. What is it right?

D

Like I mean it's like mainly from maintenance pointer, if you like to you, can't keep the manifest for each partition in sync whenever you make any updates to the table so for that's a reason like why we developed like the uh presto db and train odb native connector, so that you can just rely on the delta log as the source of truth and the user doesn't have to, like our data engineer, doesn't have to worry about like doing an extra work of generating the manifest and everything so uh and uh don't know the future.

D

I mean there may be still some use cases that may need the manifest based approach, but for like who I mean uh in cases like where the maintenance is like a burden like especially like crystal and pine or db, cases like we are trying to get a get around that and try to use the delta log directly.

A

Awesome thanks frankie uh yeah that answers this question and then next question is any plans on enabling tagging the objects on delta lake we can use. Can we use s3 life policy with tags to clean up inactive files or objects.

B

I'm going to take a step unless you want to take a stab at the first scott, oh yeah, all right cool. So the concern with using the s3 lifecycle policy is that it actually would not be interacting with the delta lake transaction log, so you wouldn't be actually able to ensure consistency of the data right so especially from a transaction perspective. Right. If you think about what an s3 lifecycle tag for lifecycle policy, you can automatically delete files right based if they're too old.

B

But the problem is that it's not actually interacting with the delta lake transaction log to determine if those files that are maybe old but they're, still valid for the purpose of the data that you're trying to query for sake of argument, because you've decided to keep um really old uh history of the data or for that matter it was all inserts and there was no updates. So it's perfectly fine that you've got data, that's two years old, but still valid, because you want data, you want to query your table from two years ago right.

B

So um there are discussions. um Nowhere concrete! Yet in terms of, is there a way to intermix by the saying? Could you tag with the s3 lifecycle policy, but then also interact with the transaction log? Somehow, but that's still not a remember when you talk about the uh s3 lifestyle calls, it's just simply a tag that just says go delete the files that have been tagged. So it's going to be a lot more complicated than just simply saying.

B

Okay, let's tag anything, that's old, so long story short, it's not that simple and um maybe we can make use of like life cycle policies with metadata. But even then that's. This is not a simple thing to solve long story short.

A

Got it anybody else also has any thoughts around it that you would like to add scott or banking.

C

I would always just like to know what people's use cases are. You know if this is a feature someone wants I'd, be curious to learn about why they want it. um What's what's missing from the delta protocol and delta lake um that that you want this and it'd be great, just to talk and and learn more about the problem you're trying to solve.

A

Yeah, I agree um yeah and you know for those joining and interested like it. It's always helpful like for our team, when we we are brainstorming on what to build. It's helpful to understand the use cases and, like you know, there are different industries uh where you're trying to solve different problems. So thank you. Another question is.

A

All right, one second, is: is there any plan to enable explain, plan of merge that is delta table merged with data frame or dataset.

B

um I'm a little confused on this one honestly, because, if you're trying to explain plan in terms of spark, there would already be it like. The logical plan would be generated within spark and you can go read. It is the context within maybe perhaps some other system, because I mean delta from the standpoint of what it is. It's a file system right. It means storage layer. So it has the metadata to do that, but whatever's actually running the queries themselves.

B

Well, that's the one that would do the planning and so spark itself would actually already show the plan. So I I apologize, but maybe I'm missing something or um guys I don't know. Maybe am I missing something from the context here.

D

There was a recently one issue on the open source uh I think to the user, so wants to like build the match command and then call explain so that you can see the physical plan so currently that is not available. So I believe that's what the question is about uh regarding the plans when that will be available, um I mean I'm not sure about that but like if that is one that you want more uh feel free to.

D

Like uh put your comments on the uh jira, sorry, the uh github issue, so we can prioritize that.

A

Yeah yeah uh so rajendra, if you can provide more context, uh venky and jenny are here to help your support, um so we'll look forward to providing more context. Thank you. uh Next question is uh that will be um auto. Optimize is known best for streaming loads, but what would be the bottlenecks when used with batch workloads.

D

I think the it's basically like one of the benefits of like auto optimize like if you have like, uh if you are inserting small files like which is the case like in the streaming um it helps to like, come back like as soon as they come. So in those cases like it helps so in the batch cases, uh it depends on like cool what type I mean, what size of files the batch shop is generating. uh I mean, if you have the similar situation, then auto optimize may help.

A

Thank you. I think there is a kind of a fetal request question. The question is: will there be a direct delta table or party to grafana integration in the future? I do understand that the plug uh you know it may be up to grafana team to develop the plugin, but is there any considerations in the future for this type of connector.

B

I mean honestly, I would create a github issue for something like this and get of get votes from the community right so because that would basically tell both the delta lake community and for that matter, the grafana community, that this is something that's um that we should all work on together right so to just provide a little context.

B

uh All the issues that you see in the roadmap that we pasted into both the linkedin and uh youtube channel right uh for the delta lake road map. This is all based off of um slack messages or github issues and, oh sorry, and of course, the delta, like google group, so we're getting tons of feedback from the community based on all of these features, and so that's actually how we prioritized what we have here. So it's not to say that we're not interested in grafana, it's more a matter of like candidly.

B

We, I don't at least personally. I don't think we've been asked that question until now, and so I would definitely chime in and add an issue in github or for that matter, and the delta uses slack um and just see if you can get other people involved with that, and then that would definitely allow both the grafana teams and the delta lake teams to say. Yes, maybe we should go work on this together. So.

A

Yeah uh and we can provide you different channels to get uh get started with our team, with our open source community to build the connector all right. Are there plans to move okay? This, I'm not sure about this question, but I'm gonna read it anyway. Are there any plans to make vacuum have no effect on pdf logs change data feed logs?

A

The context is we would like to delete old files, but keep the change feed for data auditing purposes.

B

B

Yeah, I'm probably going to fumble this a little bit, but I'm going to give it a try, anyways, okay, so first things. First uh change data feed uh has been added to the delta lake road map, though we've had a lot of feedback. Okay, so um we're targeting around q2q3 time frame for this one. So um we haven't really dug in deep um in terms of how we're going to be implementing it right, but we have some pretty good ideas and designs, but yeah, so just want to call that out.

B

First now related to that the change day fee itself would contain pretty much any changes like so basically there's a delta table that contains the deltas pun intended um of your table, okay and so every single insert, update whatever you're doing basically is recorded in this table. So if you're running a vacuum on the change data feed in essence, what you're doing is you're removing any reference to the fact that the change occurred, and so I guess I sort of need to understand a little bit better.

B

Is there maybe a context that I'm missing or maybe, if was there, because in the end the chain like running a vacuum on the table itself? In essence, the snapshot table or the static table makes sense. Because you're saying I don't want the history anymore, uh but the change they defeat itself would actually contain every single change and that's sort of the point of the sea of cdf. So I get um unless I'm missing something and it's possible by the way. So I apologize if I'm missing the context here.

A

But I think that you provided a little bit more context today, so that's that's helpful yeah um on how uh how they can go about doing auditing and working with cdf. uh So if you have any follow-up uh for whoever asked the question, please uh use slack and we can provide more details. So, thank you all. uh You know I think before we leave.

A

uh There are a lot of questions, so it's amazing to see how much engagement you uh you provided through our live event, and we will get through all the questions we will try to do it. I have one more announcement, so all right ready for this. We now have a delta lake uh linkedin home now, so please follow and share uh with your uh network. um It's this is the url linkedin.com company delta lake.

A

We really appreciate your support and uh you know being here with us, always showing the support by as a ques question, by asking questions and thank you to the panelists who answered all the questions in like a very detailed manner. So thank you all um any closing thoughts.

B

um Just want to say thank you, everybody for asking questions. Sorry, if that we have, if we haven't chimed in, um but so join us on exactly what uh venice called out the the delta lake linkedin and, of course, the delta delta uses slack channel where we're all super active as well. So.

A

Awesome. Thank you. Thank you.