Delta Lake Community Office Hours, 25 Jan 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Presto / Delta Lake Community Office Hours

Description

Join us for the next Presto DB / Delta Lake Community Office Hours and ask us your #PrestoDB and #DeltaLake questions. Thanks!

A

A

Okay, welcome everybody. We're currently live, I believe, on youtube and on linkedin, so we're gonna go check the feeds to make sure everything is working as expected. uh So give us a couple minutes to get organized and we will uh start rocking it.

A

Give me a second here.

A

Linkedin, so we're gonna go all right. We have, I believe, I've verified that linkedin is working, so I'm just currently checking on youtube right now. Save the we've got the youtube video going or not interesting.

A

All right give me a couple more minutes to try to get youtube figured out. Meanwhile, uh for those of you that are currently on linkedin glad to have you here and uh we'll be starting shortly. um Remember this is the presto um delta lake community office hours, so we will go ahead and uh get ourselves running in a couple minutes, but do if you have questions about trustwdb delta lake connector by all means uh get your questions ready, prop them onto linkedin or to youtube, and we will uh start the process shortly.

A

Hey everytika how's it going. We are kind of new. We are currently live by the way. So just let you know we're currently live, I'm just trying to set everything up uh because it looks like the connection to our youtube: isn't quite working right now, so I'm trying to figure out what is going on there, uh but as soon as I get that fixed, then uh we can start rocking. It basically, so give me a couple more minutes to figure this out and then we'll be on our merry way.

A

B

A

Okay looks like we are live on youtube, but it looks like we should, for whatever reason it shifted to a different youtube link. So my apologies for that. Okay,.

C

A

Like we are live on youtube, but it looks like we should, for whatever reason: sorry there we go. That's the repeat part so give me one more a minute just to go ahead and inform the other folks on the other channel that we've shifted over to this link, and then we will be on our way. uh We will start with introductions and things of that nature.

A

Okay, perfect: well, um let's go ahead and give this a show on a road. um So let's go ahead, start with quick introductions and then we'll go from there so ritika. Why don't we start with you, uh and I want you to introduce yourself to the audience here.

B

Sure sure, thanks danny thanks for joining me here, so hi guys, so I'm ritika, I'm working as a software engineer at hana and it's I've been working on presto for around three years and we are currently working on some of the security integration there and now started. uh Also working on this presto delta lake integration and video, I have been trying out few of the cool features I have been connecting to even benky about it and thanks danny and venky for for for this office.

B

I'm looking very much forward to uh this interaction with all of you guys.

A

Perfect perfect thanks very much for tica uh and then venky want to do a quick introduction for yourself. Please.

C

Sure uh hi everyone, my name is venky and I'm a software engineer at databricks in delta team. uh Currently, I'm working on the presto db delta connector, as well as the train odb data connector before database, I used to work at uber as part of the presto db as a service to the internal customers and I'm very much looking forward to answer any questions you have about the delta connector.

C

A

Awesome thanks very much. Thank you.

A

Excuse me, please, sorry about that. Hopefully I was able to mute my cough. This is the issue when you're trying to do live sessions so anyways hi everybody. My name is danny lee. uh This is the first of what we're hoping multiple uh presto delta lake connector um sessions. uh We're gonna go ahead, and this this is uh just dive into some of the background or the details around it and then allow folks to go ahead and ask any questions.

A

If they're, no questions not a big deal, we're just going to go ahead and uh uh we'll do it in 15 minutes. By the same token, if you do have questions, this is exactly where you want to ask. Oh, and I forgot to introduce myself. My name is danny lee. I'm a developer advocate here at databricks long time, uh a long time, brixtor long time, spark and delta light guy.

A

So hopefully I can answer some questions too, and so we have a great question right from naveen from linkedin right from the get-go, which is what's the relation between presto db and the delta connector and so vicky. Since you did a sizeable chunk of the work well, we start with that.

A

What is the relation between presto dv and the delta connector.

C

Yeah, so the uh pesto db is a mpp engine where you can write connectors to query data in different sources like the sources can be as diverse as like it could be like a mongodb, elasticsearch or even uh files, other tables which are registered in hive, but the data is in like aws or azure or like any of those sources and also like delta. Like is like its own table format, so you basically have a set of files.

C

On top of that, you have a metadata for transactional support, so the relationship between the prestodb and the delta is like. So, if you basically like to it is like a new connector where, from presto, you can query the tables that are in the delta, like format so earlier like when we are trying to query the tables in delta. Like you you, I mean you can't query directly because, like there are no apis that could directly read the metadata of the delta link, so that I mean we recently.

C

The data like community has added the apis and presto db. We have written a new connector in crystal db to use the uh the apis to read the delta like metadata and efficiently. Like query the tables which are in delta, like format so that it's a high level, it is basically like a. We are adding a new connector that can query data, which is in the delta, like form.

A

Excellent, I think you did a great job venky, and so uh let's switch the next question. Actually it's from denis also on linkedin. uh First of all, he's actually would you expect the connector to be available and press the db connector version 0.269..

A

So ritika, I'm going to ask you the question. First, in terms of uh right now, we've been going ahead and working on presto db um and vicky correct me. If I'm wrong 0.266, if I recall correctly and so press the db um is on 0.269, are there any major differences between the presto db 0.269 versus zero? uh The current one that were that the presta db connector is working with which is 0.266.

B

So, uh in terms of like delta lake connector, I've been trying it out with two six nine two three and uh there was like when I started it uh started basically uh looking into the connector. So there were like few initial issues like which any customers like any new we can face.

B

I mean because, if they're, using the connector and presto for the first time, but I feel like 269- should have all the fixes, if I'm not wrong the initial, the one of the small fix uh bug fix, which was there, uh which was basically uh fixed in 269, so I have used 269 and which works fine to me, and I have tried it out like with multiple tables and multiple combinations, so yeah definitely looking forward to use it more. So it works fine with 269.

A

Oh, that's great! Well, that's really good to know becky did you want to add anything to it.

C

uh Regarding the release, so currently the 269 is in development, um I think hopefully, like next month we will have a release branch cut and then, by middle of next month we should have a release that contains the dental connector.

A

Excellent, thank you very much and then hey denise and anybody else. That's actually either on linkedin youtube right now or if you're going to watch it a little bit later, because don't forget, we run these every two weeks. Anyways, um don't forget! If you're interested in participating to help us accelerate some of the schedules by all means uh join us on the delta user slack and we actually have a presto, connector, sorry and this, for whatever reason we call it.

A

Connector dash presto, not sure what but connector dash presto uh uh um uh slack channel, and you can join us right there and talk to us right there in the get, go and uh join us and be contributed to it. So we're always looking for contributors uh just because it's a pretty exciting thing. Okay, so I do have a another question uh from shobham. uh We're gonna answer that question in a second, but I did want to bring up an important aspect of this and it's.

A

This is going to go back to you, vecky and, by the same token, you can also I could probably chime in a little bit, but I did want to call out the importance of this press 2 db connector. So we had some questions about presto dbm. We had some connections uh a couple questions at least I'm assuming on delta lake itself.

A

One things that I'm curious about or one to provide some context on is don't forget, before presto db, to delta required us to use a manifest file. Now the advantage or disadvantage of a manifest file was that it created a file that contained all the files that were made that made up that version of the delta lake table. That presto itself would then be able to ascertain, but this is not what we're talking about.

A

We are not talking about this because the presta db connector is a much more improved version of that, and so now I'm going to segue right to you venky. What did you do? What did we do as a community to go ahead and to make the presto db the delta connector significantly better than just using the manifest file.

C

Yeah sure um so a few disadvantages I'll first go through the. What are the main disadvantages with the manifest based approach? So the one main thing is like it is hard to manage the manifest so basically like. Whenever you update the delta table, you have to run a separate command to generate the manifest, and when you have like a multiple a partition table, then it gets complicated like so uh so I mean you need to have like one manifest for like the partition. So those are the things that from manage point point of view.

C

That complicates it's. Another thing is like from performance and scalability point, if you like, so you are basically loading the entire file that has all the files that belongs to a particular version of the table at once. Like so suppose, if you have like a several millions of files, you could run into like out of memory errors. So these are the issues like that make the uh manifest based uh approach not to use the practical in uh production scenarios. So that's why I like to what we did is like.

C

So uh first thing is like so the delta like community, introduced a standalone uh reader library, so what it does like it basically provides the aps so that, like other engines like to presto flink car, like any other uh any other engine that wants to read the delta like metadata, they can use those apis to read the metadata.

C

uh It's not just about reading the metadata but reading it efficiently. Like suppose like, if you are reading a uh reading, a table that has like reading a particular question that has like a lot of files, you don't want to try to load all all those files at once. You basically iterate over a uh one it iterate at a batch so that, like you, don't run into out of memory situations.

C

uh So this is especially useful for presto, like which has like the pipeline execution like where you generate a set of uh splits uh uh generate a set of splits. That means you basically read the file status for a set of files, and then the splits are sent to the workers for execution.

C

While the execution is happening, then you ask the split gen connector to generate the next batch of splits, so this works very well with the iterator based model the delta standalone reader has provided, and on top of so we basically make use of that, and we we don't get into like any out of memory situations or like any scalable tissues. So that's one and another thing is like to with a manifest based approach.

C

You can't go back to like to suppose if you want to read a particular version version uh that happened like two days back or like to some 10 versions back like what what is the current version, so you I mean with the manifest you can't go back like, but with the connector you can actually with a presto db connector. You can actually uh have a suffix to the table. Name. That tells you like what version you want to read or what version as of particular timestamp. So that, like you, can query the data.

C

So basically you can do time. Travel queries efficiently from presto presto is the interactive party engine and you want to have the ability to basically like go back particular version and check. Read the data like how it was it is doing for our hack analysis. So this is very useful and uh and and also it's having a separate connector in future. We could add more features like the dml and the insert and update and merge. So we don't have those features yet, but we have a plan to add those in future.

C

So it's basically like having a separate connector makes it an efficient reader as well as like to write, had a great support in future. So that's the I mean yeah. Those are the advantages of the new uh presto db delta, connector.

A

Thank you very much. Thank you. That's a lot of context, and so just I want to just uh summarize some of those key points basically super fast performance because of the delta standard reader. In fact, as the during the development of the presta db connector, uh there was an iterator added to the delta standalone reader to improve memory and performance.

A

Okay, and so these are some of the things that we're doing as a community to improve things right um and I'm sure ritika will jump in when we say things like okay, we're going to go ahead and be we're as a team as a community we're working together to add other functionalities as well right now, it's a reader, but the idea is over time, we'll be doing writing inserts ctas operations deletes updates so forth and so forth.

A

Okay, so uh again, really would love you y'all to join us on delta dot io uh at the bottom of the page. There's a of the webpage there is a slack channel join us there and, like I said, connector presto, I'm going to add it to both the linkedin and youtube channels shortly, um but I just wanted to call that out again as a way to participate now saying this.

A

There are a bunch of questions here, so I'm going to go, try to go through them, I'm going to try to tackle some of them, because some of them are a little bit more egregious and I want to keep them more on presto db, but I did want to at least open uh at least answer some of the questions we have so for starters, shobham has asked the question: like uh does schema evolution work with an athena view, and do we have to write another ddl to choose the new column?

A

So basically, I think you already have the answer from that standpoint, though. Thank you, I'm sure you have more, but I did want to call out. We actually are working with the athena community, in fact to see where this is going to go. uh They, the athena community, is very excited about this code base that uh the the the presto db delta lake code base that we've started off with, and so uh hopefully in the next month or so. We're gonna have some more updates for you um to provide a little context.

A

We're actually going to be publishing either next week or the week after the delta lake, um 2022 h1 road map. Okay or the proposed excuse me to uh delta lake 20 2022 h1 roadmap, which will include call-outs, for example, the some of the things that we as a community with the athena community are thinking about for uh for handling that.

A

So just want to call that out some context. There's another question about from vibe off. I believe his statement was: do we need to go ahead and pay starburst? Now uh we're not going to do any compete scenarios here, so I didn't want to call that out uh that that's not the purpose of these sessions. By the same token, we did want to call out that we are actually in conversations with the trino community as well to see what what makes sense okay and so, but that's the most. I'm going to get into it.

A

I I want to have these as separate conversations. This is very much about pressure tv and we're going to keep it all fresh too deep, okay, so um and then so now we got a great question from a buddy of ours: francophoto. Okay, uh does the presto reader take advantage of statistics in the transaction log if delta was written by databrickstuttle as opposed to written by oss delta? So vicky, I'm sure you have part of an answer and I have the other part of the answer.

A

So why don't you tackle the first part and I'll tackle the second part.

C

Yeah, so that's a very good question, so uh we currently don't have we don't uh we don't support like the files file pruning based on the file stats, but do we do have the partition pruning and also data skipping at the parkier reader level, like basically like so the filter that you have in the query that gets passed all the way to the park reader on the worker that basically look at the footer to see like what are the min and max for the column and then do the pruning.

C

uh But we I mean with that. If you, if you support the file pruning at the planning level, the file doesn't even need to be shipped, I mean. uh Doesn't the task that reads the file doesn't even need to be shipped to the executor and the executor doesn't even need to rate the footer. So we have like to some sort of data skipping, but not efficient to data skipping that can make use of the statistics generated by the database delta link.

C

But we have some uh plan to add that in future, in presto db data electronic.

A

Perfect, thank you becky and then the one thing we want to add. As I called out, we will be publishing the 2022 uh 2022 h1 roadmap proposed roadmap and the reason I'm calling that out is because some questions concerning statistics for delta that you uh previously saw in databricks delta they're going to be showing up in oss delta. So there was a massive ask by in during when we published our proposed 2021 h2 roadmap. I can't say this quickly enough, so my apologies, but last year last half year's roadmap proposed roman.

A

The community made it very clear. They wanted features like optimize and calm stats and things of that nature to be placed into us as delta. We've listened we've heard, and we've also worked with various members of the community to build this up so again uh keep on providing your feedback, because that's exactly how we as a community are going to go ahead and get more of these features out and so back to specifically presto db and stats.

A

Exactly to your point, because that was an ask for this connector presto db wasn't asked by the community: that's why the community went ahead and built it number one and number two to your point: franco. Yes, statistics things of that nature, not yet, but we are pushing them all out into open source uh so and we will show you the timeline, uh the proposed timeline next week or the week after and then subsequently, of course, the prestige db, connector itself will be able to take advantage of thanks. Okay, so hopefully that answers your question.

A

uh Let me go ahead and jump to the next set of questions. uh Let's see here, um okay, no! Actually, I think we answered franco's questions as well, so all right, perfect so, um and we did answer the questions about whether it will work with fina. So I think that covers that part, and I just chimed in about uh joining us- the delta lake user slack. So I think that actually covers all of our questions today.

A

So if you have any other questions, please chime in we've only got about five minutes left anyway, so um and by the same token, if, uh if, if we're done for today's questions, um not a big deal, we'll shut it down for today, but then in two weeks time we'll have it we'll be back at it again and again, join us on the delta jesus flag, join us in the connector presto channel. uh That's where we're all gonna be, and you can continue asking your questions. There.

A

Okay looks like from what I'm seeing here we're good to go on questions anything uh final words for tikka or vinky you'd like to add uh before we go, then.

C

Yeah, so the presto db, the pr is landed and the docks are currently, as I mean within the within the presto db, repo. um I think that it would be good to post them and if anybody is interested, maybe they can try it out. Like you know, that's a funny.

A

Thanks vicky perfect anything else, you'd like to add.

B

Yeah, so I would just say that yeah, as benky mentioned, the pr is already out- and I have tried it so I would say like as a user yeah, the instructions are pretty much sensed here so yeah. Whoever wants to try it out, go ahead and try it out and the yeah in terms of features. What renki has already explained in detail, so those features are pretty much all working good and you definitely even. I look forward to contribute more and enrich the connector in future. Yeah.

A

Awesome, thank you very much for tica and then, like I said, I'm gonna be a broken record. I'm not sure if any of you know that reference, because maybe I'm too old now to be using those references but join us on the delta user, slack connector, presto channel. uh You want to chime in ask questions. Help us with documentation help us with testing whatever else contribute we'd love to have the chat with you. So that's it for today, we'll see you in a couple weeks and uh hopefully we'll see on the delta user.

A

Slash thanks very much. Everybody bye.

C

Everyone thank you.