Delta Lake Community Office Hours, 28 Oct 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Delta Lake Community Office Hours

Description

Join us for the next Delta Lake Community Office Hours and ask us your #DeltaLake questions. Thanks!

A

And presumably it's going to go, live to both linkedin and to zoom. At the same time,.

B

A

Okay, I think I've got part of the solution here.

C

I feel like it hasn't started yet.

A

I think you're right though, let's think here: okay, we are live on on uh linkedin now, so that's good. So at least we got at least we have half the problem solved.

A

Then let me go ahead and go to um youtube and we'll start that in a second and then we should be good to go.

A

I apologize everybody running into some technical difficulties as we're trying to set up both youtube and linkedin for today's delta lake community office hours. Give us a couple more minutes to go ahead and get ourselves up and running, and then we will go ahead and answer your questions.

A

Okay cool, so I think we're also live on youtube now. So perfect, all right! Well perfect. I think we're good to go in terms of being live. I'm gonna add tyler uh as a panelist as well, okay and so hey tyler. uh Hopefully you should be able to join us in a second.

A

Do you hear us now tyler.

A

uh Okay, I'm not sure what I'm hearing, but it's okay, but we're on linkedin. Now, okay, perfect, all right and tyler. You can unmute yourself when you get a chance so currently hi everybody. My name is denny lee. I'm a developer advocate here at databricks. uh We've got ourselves a fun little next, you know 20 minutes or so delta lake community office hours uh with me are scott, tyler and td. I want you to introduce yourselves I'll. Go in that order that I just said: scott wants to introduce yourself real, quick.

C

Cool hi everyone, my name is scott. I'm a software engineer on the delta ecosystem here at databricks uh yeah here to answer people's questions. My main focus. The past couple months has been on the connector ecosystem. We have specifically the the standalone writer.

D

Okay, hi, I'm td, I'm a staff engineer in databricks. I have been involved with the spark structure streaming and dela lake project since the beginning, uh currently also in the delta ecosystem team uh working on connector stuff. Scott is the one who does all the work. I just on the pretty face, helping around people.

A

No problem tyler, can you hear us now or can you speak now because uh you were muted before at least.

A

All right, uh I'm going to assume at some point: tyler will join, but uh just in case, uh if he can join that would be tyler croy. He is a data engineer manager over scribd and yes,.

D

A

We can't actually hear you buddy yeah, I don't know like you're unmuted too. That's the brutal thing. So, okay well you're back on mute. So whenever you can come back online that'd be great. uh Maybe we'll we'll explain we'll we'll try to toss the questions back to you. If anybody asks them so uh just to get the ball rolling, we did actually have a question from veneer. uh Vineeth excuse me about explaining the new features uh beneath. I'm gonna have to turn it around and ask you maybe uh what new features are you referring to?

A

Are you talking about the what's coming up in delta 1.1? Are you talking about that's currently the delta lake 1.0 set of stuff? Oh there we go tyler. We can hear you. I think I don't.

B

Understa, I have no idea how this works.

A

Now now it works now.

D

A

D

But it works now.

A

Yeah yeah yeah you're you're in you're, it's okay, buddy.

D

Just hold the mic a little bit closer. I think your mic's volume is a little bit low.

B

I will, I will speak into my little hand mic there. It is.

A

All good boom all right, so meanwhile, I'm uh as your host, because I don't actually know anything at all anyways. uh Basically, I'm going to be looking at both linkedin and youtube to see. If that we have any questions from uh from everybody, uh but looks like we are live on youtube and on um and linkedin so first, like I said before first questions from beneath and vanet.

A

If you can go ahead and get back to us on which set of features you were referring to in terms of delta lake, 1.1 or delta lake 1.0, anything else now saying this uh one thing I definitely would love to go ahead and have us go round.

A

The horn in terms of the current set of features and what's coming would be things like uh the delta standalone writer, also for the flake and delta sync, and then for you tyler anything especially from yesterday's uh sorry two days, it goes uh delta rust uh session, like any updates that you want to call for that. So why don't? We do those things first, while anybody else here on linkedin or youtube you've got questions please again chime them in on exactly what type of features or just be a little more specific.

A

So we know how to answer the questions so for you, scott first, I want to ask you a little bit more. What's the current state of the delta standalone writer, just because I I saw a whole bunch of pull requests, there was a lot of code. So can you can you tell me a little bit about what's going on here.

C

Yeah, we are basically at code completion so today or tomorrow we're going to merge it into the master branch, which is awesome, so people can go and see the um the github issue and the github pr if they want to comment on any of it. The issue also points to the um the public design dock.

C

If they want to see the overall design of the standalone writer yeah uh being merged tomorrow, which is great, really excited in the next couple weeks, will be qa just to validate it make sure it works with the with our existing connectors, like hive, for example,.

A

Excellent, excellent and and scott one thing I definitely want to call out like this actual- is not just for like a standalone writer by itself and for the java and scala clients, and also eventually for the hive uh upcoming high of three. But don't forget. We all have uh any system that wants to write to delta lake, in fact can go ahead and leverage that, if it's not leveraging the rest api.

A

Yes, I know you want to say that so never say that uh there are things, for example, like uh what was recently uh published in aws labs, athena uh there's the athena connector that actually uses the delta standalone reader, for example, and then a natural segue would be the flint delta. Sync and td want to talk a little bit about that.

D

Yeah, this is very exciting stuff. We are doing in collaboration with the ververica folks from flink that we are building using this delta standalone writer of flink delta sync, so that any flink pipelines can write out to delta tables exactly in the same way or you can use structure streaming to write a delta tables, you will be able to use link to write it to delta tables, which expands the ecosystem around delta lake.

D

A lot that they're like the two major uh streaming engines out there both can write out to delta tables, and this is just the beginning- I think, going forward over the next quite a few months- we're going to slowly build the flink delta source using the delta standalone reader, etc. So we are really focused on increasing this ecosystem of making every possible engine being able to read and write into delta tables.

D

A

Perfect, thank you very much so now my apologies. Basically, it was a bit of a blip almost one. Thank you very much, and so now, naturally, I'm going to segue over to tyler tyler wants some quick updates from uh the delta rust side of the house, especially with the productionization of cough kafka delta ingest. uh What are the new plants that I think the community would love to learn about.

B

Yeah, so uh someone actually asked in slack about this this morning, um but kafka delta and just is running in production at scribd, um which is super cool um and coffee delta ingest. If you're not familiar with it, it's a rust daemon that basically brings data from json formatted data right now from kafka into delta tables, with an extremely low overhead and an extremely low footprint at runtime, which is super cool.

B

um We had our open development meeting, which was recorded on on youtube earlier this week, and a couple of things that we had talked about, probably the most notable is we're getting ready to do. A new release of the um of the delta lake rust crate. We haven't done a release in a few months, because we've been pinning to basically development versions of the apache arrow rust binding, because we've been working with the aero community to get improvements for writer support in over the last few months.

B

Now they have cut their 600 release to create.io, which means we can now cut our our release. The other thing that we talked a little bit about, which is really exciting, but I can't commit to any timelines because it's not mine to commit to um there's there's. Finally, movement happening around the azure sdk for rust, which has never been released to crates.I o. um So now, there's there's word on. The grapevine is: there's some movement within microsoft to start to push that along to creates that I o and once that binding gets released.

B

That means that we can start to ship a delta lake crate that is enabled with azure. So every delta lake crate that we've released has actually just been aws file system, and I think we have not yet released the google cloud storage support. But we can release that uh with this next release of the crate. But we've had.

D

Very nice uk.

B

Stuff, um which will mean that, as a primitive, uh you know the delta lake crate will at that point you know no timeline on that part, but that at that point we would be able to you know, as a rust user, uh enable a different compile-time feature flag in order to have azure uh adsl g2, whatever the hell. It's called uh support, plus.

A

Adls yeah adl as lake.

D

B

D

B

Plus um plus uh uh google cloud storage, um so those were those were a couple of the major things that were talked about. One of the things that's landed in kafka delta, ingest, which we've been uh experimenting with uh recently, is the support to start kafka delta ingest from a specific offset in kafka, really really exciting.

B

If you're moving from an existing system, that's already processing data into delta lake, then you would be able to stop that old system and then start up kafka delta, ingest and pick up right where you left off in kafka, which is uh exciting, to say the least.

D

Okay, can I ask a bit of nerdy questions here I mean, so how do you absolutely yes, how do you manage the offset like? How do you like? Is it exactly one's guarantee and how do you manage those offsets and check pointing and stuff in that kafka, delta, injustice.

B

um Let's see if I can butcher this response, because I don't have first-hand knowledge for developing this christian christian, who did a good segment with denny and I week before last or october. 7Th I remember is, is when we talked about that.

B

um We do the way that we run kafka is for that exactly one semantic and there's a broker level configuration that you can do there, but what we do in order to allow some of the coordination between kafka delta, ingest um uh consumers that are consuming from data and then writing consuming data and writing into delta, we're actually using the delta transaction.

B

The texan txn action to store offsets in the delta transaction.

D

B

So that any other writer can know sort of who's at what offset on what partition, without needing some sort of third-party coordination and that that design was actually created or come up with by qp, who is the originator of the delta rs rest bindings he's another member of my team here at scribd, and it is so cool that we were able to just like, like our dependencies for kafka delta ingest, are pretty much dynamodb for locking to make sure that we have some uh concurrent rights into s3, because s3 has problems with concurrent rates, um but then outside of that it's just kafka, kafka, delta, ingest and then the delta table.

B

D

B

Simple simple architectural setup, which is really cool because.

D

B

Able to hijack or utilize the um the delta transaction log for a lot of what we're doing.

D

Yeah, it's it's very, very clear. It's so very cool like how you use the set transaction to keep track of per partition offsets in the log itself. That is very cool. It is something that we hadn't thought of earlier so kudos to you. Folks,.

A

All right definitely and- and then this is a good call that don't forget like exactly what tyler called out on october 7th, uh which is about two weeks ago now. I guess uh yeah, myself, tyler and christian. We actually went through in some of the details. So if you want to learn more about talk to built in just that's the best way to do it. um Meanwhile, I did want to go ahead and actually call it.

A

One thing before I did switch back over to linkedin, so if you linked it or youtube for that matter, go ahead and keep on um subscribing to our delta lake youtube, channel or or ask questions here on the databricks linkedin. You know keep on chiming. In with any of your questions. I've got a couple more questions that showed up, but I did want to call it one other thing which was pretty important, which is the presto delta reader, so we're actually working with a bunch of the folks in the community.

A

uh Specifically, I'm gonna call it call out ohana in this case uh who we're where we're help we're they're helping us to work to basically get the presto delta reader to work in a performance not just in a cool manner, but it's also the performance manner, all oss so for the community. So we're going to be pushing out the pull request on that. Actually, the pull request is actually already out. It's already already.

D

A

Sorry, my bad we're going to start too many office hours. That's what I was trying to say. uh I pro probably the next two to three weeks: we're just trying to get the schedule nailed down, but we're gonna be starting community office hours for specifically the presto delta connector as well, just like the ones we already have for the delta rust, just like the ones we already have for flink for delta. Now we're also going to do the this one for presto, so woohoo, okay and uh now we're going to switch to questions.

A

One is just a a shameless plug for us uh here at databricks, uh james called out that they love the new landing page for data breaks. So yes, thank you very much. That's awesome, but we no, I I believe none of us are responsible for that one, but nevertheless thank you very much, okay, so, uh but the first question- and I think this will be directed toward you td.

A

This case is what are the best effective ways to merge or update data into a delta lake table uh when we're receiving new data uh for the table in the raw zone. Okay, that's a question from uh kenjal. I believe.

D

A

Yeah there's a lot to do so. Yeah.

D

There's a there yeah there's a lot and let me slowly unpack it. So let me interpret your question. So basically, you have data change data where your row level changes coming in and the rod layer and you want to more upset those row level changes or apply them to the delta table. If that, if my understanding of the question is correct, then the way the best way to do this is essentially use structure streaming.

D

So structure streaming can read from arbitrary different sources right, and so you can read the change data from using structure streaming and then start systeming. Has this sync called for each batch, but what it does is that it allow it gives you it. Basically, you can use in the streaming query standard data frame transformations.

D

You can parse the change data from whichever format you're coming from and and parse it out and present it in the have it in the same schema as the actual delta table and forage bash allows exposes each micro batch of that parsed data as a data set, and then you can apply, merge the sql operation merge.

D

On top of each of this set of changes and apply it on the delta lake table, you have to do a little bit of uh keep in mind that you have to deal with a little bit of things like there might be duplicate. The single microwatch can have duplicate changes for the same key that such as you have to duplicate.

D

Otherwise, merge will not work because merge assumes uh that there's only one match, and there are very bit of complications here and there that you have to think about and do the derooping and stuff, but once you get all of that, set up structure streaming read changes for each batch after parsing, modern delta table. You have the end to end change. Data change data capture pipeline continuously updating the delta table.

D

Now, if you want to learn more about it, take a look at the merge operation docs in the delta io docs, which has a very succinct example of how you can do this merge operation um with a very simple dedupe operation right there itself and that's a good starting point for you to try out and play around with this idea.

A

Excellent thanks very much cd and uh one quick call out which of uh admittedly enough is probably a another uh shameless plug. um There is a session called diving into delta lake tech, talk series uh where, basically, we talk about unpacking the tracks in the log enforcing involving the schema, and we also happen to have a dml internals on delete, update and, of course, merge.

A

I'm actually going to paste that page directly to linkedin, because that's where I actually came from the reason is the shameless plug because it happens to be td and myself that actually did that session. So just as a quick call, all right so nevertheless saying this, uh we're almost done for today's set of questions, so I did want to actually just ask one more question before we wrap it up, which is uh a little bit more on delta lake, 1.1, um td or scott, you know for that matter.

A

um Just a quick ask like what do you think is the rough time for delta 101, because we've got some questions about sparks 3.2 support. So that's the reason why I want.

D

To call that out.

A

D

Okay, let me talk about the delta 1.1 stuff and scott, then can pick up on the delta, standalone writer and the fling related stuff. So those are two different maven artifacts therefore, do different maven release timelines. Let me talk about the delta lake 1.1, so delta lake 1.1 is going to be on spark. 3.2 spark 3.2 got released and announced just a couple of weeks back. We are in the process of doing the final round of qa and testing on after updating to spot 3.2.

D

So in expect in about next two three weeks so say mid november or third, second or third week of november, we should have made a delta 1.1 release on spark 3.2.

D

So that's it from the spark delta support side. Scott, take away about the standalone stuff, non-spark stuff.

C

Yep, uh the standalone writer we are hoping to release in november so by the end of november, is the goal and then the flink connector. We will see beyond that. uh It might take a couple more extra weeks of qa just to verify functionality but come along very well.

D

Yeah, so the release of the standalone writer and with also will include some improvements to the standalone reader that will help improve the memory usage of reading a delta log reading the delta table of parsing, the delta log and stuff. uh Those all will be released as a single release, which we can take advantage in the other, subsequent connectors like presto, etc, which that we are also building presto reader and the flink sync, which is the writer. So a lot of improvements coming along.

B

So I've got a a follow-up question for you: td, go.

D

For it, we now.

B

Have delta reader rust in java and spark and a delta writer in rust and java and spark, as as the most bearded among us, how will we coordinate the compatibility between all of these in a way that is going to make sense, because I mean this is very very new in the delta community and we need to maintain some form of adherence to compatibility between these.

D

That is a absolutely fantastic question and finally, somebody's giving me the hard questions. I was getting tired of very giving me all the love okay. Yes, you have absolutely correctly identified that this is a hard challenge that we are uh have to deal with uh right now, since the whole delta lake stuff started with spark to begin with.

D

I think that is our highest priority to maintain the compatibility and that's why we are building all of these connectors, like the standalone, etc, stuff independent without touching the main delta project, because we do not want to affect the uh the robustness of that and uh so now going forward.

D

Yes, as we make the standalone reader writer uh some of these more robust, we are thinking of figuring out how different common pieces can be pulled out as separate common maven, artifacts and shared across stuff, like one of the things that we are thinking of, like kind of looking a little bit forward, thinking well into the next year. Is that uh the the storage support, which is for us in the delta lakes spark site used to be the log store?

D

We are adding essentially similar apis in uh the standalone, but right now independent, like kind of duplicating the apis, but going forward. We definitely want to refactor it so that the same logs to implementation can be commonly used for delta spark, as well as delta standalone and any connector using the delta standalone.

D

So those are some of the refactorings we have in in the plan that we're going to slowly keep doing to kind of minimize the code duplication both from our maintainability point of view, as well as from the user uh configuration ease of use, point of view as well, the the less moving parts there are, the less duplication that are the easier for the user to configure and use, ideally not configure at all and less things to break so yeah.

D

This is definitely a hard problem and we are going to tackle it slowly by refactoring it in in a piecemeal fashion, without affecting the robustness of some of the more robust projects.

D

This answers sorry uh yeah, so this answers the java side of things. I think, for the rust side of things is definitely something we should have for the conversations on. How do we work towards making the maintenance overhead of this lower across multiple projects? So, yes, really looking forward to those conversations.

C

Scott go ahead, yeah. What I was just gonna add on is there's like development uh compatibility between the different, like projects we are producing, and then there is actual data correctness compatibility, and I just want to assure people that the data that you write with all these different writers and data that you read with all these different readers, like it's still valid data, we're still adhering to the delta protocol.

C

You know yes and different readers may have different functionalities and those are correctly specified using the delta actual protocol, that's metadata in the table uh so and over time, as we add more features to the delta oss and to the standalone reader and writer, you know we'll be bumping the actual protocol that they use uh to maintain actual, read and write compatibility between them.

D

Yeah, so the protocol already contains enough information for us to judge whether a version x of writer one is compatible with version x y of reader writer, two and stuff. So the protocol supports that, and I think that will be the way by which we will ensure the guarantee, the compatibility or incompatibility whenever we make breaking changes that needs version upgrades and stuff.

D

So that is something that it is all that is baked into the protocol and it's our job to guarantee that if readers at the same version level will be readers, writers, saying whatever will be absolutely compatible with each other, whichever project you use to rewrite those tables.

A

All right, perfect, first of all, I I'm td. I want to take particular offense, it's hard for me to make sure I give you only the low ball questions you know does actually work on my part, but no in all seriousness. Thank if you actually have more questions about the delta protocol. The the one thing that I definitely would like you to do. Everybody here to do join us at delta.io join us on slack, we're all here. uh All of us that you see here we're actually typically interacting with everybody.

A

So if you want to dive into these details, whether it's the rust api, whether it's uh delta spark the standalone writers or the protocol itself, please we encourage you to join us on slack um and for that matter, we'll be on doing these office hours every two weeks as anyways. So that's it for everybody. I think we're we're a little actually a little late, so I'm going to shut it down now. If there are any more questions again join us on slack uh and we'll answer your questions there.

A

Otherwise, then uh join us for the next community office hours it'll be in two weeks at 9 00 a.m. All right! Thanks very much! Everybody appreciate your time.

C

Thanks denny bye, everyone bye, everyone.