Delta Lake Community Office Hours, 3 Feb 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Delta Lake Community Office Hours (2022-02-03)

Description

Join us for the next Delta Lake Community Office Hours and ask us your #DeltaLake questions. Thanks!

A

Live now so give us a couple minutes so for anybody. That's currently attending we're currently in the process of getting live now, so give us a couple minutes. Please.

B

And if you are on youtube, please post your questions or maybe say hi where you are joining from. You would love to know.

A

All right, vinnie, we are live on linkedin and youtube, so by all means, go ahead and start the show.

B

B

All right, uh hi everybody- uh we are uh joining our second office hours for this year. We have scott ryan, denny and myself on the panel. So why don't we introduce yourself uh scott? You want to go ahead.

C

For sure hi everyone thanks for for having me again, I'm scott, I'm a software engineer on the delta ecosystem team at databricks, working on a lot of exciting open source features and connectors in the community and happy to answer everyone's questions.

B

How about you, ryan.

D

Yeah, I'm ryan software engineer in databricks and working in the same team at scott, and I have been working on data lag since the beginning. Yeah pretty excited to hear to like hear any feedback for the other leg.

B

Awesome, thank you, scott and ryan danny. uh Why don't you introduce yourself.

A

Oh thanks very much vinnie hi everybody. My name is danny lee, I'm a long time, brickster long time, spark delta lake guy. Just here to answer some questions. um I did want to call out uh right away that we- probably you, probably have some questions on the 2022 h1 roadmap that was published in a blaze of glory last night, so uh we're going to post it uh in there but vinnie. Why don't you go ahead and start the show and meanwhile I'll go ahead and post the roadmap to both linkedin and youtube concurrently.

B

Awesome, thank you for uh the introduction, uh all the panelists. um We have attendees from virginia bangalore- seattle, argentina, oh, my god, nice uh england, bristol atlanta, wow, so many folks around the globe uh so excited to have you here. So, as denis mentioned, uh we just released the roadmap yesterday night. If you have any questions about roadmap, please ask us, ask away and I will be monitoring uh youtube as well as linkedin channels. So please post your comments um over there um yeah.

B

So let's begin with our first question um scott: do you want to provide update on the flink delta connector? I think that's the uh that's really hot in the market. Right now,.

C

uh Yeah for sure vinnie I'd be happy to so. For those that don't know um just this this past year, we we finally completed both the read and the write functionality of our delta standalone library and what this library does is it lets in a single jvm without using spark any connector actually integrate with delta lake, which is great because now it's really helping to add a lot more connectors to our ecosystem.

C

So one of those connectors that we're working on right now is the flink sync we're also working on a source as well, so flink sync's coming along we're just um finalizing the public api and getting all the making the api as distinct and useful to people as it could possibly be, and we're just finishing up some some java, docs and some examples. So we hope to be releasing that um soon.

B

So uh a follow-up question scott. Does it mean if I have uh flink as my uh flink as my system? Would I be able to read from delta lake, or would I be also able to write to the delta lake.

C

So so, there's both the sync, which is writing and the source which is reading um from our feedback from the community. The sync was a higher priority, so that's what we're doing. First, we're we're letting flink connectors and flink engines right into delta and then soon enough, um as as we as we come along with the source, you'll also be able to read from delta lake as well.

B

That's awesome yeah, so uh for all the streaming folks out there. If you have like lots of streams going on, uh you know if you're connected to stream. This is an exciting news. Now you can speed up your queries, uh make sense out of your data using delta lake, exciting awesome um and then uh danny.

B

Another question would be.

B

There are some funny comments in the chat. That's why I'm laughing uh no.

A

Problem, actually, you know what vinnie, why don't you go ahead and figure out a question? I did want to add one little tidbit on behalf of scott, but scott did forget to call out his the delta standalone project. Actually is so crucial that in fact, it's not just the flint connector that actually was.

A

It actually is the basis for the hive connector right that was recently released and also the presto db reader, okay, so the press, the tv reader, which we announced, I think in december sometime, but yes, so that's already been merged in, and so, if you're specific looking for it is part of the 269 snapshot of presto db. uh I don't know why I remember the number, but it is what it is, but but nevertheless, so it is actually crucial.

A

Also we're working with the pulsar team and some other teams using the exact same thing for integrations, and so yeah just want to let you know so, but this is how uh we actually had a blog post that we, I think all of us here wrote that basically was what we deemed the ubiquity of delta standalone. That actually calls us out so just want to add that little. You know a little bit of call out.

B

Thank you denny for adding that um yeah. So one question uh to you will be, as you are working on like you know, publishing this roadmap from the community. uh What are some of the highlights from the roadmap that community can look forward to.

A

Sure, uh ryan, uh I'm actually going to target you first, but mainly, but I'll, do it this way. While we talk a little bit about the optimize features just because I know you're heavily involved with those uh so about the basically, the performance sentence, specifically optimize optimize, the order and uh data sniffing. Let's start with that with you, ryan.

D

Yes, so first we will try to open source the particular optimize command, which, for the first motion, will support like a file convection, which you can run optimal commander to compact small files, and on the next step we will support the writer, for example, optimize with the order. The other is kind of trying to sort your table in a great way. So you we can do like a like a better data skipping.

D

Basically, we also are going to open source like a fire stakes which of the delta rights. We are also generated the fastest for everything you write to the delta table, and then we will leverage these five states to do, for example, data skipping- and this is a lot there will be a lot of performance improvements here and we are, we are pretty excited to see like how fast the data can become. After all, these features are released.

B

Yep, perfect um thanks man, um so there was a z-order uh component in the delta engine which was uh available in the databricks runtime. So if you were already using those uh co-locating of data through the order, that's now gonna be available in delta lake um yeah. uh Any other call outs, denny.

A

Oh sure, so, basically, if you look at the uh uh hopefully you all have the link. If you need me to uh send the link again to linkedin, sometimes it doesn't come through just let me know and I'll post it back out there, but so.

D

A

So basically, the quick call out is that if you look at the road map there are we broke it down, and this is also a proposal. I do want to call that one though, whereas we want your discussion now, we're pretty sure everything on this list you want, because that was based on the the slack messages.

A

The google group messages, uh the community animes we've done in the past um and the survey we had this huge survey, okay, that we got last year where we sent out 700 t-shirts- and you all should have them now so and we will fix that by the way so they'll get out faster. um The uh we're pretty sure this is what you all want, okay, so, but it is a discussion so by all means please chime in and into that github repo and add comments, things that we may have missed.

A

Okay, so the first one is the performance. Optimizations there's also schema operations, um ryan or scott. Did you want to talk about those which is the specific column, mapping, read, support and call name renaming and dropping.

D

Yeah, I can talk about this this. Basically, uh we are adding the support to, for example, you can rename a column in your data table with a different name, uh or you can just drop a column that you think at least make the like sensitive data, and you don't need this column many more. Then you can try to drop this column, which is like a a pretty like a, I think, a long asking requirements in the community and then we uh because this is pretty hard.

D

So that's why we are right now just start to work on this, but we expect this to be complete soon and with spark support. We also need to do a something like a spark change in order to support these larger new syntax and new features.

B

Perfect uh greg says: uh databricks is going to be a long project. For me, that's awesome greg. We would love to have your feedback on the roadmap as well as you as it's been a long-term project. For you great. uh There is a question about kafka, delta, intest and uh the question is same question for confluent kafka, source and singh from n2 delta lake.

B

um So we we are working on kafka, delta, ingest uh and unfortunate qb couldn't join today, but uh happy to get uh somebody from the panel answer a question: how is kafka delta and just going.

A

No problem, so actually we recently, I think late last year, did a tech talk specifically in cockpit delta, ingest I'll, try to see if I can find it and post it directly inside linkedin here, since the question came from here, so the question basically is: uh can I go ahead and um write from kafka directly to a delta lake, okay and so? um And so yes, you absolutely can so kafka in the past.

A

A lot of people would often use kafka through spark streaming structure streaming into delta lake, and yes, that's normal attack too, but but the real ass is like no, no, I just want to natively, and so so, some of our delta commanders from scribd and back marker, they created a project called kafka delta ingest. It's actually built on top of the delta rust api. So it's very, very memory cool okay, like in terms of uh very memory optimized and so long story short.

A

It actually goes ahead and allows you to take your coffee topics and write directly to the to delta lake following the delta protocol, and so it's built on top of the. um As I noted it's built on the delta rest api, but what's interesting about the delta rest, api itself can read it can't write, but the kafka delta ingest can go ahead and write.

A

So uh one of the call outs that we did in um the the the road map actually is the delta rust folks are working on taking the high level apis that are in the kafka, delta, ingest and merging them back into delta, rust itself such that delta rust. You can go ahead and read and write. Why is that important? Because in addition to the core rust api being able to do that, so you now you can use rust to go ahead and read and write, don't forget the rest.

A

Api also has python bindings and also ruby bindings. So subsequently that means you can go ahead and write to delta, using through the rust api uh use it with python and ruby. Subsequently. Now, of course, there's still a little bit work to do. You have to update the bindings. We will always invite you all to go ahead and join us in the community to talk to it.

A

We actually have delta rust meetings every two weeks, uh so if you want to go ahead and chat with us, then um by the same token, there's the delta rush channel just join us there too. So hopefully that answers that question I I did want to call out actually- and this is the scott now uh also on the um um the roadmap. We had called out something super important uh in this case. It's the s3 multi-cluster rights, so scott you've been super involved with that project working with the community as well.

A

So why don't you give us a quick update on that, because I think that would be super interesting for folks.

C

Yeah for sure jenny thanks um so uh late last year we had um a and also open source contributor, make it pr to add s3 multi-cluster write functionality to delta lake. So what this problem is essentially solving is the fact that s3 doesn't give us one of the um critical things that dealt lake needs when performing rights, which is mutual exclusion.

C

So the way that this open source contributor solved, that is by using another external store dynamodb, to give us that mutual exclusion, so they they created the pr- um and I was I was engineer on the ecosystem team that was, that was spent to uh review it and tweak the design a little bit, and so now we're fully working with them.

C

um The pr is being updated, there's great integration tests, and so hopefully, in the next coming months, we'll be able to add this new, exciting functionality to uh delta lake, and I'm really excited to see how what what new use cases, customers and users will have for uh multi multi-cluster rights.

A

C

A

So sorry, vinnie.

C

A

Apologize, I'm actually looking at the youtube link and since it involves scott, I'm just gonna stick with him and then send it back to you. Vinnie, um there's a great question from oliver any visibility for the standalone connector, the the one. You know your project uh for scala, 2.13, uh basic spark, 3.x, just any visibility on that.

C

um So 2.13, I believe, was actually added recently and the next connectors release like that will be available and in terms of spark integration, you don't need to worry about spark integration, because the standalone doesn't use spark um so no dependency issues there. So yeah uh the next uh connectors you're, not 4.0 release you can you can expect uh scala 213., I'm great that someone asked about that. Actually, because we were just talking about that in one of my meetings. So I'm glad someone cares.

B

Awesome uh there are a few questions about uh what's the topic, so these are the delta office hours, so delta lake is a project uh for the community who are doing uh you know, data pipelines, so any uh you know any challenges you face: uh building delta lake or setting up your pipeline or maybe even like, connecting to other tools in the big data ecosystem. Please bring that away and we always host the sessions uh we also host, like you know some uh demos, as well as tech talks on specific topics.

B

So if you are just getting started- and you want to learn more about the delta lake project, you can definitely uh you know, tune in on our youtube or tune in to our slack channel google groups etc.

B

Hope that answers some of your questions on uh the comments and then uh there was a question about uh I'm currently needing to transform the spark delta data frame to pandas data frame to torch for modeling. However, the data size is too big for trying to call two pandas function. Is there a way to get around this? I think the question is around like how you can use the pandas frame uh for the big data, so I posted a link on this last youtube channel.

B

Please check that out. There is a config release on the pi pi page, but uh anybody from the panel have you seen this issue occurring with big data on pi data. Sorry, on python, uh api is koalas the right option, danny.

A

uh It should be qualis is the right option, though. We've called it now. The pan is api for spark, but this is a delta lake, since, let's skip the spark questions for now, if that's okay, we have spark sessions separately for that, so let's definitely stick to delta lake yeah. So if, if there are no other questions, one thing uh one thing we can definitely do is also talk about integrations if you'd like and by the way I did notice folks on linkedin, there's there's actually two um small problems.

A

One thing I just don't realize that the linkedin actually still says uh databricks university alliance, so maybe that may have explained why there there was some confusion. So, yes, this is what we're currently talking about uh delta lake um and then uh shiv uh our uh one of our old buddies from our previous community. Amaze, yes, go ahead and ask your question: uh please pop it down linkedin! Meanwhile, let me go ahead and at least provide a little context on the integrations aspect of the the uh delta roadmap.

A

Okay and so as as we already noted, we've already talked about presto db and we've already talked about uh flink. Okay, now the to be uh being a little bit more specific when it comes to flink. What we're doing is we're talking about it from the standpoint of the streams api, which is flank 1.12, 1.13, well, more 1.12, anyways, um we're also currently working uh in terms of working on a timeline.

A

I'd say around q2 q3, uh the flink sync for the table api as well, which which is basically we're talking about 1.14 1.15 versus apache, flink, okay, so so for those folks that are interested uh inside the github repo, we actually also link directly to the flink delta connector channel. So you can talk to us and work with us there, whether you want to test you whether when to contribute things of that nature.

A

Okay, there is an apache data delta source for apache pulsar, the the beta code's actually already up and running, but they're updating it to work with the latest version of delta standalones for member optimization. So once that's done, we'll be able to announce that as well so, but the code base is actually already sitting in the apache pulsar uh repo. um uh Let's see what else we have.

A

Oh uh just as scott noted, we also are talking about the delta source for apache flink, okay, I.e the ability for apache fund to read from delta right now. The thing we've been talking about is writing uh to apache flint uh for apache flink to right to delta. Now we're going to be talking about delta being able to uh apache flink to read from delta, so that is the delta source for apache flank. That is also targeted for q2q3 time frame as well.

A

Okay, um I already talked about the delta rust writer, so that's cool with that one, but then the other key thing- and actually this is just literally an update- um the community we've specifically florian. So I'm going to call our buddy florian and hopefully I can get him to come to our next community ama.

A

So he can go ahead and ask questions so because he's much more knowledgeable design than I am, but we actually have started working on the idea of a delta source for google bigquery, okay, so that basically allows bigquery to natively read delta lake tables. Okay, so we actually already created a delta bigquery connector channel. I think, uh but there'll be more about that soon, but the uh we're targeting also q2q3 but uh florian is taking the lead on this one he's one of the delta committers uh who is based out of back market in france.

A

So hopefully that gives you a good idea of the integrations uh that are in play. But yes, hopefully, if you have any questions by all means chime in here too,.

B

Great thanks for the insights jenny on the connectors uh now next question is any plans for point. Point-In-Time join queries to be supported, natively by the delta.

B

Point in time join queries.

B

Can you hear me.

A

Yeah we're I'm gonna, ask ryan. Is it but I'm not entirely sure about this question. So might we might need clarification, but maybe ryan, you would know this better.

B

Maybe no yeah, or maybe we can talk about, merge, yeah, go ahead. No.

A

No yeah go for it right.

D

No, I maybe I think we need some clarification about what korean time joint means is. It is at least a spark feature. Probabilities should a scar ask to spark no zero leg. I think for basically it's territory that doesn't like to do like a single blending of this stuff. We are basically relying on like a spark to build a single plane and it's cute and then just do it from delta right to delta.

D

A

D

Feature no tell them.

A

Yeah so yeah thanks, vinnie yeah antonio, please, if you can in this youtube, you know like the next couple minutes clarify what you mean. Maybe we're possibly missing something. So apologies yeah.

B

Sorry about that, um okay, so I think the next question is when z order a table and do a select statement on it using my z, ordered column in the where clause it is super fast. But if I do a where value in and then put the list of values, it takes the same time as individual queries would take. Why is this so.

D

I think so this basically asking the database to order right so for this one, and I think it's probably because we have like a threshold of how many like, for example, items in this in in set.

D

If you have like, I don't remember this ratio, it may be five or ten, but if you have more than like the like five or ten items in the list, either weird.

D

No, basically that will like fall back to the normal, like a query, because it could be much faster than you are doing. Like a data skipping yeah you if you fail this is you expect, is much faster. You can try to reduce the items in the list.

B

Got it? Thank you, ryan. uh That's a helpful suggestion and he also responded that you know skip if it's not relevant. He can reach us out on the slack. Thank you awesome uh and then there are a few questions about data bricks, specific uh delta, so I'm gonna post a link in the linkedin chat where you can ask and get involved in databricks community as well. So I let me uh you know, take the next question so.

A

B

A

Yeah, I suggest, being sickly, because we only have two minutes left. Why don't we just answer the last one more question and I just noticed the question youtube. So why don't we just do that? One which is uh uh alexandra yeah had basically said he saw zeorda on the robot, it'll be open source and yes, the the simple answer to that question is yes, it will be open source we're targeting.

A

You know what I'm going to go. Look at the roadmap again, when are we targeting this again yeah we're talking.

D

A

D

A

Optimize falcom action, q1 z order, specifically for q2, so yes, it is by all means: please go ahead and chime in directly into the github uh or delta user slack for that matter. By the way there is a 2022 h1 roadmap uh channel as well. So you can even just chime in there specifically to ask us questions on that, but uh go ahead and chime us there because uh yeah it is, it is unequivocally being open source. uh We are working with the community rapidly on this as well, so yeah so yeah.

A

I think that covers it right now because, like I said the there are a lot of other questions for more databrick specific. So let's not let's not do that. This is the open source one. So.

B

Yes, uh thank you denny uh yeah and I posted some links uh in the chat. uh Hopefully you find them useful. I think a lot of your questions are surrounded by you know uh some um connectors, as well as the latest in the roadmap. So please follow us on those link, uh links and, as just have a discussion with us. Another uh cool thing is, you know, as danny mentioned, I think in the q1 and the main highlight is uh data skipping. I think I'm really excited for that feature so anyway.

B

uh This was very helpful and engaging discussion um thanks for thanks to the panel, uh do you have any closing thoughts.

C

I just thanks for having us vinnie uh glad to answer everyone's questions and see everyone in two weeks.

B

Awesome. Thank you, scott. Thank you. Ryan. Thanks, jenny and thanks everyone for joining from all over the world, see you next week, bye.

A

Thanks everybody.