Delta Lake Community Office Hours, 28 Apr 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Delta Lake Community Office Hours (2022-04-28)

Description

Join us on April 28, 2022 at 9 AM PST/12 PM EST for the Delta Lake Community Office Hours! Ask your Delta Lake questions live including but not limited to Apache Flink, PrestoDB, and TrinoDB connectors. Join Fabian Paul from Ververica, Scott Sandre, Denny Lee and Vini from Databricks!

The session will be hosted live and the recordings are available on the Delta Lake YouTube channel.

A

And we are also live on zoom, so we have three channels today, uh so please say hi where you're from uh and then um for those of you who are new. These sessions are live and occur every two weeks on thursdays at uh pacific time. So we bring a panel of contributors and champions of delta lake to answer some of your questions about delta lake, open source software and also the things we are doing with other open source ecosystem projects, so welcome. All of you.

A

We will get started shortly.

A

All right so uh without further ado, uh let's go ahead and get started with introductions. um So fabian, do you want to kick it off.

B

Sure uh hi everyone, so I'm fabian, um I'm currently uh active, committer and contributor to the apache flink project and and I'm currently employed with bavaria and yeah. I'm basically here today to talk about the new flink delta. Sync, where you can just write your data from a sling pipeline into your delta lake.

A

Awesome, thank you for being a fabian uh scott you're up next.

C

Hi everyone, I'm scott, I'm a software engineer on the delta ecosystem team here at databricks, and I'm here today to talk about a bunch of things, including deltalake and dot, o huge new release, which includes a lot of features that I worked on, um that our team worked on, including data skipping and adding multi-cluster rights to s3, as well as our connecto ecosystem, which includes the um the fling sync that fabian's here to talk about as well.

A

Awesome awesome, uh denny.

D

Hi everybody, my name is denny lee. I'm a developer advocate here at databricks a long time delta cab just here to help out so make it that simple.

A

That's awesome: I'm excited for this panel, so a mini recap. From last session we had td and ivana, who provided amazing insights on the key features of delta 1.2, which included column, start generation and data skipping, which is a big boost to read performance. We had also discussed file compaction feature and improvements to restore um to delta table to earlier snapshots.

A

We also chatted about the delta connectors, with other open source tools like delta of linksync and updates on delta flink source, and also we got a few insights on future releases, like z ordering, and you know some of the other features which we can still hear from our panelists today. uh In case you missed it. The recording is still available I'll share that in the chat.

A

So this brings us to today's session, where I'm excited to talk about other things: data like data processing streaming and the flinking connector from with fabian from verica the original creators of a part of apache flink itself. So uh and then we have other contributions uh from um you know: uh scott denny and other uh data lake contributors. So we will discuss that without further ado.

A

Let's uh go ahead and take some questions, so uh let's talk about uh flink, sync connector, uh so fabian can you give some highlights on like you know what this connector does? uh What are current capabilities? Yeah.

B

Yeah, maybe let's let me start with a quick few words about apache flink, so apache flink is like very similar to what probably most of you have already seen or heard from sparks. So it's like a unified batch and streaming engine, but like in the recent days. We are mostly focusing on on our streaming execution because we see like batch mainly as like the special case of streaming application.

B

So a new connector can be basically used in like streaming and batch use cases, but I'm we are mostly professionally focusing on like streaming use cases where you can um insert your your data lake with like, uh like low latencies and like um high throughput.

B

So the idea is uh yeah. Currently we um our are still working on like um different api levels of the data sync, but in the current situation um you can use it right away. I think it has been released with one of the delta releases, so probably scott knows more, which that release exactly it came with so yeah. I'm looking forward to your feedback using the project.

A

Awesome, uh danny and scott from your standpoint, what are the, how did the release go? What were the features, or you know uh things which got you excited throughout working on this project?.

C

Yeah for sure, so, just to answer fabian's question there: yes, it was released with the delta connector 0.4 delta. Connectors is a separate repository from the main delta core, which contains all of our spark list connectors, because delta is an open table format.

C

We want it to work with every compute engine, so the delta connectors repo contains both the standalone writer, which is our spark list, writer and reader for all the delta log metadata, as well as all of our future connectors, which includes right now, the flink sync which was just released and then in the next release in the coming months. We also hope to release the flink source. So a lot of exciting work going on there.

A

That's awesome uh danny. You would. Would you like to add anything.

D

I mean really, I think, fabian and scott covered most the call-outs, but if you are interested in testing, if you're interested in also uh participating and doing some more additional prs, please join us in the delta user slack, I'm probably going to be a broken record on this one, but join us here. There's actually a flink connector channel specifically to talk about these things as well, and so we'd love to get your feedback. This has been a very popular project.

D

Lots of folks have been asking us tons of questions about this um from the community in general. So please love to have you test it? um That's pretty much my little addition.

A

Awesome so with that, uh with that note denny, uh there are some questions around like what are the? uh What are the limitations of this connector.

A

B

Yeah yeah, I I can start. um I think one of the current uh limitations of the connector is that in the first version we only support like amending to two tables. So basically we can create the table and then amend to existing ones, but in the future we will also like plan to support like um upsell statements, where we can basically update existing tables. But this is currently like not supported yet.

A

Got it and uh are there any others, and if this is the limitation, what is the plan of uh making it you know available in the next release or something like? Can you give us some.

B

uh Yes, so I think the next few few releases will mainly concentrate on like um bringing the delta thing to, like the other apis that are supported things. So one very interesting project that is currently being worked on is like bringing it to sequels, so that yeah um it makes it like way easier for businesses to track the data.

B

If you can just um use your usual flink sql statements and you do not have to like worry about any internal or any of the connector apis, and one thing that also like becomes like more and more important than modern businesses is the use of catalogs that you can basically share the different delta tables um within your organization or within the different teams. So I think this feature will be also be live once uh the support for flinxy was released.

A

Got it got it? So I think you you touched on a lot of the uses and limitations. There is also a question on what are the pros and cons of using flink over spark.

B

uh Yeah, I can start, and then probably someone from scotland can take this over uh yeah. I I would say there's no like clear winner in in terms of if you, if you're thinking about performance or any of these kind of like metrics, I think it really depends in a bit of if you're, working in like a larger organization.

B

What kind of expertise is already in your company and then it depends on if there are like more people already familiar with spark or more people already familiar with flink, and I would probably say that the current delta lake support is still better with sparks. This is probably where it originated.

A

Got it scott or then you would. Would you like to add anything else.

D

Yeah I mean I I'll just add a little bit. I mean exactly as uh what fabian would call out right. I mean, if you're, already familiar with the flink ecosystem. That makes a ton of sense if you're already familiar with the spark ecosystem. That makes a ton of sense each has their own strengths. We're not going to do a competition here right.

D

What we're really talking about is the fact that both systems do require, though reliability on the underlying uh storage right and the fact is, your cloud object store, can itself in it, whether it doesn't matter which distributed system you're working with will ultimately have issues when it comes to um orphaned files. That really a transaction log is going to be super helpful to ensure you have the reliability of your data. um So that's what we're really focusing on today and just like fabian called out.

D

There are some aspects that the because delta lake was initially built with spark that yeah we have a little bit more uh code base available for that, admittedly enough, but, as you can tell uh flink, is super important super popular. So we've started already with the uh the data stream api for starters, which is around flink 1.12, then uh we're currently working on the table api. That's what scott had already called out.

D

That's flink, 1.14 onwards, um correct me if I'm wrong with the version numbers again by the way, so I'm trying to do this all out of memory and um and also this allows us and also we're trying to work on the flink delta source, so we're getting these features in and that's actually what I meant by going ahead and pinging us on the delta user slack or especially, the flint connector, because if there are other features that you would like to see faster and especially, if you're interested in pulling uh pull requests of your own, please do chime in because we'd love to go ahead and number one get the help and number two also get your opinion and feedback on what's working and what should be prioritized.

A

Awesome there is a question around so since you mentioned about catalogs there's a question around: do we have more in-depth cadre catalogs for delta tables, kedro catalogs.

B

A

Maybe yeah, I don't think I.

D

Know I think you're are you buying uh to the person who asked the question? Are you by any chance of referring to kedrow like the mckinsey project, open source project.

D

So if it's that um so cadro actually works more like is, is sort of like a project that actually works on the ml flow side. So it's not to say that it can't work with delta like what the offset it can actually, but I'm not entirely unders. I'm not entirely sure what the question is. So I apologize. If you can clarify that a little bit, I can probably help um help. Explain that one. But yes, just the long story, short pedro actually does we actually have a partnership?

D

Well, a community that's working together with quedro and mf uh mflo and delta lake, but uh it's still very early stages. So I didn't want to like promise something yet because again, very early stages, so.

A

Yeah, I think that helped denny and you hit the right, uh uh the right product pedro, uh which works with them also, so I think uh deepish um is nodding that yeah. This is what I'm asking for awesome. uh There's another question on when we should use delta lake when we can store data in adls, gen 2.

A

What should be trade-off? For example, if I created a delta table, can it be shared among other team with different access levels.

D

I can tackle that and if others want to chime in afterwards, um so in general, adls to gen 2 is a format that works perfectly great with delta lake. um So you can absolutely whether you're working within the context of, for example, something like azure, databricks or you're working within the context of delta oss, where you're generating your own uh vms within the azure environment, to save directly the adls gen 2. It actually still is the same process now adas gen 2, as opposed to like s3, um and this is also simple for gcs.

D

They actually have uh this concept of put of absent uh consistency uh that s3 does not have so it does actually have that full transactional consistency, issues which allows us to have uh multi-cluster rights, multi-cluster, jvm, multi-cluster driver rights to the same storage system. In fact, scott. You can probably chime in afterwards to talk about the s3 multi-cluster right thing by the way, but to answer your question about adl gen 2, but irrelevant irrelevant of even though it does have those consistencies.

D

The thing is that um there's still the issue of and again whether it's flink or spark or any other distributed system. That's doing the right, you could still leave orphan files that a transaction log will protect that data. So because of that um yeah I mean I would generally almost run always my stuff up with adl as a delta lake table on adl gen 2 to protect the data. So a bit long-winded, I realized, but uh the context is yeah, even with the uh put of absence consistency uh that adls gen 2 has.

D

I would absolutely still use delta lake to ensure the reliability of the underlying data.

A

Thank you, scott. I shared that document. uh Would you like to add anything else.

C

Yeah, just I would just add on some more pros of using delta lake in general, of course, whether it's on adls, gen, 2 or any other cloud storage. um It's not just reliability as denny said. There's also features you know, time travel, schema, evolution, schema enforcement, there's all these great things that can help you protect your data and query it more efficiently, um so yeah.

C

If they would like to like talk about this more thoroughly like they should definitely join the delta user slack channel, and we can talk a little bit more about maybe what their use cases are if they're having any sort of hesitancy.

C

But you know delta's here to make your make your data more reliable, more scalable and more accessible, so definitely recommend the shirt might be a little biased, but.

A

That's awesome, scott, so I think that brings us to the next question so drew is asking in the last few sessions optimize uh for delta lake is going to be made open, source and q1 any updates on the roadmap. That's perfect because we already released it um so and scott. Do you want to tackle that question like what optimize does and a little bit higher expansion on that feature? Yeah.

C

Yeah for sure, so optimize was released in delta lake 1.2, um and what optimize does it helps? You compact your files and what this does is it helps you solve the small file problem, which is that you have these streaming rights you have all these files coming in, which is great delta lake, is able to handle that we partition our data, um but but compacting them into into larger files uh helps with reads helps with the listing problem on various cloud stores. I'm sure denny can add on more benefits of this feature.

A

Great uh danny, do you want to add more like.

D

No, actually, I think scott said succinctly. I mean the the reality is that it's we, we dealt the blog for delta one day, two is coming out: we're gonna have a lot more um explanations, deep dive into it. So if you wanna wait till our next community n a we'll dive even deeper as well, but uh the context is um it's a great feature that allows for faster performance and just like scott called out. um If you have any more questions in the interim, either ping us here right now or and the delta user slack.

A

Awesome so denis uh questions are flooding in I'll. Ask another question: uh moving from a traditional rdbms data warehouse to a lake house with delta lake data, modeling changes uh should data modeling changes. Should we make uh if any and oh so, I think he's asking like: should we make any data modeling changes? If any and what are some considerations, we should take into account when doing this. Migration.

D

Oh, no, that's a wonderful question and so to zachary. I think you're, the one who asked the question uh live so again, I'll provide the context, and so, by the way, the part of the reason why I love it. Zachary is uh because I actually am formally from sql server team so built used to build data warehouses so complete rock, where you're coming from so the question really is about. If I'm coming from a traditional relational data, warehouse, hey, I'm gonna go move to the lake house in delta lake. That's awesome!

D

Do I have what considerations do I have to consider? Okay, so from a data modeling perspective, for example, type particularly. We talked about in like like olap dimensional models, or we talk about infirm, third, normal form or things of that nature in general. That's not necessary, doesn't mean you can't do it by the way. I want to be very clear about this. It's not like you can't build a star schema within the context of a data like house. It's just it's no longer necessary to to do this.

D

That's the key call out here: okay, so if you're building the systems from scratch, you don't necessarily have to start with that olap scheme or star schema.

D

um By the same token, if you are going ahead and uh taking an existing system, in fact, actually I would transition almost exactly as you would have designed it in a relational data warehouse first purely from the standpoint to make it easier for that transition to happen, but then, over time, you'll recognize there's various advantages, especially when you're working with spark or flink or any of these other systems that are interacting with delta lake. To you know, for example, just things like oh do I need uh uh the equivalent of identity, columns or like uuids.

D

Do I need to build circuit keys? Some of them will be necessary and in fact what you'll hear is that we actually have uh multiple sessions about how to take data warehousing concepts with myself and douglas moore on how to do exactly that surrogate keys, type, 2, slowly, changing dimensions, things of that nature, so I'll find some time to prop it into inside the the linkedin and zoom for without those links. But again it these common processes that you traditionally do with rd masses.

D

You can actually replicate them and advance them within the context of a lake house so with delta lake. Hopefully that answers your question.

A

Awesome. Thank you very much. There's another question on how we are making uh integrations with power bi simple, so that it can be used over delta lake.

A

We do have integration, so does anyone want to talk about it? Yeah. That's.

C

A great question, so we do already have a power bi connector on the delta connectors repository um and feel free to check that out. That was actually contributed by an external contributor, so the author of that would know more about it than I would. um But if you have any questions about that again, join us on the slack feel free to ask about the specific connector and I'm sure we can help get you some more uh information about that.

A

Yeah, that's awesome and also we release like a lot of other uh integrations with uh other open source ecosystem, as well as reporting tools. Data processing tools so definitely check out our website where we have all this information and also engage with us on slack or github, so that we can see what you are, what you are considering to build and we will uh be happy to follow uh great.

A

uh So there is another question on um uh so you mentioned about bringing more uh bringing more performance into delta lake, and you also mentioned about transaction log. What are some of the features that were released in 1.2, which uh people can find it beneficial for processing, big data.

C

Yeah great question uh so we've already mentioned optimize, uh we've already talked a little bit about that and that will give you um huge performance improvements as well, but so will data skipping data, skipping or file stats collection are pretty much the same, the same solution, so this major feature we added lets you actually collect per column stats as you're writing to the delta log and with the very minimal write overhead and what this means is in addition to partition skipping during reads: we can actually look at the per column stats to determine which data files to actually skip, and this we've various benchmarks have concluded uh different things about the actual read over improvement.

C

But it's it's very. It's pretty substantial. So we're happy about that.

A

That's great uh and then there are some questions about.

A

uh How can we join this live channel? I pasted a link in the chat for uh slack and git. Please check that out. um I think this remains a very popular question for all our amas. Any support for uh python connectors most existing connectors are jvm based.

D

I can probably chime in a little bit on this one. So uh long story short. Actually there is a python binding. That's part of the delta rust api, so you can actually check that out. The delta rest. Api actually uh has um python bindings that allow for read, there's currently work being done for the delta for the delta rest api for writers as well, and so then, subsequently we'd love to have help to advance those python bindings to go ahead and support the writing as well.

D

um There's also uh ruby support, in addition to, of course, rust itself, um there's also work being done by the community for golang as well, by the way, just as an fyi, it's still very much early stage, but uh so, in other words, we can't put it up on the the website yet, but there is actually work being done on that front as well. So just as an fyi.

A

Awesome uh there is a question about migration.

C

Yeah, so one thing I would I would love is help on getting more python binding. So, for example, we have the jvm based delta, standalone um library and the delta connectors repository, and what we want this library to be, and what it is right now is the source of all metadata interactions with the delta log, not involving spark.

C

This is our sparkless standalone library and currently it's jvm based, but we would love as an open source community member to go and help us or make a project plan or make a design dock to add python bindings to that, because again, this is the source of all metadata interactions, we'll be adding more more features to it over time for a sparkler's, connector and adding python bindings. To that would be able to have. um You know a lot larger scope of impact, since our hive, flink, etc. Connectors are already using this as well.

A

That's awesome call out, I think um yeah. So thank my uh thank you mayank and andy for asking uh great questions. There's another question uh from through he's asking: when is the rust api for writing scheduled for release.

D

Oh right now it's turn it's currently planned for q2, um which is basically the way I figure uh around june july. Roughly um we'll have we have regular updates actually on it. So just join back to the delta user, slack there's a rust api uh there's, actually a delta rust channel as well. So you can definitely ping your questions there, but right now at least that's where we're currently at.

A

Great there's a question about star schema design, so the question is based on uh denny's reply on on-premise data warehouse skills. In short, what is the best practice in modeling data in delta lake? If we forgot about, if we forget about star schema design, are we still using star schema design.

D

Okay, so it really so it's always the it depends. Okay, so now part of the reason I'm using the it depends is because, for some people, uh some designs, what they have done is that they put the dimension data they've left that in their relational store right, and so what they're doing is they're running their queries that join the relational data with their uh their their their lake house. With the actual the fact data which is sitting in delta lake, that's a perfectly valid system, and especially common when you do a lot of ad-hoc queries.

D

So this is a complete normal system, especially when they're using some other system to ensure that the dimensional data integrity- okay, which is in other words as you, have something like type 2, slowly changing dimensions. You over time will need to update that this is actually controlled by a third-party system, not necessarily controlled. By the same folks who run the data platform where delta lake is residing.

D

On the other hand, if you do have full control of the system, it is also just as common to say forget it. I actually don't need a slowly changing type of slowly changing dimension type, two dimension, because I'm actually just storing the actual dimensional value directly in the parquet files that make up the delta like table. That's actually a very valid approach too. The concern usually is when you're looking at, like your favorite uh favorite bi tool like power, bi or looker or tableau that then the generation of the actual dimension data becomes complicated.

D

So then, again, all of a sudden, you are inclined to build a star schema, so that schema may be built after the fact I, when I talk about the medallion architecture, you're building it as post gold tables versus actually building them with it now. Suffice it to say I gave you an answer, but it isn't a straightforward one. Is it and that's more or less the point?

D

It really depends on your environment and how your pipelines are structured in which whether it makes sense to actually stay with a star schema or to go ahead and remove it, and so, like I said before, I would, if you already have it, I wouldn't necessarily throw it away right away. Okay, I would definitely allow your trans allow yourself to migrate slowly off it.

D

There, like, I said before, they're distinct advantages of having a star schema and there are just advantages of not having one and, incidentally, all the advantages and disadvantages of a star schema pretty much everyone that you can think about within a database like, for example, the replication of data and things of that nature actually apply to the lake house. So frankly, you probably already know the answer which is pretty sweet. Hopefully that answers that question.

A

That's awesome. I think that uh that gave a very good overview and answered a bunch of questions danny. Thank you so with that uh just last question to all of all three of you like, if you all, can take a moment and uh tell us about what is coming up next, what are the features that you are excited about and working on so uh fabian I'll start with you.

B

I think I'm really excited about like seeing the data source in place, because this basically enables like a nice end-to-end experience. You can basically use your flink pipeline to process like any tables you already have and to and right into other data tables. I think this really enables like using flink in like a full um yeah fully in full processing mode for like data data tables.

B

A

So end to end streaming dreams coming true with flink and delta right. That's awesome! uh How about you, scott.

C

Yeah uh two exciting features: I'm really really looking forward to are both z order, uh which will help you really optimize how you well run the optimize command, which is the compat, the the compaction that solves the small file problem. um That's one really exciting feature, I'm looking forward to as well as uh change uh change data feed which will actually let you capture row level changes uh from your delta table, which is something that a lot of users have been asking for, and that's actually what I'm working on right now.

C

So I hope to release that in the coming months,.

A

That's awesome: there are all smiles on your face, packed so exciting um cool denny. How about you well.

D

I I the first, you know fabian and scott already took my two favorites once, but in addition to that, I actually did want to call it all. These awesome integrations right, so we've recently released the presto one. We've released the recently released neutrino one uh there's actually an upcoming one with pulsar, and for me it's less about the newer features about, but just integrating more and more uh and improving the features we currently have, for example, uh improving the presto writer uh working with the trinity community with uh with improving their uh reader and writer.

D

So that's what I'm really excited about, just because it's all-encompassing and you can see how delta lake is actually working across multiple systems, not just spark.

A

That's awesome. uh Thank you all this is. uh This was a wonderful session. We got a lot of insights and I'm pretty sure there were. There were a lot of people who asked great questions and uh got amazing insights into our project. They are, it seems, like a lot of people asked about slack channel link, so they feel uh you know. Please uh welcome all the you know all the new integrations or requests from them uh so very exciting. So thank you all um for joining uh again reminder.

A

Please uh join us through all previous channels. I posted a link in the linkedin for all different channels that are available for delta lake, and then this sessions are live every uh two weeks on thursdays. 9 am so please look forward to joining us next.

A

Thank you. All.

C

Thanks vinnie bye, everyone bye, bye.

E

C