Delta Lake Community Office Hours, 26 May 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Delta Lake Community Office Hours (2022-05-26)

Description

Join us on May 26, 2022 at 9:00 AM PST for the Delta Lake Community Office Hours! Ask your Delta Lake questions live and join our guest speakers, Scott Sandre and Ryan Zhu alongside Vini Jaiswal from Delta Lake!

Ask us your #DeltaLake questions. These sessions allow our community to ask questions about Delta Lake OSS and get to learn what we are building, planning to build and know about recently released features. These sessions are live and the recordings are available on the Delta Lake YouTube channel.

Quick links:
https://delta.io/
https://github.com/badal-io/datastream-deltalake-connector
https://groups.google.com/g/delta-users

A

In the meantime, uh please tune in to our databricks linkedin page and link youtube channels. That's where you can find questions or you can post your own questions uh and please say hi where you're from uh for those who are new to our session. These sessions are live and occur every two weeks on thursdays. We try to keep it at 9 a.m, pacific time or 12 p.m.

A

uh Eastern standard time, and we bring a panel of contributors and champions of delta lake to this session, uh who answers your questions, live um on just like open source software and any questions about how we are working with other open source ecosystem. um So a mini recap. From our last session uh we had a fun trino plus delta uh session, where we did some live music. We discussed trino and the connector uh and hope we try to have fun on this session too. So, please ask away your questions.

A

uh We will be monitoring channels on linkedin and youtube awesome uh and without further ado. uh Let's start by introducing our panel, uh we have awesome contributors, scott and ryan. Today, uh who will be answering your questions so scott? Why don't you go ahead and just introduce yourself.

B

For sure uh hi everyone and thanks vinnie for for having me uh my name's scott I've, been working on delta lake for just about a year now. um Currently, I'm working hard on on the next release on some exciting new features that I'm sure I can talk about today and yeah. Looking forward to answering any of your questions.

A

Awesome, thank you, scott ryan. Why don't you go next.

C

Yeah, hello, I'm ryan, I'm a software engineer at scala bricks and I have been working on chaotic since the beginning, which is kind of already almost five years yeah and I'm pretty excited to hear and may would like to hear a lot of feedback to the project and maybe- and I can also try to help. If you have any questions.

A

Awesome awesome, thank you, scott and um ryan, so we have a few feeds coming in from linkedin uh wow. There are people joining from everywhere, uh all right, the first, oh already the questions are coming in. So I'm gonna read out the first question.

A

uh The question is: do you have any option for data lineage for structured streaming.

C

uh No basically, right now for the dialect of open source project. We focus on builder list new, it's kind of like a data lake format, so we don't want to like us, make our scope much larger today, right now,.

A

Yeah and uh just to add to that um you know we do have like um other efforts in streaming area. I think this uh on this office hours. We will focus on what things we are working on in delta lake, and uh you know we will. uh We will help you out with a lot of uh ins and outs of what is happening within the software, so ask away those questions um all right. uh Other question is: can we use delta lake for heavy volume load of data like 600 gigabytes of data for atomic operations?

A

um So um I think uh scott, do you want to answer that.

B

um Right ryan ryan has more expertise here, but it depends on your exact usage use case, but you know, delta lake is for data lakes and we're here to handle heavy workloads.

B

um I'm curious if this is a 600 gigabyte commit, but again, data commits is different than actually committing to your to the transaction log, which is just metadata, uh which is very quick and easy. So again, it depends on their exact use case.

A

Ran, do you want to add anything else on that.

C

Yeah, I think, let's correct it, it's basically delta builder metadata layer, so you can write a letter like for the 600 gigabyte data, but usually you only have like a few like data paths, we need to step in metadata, which should be very fast and all those can scale well.

A

Yes, and also like dazzling is uh built for scaling your storage and operations uh transactions. So definitely it's meant for big data workloads awesome. So we already have next question. uh Please explain the difference between serializable and right, serializable, isolation.

C

Yeah, I think it's probably harder to explain here. I probably we can. I don't know who and ask this question. Maybe we can put our document. I think we all already have some problems to explain this. So it's very hard to answer this question in writing. This of it.

A

Yeah, I agree, ryan. I think we also have a talk around this, so I will share the document and talk in the channel uh shortly after our emails. um Another question we have is: do we have connectors to bring metadata into purview or any other catalog services.

C

So you, it basically means so currently stereotypes try to basically manage the mega data by itself, which we can ensure it's atomic and uh you won't see any inconsistencies between like a hybrid store or photovoltaic mega store service.

C

So today we don't plan to do this, but if you have any great idea, you can raise an issue or maybe also open up people that you've already and have an implementation in, like the data lake project,.

A

Yeah uh yeah so yeah uh say uh just adding on top of it. If you have like any kind of uh connectors that you don't see or if you want to see uh any kind of features built into the lake, um you know we have very a great sla and our teams are working hard to determine the priorities. So please uh engage with us on github and slack. We will be happy to answer your questions.

A

Awesome. um There is a flood of questions, so it's hard to pick, but the next question is how different is delta lake from lake house? Can you say some use case examples pretty cool question actually who wants to take this.

B

um For sure, I'm sure vinnie, you can comment on it as well, um but delta lake is just an underlying technology that enables you to build a lake house. The lake house is like an architectural paradigm for managing all like your end-to-end data use cases and delta lake is just one technology in that chain that enables you to do so.

A

Exactly- and um I think I will add on to what scott said by giving some use case examples, for example, like you know, lake house is, uh is a term that we realize it's, it's not something that we developed it's something that we realized from a lot of use cases working with customers over a period of like last four years, uh so it just started becoming a thing. People are doing, bi, analytics machine learning, ai everything together on on a single platform, and that helps us um uh coin.

A

A term which is lake house, and you know, delta lake is just an ideal enabler of or foundation of, ideal open source lake house. uh So hopefully that helps answer your question.

A

uh Another question is: uh okay, there's a question about optimize, so let me call that out: can a delta table go corrupt if the optimized query on on it cancels or crashes midway.

C

Yeah so basically delta provider, acid support which uh support transactions and the opener the optimize command is also built. On top of this, which will ensure your table will never get cropped. For example, if you cancel optimize optimize command, you will see either this command complete is the name of dynamic thing or is already complete.

C

There is only two steps I like it is entirely not nothing changed or it's already finished, and then you will never see like some partial state in the middle.

A

Exactly uh yeah, uh that's right as a transaction enable us to. uh Basically, uh you know, help with that uh uh situation. That's why a lot of uh people or users use it to make sure that their data is not corrupt and it's reliable data. Reliability is one of the guarantees that delta lake provides so yeah.

A

Another question is any impact or limitation when using an azure data lake as repository instead of a native dbfs.

C

Yeah, I think this question probably uh not really a that. I like a question.

C

And our team actually not really familiar with these two like products, so it's better to ask like using other channel.

A

Yeah, exactly and also like you know, uh as far as delta lake is concerned, it's an independent project uh and you know it's. You can configure any kind of storage system underneath uh and you know, underneath the storage, whatever storage system, you are using, um follow the best practices uh on what their impact and limitations are.

A

uh You know, interestingly enough, when you build data pipelines, you will encounter like if your data volumes are growing, there might be limiting limits that cloud providers or any uh any vendor will post on your um on your environment infrastructure so uh be mindful of reading. uh You know those documentation as well and I'm happy to share some more um cool.

A

Another question is how delta lake caters to management of unstructured data like meaningful information extraction, so I think this goes into maybe like text mining kind of uh data.

A

Do you want to answer that anymore?.

C

Yeah, so basically delta lag doesn't really support and struck data today, but uh with lag of a number using spark, you can actually easily pass language and structure data into like strong data and store in data lab and the data label those suppliers for the most schema evolution, which you can like handle a lot of like use cases like this. Whenever you have some schema changing and structure data, is you can still inject like this data into the other leg?.

A

B

Add on to that ryan um delta lake works really well with the medallion architecture of your bronze silver and gold tables. So what delta lake really enables you to do is import all of your data as bronze tables, without parsing it or performing any kind of computation or transformation on it, and then you know gradually aggregating it and parsing it and cleaning it up into your silver and gold level aggregates, which is just a really efficient pipeline and a really safe way to manage your data.

A

Exactly awesome, uh we have another question that is: does delta lake are able to read data from web services, meaning like some kind of external protocol.

C

Yeah, I think it is basically depends what list web service, for example like for s3. It's also like provided a data through web service. Actually even it's look. It's like a storage. So if your web service can provide like all the storage like a requirements, will we ask then we should be able to support this by building, for example, a new implementation of log store to on top of this web service? But it really depends on what features your web service provides.

A

Exactly- uh and um you know it can be from uh you know that like supports wide variety of sources, you might have kafka streaming coming up like which is ingesting data from your source systems to delta lake. um So you know any variety of system like, for example, media files or you know, a iot device which is sending sensor data, so it could be like multiple things which delta natively supports and then another question is: can you explain delta lake with one real-time scenario or use case?

A

So I think this is like how delta lake is being used for streaming or real-time use case. Anybody wants to take that.

C

Yeah, I think I mean maybe okay.

A

I can yeah I can. I can answer that question. um So, for example, if you have like you know, real-time feed from an iot device, for example a lot of for example. If you are, um you know, I'm not really sure who uh what your industry is, but I'm just gonna take iot example. So imagine you have 150 sensors on different uh trucks for uh trucks and they are sending uh data in. You can actually have that uh feed into s3, uh which uh and top it off with delta lake.

A

The how you, how you, how you use delta lake, in that particular scenario, is whenever you have like uh uh whenever you have a detection event, for example, uh a real-time detection event like you know, if a driver behavior is going sideways. Based on your past analysis, you can actually put models around it and wait for those signals to help you to help you out, detect those signals and immediately take action on it and delta.

A

Helps you with that, because you know um you can have either batch or streaming stream of data coming in from those devices and help monitor or help. um You know put it into your machine learning pipeline either. In the batch format or streaming format- and you can basically uh you know basically either use- use those events to track to track or maybe like whenever there has been a uh cut up data or maybe, if your software, uh so maybe you had a new release of software and something goes wrong you can.

A

You can push it back using the transaction log from delta lake and it helps you save those uh those already detected events. So that's one of the use cases, I'm sure there are so many that I would be happy to share. We have like a whole use case repository, so I'm happy to share with that link as well uh all right. Another question is: what sort of transaction volume does datalake support for inserts or deletes.

A

Wouldn't this be like just a basic property for delta lake? uh Doesn't regardless of transaction volume it? uh This is a delta characteristic right.

C

Yeah, I think that maybe we need some like more cares about the meaning of this transaction volume. Here, if you mean like, for example, how many like transactions, we can do per second, that really depends either your data volume. So it's hard to answer this question without knowing your data.

A

Yeah, I think that's right, maybe, like you have a particular use case, and maybe the preliminary question is like you know yeah what sort of uh like how you're defining it may be helpful. Yeah. uh Also. Another question is: uh is it mandatory to have bronze and silver layers? Can we combine into one storage.

B

I didn't take that many um it's not mandatory. This is the bronze. The medallion architecture is something that we have observed working really well for a lot of customers and a lot of users. It just enabled them to achieve the success they want with their pipelines.

B

um So it's certainly not mandatory, uh but it's it enables some really powerful features to be built on top of it, and if you have like more specific questions about how to build your pipeline feel free to like message us on slack or post an issue on our github and I'm not quite sure what you mean by combining into one storage. But perhaps this is a discussion you can like take offline.

A

Yeah definitely uh yeah, and I will also add to this scott scott. um While scott said it's not mandatory and it's one of the best practices we have seen working well for customers. That's why you know in our medallion architecture we consider three types of layers, because we deal with a lot of different use cases.

A

For example, somebody is working on machine learning, somebody's working on analytics, so analytics uh users who are building just the dashboards, don't have to have raw data, they will be lost if you give them raw data, so it's very helpful to give them business level aggregates, and that's why goal tables are really helpful.

A

Another scenario is: if you have pii and non-pii data, you don't want to give access to everybody in your organization for all all your data, so you make uh silver table where all data goes in. Not all everybody has access to it. You make a bronze table as an extraction of that and that will that can be like non-pii table uh where you can freely give- and you know also work towards having like some kind of governance around your data that helps just as a best practices. um That's why we have that in place.

A

So it's not mandatory, but it is something that we recommend, as we have seen that working well with customers uh all right. Another question is: is there a plan to integrate any data, governance standards or data quality rules to make data management more robust right from the source through multi-tiered target?

A

Let me place this question in the chat uh scott. So all right is there a plan to integrate any data, governance standards or data quality rules.

A

So I think this is uh going towards expectations, um so not really like a delta lake question but uh feel free to correct me, ryan r, scott, but this is something in delta life tables. uh What you can do is you can define certain expectations and quality rules to ensure that you know whatever you are you have going in. You have like um you, have a live lineage of whatever is happening in your within your data. So that's the architecture. You can follow all right, correct, scott trial, anything you read.

A

All right uh all right, let me see if there are any questions.

A

All right, this is a big question, so I'm gonna paste it and bear with me. While I read this question, I have a streaming pipeline to load data into delta lake. It has checkpoint as well. We observed as data grows. Restarting the pipeline takes time due to checkpoint.

A

Is there any way to restart the pipeline with minimal uh streaming initialization time.

C

Yeah, I think at least kind of usually, when you restart streaming query. It has to do a lot of work like restore like like all this. It's instead into like a memory, and also sometimes you may need to like- do some buy some spark job actually in order to get this like a recovery united states.

C

So but uh it's basically, uh it would be great to like know the exact details of when you mess here. You mention that when data grows, so it's what kind of data is the data log for the multi-sectional or it's just like a practice files, but in general, if you, the data growth, is just like a packet file in increase a lot, then this should not be a problem.

C

So I guess this problem is like a transaction lock increase a lot which sometimes you may need to like run optimize to compact, like your small, five, so big fights to speed up, like all the delta leg operations.

C

When can we use incremental.

A

Yep that works all right. uh Another question is: can we use incremental in live delta tables.

C

Yeah, it's probably no is not really a test. Jacob.

A

Yeah, yeah and- and I think, maybe like if this is meant for like how you separate batch or how do you define stream, maybe it's like trigger if you set up trigger characteristics uh that that would help you in uh you know determining when you want data to arrive so set those trigger points or like how how frequently you want to trigger and um the drive you know the mode is this happening in badge or stream mode.

A

And then scott, you are working on cdf uh for the next release. Can you touch on that, like, what's bfs and um in terms of delta lake, and what is that exact feature.

B

um For sure so yeah, so the next feature I'm or feature I'm currently working on on developing is called change. Data, feed or cdf for for sure, and what this enables uh users to do is actually capture row level changes um instead of like per file level changes for like streaming queries, um which serves a lot of like downstream use cases, and it's been a really um highly demanded feature. So I'm really glad we're. Finally, finally getting around to developing it.

A

That's awesome uh and this uh cdf or change data capture means like it only will attract the changes that happens, and it will add statistics to your transaction logs. Is that the feature.

B

Exactly so, you know if, um if there's only only deletes, then we'll actually just tell you which rows were deleted and if there's only inserts, we'll tell you which rows are inserted, but then there's more complicated cases, for example, updates and merges, and during cases like that, will actually tell you what the pre updates and the post update values were because and then it's up to you in your downstream application to decide what you want to do with that information.

A

Awesome that that would be a nice feature. um Another question was, uh you know in terms of uh you know, next release that we are prepping up. I think uh we are soon going into our summit, which is happening in last week of june. What are other exciting features that you both are working on? Maybe we'll go with uh use card first and then right.

B

Sure yeah, then the next really exciting feature is z-order optimize, so optimize is already uh in delta lake, which allows you to essentially compact your data parquet files, which reduces the size of the transaction log.

A

B

Like ryan said, this speeds up the delta log operations and what z order does is it's essentially a new algorithm to improve or optimize the optimize command, and it improves exactly how we determine best to compact, your parquet files. So what it's really useful for is, after you have partitioned your data.

B

um You now have all of these actual um data columns and it's a really hard problem to decide how to best co-locate or group files together based on n arbitrary amounts of data columns, so z order is one such algorithm, which is a space feeling curve which basically is able to determine which best files to compact together and this. The main benefit of this is during your next or after your queries.

B

Afterwards, on data skipping, which is another feature we released in delta lake 1.1 data skipping knows which files to skip- and this really just speeds up your queries.

A

That's awesome scott. So there was a follow-up question on the cdf feature that you are working on, which sources are supported with cbs. Do we know that right now or that still in works.

B

I'm not exactly sure what uh which sources you're referring to cdf is a feature that just it applies. It can be applied to any change to the transaction log, regardless of where the source is coming from um so yeah it's just it's yeah we're I'm curious what they're, what exactly they're, referring to yeah.

A

So maybe if they post it adam, uh please post it uh what the what you are referring to I, this is just an extended feature on how we are evolving transaction logs, so it doesn't matter like whatever that link supports. This will support as well um cool. So the next question is for you, ryan. What are you working on exciting things? Yeah.

C

Yeah for me, and while I'm also I'm working on general ed, I'm also kind of uh also spending a lot of time. Working on delta live table, which is a different project, but it's also built on top of delta lag to allow people to. For example, you can define your table and tell us the relationship between these tables, and then we will help you manage all this table automatically and basically and when I kind of move in a lot of time on this project as well,.

A

That's awesome, uh then there are a lot of questions on delta light table so whenever uh I think we already have a demo and blog on the live table, so I'm I'm happy to post it on the slack channel and um stay tuned into that area as well, and there have been a lot of questions around data breaks, sql uh and some databricks related question.

A

um Unfortunately, this is the forum for only our team, which is working on delta lake, so uh I will paste the channel link so that you can ask appropriate questions about databricks and databrick sql uh and also I will paste the link for our github and slack channels where, if this was not the forum, where you've got your questions and asked or if you are joining offline, um please feel free to join us from slack and github. We definitely will get back to you uh on those uh yeah recording.

A

uh We are recording this sessions and it will be available shortly after uh naveen. If that's what you want uh cool, so that's about it. We are uh almost um close to our time and uh you know appreciate uh everybody who asked great questions. um I highly encourage you to. You know keep working with us and if you are passionate about contributing, we also have some good first issues uh in github, so please do engage with us and thank you so much scott and ryan uh for being awesome uh panelists for this session.

A

um Hopefully we see you soon uh next uh next next week, yep and one more thing, uh if you want to join our summit, we will have a lot of meetups. We will have a lot of like machine learning and ai visionaries and I'm super excited. We will also have contributor meetup and if you want to meet us in person, I will send you the registration link. So please check that out. Yeah. Thank you all.

B

Thanks vinnie bye, everyone bye.

B