Delta Lake Community Office Hours, 3 Mar 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Delta Lake Community Office Hours (2022-03-03)

Description

Join us for the next Delta Lake Community Office Hours and ask us your #DeltaLake questions. The Delta Lake community AMAs occur bi-weekly on Thursdays at 9AM PST. These sessions allow our community to ask questions about Delta Lake OSS and get to learn what we are building, planning to build to recently released features.

A

uh I'm gonna mute myself because we'll hear multiple things so vinnie wants to go ahead and introduce ourselves uh as we get people on.

B

Okay, uh awesome! So if we are live hi everybody.

B

All right one, second: okay, if you're live hi everybody please, uh we are live on linkedin as well as youtube. I hope you have the right links. If not, uh you know, I'm gonna paste it shortly again in the linkedin channel. um If you can say hi where you're from I would love to see like you know what global presence we are getting today. So, uh let's see.

B

All right so shortly I will introduce the panel and.

A

All right we should vinnie, we should be live now on both linkedin and youtube. So hopefully, we've got a little bit of traction here. So just uh I think we're. Whenever we're ready, we can start the show um uh or we can. We can continue talking about quebec politics, whichever which one you guys ever prefer. So, oh.

B

uh Let's, uh let's see there are some chats coming in which I love. uh We have people from. uh You know: arizona, toronto, south africa, wow, amsterdam, canada, brazil, interesting nice, so I'm joining from san francisco live from derby's office. uh uh So why don't you? Why don't we introduce our panel? uh I will start with florian florian, just yeah uh introduce yourself to the canary.

C

Yeah, thank you, vinnie, hello, everyone, so I'm florian, I am in france in bordeaux right now, and I am data engineering manager at black market. So I'm working closely with data engineering teams and we has a lot.

C

We have a lot of data pipelines running using delta lake technology, mostly open source one and I'm also a contributor to delta rs, the the rust native integration of delta lake and also- and so I'm designing a bit of exploration um with a big query: to find a good alternative to have an integration between gcp, bigquery and delta h.

B

Awesome awesome. Thank you, florian um amazing, to hear what you are working on and ivana. Next to you.

D

Hey hi everyone, I'm ivana, I am based in belgium in europe, I'm a data engineer working at element61, a consulting company based in belgium, working in my day-to-day job; a lot with customers building their architecture in the cloud working on on data pipelines, mostly with mostly with delta.

B

Awesome thanks ivana next up, we have ryan and why don't you introduce yourself.

E

Yeah, hello, everyone, I'm ron a software engineer from databricks, uh I'm mostly working on delta lag and in the past several years, and also previously also working on structural streaming. The stream engineering about this park.

B

Awesome thanks ryan, uh danny next.

A

Hey thanks very much hi everybody. My name is denny lee, I'm a developer advocate here at databricks, uh long time, spark and delta late guy, and before that um you can either blame me or thank me, I don't know which one you want to do uh for. I was part of the sql server team at microsoft and also on the team that uh created a hdinsight so that one, you probably want to blame me. So my apologies back to you, vinnie.

B

That's a great history. uh Is it then don't we have exciting panels here and hi myself? I'm a developer, I'm looking at dynamics working in uh delta lake uh open source community. So uh that's about us. Let's start get started with some questions. A few topics. What you can ask questions about is anything related to delta lake, for example. We have recently released amazing features like performance improvements and merge. We have uh new features coming in which are reordering data skipping.

B

We have released and continue to release new connectors like trino presto flink, so ask any questions about uh that. um So, while we wait for some good questions to come in, why don't we quickly uh get started with some updates on uh rust, uh so florian? Why don't you provide like some updates of what uh that the lake community should look forward to uh yeah any insights that would be appreciated.

C

Yeah definitely it's. um We have a lot of improvement into the delta irs binding. So mostly we integrate the new azure bindings, meaning that you will use a new crate for it. So it has been a lot of improvement for integrating azure cloud provider with us using delta irs, and many things is coming um as soon as we have some requests so feel free to reach us, so it will be a nice way to improve or to provide more features into delta ls.

C

We are thinking about providing as well on payroll writer for the python binding. It will be make possible to write using the python binding into delta rs without the need of integrating spark or have a dependency with spark itself.

B

Awesome, uh that's a great update florian. So can you for those who don't know like? uh Can you provide some uh inputs on like what uh ether does and a little bit of what they can expect from the connectors? Like a use case, specific yeah.

C

Yeah, it's actually, um I can use a black market. We use delta irs to read the data tables without having a cluster running, meaning that we can fetch a lot of metadata using delta ios. You can also use read the data tables list files of the of the partition, underline packet file using this library, and we can als. We can also provide a writer to write delta tables or updating in in the in the delta lake tables, without requiring spark installed. So mostly the rough binding is there and we also have a python and ruby.

C

One python is the most advanced as of today, but we plan also to integrate many other programming languages.

B

That's awesome, so you said the cluster up and running is not required when you are fetching uh fetching the metadata, so that means saving on cost to run compute. That's awesome.

C

Yeah, exactly you can read, you can have a small data set if you want to look for exploration his age, I'm thinking about data scientists use case when you need just to look or have a small data set to work with, without launching everything on the databrick side or the open source with other cluster running with spark. So mostly, it's really good when you just want to explore with a small portion of the data table.

B

Yeah, I'm pretty sure, florian. I missed a lot of details. They are happy to have like detailed conversation later. That would be awesome for uh for people to know awesome. uh So we have some other questions uh on the chat. Thomas is asking anything on the horizon to help with the surrogate key problem.

B

Current workloads are cumbersome. Anybody would like to give it a try.

B

It's a question about surrogate keys.

A

Sure I mean I can probably tackle that, so uh I do want to call out, there's probably a little bit of bias for my answer, because uh we actually had a and I'll post it. We actually had a delta tech talk that actually explains how to work with surrogate keys uh within delta lake um and in addition to that um in case you're into stargate sg1. There are plenty of our stargate sg1 references to that particular session, but outside of being going to geekdom.

A

um The one thing that I did want to call out actually is that, uh as part of the roadmap um delta lake actually does will actually be including an identity column actually, and so that is one of the things that we will that's working down the pipe and so combination of where we are planning to work on in terms of the community to to to address some of these issues. I think that'll be helpful for the longer term. Thinking for the shorter term.

A

Thinking honestly- and I think it's thomas that went ahead and actually asked the question- the reality is um yeah, the problem is cumbersome and it really does involve you figuring out, which type of either hash or um um basically uh circuit key generation that you want to work with, and so honestly, I'm just going to give you the answer that uh we'll post the the link inside there now saying this. It's not I'm not trying to avoid actually giving providing more details. It's just that, depending on how you want to solve the problem.

A

There are different answers, so please join us on the delta user slack. uh If you join us there, we can dive into those scenarios a little bit better okay. So hopefully that gives you some context uh on that. Oh thank you vinnie for uh dropping in that particular video that that's quite helpful.

B

Yeah, thank you danny uh awesome and that's what uh that was good answer. um I think the next question is about what differentiates delta lake from competitor products like azure, ideal, adls and snowflake. There are many answers to that. But um uh ryan. Do you want to take a pick at question? uh Answering that question.

E

Yes, so if the question is the difference between microphones like atos, like a snowflake right yeah, so regarding fdls is kind of a storage like object store similar to s3 for della lake, we actually build on top of hdls, so it's kind of you can see.

E

You can put your data on atls, but the data lake will have a transitional layer which help you manage all the data on like your like storage, so it's kind of a one level below like a delta lake for atls for snowflake.

E

I think I think one major difference is like, for example, theta later it's an open source for max, which so we did. We basically documented how to read or write a delta table. You know like a website. This is entirely open and we also building a lot of like uh open source connectors like node.jspac. From a presto frank, they can also read it directly using uh using like a delta format, and then you don't need to worry about the weather, for example, for some similar snowflake.

E

If your data is slow flag, if you decide to use a different plot product, you probably need to move all your data out from snowflake, but for delta lag whatever like a platform you are using, you can easily just switch the engine without like changing the underlying storage yeah. Hopefully to answer your question: yeah.

B

uh That was a good answer and thank you. uh Anybody else would like to add anything.

C

I would just highlight the site that the protocol md is a open source, is on delta repository itself and is a great help if you want to add anything related to delta. It was the case with delta rs, but it was the main source of information if you took if you want to change or the connector, uh so it's really a good place for the open source community to see what are the requirements, the guidelines and the information. If you plan to integrate something in it,.

B

Yep great um there's a question from sean collins which, uh which is a part of the question so sean. If you can please uh you know, ask the entire question in one chat uh I I would be able to help you out. um Okay. Moving on to the next question, so jether is asking, is hadoop installed on databricks cluster by default?

B

If, yes, how is it going to work out for small files.

A

B

A

Let me try to dive in here because uh in general, let's try to limit our questions to delta lake specific questions. Okay. So, but I will answer this particular question: okay, um no hadoop is not installed on a database cluster so that that's the easy answer. So that's why I didn't mind answering it: there's always a small file problem, though it doesn't matter whether you've got hadoop or adls or gcs or aws.

A

The small file problem is rooted on the fact that the listing of files on cloud object stores is excessively slow, don't forget when you're dealing with cloud object stores what you're doing is you're doing web breast web-based rest api calls, as opposed to bulk batch calls, which inherently will be slower, and so, if you've got a lot of small files, you've got a lot of transactions, a lot of like pings, back and forth uh communication back and forth that will basically significantly slow the performance down, and so how delta, like it happens, to solve this problem by the way is and delta lake does in fact solve.

A

This problem is that, instead of actually providing the making you list, the files directly, the actual files are inside the transaction log. So whenever um spark or delta rust, or the upcoming flink connector or the current presto connector or any of these systems want to get access to delta like what it does. Is it reads, transaction log by reading the transaction log, it basically gets the full list of files, so you don't actually have to use that really cost prohibitive. Listing uh of your cloud object store to get the list of files now.

A

One other notion is that you do want to go ahead and optimize your file system every so often ie compaction, because if you've got lots of little small files, it doesn't matter if you're listing it from a transaction log or even from a from hdfs or cloud object. Storage, that's still too many files, and so what uh file compaction does and delta lake already has. That, which is the ability to do file compaction, um is basically take a lot of small files making them into bigger files.

A

So that way, it will actually significantly speed things up now. This will naturally lead to a question uh which is hey, but I heard that there's this thing called optimize and I would really love to have optimize included in delta, oss and uh oh yeah, and also optimize z order. I really would love to have that and the good news we heard everybody's question on that, so we had a survey last year and so the combination of the survey um also the feedback from github.

A

It was the overwhelming winner to put it rather lightly that we actually had to put optimize in so as part of the current delta lake uh h, 2022 h1 roadmap. I'm trying to be precise, though it's hard to say at all, but with the current roadmap that's targeted for june of this year, we we are targeting for both optimize and optimize. The order current current timelines are approximately q1 for optimize and q2 for optimize the order.

A

So yes, so that will then in this case help you solve the file small small file problem, okay, so probably a little bit more than you're expecting. But nevertheless I figured I'd, give you the answer. So hopefully that helps.

B

Thank you danny. That was wonderful answer. I I just add I'll just add a little bit of more information there. So I think there might be a underlying question as well uh that you you were trying to ask which is setting up data lake. So I uh put a link for you to getting started on delta lake as well. As you know how you can manage the small file problems. um There are different ways.

B

uh You don't have to just use hadoop, it's a delta x standalone, so you can basically have uh you know any configuration underneath for your storage systems and then the link works on top of that awesome uh and then uh there was amazing roadmap uh which recently got uh launched and we are happy to have more discussions. I put a link in the chat, um so please get involved in the discussions awesome. uh So moving on to the next question, um all right, so there's a question about.

B

Would you recommend us an approach to metadata or schema management outside of manual scripting, particularly when handling schema changes?

B

I'm thinking how, if you do a sql server, backpack deployment, it largely handles changes for you. So it's a question around how we manage schema, uh manage schema outside of manual scripting.

B

Did you uh do you want me to repeat the question.

A

Oh, no, it's okay, I mean um ivana, I think. Did you want to try to tackle this one? Did you want me? I'm sorry, I wasn't sure.

D

No go ahead. Go ahead. It's fine.

A

Okay, cool cool cool, um all right, so um the only reason I wanted to figure out I could tackle this is because, um since I was formerly of the sql server team, I probably have, I know a little bit about sql server. I can claim that so the thomas's question is very prototypical and that's a good thing by the way uh pro typical standard, dw, bi type questions in which how do you handle metadata and schema management right, and so so, just like in sql server, and for that matter other relational databases.

A

um The context at least, is um you're going to write, ddl scripts or dml scripts. Excuse me, and you probably save them in github or some. uh You know some version control system, okay and then that's how you manage that, and so there's two things when it comes to delta lake. First of all, delta lake actually has both schema enforcement and schema evolution.

A

So, whatever dml that uh that you actually are writing you, it will basically make the changes uh to the schema of your delta lake table. Assuming you want it. Okay, in other words, you have to enable evolution to allow it to evolve by default. It'll enforce this current schema. So if you try to put like for sake, mr five columns in on a four column table it'll actually prevent that from happening, because you don't want to risk accidentally corrupting your data now to thomas's question specifically about how to manage that.

A

Typically, you would just take the dml scripts that you're using to evolve the schema for sake, argument uh inside github now you're saying. Well, then, how do I match those two? Well, what's really cool? Is that there's actually custom metadata that you can actually put with the within your delta, like transaction log, so a common practice is actually as part of the custom metadata is to actually link directly to your github lead repo that contains the dml link so inside the delta transaction log not only contains the fact that the schema change.

A

It also contains a link to the schema itself. So that way, you just review the transaction log and go get that information, so it's at least I've seen this happen uh pretty recently. uh More recently in terms of various customers have been tr doing this. So hopefully that helps answer your question.

B

Awesome uh all right, uh thank you, danny moving on to the next question, so ivana, this question is for you, so you are doing a lot of work around streaming. um You know, can you uh yeah? Can you maybe uh give us like few highlights on what are the some awesome features that you have come across and uh you know why you are interested in working on those features. Yeah.

A

Oh and and just to add to that vada I want to over clarify she's a data scientist initially right. That's.

E

A

E

A

Like we've got a data scientist sitting with delta leg, so that for me I just wanted to call that out, like that's really cool so ivana, please tell us, you know. Why is that data science sitting here with us, because that's pretty cool.

D

Well, delta doesn't have only data engineers, but also data scientists, and also I do a lot of projects where data engineering is involved. So I want to keep my hands dirty with the data engineering stuff as well so yeah I do. I do a lot of streaming and for that delta delta is really the savior particularly they're.

D

One of the latest features about the change data feed uh feature there. um I find it. I find it uh quite useful, because then I don't have to read all the all the data. If I have a stream, let's say the bronze typical bronze silver gold in your in your data lake, and if I will have to read the changes only from the from the silver layer and get them into bronze and do some kind of aggregations there. I don't typically don't.

D

I don't want to get all the data, because that can be a lot of a lot of data, and I would like to only read that the rose or the the data that has recently been updated and uh with this change data fit feature I can. I can easily I can easily get only those rows and then merge them. Merge them into my into my gold layer.

D

um Yeah, that's one of that and uh yeah for the for the data science, part uh delta, really delta, really yeah helps uh helps a lot on that front uh front as well. To keeping all the data sets all the features uh features there, building easy, easier uh feature store, etc.

D

B

Present in yeah in boot fronts, so I guess you touched upon all the layers, uh bronze, silver and gold, and you know you, as a data scientist, understand that how that architecture really helps in you know. uh You know, uh setting up your pipeline well and uh really helping that end. Use case which is like once, data engineering is set up properly.

B

Then, as a data scientist, you can go ahead and um you know make use out of the data in a very uh I think in a curated manner. Right because definitely.

D

As a data scientist, it's I think it's the dream to have uh the the data that you have as an input uh as as uh with good with good quality as possible and delta lakes. Actually, delta lake uh helps us with that. uh So we can just read the data that the data engineers have prepared uh and uh yeah directly have uh have the clean and uh yeah high quality data.

B

That's awesome, ivana, ah that's a wonderful insight. Coming from a data scientist, you know um this is what data engineers do with delta lake. You know awesome. uh Anybody else uh would like to add anything there for data science.

A

I mean not to take up the show, but but but the one thing I definitely want to add is exactly to vana's point. I mean it's garbage in garbage out right, like you could literally have if you, depending on which machine learning algorithm you're using literally the sort order of the data coming in, could literally change uh change the hyper parameters and change the meanings of everything right. So you you really it's there.

D

To add yeah to add on that it's you can build the the most complex model on neural networks or whatever yeah, but the the input data that you have. If you have good quality, even yeah, sometimes even the most simple algorithm can can do the job.

A

Exactly yeah, so it's exactly tavana's point, so it's more focusing on the fact that you want to have that date of reliability right underneath. So that way, the machine learning you're running is actually useful and god forbid, actually, reproducible too.

B

So we touched on data quality reproducibility and having a clean slate uh telenate for data science. There is one more question related to this, which dennis is asking he's asking. Would you recommend to use dedicated buckets for bronze silver and gold layers? For any reasons, there are so many questions, so many answers I could provide, but ivana would love to have you answer this question.

D

uh Yes, I would, I would put it in in different backgrounds. I think there are. There can be many different uh opinions on this. I I prefer that that they're they're separated and uh sometimes that's also better in terms of uh security and governance.

D

Governance reason I don't know if someone wants to add to that, but I think it's better to have them separately. Yeah.

B

I can add a little bit more context, so, for example, if you have like raw data, you don't want your organization to directly use all the data. That's available.

B

You have to have some kind of governance, so when the data is coming in, it's always a good practice to curate that data in the first bronze layer, bronze layer, you can just use it for raw data, as is but then apply the transformations as well, as uh you know, get uh first uh pii data and uh you know regular use data that can be separated and then use like specific buckets for use case that you are trying to solve.

B

For example, if you have like 10 different teams working on different use cases, it's always a good idea to have separate um buckets um in the goal layer where you can curate the data add, do the you know, merge of tables and all that in, like a specific use case manner, so hope that helps. There are a lot of different use cases which, um uh which you can think of when um creating your uh delta architecture awesome. uh So I think that's about that's a wrap. uh We are coming close to our uh office hours.

B

uh I'm gonna pay some chat uh links in the chat so that you can keep continuing this discussion through various channels. We have either through slack, google group or github. um You know I love seeing how much engaged the audience is on the chat. So thank you for that. uh Any closing words from the panel.

A

um My only call out is join us for the delta users slack. If you've got any questions, I think that's been posted in linkedin and youtube um because I'm sure you all have other great questions we'll be here in another two to three weeks to answer your questions again with a new panel uh so but between now and then please join us in delta user slack.

B

Yeah. Thank you all. Thank you, panelists. That was an amazing discussion. Thanks for answering all the wonderful questions, see you in two weeks, bye.

C

Thank you, bye.