Delta Lake Community Office Hours, 22 Sep 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Delta Lake Community Office Hours (2022-09-22)

Description

Join us on September 22, 2022 at 9:00 AM PDT for the Delta Lake Community Office Hours! Ask your Delta Lake questions live and join our guest speakers, alongside Vini Jaiswal from Delta Lake!

Ask us your #DeltaLake questions. These sessions allow our community to ask questions about Delta Lake OSS and get to learn what we are building, planning to build and know about recently released features. These sessions are live and the recordings are available on the Delta Lake YouTube channel.

Quick links:
https://delta.io/
https://go.delta.io/slack
https://github.com/delta-io/delta/releases
https://groups.google.com/g/delta-users

A

Okay, I think it's working so I'm going to ping you guys in a second okay.

A

Matthew I'm wearing the same t-shirt from uh data AIS helmet.

B

A

B

Right: let's try to get Scott in on the party yeah.

A

See he's wearing like a true true Delta Lake t-shirt. There.

C

A

C

I never got that one. That's true.

C

Also, hello, YouTube.

A

Oh YouTube, hello and then I'll just wait for LinkedIn all right. So let me.

A

Let me find the exact link of the stream uh one second.

A

It doesn't show me the exact link: okay, no, that's a company web company, page uh okay got it all right uh sounds good, so hello. Everyone uh welcome to the Delta Lake office hours uh please tune in to our feed on Delta Lake LinkedIn page where we are live, and we also are live on our YouTube channel for uh just a sanity of this uh office hours, a quick reminder for those who are new to the session. These sessions are live and occur.

A

Every two weeks on Thursdays at 9am Pacific, uh we bring a panel of contributors and champions of Turtle Lake. To answer your questions about either getting started on Delta lake or if you are wondering what is coming up in the roadmap, what we built, if you missed any previous amas, don't worry about it. We have recordings available where you can watch it, either on LinkedIn or YouTube on our desolate channels, um and since this is an official webinar of Delta Lake. Therefore it's subject to a code of conduct.

A

uh Please do not add anything or ask anything which will uh be in violation to that code. So without further Ado uh we have Scott and Matt on our panel. uh Why don't I, give the room to them for introductions Scott.

C

uh Sweet thanks, Vinnie uh hi everyone, I'm Scott I've, been on here a few times before I'm an engineer at databricks working on Delta Lake I've been working on this product for and this open source project for uh over a year at a lot of cool features in the past, several releases I'm really excited to be here and answer your questions.

A

Thank you, Scott Matt,.

B

Hello, my name is Matthew Powers uh I'm, a developer Advocate at dead Rex, uh big Delta fan for quite some time since it was released and recently I've been focusing on a Delta acceptance, testing project that we'll be chatting about, and also writing a bunch of Delta blog posts that uh are getting me even more excited about Delta. Every time I do Delta stuff. It just makes me I'm more happy about the product. That's.

A

Awesome so because you're so happy uh Matthew what what are you bringing for the community I know you are working on some exciting things. So if you want to give a quick sneak, peek.

B

Yeah one uh I'll give it a high level overview of the Delta acceptance, testing uh project and it's something I'm collaborating with Scott on and also a bunch of other members of the community. So basically, at a high level, we want to create some uh reference tables and those are going to be Delta lakes and we want the connectors to be able to write integration, uh integration tests off of those reference tables. um So we have a pretty vast uh ecosystem of connectors.

B

uh Like trino and Presto and uh rust and uh pandas, we want to make sure all those Delta, like implementations, can uh Implement all all the core functionalities that Delta Sports. So we're going to help drive that with the Delta acceptance testing project I think it'd probably be helpful for to have Scott rephrase the same thing in different words.

C

Yeah, um the goal here is to make sure that any client implementation that reads and writes Delta tables does so correctly. We want the Delta table to be a single source of Truth and that looks the same to any client, no matter how you're accessing it or writing to it. So this is just making sure that different clients are able to actually work together, because this is actually um uh really reflective of how people use Delta, sometimes you'll have one workload, that's or one team.

C

That's using one client, um another team that maybe wants to do something a lot more lightweight, so one is using spark and others using pandas or rust, or the Delta Standalone reader, and things like that. So making sure that we have a standardized way to test and invalidate how these different clients interact um is really important to the overall success of the Delta ecosystem.

A

Does that mean we also provide some uh some kind of like a ground framework for Community? If they want to do any benchmarks or like build any connectors, they will be able to use that framework, something like that or am I getting it totally wrong.

C

No, no yeah, it's that's exactly it. We want this to be a source of Truth, where you can come with your own connector and validate that you are producing the expected results and that.

A

You're reading.

C

The expected um you know the expected data that that's represented the same in your own system as it is in the table. Format.

A

That's pretty cool and also we have uh Benchmark framework which was released with 2.0. That's success, that's that's great uh because it allows uh it takes away a lot of complex setup from community. So thank you. uh Another question is around flank. uh I know that we were working on several things with flank. uh What are some of the things that that is coming up in the roadmap.

C

um Yeah a great question so for those that don't really know much about Flink, um perhaps because your version of Delta lake is the Delta Lake implementation on spark, um which is our most uh popular repo. But Delta is just a file format and it should be able to work with any compute engine and not just spark so Flink is one such compute engine.

C

It's able to provide really really low, latency and was built in the ground up with streaming in mind, um whereas spark was initially kind of more geared towards batch jobs and we figured out streaming um a little bit later. So the Flint connector um is um a connector with Delta Lake uh that supports link uh sources and flank syncs, and those two versions were released this year, so we're actually working on it, which is really exciting and we have our first.

C

You know companies um out in the open source Community using it for the first time and getting great feedback from them.

C

um So source and sync is what's been developed so far this year and currently um right now we're working mainly on upgrading our the Flink version that we support so just upgrading our versions along the way and there's some little bumps in the road along the way, while we're doing that that we're problem solving and we're also working on adding um SQL and catalog support.

C

So the goal here is for our Flint connector to work complete, integrate with SQL queries as well as to provide our own Delta capsule uh catalog, which is just kind of necessary to solve some of the problems when you're, integrating with a with a meta store um so yeah. Those are both um actively in development and there's actually a public design doc on our GitHub issues um in the connectors repository. If people want to go in and leave some feedback.

A

So Scott one question on that. uh So if I'm, using for using Flink and integrated with Delta lake is the idea that I will I can run some SQL queries on Flink uh to be able to actually query any metadata information from Delta Lake.

C

Any metadata, any actual data for sure, okay um and again, a cool thing to highlight here, is the fact that all these different compute engines can work together. So um one common use case is, you could have uh the Flint connector um appending to an append only table um and that's just really really low. Latency um like that could be one of your pipelines. Then you can have a spark query on Delta Lake, that's running optimized compaction to compact your data for faster reads later on.

C

um So these both of these two pipelines can work um interchangeably.

A

Got it got it? uh Thank you um and, in terms of like you know, just these engines in general I do see a lot of trend around streaming, so people use like different tools. uh So Matthew like what are your thoughts on, um you know uh the Trends on streaming. What are what are popularly used uh methods with that, like.

B

It's a great question uh and I think that in the streaming space there's a lot of different ways you can go with this. Like let's say you have a Kafka stream and you'd like to get the data into a Delta Lake.

B

um What I understand Scott is you can, and let's say you want to do some transformations to the to the data and the cafe screen before you put in the Delta Lake I think that the Flint connector that Scott was just referencing would be a way to do that and I think another way to do that is you could read the Kafka stream directly with spark streaming. Do some Transformations, then output it into a into a Delta lake, so I'm actually kind of curious on some of those high level design decisions Scott.

B

If you could educate me.

C

Yeah it, it all depends on like a customer or user to user basis. um It depends like what.

A

C

Of latency you're looking for or depends on what kind of cost management you want to implement, but it depends what looks like the Legacy architecture you have at your company. um You know Delta and Spark is a very, very robust battle tested solution. You know it's been in the open source Community for over four years now, um and it's been continuously getting better and better where these new connectors, they're they're still being created. You know we're still working out some Kinks along the way, um but we still want people.

C

You know if you love, Flink and Flink is the solution for your company. You know we have customers and users.

A

C

This connector um and they're giving us feedback right because they want to make it better um so yeah, that's all I can really say on that.

A

Got it got it awesome? uh There's another question which is unrelated to this um feel free to uh you know, say no, uh but which one should I. Should we use DLT or dbd modeling, uh so I think Abhishek I'm not really sure what you are trying to build, uh but.

C

Isn't um like a Delta, Lake, open source project, so I don't know how much we can speak to that here. Yeah.

A

C

For those that don't know.

A

Dlt Delta live tables. Awesome. uh Another question is uh so we you know there is very helpful information which is uh which is captured in the operation metrics of Delta Lake. So what are I know that you know when I used uh use some of the operation metrics for like compliance and things like that. It really helps to understand the user history and all that uh cool details. Are there any?

A

uh Are there any uh features in Delta Lake that uh allows Community to take a sneak peek into operation metrics currently, or is that something in the works foreign.

C

We have apis that expose all this metadata for you, I'm sure you can go on our website and explore the different apis for getting the table history for getting um operation metrics for a given transaction um each commit to the Delta log um has a commit info action under the hood that will store these operation metrics. So you can see number of Target rows deleted or number Source roads copies if you're doing a merge and things like that. All this is like really easily accessible got.

A

It got it and I remember something that was coming up in the roadmap, I'm, not really sure which feature it was, but it was related to some operational metrics, um but I will be excited to see that feature.

B

Is would like to change data feed stuff be in that bucket, like for, like you know, like let's say auditing purposes. You're, like you know, enable the change data feed is. That is that in that bucket too, you think.

C

Yep, so for for one change due to feed is enabled on your table and you perform any sort of DML operation like insert upsert merge, delete, Etc. um We we, we keep special CDC CDF change data feed related operation metrics in that commit as well um so like um operation. Metrics are a public API, they're, first class Citizen and they're heavily supported throughout Delta. So when we added a new feature like CDC, we made sure to update the operation. Metrics accordingly does.

B

C

That help answer your question.

B

Yeah I think so: yeah yeah still learning what operation metrics refer to specifically, but it's I'm learning. Oh.

A

uh So Matthew, maybe like I, can add if that makes sense, uh so anytime, a user performs any action, Delta stores a record of add delete and it also uh stores what queries it perform that operation on and how many lines of uh Delta how many rows change, how many rows were impacted? So all these information were collected and uh gets collected into operation, metrics that a user can query.

A

This allows a user or you know, audit team to see which user performs, what action on a Delta table, for example, if I'm missing a row which was supposed to be a gdpr compliant uh data for a user and now a user is requesting. If my information, um you know you have my information, but you can show them like hey. We have this record.

A

This was deleted operation, so you can give that as a as a proof that oh, my table no longer has your record because I performed this operation on this specific date, timestamp Etc.

B

Got it so, if you're like undergoing httpr audit, you can like hand that, over to the auditor and be like this is evidence that this transaction took place at this time. Yeah.

A

Just one thing that we don't capture right now is I think the vacuum operation. So that's something that we I think we are discussing I'm, not really sure Scott. If we have, uh you know, made any PRS on it, but I'm pretty sure, like you know, that would be a good enhancement to this.

C

Yeah um for sure you know, Community can give us feedback. I would love to hear like the way that people want and Matthew. This seems like a great future blog for you.

B

I'm motivated I'm into it.

A

All right, uh I think we are also working on some of the some of the user facing uh you know blogs and tutorials. uh So Matthew, do you want to uh give us sneak peek on some of the things that uh is coming up.

B

uh Yes, I do so one uh blog post, I, just drafted, be publishing soon is on how to convert a parquet table to a Delta, Lake. um I. Think that one thing that's really cool is that this is actually an In-Place operation. So, like let's say you have a bunch of RK files sitting in disk and you want to convert them to a Delta Lake to get all the advantages of the nice features of Delta Lake.

B

uh But you don't want to do an expensive rewrite of the data, um so the Delta OSS API exposes this convert to Delta method and you can kind of just take those parquet files convert it to a Delta Lake, which all that entails is making this Delta log directory and then all of a sudden, your parquet data is a Delta Lake that you can enjoy all the wonderful benefits that uh Delta provides.

B

So I will be publishing that post, hopefully today on the delta.io blog and uh in doing that, work I noticed that we hadn't yet exposed the opportunity to for users to not gather statistics in this conversion, so opened an open source uh requests for or sorry an open source issue for that one and that's a good first issue. So if somebody wants to to tackle that that that's that's something that you could jump on.

A

Awesome yeah looking forward Matthew. There are some more questions on our chat, uh so um I think one of the question is even I tried vacuum, but it not. It did not work what we expected, especially for streaming um so Morgan. If you actually can provide more details, uh if you have, the I will paste the link for slack Channel, because I think uh I would like to learn more about what exactly your parameters are and what you are trying to run so, hopefully, I can uh help there or somebody else can um yeah.

C

Their initial question was just about their performing a lot of upserts and they they're seeing an increasing cost in terms of their storage.

C

um So two things I would say to that: omega-3 one, as you said, is um give us more details in slack, we would love to dive into the details and and help you out here.

C

um uh Secondly, I'm curious, like how you're partitioning your data, um if you're not and that could be kind of a bad, a bad smell of the design of your architecture, perhaps because you could be touching a lot of files uh with your upserts um and secondly or thirdly, I wanted to call out um future uh future that we're working on called deletion vectors, which is designed to really help with these fast um uh fast rights into slightly reduced storage costs.

C

In terms of typically, when you do an update now, you have to completely rewrite that file. um Even if you only changed one row and then it's just that initial file, but with deletion vectors, we can make that change of that one grow. A very, very lightweight sort of metadata um right. Instead, which means you're not fully duplicating that file um and and features like this um once they're once they're added to Delta um I think it really um help you save some cost. There.

A

Awesome, thank you. uh There's one more question, I think: that's all we have time for uh so Gary is asking. Is there a difference in performance in implementing that, like using the library directly on spark or use data bricks.

C

Yeah, so uh that's a great question um and I would say that any sort of API feature um that's not yet in Delta Lake uh we have committed to fully opens open sourcing, adding it eventually that's what we announced at Delta 2.0 and that's what we're really excited to be adding all these features like deletion, vectors um to Delta Lake.

C

But then, when you talk about this performance difference between spark and databricks, that's where it's not really Delta lake anymore databricks has a proprietary engine called Photon, and that is much faster than spark uh than Apache spark for for various reasons, um but that's not really related to Delta Delta. Our goal here is to have the table format and the apis be fully compatible.

A

C

It and if they have more questions, you know, keep asking the slack Channel love to follow up on that.

B

Yeah I'd say with a 2-0 relay release now that you can zorder I think that that was something you that was a gap before, but not anymore, so I I would say it's pretty pretty equivalent. At this point.

A

Awesome yeah, so I pasted the links for our slack GitHub and other channels of Delta Lake feel free to join. That and you know, ask many questions. I know this happens twice uh like uh bi-weekly, but uh you know you can always reach out any day uh on slack or GitHub. So, thank you all. uh You know we had a wonderful uh q a around here. So thank you, Scott and Matthew for uh doing your due diligence and answering bunch of questions.

A

uh We will uh have these office hours on October 6th next time so hope to see you there. Thank.

B

You great questions Community, you guys have really good questions. I'm, always impressed by the quality of the community questions.

C

Bye everyone thanks Jenny for hosting take care thanks, Vinnie.

B