Delta Lake delta-rs Development Meetings, 14 Sep 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2021-09-17 delta-rs open development meeting

Description

Discussing recent progress made in the project and some production use-cases

A

So we're going live, we're live hooray all right, so I'm actually. I I'm curious to start neville. um If you wouldn't mind sharing with the the empty nested list, bug is and how that's going.

B

Cool um the empty nested list bug is a bug that we've encountered or that mission I'm christian encountered when writing a list that has a struct, um but the struct is empty.

B

So the the issue there is that, when calculating the right definition levels to write, if, for example, you you need to have three definition levels, you end up um having two definition levels and then um there's the it triggers an assertion error in the per k code, where you've got, let's say three values but you're, trying to write two definition levels.

B

So I've been working on it for quite a while on and off, um I've been quite busy um with work and other stuff, but I was working on it earlier today and the test the test case that misha created I'm.

A

Assuming that this is the bug that I've got shared on my screen right now,.

B

Yes, that's the one yeah! I just want to find the test case that misha created.

B

I got that to pass, but I've got regressions on two other test cases, um so I'm just looking at those and then once I've once I've resolved them I'll, be able to submit a pro request.

B

Okay, I'm just going to paste it in the chat now.

B

Yeah um and that's me, I haven't been doing anything else, I'm at least okay and error related.

A

A

So misha, I'm kind of curious from you is this: is this an active problem for us.

C

Yeah that was like a week or two ago: that's actually what's what was stopped us from the initial release right, but christian came up with this.

C

uh That letter skew, so in a kafka delta ingest, we have now an option which enables a that letter, skew uh table uh which actually cuts uh catches, those arrows and write the invalid records to that another table, so the the whole stream cannot like will not be aborted just just because of this single uh row, yeah we so we've kind of got a workaround for it, sort of right just just by filtering those uh records, but, to be honest, like we've got a.

C

We got only for one table uh when we had this um a list of of structures and from what is same, his is like uh one or two messages per you know by several millions a day, so um not a big deal from the kafka delta interesting point other than that we are.

C

We went live for 24 streams uh with the top uh top yeah top message rates like 300 messages per second, not that big, but uh it seems to be working and it's just like each one of them works on the quarter of of of cpu and the top stream consumes like 700 megabytes of ram, and that's only because the messages in this topics are pretty large. Like a 3 kilobyte per message so uh yeah, that's it.

A

I think um you know, in in case denny watches this later. What that means is that we have production delta writers uh consuming right now with rust, which is really cool. We did it.

D

Yay, I think it performs order of magnitude better than oh yeah the premium system. It's all worth it.

A

Many orders of magnitude.

C

Yeah, I can, I can't wait to migrate some of the existing spark stream spark sims and to compare the impact the that we have with rust. Also, I have another update for me. I will, if you merged a writer map, supports to the main. uh The impact is that uh data fusion crate does not.

C

It does not yet does not yet on ro 6.6.0 snapshot, which that means that we have to comment that out from the build, uh because it is been failing- and we do need uh this, like 600 arrow address, because uh it has a map support which is essential for uh checkpoints. Usage, as the schema has a map structure in it.

C

So until we are on uh a o, stable release, especially the date of business, it will be. You turn it off to that point.

D

Yeah um for those who are using the latest master branch, a main branch from delta rs, you will have to update to use um the latest master branch from arrow r um rs as well. So that will be arrow six point six um point x release, um so um data vision will be broken for now until data version catch up to the six point or release.

A

From the rust crate perspective, like the last release that I see, we did was zero four zero, which was.

D

We yeah, we haven't.

A

D

There are a lot of new features, that's been added since then.

A

C

Good, uh do we have any estimation time at when the 600 will blur is for arrow.

D

Now well, mine, I don't know, what's the next time, arrow would do a full release.

B

I'm also not sure, but I'll just check.

A

Okay, I think, based on what I've seen uh if we wanted to do a new crate release. We'd have to get a release dependency for arrow parquet data fusion, assuming we wanted to keep that in and then the azure sdk for rest. The last time we just disabled, um I.

D

Believe it's still disabled in the release branch.

A

Yeah but I'm wondering if they've cut a release.

A

um Yeah they haven't, they even cut a new release of that so yeah. We would still have to disable the azure sdk.

A

I don't know why they won't just cut a release.

A

um I think, given all that that's changed in the um in the rust binding. I think it would be good for us to try to get to our release in the next month or so because there's some really good stuff. There.

A

Like all of this stuff, actually now that I'm thinking about all that stuff, that you did misha you and christian around tombstones and getting checkpoints sorted, I think, that's all in main and not in the released zero four zero code.

C

Yeah, but I'm not sure like we need so we need say new arrow, major release or.

D

Checkpoint well, so technically um we can read some of the features, so we can release features that was merged before we merged the arrows exponents release branch. So if we won't do that, we can do it. Otherwise, we'll have to wait for the main error release people who can do another.

A

Reason, oh yeah! If, if arrow 6-0 is not coming until 2022, then that's the route that I would want to go.

A

Just because that's what I would I don't know, it's been a long time.

B

The five thought was released on the 17th of july. um I'd estimate that we've got about three to four more weeks till this meadow is released.

A

I don't qp misha. I could wait for that.

D

Yeah, that sounds a lot similar than I expected.

C

Like one month looks optimistic, okay,.

A

And we're just gonna have to keep azure disabled. Then that annoys me, but can't do anything about it.

A

um Cupi you mentioned that there's some recent contributions that you may think are worth highlighting.

D

uh Yeah, so this is not a full list, I might have missed some important ones, but this is what I had in mind. um We have connor who has been a really active contributor to the project and has been picking up a lot of help needed tickets. For example, um he added a create table feature to delta rs, so you can now create new tables using the rust binding brandon, who has added gcp support to delta rs, which is a big milestone.

D

So we now support all major cloud providers now um yuan zhou, who has also been a new and reactive contributor, who has been adding helping with all these help wanted tickets. He added here the batch deletes, which optimize, which will make vacuum a lot faster for large tables.

D

This idea was suggested by daniel and angel, also added a lot of other features like for example, he went through all that um all the partition serialization codepath and make sure that it um it's supported, make sure we cover all the types um florian added, actual file system argument for the python binding.

D

So when you're reading, when you're loading a data table, you can pass a actual file system argument which um allows you to basically over overwrite, basically supports file system- that's not supported by pi arrow by default, so you can pass any kind of file system object um to read the objects from any remote store that you want. Florian, also added a glue, catalog support. So that's also pretty cool yeah.

A

D

A

D

You now you don't have to with that, you don't have to um hard code, the table, s3 path or whatever path that you want actually has to be s3, because it's aws group, so it only works with the 80 plus, but uh you can pass in the table name once you have the glue integration setup. So that's also pretty cool um florian also audited, all the table, fields, delta or actually all the all the delta commit action fields.

D

So to make sure that for the fields that has to be optional, we are setting it as optional and further feels that that's not optional, we're not using optional, and this has a impact on um compatibilities between the official reference implementation.

D

We run into problems where um there were fields that that's supposed to be optional and we're not seeing setting it as optional and causing um some crashes in the official reader, um and I think there is one more thing that uh misha did that's really critical, but I forgot.

C

Yeah, that's um maybe that's what we've been doing with christian.

D

Oh, the config, the conflict, you added.

C

Oh yeah, okay, we've added a config which is which which actually there's a lot more options, such as retention, lock policy. So I notice in spark. um After each checkpoint uh they are trying to remove locks and checkpoints, which is older than 30 days. So we don't have that and that that'll be nice to have.

D

C

D

Tyler that will be um introduce delta, config and tombstone's retention policy. Yeah, that's the one. So this replicates replicas uh replicates all the config options from the reference implementation.

D

So you can set all the table conflicts. Now we don't own up, we don't honor of the tablecloth fixed, but at least you can set all of them now.

C

And and then I have um in another critical bug with sp has been fixed by christian: that's the fixed compatibility, uh fixed checkpoint compatibility for remove fields uh yeah. So um the issue is that there's been breaking changes from uh from delta 0.7 to delta 1 0 that they have a dynamic schema.

C

uh So there's like specific flag standard file metadata and we were not like run. We were writing this in incorrectly because uh so, when spark reads like a delta 1.0 reset, uh if extended file metadata is false, then the parkit schema uh has like those size. Tags based on values should be omitted from parque schema, but if that's true, then they should be present and uh like we were just like writing not to it right.

C

If, if, if that's missing, but uh that's actually a like div, that should that that will extend schema spark would would fail. So did this change introduces a uh like a dynamic schema to include those columns or not, depending on that um intended file, metadata yeah, so so we're trying to am the delta 1.0 version.

D

Yeah, this is something that's not clear from the spec. The spec also says.

A

D

Says um you have you, you need to provide these extra fields if um the extended file metadata is set to true, but it didn't say if that is not set to true, for example, if it's false or if that's not set at all, um you cannot input any of these fields. Otherwise, you will crash the reference implementation yeah, even.

C

D

Fields are now right: yeah, yeah, no or false. You cannot set these actual fields in the schema.

C

Yeah and- and we found that only because the the error was index out of bounds, that they were trying to reach a a those columns, not by name but rather the index number, which was outdoor. The number of of columns. That was works well interesting.

A

For this branch misha, this writer map support. How long is that ex? How long are you all expecting to keep that around? I I'm assuming that hasn't.

C

We fell much just just like couple hours ago. That's the there's.

A

C

Yeah, that's what was I was talking about when uh yeah, uh when we commented our data fusion.

A

Got it all right, cool.

A

Also has a new.

D

Pr that everyone needs to review it, it's a peer that adds the history api, that's also pretty cool feature.

C

Oh, I can incomit info, that's nice.

A

Yeah, that's very helpful.

A

um Cool one of the things uh I was speaking with denny about um to change topics slightly. I was speaking with denny about um scheduling like a meet up to talk about kafka, delta, ingest and the road to production.

A

I think it's unfortunately probably going to be too late in the evening for you misha, um but it'll be right in the middle of the day for for christian, so I think I'll be able to present some of what uh what y'all have done there um and I think we're shooting for october 7th, um which is a thursday a couple weeks away.

C

If you happen to have a recording I'll be happy with that.

A

It will be recorded.

D

In this channel or some other channel.

A

uh I actually don't know he just said: let's do something on october 7th, I don't know where we're doing this.

D

A

Maybe not the best idea, but okay.

A

Anything else we should discuss this morning or afternoon evening.

D

I think there is a doubt- oh sorry, not delta uh uh delta lake office hour, um that folks interested in.

A

Yeah, that's what denny just announced in the events channel and the delta user slack, so the delta lake community office hours are going to be alternating. um I think it's every other week, every two weeks yeah. So uh this thursday, that is september 16th, we'll be starting at 9 a.m, pacific and then two weeks from then we would be doing 4 p.m, pacific um to try to accommodate the global audience, but the goal of the office hours is really to have.

A

You know: general delta, like discussion, uh he's going to be bringing some of the committers from data bricks. To that, um I think it's on my calendar he's asked me to show up for a couple of those um but yeah that'll, be I don't know we'll see how it goes.

A

Yeah anything else, gentlemen,.

D

A

Nope well, thank you all for joining and yeah great work I'll see y'all later.