Delta Lake delta-rs Development Meetings, 19 Jan 2021

Previous Meeting

⏯

youtube image

►

From YouTube: 2021-01-19 delta-rs open development meeting

Description

Tentative agenda

* Neville share how arrow writer support is going
* We discuss the low-level transaction API here: https://github.com/delta-io/delta-rs/discussions/67
* Pending time, review any bugs/features we think merit discussion.

A

This is not strict, but let me go ahead and click the live button on the youtubes, all right we're live. So this is our first uh open delta rs development meeting. I figure we're gonna try to do these every couple weeks, while there's some significant development going on on the delta, rs library or the bindings on top of it.

A

The agenda is in our slack channel, which can be linked from the github readme or from delta io.

A

First up that I had on the list and uh neville is traveling, so hopefully his connection stays pretty strong just to share how some of the uh the aero writer support that we need is going.

A

B

A

B

Okay, thanks um hi everyone, so I've been working on the aero um pacquiao write, support mainly for nested lists, structs and combinations thereof. The the current writer well before I started with this work only really supported primitive types and very basic lists. So it's been extending all of that work. I've worked on it for the past few months now um because there's been quite a lot of edge cases to to to cover and deal with, you know the different semantics between how air reflects the data and how 4k interprets it.

B

As of this past sunday, slash monday morning, um I opened a foot request. That's really sort of the penultimate work, getting the right um support working, so um the the pro request enables calculating the the nesting or the the definition levels for arbitrarily nested um list types and a combination thereof. So it's it's the the one step that I was missing before fully enabling you know nested right support, so I'm nearly there I'm expecting that in the next two weeks we should have um right support um in the parquet um library.

A

The um the pull request is this: that's uh 9240, let me open it up real quick. This is compute.

C

Nested list definitions.

A

Yes, that's the one okay good, then I share the right one in slack.

A

um Is there anything uh that we should do to help.

B

Neville so it's mainly getting the the pro request reviewed um I've added as much detail as I could, but I'm still gonna add um some more to make it easier for non-parquet experts to also review it. I'm just looking at the logic it does get a bit um intricate when it comes to dealing with. You know lists on top of lists, but it should be something that one could um be able to follow so uh just going through the pull request.

B

um If there's any parquet experts in the in the in the chat or we've got some time um go through it, you can probably suggest some things. So what I said in the product was that, um because I had to iterate on that on a different on a few different implementations to get it right. I haven't done like optimizations like bit packing the the masks etc. um I'm also it also makes it easier to review.

B

You know, instead of looking at the number, seven and figuring that it's it's one, one one or something it's easier for you to just see true true, true kind of thing. So once we've once you've gone through that, then I can. I can start optimizing it, but yeah, I think, um as many hands as possible. Looking at the request, there's also a few other smaller stuff that um I'll be opening more pull requests for um during the course of the week, but some of them relate to the right side of things.

B

um One of the challenges I've heard is that it's it's quite difficult to to test that you're. Writing correctly. If you don't have read support so outside of using um spark- and you know, um horses outside of using pi spark and pi arrow to to read files. um I've been working also on some other. um The read support so that I can just make sure that the pake implementation is run tripping correctly.

A

Gotcha christian, were you gonna, say something.

D

uh Yeah um I so I was gonna start with sorry joined a little bit late, uh caught up on the end of neville's conversation there and I just wanted to validate. um I've still been catching up on uh kind of the aero parquet crate side of things, and I just posted in this meeting in the chat window.

D

um The writer that I'm targeting now, which is the arrow writer which instantiates a parquet rider internally and I'd like to just kind of throw that out there for neville while we're on the call. This is the right thing for me to be targeting at a high level to execute parquet rights. Is that correct.

B

Yes, that's correct.

D

Excellent thank.

B

You there's still a few things missing, which um I suppose will be the work that I'll be doing um after that. The right support, so you'll notice that it's very heavily tied to a file um on the file system. um We.

C

B

Make it more generic so that the underlying packet writer, that it creates ultimately allows you to write to a bunch of um a bunch of bytes instead of writing to a concrete file. But, yes, that's.

D

The correct I feel like I could work around this at uh present, though, because from what I've seen what I'm seeing in the uh arrow writer it looks like it does, take an iterator so, rather than going, I know the documentation says I need to go through that value. Error. That's provided that wraps a buffered or a file, uh but I feel like I should be able to do this without actually having a file right.

D

I could create my own iterator on an in on an in-memory uh buffer of json values and pass that instead of evaluator does that sound.

B

Yes, on the read side: it's fine! You can do that it's more on the destination, so the destination has to currently be a file.

D

Oh, oh, so it would not support s3. Currently is what you're saying.

B

Yes, so in order.

D

B

S3 concretely, we need to support passing anything, that's effectively bites or that that implements um the study.

C

E

D

Okay, that makes sense and was- and I expected that constraint as well- so I think we're good there.

A

Cool anything else, level, nothing else, thanks cool, well um what I was hoping we would be able to cover next and maybe uh screen sharing, isn't working well for me qp, would you would you be open to sharing your screen and just opening up the discussion that we've had on the low level rate api from the delta rs light.

E

Yeah, um I can try to see if it works, screen, share.

A

E

Can you see.

A

E

I'm gonna jump into the discussion um discussion, so we have um tyler opened this discussion here about the right support.

E

um So I guess I I don't know if everyone have go through this yet, but um I can do a quick overview on um what the proposal is. Basically.

A

E

Yeah, basically, we don't have any uh right support in delta rs right now, and we are at this stage where we are, we should be able to start up implementing the right support, including transaction log commits.

E

So um what the over all the the high level um view we have right now is to first implement a set of low level apis. So if we look at the delta particles, which is open here, there are a set of actions, um that's defined for 10 for the transaction logs. So the idea is to first implement these set of actions. For example, add files remove files um change, metadata, um editing can make transaction identifiers all that stuff.

E

All these low-level apis will first implement them, and then we can build higher level abstractions on top of them, and with with these low level actions, we should be able to have a fully working end-to-end demo with rust, with pure rust code. So what I suggested is to use the scala implementation essay inspiration um to see how what kind of api we can design to work with a transaction. For example, I gave a pretty rough dummy code here, which is basically with a a table object.

E

We can start a new transaction and with the transaction object, we can then invoke these low-level write actions apis to, for example, add a add, unlisted files to the transaction log, and then we can remove files.

E

We can add some other metadata and then at the end we can do a transaction commit which will um write all the actions, that's all the actions and commit a new transaction log to the data table, and so this is um a rough example that I gave, but I'm I'm happy to see, um see what other people think about it.

D

Yeah, no so just to be the first to respond.

D

So I feel like what you wrote here makes perfect sense to me as an overall api, but then, as I'm thinking about our incremental deliverables, I'm thinking like starting a new transaction to begin with is not going to do anything special in the long road. It's going to interact with you know a transaction coordinator that can um kind of service between multiple simultaneous potential transaction writers. But we're going to ignore that at first right, just to get kind of the baseline in place. Yeah.

E

I think for the initial implementation we should just uh we should target single rider supports instead of multi-writing, so there will be no coordination. Gotcha.

A

Okay with how you sketch this out um qp, could you I mean at the transaction commit is when we'd actually be doing a write, but with the the add file remove file. Those would be those would be paths of some form that I have to go in the transaction log. Wouldn't they those will be path like.

C

A

A.Parquet, for example, yep that's going to be encoded in the transaction log verbatim as a dot park, a correct! That's.

E

A

Okay, so if that.

D

Makes sense it would.

A

Have been to be s3 thing like if I had an s3 object uh or s yeah would would we would the low-level api, be I pass a uri in here like an s3 uri.

E

This is up um for discussion. um I was.

A

E

It's a path relative to the um relative.

D

E

Table right, I would.

D

Say that's what we'll do.

E

D

E

Think this is up to a discussion like I.

D

E

Very strong opinion on this.

D

Well, that's that's what I was thinking as well. So if you create a delta table, that's rooted at a certain s3 path, then I would expect all of the add files to be relative to that delta table path.

E

Yeah, I think that makes sense.

D

Should we handle a case where somebody uh where, where some uh other party is trying to leverage the same api and chooses to use absolute paths instead of relative paths or that create more problems than it's worth, but.

A

I was actually going to ask if, if denny would be able to china- or maybe ben, is, is familiar with what the uh the scala library is doing like? Is it even valid to have files outside of the table space in the transaction log.

F

It is not yeah so yeah, so you, uh basically it's okay. I personally think it's okay to use relative paths, because the transaction log is only locked at the table of the root table as it is anyways. So if I could see us actually confusing people more.

A

C

A

Made it I made a quick note, I'm just keeping some notes to add to the discussion on github. Just so very nice sure.

F

No make sense, make sense.

E

Are you come on commenting on the github page.

B

A

Yeah, I will be I'm spooling up these. I'm queuing the rates.

E

Anyway, I think uh when we do the transaction commit that's when we're gonna, we uh gonna grab the next transaction id and then try to commit that's where we do the up optimistic right and then for, I think for the first iteration. We will just do it um with the local file system without s3 and then well.

D

E

Right, yeah yeah. Well, we there's also ways to do it with that three, but I think that's um we can do it later.

D

Well, I mean uh if I understood the previous conversation with neville, we don't have support yet to provide a writer that writes to s3.

E

I guess we will have to write to a buffer and then move that buffer to s3 right, so not directly from parquet crate to s3, but from practically to a buffer and then from buffer to s3. So we have to do the bridging ourselves. If we want to do asterisks.

A

My understanding from what you were proposing last week when we had chatted a little bit this a little bit about this privately qp, was that this api would know nothing about parquet file rights, and this would just be taking like, when you say, add file a dot park, a as an example that would be an a dot. Parquet file would be somewhere already there like it would have to already be in in the.

B

A

Space on on disk or in storage.

B

Yes, yeah all right.

A

And uh like the way, I'm interpreting, that is, the discussions about the the native cloud or the cloud storage providers um is sort of moot in that.

A

That would be a different, different set of apis to actually put the file like put data into a file in in s3. As an example I see, and then then we would use the delta transaction writer um to reference that.

D

So basically, I I think I hear what you're saying tyler in this case, like the paths might be in s3 that we're adding here to the delta transaction log within this transaction scope. But um it's actually doesn't matter at all the transaction log. The transaction log is just updating the transaction log wherever it happens to.

E

D

E

That makes sense.

A

D

A

Our simple test would be like you'd create a uh you know: you'd have a delta table on on disk somewhere and you would go put a park or parquet file there, which, from the delta standpoint, doesn't exist until you would go through this transaction and actually add that to the transaction right.

D

Yep yeah, that makes sense.

A

Like two steps, we we get, the transaction log writing part correct, and then I can work on the actual writing the parque but bits after neville's done with his work.

D

So so um just a question just because that's the most uneasy part for me is the s3 part um neville. Is there uh a ticket, that's tracking that that I can and that I can follow for like allowing to to pass in a writer that writes to s3 or or to write s3 directly, whatever it happens to be.

B

uh Let me find one there's a pr, that's semi-abundant um by someone where they were implementing the ability to write to um a memory buffer. So let me find that one, um because I was going to use that as a as a start and then um abstract away the the file system after that I'll post it in the chat now.

D

Okay, thank you.

E

Am I still sharing my screen.

A

E

A

You're not you're.

A

Safe islam and we've got florian who's joining us. Welcome, florian, hello, so qp. Is there anything else on the um on the or anybody else I mean this topic is open for discussion, but anybody else that wanted to discuss the transaction api.

A

Going once going.

A

Twice: okay: we are just sit to time check for everybody. I think I had scheduled this originally yeah for 30 minutes. um So we've got 10 minutes left. uh We don't have to use the 10 minutes. I was I wanted to leave the last part of the meeting open to uh open, pr's or bugs or or things that we wanted to discuss that might merit a synchronous.

A

A

Bueller anybody.

E

um One thing that I think uh it's worth working on, but I don't think it's checked yet is we have this golden test? Data sets um prepared from data breaks and I don't think we are passing all of them for for reads.

E

I tried a bunch of them and there are low hanging fruits that we like looks like a lot of them are failing for the same problem and it seems like a pretty quick fix. So I think it's worth spending some time to get us passing all the tests for the golden data set and make sure that's part of the csad pipeline.

A

Yeah, that's, I think that's certainly worthwhile. Do you have? Can you open up a topic branch that would be failing uh with failing tests? Yeah, I'm assuming you've got some code already written.

E

I I don't, but um I think I I think that's something we should.

C

Even know they were failing.

E

Well, I I tried it for some test tables. I.

B

A

um Yeah for for those of you that may not have seen the pull request. I we've got uh this hard-coded um azure storage account and we've got a a hard-coded delta s3 bucket that are both in accounts that I control and I populated both of those with the golden data set from um from the connectors repo.

A

I think it is so they're not they're not present on disk for for local development, um but that's pretty easy to just do a uh you know: a get sub module if we want to go that route, but they are, at least in the storage uh buckets for testing.

D

C

D

Sorry go ahead.

D

uh I yeah. I want to.

F

D

Any any of my remaining comments until everybody has very useful things to talk about then.

D

A

So devel asked about um medium term keeping the test running in tyler's s3 account. uh Yes, the the I actually wrote a blog post about this, which I can I can share later, uh but I, the the s3 account that is running for those it has a maximum budget of five dollars a month um if we somehow ever exceed that budget for just storing s3 objects um for the the golden test, I'll be very, very surprised, um yeah.

A

So, basically, unless denny has something official for the delta project um as a whole, um I'm perfectly happy to keep hosting these. They won't be disappearing anytime soon.

F

Right and if there, if we do run into issues, just let me know and then we'll figure something out: okay, cool.

A

All right, christian, okay,.

D

A

D

um uh Neville the the file you linked here just doing a quick peek at it. um It's a little different than the api I was expecting. I was I was thinking there would be like a trait that would be passable. They would do the right. um It's fine that it's not! It is so basically what I'm! What I'm reading here is in.

A

That maybe you share your screen. Christian just show.

D

Us what you're.

C

D

A

Shared a number of links at this point to where I'm not.

D

A

You would be referring to.

D

Yeah, let me see I'm not used to this thing here. How do I share my.

A

D

A

D

B

No next to that chat and the hand um the right the hand that you raise there's a screen there, where you can share your screen. Well, if it's on my side, it should be the same followers.

D

Oh okay, I see it yep.

D

Oh, it's only let me sh, okay, your entire screen application window.

D

It's probably this one: where do I have it? I think it's this one uh arrow 8796. Is that what you guys are seeing right now: okay, perfect, okay, so.

D

uh Actually, maybe this is not the best okay file's changed.

D

D

I have not looked at the files changed yet so I don't know I don't have a good bearing here. Let me start just by saying.

A

What did you want.

D

I was expecting to be able to pass in a trait um or basically a strike that implements a trait right. um That would do the right, but it looks like the support. That's added by this pr is more in the direction of writing directly to an in-memory buffer, which is fine.

D

I think, because I can write to an end memory buffer and then handle the file right to s3 myself, um but I just wanted to make sure that I was on the same page, because, where we were going with conversations, I I thought it was more in the realm of passing in a writer instead, if that makes sense,.

B

Yeah, that makes sense um I can answer it. So there's there's a bunch there's a couple of um constraints that are quite challenging to solve, so one of them is that the the one one of the three conditions is that um you you have to be. We have to support track clone, um which makes it difficult for you to to do it arbitrarily. So this pr sort of goes halfway there, because when you, when you can sub um supply, you know just a buffer, a vec buffer um the benefit.

B

There is that um in this pr, the the triclone constraint has actually been abstracted out to moved out, making it easier. So what I would practically do here on this pr is that I'm going to take it as it is, and then, instead of passing in a vac, I'm going to then pass a trait that has a few uh fewer restrictions than what we currently have. So this is sort of the foundation to get to the point where we pass a struct. I'm sorry, a trait um sorry, something that implements the right trait.

B

So yes, um this is when working with this pr, it's not going to be the end of it, but I'm going to take it further so that we can be able to pass in anything effectively as long as it can write data.

D

Okay, gotcha. I I think that that gets me as far as I need to be right now. Thank you.

E

So it doesn't have to be a buffer right as long as the uh you pass in a thing that implements the parquet writer trait, which, um after this pr gets merged, it will only require right and sick trait, and it doesn't really matter how what what this um struck is doing underneath as long as it implements the right hand, side trade, and then this chart can be writing to you, know, memory or s3 or whatever yeah.

D

Okay, yep, that makes.

A

Sense: hey any open, pull requests we want to discuss or bugs, since we're coming to the end of our scheduled time.

B

uh Not on the aerocrate, it's just uh the main, the main one with write support where I asked.

C

B

Have time to have a look at I'll, just post it in the chat now.

A

Okay, uh one thing I wanted, maybe with oil and qp here we've got this open: pull request for tokyo, bumping tokyo to 1.0, which qp- and I already had some discussion around this in in the pull request. This is pull request 76 on delta rs, just so you know, but this is dependent on an aero aero, pull request to upgrade to tokyo, 1.0 um and I'm curious qp from your perspective, like how important is it for you or to you that we get to tokyo, 1.0.

E

Not very important, but it's just it's nice to have and some of the. If you want to use newer versions of some of the sdks, we will have to go up 1.0, but right now it's not a pretty issue.

A

Okay, the the impression that I got from the arrow pull request, which is arrow pull request, which I'll drop in our slack channel, um is that we might not see this get merged for a while.

B

Yeah, we will have to wait for that pr yeah. I haven't followed the discussion around it. um Do you do you know iqp.

B

Is it because the merge of the new release or something else.

E

There's some discussion about how to do worship painting um with the within the cargo tumble file. I think that was the main discussion that's remaining, but the pr.

A

E

I think there's no objection.

A

What was not clear to me neville was there's a comment from uh alan b um and I'll link that directly in in the slack um saying that we should merge this or that it would be great to merge after 3.0 ships. I don't know enough about the arrow development timelines, but if we're, if tokyo 1.0 doesn't get into arrow for the next few months, then qp's pull request is probably just gonna stagnate. While we change uh delta rs, because we have a lot of development planned in the next few months,.

B

So I think, to the extent that we'll be we'll still be linking to the ongoing I'm sorry, the other get brunch and it's fine.

B

um I think andrew andrew was speaking more from we, we sort of had a merge freeze, um so I can call it that where we went, um I'm measuring any pull requests but that that that has been resolved now, um because I see that there's a bunch of police that have been merged because, typically, once uh once the release maintenance branch has been cut, we can. We can still continue with development as usual, so I'll I'll chime in on the discussion and see what the the versioning issue is. I mean we had a.

B

We had a we had. A breaking change in the package. Format um live crate um where we sort of upgraded from version 2.6 to 2.7 of the format and it it broke. It broke a couple of people's code, so I think the concern was alright, but we resolved that so I'll I'll. Look at I'll look into why we're being pedentic about.

A

I'm pinning versions from a release standpoint the thing that's important to understand, and this this is really for everybody, that's contributing um we can. We can release new versions of the python binding and we currently do I mean this is this is what qp cut? I think zero two one over the weekend with pinning rust dependencies to sha ones or branches from get in order to release the delta lake crate, so the native rust binding.

A

We can't have any references to get uh get-based dependencies. Everything has to be a real release, dependency on crates.I o. So if, if we wanted to be on tokyo 1.0, for example, um we could continue releasing the python crate or the python wheel and uh continue developing with delta rs. For for some of the the downstream projects that we have here at scribd, but nobody, nobody would be able to depend on the native rust binding in any sort of release standpoint. They would have to point to our git for everything and that's to me that's.

A

The big trade-off is that we can't cut a crates.I o release until we can point to an actual arrow released, um crate.

B

So effectively about three moments.

B

I think that's fine, so the the the tokyo 1.0, um the the the issues that are that are there or the the question the questions about our own pinning. We can resolve that, um probably by the end of this week and then get that merged in, uh but it means that then, for we'll have to really wait for version 4.0 of um error in.

A

Pakista, I guess that's going to be that I'll I'll make sure to update the readme accordingly. Accordingly,.

B

Yeah, I think in general, there should be an appetite if anybody wants to use. um What's this delta delta rs, there should be an appetite to to to pull from get in the short term, because it at least what I've seen is that um a few, a few relatively big projects that are that are moving a bit quickly, are using the are using aaron pacquiao from from get.

C

B

Of the released version- okay, because yeah the problem is really that, with the other language implementations, there's quite a lot of release, um work that has to be done. It's not as smooth as as we do with with rust. So as a result of that, there isn't that much appetite to do releases more frequently, but there are some people.

C

B

Interested in helping out to that, so I think if we can get the c plus, plus and java side of things like very smooth, we could. We could probably even see like a a two month or six weekly released cycle.

A

Whoa neville- that's really fast.

A

um I didn't. I didn't actually put it together, that the release cycle for the rust stuff was pinned to all of the other language implementations in arrow. That does clear a lot of things up for me.

E

Right, uh yes,.

B

All right, then, we also had this discussion with delta rs. I know I know.

A

I just think that, like I, I agree with your viewpoint much more now, qp of let's keep these versions independent, um okay! Well, for me, I don't know if anybody else has topics, but for me uh I've, I've accomplished everything that I wanted with this meeting.

A

This has been great and helpful for me for.

D

What it's worth.

A

um All right, so uh why don't we call it a day, um everybody quick round of applause for uh florian who opened his first rest pr, or at least that's how I interpreted what what he sent along.

C

Yeah. Thank you. Everyone.

C

Welcome my first code with the rest is: it was very complicated on my side, because uh I'm more comfortable with uh java, scala and python, but feel free to share with me details regarding my my request.

A

Well, welcome to the party thank you for the pull request um and yeah everybody uh and have a good day have a good good week or so. uh The next time this is scheduled is two weeks from now at 9am, pacific time. Thank you all for joining I'm going to go ahead and end the stream.