Delta Lake delta-rs Development Meetings, 30 Mar 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2021-03-30 delta-rs open development meeting

Description

Tentative agenda :

* Neville to share the 2.6.0 parquet writer updates
* Misha sharing the dynamodb lock work * Review rustdoc progress* Granting committer privileges to Christian * Surfacing writer stats through arrow/delta-rs

A

All right welcome everybody to the regular, as in every other week, regular uh delta, rs, open development meeting today for the uh tentative agenda. We had a whole bunch of topics that we had suggested. um So I tried to to cut down a little bit and if we get through these, then we'll go through everything else that we had in the slack thread. But at a high level I would enable to share the the 2.6 parquet writer updates.

A

um I wanted misha to then talk more about the dynamodb lock work that he'd been doing um uh qp. I wanted to review some of the restock progress to see what we need to still do there.

A

I also wanted to make sure that we discussed granting chris committer access and then hopefully we'll have time to talk about surfacing, writer's stats uh through arrow or sorry through the parquet crate or up through delta rs, which is a topic that christian will leave. um But let's go ahead and get started with you. Nibble.

B

C

um On my side of the update, so I've materially completed the package, 2.6.0 right support, there's only one item pending in terms of data types that need to be written and that's the decimal 128 data type. It's actually dependent on a an open pr in the aero site, where we still whistle going through it. So it's still in a draft state. um But with that said it's it's just really the actual type, but the underlying you know fixed, fixed length, binary type.

C

We can already write, so it's just semantics waiting for it to sort of be ticked off, and then we can sort of say we completely done with that. So I've just attached um the umbrella jira on the on the chat now I'll also put it on um on slack, but it just talks about the the final work that I've been doing on the on the right support. You know supporting um nanosecond timestamps.

C

I think that was the big one for a couple of people um and then what I've also done um in this in the past week is I've so when, when, when trying to write tests or all or benchmark or even you know, check check, what's working, what's not working, it's been very difficult because you sort of have to write the schema by hand and then generate the data by hand. So I opened the pr which I mentioned today to allow us to generate arbitrary, random data.

C

So you just provide the schema that you want um very useful if you've got like deeply nested structures, you know with a struct, struct list, etc, and then you supply the number of records you want, and then it generates that for you. So with that, I'm also just creating some benchmarks to look at a couple of to-do's that I'm left with in terms of cleaning up.

C

You know the stuff that I, where where I would um write a vector of booleans instead of a bitmap just so that I could be able to diagnose easily without having to worry about bit, manipulation logic. So there's um a bit of those things that I need to clean up, um I'll, run them up and just put them in a in a single gyro so that I can check them. But um that's it from my side.

A

You have any questions for neville.

B

No, I hope not.

A

Explain things perfectly well done. Thank you! That's that's, fantastic that the support is coming in. I think I can speak to scribd's specifics use case in that, I'm pretty sure we don't have any 128-bit decimals floating around, um but it's great to know that that work is coming in shortly. Yep.

A

All right, the next topic that we had uh misha if you could just go ahead and lead us to and share the uh the work that you've been doing on the um the dynamodb, lock support and I'll drop. This link in the slack channel as well yep.

C

So uh uh bulletproof is still in draft mode, but uh I just need I just need to figure out integration tests uh and then yeah and then uh the creation client within s3, but uh other than that. uh The the data log workflow is this: um it's it's copied from uh java info, uh meaning that it supports all of the major uh features, such as acquire, lock, updating and releasing, and with a with a simple addition from my site.

C

Is that there's the specific uh scenario where java impul is lacking- and I addressed that in this implementation- is where the uh orchid that holds a lock dies and then there's more than one uh workers trash to acquire this lock and uh in java impo when when they try to acquire a lock, there's least duration uh for which they wait like by default, is 22nd and only then it is able to expire lock. uh But the issue is that they are not.

C

If the lock is changed, they are not extended, at least at least duration, so only for example, if you have like uh more than one uh workers and uh both of them like example, if both of them wait for 20 seconds and only one which is uh the the faster one will get a lock, another will uh will fail with a with a timeout. So that's that should not be our case.

C

I extended that logic with, uh uh with the check that, if the second work, which is the which does not acquire a lock, it says that the actual lock uh has changed like that record version number has changed. It will still. um It will then uh increase eight amount duration that may uh cause to, for example, a waiting move for more than 20 seconds. But uh if the worker that's faster, which require lock releases, then then we'll get the uh normal workflow uh anyway.

C

That's that there's a still addition from me uh for our use case uh compared to can.

A

I ask: uh can I ask you and I think qp you've been collaborating on this a little bit. I wouldn't ask. I see in this draftpool request that there's a new feature like a new cargo feature for dynamodb, that we'd be adding. um I was hoping that maybe one of you could sort of share the reasoning behind adding another uh another feature flag here.

C

Yeah, uh I can answer that, so uh um the I we have not yet integrated the 911db into a street storage, but the idea is that not the every use case will need it in the dynamodb lock, for example, if that simple worker uh central worker be a behavior right and they might not use the aws s3 something like this, so uh they they won't need a dynamo db, lock within it. So they might they just uh so.

C

We hide that under the future and only those that need uh that need a locking mechanism for multi-worker uh environment and have access to the normandy b, then only then my leverage of using the number limit.

A

C

And um also recently, um when I was not, I I had a meeting with chris on kafka delta ingest and we were talking on how we can benefit from a delta, rs dynamodb, lock that there's there's additional meta field data where we can store, for example, for kafka delta ingest. We might store a kafka offset and for uh delta arrest. We might store a latest uh latest delta version.

A

Sorry misha, I I may have lost the the key piece of context: uh that's a metadata field in the dynamodb row. Yes, okay,.

C

A

C

It's it's also in java import, but uh in lack of simplicity, I was not a. I was not using a. I was not including that in delta arrests, but which recently figured that it will might be beneficial for us for kafka delta ingest to store the latest offsets and for a delta rs. You might use that for latest version. So, instead of relying on uh optimistic currency, where, where worker uh tries to create a new delta version in loop right with atomic rename, it just might start from the latest version.

C

uh Instead of doing that in a loop, uh yeah and and that's it.

B

C

Also, uh just to add, uh in this dynamo db lock, I have to try to copy every a every like a dynamo schema, a another structure from a and and and a and the final code to be similar to java impulse. So I in case there will be an actual part of that library. In us. We will be able to easily immigrate and uh yeah. That's it for me,.

A

Any questions or comments for misha.

D

um Not not a question but a comment. um I I think uh a little later this week we probably need to have an offline conversation with qp to catch up on some of the discussions we've had around exposing the the distributed table lock um in a lower level way uh to the kafka delta, ingest.

D

Api, that is all.

A

Okay, uh I trust y'all will be able to find each other to have that discussion um except uh qp, rust docs. How are we doing.

E

Yep, this part is easy, so we have a work in progress. Pr open for adding rust stocks for all our rust coolbase florian has already added um docs string for all the python binding already, and the goal of that is to once. We have all the docstring added to public interfaces. We will enable missing dogs linkedin rule so that all future code change require um doctrine in any new public interfaces so that um we could make sure that our documents are documented.

E

This is following the same rule that the arrow creates has set up and we're just doing the same thing here, and um anyone is welcome to send pull requests to the docstring branch we are currently.

E

I think we we still have 160 public interfaces still we need to add, but uh once we get that uh the crci to pass in that branch, we'll merge that into the main.

D

I plan on adding adding a pr for another big pass on this later. I'd like to get get this in as soon as possible. Okay, because we're going to keep growing the surface area of non-required rust stock um right, so the sooner the better uh so I'll take another pass this evening.

A

What I'd recommend for anybody that's working on this, and this goes for anybody. That's watching! The video um is to also maybe announce what area you're working for working on for adding rustoc in the slack channel. To make sure that you know two folks, don't have. You know, spend a lot of time, adding the same restocks um qp, I'm assuming that. So this is pull request. 156, I'm assuming once this goes green. We merge it and then that's that right, yep.

E

Yep, that's depend.

A

Okay, so pull pull. Requests for this should be targeting the rust, docs topic branch and then pull request. 156 will be automatically updated with that.

A

Any other rust stock related things we should discuss before we move on now we're moving through topics at a good clip.

A

uh I believe the next thing that I wanted to to have us discuss is qp suggested that christian, um who is the big letter c in this video um christian, be granted commit uh privileges to the delta rs repository. I think uh florian qp myself, I don't know if uh miesha or neville are committers at the moment. Like a look.

B

Yeah I've got many issues.

E

A

It's almost threatening soon um so qp you put this up. So I'm assuming you're in favor yep.

A

All right, I'm also I'm also cool with this. So christian, I'm clicking the buttons now hooray great power, great responsibility, yada yada, yeah, yep yep, um so christian's been great at rate access welcome board. Thank you for your contributions, etc. uh I guess we should go into a next topic. um That was actually. uh This is the last topic that I had said that we definitely should get through um and maybe neville and christian. This is y'all um that stats support bubbling up through delta rs, something or other.

D

Yep, um mostly just going to defer to neville on this, but basically you know the the decision. We're trying to reach here is for deriving the stats to include in each add action for a delta transaction. uh Where, where do we want that code to live? Do we want to live it want it to live in the parquet crate, the aero crate or in kraka delta ingest running compute kernels from arrow, so neville's done some research on this so I'll hand it off to you.

C

Cool, thank you. So broadly, there's two options. The first option is to compute the stats using the aerocompute functionality, so we've got the minimum maximum kernel. I think we only calculate minimum maximum. The null count is already provided by the record badge and then there's um distinct columns. I think that that's that's only the other one, but it's more optional than the others.

C

The second option is to surface the k statistics for the you know the the file that has just been written because it we compute the stats as part of writing the file um and then use use those um statistics as part of the metadata to to to to populate then the the delta um statistics.

C

So the the main downside with that with the error one, is that we so we've from a physical data type perspective.

C

You know, we've got a 32-bit into 64-bit, end, float double um and then we've got byte byte array and fixed-length battery, so with the byte arrays that we don't actually have any um methods and errors yet to compute what the minimum you know if, if you're looking at an ip address, for example, sorry uh not an ap address a mac address for as an example which you'd store as a fixed length, um binary byte array: you do, we don't have um any functionality to calculate what the minimum and what the maximum is and per k actually requires those.

C

So that's the one downside and that they'd be a bit of extra work that we need to do just to write the minimum kernel for for binary data, minimum and maximum for binary data.

C

Now with the parquet one, um there's a slight challenge, but I've been exploring it in the background because I've mob I've I've been thinking about it, but not actually trying it out, but the challenge there is that chris needs to well is writing data to an in-memory buffer and then writing that in memory buffer, then to to s3. So after you write the file and you close the file, um we currently don't have a way of returning the statistics of the file without actually reading the file again now to read the file again.

C

If it's an if it was a file on disk, it would be fine, you'd, sort of instantiate the file writer and then just retest the the metadata and then get the statistics. But with an in-memory um in memory file that has just been written to of a bunch of data, we actually don't have any functionality to read data from an in-memory source.

C

So that's a that's a bit of a constraint that I've been looking at, but with that said, I'm actually trying to see if I can modify the signature of the the final function that closes the file.

C

I added some text on this on the chat um on the select channel just before the talk, so that's option two there on and there also option one I'm looking at. You know the writer. The close function returns um an empty result, um but I'm looking at whether we could retain like the file metadata and I can potentially share my screen. Let me see if I can do that.

B

I have to play the name of how do you share screens? Are they.

A

Maybe while you start sharing your screen christian, could you maybe share a little bit more why the the stats are useful to you or why this is something like. What's the use case,.

D

Here yeah, it's mostly um for optimization of reads uh when running queries from databricks, so um they uh spark sql queries or or spark queries that you run from a databricks notebook against a table that we would write uh from delta rs. Can leverage leverage these statistics to optimize optimize the query and give better better read performance.

C

Yeah to add on to what is the same, so you would have a file that has you know like a bunch of chunks, chunk one chunk, two chunk, etc, and then, uh if you've got the statistics for each column, what you'll do- and we also do this in data fusion- if let's say, for example, you're reading you're only interested in in data where somebody's age is between 18 and 20., and um here we know that the maximum is 15..

C

um Let's say it's one year to 15 years and then the other you know you'd effectively skip this instead of so, instead of touching it at all. So if we have the so this is at a parquet. um Can you assume you can see my screen?

C

Sorry this is at a particular um level, so the file would have different chunks, but then beyond that, what delta would then do is that for the whole file, it'll sort of say well, okay, if you've got if this is one of let's say, 100 files, we'll then say the let's say that okay, we've got 2 250 and then let's do this quickly and then 56 to 19., very old people here.

C

What you have with these statistics is that you'd have the minimum of one year and then the maximum of 90., so in the next vowel, if you're looking for somebody who's, let's say 92 years old, then in this case you won't even touch this file at all. So we need those statistics. Otherwise, if you don't have them here, um you're forced to read um the right, the whatever readable one will be written, we'll be forced to go here and then find what it means. Yeah great.

D

Explanation, yeah basically use the transaction log to pick which files you actually need to scan on the read on the reader side.

B

C

So this this guy here in this trade, the foul writer trait before this change- that I'm exploring I'm just letting the compiler guide me to see what else I need to change. If I get stuck, I get stuck sort of. If I perish, I perish so this guy here, I'm when you close the file, it returned an empty empty result. So I'm just exploring what happens if we return, you know the actual metadata the file metadata, because with the file metadata you'll then have with the file metadata you'll then be able to see this one.

C

Yes, it's this one! Sorry I keep creating like random ripples just to try stuff out with the file metadata. What you'll then be able to do I'm here. So here's an example where I was talking about the random data generator thing. I generate a large enough record batch. I think here we've got like 20 million records. I write it to the packet file and then to get the um to get the statistics I need to. After closing it I need to read back the file, but christian wouldn't be able to do this currently.

C

So what I'm doing now is just exploring if I could return the metadata here and then avoid the need to read the file and then with the metadata. We can then get the column statistics. So these are the column statistics here where- and this is the one that I was talking about so if you've got a hey.

A

Could you stop scrolling it's very hard.

C

If you've got a byte array, pacquiao will give you what what the minimum of that battery is, but with arrow we'd need to implement a kernel that calculates at the minimum of a battery is.

E

So when you calculate that um stats for battery in the parking crate, are you not using arrow to do that calculation.

C

No, um we we're not using error. It's it's sorry, um it's it's actually a bit inefficient because we we sort of doing it on a row-by-row basis. um Okay, no. I want to look for the code we sort of doing it on a row-by-row basis. It's one of the things that I wanna look at improving, because what happens is when you write the file. You can provide this, that's if you've pre-computed them or if you you can provide like it's it's an option.

C

If you provide nothing it'll, then compute, the stats for you on a row by row basis so where we could use arrow is actually to compute the stats um for the the chunks or the or the batch or whatever, and then you know pass them to the to the writer so that it doesn't compute the stats, but now the problem with that is that sometimes there's a there's a bit of a disconnect between the parquet record. Sorry, the error record badge size and the pacquay um row groups, so you would have a in this instance.

C

Here we've got a record batch of. I think it's. What is it now yeah?

C

It's! It's 10 million records that I'm writing here, um but the because the rust implementation actually doesn't split the the batch into you know smaller chunks. It's writing the whole thing um in one. Go sorry I'll stop scrolling in a bit once I reach I'm going um here, we are so I've written a 481 megabyte file, which is one row group. So that's that's a problem from a parallelism for a reader perspective.

C

So if spark way to read this file that has been written by rust, it wouldn't be able to parallelize it because now you've got only one. You know one gigantic um row group, so this is um related to the other item that I was talking about around.

C

You know creating a way of slicing a record batch so that, if you've got a massive record base like this, where you've got 10 million 10 million rows, you sort of break it down, maybe into chunks of 60 000 records, so that you'd be able to write it down in small pieces.

C

But now going back to the discussion at hand of the of of whether you could compute the statistics from arrow the problem, there is, if you compute the statistics at the at the parquet right level, if you compute the statistics for the whole record badge and the record batch has 10 million records. What happens if parquet writes a chunk of, let's say a hundred thousand records. You still have to compute the stats for just that hundred thousand, so I'm still exploring that, um but it it just specifically answers your question. Quepita.

C

It's not a consent for for for me and chris for now, yep.

E

So it sounds like um the parquet writer has its own road group split right, regardless of the size of record patch? Yes, okay, so I guess the stats that's passed in from um outside of the writer. It's not really useful. In that case,.

A

Yes, unless, from a from a performance standpoint, I kind of I don't understand why one would pre-compute the stats at the arrow level, because it's not like you would pre-compute on one node and write on another like it's just you'd, basically be just deciding where you want the cpu overhead to be in a single process. Either way. Wouldn't you.

C

um Yes, and no so if you're able to tweak, if you're able to get the record batch sizes to equal to the chunk sizes computing, that's that's an arrow is more efficient, because if you, if you're calculating the the sum of a column, you know you're going to vectorize the computation, whereas, if you're doing it on a row base yeah, it's that okay, you probably save like a few milliseconds well, depending on how big your data is.

E

So parquet writes um stats calculation. It's all row by row, there's no vectorization at all.

C

uh The rust um library computes row by row, so it's it's part of the problem actually with uh well the performance issue, the main performance issue with uh with the rust um parquet writer and reader. So, unlike the the c plus plus implementation, for example, where they sort of rewrote the parque um write and read, support you know. From from from the very low level we took, uh we took, we took a a a more a more convenient route of saying. Well, you've got the low level where's that thing you've got the low level column.

C

Writer um you've got the low level column right. So this is where the stats actually compute computed. So you see you sort of say the minimum is the minimum of of oh. No, no sorry! This is not it. It's it's oh yeah! Here it is so this is where you sort of computing, your your stats, you we do it row by row and the reason why we do this is that we we use this function of writing the batch.

C

The internal batch um instead of what the c plus plus implementation did, where they sort of re-re-worked the whole support from scratch. That's why it took them like over over a year or so um to to eventually get the the the right support completed. But what we did is we sort of using the low level um column functionality.

C

um We materialize the error values into you know into the into the primitive types which is a bit inefficient because we're creating an allocation. Here we compute the definition and repetitions we're creating two more allocations, and then we compute we are the minimum maximum if we need them.

C

So in the long term- um and this is work I'd like to probably do in the like yeah in the next year or so, but in the long term, what we need to do is we need to be able to go from a an error column without computing, the definition and repetition and sort of in the ideal state.

C

We would iterate through the you know, through the through the error, column and compute these things as you go along for primitive values, it's easier, but once you once you get to lists and structs where it's it's deepliness, that it becomes a bit tricky. But this is the the sort of direction that we'd like to take and when we take that direction, even the computing of stats will probably be a bit easier because we would sort of then be able to say well for this.

C

For this column, chunk of yeah for this column chunk that we're on the right. We only want to write like um ten thousand of the hundred thousand records in this um in this error column. So, let's slice the let's do a zero copy slice of the arrow um column compute the stats for that quickly and then pass them here as an option. Instead of then having to calculate them sort of on a row by row here.

E

Yep makes sense, I think, as long as it's something that we can change in the future, then I'm not worried about it.

A

Yes, so I'd like to encourage more discussion around this in the slack channel we're coming to the end of the time, I was really hoping that we would be able to give florian an opportunity to share some of the partitioning partitioning and filtering work that he's been doing lately and maybe share anything that he's learned from the the deployment.

D

A

Delta lake python, in.

F

Production yeah: we will talk about it in in a next meeting. I think, but uh with pleasure, to share insight on what are the use cases that we faced using that irs in production.

C

I apologize florian, I tend to talk unabatedly.

A

Yeah we can cover that in the next meeting. If you'd like florence,.

F

No worries dude. It was really interesting.

A

All right, well with that, we are a little bit over time. Florian I'll, put you right up front at the beginning of next meeting and everybody have a good day. Thank you for joining.

F

Thank you. Everyone yep.

E

Have a good day see you next.

E