Delta Lake delta-rs Development Meetings, 31 Aug 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2021-08-31 delta-rs open development meeting

Description

Discuss tombstones and challenges with vacuuming

A

Should I get going with it what's that now saying, should I get going with the error issue, yeah.

B

A

It okay cool. um So, as I explained yesterday on the um on the kdi um channel, the the problem is that when, when you've got a list and it's empty, the the emptiness of the yeah, the yeah, the emptiness of the list- is expressed as a as an offset, um but that offset only lives on the list. It doesn't get propagated down to um the list's children.

A

So when it's when the list child is a primitive, it's fine, um because when we, when we get to the primitive calculating its definition, we'll we'll will be able to pick up that. Okay! Well, if, for example, your max maximum definition is two, as I explained with the list, you've got um if if a list is nullable, you've got actually two or three three or two levels.

A

um There are two valid definition levels um where zero says that the list is now one says that the list is empty and two says that the list has a slot or multiple slots. So when, when doing that with the primitive, if you've got a list and list list that has primitives um it's not an issue, because um after we calculate that the definition of the list, we immediately calculate that of the primitive and that's not an issue, but where you've got a nested.

A

um Well, when you we've got an anotherness that type inside the list, that's why it becomes a problem, um because you effectively pick up that you've got a zero offset. um So you set one value to be zero um like an empty slot to be zero. But then, when you come and calculate let's say the struct inside a list, um you totally well the current way that we're doing it then disregards the that ended that list slot, and then you end up with nothing.

A

I don't know if I'm, if I'm making sense, but I I I yeah that's supposed to be a good example.

B

I get the general idea um yeah, so basically, it's specifically a problem for lists that that have strucks as an element type um yeah.

B

What I don't understand, though, is um how severe of a problem is this to resolve? Is it um yeah? It's like? Is this gonna be a while.

A

um I don't think so, um because I've I've started working on a solution. um Obviously uh I started with uh with the test case that fails, um so instead of writing a whole. um Instead of writing a whole black arrow error batch to parquet, I've just reduced it down to the in the level calculations, um I'm probably about 60 percent there with the solution, because I've got it. I've got it working partially.

A

So the when you calculate the list when you go from the list to the struct it's working correctly now I'm left with going from the struct to whatever the primitive value might be or even struck to another list and making sure that that's that's accurate. So it's just been a a matter of not having enough bandwidth. I could have probably finished it over the weekend, um but I intend on uh trying to get that done like during this week.

B

Okay, excellent, um that's good news um in the meantime, so you know we finally realized yeah. You know we're kind of working on some bleeding edge stuff here and we should expect uh occasional bugs like this to show up uh yeah yeah, so we started implementing implementing a dead letter queue in kafka and in kafka delta ingest I mean and the approach we're taking. um Basically, as I think you know, um we kind of we, we buffer json messages and memory um at on certain, inter intervals, we write those to arrow record batches.

A

B

uh Ultimately, to a buffered parquet file in memory, and so the approach we're taking is we don't want to?

B

um You know test the validity of it, so so the the bug only surface surfaces for us in very rare occasions when you know when we get a null value for one of these fields, and so the approach we're taking is don't validate the message up front. But when we go to write the record batch to parquet, if that right fails, we are um backing up and basically writing each we're creating a test parque buffer.

B

At that point, where we write each message in the batch and uh we write it to a separate parque file buffer for each record and then that way we can um sift through the good records and the bad records, and once we have the good ones, we then create a new record batch out of that and write it to per k. um I will given the way, we're using arrow and parquet to do this.

B

I'd like to send you a link to some code in the pr, so that you can, you know just let us know if it's you know if this seems like a reasonable approach to you or, if um so I'll, I'll grab that real, quick, okay.

B

um Okay, so the whole pr, I'm gonna, link it in this chat.

B

So that's the whole pr. I'm gonna find the quarantine routine because I think that's the that's the part where.

B

I'd like to get your actually there's two two areas where I'd like to get your special attention: okay, cool all right, so one is.

B

Actually, it's gonna be.

B

Here's where we do the quarantine and the other part I'd like to get your eyes on is where we invoke the quarantine and, let's see.

B

And this one is uh interesting because, let's see here.

B

This this last one is interesting, because what I'm doing is to protect the clean buffer, I'm storing off the existing rk bytes. First, I'm cloning them and then, um if a, if an error happens, then I'm uh replacing the um existing par k buffer with the copied bytes and a new uh fresh cursor re-initialized to um to the good bites that we that we left off with before we found um the.

A

Records yeah, yes,.

B

So you, you don't have to review this right in a second, but um if you get some time today to take a peek at it, um and just let me know if you see anything um too,.

A

Egregious okay, cool I'll do so.

B

And aside from that, that is all I needed to talk about with you.

B

Misha had another issue that popped up for him, but it was, it was related to spark not aero, and it was related to vacuum, which florian implemented in delta rs, and so we were thinking he might have some experience in case misha still had any existing questions, but I think where misha landed this morning, he actually um might not need that assistance anyway. So so I think we're good here for this call.

B

A

Cool um I'll go through this and then um I'll. We can just continue this in the channel and then also give you an update on how far I am um with uh with fixing the back.

B

Cool sure do appreciate your help, sir.

A

Cool no no worries um and then have you found any. Have you come across any issues with the map support so far.

B

No map support has actually been working, pretty great okay,.

A

A

Now that's great to hear, because um I was just a bit worried that maybe I would have introduced some issues or something.

B

Yeah yeah, when we first uh started deploying it, um we ran into a couple of bugs, but it turned out. They were on our side. um Oh.

A

Yes, I remember that yeah.

B

Yeah, so we got all those resolved and it's it's working like a champ.

A

Cool awesome and then I'm assuming, um actually we we well yeah, we might have.

A

We might have a similar use case in that in the near future, where we need to use um kafka delta and just uh the one that one of the teams at work um they, I think, as your yeah azure has finally enabled or still in preview, but they find enabled um change data capture on azure sequel instead of just the the normal microsoft sql um server.

A

So we, I think our current etl batches are hourly so we're looking at potentially moving to real time, so um it'll be great to introduce the the team to kafka delta and just because they just want.

B

A

Run like a big spot cluster that does does that work. It feels a bit like overkill.

B

Gotcha yeah, um so one thing to keep in mind there for azure right at the moment: uh kafka delta ingest is. It only has s3 support, um yeah, so we'll you know we'll need to get some azure bits in there before we can uh before you'll be able to leverage it most likely.

A

Okay, hopefully I'll be able to look at it at some point. Okay,.

B

Cool well, I appreciate the call sir and uh I'm gonna stop the live stream and I'll talk to you next time.