DataHub Tech Deep Dives, 9 Dec 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: New Feature! Redshift Table/View Lineage

Description

Tamás Németh (Acryl Data) talks about how DataHub can now automatically extract Table, View, and S3 lineage using Redshift system tables.

This functionality is available as of v0.8.18

A

Over the redshift lineager, uh I'm not sure how much you know about redshift. I I try to guide you through the pain points of building out the redshift lineage. So when we are talking about ratchet finish, there are we are talking about table in each and unfortunately uh like one one of the lineage. What we want to get is the triple lineage, unfortunately, the rest shift. We are not that lucky, then, with snowflake, where we have everything in one place.

A

So here, basically, you can the only thing what you can do is basically doing some nice tricks with redshift system tables to get this information and one way one approach. What's uh where I went to it and actually it's implemented now in uh we are supporting multiple lineage collectors based on the need and usc. So there are pros and cons for all of these one is basically the str scan based one which basically, there is a stable. So there is two tables what we're using for that.

A

There is a stl insert which contains all of the inserts. If any that happens any of the tables. There will be one record for that. So, basically, we can use to see that which table somebody inserted or loaded any kind of data, and there is this another table.

A

uh This is the sds scan which basically, if uh shows, if in a table there was a paper scan happen, which of course, if you are running the select query and insert into a table then which table you insert that will be in the sd table and where you scan, when the table is what you carry is those should show up in the stl uh scan table, uh and actually this is the approach.

A

What zinger is using and as well as as I think, type form, and they are already using with data hub getting the lineage, so it has, it works pretty. Well, it's fast, it's reliable because it's using redshift system tables, but the problem with with that one. But there are a few problems with that. So sdscan only works with redshift tables and if you have like retrospec spectrum tables which are external tables, those won't show up.

A

So we I try to come up with some kind of solution to being able to capture those lineage as well, and I think another another group you should know if you are doing a querying of you or there is a query which touches a view, then view, of course won't show up in the str scan, but uh but all the dependent tables, but what's under the view, will show up as a dependency.

A

So there is another approach actually which is which you can set up in the retrieve lineage. It's the sql part sql parsing, we're using a library called a sequel lineage and it's a title library. Actually it works pretty well, but of course, so these you, if you would sing that and actually this approach works with external tables as well like registry spectrum, and you would imagine that.

A

Okay, that's why we are not going into that way and using that that one, there are a few problems with that, one as well, so one it's slow, because it's doing actual silver parsing and of course it's not that precise, then using it on a table. So one known issue: what I saw if any of the table table name is a word like you have a table called date. Then it basically won't show that table, but only the alias, if you have an ios on top of that.

A

So basically it just skips it because it's the the parser things it's a. I guess it's a in it's a reserved world. So these are the two approach which what you can set up actually by default. We are defaulting to the sds canvas, but you can set up the sql uh parsing base and there is a third one which is a mixed one, which actually first run the the sequel, parses parts based which actually can capture all of the tables.

A

And if there is any issue with the table name like the parser uh fair to get it, then we also run this to ask str scan based to fix those issues so that as well can be set up. So these are just for the table in each and there are another things it's views. Actually, there is another system table where you can get all the views and all the dependent objects. That's easy, so it works pretty well, but there is only one thing: it's uh delayed binding views, but you can create.

A

uh So basically, you can create a view where the uh the schema will be only checked when you run actual query. So, basically, then those columns, those tables won't show up in this system table the only thing what we can do is detect lead, binding views and basically running sql parsing to getting the tables from the view, creation query, and uh there is the last one.

A

What we wanted to support is basically, uh when you are loading your data into redshift, it's called there's a copy command where there is another system table to get this information as well and and getting the files. What what is loaded into a table- and this is how now the configuration looks like actually, as you can see uh here there is there- is this table lineage mode where you can set up?

A

If you, if you want to use the sql base, the mixed one or like uh or for the default one, which is the sds cam based and I think the rest, if you actually, if you are using the secure, based parser, there is another issue with that. So if you're using your sql-based parsing and you fail to define uh the schema like for the tables, then uh we had to know somehow. What is the default schema?

A

What you can set in the configuration you can see here, but normally there is an option to set it up. But of course, if you change the schema in your session a default schema something else and running a query. We won't be able to capture that with secret voicing and let's do the demo quickly.

B

Can I can I ask some questions.

A

Yeah yeah sure soon.

B

This is this is very near and dear to my heart, because I I used to work extensively in redshift and tried to figure out the lineage of thousands of tables and views, and I mean this is like uh well. First of all, I'm very excited about it. um Second of all, I just I'm yeah, I'm also very excited about it anyway, um for the for the stl table, scan and also the query parsing, do we have an idea of how much history by default is available, because I know that's flushed right, like.

A

Yeah yeah, that's right, so the retention is by the redshift documentation. As far as I can remember, I think two to five days based on uh how many queries are running on on directory cluster. So if it's a busy cluster, then it will be shorter, like two days but but more than a day for sure so like two to two days.

B

Okay, because I think that's something that's different as far as I know, that's different from bigquery and um snowflake, where I think all like. Not only are they doing a better job of just storing table lineage as a system table, but I think they're retaining a longer history, if not all right. um So I think that's something that we should definitely call out in the docs if we haven't already just to like kind of help, people understand best practices and the limitations of it, um which hopefully they already know since they're working in it.

B

But it's just something that I think we should explicitly call out is not something that data hub has control over right, like it's. It's specific to your instance.

A

Yeah, actually, the restrict recommendation is, is basically just taking backup of that on your own. If you want longer retention yeah for the uh for the query history and I guess for the other tables, you should do that. Maybe we can support if this is how our customers are using, if they are backing up to being able to specify your own backup table or something like that.

B

Yeah yeah or dumping it to. I know that um the last two companies I was at we were dumping it into an s3 bucket and just storing it indefinitely. So maybe we need to think through. How do we say you know? How do you point it to either s3 or redshift.

A

Yeah, my only problem is that so for that you have to also backup what the other tables is, and usually people are doing, backup from the query history. I guess so.

B

A

There, the query parts there should work actually for those use cases, but if you want, I don't know the str scan based. Then, of course you have to back up that. That's right.

B

Okay, cool, okay! Thank you.

A

So I have this nice, I'm not sure how much you can see that basically for the demo, the data- I will create a schema. I will create a couple of tables. Actually this is the achieve demo tables and copying uh so uploading from s3 data to these tables then creating a table. This is the sales by city and running this example query and also creating a view on the top of that data and let's see how it will go.

A

Hopefully, to run all of this, so it seems like it's worked just to show you, I'm not cheating, I'm showing you my data hub and it's currently empty. As you can see. Oh it's a zoom. Why isn't it? Okay? So no chipping, as you can see, let's check this is my uh so this is uh my uh ingestion job which basically will do uh jerry the the lineage I'm going now with the the default lineage generation and if you use the the demo 3, because that was the schema where I loaded the data.

A

Let's just run it the normal way.

A

So, in the background now it's try. It's runs like four or five days, one for the fuse. Of course, then the sds cam and or what I mentioned before.

A

A

Yeah and it's loaded up on a bunch of lineage there. So then, if I let's go here and check out, let's check out the sales by city, because I think it's a most interesting one. There should be some lineage, but let's check in the change graph and as you can see,.

B

Yes, this is so cool.

A

And I think, if I would event on you can see as well the view as well.

B

This is awesome extracting out those s3 buckets is so great like what a nice, what a nice way to um like it's, like a nice cherry on top of of just basic red shift, lineage.

A

Sure so, basically, that's the lineage uh yeah one one thing about what I put there as well: what if this does with srishanka? Actually so if the very poor sales rule is an exception that we had the property into that table that hey these are the queries which failed. This is by default, the disabled. I think we can. I don't know if the default should be enabled or disabled, but if you enable that, then for the query properties, you can find those queries which actually failed during the query.

A

Parsing, if you are using the ferry parsing now I failed to show the query parser. Actually you should believe me that's working as well.

A

I I tried out all of these three, so the mixed, uh this sts canvas and the sigma based. I would assume that most of our customers would use the fts canvas, but if they are using- uh and I think that should be the most reliable as well, but if it turns out they are having, like- I don't know, external tables, we still can say. Okay, if you want to support external tables, then you should use the sql parser.