DataHub Product Demos, 8 Mar 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Fine-Grained Lineage & Timeline API

Description

Shirshanka Das & Ryan Holstien (Acryl Data) review Fine-Grained Lineage & demo the new Timeline API during February 2022 Town Hall.

Learn more about DataHub: https://datahubproject.io
Join us on Slack: http://slack.datahubproject.io
Follow us on Twitter: https://twitter.com/datahubproject

A

This is uh the most uh interesting part of my day, because I get to tell you all of the cool stuff that has happened uh in data hub and is about to land. uh First thing is column level: lineage we're actually calling it fine grain lineage.

A

So we're going to use the two terms interchangeably.

A

Where are we at? We have committed the model and it supports both data set to data set, as well as data set to job lineage. We have apis and documentation for how to add and query this data. So if you have programmatic integrations that you are dying to kick off, go at it, you'll find that documentation exactly where maggie was talking about.

A

If you go to the doc site and you look on the left, rail and you drop down to metadata modeling entities and you drop into data set you'll see a section there on find green lineage and over there. You'll have examples of exactly how you can add this lineage, both to a data set or to a job that reads or writes to a data set and then how to query that lineage information.

A

So that's just the beginning. Obviously, but it's a good checkpoint for us, the next step is going to be actually building out integrations with existing sources and producing this kind of fine-grained lineage from those existing emitters. We're still doing um an inventory of which sources are actually good targets to start out with. So please let us know if there are ones that you think are a good ones for us to go after.

A

It's interesting that most sources do a really bad job of providing column level lineage, and so it's a quite a bit of work to actually pull it out right after that, we will obviously be showing all of this in the ui. So so that's the order of uh progress that we're planning.

A

So that's the checkpoint on column level, lineage.

A

uh Next slide, all right, so we've talked about schema version history uh ad infinitum and as we started building it, we actually built one version of it and then realized actually a better way to do it. So we're actually calling this the timeline api. What do we mean by that?

A

It's a unified timeline of changes to entities in the metadata graph and it's computed across all of the individual, fine grain changes that are happening to entities, so you essentially get a unified timeline of any entity like a data set, and you can then filter as a consumer. Whatever categories you care about, so, for example, we've got this entity here. This looks like a hive data set and we want to see what happened to this data set from seven days ago.

A

To now you can see that oh well, schemas got added tags, got added, documentation got added, maybe a schema got changed in a backwards incompatible way and then documentation was further added. So this is kind of a unified timeline across all of these categories of changes.

A

um As part of this change, we also added finally an open api server to the data hub back-end. So I don't know how many people love wrestling, but it's a very hard to use rest api.

A

So we've uh we want to move towards open api as our public facing uh rest, endpoint and as part of the timeline api work we actually put in the work to add in an open api server.

A

So what that means is that the timeline api served over the open api endpoint and also, of course, through your friendly data hub cli and we'll go into some details there next slide.

A

If you want to think about technically how it's actually built, the metadata model is split up into aspects right and each aspect is very fine-grained. So, for example, we have an aspect called schema metadata and we've got another aspect called global tags and we've got another aspect: called data set properties and there's lots and lots of aspects and individual changes are happening to each one of these aspects. So you might have a schema changing and that might be happening over in the schema metadata aspect.

A

You might have tags being added, and sometimes that happens through the schema metadata aspect, and sometimes it happens through the global tags aspect. So there's a lot of complexity in the lower level model and how changes are happening in there, but we. What we wanted to provide was actually a semantic way to look at all of these changes in a single way. So at the top level there are categories like technical schema tag, documentation and these techniques.

A

These high-level categories actually span across these fine-grained aspect changes, and so you can then create a singular, unified timeline across all of these categories. So you can almost think about this unified timeline as a projected view that is merged across all of these individual version timelines. So it's pretty cool. uh It was a lot of fun working on this, a huge shout out to ryan and surya, who I collaborated with in building this, and it's just the beginning.

A

I think once now that we have this we're going to be able to build simple things actually now, like schema version, history, documentation, history and all sorts of different experiences. On top of this uh much more general api, so that's uh kind of the high-level visual of how to uh think about it. But how do you get it? Well, it's already there. uh The cli was released last night, so you can try it on the quick start using the latest cli 0827.1.

A

uh The server code is also in obviously so it's part of quick start. It will be released officially as part of zero. Eight twenty eight. So that's the next release coming up and uh in terms of like caveats, it's supported only for data sets right now, so the entity type has to be a data set and it supports these kind of change categories, technical schema, tags, documentation, glossary terms and ownership.

A

Docs are also up, so you can go, read them and send us your feedback and, of course, over to a live demo from ryan who's, going to walk you through how it looks.

B

All right, let me take over.

B

B

I've got the ui up here and so you'll notice that here, we've added a couple of links in the drop down here, specifically I'm interested in the open api one. So this is our uh new swagger page that sri shanka mentioned. So this has the open api spec for the timeline endpoint, um and here you can explore it just like any other swagger page.

B

So if we add an earn and a category of change that we want to search for, then we can get this nice little response of a list of changes that have happened to that particular data set and so the schema of the response. Basically, we split each change into a transaction level change that includes a list of change events, so those list out the individual changes that have happened to each category.

B

So that's the swagger page and then on the cli side.

B

We have this data hub timeline command that we've added, so it has two mandatory uh parameters, so earn and category, and you can specify a start and end time and a few other parameters as well, and so, if we send in that we get a list of all the changes that have happened to the technical schema, which is linked to the schema metadata aspect.

B

So we'll see like the for the first set of changes, we got all of these fields added and then we've put in some changes where we modified and removed some of the fields in there, and so one of the things you'll. Probably notice is so we have a time stamp of each transaction. That happened and a semantic version that was computed.

B

So these are all relative. Basically, we compute this as a result of what changes happen. So if a major major version change happened like a backwards incompatible change where a field is removed, then we'll up it to a major version. But if something less significant happens, then it would be a minor or a patch.

B

B

I'm also going to go ahead and show that.

B

If we add a tag to this data set, we will see a new entry in this list so right now, this is what we're seeing. So there were some tags that got added and removed.

B

And then, if I go in here and I add the needs documentation, tag.

B

And then I run it again.

B

We'll see that we have this new entry in the list where it says the tag needs documentation was added to this entity.