DataHub Product Demos, 11 Mar 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: NEW! Lineage Impact Analysis

Description

Gabe Lyons & Dexter Lee (Acryl Data) give a demo of Lineage Impact Analysis - using DataHub to understand the impact of changes on downstream dependencies.

Learn more about DataHub: https://datahubproject.io
Join us on Slack: http://slack.datahubproject.io
Follow us on Twitter: https://twitter.com/datahubproject

A

So then continuing the lineage theme dexter and I are going to talk to you about a new feature. Lineage impact analysis. I'm really excited to talk about this feature at town hall, especially because I know this has just been requested so much from folks in the community and I'm really happy to be able to finally present it. You know: we've incorporated a lot of that feedback and I feel like this is a really collaborative project between us and all sorts of folks in the community.

A

So thank you and uh I'm excited to excited to demo it. So what is a lineage impact analysis? So essentially, what you can do now is for a data set. You can go and view a collection of all the downstreams of your data set together in one grouping, so not just that first layer that we were showing you previously, but actually we'll, go across all the layers and bring it together into a collection in this collection.

A

You can view, but you can also filter it using all different types of filters that you can use in the search that you're used to. So you can filter by tags platform entity type. Now, with the recent update we gave you can filter by owner, you can also have a free text search box to just do free search across all of the downstreams, and then we added a new filter just for the impact analysis section, which is the level of dependency.

A

So you can say you know I actually want to see only things that are two layers deep or three layers deep away from the current entity that I'm looking at so then. Finally, um once you have that collection, you can they're viewing it as one thing. But what are you going to do? What do you want to do with it? And so we've added the ability to download this collection that you filtered as a csv. You can take it with you and then go do any other action based on that.

A

So on the motivation front, I think about two different main use cases here. So there's the proactive side and the reactive side on the reactive side, that's sort of the data ops use case. Maybe something's gone wrong with your data set and you need to know what do I do about it? Who do I alert? Who do I am for? How do I make this right, so you can go in and find who's depending on your data set.

A

If it maybe your stream job is delayed or a data quality check's gone wrong or something like that. What you know, how can you manage this incident, but there's also the proactive side, which is you you want to make a change, and you want to know before you go ahead.

A

Who do you have to discuss this change with who might be impacted in a similar fashion, um and you know how who do you essentially want to reach out to and make sure that you're going to deprecate a data set or change a column or do a backfill or something like that? Who do you need to talk to and again thanks to the community for great conversations that helped us understand better how to build this future and what exactly impact analysis?

A

You know what iteration of impact analysis would be useful to folks, and I want to give a special shout out to stephen poe. We had a lot of really fantastic conversations.

A

He was super helpful in helping us clarify how we can make this most useful, um all right so now I want to go into a live demo of this feature, so you can see it in action, so I brought made this raw events kafka data set, and uh you know maybe this is some event stream that we have going on and all of a sudden, I realize this event stream's delayed and I need to know what to do about it.

A

So if I go into the lineage tab by default, I just see there's one downstream consumer and so on the surface level. It might seem like there isn't that much impact of this cocker stream being delayed.

A

But now I can jump into the impact analysis section of the lineage tab and see not just that first layer of language, but all the different layers beyond that, and so I jump in we'll do a little search across the graph, and actually I have to give a shout out to ibru. This is evie's little animated gif that uh they contributed.

A

Now is our loading indicator for the impact analysis section, and this after I jump into that impact analysis view. I see all the different entities, all the different layers deep. So you can see here in the top right corner of the entity. Pro is how far away the connection is from this source data set. So, for example, dim users is a few steps away and I can open this filter panel and say you know.

A

Actually I need to go and let the analysts know so I actually just want to filter by looker and just find all these looker charts and dashboards that depend on this data set.

A

Once I have this subset of my dependencies filtered down, I can go into this little menu and hit download name my file and then, when I download it it pops up as a csv.

A

I can just show you after I open it. It's going to have the the urn, the name, the entity, type description, you'll, see, ownership, information, platform, tags and terms if those are filled out the level of dependency and then also a url. That will bring you uh to that entity in question.

A

So then, just as one is the end of the demo, but as one final easter egg, we thought this csv download csv feature is super cool, and so um something that we did just for fun was added to the search page. So if I search for raw events here, you'll see that same menu, and I can download any group of search results of csv.

A

B

A

In all the different embedded search elements, so if I go to, for example, my ownership page, you can get downloaded csv there and whatever filters you've applied, are going to be included here as well. So I'm really excited to hear your feedback on this download csv universally. You know, let us know what you think um and also, of course, impact analysis as well um yeah, so that's the demo.

A

Then I just wanted to touch on one thing before I cut over to dexter so just on the what's next, so shocker was talking a little bit about how we're moving toward column level lineage and getting that metadata inside of data hub, and so once we have that we're gonna you'll be able to do what you'll see coming down is the ability to say you know if I was gonna change this column.

A

You know what is the impact of that and we'll be able to have impact analysis, not just at the entity level, but also the comma levels, column levels, so things like schema change management will be easier and then obviously you know what what's next is not just determined by.

A

What's on this slide, but it's also determined by you folks in the community, so please let us know, try it out and I really look forward to reading your feedback now I'll hand it over to dexter, and he can talk about the amazing uh engineering and architecture that went into this feature before you switch over. uh Would you mind talking about the api really quickly? Oh.

B

Yeah yeah, thank you.

A

Yeah, so, in addition to being able to view this through the ui, we also have an api that you can get to query and get this information and in the api, it's not just those columns that you see in the csv, but you can fetch any metadata about these entities, so you can programmatically say you know. I want to get all the down streams across all levels for this data set and just give me you know you can provide filters and search queries, just like you can in the ui.

A

So essentially everything you see here in the ui you're able to express express through our graphql api and we'll share a documentation on that afterwards. Yeah. Thank you.

B

Awesome all right, so let me go really quickly through how the back end works. So um so, in order to do this, we have to change uh some fundamentals of how lineage works in the back end. Basically, let me go over how it was working before so after no code change that we did.

B

Last year, we were able to put relationship annotations on our models, to note that this field called dataset in an upstream lineage aspect, means it's a downstream of lineage between these entity types and these relationship annotations are converted into edges in the graph index. So these are some example edges that you can see here. For example, a logging event is a downstream of sample hive data set data, job user creation, consumes this data set and so on and so forth. So how this edge is provided is the the entity on the left side?

B

Is the source earn of the entity and the entity on the right side? Are the values inside the aspects? So what is the source line? What is the value in the aspect decides the direction of the edge in the graph index now. The problem is that this is not necessarily the direction of the edge in the lineage graph, so if you see above we go from data set to data set, and then data set to data job and so on in the lineage graph on the top right.

B

But the direction of the edges are sometimes the opposite, so you can see that the logging event is downstream of sample height data set, but the direction of the edge in the lineage graph is is the opposite. While if you see the very bottom data job user deletions, creates this data?

B

Oh sorry, that's a that's a typo, it produces this data set, and that is the same as the lineage direction um on the graph above. So what we had to do before was for front end to white list, a bunch of these edges and figure out how to query for on our graph service. So our graph service has no knowledge about how lineage works, and it just has knowledge about these edges.

B

And now the problem with multihop, when we try to hop multiple times, is that we need to know what a lineage edge is on the back end.

B

So we had our initial set of iterations to go to the next slides um gabe next slide to build this lineage registry, where uh it's all no code for given a relationship annotations, we add a few fields called is lineage is true means it will show up in the linear graph, and this is an actual lineage edge and some other metadata about this relationship so that we can build a lineage registry that says, given an entity type. What are the upstream data sets?

B

How do we query for them in the graph database and what are the downstream uh edges and how do we query for them in the graph database? So, for example, if you see this data job lineage registry, it says that to get the upstream entities of this data job, um you need to look for these type of edges in the graph database. One is consume, so basically, I wanted to see what this data job consumes, which is actually the upstream of this data job.

B

As well as the downstream of so, what is this uh data job downstream of which means that the what this data job is downstream of are upstreams of this data job right? It's a little confusing but bear with me here um and, um and then the same thing for downstream so like here are the edges to look for in the graph database to find the downstreams.

B

Now uh the contract between the front end and back end is now much simpler. The front end tells us give me all the upstreams and downstream lineages without knowing anything about edge types. Anything about anything like that, um the lineage registry in the back end decides. Oh, these are the edges. We need to fetch on the graph uh graph index and fetches of special edges, um so now um so as a side effect. Our lineage graph has also improved by saying all of these.

B

All the entities that shows up in the lineage graph and the edges that show up in the lineage graph are defined on the back end now, through these relationship relationship annotations.

B

Now, now that this is set up, we can move on to how multi-hop works. So, basically, now that we can do a single hop if you go to the next one game now that we can do a single hop, we can easily do a multi-hop.

B

We started work on elastic, search because elastic search is the mainly used graphdb right now and we'll be implementing this in neo4j. Soon in elasticsearch, there's no way to go multiple hops easily, because it's not a graph database.

B

So basically we had to do a simple bfs across the graph index. So given an entity uh given the set of entities from the last top fetch all the downstream edges, so do one hop to the next set of edges and then go go, keep track of all the visited nodes and do all the hops until we reach the leaf nodes, um we batch all the requests to minimize number of queries, so we don't do it per entity. We do it per hop.

B

So perhaps this one request elastic search um and then we cache the final set of earns all the errors that are impacted by this data set. Now, once we have the final set of earns, we query the search request, so we do search across entities the same way. We do the search on the main search page, but with this added urn filter, saying anything returned by the search needs to be among these urns that are impacted by the data set.

B

um By doing so, we're able to support that embedded search experience that you saw in the demo so now like once, we have the neo4j implementation of that. uh The first part um we'll have this working on neo4j as well. um If you have any questions on this, uh please ping either gave or me we're open to any suggestions.