DataHub Product Demos, 5 Aug 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: dbt Integration Improvements

Description

Shirshanka Das and Gabe Lyons (Acryl Data) share recent improvements to the dbt + DataHub integration

Learn more about DataHub: https://datahubproject.io
Join us on Slack: http://slack.datahubproject.io
Follow us on Twitter: https://twitter.com/datahubproject

A

So srishanka- and I are very excited to talk about some improvements that we've made to the dbg integration, with data hub and just to set the context. Dbt, as many of you folks know, is the tool that allows you to write transformations on your data inside of a data warehouse. However, dbt is not the one. Storing the data itself you'll be running dbt on something like snowflake bigquery, postgres, et cetera, where the data is stored.

A

So to get your metadata into datahub from dbt, you need to run a recipe that integrates both ingest dbt metadata, to get things like lineage documentation, dbt meta tags and then also you'll want to run ingestion from your short system to understand things like data profiles, data usage, etc.

A

Now the downside of viewing both of these ingestions historically has been that we've created independent data set entities for the dbt nodes and the warehouse nodes. So this comes up in a couple cases in search. You would see these duplicate nodes from data warehouse, the data warehouse table and the dbt table coming up in search.

A

You also see a more expanded, lineage graph, where you see lineage between the data warehouse nodes and the dbt nodes themselves, and finally, you'd also see independent entity pages. So these dbt nodes and warehouse nodes would have their own entity pages where you'd see the metadata from some of them or some of the metadata on the gbc page and some of the metadata on the data warehouse page.

A

Now, what we heard when we got feedback from the community is that people generally thought of these as the same thing, so the dbt node and the warehouse mode were really two sides. At the same point of that, the same fundamental entity.

A

We listened to your feedback and we decided to make an improvement that really addressed these concerns, and this is our solution. Essentially we're taking. You can see here. This is the dbt node on the left and the warehouse node on the right, and we thought these are just you know. If you view these as the same thing, let's bring them together and um you can see that once we once we brought them together, they're just as happy as these two privates.

A

um So after we have merged entity pages, merge, search, results, merge lineage, and we really get that joy that you saw of the those two primates again. Of course it would not be a uh town hall without a live demo. So let's cut to that- and I can show you what this looks like in action so going to my local data hub I've loaded in the japashop dbt project with being run on bigquery, and I can show you how exploring this metadata will look in this merged world.

A

So if I type in customers to search for that, instead of seeing distinct results, I'll see those combined results. So I can see here the dbt and bigquery node, customer customer source, etc, and instead of seeing duplicate, search results, I'm just seeing one for each. In addition, I want to call out that we still preserve the filters that you had before. So, if you just want to search for dbt nodes or just want to search for bigquery nodes, you'll still be able to discover these entities with either filter.

A

So you don't have to worry about only searching for one type of.

A

In addition, going to the entity page, you can see that this entity page is showing metadata merge from both the dbt node and the bigquery, so you get the both the best of both worlds. Here in the schema, you can see schema descriptions that are coming from dbt and also usage information. That's coming from bigquery, similarly going through the various tabs, we're able to pull in.

B

A

Merge the metadata from both bigquery and dbt, so I'm getting view definitions from dbp properties from dvd, but also things like queries and stats that are coming from bigquery finally, jumping into lineage. You get to see that merged lineage ui is discussing so here. When I look at the lineage between various entities, I no longer see duplicate nodes.

A

Instead, I see merge nodes and one thing I want to call out, as you can see here, this is an ephemeral, dvt node and since it doesn't actually have any backing in bigquery you're not going to see that bigquery simply because there is no equivalent bigquery. So it's able to understand some nodes are both persistent to dvt and bigquery, while others only exist in the bigquery world.

A

Very sorry. Some only exist in the ddt world. So hopefully these changes make it much easier for you to both discover your data sets and then also once you've discovered them understand. What's going on as we pull that metadata together.

A

So I just wanted to briefly talk about how this works under the hood, so we have these various different pages search, the entity view and lineage and they're all presenting these with this concept of a merged entity. To you and to make this happen, we created the new aspect on entities called siblings, and this lets an entity say and declare that it has another sibling entity that exists inside of our metadata graph. It also lets us annotate, which is the primary sibling for breaking ties between metadata and which is the secondary.

A

So on the metadata side, we're still keeping your bigquery data sets and the dvd data as distinct concepts, but we present a merged view that the various search entity, view and lineage pages are able to pull from and so the end user experiences them as a combined object.

A

In terms of next steps, there's a few things that we want to continue improving when it comes to merging the dbt and warehouse nodes. So one is, we haven't completed the visual merging there's a few places where you'll still see them as distinct objects, and that's an auto, complete and also the browse cards are still showing um individual objects, so you'll still see the creativity nodes and browse.

A

We also want a way to sort of peel back. This merge entity view that layer that I showed in the previous slide and let people access the distinct nodes as well. So this way you can still go to the dbt page and see it without the merge forward with the merged.

A

This is something that we plan on adding shorter and then finally, we also want to explore how we can use the siblings metadata pattern for other types of relationships. So, as I showed you in the previous slide, we have this concept of associating entities together and then presenting a view of them being combined, and this isn't necessarily a dbt-specific concept. We see this being valuable in other areas as well.

A

For example, if you've ingested, multiple data sets or remote entities for something that you still view as a similar concept in other patterns, so maybe sharded data sets or data sets that um you know one reference was brought in from one source and one reference was brought in from another source and although the references were slightly different, maybe you view these as the same thing.

A

So this is an area that we want to explore more and we're really interested to hear your feedback on how you can see the sibling association being valuable in use cases outside of you.

A

So at this point, uh I'm going to go over to sritanka and he's going to share some other, very cool updates on the dbt front.

B

Actually gabe, you can keep moving the slides forward because it's not too much.

A

B

So now that we've kind of seen all of the amazing stuff and a lot of love in the chat, uh gabe so check it out um and mark to your question about, can we do apply the same thing to kafka and hive and other kinds of sibling entities? Let's chat, I think we definitely feel like this might be an interesting way to combine views together and things like that.

B

So moving forward dbt tests, I think most of you hopefully, who are dbt power users are also familiar with dbt tests. But I'll give you a quick primer, you you define tests right alongside so next slide. uh You you define tests right alongside your dbt model, but the way we were treating them are just like data sets, and so when you would ingest your dbt, catalog tests would show up in the catalog except there would be like subtypes of tests. So you search for customer and all these tests would show up.

B

It was just a pretty big annoyance.

B

There were also other problems moving forward, we weren't able to record any test outcomes, so we would see these test data sets hanging out and you didn't know what exactly they were doing or if you go to the properties, you won't find anything about them.

B

In lineage, obviously, they showed up as having edges up to the underlying table and even after you merge it, it's kind of annoying to see tests hanging off of the lineage graph. Alongside data sets.

B

So quick primer on how tests are defined, they are attached to either the model or to the column within the model.

B

Dvt comes out of the box with some default tests like uniqueness and not null, and a few others there's also additional packages that many people use, such as dbt expectations, which kind of parallels great expectations, and you can attach tests that you know test for min values and max values, and things like that.

B

You can also write um tests purely as sql. So you know this is my jaffle shop and I went in and wrote- and this was you know, a logical sql statement that just refers to the orders table and does some test and as long as the test returns, zero rows. Dbt assumes that the test succeeded.

B

So next, this is how you run a dbt test. You just run dbt test and it runs the tests and just tells you what happened. The challenge with a lot of this obviously, is that the output of this only stays here, and it's very hard for you to know where all of this went and keep a record of every single test you've run in against your warehouse and do something actionable with it.

B

So what we're doing right now is connect. The the nice thing about dbt test is once you run it. It actually generates results in the target directory under the run results.json file, and so we, when we launched the dbt tests, rewrite that's the next slide. We basically said: what's the simplest way for people to consume this?

B

First of all, when we parse the manifest in the catalog, we should be representing tests as tests and not data sets, and secondly, as long as they have run dbt test command, they should be able to point us to the output of that test command and we should be able to then reflect those outcomes in data hub, and so that's exactly what we've done as expected, and so that's a small screenshot of what it looks like on that merge node but gabe.

B

If you click on that live demo link, it should take us to an actual data set on which tests have actually been run and some assertions have failed and some have succeeded. This is up on demo and, as you can see, it looks exactly like what you would expect.

B

You have the timeline view. So, if you click into any of these, you should see when those evaluations were last run where they succeeded or failed on some of them. You'll see that view logic node, and so, if you click on it, if you scroll up a little bit, uh you'll be able to see the logic for the node as well. So, for example, in this case there was that a sql statement that I just showed you you can basically see what is the sql that backs the the assertion.

B

I think that's pretty much it in terms of what to expect from that integration. So most of you, I know, run dbt either on dvd cloud or maybe on airflow or some other orchestrator.

B

So the best way to integrate with the data hub injection system is to produce your catalog and your manifest.json, and your run results.json that comes out of the dvt test command. You can either put it in a local file system and drop it over into s3 and then in the dbt recipe. You can just point to those s3 artifacts and this recipe will just pull it in and produce these results into data hub. So it's a great um it's a great kind of follow-on step to running your dbt model generation and your dbt test.

B

So you can just have like a three-step process now dvt model run um and then dbt tests and then push metadata to data hub from the artifacts you just generated so you'll get them pretty much live and instantaneously.