DataHub DataHub Basics, 17 Aug 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: DataHub 201: Metadata Enrichment

Description

Maggie Hays (Acryl Data) provides an overview of all of the ways in which you can enrich your metadata within DataHub during the July 2022 Town Hall.

Learn more about DataHub: https://datahubproject.io
Join us on Slack: http://slack.datahubproject.io
Follow us on Twitter: https://twitter.com/datahubproject

A

All right so we're going to move on to our last session um every once in a while, we will do uh kind of a data hub 101, which is kind of more of like the basics of the fundamentals of data hub today. We're actually doing a data hub 201, so a little bit more advanced and we're going to focus on metadata enrichment um and, like shanka said today, is very heavy on it. We're talking a lot about ingestion, we're talking a lot about metadata ingestion or enrichment. Excuse me! So, let's, let's dig on into this.

A

um So let's set the scene really quick! You and your team have successfully ingested a metadata into data hub. We have. uh Your stack is pretty straightforward. You have snowflake for your warehouse, so we've ingest, we've ingested, uh schemas tables usage, profiling. Basically everything that shoshanka just covered from dbt we've ingested our models and dbt test outcomes and looker we've ingested explores, looks and dashboards.

A

So that's great. We have kind the foundation there, um but now it's actually time to push it one step further and start to enrich your metadata. So if we take a look at our um at just a sample page, so this is our pet profiles. Data set. What do we mean by metadata enrichment like what are? What are the ways that you can do it? What are some of the options there?

A

So, first and foremost, when we talk about metadata enrichment, we mean applying kind of a plain language definition, so that could be at the in this example, the data set level itself or even down to the column level.

A

The next piece of it is ownership in data hub. We have a few different concepts of ownership, so it could be anything from a technical steward to a business user or a stakeholder, but really just helping people understand who to go to when you have questions and who's responsible for those entities.

A

The next part of it is tags, so these are more kind of like social, uh social tags or kind of ways to kind of organize or group. Your um your entities together in a less governed way, so a little bit more flexible.

A

There and again you can add tags at the data set level or even down to the column level if you'd like the next one is glossary terms, so these are going to be more of our governed business terms that really describe how the how the data entity or the um the data hub entity excuse me relates back to a business, either measure or definition, etc. Again, those can be defined at the data set level or the com level and then, last but not least, the domain.

A

So the domain is really popular in data mesh implementations and just kind of providing this logical grouping of entities within within an organization.

A

So those are all the ways that you can enrich metadata. But what are the ways that you like practically? Do it right, practically speaking?

A

So, first and foremost, um we talk about this idea of shift left a lot um and we're going to go into detail, but so, first and foremost you uh you can enrich the source and then ingest. You can transform your metadata prior to ingestion.

A

You can do a bulk import using our csv enrichment. You can do ongoing enrichments vr api or you can just do manual enrichment, ad hoc as needed from the hub ui. So let's actually dig into these a little bit when we talk about shift left. What we're saying is if and when you are defining your metadata or enriching that metadata at its source. We want to extract that as much as possible, so for this. In this example, in this you know very simple stack. We have snowflake dbt and looker um in snowflake.

A

One way that you can start to define or enrich this metadata is by providing column, level descriptions or table level descriptions which we will automatically extract um for dbt. There's this idea of meta annotations. So this is where we can start to map ownership. We can start to map tags terms domains it's super flexible.

A

This is also true for our support for protobuf, where you can annotate the schema there and do some mapping that way, and then last but not least in looker. If you are annotating or adding descriptions within your look, ml will automatically extract that. So it could be anything from that field. Level description, adding tags to it, etc.

A

So the idea here is that um and really this is the the preferred and kind of like most um bulletproof way. I'd say to to consistently enrich your metadata, where you really want to do it, where the code is changing and evolving as the um as the ingestion or the creation or transformation of data evolves over time.

A

So this provides us a really flexible way to extract from that code base make it available in data hub going forward.

A

The next part is this idea of a transformer, so um this is something that's supported uh within data hub. The idea is that when you are connecting to a source- so let's say snowflake, maybe there's some logical grouping of ownership by schema, or maybe there's some logical um or maybe there's like a pattern around assigning pii to columns that have the word email in them. The idea here in a transformer is that, when you're ingesting from a singular source, you apply patterns or kind of rely on those patterns to apply either.

A

You know ownership domain domain description. You know kind of all. The things we had talked about and the benefit here is that this is executed every time your ingestion runs. So as there's, let's say a new table added to your schema. If it has that pattern and it matches it, it's going to automatically apply your tag or your glossary term, your owner, etc.

A

um So it really just keeps up with the evolution of the data at source, but- and this is a great- this is a great uh use case if you're, maybe not actually uh managing those descriptions or owners or annotations within the source itself. So it's just kind of that intermediate step before you actually ingest it into datahub.

A

um We've talked about csg csv import a couple of times during town hall. This is really great for, after the fact enrichment, so after you've actually ingested your data into data hub. um Maybe you actually have a you know a google sheet, doc, that's floating around and people are kind of maintaining some idea of like ownership or descriptions there. This makes it really easy to kind of crowd source that information and then bulk ingest. It back into data hub and what's also nice here- is that it can go across sources, so you can do it.

A

You can have. You know, do a single bulk update for snowflake and dbt and look are all at the same time um and it isn't specif, it's not specific to um like you, don't have to be super uh specific about um pattern matching or you know kind of source by source. So you know this is an example here of us enriching resources across looker across snowflake and dbt we're applying glossary terms, we're applying one or many tags. You can also do the same thing apply one or many terms um same thing around owner.

A

You can do one or many and you can provide the description. So all of this is additive um as you ingest it. Maybe as you're collecting this information, you can ingest and add to your your kind of uh metadata model there and, like I said this is just really great for kind of like bootstrapping, your initial um ingest of of uh entities into data hub and then crowdsourcing from uh folks within your within your organization.

A

The next one is um api enrichment, so this is going to be more kind of like ongoing or programmatic enrichment. um This is really great when folks are starting to manage definitions as code. So as that code evolves, you can actually emit or, as those definitions evolve, you can emit those changes back into data hub. um Maybe you have some very predictable kind of deployment processes and you know going from maybe testing and staging them into production.

A

You can just programmatically emit those tags to like, for example, attach a tag that says now it's in production and you don't have to keep track of where the deployment is and then go manually apply it. Instead, you can just apply that as it evolves and it's always going to be up to date there.

A

This is also uh relevant to our kind of like circuit breaker, similar to our circuit breaking.

A

So you know when you're, when your dags evolve as those relationships evolve, you can emit lineage edges and you know really just manage these things as the operational tasks evolve and the pipelines that support them over time and then last but not least, you can always just go into the ui, and you know manage these definitions or manage these um uh this this enrichment manually, um and so this is where you know we provide the ability to edit table level description, calm level, description from the ui.

A

uh You can also add links. So if there may be some external resources that are helpful, you can point out to those you can add your owners. You can add your tags, you can add your terms and add your domain and all of that fun stuff. Of course, this is going to get a lot easier once we have the bulk editing rolled out that john demo'd, um but this is just a great way to uh you know kind of like last, but do a little bit more ad hocker as needed enrichment as you go.

A

So with that in mind, there's no kind of one right way or one perfect way to enrich your metadata uh within within data hub. So we encourage you to um you, know kind of mix and match those different uh ways, depending on your stack, depending on your um your workflows and then really just find what works for you. So you can keep it as up-to-date and um accurate as things go along now. Let's go enrich some metadata folks.