DataHub Product Demos, 1 Dec 2022

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Conquer Data Governance with Acryl Data’s Metadata Tests

Description

Maggie Hays & John Joyce (Acryl Data) share a framework for conquering Data Governance with Acryl Data’s Metadata Tests during DataHub's October Town Hall.

Presentation Deck:
https://docs.google.com/presentation/d/1bE_rY9dZCfrDcfRVR4-0G1XTNTaXPnhe9nTikeOUR6Y/edit?usp=sharing

Learn more about DataHub: https://datahubproject.io
Learn more about Acryl Data: https://www.acryldata.io
Join us on Slack: http://slack.datahubproject.io
Follow us on Twitter: https://twitter.com/datahubproject

A

All right, we are going to switch gears, so we're going to talk about um some a feature in actual data's, managed data, Hub called metadata tests. After that, we're going to have a sneak peek of an upcoming feature, called save views, and then we will be talking performance improvements. So just so, you know what you're in store for so um earlier this month, John and I uh spoke in Budapest and at a conference called uh crunch data conference.

A

So what we're going to walk through is a subset of of what we uh presented there, we'll also link to the full talk. If you want to see it, but the idea is that we are collectively as a community as an industry kind of all at the bottom of this Mount governance, and we don't really know how to tackle it right. We talk a lot about um about how to roll out governance programs.

A

What are the kind of the key components of governance, but the reality is a lot of a lot of us are kind of sitting, starting at the bottom of this. Like huge mountain of governance, debt, we have thousands, if not hundreds of thousands of data sets and pipelines that we need to document, and it can be a really really really daunting task to figure out. How do you even start?

A

So the idea that we're going to walk through is uh you know some examples of how you can start to iteratively address uh data governance uh management uh through both just kind of a work framework, but then also through automated workflows within managed data hub.

A

So when we talk about incremental, automated or automation-driven governance, there are kind of four principles for us to consider. First and foremost, we want to set really clear goals. We then want to narrow the scope and then surprise kind of narrow it a little bit more. We want to drive incremental action and then, as we go measure the progress to make sure that we're hitting the goals that we set. So what are some examples of this?

A

um One thing I've gone through that personally gone through this exercise of kind of rolling out governance initiatives in a handful of companies, many many many times, and what I've found is that you really need to have a clear goal of why this is important, why governance is worth tackling, and so that could be coming from your organization that could be coming from your leadership team. It could be coming from.

A

You know, data team and realizing that there's some problems, um but the idea is identify the problem and then set the goal of what a positive or kind of targeted outcome looks like. So maybe you have within your organization a lot of questions or concerns around compliance. Maybe in that regard you want to start by a class ensuring that everything has a security classification on it.

A

uh Maybe you actually are hitting a lot of issues with unreliable data. So then, maybe an ownership initiative is, is the right path for you, but the idea here is to just set a really targeted and concise goal. The next step of it is to narrow the scope. So if you think about the entirety of your data, stack you're not going to like I guarantee you, you will not be successful. If you seek to add ownership to 200 000 entities in in a week or a month, it is not going to happen. It's it's.

A

It's just not gonna happen, so narrow the scope and some ways that you can start to kind of um focus on the data that that matters most is to. Maybe maybe you want to Target a specific platform. Maybe you want to Target a specific domain. This could be really popular in companies that have adopted data mesh, um or maybe you actually want to start targeting uh entities that have high usage regardless of platform regardless of domain.

A

But the idea is now that you have your goal, set a very finite scope of of how to start addressing that and then also cut your scope a little bit more. So you can start building out this this workflow, um more often than not. These are going to be initiatives where you're asking other teams to do something right, you're asking them to own data you're, asking them to document data you're, asking them to classify it, and so um really what you're doing is is testing the waters in the early stages to figure out.

A

How do you roll out these initiatives in a way? That's successful, so um I've been most successful when I've teamed up with highly motivated uh stakeholders that are attached to that problem right, so we've set the goal up top. Maybe compliance is the issue, so maybe you want to work with your compliance or your legal team, um but the idea is that you should start with a small set of stakeholders that are that are both invested in the outcome and are familiar with uh with what the initiative should should seek to do.

A

The other thing is just really being setting clear expectations of what you're asking them to do. Asking someone to be an owner of data. Sure like no one will will come. Nobody will say no to that, because it's so nebulous and it honestly doesn't really mean much. So what, when you ask someone to be an owner, what are you explicitly asking them to do? What are the expectations? So it should be very clear of what what they're expected to do um and then the last part is to create rapid feedback loops.

A

So this could be something as simple as daily stand-ups, weekly Retros, but just making sure that you're, adjusting and and kind of tweaking your workflows or how you kind of go through these workflows um based on their feedback.

A

um If you're, having a hard time thinking through what stakeholders you might want to Target, um my my advice is to start with making it very obvious so working with stakeholders that, like are really aligned with why this is worth doing, um make it easy right, so just help them understand exactly what you're asking for them, um but also make it collaborative so don't go in and say: Here's, here's exactly what I, what I'm looking for you to do, and here's how you should do it um you've already?

A

They already understand why they need to do it. They already understand what you're asking them to do so, don't over prescribe how they do it. I've personally walked into a room and of like with Google Sheets and asked Engineers to document tens of thousands of of data columns, and they literally laughed me out of the room uh which I will never forget that moment. Ever in my life it was horrible, so please do not over over, prescribe the how and then.

A

Last but not least, you want to measure your progress, making sure that um what you sought out that goal that you set up top, are you actually hitting those goals? Is it having the impact that you're looking for? um Do you still have the organizational support that you need and also don't get too married to those goals? Right circumstances, change, you're, going to learn more throughout the process.

A

Just because you set a goal up top doesn't mean that you necessarily doesn't mean that it's necessarily the right goal indefinitely, so walk into it with some flexibility um and when you, when you are able to measure progress, automate that as quickly as possible. That way, you can kind of track things systematically and it's not on your and you can kind of keep an eye on how on how things are progressing.

A

um So those are steps, one through four latherines, repeat: iterate through a different subset of data. A different set of stakeholders. I realize I've, walked through this incredibly quickly.

A

um We're gonna link out to the full talk that we gave in Budapest, but I wanted to pass it over to John to actually show how we can start to augment or support this framework through metadata tests in actual data.

B

Yeah thanks Maggie. um So what we're going to do now is kind of look at a practical application of the steps that Maggie just introduced.

A

B

See how data Hub Acro data Hub can help you, especially around automating, some of the work involved.

A

B

The first uh step, you know we're going to set clear goals, so imagine we have a company we'll use the example of long tail companions which we usually use. But let's imagine that we've set clear governance goals at long tail that every data set must have at least one owner assigned. It must have well-structured semantic documentation, understand that describes the purpose of the data and then finally, it must have a classification, so maybe labeling from a centrally managed taxonomy step, two we're going to narrow the scope right.

B

Like Maggie said in our case you know, we've got tons of data assets. We can't just say that the platform is snowflake, but we'll start with that, maybe we have 5 000 tables in there.

B

So what we can do is we can actually scope that down even further and look for the important Assets in Snowflake, and maybe we Define important as having a high query volume so maybe being in the top 25 percent of the most used and having significant unique user account, and maybe, in our case we Define that as any table that has you know greater than one use user in the last 30 days.

B

So we'll scope, our initiative to those data sets and now we'll look at how actual data Hub can actually help you automate the process of identifying those assets that should be in scope for governance, roll to click. Please.

B

So in here we're going to look at Acro data, hub's metadata test feature which basically allows you to First select a subset of the assets inside of your ecosystem and then do something with them. And so what we're going to build is we're going to build a selection criteria that finds all of the snowflake tables that are in the top 25 percent of most use and also have a significant, unique user count. So we're basically just taking our criteria.

B

The scope and criteria that we've outlined and we're actually just remodeling it on data Hub and we're using data Hub to help us find those data assets, and so we're going to look for those which have a query count: percentile as greater than 75, meaning top 25 percent of most used and, finally, we're going to use another metric. That datahub will automatically service for us, which is the unique users in the last 30 days and we're going to say that must be greater than one now once we've defined this criteria.

B

The next thing we're going to do is we're going to kind of organize these assets, we're going to track those assets. We're going to talk about rules in the next section. So we're just going to skip this for now and move right on to the the actions piece. But what we're going to do is we're going to actually automate the process of adding a glossary term to all of those data assets and we're going to say that all of those data assets that were identified are in tier one.

B

So we're going to add what we call an action which means that any asset that matches the criteria is automatically given a tier one label and any that falls out of the criteria is going to be removed from tier one. And so we can kind of use this to enrich our metadata in real time. So we're going to create a test here and then after some time, you're going to see that the test begins to run across your entire data catalog.

B

And so you can see that we've we've detected 91 data assets that fall into that category and if we go and view them, we can see that they've been identified as tier one.

B

All right, I think we can move on to the next part.

B

So once we've kind of you know identified the scope of the assets that need to be governed uh for our initiative. We're then going to move on to driving incremental action. So this is the third step that Maggie talked about once we know the assets we're going to identify the experts for the data. That's in that scope, so people that should own those 91 assets. You can do this in two ways.

B

One is you can look at actual snowflake, uh you know access history manually or you can use something like data Hub, which services that to you or you can use tribal knowledge right. So you can actually go and do the manual work of finding those people, but I.

B

Don't recommend that and then finally, you're gonna have to actually do the work to you know: get ownership, get documentation and get classification on those 91 assets, there's kind of No Way Around that you have to have human kind of intervention in this process and then, finally, what we can do with data Hub is we can measure our progress against our governance goals and so what we're going to view here is defining. Yet another metadata test that will allow us to track the data assets that meet our governance criteria.

B

So the first thing we're going to do is we're just going to again Define a selection criteria in this case we're going to identify all of the data assets that are tagged with tier one. So you can remember we kind of grouped everything into tier one. First, now we're selecting all of those things and, in this case we're going to add some rules, so these are basically conditions that all tier one data sets must match and we're going to say it has to have a description it has to have.

B

You know, maybe a glossary term from the classification term group and then finally, it has to have you know at least one owner, so we're just modeling. You know our definition of success in governance in data hub now we can use data Hub to kind of try it out on some sample data and see which of the assets that are in scope are actually matching see. One failed one passed in this case, we'll just skip the uh the actions, because really we just want to monitor.

B

You know how many assets are compliant versus non-compliant so that we know kind of what's remaining and you can see here over time, you'll be able to track those assets. So you can actually say here are all the things that do not meet my governance standards and you can watch that number kind of tick down. As you run, your governance initiative, so I think this is really useful when you're really trying to kind of iteratively track your progress and actually report on that progress to external stakeholders in governance.

B

I think that's, uh that's pretty much it for the demo and then finally, I'll just leave you guys with what is going to be available uh in the open source. So what we just saw as a demo on actual data Hub, which provides the entire experience from defining a test running the test against your entire catalog reporting. The results. What will be available in the open source is a couple of things. So first is kind of the specification of the test format.

B

So, under the hood, all tests are going to be represented in yaml or Json, we're going to publish that format. The second is the model itself in GMS for metadata tests and metadata test results so that you could presumably kind of ingest metadata tests into Data Hub and results into Data, Hub and then finally, UI support for actually rendering those test results that you saw there at the end on the entity page all right and with that I think we can conclude this one.