DataHub Product Demos, 11 Mar 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Great Expectations Outcomes in DataHub

Description

John Joyce (Acryl Data) gives a demo of Displaying Data Quality Checks in the DataHub UI during the February Town Hall.

Learn more about DataHub: https://datahubproject.io
Join us on Slack: http://slack.datahubproject.io
Follow us on Twitter: https://twitter.com/datahubproject

A

So I'm going to be talking about an integration. We've worked on in the past few weeks to extract quality insights from great expectations and surface them inside of datahub, so we'll get right into it. First, we'll just kind of go over what is great expectations for the folks who may be unaware.

A

Basically, great expectations is a way to define assertions or tests on particular data assets and then evaluate them repeatedly over time to track the quality of a data set. The goal being to maintain the data sets quality through time as it changes so some examples I pulled from the great expectations. Documentation is this, so you would define this in your python code, basically an expectation of what you would expect from a data set either at the table level or at the column level.

A

So in this case we have an example that expects the column named passenger count to always have a value, that's between some minimum and some maximum.

A

Additionally, there's a there's a large set of sort of standard expectations, as they call them that you can use to validate different characteristics of your table so, for example, expecting columns to exist or the table columns the structure of the schema to match some predefined list etc.

A

So what we got from the community was a request to display the outcomes of the great expectations assertion suites in datahub for all the perfectionists out there. We had to make one modification to that request, so the solution they wanted to see is as an end user, to be able to see the types of assertions specifically the results of running the assertions associated with a data set inside of datahub.

A

Additionally, with the requirement to be able to see the assertion runs over time or over history, and so now I'm just going to jump right into a demo where I'll talk about how to actually configure the integration and then I'll run some expectations and show you what the output looks like once. We've ingested that into data hub, so I'm going to step over to a local, great expectations, project that I've got here.

A

In my main grid expectations configuration yaml, I've configured a connection to a local postgres db, which contains two tables. The one we'll be interested in is called taxi data jan19, which contains some information about taxi cab, rides given in january of 2019.

A

So we've already gone ahead and defined a set of expectations that I'd like to run against this table. So basically, just some tests or assertions a few of them that we have in here are you know, expecting the table columns to match a predefined list expecting the row count to be between a particular minimum and max and expecting a column to always have values that fall into a finite set.

A

Among a few other types of expectations, once we've defined our expectations, we have to use a checkpoint to actually run the expectations against our taxi jan 19th table.

A

So I'm going to go into a checkpoint file that I've configured and you can see that we configure this this to run against the taxi jan 19th table and we're using that suite, which is just a group of expectations that I previously showed. You there's also an interesting configuration called action list, and this is where we're going to configure the integration with data hub. So actions and great expectations are a way to run code once a checkpoint has been hit.

A

And so, if you go down to line 20, you can see what the data hub integration will look like. It's basically five lines that invokes a data hub, specific action when my assertion run suite runs and what it'll do is actually sync that to my data hub instance, which is running on localhost 8080..

A

So now, I'm going to run through the process of ingesting an assertion run into datahub. First thing: we'll need to do in our great expectations. Environment is actually just install the great expectations plugin of acura data hub now, I've already gotten that installed. So I'm just going to skip that step, but once we've done that we can run this checkpoint, and hopefully this will execute the suite as well as push data into my local data hub.

A

Okay, so you can see that the suite ran and it says that all of the assertions passed specifically there were nine of them.

A

Now I'm going to navigate over to datahub and show you what it will look like inside of database, so I'm just going to search for that table and I'm going to click it and what you'll immediately notice is a new indicator on the side of the title here, which indicates that this data set is passing all of the assertions that data hub's aware of you'll, also notice a new tab called validation, I'm going to go ahead and click on that to see the current status of this data set.

A

What you can see inside of here is a top level summary saying that all of the assertions that datahub is aware of are currently passing for this data set. You can also see kind of human english descriptions of each type of assertion, and, if you hover over it, you can see the native grade expectations operator that was run.

A

Finally, you can pop these open and actually check the history of the runs, and you can see that we've only run this one time and it passed. We can also see the information about the run output here. So this is what a succeeding set of expectations or assertions on a table can look like in datahub.

A

Now I'm going to go ahead and show you what failing expectations looks like and for that I'll go to this feb 19th table. So we can see that this one's actually failing one of nine of its assertions that we know about, and we can go over here to actually see that this is the assertion. That's failing.

A

If we pop that open, we can see when it was run. Okay, this one was actually run quite a bit of time ago, the last time, and we can see that it failed with an invalid count of seven meaning that seven values of the vendor id column were null.

A

Finally, I'm going to show you what it looks like when there's a little bit more information in our table here. So in this case, we can look back for a whole year and see the output of every single time that this assertion has run against the data set.

A

So yeah this is pretty much the demo initially, we will support great expectations, but we've modeled this in a fairly general purpose way with this concept of assertions such that we can support things like dq among other types of validation systems. So now I'm going to navigate back to the the presentation here.

A

So quick configuration recap for grade expectations, in particular um in your grid expectations, environment, you're, going to want to install april data hub grid expectations, plugin you're going to want to add the datahub validation action to any checkpoints. You have you're going to want to execute them, and then you can view the results in datahub as assertions.

A

So just briefly I'll talk about, you know how this works under the hood, particularly the modeling. We have a new entity on data hub that we call assertion. Assertion can be associated with other entities and it does exactly what you would think it just defines conditions that are executed against a particular entity.

A

We also have a special time series aspect: we've added called assertion run event, and this is basically what powers that, over time view historical view. Every time an assertion runs, an assertion run event will be produced to give different information about the assertion run like its status, its results, maybe how much time it took things like that, and in this case we have just a basic relationship between the assertion entity and the data set.

A

So now I'll quickly cover the availability in data hub version, 0, 8 28, which is the next release. We will be shipping support for data set table and column assertions from great expectations, support for great expectations, v3 api, which is their latest api, we're going to push assertion results, as you saw in real time via that checkpoint action and we're going to support the sql alchemy execution engine inside of great expectations. Now there are other engines like spark and pandas, but that will be not in the v1 support.

A

Finally, we'll have a graphql api which will allow you to check the assertion status for a particular data set, and I think this is actually pretty powerful because it allows you to build automated workflows that only proceed. If maybe some input data sets are actually passing their most recent assertions.

A

Finally, I'll just talk about where we see this feature going, um starting with some improvements we have to make already to the grid expectations connector, um we want to support other execution engines, like I alluded to spark pandas support for legacy apis, depending on the feedback we get from the community.

A

I think there are some people who are still working with the v2 apis as v3 is actually kind of new um support for cross data set assertions which is an advanced feature in grid expectations, support for conditional expectations, which is basically expectations that only apply to a subset of a table based on some filtering criteria.

A

Finally, the new integrations we'd like to support again. This will be driven based on community demand, but we have in mind dq and potentially even soda sql, so help the core team out. Tell us what you want to see here and we'll do our best to prioritize it.

A

Okay, that's it for me just based on demand here, dbt test, okay, dbt test, all right: that's it! Thanks back to you, maggie.