DataHub Tech Deep Dives, 27 May 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Preventative Metadata: Building for Data Reliability with DataHub, Great Expectations, and Airflow

Description

John Joyce & Tamás Nemeth go in-depth about how you can use DataHub + Airflow + Great Expectations to scalably address data reliability.

Learn more about DataHub: https://datahubproject.io
Join us on Slack: http://slack.datahubproject.io
Follow us on Twitter: https://twitter.com/datahubproject

A

So hopefully, everyone can see my screen we're going to be actually giving like a condensed lightning talk, based on a talk we gave on monday at airflow summit. It was about building data pipelines for data reliability in mind, with a set of tools, data hub, great expectations and airflow all right, let's get right into it. So the first thing we're talking gonna talk about is what this concept of data reliability even means and kind of why you should care.

A

So we looked to two definitions. When we were thinking about what data reliability means, the first is reliable, which means consistently good in quality or performance able to be trusted and reliability, which really means the overall consistency of a measure, and when you think about this applied to data, we think a good way to think about. It is really that data reliability is the overall consistency of data quality.

A

Okay. Well, that raises a new question. What is what is data quality?

A

Well, data quality can be thought of as kind of a set of living dimensions, um availability, which just means that you know data is available for access timeliness which is really about you know how soon, after something happens, does the data actually reflect that it happened? So data is typically modeling. Some events, how long after those events is that updated data ready to use and then correctness which is really about?

A

Does the data accurately represent what it's supposed to be representing and then data reliability, which can be thought of again as sort of the consistency of data quality over time? So what we're looking at here is something that would be maybe less reliable, where the quality is sort of varying dramatically over time, and this would be something more reliable where the quality is sort of stable and high over time.

A

So two characteristics there, both the consistency and the fact that the quality is remaining high, all right so thinking about this, how do we realize data reliability? Well, we try to build availability, timeless completeness and correctness into all of our data, but we have to do this across time. So as our ecosystem changes and as our data changes which, as we all know, this is a really tough challenge by itself and then finally at scale.

A

So we have to do this with you know, maybe thousands of data sets or hundreds of dags or even more, and so this is a real challenge. Why should we care about data reliability in the first place? Great question, I think really.

A

The strongest reason is that data is becoming a product, and what I mean by that uh you know in practical terms, is that data is now on the kind of value creation chain for many businesses in a very direct way, serving both human and machine consumers, so, for example, human consumers, both inside a company or outside of a company as part of a product and then machine consumers which again can be internal consumers or now.

A

What we're seeing is companies sharing data and actually building data dependencies across one another very similar to how many companies have built online dependencies across each other's services? And what that means is that you know these attributes. Availability, timeliness and correctness are just going to continue to grow in importance over time and as a data platform team. There's various challenges we have to face in data reliability. The first is we've kind of touched on its scale. uh You know, collecting data has never been easier. More people have never been interfacing with digital products than today.

A

The scale of the data we are collecting is just bigger than ever, and it's still growing at quite a clip and complexity which is related right. So, as as the scale has grown, many organizations have moved towards data democratization, which really means putting this.

A

You know this volume of data in the hands of their their company members to derive insights and what that's led to is really kind of an explosion of different tools that are very mission, built or purpose-drilled, built for one specific use case on that data, value creation, capture, plane and these both are changing quickly as well.

A

I think an emergent challenge we face as data platform teams, data engineers is really like separating the signal from the noise. You know, as we deal with the scale and complexity as engineers.

A

Our typical response is to try to build automation and deal with it using technical solutions but uh oftentimes what I've seen and what I've experienced is that you know this part about filtering is actually one of the hardest pieces you come in monday morning and you've got all types of dags breaking and it's really difficult, sometimes to understand where to even start given scale and the volume all right.

A

Now that we kind of understand the data reliability as a concept, we're going to talk about building pipelines for data reliability and we're going to start by looking at kind of some patterns that we've seen in the wild for how people are doing this today and then we're going to propose a new pattern which involves obviously data hub.

A

So, let's imagine we have you know a simple airflow pipeline which produces some output table, we're going to call it table and it's a pretty standard setup where you have an upstream table, that's actually being consumed by multiple downstream consumers and these downstream consumers, they're they're owned by different stakeholders.

A

So, for example, we have one that produces a derived table, we have another which maybe produces a table that serves as input to a to a looker chart, and then we have another which maybe generates some emails based on that table, and one key characteristic here is that the producer of the upstream table is independent of the consumers. So there's decoupling between the two and really.

A

This is the only way that we've seen to like achieve scale in data democratization, um and the first pattern we've seen, which kind of naturally emerges uh from this type of architecture, for reliability is what we call consumer side, validation, and what this really is is that you know each of the consumers will basically run some validations or some pre-flight checks on that input table that they're, depending on before they run so, for example, this first one may check that that input table just exists, and maybe the second one does something a little bit more.

A

That says: hey if the row count in a certain range or a certain time stamp is greater than zero. Then maybe I can. I can continue, and maybe the third one just doesn't do any validation at all. It just assumes that upstream table is good to go and then runs.

A

This is what we call contract on read and what I mean by that is that the contract of the data, not just its structure or like the columns and the that kind of thing, but the actual kind of row profile as well, so the whole semantic contract of the table is verified by the consumers when they read the data and there's some different downsides to this approach. That we've seen one is that it's just ad hoc and inconsistent.

A

So you know some consumers may only depend on a handful of columns from the table. So maybe they only check that those columns are looking good. uh Maybe others don't even do any validations at all.

A

There's kind of this partial validation of the contract and then, of course, there's just wastefulness, it's duplicative, because you can have multiple downstream consumers that are doing the same processing over and over on this table.

A

The second approach, which I think is a little better, uh puts a validation of that table. In the hand of the producer of the data- and so in this case, what we have is the owner of this transform. Actually writing some. You know validations against the table and then running them periodically.

A

So, for example, you can use tools like grade expectations or dq or dbt tests to do things like verify that columns are distinct.

A

You know not null in a range, basically verify this semantic contract and then probably do some alerting if they start failing- and this is another very common pattern we've seen. This is what we call contract after write. So once that table has been written, we run.

B

A

Periodic validations on it and if something is wrong, we send an alert to the owner now, there's another downside with this approach. The key key downside is that bad data can propagate to all of those downstream consumers by default.

A

There's no way for the consumers to understand that those validations upstream have passed, or not it's kind of up to the diligence of the owner to make sure that they're communicating that, and so we started thinking like hey.

A

Is there anything we can do better in this type of architecture, just because it's so common across organizations today, particularly those using tools like airflow and we started thinking well, one way would be maybe to push the validation into the right path and actually run validations when the table is generated, let's say on a daily on a daily basis.

A

This is what we call contract on right, because when the table is written, we're verifying certain attributes of the table establishing that it's trusted and there are self upsides to this type of a pattern. There's consistency, there's better coverage of the contract because the owner has all of the the context of what the table should look like, theoretically being closer to the source and then finally, it's like a centralized play.

A

We don't have to have the consumers verifying things, but in this approach um you know there's a few patterns you can use one is you could fail the upstream dag if the validations don't pass? Maybe this is running as the final operator in the dag or you could I mean that's, that's pretty much the main one, but there's not really any way to communicate that that status of the upstream dag to the downstream consumers- and this is what we call the communication problem.

A

It's really. How do the downstream consumers really know whether the upstream validations pass or fail for a given day, there's not really a clear way to communicate, and so what we're proposing is introducing a metadata platform in between these two decoupled producer and consumers and what we call this is metadata, driven, orchestration and here's how it works. Basically, as a producer, dag, you kind of add a final step after you've produced a table and run validations, which is called the reporting step.

A

So what you do during the reporting step? Is you push a signal to data hub which basically says I have produced this downstream table and it has passed or failed its validations and then what you can do on the other side is add, a very thin preflight check, which simply checks that the upstream data that's serving as inputs, have indeed passed and are up to date based on the consumer's expectations.

A

So, for example, this producer may say: hey, I inserted 45 rows into the table and the table passed its entire validation suite and then the preflight check would ask the same question. Is the input data up to date, meaning has it been updated in the last 12 hours and is it passing its validations or is it failing all right and then basically there's kind of two outcomes that we can take? If the checks pass, we can run the pipeline if they fail.

A

We can circuit break just to make sure that bad data is not propagating downstream and we don't have a huge mess to clean up. Obviously, this is a very simple case, but there are cases where there are many, maybe hundreds of downstream looker charts or dashboards dependent on an upstream table, and you can imagine the difficulty and the complexity associated with kind of going up that entire dependency chain and backfilling stuff when bad data makes it all the way down all right.

A

So a practical example of this and we're going to actually continue with what maggie introduced earlier. This idea of long tail companions, which is a company that uh you know pet adoption company and we're going to have a very basic pipeline as an example, we have a set of pet profiles which is updated from some online store, periodically it's etl into s3.

A

Then we have a snowflake loader, which is going to load this into a snowflake table and then finally, we have an email sender, job which is going to send emails to prospective um buyers or adopters, and this pipeline is working out pretty well so far, but one day, there's a problem and that problem is that the pet profiles data is delayed because the etl job that copies it from online uh didn't work today, and so what happens?

A

Well, what happens is that the snowflake table is also delayed, so it's not updated for the day and our email sender, job doesn't know about this, of course, and it still runs, and so what does that mean? Well, it means that we're pushing bad recommendations to prospective adopters, so maybe for pets that have already been adopted, and that's that's pretty bad, because we don't want to pull on their heart strings or like make our customer support team upset uh by all of this. So we really want to prevent this type of thing.

A

So we start to think okay. How can we prevent a situation like this in the future, and this is where the first abstraction provided by datahub comes into play, which is we're calling datahub operations and the first step of using datahub operations is to report when something happens in an upstream dag. So, for example, in that snowflake loader dag, we could add a final piece of code that would push that they had inserted a thousand new rows into pet profiles at some point in time and then on the other side.

A

We can verify that, of course, but uh hold your horses on that. We're going to show you how to push this information into data hub first, and I'm going to pass it over to tamash to talk about how to do that.

B

Yeah, let me show you how you can apply this technique to your airflow pipeline, so for the simplicity, these two dags will be, will have only one task for each, so even the snowflake loader and the email center one and as well for the simplicity for uh both tasks, will be a simple bash operator, uh some kind of limit bachelor period there. uh So for this task, which will be the prep profilers load, we define inlets and outlets, which is provided by the airflow lineage api, and we define uh data sets like input.

B

Data sets, which will be here in this case, an s3 location where we load data, which will be the outlet into snowflake and on success. We want to send out operational metadata and how you send it with operation. Metadata is pretty easy. We have a uh we, we basically working on these new api. So first you have to create a data hub connection, and then you keep here this uh operational reporter and basically you go through all of the outlets and send out operational metadata to data hub.

B

In this example, we are sending out uh the operation, the urn of the data set, which we would like to send in operational metadata and the type of operation which, in this case, is our insert operation. Of course, you will be able to send out, like other informations as well like partitions or the number of other metadata, or a number of rows affected and various other key values, very key values. What you can imagine or what you want to send out.

A

Awesome so once we've kind of posted that to datahub we can add a small snippet on the email sender side, which basically just verifies that that information has been reported, and you can do things like verify. Custom metadata. So, as tamash was alluding to you can verify that a partition has landed even or you can verify that the upstream job was a scheduled job versus a backfill job. uh That sort of thing, and now we'll just look at how you can do that verification.

B

Yeah, so to protect us from a premature pipeline start we create for the email center set up a data hub operations circuit breaker sensor, it is, will be a sensor which will be in the beginning of your pipeline. It will be going to check the if operation and metadata exist and if and it and the pipeline continues, if there is so to create. It is very easy. You just create a data hub, operational circuit, breaker sensor. You basically pass in a data connection and as well a list of earns which data set.

B

You want to check operational metadata and also you can set a time data, which is the time frame where you accept operation metadata. In this case, it's like a 12 hour, because you can imagine you don't want to start your pipeline. If you only have a data which was created like two days ago. This makes sure that your pipeline won't start in this case.

B

Yeah and, of course yeah. So this is now uh uh a sensor, but we are going to provide uh other airflow primitives like operator and also we under the hood, the operator and the sensor as well, using a python api, and you will be able to use that as well. So in this way you can basically integrate not just for airflow, but any other systems as well.

A

Awesome and I think we're just going to talk through the demo because we're short on time, but you can imagine uh you push this data and then you run the downstream and it works or it fails depending on the presence of the uh the pushed metadata okay, so things are running well, but a few months later we actually have another problem in our pipeline and in this case the timestamp column, which we're depending on, has changed from being in seconds to milliseconds because of an upstream change in the online service.

A

Again, you can imagine this is a problem for our email sender, job, because we're doing a filtering on that timestamp column and we're actually filtering for profiles that were, you know, recently updated, and so what happens is no emails are sent because now it's in milliseconds, it wasn't seconds or assumptions are broken.

A

Okay, so we've solved the time problem uh with the operations. But in this case it's really a correctness problem. It's something about the data itself, and this is where our second abstraction comes in, which is called data hub assertions. Some of you may already be familiar or even using assertions with great expectations already but great expert. The assertions integration basically allows you to run some validation at the end of the producer.

A

Job push those validations into data hub, so examples are, you know things about the columns and then again verify that before the email sender runs um and then we can just zoom through this piece. This is the first step is defining assertions in something like great expectations in the future will probably support other things like eq or dmt tests.

A

Once once those assertions are run. We push those into data hub and of course, some of you may be familiar with this screen already on the right side, where we surface the results and then the final step here is again asking datahub: hey are my upstream assertions passing and are there results that are recent that I should be able to trust uh tamash? You want to talk about this part again.

B

Yeah so here in this example for the email standard, we will introduce a data accession circuit, breaker operator. So, as you can see, we provide operators as well, then, basically setting up quite similar like how it was like the operator uh operational circuit breaker. Basically, you just create a database, a session circuit breaker operator.

B

You just specify data connection uh list of turns and there is an option which says which is called like verify uh after last update, which basically will get the latest operational metadata and we it make it it will check if you have assertion after the latest operation, because we don't want to you know, basically uh accept a session which was created before the last update, because, most probably it's already, it's an outdated data quality check.

A

All right, so one more demo we're gonna skip um so okay, so here we have a few abstractions that help us to work up this pyramid. So.

B

A

Have data hub operations which maybe helps towards the availability and timeliness side, and then we have data hub assertions which will help more on the correctness aspect of things. I think we're doing an okay job on the through time aspect of reliability. But one thing we haven't really touched on is how to actually do this at scale. So we've instrumented, you know one dag, but how do you handle hundreds or thousands of dags in your ecosystem?

A

And we started thinking about this and started to think about if there was another solution where we could centralize our control as a central data platform engineering team and then have distributed enforcement of some policies that we'd like to set about when a pipeline should continue or when a pipeline should be broken and some key characteristics we were thinking about for a solution here is really like.

A

I mentioned decoupling that kind of rules definition from the enforcement, which is the the circuit breaking in this case, flexibility, so being able to evolve our rules and our policies of what a good data ecosystem looks like over time, configurability so being able to sort of slice and dice for only the most important assets or most important pipelines in our ecosystem, so that we can avoid a lot of the noise usability so being able to integrate with all of our dags all of our airflow pipelines by default, without having to kind of go in and instrument them one by one.

A

And what we came up with is a vision for a new feature set, which I think shoshanka mentioned earlier, which is called metadata tests or data hub tests, and the vision for this feature is really that a data platform team can come into data hub and they can define or declare a policy inside of data hub saying, for example, all tier one data sets should have passing assertions and then two we can instrument all of our dags.

A

To again, do this circuit breaking check against data hub before running, and so what this allows us to do is kind of get out of the mode of kind of redefining or reintegrating every single dag into this. This data hub ecosystem, one by one and.

B

So how this will work.

A

Is basically two steps? First step is defining a data hub test and here's what the definition of language will initially look like it's going to be a yaml file and there's two key blocks. I want you to look at the first is the on block and that basically filters down the different entities in your metadata graph.

A

So, for example, we want to apply a test to all data sets that that are tagged with tier one, and so that's what we're doing in this on block and the second block is the rules block in this case, I'm going to run a test that says that the ownership count for these data sets must be greater than three because they're, maybe important.

A

So that's kind of the first step is the test definition. Second step is data hub runs these tests continuously over time on your graph as the metadata changes and we surface the results of these tests in real time on the entity pages to which they've been applied. So here's what that would look like on the right side, all right, that's great! But again, how do you integrate this seamlessly into your airflow environment tamash one more time. Can you walk us through this one.

B

Yeah yeah so, as joe mentioned uh metadata test validation, you don't want to let your dag owners to basically control if they want to enable or disable it, but you want to make sure that it will run for all the tasks in all your devs, and actually there is one way to do that in airflow, which called like cluster policies. If you are not familiar with what cluster police is basically when uh create uh policies which which will do that.

B

What I mentioned so basically get will be applied to all of the tasks in oil with that and we are going to create here a test policy which basically will get all the tasks and basically you can mutate or all of your tasks with this policy. In this case we are going to add a pre-execution method to all of your tasks which, where basically, we are going to run a metadata test circuit, breaker setting it up. It's pretty easy here you will see we are going to use the python api.

B

You just set up a github connection and then create a metadata test circuit breaker, and then we are getting all the inlets. So the inputs input data sets for your task and basically we just run the circuit breaker for all of the data sets and if any, if any data set has a failing test. Basically, we fail your task, and the cool thing is that in in this, is that that you don't have to change any of your text or any of your tasks? It will be get applied automatically.

A

Awesome all right, one more demo, we're gonna escape at this point, you're, realizing that I am not good at the lightning portion of the talks, mostly just talking um so as we're as we're moving up this this pyramid. uh What we're kind of trying to see here or tell a story about is really preventative metadata, so it's.

B

Using data hub's.

A

Tool kit, including operations, assertions and now tests to proactively orchestrate your your pipelines and the availability for what we've talked about today, we'll cover briefly, the report operations api which tamash was talking about, will come in v0836.

A

I think that's the next one, which is this friday, the circuit breaker apis, will come in the next release, which is june 10th. I believe that's going to include the operation circuit breaker. You saw the assertion circuit, breaker, the test circuit, breaker and then deep integration with airflow, so supporting pushing this metadata from sensors and or from your operators, and then the test apis will be coming a little bit later as well in v0837 june.

A

10Th, all right we're going to go through these one more time, all right and a summary of what we talked about in the talk data quality can be thought of as availability, timeliness and correctness. Data reliability can be thought of as data quality over time, and we presented a new approach to build data pipelines for data reliability or with data reliability in mind using metadata-driven orchestration.

A

And finally, we talked about a few different tools in the data hub tool kit, which can help you to start building these pipelines today, operations, assertions and tests all right and that's the end of our talk.