DataHub DataHub Basics, 27 Mar 2023

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: DataHub 201: Data Debugging

Description

John Joyce (Acryl Data) presents DataHub 201: Data Debugging - Learn how you can prevent and triage data issues using DataHub as part of your core workflows during the March 2023 Town Hall.

DataHub Public Roadmap: https://feature-requests.datahubproject.io/roadmap
Presentation Deck:
https://docs.google.com/presentation/d/1Xe1HZ11zpP7BPyXUwVYoHFEr7fHLjSUaZlKaae2KuM8/edit?usp=sharing

Learn more about DataHub: https://datahubproject.io
Join us on Slack: http://slack.datahubproject.io
Follow us on Twitter: https://twitter.com/datahubproject

A

For my session, I'm gonna try to keep it relatively quick, it's quite a few slides, but we're going to be talking about uh debugging data issues with data Hub, and so first we're going to talk about sort of you know what are the what's the ecosystem look like like for most companies. Why do these issues happen in the first place and then we're going to go through a few different common types of issues we see and how you can actually start to triage or debug them through data hub all right?

A

um First, before we get started, I'll take a second just to Define what I mean by data debugging.

A

What I mean is the process of identifying and addressing errors in data and we'll talk about what types of Errors we can see, but most often this is uh you know the expectations of a data consumer not being met for one of a few reasons.

A

So pretty much, you know identifying what's wrong, why it's wrong and then how to actually fix it or make it right cool. So, let's take a look at the the context at the stage. So you know who are the key players in your data ecosystem?

A

Well, most often we break them into two logical rules: data producers and data consumers, and obviously there are data consumers that can be data producers as well, but we break them down in this way and inside of an organization, you'll often have data producers like applications, Engineers, so folks who are building online services, data Engineers people who are building derived data sets, and then maybe even third parties, so apps like Facebook or segment, which are taking their third-party data and pulling it onto your ecosystem periodically.

A

And then we have data consumers, people like data or AI Engineers or scientists, business analysts who are using data to make decisions in dashboards and then finally, actually decision makers who are at the end of the end of the road looking at some data and making some decision based on it.

A

And let's take a look at the data ecosystem pattern that we see most often at most organizations. So most often we have you know an online service or application. Let's say like in an e-commerce site, you would have you know, purchases and products being served online and typically these applications are going to be generating two types of data.

A

One is like a dimension type of data, so this is what we call snapshots things like you know what are all the products on the site or who has profiles on our site and the second type would be events, so you know things that are important, that have happened so perhaps a login event or a purchase event, and typically we'll take that data which is generalized or generated from those applications and we'll periodically pull it into an offline data, lake or data warehouse, and these become Source data sets for other analytics and products use cases.

A

So we basically take those and we start to generate derived data and oftentimes you'll have this sort of loop, where you have Source tables being processed by pipelines which are managed by data Engineers or data scientists which create more derived data which get put back onto the lake or the warehouse sort of in this in this Loop pattern, and then, finally, we have Downstream analytics or products use cases so, for example, dashboards, maybe for making decisions internally or even products, so like recommendations, if you're thinking about something like Netflix or Amazon- and you know why are we even doing all of this?

A

Why go and go through all this trouble? Well, it's because at the end of the day, someone's going to be making decisions based on the data we're producing right, so we're enriching data, we're adding value to it, so that some operator, whether it's an internal person like a CEO looking at a dashboard or a user of our product, can make some decision. So this is kind of the data ecosystem data transformation landscape that we see at most companies at a high level.

A

Now, with that in mind, let's take a look at what could go wrong in an ecosystem like this and we're going to imagine that we're the owner of a table called Product purchases and that table is derived by joining the products table on snowflake with a purchases event table on snowflake, and maybe it's joined in this way.

A

So we're running this statement every single day, where we're inserting into our product purchases table by selecting from the purchases events table and we're joining it with the products table, so you can imagine we're creating like an enriched table called Product purchases. So we own this all right now. What are the things that can go wrong on a day-to-day basis if we're trying to generate this data every single day?

A

Well, the first thing is: we can have unexpected schema changes in the data that we depend on so those Upstream tables, so the first type or first example I'll, give is a column removal. So let's imagine that the Upstream products table has an alter table statement made on it right. So let's say that there was an alter table statement where the description column was dropped. This is pretty significant, so it's actually not. Maybe all that common, maybe something that's more common- is a column name change.

A

Let's imagine that the Upstream product table has a name change where the description column is renamed to edited edited description. Now, there's other types of schema changes that we may see like column type changes. You can imagine a bigint column going to an INT column in a lossy fashion. Something like this, but in most cases this is the type of thing we'll see and now, let's talk about how we would detect an issue like this.

A

As the owner of that Downstream table, Well we'd probably run some query to produce our Downstream table and when we try to select the description column, we would get an error.

A

So actually the query engine would catch this for us right at query time before the impact of that change really gets to spread to any of the downstream use cases that are depending on our data. So it's actually kind of a nice property once we know that we've had an unexpected schema change. How can we begin to debug it, and this is where I think data Hub can help there's kind of two features that can help us here.

A

One is lineage and the second is a feature called schema version, history and I'm going to illustrate the point by jumping over to my data Hub, where I've got this exact architecture set up so I've got my table here and I've got some input tables, and you can imagine that my query failed today. Now, how would I go about debugging it?

A

Well, maybe I would look at the Upstream purchases table and would I be able to see if I go to that page is I'd, be able to see the schema history actually, rather sorry, the products table.

A

Is the one that changed here.

A

A

So if I go to the schema tab, I'll be able to see immediately that there's this edited description, column, and maybe that doesn't look familiar. Maybe it's why my SQL statement is breaking. What I can do is I can actually open up this column history and begin to see that you know this edited description. Column was just added. Actually it was just added 14 hours ago. This is a bit fishy.

A

What I can do is dig a little bit deeper and I can see that the previous version of the schema had a description column, so this has recently changed. So this is one way that we can try to understand what's gone wrong when our query fails because of an unexpected schema change. This also applies to cases like a column. Type change. Of course,.

A

So the second type of issue that we would see is maybe delayed data. So it's a little bit different problem. You can imagine that that purchases table that we depend on is updated every single day and on March 18th March 19th, March 20th. Everything is good, but on March 21st the data doesn't appear.

A

So maybe there was a scalability issue or a bug was introduced in the Upstream application, which prevented that data from being etl'd properly on the 21st.

A

So what can we do to detect a case like this? Well, let's revisit the SQL query that we're using to generate our data. So one thing you'll notice, right off the bat is that this query won't necessarily fail, like the previous. Would right. It'll just generate zero rows, and this is a bit trickier of a case right because now, at this point we're not catching it early. In fact, we may we may not even notice that anything is wrong.

A

So what are our options to detect this more proactively? Maybe? Well, maybe we can pre-validate the input, so maybe we'll run a check that says hey. If there's no rows from the 21st in our Upstream table, then maybe we shouldn't run it all or maybe we'll do the opposite, we'll post validate so once we generate some rows, we'll make sure that there is greater than zero rows. But you can imagine the issue with this. There may actually be valid cases where there aren't any rows produced.

A

Maybe there were no purchases on the 21st, so it's not a foolproof way to protect against this. In fact, these types of issues are very difficult to catch in option. Three, which is probably the most likely is that somebody notices the missing data right and they come tell us hey, and why is there no rose today I'm looking at this dashboard and it doesn't look correct, so it's sometimes possible to catch this near query time when we're generating our data, but it's not always possible.

A

In fact, it's actually pretty rare that it's possible, but let's imagine that someone did come. Tell us hey something looks wrong. How can we use data Hub again to begin to debug this case? Well, we can again use the lineage feature combined with two important data. Hub features, one is called operations which allows you to understand the changes, the inserts, the updates, the deletes that have happened on on table and incidence, which allows us to communicate that there is an issue on a data set or a table proactively.

A

So let's actually go and check out those two features really quick, okay, so the first one is the operation. So you can imagine that you know I'm the owner again of this product purchases table and the Upstream purchases maybe was delayed, and someone told me hey. This is delayed well now what I can do is I can come in and I can actually see the last updated time and I can see that the last time it was changed is 320., so we were missing that 321 date. So obviously something is a little bit wrong. I.

A

Imagine that this table is going to be updated every day. So what can I do if I've detected? That something is is wrong. Well, maybe the best thing I can do is let my stakeholders know that something is wrong as soon as I possibly can so that I can mitigate the effects, so maybe, as the owner of a table, let's say, the Upstream owner actually notices this first. What I can do is use. This feature called incidence to basically say that this SLA was missed right.

A

So maybe we, you know a bug in application code caused an incident here, and the really nice thing about this feature is that once we create the incident, we will actually hopefully we yep. um Basically, we will actually notify you so on slack or on your preferred notification channel that something has happened to an upstream dependency. So, for example, if you're an owner of the downstream purchases table, you'll actually be notified. When there's an incident on the Upstream purchases and similarly you'll be notified once that issue is, is ultimately resolved.

A

Awesome, let's continue back to the slides. I know I'm going a little long, I'm going to try to speed things up a bit here, all right. So the third case- and this is probably the hardest to detect and the hardest to catch- is unexpected semantic changes to data. So what do I mean by that? Well, let's talk about the first type, which is a column, semantics change.

A

Let's imagine, there's a purchase at a column in the purchases events table that we depend on and it's traditionally been in seconds right, but you know the engineers that are owning the Upstream application decided that milliseconds is actually the industry standard and maybe they should have been putting milliseconds in this column all along, so they make that change right now. All of the downstreams have to be able to deal with that. The second case is a row, semantics change.

A

So let's imagine that the purchase event table initially or traditionally has only represented online purchases, those made from our web store, and today we made a change that actually includes mobile purchases in the purchase event table. Well, now, the meaning of the purchase event row has changed to include mobile purchases, so you can see the key characteristic of this type of Change Is that the structure of the data didn't change. It's really the the meaning of the data has changed in some way.

A

So how do we detect cases like this? Well, it's the most difficult to detect, because our query will not only run successfully, but it will produce roads. It will actually run and produce data, and so you know this actually will possibly lead to cascading effects where downstreams consume our data and in turn produce bad data.

A

So what are our options to detect these cases? Well, one is, maybe we could you know I, don't know pre-validate the columns before we use them as the consumer, but that that seems kind of expensive. Maybe we could pre-validate the row count, but again is that is that really full proof like? Can we really know how many purchases were made yesterday or if there was a spike because of the holidays? Can we really account for that?

A

And the third is that you know someone actually notices the strange data again. This is most likely what happens today.

A

This is very rarely possible to catch a query time, but it can be very expensive if you don't catch it inversely because of the downstream effects it can have all right. So how can we if we know something, has gone wrong? Let's say we know that the data looks weird. How can we detect that there was an unexpected semantic change using data Hub? Well, you'll notice, a theme Here. We can use the combination of lineage and two other key features of data Hub.

A

The data profiles feature in the assertions feature if we're lucky um to debug these types of changes, so one final pop over to datahub, so you can imagine that we know something's wrong with our Downstream table of the purchases table. Maybe what we can do is we can come to actually look at the profile of this data and maybe we look at the column purchase that date. This is the one that's been changed and you know nothing looks particularly weird with the sample values.

A

So we go over to history, and if we look at this, we can actually see that the values have changed quite dramatically. In fact, they've changed by a value, a factor of a thousand. So that's one thing we can do. We can also look at things like the row count over time. We can see that maybe some new events are coming in.

A

Maybe this is caused by that mobile purchase event being added all right and if we're really lucky, you know our Upstream data owner will be super, uh be responsible and they will have defined some tests. This is what we call assertions and data Hub, and we can see that there was an assertion defined on that purchase that column, which basically enforces that it's in seconds and previously it's good and now it's bad. So this is less common I. This is the ideal state.

A

Is that you know every data producer is responsible and able to write tests for their data that we can immediately go inspect when something's wrong.

A

All right, that's pretty much all the changes I wanted to talk about, but before we before we end I want to talk about a few features.

A

We're working on that will enable you know not just layer to layer debugging like what I showed here, but actually end-to-end debugging of data issues and flagging data quality issues across the lineage graph, and the first feature we've rolled out on this track is the data Hub Chrome extension, and what this allows you to do is it allows people who are looking at charts or dashboards in their VI tools to understand whether the lineage graph is healthy, so whether the inputs that produce that dashboard or chart can be trusted so I'm going to do one final demo over here by pulling up our looker account, and you can imagine, we have an adoption report dashboard.

A

What you'll notice is that we have this little Data Insights panel over here, which is provided by the data Hub Chrome extension, and when we click it. What this will do is it'll, basically cross-reference the dashboard with the information and data Hub. It will use the lineage graph to tell you that there's something wrong with an input table which is used to produce the adoptions dashboard, and so we can see that there's one table Upstream of this that has failing assertions from here.

A

We can actually go in and we can see that it's failing assertions and we can begin to triage that problem using data Hub. So I'm really excited about this particular feature set or roadmap track within data Hub and there's one thing that we're working on right now immediately, which will make this even better, which is the ability to search and filter across lineage for entities that are failing their tests or have active incidents.

A

So using that incidence feature that I demoed stay tuned for more on this track, but really interested to share this with you guys.