DataHub DataHub Basics, 18 Feb 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: DataHub 101: Data Profiling

Description

Maggie Hays and Tamás Németh (Acryl Data) provides an overview of Data Profiling and Usage Stats in DataHub during the January 2022 Town Hall.

Learn more about DataHub: https://datahubproject.io
Join us on Slack: http://slack.datahubproject.io
Follow us on Twitter: https://twitter.com/datahubproject

A

Now we're going to move we're going to switch gears a little bit and talk the basics, so um one of the one of my personal uh favorite features within data hub is our ability to quickly view kind of the the profile of a data set, the kind of shape of the data to really kind of minimize the amount of ad hoc discovery you need to do within it. You can just kind of get a glance at what that data looks like so tamash and I are going to give you a rundown of all things.

A

Data, profiling and usage sets within data hub. So, let's just start with the basics. Why do we care about data profiling? um So I think the main driver here is answering questions around. Can I trust this data set? Is it fresh? How large is it? Are there dupes? I need to worry about. Are there no values? I need to worry about, uh what's actually meaningful about this, so what's uh what's what character column is going to be a unique identifier versus a categorical variable that I can uh kind of pivot the data around?

A

How is this data changed over time? Is it consistent? Are there time series issues I need to think about, and a lot of these can be answered just looking at the stats page in a uh in a data set entity and data hub.

A

So um what you'll see here is a sample data set and um in our stats tab you'll see a quick uh kind of like high level view of all stats related to this table. So I can start to answer by looking at the latest stats or historical stats. uh What does this data look like now versus some time in the past, so has it evolved over time? How large is this data set?

A

Am I dealing with a ton of records that I'm going to need to think about kind of like query, performance or kind of processing time, or is it small and I can just kind of zoom right through it? How up to date? Is this data when's the last time that it changed when's the last time that it was updated um just to give me a sense of kind of data freshness there, um the kind of like zooming in on uh the output?

A

um We can start to look at column level, stats, so answering questions around data quality issues. So if we see like with this location description column, we see that there's about 8 700 null values within this data set. Now the data practitioner can look at that and say like oh no, I that's untrustworthy, but then in the grand scheme of things since there's 7.5 million records, it's less than you know, it's like point one percent of the data set. Maybe it's not something I need to worry about.

A

um I can start to look at what date rate what date ranges are represented within the data set, so this is looking at records from um early 2001 all the way to the end of 2021. So, to give me a sense of just how much what's the breadth of history included here, um I can start to understand. Are there duplicates so looking at our unique key? I see that this is a hundred percent distinct. So great it is unique. I don't need to worry about dupes there.

A

um Thinking about you, know kind of like feature categories or reporting categories, what what categories are actually meaningful. So if I look at some with you know, within a data set of 7.5 million and there are 61 000 categories of block, that tells me that that's you know pretty high car pretty high cardinality and something that you know maybe would be useful for reporting or modeling around um the other thing is we see uh sample values, so you can get a sense of what data is actually in here right.

A

So, if, like maybe you just see a column called block, what does that even mean? Oh now I see that this is. You know an actual like physical street block within the city, um or this is you know, kind of like the types of um categories that are going to be measured here.

A

One thing to note with this: um this is all configurable, so some folks are worried about actually displaying this information to the uh well. Some some teams are worried about accidentally displaying the wrong information, so you can figure you can configure each of these to actually show the sample values or not it's up to your discretion there and then, on the other side of things, we think about usage stats. So this starts to answer questions around. How is this data generally used? What columns are most important? um Even if the table's been updated recently?

A

Are people actually querying it like? Is it something that is used within my community of data practitioners in my in my company uh and then questions around you know who who's using this? So I can go. Ask them questions about how to interpret the output of it so um kind of going back to our schema tab here with the same data set. We have this idea of query usage at the column level.

A

This isn't supported for all of our sql, um all of our sql stores, but but where this is available, uh we do surface it. So you can get a sense of you know kind of the relative uh popularity of a given data set.

A

um You can also see the top users of it so who's querying it the most- and you know maybe folks that you can go, ask questions to um and also oops oh yeah. Sorry, then we start to look at actual uh sample queries so, in our uh queries, tab. This is where we start to surface the most popular queries over a period of time, um so you can start to kind of understand. How is this, what are kind of like the most common calculations and also what are things that have already been calculated?

A

So you don't have to kind of like recreate the wheel. You don't have to recreate that or rewrite that query from scratch every single time and then also just a sense of how often this data is queried within your organization.

A

So, with all those use cases in mind, I'm actually going to pass this over to tomash. So we can see how this is configured, how it's adjusted and how you can get it up and running within your stack.

B

Yeah hi, so there was a question that it's available. Actually, there is a good news, it's available, and now I'm going to show you how you can set it up. So it's super easy. It's currently available available for all the sql sources and we are using great expectation under the hood. But let's.

C

Go in to the details.

B

So, basically just to enabling it it's super easy. You don't have to do much just entering the profiling enabled and set it through and then, if you just go and run your recipe, then profiling should happen automatically.

B

But one thing you should know actually about that: profiling is not for free, so actually profiling time can really can be affected by how many tables do you have how many columns you have in a table and how many rows you have in a table so most probably because of that you might need to tweak a bit your profiling configuration and I'm going to show you how you can do that.

C

If I can click my other okay, sorry.

B

So uh so previously, I showed you these two lines. Now you can add even more lines just to disable and or enable all these profiling and and when I'm going through all of those you will see what kind of profilers we have and basically we have all the options. So you can really it's up to you what you want to turn on and off it's totally on you. So for numerical I use uh we for all of the columns.

B

We calculate no accounts and you can either enable with the include field and account set it true or false. For all the numeric columns we calculate minimum value. Maximum value mean value median value for integers. uh You also generate standard deviation if it's needed for you and also for like timestamp fields, we calculate as the minimum max value so and here you can see- I think it's quite straightforward, so you can enable and disable those uh there is an option as well like uh include field contents.

B

So if you enable that for numeric fields, we we calculate.

B

The fifth, the 25, the 50 75 and the 95 percentiles, currently actually it's disabled by default, because it's not shown in the ui, but if you want, then enable that for for the back end, we are storing this information and calculated, but this is not visible on the ui. Currently, there is the distinct value frequencies.

B

This calculated for low carbonate, numeric fields, sorry low carbohydrate fields actually, and this as well not really not shown on the ui currently. So this is disabled by default and the field histogram. Therefore numeric fields it can create a generate uh automatically a histogram for you, this as well, not shown currently on the ui. So it's disabled, uh the field sample uh sample values. This is what you could see so, like maggie's screenshot, where you had these sample values this.

B

This is one that you can control if you want to enable or disable that- and these are about the profilers- we also just to make sure that these profiling queries run effectively and fast enough. We, we introduced this query combiner, which basically try to make sure that all these profiling queries run. You know optimal batches and not doing too many round trip towards the your data warehouse and this by default.

B

It's enabled if you want to disable, because you see some issue or anything you still can enable or disable that you also have control to say how many columns in a table you want to profile with the max number of fields to profile if by default, it's unlimited, but of course you can change it to 10, and then only 10 columns will be profiled.

B

If you don't want column level profiles, then you can just enable a profile table levels only so you will only see the cons of the table, but you won't see column level statistics.

B

You also can set a limit which basically add the limit to the profiling query. So if, if you set the limit thousand, that would mean only thousand lines will be used for the profiling. That also means, if you set it like 2 000, and then you have like 2 000 lines in your in your table, then in the end the total account with the a thousand, because that that was the limit for uh what we used offset is basically just add the uh sql offset. So basically, it's not start from that offset.

B

It starts the profiling. But if you, if you don't want to mess all of these, but you feel that hey my profiling is slow. We have a nice option. This turn off expensive profiling, metrics, which you can turn on, and then it will turn off the expensive profiling profilers and also set uh the max number of fields to provide to 10..

B

But but maybe you know that you have some papers what you want want to add for profiling, or you want to do profiling, multiple batches, and you want to filter down for one profile to run only one schema, and you can do that, so you can define profile pattern and then profilers will only run on those tables which matches this pattern.

B

One thing you should know that these profilers running on the whole table, which means, if you have like a huge click stream data on hive with multiple partitions, then you should think size. If you want, you really want to do profiling on on the top of that in the future, we would like to support these partition tables, but now you should know about this limitation and if everything went well, then I think the profiling finished, as you can see, and now, if I go to.

C

Little hub, I should have my.

B

Table and all the stats should be here, but one thing you can see here like the use each one and you might ask okay, but how I can enable the usage uh statistics. That's one thing what I think some. Sometimes people get confused so there you need a a different recipe and a different source which are the usage sources.

B

Currently we support usage for redshift, bigquery, snowflake and starburst renault, and if you want to enable it's super easy.

C

B

It's basically, you just install your source so like in this case. You need to install the bigquery usage source and then basically, if you don't set anything, then it will get the usage for the previous day. If, if you want, you can set a start and then end time, especially, I think it's. It can be quite useful if you want to bootstrap your data, so you want to run in multiple and after that you just only want to run like incrementally.

B

Always previously, you can set a start time and time and also you can set a top end. Queries which says uh for that table. How many queries we are going to the top end queries what we are going to store with every table.

C

And uh if I can tweak my editor.

C

Sorry never mind, I will do it.

C

B

So then, I just need to run uh the ingest same way uh in a normal way, and then it should collect all the usage set for that table.

B

And one thing you should note that so so it's you should know that for every platform, I think there is some kind of retention period for these usage information, so it can happen like if you are on redshift.

B

You want to get a usage set, but you can only see for the last two or three days, that's most, probably because because of that retention period, what you have in these systems now, if I just hit refresh, I should be able to see all the queries here and if I go to the schema page, I should see all these usage next to the columns based on the queries so in high level or not in that high level. But actually this is how you can use profiling and and usage.

B

As you can see, it's super easy to use and we are continuously improving it so like for profiling. Recently, if you use profile four months ago and now you might wonder how much faster it become, because now we are using like approximate queries, wherever we can, we basically disabled those profilers which currently you can see on the ui and it made. You know, I would say, like significant improvement for the profiling.

B