DataHub Tech Deep Dives, 5 Jul 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Dataset Popularity in DataHub - Jun 25 Community Townhall

Description

Harshal Sheth (Acryl Data) describes how Dataset Popularity is implemented in DataHub and gives a demo.

A

Yeah, so I'm going to be talking about data set popularity and how we use it in data hub as well as how we get it from you know: query logs and creativity. So, first off. Why do we care about this so for different people? Data popularity implies different things for the data platform owner. It enables them to kind of understand what's going on within within the enterprise. How is data being used in what systems is it being used?

A

You know if you've got like a snowflake and a big query, let's say, and you want to understand which one is actually being used by more people and it's more popular. You can understand that using you know, popularity and usage logs.

A

It can also help you in the help point you in the right direction, for where to dedicate more resources and where to both in terms of people and in terms of compute power, so that you know you never have an issue as you as you operate for data engineers, data popularity is a little bit different for them. It kind of under helps them understand.

A

How are people using the things that you're producing uh so kind of like an impact analysis within the company as well as helps you prioritize among the data assets that you actually produce, which ones are most important um and which ones are actually getting used, which ones say to use more documentation to improve usage? Because you know it's a great data set that you produced.

A

It can also help you streamline the deprecation process, so let's say a data set that you're trying to deprecate still has you know 100 queries a week? um Well then, you probably don't want to deprecate it just yet. Instead, you want to you know, look at the look at the popularity and the usage figure out who exactly that person is, and you know, reach out help them migrate to to a different solution um for data scientists.

A

Popularity and usage is a major trust signal. It helps you know kind of understand. Is this thing, someone that you know someone put out a year ago and hasn't touched since? Or is it something that is being, you know, regularly updated, regularly used and is something that you can rely on, given that other people are also relying on it? uh The other thing you can do is you can look at kind of the other queries that people are issuing against that data set figure out, say what what other tables are relevant here?

A

What is it commonly joined on which keys um and so forth? So you can kind of determine um not just whether or not to, but also how to query that data set, and then you know helping everyone is you know we can? We can use usage and popularity data to improve search rankings, and you know improve the ordering of things in the languages, visualization and so forth. So lots of product improvements for datahug can also come out of the usage statistics.

A

So let's look at what we're collecting and kind of how we're doing that so right now we support bigquery and snowflake for usage, stats, bigquery we're using the bigquery logs and we're parsing those out and then with snowflake. We are using the access history and the query history views joining these together and getting our popularity and usage data that way and for each data set we can collect per user usage frequencies, so you know person a is using it. This much person b is using it this much.

A

We can also collect how they're using it what queries they issued a lot of granularity here um even to the extent of what columns are they frequently querying versus which ones are, are not being used um and once again, we kind of roll. This up and you know, can get frequent queries across all of all the people using it together, data set as well.

A

um So how do we design this? I want to talk about this a little bit, so we are some of the constraints we were pushing against. You know skill, wise. You know it wouldn't be unheard of for a company to be issuing.

A

You know 500 000, queries per day, again say bigquery or snowflake, or a similar data warehouse and you know, have 10k users, so this is kind of our north star on what sort of scale we might want to support. um They'd want to retain a decent bit of historical data, so they can. You know, view this historical usage over time, let's say a year here, but you know it varies for different different enterprises.

A

Some might only care for 30 days, some might care for many years, um and the last thing is we want to avoid refetching the same data from the same solar system. Repeatedly.

A

What this means is, you know if we're collecting data, um we only want to pull that, given that given piece of uh the usage log or the query log or whatever it is once and then you know not have to pull it repeatedly.

A

So, given this, we are some of the decisions we made. The first is we're going to start with a batch based system, um so you know you can configure to run hourly or daily, whatever you'd like and we'll pull kind of the most recent queries in history.

A

That has happened in that past time period and then we aggregate it during ingestion, because you know this data at the at the top level of scale say if a given log event is, you know two kilobytes, this reaches you know, gig or so in memory, which we probably don't want to do.

A

So we have a memory efficient algorithm for kind of pulling this in, while keeping preventing the memory usage of ingestion to blow up, and then we do some pre-aggregation here on a byte on a per data set level uh roll it up so that we can frequent users of the data set. Frequent columns used for the data set frequent queries of a data set, um and then we take that information.

A

You can push it into through gms into elastic and that's where we can store these aggregate statistics do additional aggregations on them um and then you know surface them in the ui, as you might expect,.

B

Special, I guess one more uh interesting constraint in the design that you probably had was not adding one more moving part to data hub, like oh you've got to go, run a spark job or you have to go run some other big data processing job to compute this stuff right yeah. I.

A

Think absolutely- and uh you know that's actually kind of a good segue into the demo piece. So I wanted to show how you know: bigquery and snowflake usage work, the snowflake one, I'm actually going to show you how it's looking um when scheduled with airflow, because that's kind of the common use case here you schedule it on on a daily basis and then how it looks in the ui.

A

So we can start with how bigquery usage works um so right now I have a little recipe configuration. It works the same way as most other sources. Do you know you just have a new plugin type called bigquery usage you can put in the product project id for bigquery. I just have a playground instance that I'm using- um and I unfortunately haven't queried this this instance in a in a few days.

A

So I just reverted the the start time to the beginning of this month and then here we're just going to dump it to a file instead of pushing it to rest, as you might normally do and simply running it, as you might expect, will you know take a little bit to pull all of the data.

A

So what is it doing here? It's pulling the bigquery usage logs from the cloud logging um product from google and then doing a little bit of pre-aggregation here and then um dumping that into a file here.

A

Cool, so this might take a little bit because the time range is quite large.

A

um It is okay, so you see a couple couple instances. um We can see the general queries that I was um or the general data sets that I was using and then, if we want to, we can take a more detailed look into the actual usage data that was produced.

A

So we have, you know, emails, frequent queries and then the fields and each one of their usages as well- and we have this on a per day basis per bucket or per per data, set basis cool. So it's snowflake. It works pretty similar.

A

um I actually added it to our demo instance here, um it's you know pretty straightforward, how it looks this time we're running the ingestion using direct code, um because we want to do that inside of airflow and it's remarkably similar, we first ingest snowflake and then we we add snowflake usage in kind of a pipeline, so you get both of them at once, um once again kind of set your configuration and then I wanted to get a bunch of historical data.

A

So I set the start time manually, but then beyond this I might just leave it blank and it will automatically do the current day um and yeah. Now that we've kind of run this successfully, we can see the little green box there. We can head over to the demo instance and we can see where this is surfaced in a couple places.

A

So the first is we see immediately the queries that the number of queries have been issued against this data set. We can see that it's, you know 78, um and you know this time period and we can also see you know I and through have issued queries against it. So we can see the top users and this is going to be ordered by frequency.

A

So you know that um you know I've done the most and sort of the second most beyond that. We can also see you know per column so entity and earn the two columns here we can see the entity has had 78 queries per month and the urn field has had only 43 queries per month.

A

So you know that you know people maybe use the entity field more than the urn field, for whatever the reason it might be, and what might that reason be well, we can actually hop over to the query instance and we can take a look at some recent queries that have been issued that reference this table.

A

So you know your standard select count. You know group by entity, here's where we might guess that the entity field is being used more frequently than you know the urn field or other fields, and then we can also see people are creating other generated tables that reference um this all entities uh table. So we can kind of understand how are people using it how they're joining on the urn and so forth. So we get a lot more rich information about how people are using this data set.

A

um So that's it for the demo.

A

I want to hop back over to talk quickly about what's next here, so obviously we're going to iron out the edge cases and figure out any issues that we might have we're going to integrate usage to statistics for more systems, so expanding beyond bigquery and snowflake, as well as getting more rich integrations of this usage data into search, ranking, lineage, is and so forth.

A

um With the ui improvements, we've got a bunch of you know, time, series data for the um usage statistics and right now we're kind of only showing you know queries per month or something we can also add.

A

You know line charts so that if you're expecting a certain say, data set to be deprecated you're, going to look for the usage per day to to kind of taper off as you migrate people over and then finally expanding our time series metadata piece to add mechanisms, for you know using no code to like using a similar, no code approach to expanding that um so yeah with that love to hand it over to shoshanka.

B

Awesome, thank you, marshall, for that um yeah. If people have questions about how to use it, my my first question was: why is snowflake and snowflake usage, two different sources um and there's actually a good reason for it, because in some of these sources, the place where you get usage data from is actually different from the place you get the metadata from. In some cases, you actually need elevated privileges to get usage data out, so it makes sense to separate out those two pathways now.

B

One of the things I'm also very excited about is how simple we have managed to keep it so that it's still the same deployment footprint and that is able to also pull out usage logs and it doesn't kind of overwhelm the system.

B

Once we add time series metadata support for no code, I think it will be pretty cool to see different kinds of systems being able to push usage metadata into data hub.