DataHub Tech Deep Dives, 24 Sep 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Stateful Ingestion in DataHub: Sept 24 2021 Community Town Hall

Description

Surya Lanka and Shirshanka Das (Acryl Data) give a demo of stateful ingestion works in DataHub, ensuring that you ingest only net-new metadata to minimize redundancy and optimize ingestion performance.

A

Hello, everyone, one of the things that we've done recently and are about to roll out in the next week or so is support for stateful ingestion, and it really improves our ability to uh not unnecessarily ingest old stuff again and again, and it improves the capabilities of new injection connectors to to essentially only pull relevant information every time they run. So, let's get into it.

A

The next slide.

A

So, let's, let's start with an example of a connector that we already have it's called the snowflake usage source and let's look at how it works today you know the snowflake usage store source um starts up and it's got in its configuration. A variable called start time and that start time can be specified in your configuration.

A

So you can say start time is yesterday. Our start time is day before yesterday and depending on what you specify the snowflake usage connector says: okay, I'm going to start from this time and pull all usage logs from this start time up to a granularity and the default granularity is one day. So essentially what the snowflake usage connector is doing. Every time it starts up, it pulls a window of data from the source and that window is defined by the start time parameter now.

A

The start time can be specified in config or if you don't specify it, the default source will just default it to now minus one day. So essentially it'll pull data for the previous day once it does. All of that it does aggregations in memory and then pushes a grouped usage stats group by data set and bucketed by this day, granularity into data hub and that powers a bunch of features. If you go to the data hub demo, you'll see monthly queries, you'll see top users, you'll also see column, level, stats and recent queries.

A

All of that information is coming in from this usage source and there there there are other usage sources like the bigquery use that source, and now we have the redshift user source that are basically built the same way.

A

So what's the problem next slide?

A

First, the start time needs to be set as configurations which makes it kind of interesting like what do you set it to do you set it to today? Do you say it to yesterday?

A

You basically have to set it using kind of your scheduler, so people have to use something like airflow to really ensure that this runs every day and you know does not rerun twice the same day.

A

So if you rerun today's job twice, you know the the injection connector is stateless and it just says: okay, I guess I'll pull data for today again- and this is not just a problem for the usage source. Pretty much. Every single source today is memoryless, which means it starts up and crawls the whole world and ingests metadata.

A

So we wanted to facilitate sources to remember where they last left off to allow for incremental ingestion so that we can reduce load on source systems and only produce relevant changes. Of course, when sources are pushing metadata out, this is all trivial, because you're right in the context of actually making a change, so you can push metadata out when it changes, but for sources that are crawled. You actually need state to be able to ingest only the things that have changed. We also want these sources to be really simple: self-healing cron jobs.

A

You should be able to just run them and they should be just able to figure out where they last left off and only pull the deltas.

A

So how did we do it? Well, we thought about where we could store this state, so next slide.

A

What were the requirements? Well, first, we need temporal storage, so we need state to remember across time and second, we want to be able to query that state to get the last successful run for a particular window of time and of course there are a few options for storing this state. You could either store it in your local file system or you could store it in s3 or some other blob store, or you could store it in a db or you could store it in data hub and we were like huh that's interesting.

A

Maybe we can actually use data hub as a state store for ingestion and let's look at what that looks like.

A

So what we ended up building is just leveraging the same time. Series metadata support that we added to data hub to actually store state about metadata injection runs if that's not meta enough for you, I don't know what is so. Here's an example of how the snowflake usage ingestion system changes once we add this capability.

A

The first thing it does when it starts up, is gets the last successful run state from data hub and data hub internally has temporal state for all of the previous runs once it figures out the start time based on the last successful run state. It then issues the query to snowflake gets the deltas. Does the same thing?

A

Groups usage stats sends it over to data hub, but in addition to all the metadata it produces, it also produces another piece of metadata, which is the ingestion run state, and that includes things like the run id the pipeline id the status. How long it took other metrics as well as the start time and the granularity. So that's, basically the window and each of these individual ingestion run.

A

Events are really getting logged in data hub on a timeline which allows the usage connector the next time it runs to find out the last successful run and carry off from there. So that's in a nutshell, what we've done and it's uh working out quite well uh next slide.

A

We uh surya is gonna, do a quick demo for you here um and just before he gets started. I want to give you an example of what we will be demoing. So today is september 24th in utc time, and you know we wanted to uh show you what it would look like if ingestion was run two days ago and then rerun and then we move time forward to now and then run ingestion again. So we want to basically show how ingestion runs.

A

Would progress over time using a connector that, from a configuration perspective, is completely stateless over to you, surya.

B

Okay sharing my screen.

B

So for this demo we have done something very interesting, so we have uh set up basically a docker that can fake time with the library. So if you want to see what it looks like so before we do the injection two days ago. So if I just ask like this what's the date, it says today is september 22nd.

B

So if I ask uh minus three, what is it it's 21st? So it's actually changing system time. Basically, so now let's go ahead and do an injection. So that's two days ago, okay, so from snowflake.

B

So this is now going to complain about some clock skew so, but that's okay still go ahead and finish it. Okay started.

B

This is complaining that it has detected clocks, queue potential class here right here. Okay,.

B

Now it's ingesting basically this records, so we have 48 work units produced so now, if I try to rerun basically for the same day again.

B

So it shouldn't allow us to basically uh do it. It should produce zero work units.

B

Okay done so, it has remembered what it did before for this day now uh we can actually oh.

A

Yeah, it might be a good idea to just look at that uh recipe quickly. The.

B

A

B

B

So, ah okay, the only thing basically from here that will survive is this, so the only additional, basically uh piece of information, this suite needs is basically where the gms server is. So that's where time stats live. In order to query these aggregations are like even push so this is the end point we need, so the other configuration is going to go away so yeah. So this is what it is. One thing that's interesting for you to notice: is that, like there is no time specified anywhere here so uh now?

B

If we basically go back to the demo and run for today, you will notice that it will run for today. Basically, so it will do injection again of this records. So now, let's get rid of this all together so- and this is now, if I show you what current time looks like.

B

So without this date is basically today, so we are actually kicking off the injection for today so which means it's supposed to basically fetch all the data for yesterday. All of us today so.

B

Okay, so here we go so it's now, if you notice, this is the entire data, basically from the start of yesterday, so 23rd so and we again got 48 records now, if we try to basically rerun the same thing it shouldn't allow us to it should produce any records so done. There's nothing back to you, sir.

A

So yeah, maybe you want to use the so this is a pretty cool, but what is even cooler, I think, is what you see in the ui after you run.

B

B

There is a ui, I didn't expect you to ask for this. So let me pull.

A

So any um questions so far.

B

So let's go see pipelines so data have.

A

B

This is a pipeline here, so.

A

Right, so what is interesting, I think, is that even the data hub injection itself will produce a pipeline instance um which will allow us then to look at the ingestion process itself as a pipeline. That runs and produces metadata so now, you'll be able to actually not only look at your data pipelines and your data sets and the lineage across all of them, but also your metadata pipelines and make sure that they are running on time and that they are actually running uh successfully.

A

So it allows us to use data hub itself as an observability platform for metadata ingestion.

A

Cool so maggie, I think we can go back to the slide deck, so this will get rolled out um to all the usage sources, so snowflake, bigquery and redshift.

A

Another thing that we want to roll it out to is the data set profiling sources. These tend to be our most expensive uh profiling sources and we want to start doing incremental profiling as well and also start converting other sources, such as our sql sources that are ingesting schemas, our local sources to be stateful and only perform incremental post. This will reduce your api bills as well as reduce the time that it takes for these sources to run.