DataHub Product Demos, 16 Feb 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Automatic Spark Lineage

Description

Mugdha Hardikar and Shirshanka Das team up to give a demo of Spark Lineage during the January Town Hall

Learn more about DataHub: https://datahubproject.io
Join us on Slack: http://slack.datahubproject.io
Follow us on Twitter: https://twitter.com/datahubproject

A

So quick reminder of what spark lineage is all about data hub spark. Lineage is a jar. You drop it into your spark application and there are helpful guides for exactly how you do that and when the spark application runs.

A

The jar essentially is a listener to the spark application events and it produces uh metadata events out and sends it over to the data hub. Metadata service mukta actually recorded a couple of nice demos for us to walk through, uh and it basically produces metadata for date, jobs. The the spark application is a pipeline, as well as the input and output data sets that the job consumed and produced.

A

So we'll I'll just play the notebook version because that's I think pretty easy to follow through and we're gonna do the 1.5 x version.

A

Let me know if this.

B

Is, let's see how we can use our data sparkly image in jupyter notebook? Here we have a couple of imports specifically for spicebar and after that, when we create a spark session using spark station building, we will specify master whichever we want to get local or it can be a cluster uh setup as well, and then the app name app name is important, because the app name whatever is specified here will be reflected in data.

B

Contains information about all the airports and where that airport is uh which city and state and all that and whereas flights flight cs, uh contains the information about. When was the flight uh day of the month day of the carrier, what was the original airport and what is the destination airport id and whatever the delays are, whichever are present like departure and arrival in this particular notebook, I'm going to work on departure release see how the departures are affecting are affected by the airport and the carrier together.

B

So what I have done here, I have merged the two csvs to get all the data together. After that I have added aggregation function uh over the delay, mean department, delay and whatever the information which we have received, we have just written into the table. Such kind of information will be useful in further model training that uh can be used for further analysis, let's say in model training where, uh instead of just giving the departure airboat, if we provide the average delay at that airport for the given carrier, it might be better for those models.

B

So that's why we created this and store it into the hud. In a table which will be further utilized by our pipeline, which will be utilized further by our pipeline and then just park that stop I'm just going to execute this notebook. Okay and then we will see that this executed notebook. uh We will see how it gets reflected into our data. So let me open our data hub and you will see quickly see how it is being executed. So right now it's been it's executing.

B

Now here is the pipeline, so if we go to the pipelines he has car because the pipeline was executed over a spa here, it is whatever the master which we have. So, if I have let's say, spark master colon 77, then that particular thing will pop up over here and just click on local, and this is our pipeline. So this is the name of the pipeline, which is getting reflected from this. Our app name, flight departure analysis. So let's go inside parts, departure analysis documentation.

B

Here we can see when the job was started when the job. uh What was the application name who created uh who has triggered the job? uh What was the id of that app inside a spark context? In addition, here they will see tasks in our notebook. We have only one task which actually writes or updates something, and everything before this is read only so. Those things won't get recorded, only thing where we are actually storing something will get recorded.

B

So, as we can see, we have a save as a table at native methods, whatever so save method uh we can here see it has a properties and properties. We can see what was the departure like application name, complete data description. Important thing is here: we have a query plan which is important, logical query, plan which is getting showed and what was the execution very execution id. So we can actually relate this query execution id as well as the app id to the spark history server in case.

B

We want to link the two things together and inside this job that the lineage is being created, so there are two upstream all lineages and one downstream, because we read from the two csv files and we write to a data set. So if we see this is a data set uh which is a higher data set and the before these two are the hddfs dataset, because we are reading it from the hdfs and processing it. As you can see, there are different lineages, ah so there is no upstream.

B

There is only one downstream which is being created by this. So all the information related to this is present uh inside this, and we can see what.

A

A

A

Let's get out of this cool so back, um I will post uh both of these videos, uh one that shows how to do this with the notebook and the other, with kind of the spark submit option and how to add. The jars on the spark.conf documentation, for this is also available on the website. What's coming. Next is spark 3.x support, support for data bricks, environments and column level lineage. So there were a couple of questions around common level lineage as well.

A

The models are getting released next week and then we'll follow soon with integrations with specific sources. Spark is on that short list, so we'll start seeing column level lineage getting produced from these spark emitters as well.