DataHub Tech Deep Dives, 8 Mar 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: YAML-Defined Dataset Lineage

Description

Edward Vaisman (Wavelo) gives a demo of how you can define Dataset-to-Dataset lineage via YAML during the February 2022 Community Town Hall.

Learn more about DataHub: https://datahubproject.io
Join us on Slack: http://slack.datahubproject.io
Follow us on Twitter: https://twitter.com/datahubproject

A

Okay, so hi, my name is edward vaseman and I'm a senior captain administrator at wavelo here to talk about managing your lineage and data hub through the use of gamma files. um Just a quick note about wave low, we are a new software business, that's on a mission to make telecom a breeze, and all that means is that we want to provide flexible and modern software to communication service providers and in even shorter terms, software as a service telecom.

A

Okay, um also for those of you who have been on the internet for a while wave low is a part of the toucas family. So if that interests, you at all two cows is hiring right. All you have to do is go to two cows.com careers or you can message me directly on slack on the data community, and you know we can iron out some of the details.

A

Okay, so let's talk about data hub um being a member of the cockpit team here at wavelo, we wanted to provide a data discovery platform to our organization, to help find out a little bit more about the event streams that we've been uh curating within our platform.

A

Over the past year, that's when we came across datahub and we've been toying around with it for the last month or so, but for anybody, that's familiar with kafka or messaging systems in general, there's actually very little out of the box tooling, when it comes towards linking your data sets together.

A

Naming standards only really get you so far, which is why, when you run the data hub recipe for kafka metadata um at most you'll get the individual kafka topic ingested and maybe some schema related information if they make use of this schema registry.

A

So to help bridge that gap, we recently contributed a file-based linear source to datahub, which will allow you to start linking your datasets together. I would say it's best used as a sort of duct tape, while teams are still going through the process of adopting a data, ops approach, but it's uh it's good to at least like get you off the ground. I would say so.

A

Let's take a look at the data hub right now. I've already gone through the docker quick start and I've ingested some data in here and being a kafka guy. I'm going to be interested in data inside kafka, so I'm going to quickly browse and I see the sample copy data set and cool.

A

It already has some lineage associated information, but when I browse it, I notice that you know what the sample kafka data set doesn't have um all the information it's actually a little bit misleading in the sense that it actually sourced its data from other kafka topics, right which isn't uh present just by looking at the ui here. So let's go ahead and add some new, a new lineage to it. So the first thing I'm going to look at here is I'm going to look at the uh the recipe file.

A

um The only thing you have to pay attention to in the recipe file is that there's two parameters: there's a file which just a path to um our our lineage file. That's in yaml format and there's a field called preserve upstream, which I'll talk to at a later point. The type up here you would set to be data hub, lineage file in the latest release of pip right now, I'm just using the custom plugin that I have okay. So, let's take a quick peek at um example lineage.

A

So your links file will kind of look like this, where um at the root you'll have a version right now, that's only version one you'll have lineage and under lineage you'll have a list of entities and their upstream nodes an entity comprises of the name, the type environment and platform.

A

Currently, the only type that would work today is data set, but that's perfect for what we need to do and optionally. You can provide up streams to that data set. So let's say, um let's go back to sample. Capcom data set, I'm going to grab this name.

A

Sample data set I'm okay with this- I'm okay with this, I'm okay with this, and I want to add a new upstream for it. So I know that there's um you know my kafka upstream somewhere there and you know what maybe it's actually also grabbing data from another data source. So I'm going to keep this s3 data set over here. I'm going to quickly just run this through the data hub ingest.

A

That's the address sync.

A

Okay, so it's already produced uh one work unit for uh this urn. Let's go back to datawi, I'm gonna do a quick refresh and boom. Now we have lineage between our sample capital data set and our you know. Other data sets um currently only data set to data set connections would work, but you know, depending on how that data process um pr goes.

A

We might also want to include like an application in between very important for, like an event streaming world where we have microservices in between that we may want to link in between topics or something and as I promised, let's talk about preserve upstream so by default. This is true and all this field does is it determines whether or not you want to keep the existing lineage that already exists within data hub.

A

So if I was to go inside this lineage file- and you know I wanted to modify one of my upstreams- let's say to remove that and let's just do my new upstream run the ingestion again.

A

The work unit is replaced, and I refresh the upstream just gets appended on. If you wanted to hard replace it, let's say we accidentally messed up, and you know those up. Upstreams are not current. We would set this to false run. It again.

A

Do a quick refresh and now our data set only has one upstream associated with it, and that's really all it takes to get um lineage hooked up through a yaml file. uh Just a quick caveat. um It only goes one layer deep, so you can't just start adding you know another another upstream here. It won't look through it. So if you wanted to um add another lineage for maybe this new upstream, you would start with entity and name whoops type.

A

Data set environment, prod, oops platform kafka, and then you would provide another option for that and that's uh that's all. She wrote thank.