DataHub Metadata Day 2022, 27 May 2022

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Community Favorite Hack: Single Click Data Exploration in DataHub

Description

Dane Curran, Edward Morgan, Levi Le from Raft submitted this hack during the inaugural Metadata Day 2022 Hackathon, winning Community Favorite Hack!

"We decided to tackle one of the biggest asks we've had from data scientists and analysts interested in using Datahub to look for interesting datasets: how can I easily get from a user-friendly metadata UI (Datahub) to a place where I can interact with the data? To accomplish this, we wanted to add a new "Visualize with Superset" button in Datahub that links to the dataset loaded in Apache Superset, a tool that can be used for SQL exploration and data visualization."

See the code submission here: github.com/raft-tech/datahub-metadata-day-2022

A

All right uh hi and thank you for reviewing our submission. uh My name is edward morgan and I am a data engineer at raft.

A

um So what we decided we wanted to do for our hackathon proposal was to add a feature to datahub that we've had requested more than once, and that is how do you get from the datahub ui, which is this really rich way to visualize and view and search for metadata about a data set? How do you get from there to actually playing with the data? Even if it's in you know a web or browser-based fashion? How do I get to actually see what is in those data sets?

A

So that's what we did um so just to sort of take you through where we're at. We are deployed on kubernetes, um I'm running locally in kind, which is kubernetes and a docker, and so we've got uh kafka up, elasticsearch, postgres, superset and trino, so trino and postgres are where we're sourcing data from so trino is sql over everything.

A

uh The data sets that we're actually looking for are in postgres. um We ingested them using the trino plugin and that's how we'll access them. uh Superset is. uh It's an apache open source project and it provides a really easy way to explore your data to visualize your data. uh It has a ui in the browser where you can look at things in a sql editor and then generate charts from those. So what we wanted to do was have a single click way of going from data hub to superset.

A

So let's say I'm a data user, I'm looking in trino- and there are these two data sets that I'm interested in these ais data sets from noaa. We can see the schema and you know maybe some documentation, some properties, some ownership. We know how large they are, but if we actually want to visualize them and view them, that's where we're adding some some functionality.

A

um So we see up here we added a new button up to the sidebar of the data set entity to visualize with superset. So if we click that behind the scenes, what happens is data hub is going out? The front end is calling a mutation that we made to go back to data hub gms on gms. It is carrying out a sequence of steps to set up the data connection in superset to postgres, where the data is actually residing.

A

So it's also connecting through trino to do this. We use trino quite heavily at raft, so we figured it would be good to include that in our hackathon.

A

uh So the steps that we have to go through for setting up this connection, um it's more than just opening a link with a query.

A

So, in addition to taking the information like the data set name and maybe the location, the dns name, we also have to set up the connection from superset to where the data resides, and we have some information about that in data hub.

A

Coming from the data set name, the data platform name, information that we can include in the deployment like environment variables for the datahub gms pod, which is how we passed in some information and that's something that you know we could expand on outside of a hackathon is how do we pass that in in a more dynamic way, for example like using secrets for passing in credentials so from the data user's perspective, we are in data hub, we click a button, we're in superset. We can run a query and we can see these results in superset.

A

You can also do things like create charts, create dashboards and then view that lineage in data hub, which is really nice, and just to just to show that it's uh that it's not all hard-coded we've got this noaa 2020 data set, there's also a noaa 2021 data set, and so we can do the same thing and we see that the the nova 2021 data set is what's being selected here. It's reusing the database connection that it set up previously, so it's sort of lazily evaluating that and then setting up.

A

You know the sample query and everything from there.

A

So from this we intend to expand upon this, to make it a little bit more full featured and also for data sources other than postgres over trino trino is nice because it runs sql over everything. But you know maybe, including other data sources like kafka would be, would be useful, but we decided to tie this all into datahub, because datahub is the central way in which people are viewing datasets, and we think that this is a really great addition, because it's allowing people to go directly from data set exploration to data exploration.

A

We think that's really important, um so I think that's about all for our hackathon topic. We've pushed our code up and uh we hope that you guys take a look and that you guys enjoy it. But thank you for giving us the opportunity.