DataHub Adoption Journeys, 29 Oct 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: DataHub @ hipages Case Study: Oct 29 2021

Description

Chris Coulson from hipages shares their experience adopting DataHub to supercharge data discovery, lineage, quality, and ownership. Chris shares their core use-cases, an overview of their modern data stack, and how they are leveraging the DataHub lineage graph to identify the most influential datasets & assign ownership.

Join us at our next Town Hall - RSVP here: https://forms.gle/g8EpCLnohtPLLtdg6

A

So hello, my name is chris carlson and I run the data platform products team here at high pages.

A

um So my team is responsible for the data orchestration around the company and to look after all, the machine learning components both internally and externally, facing so we've gone through a journey recently where we have done a technology evaluation and found that datahub was the best tech for some of the problems that we wanted to find and subsequently deployed datahub and integrated it with our airflow instance, and I just wanted to talk you through some of the stuff that we've achieved and some of the cool things we've implemented along the way so before we do that.

A

Just to give you a quick introduction to high pages. So high pages is a two-sided marketplace in australia, where we connect consumers, so people who want a diy job done or they want something built with tradesmen. Who can then do that work? So the company was founded 16 years ago and we recently listed on the asx. um We are growing, which is really cool and our data team's still growing as well. um So, to give you an introduction to the data team, this is a. This is the whole team.

A

We are focused on helping the company start to make data driven decisions and the way we're doing that is by opening out all of our data assets and hopefully helping people from all parts of the company consume data and care about data, and, through that start, making decisions based on uh real insights and information that we draw around how people interact with our site or how people how some of our consistent customers use our platform.

A

So with that in mind like how people we use data across the whole company, but we have some questions that continually come up um and they sort of like can be broken down by the uh the specialism that people work in.

A

So um the first thing to think about is our data analysts and the data scientists and they're, principally interested in where they can find data and how that data is generated, so they're they're interested in knowing what tables are available and what information is in those tables and how that information was generated.

A

When you talk to our data engineering, guys um they're really interested in what, if a pipeline goes down, who do they need to alert so who's the downstream consumer of that data, both in terms of like processes, but also like when it's surfaced up to dashboard level? Who actually cares about that data, because what we don't want to do is have our consumers of the data? Tell us that there's a problem we want to be ahead of that we want to be able to fix it.

A

The other problem we have is when we have complex and interdependent pipelines that run we. Actually, it can be sometimes quite difficult to work out if there's a real. If there's an upstream failure, what do we need to re-run in order to recover all of the downstream data? So we need to worry about the lineage from that point of view, and also we kind of care about. How can we prevent it in the future?

A

So we also want to know which elements of the stack we need to fix and where we want to spend our time fixing things and then, from a senior stakeholder point of view, they kind of care about their business metrics and they kind of they would like to know what business metrics do they need to track and actually what do those business metrics mean and um how can they interpret them?

A

So these are like a collection of common questions and issues that we saw across our ecosystem and data ecosystem, and we were trying to think about how to solve these and you can break those down into a few four, essentially four different problems that we were seeing and these problems were really uh highlighted that we need to get some kind of system or metadata cache. That would help us surface all of these things and we came across data having informative data hub is the solution for that.

A

So, if we talk through the different problems we were seeing and how we're going to solve them. The first thing that became clear is that we need to find a way of allowing our data to be discoverable, so we need to find a way that people can search through the information and data we have available and find what they need.

A

The second thing we're starting to care about is lineage. So when we have these complex processing change, we need to know what processes things have gone through to arrive at end state so that we can really identify what's going wrong quite quickly and then fix it accordingly.

A

The other thing we were finding is that we need to be proactive on quality, so we needed to start thinking about profiling and understanding where and what our data is looking like when it's in flight and the final thing is ownership. So when we start thinking about ownership, we start worry. We start making people accountable and encourage responsibility for their data, sets and data assets and start encouraging document them properly and really explain them and that all comes together as we sort of migrate towards a mesh architecture.

A

These come together to be really important to enable each individual domain. In the mesh concepts to really own a fuel ownership of their data,.

A

So so before we talk about how we addressed each of those problems, we'll just give you a quick introduction to the stack and then show you how we deploy data. So our stack is principally orchestrated by airflow. We use amazon, athena and spark for processing.

A

We have storage in s3, cassandra and mysql, and our consumption layer is through looker and redash, and we deploy everything on cube.

A

um To give you a quick insight about how we deployed datahub and we deployed datahub on our kubernetes cluster through argo cd. uh We connected to an rds instance, and we also connected to our in-house clusters.

A

We actually developed our our own helm, charts um to bring up the persistent storage layers and then use the community helm charts to deploy the data hub components. We also switched on things like the authentication layers, uh just to make sure that it was all nicely secure and it seems to be working well at the moment.

A

So so, to go back to like the thing, the problems we were trying to solve. Let's first look at discovery, so the thing with discovery is we wanted to make sure that people could find all their tables or find all the dashboards that have already been created so that we could start sharing ideas and knowledge and take it out of that sort of tribal world and bring it into the uh more discoverable pattern.

A

So what we've used is all of the recipes that are available to look at uh through um data hub to ingest all of our metadata, from both from our consumption and processing loads um and really start surfacing up some of those, the information that people need. So to give you an example, we can use the free text search now to find tables, dashboards and some processing steps that are related to specific keywords and through that now we're starting to share ideas in a better, more controlled way.

A

So the next thing to talk about is lineage and we've done a lot of work in this space, principally because of the way our system is architected. It's made it quite easy to bring up uh some lineage that covers quite a lot by processing stack very quickly, so we'll break this down into the three main areas. We've done a bit more work, but just to give you a flavor for the three main things that we've done. We've now enabled rds log inspection to give us um processes give us lineage inside the production environment.

A

We've done some work on how we do our snapshotting to add lineage to that, and we've also done work in our later lake transformations and on our dashboards in the end.

A

So, to break that down, um we currently run a lot of cron jobs on our my sequel instance or executing cube, and what we did is we added some labels to those cron jobs so that we'd be able to identify them in the rds logs, and now we do inspection on the rds logs to analyze all of our queries, they're being executed in that environment, to provide linear general of the um queries that are operating in the production environment without actually affecting any of the production code, that much and without having to uh inject extra dependencies or affect performance on the uh production side.

A

So we've got a bit of a handoff between production and the analytic environmental environment and it really by analyzing the rds logs. We keep a good firewall between the two.

A

The next thing we were looking at is how we do our snapshots, so we simply added. We use a generation pattern to generate tags, that copy tables from our mysql instance into our link. So we quickly were able to add lineage for all of the source to lake transformations when they come from the operational data store and put them into data.

A

The next thing we've done is, as we sort of migrate towards a mesh architecture. We've we've started implementing. Well, we've invented for a long time. This idea of sequel uh templating. So people can write sql if it has like a daily snapshot. We can then template it with airflow with the temporal values, and we take snapshots on we'll take, for example, we would say: do a later late transformation on a daily basis, so uh by analyzing the template to sql.

A

We can then generate the lineage of that later late transformation as a pre-execution step in the air, for instance, and emit the lineage into uh data helper calling and then the last bit we're still working on and it's still uh still sort of in flight is doing the lineage all the way back from the dashboards all the way back through into our um from the dashboard queries that generate all the insights and uh visualizations all the way back through to the source.

A

So we're currently going through a process of ingesting all of the data that we have metadata. We have from reduction locker and then analyzing the queries and making them back through the lineage that uses the new recipes that are available through data.

A

So we'll quickly rattle through the last two bits that we've done so data quality. um So data have recently released the data profiling steps on some of their tables, we're trying to use those, but it turns out that it's quite expensive and computationally complex to analyze all of your tables.

A

So what we've done is try to identify some of our key tables and just profile those um we tried profiling in athena and it was very expensive, both computationally and because, because of the way athena is architected was also just expensive from an aws point of view, because it was taking a lot of time to process they say: wasn't a lot of data was being generated so to identify and target our most our most influential tables.

A

We took our lineage graph and started running some graph analytics over the top of it, so to identify um key tables. You've got three types of paid key tables. The first is ones that we know are key to the business and are consumed by some of the most important uh decision makers around the company. So we profile those. The second is tables that are consumed a lot in different etl processes.

A

So if we have a key table that is uh consumed in a lot of different processes, that looks like it's a very key table and that can be identified through a lineage graph by looking at um concepts around the outward degree number of the graph of the node. So that looks at the number of downstream processes that are consuming it.

A

The the last step you can do is to look through look for influential nodes and that's looking for things like centrality measures in the graph, and by doing that, we can identify graphs that uh sorry tables that might not be consumed by a lot of processors, but are very influential downstream. Of that.

A

So the last thing we can look at is ownership and again the this is a project, that's in flight and is in part depending on the wider engineering team. But what we can do is use your lineage graph to start looking for clusters of tables that all get processed together within our lineage graph, and by doing that we can start identifying groups of processing, steps or groups of tables that are used commonly together and by doing that, we can start defining domains within our table with our tables and then assigning ownership of those domains.

A

Now, obviously, there's a bit more to it than that and looking at who earns the process is, is as a wider product initiative, but by looking at it through this lens, we can start defining these groups and start helping inform what that ownership model looks like so just in the last stage, we'll try and give you a an insight into how this all comes together.

A

So what we've got here is a is a lineage graph of um this generated using all the processes I've just described, and you can see here that this is a source table. It goes all the way through to my sequel, take the some airport process that links it through and now you can start tracking, all the later late transformations and how it's consumed downstream, and you can see that this table informs a lot of other tables and it becomes quite complex in the way that it's consumed.

A

You can see all these dance screen processes here and then just to give you a flavor for what we're working on at the moment. This is a recent feature that we've that we've released and it's being tracked by some dashboards and redash, and you can see here that the redash dashboard tracks through to a chart- and we can see that that chart- consumes this specific uh data set here and once we link that in once, we move that to production.

A

You'll, see that now we'll be able to link all the way through from a chart all the way through to the downstream processes.

A

So, just to say, thanks to the team, uh I obviously didn't do all this work. um The the this is. This is the talented group of people that actually did the work that I'm presenting to know about. So thanks to all them and yeah. Thank you for your time. um Sorry, I can't be awake. uh It's quite early in the morning for me, so I'm sure I'll be very asleep, but if you need to get hold of me um I'll reach out to me on the data hub slack channel other than that we're hiring.

A

So if you're interested come and speak to us thanks design.