DataHub Adoption Journeys, 19 Sep 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Etsy Adoption Journey

Description

Vishal Shah, Data Engineer, describes the adoption journey of DataHub at Etsy.

Recorded at: DataHub August 2022 Town Hall

A

uh Hi everyone, my name, is vishal shah, I'm a data engineer at etsy and I'm super excited to present at today's town hall about our data hub adoption journey.

A

So to kick things off I'll, give a quick intro on etsy, then I'll jump into our journey timeline and then highlight some interesting implementation and metadata learnings last I'll end with a quick bit about what we look forward to seeing in datahub.

A

So for anyone who is not familiar, etsy is a two-sided, global marketplace for handcrafted and antique pieces founded in 2005. It sees an e-commerce platform based out of brooklyn new york with offices and admin around the world. So to give an idea of the size and complexity of our data. Etsy has over 7 million 7 million active sellers, 93 million active buyers and over 100 million unique listings.

A

Our production data is stored across hundreds of shards. In my sql we have a data, warehouse, bigquery and other data sources as well. So our platform has seen tremendous growth over the years and with that so has our data and with the data, so has our metadata, which takes me to our journey to a new data catalog.

A

So our journey goes back nine years the year 2013 when we built our first in-house data, catalog called schemer over the years, schemer became an integral tool for engineers. Analysts and more users could search for tables from mysql and our data warehouse at the time which was vertica.

A

They could add table and column descriptions and there were links to source code as well for those tables. Alongside in 2020, we built a proof of concept, lineage tool called the data lineage tool to start understanding the complex relationships across our vast data landscape.

A

Both tools had been on maintenance mode for some time, and we knew it was time to reinvest in data discovery so cut to april 2021 uh data engineering teams, etsy wrapped up our data warehouse migration from vertigo to bigquery and formed our new data discovery team.

A

We knew there was a need for a better data discovery experience, but we wanted to understand the problem space better before we jumped into a solution.

A

So our team of six engineers, one product manager and engineering manager, conducted a month of user interviews where we spoke to 30 individuals from engineering, product management analytics and more about their experience, using our assistant, catalog and lineage tool and where they found pain, points around discovering data.

A

So we learned that the main issues were around data just around discovery and trust. It was hard to find the right data sets and it was unclear which data sets were reliable where they came from and how they were being used. Much of this information came from tribal knowledge which, at this stage of growth, was no longer sustainable for us.

A

I want to call out that, while this wasn't complex on the engineering front, there really wasn't much engineering at all the entire month. These interviews helped our team learn about the problem space better and become invested in our team's mission, so the next step was to find a solution. So we split up into squads to investigate all of our options.

A

There were quite a number uh what if we extended schemer what, if we built a new in-house tool, what if we paid for a fully managed solution that we could get off the ground faster or what? If we used an open source solution, so we dove into 30 tools over the course of a month and poc a couple of them. Our proof of concept of data hub involves setting up a local instance, adding a custom source and adding a custom aspect to the metadata model.

A

While it was a bit complex, we found success in datahub for its flexibility and metadata modeling integrations with many of our existing data sources at etsy and an active and growing community.

A

So after months of research, we were finally ready to move forward with data hub so now for the fun engineering part along with datahub. There are many technologies that our team did not have much experience with, but we did have the advantage of other teams at sea, owning instances of gke and kafka and having expertise in build quite to guide us along the way. We also set up a cloud sql database and used manage elasticsearch in our infrastructure setup, so we started small and integrated along the way for ingestion.

A

We prioritized bigquery and my sql ingestion to match purity with schemer and then also built an mvp for data lineage that connected from my sql to bigquery, to looker a widely used data pipeline at etsy and also displayed in our in-house lineage tool.

A

So after months of implementation work by april of this year, we are ready to launch data hub at etsy. So till date we have over 600 total users and about 45 are active each month, uh we've ingested 11 000 data sets across five data platforms and the number increases each day.

A

I want to call out that, since we already had a data catalog in production, we had to be extremely thoughtful in our transition to datahub to make sure that we had addressed any parity issues before deprecating schemer.

A

I can say that we were able to smoothly migrate onto datahub and turned off schemer last month with minimal disturbance.

A

We're also well on our way to turning off our in-house lineage tool to consolidate all of our data discovery efforts into data hub, so it wouldn't be quite a journey if there weren't any bumps along the way. So I want to take the next couple of slides to highlight some of our learnings as they pertain to implementation and governance.

A

So this question came up a few times with. Why does this work as expected with no fault to datahub or our data ecosystem, but a natural process of integrating two systems together?

A

The cool part about using an open source solution was that we weren't left in the dark, with having to figure out every uh solution by ourselves or having to fully rely on another team to fix all the issues for us. We were able to fit data hub into our ecosystem and made a couple upstream contributions along the way.

A

Maybe the one with the most impact for us was around the bigquery ingestion. There were. We have projects at etsy use only for storage and other projects used for running queries and jobs. This didn't immediately work with a bigquery injection, because at that time it only took in a single project id.

A

We ran into all sorts of problems when trying to query lineage information or profile from projects, but we did not have the permissions to run queries, but with help from the datahub team, we were able to contribute back and provide an additional field in the recipe for a lineage client project id.

A

Similarly, for airflow, the current version that we use the etsy is behind the required version for the lineage backend.

A

We were able to extend the github lineage functionality, with a small change to the open source code that made airflow lineage compatible with our airflow instance.

A

However, I should add that not all changes were as seamless for ingestions such as looking mel and some custom sources that read from github repositories. We were surprised that we had to clone these repos into a custom image and, lastly, for my sql ingestion hurdle that we came across was around how to profile a sharded databases.

A

Profiling off of one host would not capture the full statistics for a table that started over hundreds posts and also profiling against our production database was turning out to be a bit complex and expensive and a concern for our production systems. So for now, this one's still on hold and something that we discussed on our team.

A

Outside of implementation, we learned about how data discovery and data lineage are both highly dependent on data governance. How do we display all this helpful metadata information in data hub if we don't know where to find it?

A

So our approach is to start with ownership, to make adding an owner to a data, set extremely easy and accessible directly in the creation code. For that data set and then adding more governance rules from there even for lineage for airflow a situation falls around. Do we ask users to add inlets and outlets manually to their dags?

A

What if they change their data job and don't update the outlet? What, if there's a typo or should we find a way to add them programmatically into the operators themselves?

A

We believe that data governance will play a crucial role in the success of data hog etsy. So if you have any success stories regarding governance, ownership or if the lineage please reach out to us, we'd definitely love to hear about them.

A

And lastly, I want to touch on a few wishlist items that we'd be thrilled to see in datahub, so, firstly, column, level, lineage, especially for bigquery, would be a huge win subscriptions and favorites would be another that would allow for more team collaboration and standardization.

A

So if I were to join a new team, I would already know what are my team's most important data sets and could get notified for any changes, even if I were not the data owner.

A

Lastly, a list of data set recommendations based on my most reviewed or favorite data sets, would lead to a boost in data discovery for our end users.

A

So that's all I have for today. I want to give a huge thank you to the datahub team for all of the help and support along our journey with our many questions, also a special shout out to the data discovery team at etsy for their support. With this presentation, I can't wait to see all the cool stuff that we work on with the data hub and if you have any questions or success stories that you would want to share with us, you can feel free to reach out to me on the datahub slack.

A

Thank you so much.