DataHub Adoption Journeys, 21 Feb 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Case Study: Contributing to DataHub

Description

Eric Cooklin (Stash) shares his experience collaborating with the DataHub Community and contributing back to the Open Source Project.

Learn more about DataHub: https://datahubproject.io
Join us on Slack: http://slack.datahubproject.io
Follow us on Twitter: https://twitter.com/datahubproject

A

Thank you guys for having me on today.

A

I was asked to kind of recently stash we added some data set transformers up to the upstream repository and we were asked if we would like to kind of explain why we did that where what was the use case for and then explain our experience, because this was also our first time contributing up to uh open source projects. So um how easy was it? What were there any snags or anything like that spoiler alert? It was pretty easy.

A

Oh there we go wait a second there we go so a little bit of background. My name is eric. Obviously, I'm joining you guys from austin texas. I've been a data engineer at stash for two years and stash is an industry-leading subscription platform uh empowering everyday americans to build wealth. As you may know, middle class americans are struggling to build wealth. The u.s stock market is the largest generator of wealth in our history, and yet 45 percent of americans aren't currently invested. Majority of americans live paycheck to paycheck.

A

Stash is a personalized financial platform, with tools for investing banking insurance and more that helps. People grow their wealth for the long term uh more on that after a presentation, uh so this journey was back last year in q2, our team was really looking at what was out there in the data catalog data catalog offerings just both from a paid vendor standpoint and is also from the open source community and like many of y'all, we ended up choosing data hub because we really fell in love with the product.

A

I mean it was one of the only ones that really matched our technology footprint and just the openness of the platform meant it could grow as we grew um some of the cool features like data lineage, the um business glossary stuff like that were really appealing to us and as we knew we want wanted something with um strong support. uh This open source community seemed really active, really strong and, as we will learn it was that that's a huge benefit for us um and probably for a lot of others as well.

A

In q3, we started using this tool uh internally and started onboarding some of our business teams on it, trying to kind of capture their knowledge capture what they they know, working with their smes and analysts.

A

We identified kind of like low-hanging, fruit kind of things, to add to data hub to make sure that the user experience out of the box was going to be really strong. So you know in on top of just the metadata that comes from the various platforms that we're ingesting. We wanted to get. You know like confluence links and create some tags. Add to the business. Glossary really help the the first users get um get familiar with the platform and have you know that wealth of data right there at their fingertips?

A

But the big question was: how do you add these into data hub efficiently right? um We knew we would just from these meetings with the business teams. We would have like you know. Okay, these data sets need these terms and these tags and these links and stuff like that. But there wasn't really a way to get out of the box to get that information in the data hub.

A

um There's the graphql api, which kind of was the obvious solution, we'll just read it in send it up blah blah blah it works, um but reading into kind of the source code we identified that hey. There's data set transformers out there that already modify the data set entities. um What if we can just use those ourselves to add like tags or business glossary terms, um so that's kind of talking with the project management.

A

We had some leeway to explore that we had the fallback option of the graphql api and we had the flexibility to see. If you know we could learn something from this: learn not only like open source community stuff like that, but actually just the inner workings of data hub itself.

A

So um we use this data set transformer that is already available. This pattern add data set ownership and the config is on the screen right here. Basically, what this does is in the rules section. There is a regex pattern and then an array of urines and the dataset transformer is just going to apply all those urns in the array to whatever is in the regex pattern, so we're like yeah. This should be pretty straightforward to copy over for glossary terms and uh tags and stuff like that.

A

So it turns out it was pretty straightforward um if you're not familiar with how this works. This is a very, very quick overview, because you know I don't want to waste too much time here, but with when the pipeline runs right before it writes every record to whatever sync is defined. It passes it to this transform function and the transform function just says: if there's a transformer transform the record and it does it on the right side of the screen.

A

Using this method called transform1, which you can see, takes in a metadata change event and outputs a metadata change event. So that's what we'll be extending and as an example of this here's, what we did for the data set terms, and you can see that we on the left side of the screen. The data set terms just extends that data set transformer and, most importantly, that transform one method um actually does something.

A

Now it doesn't just pass and we use the mce builder to add a glossary term to this uh to the mce and then pass it back, so that was already implemented by that pattern. Ad ownership class. um So that's, we literally was just like copy paste. That's how straightforward and easy it was um and then on the right side of the screen is kind of the magic that makes this pattern. uh Add data set transformer work, um there's already this config model that takes in the regex patterns and parses it through.

A

So we really just copied that implementation and then um used it and that that's kind of the the crux of this all you know this was our first time contributing. So it wasn't. We we didn't go for you know this huge, huge feature or anything. We just thought hey. We have this problem, maybe some others have the same problem. Let's explore it and contribute it back up. You know if the community doesn't like it, they they'll always just pass on it right. So the risk was really low.

A

We had the flexibility from the project management and, most importantly, I think we had the community to kind of back us up um reading through the slack channels reading through the github. That was pretty invaluable to our experience. Just like. Oh, we we see, people are having this problem. Setting up, you know the dev environment, blah blah blah, it's really easy to fix. So with the documentation everything set up testing all that uh the contributing experience was pretty straightforward, really easy.

A

I think we had like a few comments from shashanka and the team, but honestly it was really straightforward and I really can't um stress how invaluable that slack communication was. You know using that search function, um a lot of a lot of good informations on the community already.

A

So thank you. Everyone thank you to the data hub team, maggie shashanka, john gabe, all your time, all the help and guidance you guys have given us all the contributors without you guys.

A

Obviously, we probably wouldn't have gotten as far as we did, even with uh an addition as simple as that, and thank you all to the stash team for all your support and, if you're interested in joining our team, there's a link on your screen right there stash.com about slash careers, we're growing and if our mission sounds like something you'd love to help build out. um Please reach out to me either on the slack or on linkedin.

A

I mean if you have questions about the data hub transformers as well feel free to reach out to me as well. I'm always around thanks again for having me guys.