DataHub Humans of DataHub, 22 Apr 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Humans of DataHub - Steven Po, Senior Data Engineer from Coursera

Description

Elizabeth Cohen and Maggie Hays sat down with Steven Po, Senior Data Engineer from Coursera - a global online learning platform - to hear about how he and his team are using DataHub to manage their increasingly complex metadata landscape.

Learn more about DataHub: https://datahubproject.io
Join us on Slack: http://slack.datahubproject.io
Follow us on Twitter: https://twitter.com/datahubproject

A

um Well, welcome folks to humans of data hub today we are joined by stephen, so stephen I'll pass it over to you and if you could give us an introduction about who you are what you do and we'll go from there.

B

Absolutely yeah so hi hi everybody, I'm excited to be here with you all today. I'm stephen po senior data engineer within the data engineering team at corcera, and previously I've worked in various companies ranging from sort of early stage startups to mature enterprises and and for those of us who might not be familiar with coursera uh corsair was founded by athlete, kohler and andrew ing in 2012, and today we're a global online learning platform and also a b corp, and we offer anyone anywhere access to online courses and degrees from leading universities and companies.

A

So then steven tell us a little about a little bit about your experience. Finding data hub like how did you learn about it? What brought you to the community.

B

Oh yeah, absolutely yeah. So for a brief of context, um our data stack of coursera has evolved over time, um so we start off with, like you know, a few online databases um pulling data into redshift and then we expose it to looker and all of that was kind of orchestrated within an internal solution that we have and over time. I think the number of tools that we have have expanded, we've added components like you know, data breaks, um airflow, amplitude, etc, and also the number of data assets has grown over time as well.

B

um We're hitting pretty in solid figures of like five. I think five figures for a.

A

B

Storage perspective, as well as a visualization perspective, so all that kind of calls for an increased focus on metadata around our data stack, and so we started looking into you know what what open source solutions are out there? um We came across a few and we came across data hub and while we were kind of researching and conducting a poc, um we found that data hub aligns with our needs a lot in the sense of data, documentation, sort of a unified search, experience, lineage information and additional metadata, and- and so after that, poc.

B

I think we are interested in data hub increase and we were also very impressed with the uh the vibrant and supportive community and also thanks to acrodata for kind of helping us. Try that open source community.

A

Yeah awesome um so where, where are you all right now in your adoption journey, are you still you're still in the evaluation stage? Is that right.

B

uh Yeah we're still at we're sort of at the tail end of evaluation.

A

B

We yeah so we do have some road map of adopting some sort of a metadata solution, um but I think there were some specific things that we wanted to like enable within our organization uh with.

A

B

And that kind of fell into, I would say, maybe three major buckets uh discoverability of data uh data management and data governance right so some example use cases are like you know. If we have data sets available um where, where are they and who do we go to to ask questions? Even that would be a major win um in terms of like data management, for example, you know if a production job goes down. Who do we end up notifying in terms of downstream impact on looker or whatnot?

B

And you know if we're option changes what happens to the to the downstream and.

A

B

Of like governance, maybe use cases around gdpr, you know what happens if we delete or anonymize a certain record on the online database like how does that kind of trickle down into all of our downstream visualizations? So all of those questions um are things that we we wanted. You know to enable with data hub.

A

Awesome, um can you share a little bit about what you enjoy most about the data hub community.

B

Oh yeah, absolutely um yeah, I think, as alluded to earlier, I'm very excited um noticing a lot of talented individuals, everybody willing to share their ideas, their experiences overall, very responsive, very vibrant and also you know, special kudos to aqua data again for helping us sort of drive that community and increase adoption and also being supportive. As always with our on non-stop questions.

A

Yeah, it's been a lot of fun. um I mean really big, shout out back to you back to you all your you and your team have been just really great partners and giving us feedback and helping us kind of understand, use cases and flesh that out so love right back to you.

B

Nice absolutely yeah.

A

So I'm curious within the um within your organization. So I know we talked a little bit about. Discovery talked a little bit about governance, um who are the I'm going to kind of flip the question that we had a little bit on its head a little bit? Who are the end users of data hub um either currently or you know, kind of like targeted in the near future. Within your organization.

B

Yeah, that's a that's a good question. um I think we would envision a phased-out rollout plan, so we would likely start because we do have a lot of internal. Like data team use cases, we would likely start with data engineering data science, as perhaps so that first batch of users, but in terms of like a longer term perspective like we envision that sort of anybody that works with data and that works with data on a very in a critical day-to-day responsibility-wise perspective, would be consumers of data hub um and we're also envisioning.

B

You know we, we do have metadata around our online databases and how you know whether.

A

B

Can incorporate some of that um in terms of like our our production platform, engineering being able to help us input, metadata and documentation so on and so forth, and also um opening up, perhaps also to business users, where they could sort of input documentation in terms of like what they're seeing um and how certain datasets could be used for their use cases. So I think tldr um we'll start with the data team.

B

We would like to roll it out to the broader company um and we would also uh like to roll out with our content production platform engineers to to help us prepare data as well.

A

That's awesome, great, yeah and, and speaking of uh different features and use cases. What is your favorite data hub feature or use case.

B

Yeah, um I think that answer will change uh over time, as new features are constantly being added. um I I think to answer this question for now like we will start off with with what we were looking for, initially, um which is the unified uh search experience. um I I find that to be you know very delightful. The team finds it to be delightful, um and I think that the newly added lineage downstream impacts.

B

um It's something that we um we have always uh you know held in terms of like higher party, and it's something that um again shout out to to gabe and to professor for working on this. I'm really excited about this. It will help us enable our operational use cases um and also having sort of that collaboration between different stakeholders of our company. So I would focus on the answer on those for now, but I'm sure that um that answer will change uh over time as new features are added.

B

It's very hard to pick like a most favorite uh feature.

A

Yeah, I'm really excited about the the impact analysis and um man. It's just gonna, it's something that I wish I would have had access to five years ago. I would have had a fundamentally different experience in my my role in analytics and bi engineering um all right. So then, let's think about so actually kind of like thinking a little bit in the in the future. um Is there anything that's kind of on your radar for 2022 that you're excited to see in either either data hub the product?

A

You know kind of like the platform or within the data hub community for for this year.

B

um Yes, that's that's a good question. I think, in terms of specific features on the roman that we're uh noticing there, there are a couple that uh that we are excited about, um such as column level lineage. I know that's, uh that's a work in progress uh super excited for for that one um and like for information. We we had an internal poll around um what use cases we were looking for uh within coursera and that actually came out like really high um higher than I had originally anticipated without gain.

B

So super excited about that uh road based access controls, I think, would be awesome in terms of controlling which groups of users can view with schemas or which documentation. I think that would be very high in our list as well.

B

um Selfishly a delta lake integration just because we're we're going to be using database double blade, a lot so that that's, exciting and and also integrations with or potential integrations, uh with slack and jira, for collaboration use cases that we've seen notifications for changes, and maybe you know, servicing up previous questions or or tickets on certain uh data assets. I think those uh were you know highly exciting features among others as well, um and maybe in terms of like themes like some themes that we're also excited for is sort of.

B

How do we leverage metadata to continuously improve our data architecture.

A

B

Some examples of that you know could be knowing now that we have so that end-to-end picture of our data is there a possibility where we could, you know potentially automatically optimize how pipelines are scheduled. You know to meet certain sla, criterias, um reduce concurrency issues and costs right and maybe help us sort of suggest where we could potentially consolidate our data assets to avoid having like multiple sources of truth.

B

um I think those are um topics that we can uh consider as well and the and the other theme would be like governance, so potentially looking into some sort of a centralized. You know data policy, access control where we kind of consolidate sort of enterprise-wide data policies and make sure that we can control access like on on the data contents themselves um through some sort of a centralized uh fashion um would be super helpful and.

A

B

Like ongoing, you know, automation and notification where you know once some changes are happening through our data stack that may lead to downstream impact or introduce risk of us being like non-compliant. So so those areas are areas that we're we're all very excited about.

A

Yeah, I think you just built out our 2022 roadmap for us sounds.

B

A

No, those are some amazing use cases and I'm I'm also really excited to see how we, um you know how we kind of can start to build out some of these either alerting frameworks or recommendation frameworks to really like tailor um to really uh manage all of the entities everywhere.

A

Right like right now, we're very focused on surfacing um surfacing entities regardless of platform regardless of source, but how do we help people kind of curate that a little bit better right and actually remove redundancies, or um even like remove kind of like high-risk data entities, so we kind of like minimize potential kind of like data, governance or compliance risk, so I just think there's some really exciting ways for the community to build out some like best practices or frameworks around those things.

A

Very cool yeah super excited about that. We have I'm so excited for this year. It's going to be amazing.

B

Stephen, I.

A

Don't know if you noticed but as as you're sharing your answer. Maggie was just like nodding and getting yes.

B

Absolutely wonderful.

A

um What is your favorite uh data hub slack channel and why.

B

um Yeah that was uh tough to answer as well. I think a lot of the channels were very helpful in terms of you know, finding experiences and and answers.

B

So if, if I had to pick one you know, perhaps the announcements channel um continuously, it will help me sort of help us keep up to date in terms of uh developments and on the data hub products. um So I find that to be very helpful and just very exciting to see how the product evolves over time.

A

Or what it, what advice would you give to someone who's joining data hub or or joining the data hub community or starting to work with data hub um yeah, like what advice would you give them in the early days.

B

Yeah, um I yeah that's a good question. I would say from like from our experience number one. The the documentation is very helpful, um so we we may have some time to sort of you know dig through that documentation um and also what we also find particularly helpful.

B

Is that the um the slack channel, the official slide channel data hub has all of the previous um like messages and whatnot um and those are not sort of cut off by like um we only retained the last like ten thousand messages or something that slack has yeah, so fine yeah, but I find looking through um those uh slack channels as well for pre-responses uh super helpful and and lastly, to to you know, don't you know not don't be shy about.

B

You know, raising questions to the community, uh that's something um I I've noticed that a lot of people would raise questions which I think is really helpful because a lot of times like we all have like similar questions um and so sometimes like I find that like for for us, like we, we sometimes jump on like existing question. Threads and sort of you know um work with that and get more details uh from from that perspective, so so that would be. uh That would be my advice from my experience.

A

Awesome all right.

B

A

Wonderful, thank you so much steve and it was great chatting with you this morning.

B

Absolutely it was uh yeah thanks for having me.