DataHub Adoption Journeys, 23 Apr 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: DataHub Hackathon at Depop: Apr 23 2021 Community TownHall

Description

John Cragg describes the DataHub adoption journey at Depop, a UK-based peer-to-peer social shopping company.

A

Yep here we are and oh another handoff. uh We have john who's gonna talk to us about the data hub, hackathon that the depop guys did- and you know it was pretty cool- to see them pop in literally pop in to the community channel and say: hey we're doing a hackathon, and um you know in a few days they had, they were contributing, glue uh integration back to us, and you know the clarinet folks helped them out. So thanks for that collab, I think it was really nice to see that happen.

A

Take it away, john.

B

Can you see my screen well, yep awesome uh yeah a bit of a tough one to follow that one uh that looked really good. um So I hope that is. Don't disappoint you here, but um anyway, uh my name's, uh my name's john I'm, the lead data engineer here at depop um and yeah here to talk about the hackathon that we did with the data.

B

um So just quick intro about the pop and who we are we're a fashion marketplace um for the next generation to buy and sell, discover, unique fashion, so we're an app. Basically, we sell uh provide the ability to um uh sell predominantly second-hand uh fashion uh and sustainable fashion. You can think of as a bit like uh ebay, mixed with instagram.

B

That's what my top my mum anyway, when I joined uh but um yeah lots of lots of people in the uk using dpop, and it's growing around the world as well as the us too um uh yeah and we're we're growing very fast, and our data needs uh growing uh well too.

B

um So why did we look at the um data hub? Well, uh we need to enable the business to use data in a self-service fashion and we need a single location for all of our data needs um shout out to the design crew. Who did that? Certainly wasn't me, um uh so I'm going to walk us through some of the problems that we're trying to solve here uh that we can see through various uh slack messages that we've had through the company.

B

So we've got issues with data discovery, somebody new joins and um they want to know about data for our crm and they they don't really know where to find it, which is a bit of a shame.

B

um We want to know data about recently viewed items or any data about banning people in the trust platform, and uh we don't have that single location for search, so uh the data hub would be pretty useful there. The data lineage aspect.

B

I don't really need to speak about this, as have you seen a perfect, um a demonstration of how how that works, but um generally uh producers and consumers, and seeing who, uh where data starts and where it ends up all the way through to our looker instance, that would be very useful for our uh business users and depop's, a startup or a scale-up, and and uh we've got lots of knowledge in our heads when we have one in the uh the sort of documentation phase.

B

So the tribal knowledge um is pretty rife, and this this table has a uh a a column called active status which, over the years um has has baffled. Many people in the business, including this guy, who said active status, could just about mean anything. So documentation is pretty important for our uh for our users.

B

uh So what we did um is we had a hackathon in the in the data engineering team and and the bi team as well, and we split up and we tried to have a sort of a head-to-head against a munstern and datahub, and so we did.

B

What did we try and do? Well, um both of them have local setups um that use docker and we tried to go from zero knowledge of these products to getting as much production data into them as we could do it inside two days.

B

So I will just change the screen and I'm gonna only show the the demo part for the data hub. Obviously um this is what we managed to do um and then I'll slip back in afterwards. So I'll stop now I might need to reshare my screen actually two seconds.

B

Oh, no, that's nightmare, huh no worries. How could you refresh it.

A

We just see your desktop john, oh.

B

That's good: are you enjoying the mess? um Sorry.

A

B

You can see it now.

A

We can see it now.

C

But data hub didn't have any glue support, so we spent the last two days um figuring out how we could ingest data from glue into data hub and we managed to do it. So that's good. um So if you look in data sets, we have broad thing. We have this glue here and then uh this then goes to a database level.

C

So, for example, if we just click in random one like daily compacted um here's all of the tables that are in there, um if you search for product, create and then go to the one in compacted so yeah. This has the search as well. um So, for example, you can have see we added a description here. Most of our data isn't really well documented, doesn't have descriptions, um but in the schemas, for example, in glue you can have descriptions for each field. um I know there's often confusion about like what a user id actually is.

C

Is it the seller? So all that can be documented um got the name of each field, the type on the left, the descriptions. um So that's the schemas there's also ownership um that we could pull out. All of ours are apparently owned by owner. So that's not that helpful, but that can be changed um and then properties has um just probably about the table that get pulled out so just extra information in there. And then you have the option to add documents, but we didn't have any for any of those but yeah.

C

um I think we're going to open a pr in the data hub repo for the glue support and that's pretty much it. I don't know if I missed anything.

B

That's cool um thanks! uh Lauren. Do you want to show the the looker integration.

D

Yeah sure so, um basically uh our team uh rob abby and myself. We um worked on integrating the redshift tables and looker um so as similar to the glue schemas. You know the ratchet tables we, it basically pulled the tables from redshift and it has again schema types names of fields.

D

um We also have um in ownership and all that, so that's existent. um Obviously, we've implemented obviously tags um similar to monster, and I think, in terms of the um like search.

D

uh If I search for a keyword here, I am able to see the uh um all the um entities that have this particular um tag, and I can even do that the opposite way. By going to the tag and then looking look up, you know everything that has this tag. I think. Similarly, if I look for a another keyword here, um um I want to see well in this case. For example, it appears in the table name, but in this case there is a mac for calling.

D

So it's really interesting to see that the search is very inclusive um for the looker um implementation. So there's another area here um that we've been able to integrate a particular dashboard here uh with a description uh to scroll in I'm able to see the obviously tags and owners and everything can be added, I'm able to see the uh actually the actual looks that are part of this dashboard.

D

um In this case, we just provided a few examples, but um so, if I scroll to one of them, I'm able to first of all see tags, I'm able to see the actual tables, so the data source for this particular look um in luca and obviously scrolling and see that information um I can. I don't want to get out, but uh yeah there's a direct link to the um to the look. So that's really nice um what I think I've covered most of it here.

D

um uh If you check uh confirm sign ups, you can see documentation and lineage yeah. Yes, so uh yeah. This is a ratchet table um and any documentation. Let's say it's an etl-based store. You know any logic that is uh part of that um creation of the table. We're able to see that and um each entity uh here, you're able to see the upstream dash independency.

D

So that's very useful uh for later lineage.

B

Cool so that that's the majority of uh our demo there's some uh uh faqs afterwards, but we were presenting um to the business. So I won't show you those um so um yeah what we achieved during the hackathon. Is we ingested all of our production data into the local instances, so that was redshift glue and kafka. They all came into our local instance of the data hub. We also linked that chart of looker in um and we created uh some.

B

We used a metadata change, events to create lineage and tags and documentation and owners, and we created a merge, the pull request, uh which was pretty nice. um So I think that the most important thing for us and and probably sir, for any advice I could give people who here who haven't decided yet, is why we actually picked the data hub.

B

um Most of uh the problems we had with the munston was the the lack of kafka support and when we tried to integrate that with data hub, it just worked straight away and, as you can see, um we added the glue integration, which was really easy, and the the process for adding a new ingestion um type was was super super easy. It was very straightforward: the docks were set up nicely um and most, I think um pedro said earlier. The support from uh from the team was just a man immense.

B

It was amazing, um we were messaging at all times and we were getting responses and pushing that um vr was was really trivial and thanks to kleiner for helping us out there as well um yeah. The the aspect of data lineage is really important for us, because we have several layers of transformations on a business level. um Anna munson didn't really support that very well.

B

um Looker was a work in progress, and I know you said it's in the um in the contra folder at the minute, um but we're really excited to see that, um and just uh all of the other bits that you've seen already. um It was just super good and we had a really good time uh doing it and contributing back and we're looking forward to integrating into our production stack um in uh in the next couple of months, um so yeah. Thank you very much for your help.

B

um I'm really pleased to be working with you and it's been. It's been absolute pleasure, um uh yeah and shout out to the team. I think marie is here, um hey maria. Thank you very much.

A

Thanks john yeah, we really enjoyed uh all the energy that the pub team uh brought into the project so keep that coming awesome. So now that we have just a few minutes left, I wanted to do one of the things that we had promised. We would do for the community.