DataHub Adoption Journeys, 22 Sep 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: DataHub at Bizzy (Case Study): Aug 27 2021 Community Meeting

Description

Taufiq Ibrahim (Bizzy - Warung Pintar Group) shares Bizzy's experience choosing a data catalog and working closely with the DataHub open-source community

A

Good uh morning, everyone uh I'm tavik, uh I'm from uh indonesia uh right now, I'm working at busy now part of one twitter group uh and I'm going to share our case study with data hub and how we develop the re-dash source connector here next.

A

Yeah, this is uh some a few things but busy and which is now part of worldwide group uh busy was founded uh in 2015 and it was a b2b marketplace, and then we have several multiple company restructures. There's a merger, there's cells and then acquired by one quinturing to 20 uh early to 2021..

A

Now we are serving around 600 brands, fmcg brands and serving around 230k of retailers, fresh indonesia. So actually we have two kind of business here. uh One is the supply side, uh which is uh which is uh uh working with the distributor and uh fmcg brand, and the other thing is uh we work with the retail part with the we call. It is actually in in initial work for grocery retailers, yeah thanks yeah. This is a data ecosystem uh at one printer group and busy.

A

So uh we have several uh legacy. Let's say a legacy stack coming from existing uh platform from corporate enterprises like sap, but we also have more modern uh architecture like a cloud-based application. So we have like a mix of technology stacks like you can see that we have airflow. We have, as it still have ssis here and then uh we broke some of this stack into operational part and the analytical part, and also we have operational domain, which is actually the erp and the application databases we.

A

We do some best processing in operational data, engineering and stream processing, which is actually uh quite different from most analytical data engineering.

A

We also touch the production database, like updating data, synchronized data from multiple sources, and then we also capture uh change the capture from the application database, using kafka, connect and sync it into multiple uh things, like operational reporting, dbs and then also uh right uh into the bigquery, which is uh uh processed by airflow, to be served by several bi and reporting tools.

A

You can see that we have multiple reporting services like we have the the old stack like legacy, sql server, reporting services, we have metabase, we have redash and also we have jupiter why we have so much uh stack here, because we we've been through multiple merged and sales and we need to maintain most of it, because the users still need to use it. That's why uh metadata and then the lineage things is really important here.

A

So we can uh understand easier for all the data yeah next, so why we need the data catalog in busy, because one uh the first one is. We have like endless repeated question from from anyone like where the data is how it is produced who owns it, and the question is like repeated every day from different person and we we we keep answering it and then uh it's also difficult to look for lineage and impact analysis like because we have lots of data source and a lot of reporting that use the the data.

A

It's uh it's quite difficult to uh to search. If we we want to to change a data or uh modified data, uh what what is the impact for for the other for the the application for the reporting, something like that yeah.

A

So uh this is our journey uh with the data catalog things uh at the beginning of 2020.

A

We just create a like a simple manual dead lineage on cuba, siege and then we move to to do uh some poc with mhg atlas, but we found that it was too complex and too hadoop at the time. So we we stopped the the poc and then we also doing some plc with amundsen, but at the time it wasn't really answering what we need and then actually at the end of 2020, we uh found data hub and then we start doing uh poc and then development with data hub next.

A

So this is some reason why we choose data hub, mostly because data hub pretty much match with our data set like mostly like connect and bigquery and kafka, because uh data have used kafka a lot right so like uh it's really uh matched with our requirement and then the nokia ingestion, the emerald recipes, that's really really helpful for us and then the development of the source, connector and sync connector.

A

It's really really helpful. The the documentation was really helpful and then the other features that we are really love is. uh We can show the dashboard link from the app and then right uh click to it and click to it, and then we we will uh brought right into the the the dashboard itself and then uh now we have the the role based access and we can limit what users can do. That's just really really awesome right. Now yeah next.

A

So uh this is our data hub, integrations usage here at our fintech group. We have databases mostly rdbms, like mysql sql server postgres. We also have bigquery kafka and there's there are two source integrations that uh we contributed this. That is cafe, connect and re-dash.

A

What we love from data hub is, it's actually highly customizable.

A

The basic is actually as long as you can construct the urls things, then you'll be fine, and then uh previously we have several legacy lineage stored in the google seat and we just parse it into mcs, and we have that lineage right into the data hub, even uh if the source is actually uh not working as plugin.

A

We just uh push it directly to data hub mcs, yeah thanks, so why we uh develop read this integration because, uh after the merge, we found that one quinta group use readers a lot from data analysts to product teams to hr teams. They use redash a lot, they practically love to learn sql and they can use redash quite good and it's actually a develop based on the superset source and then the other reason is actually it helped the plc to be approved internally and right now.

A

Actually, we already deployed in a simple easy to server, and then we are going to release it next week to get the feedback from the other teams. Yeah.

B

Cool so topic, I think you wanted to do a maybe a quick tour of your re-dash integration that you have on your end. Okay, I will stop the share here, so you can pick up later.

A

A

Okay, I hope that you can see my screen right now.

A

Yeah this is the uh the the example of the recipes of our uh re dash. Is uh you can find it in the documentation, uh the github? uh Basically, what you need is the the url connection of the radar server itself. This is a hosted, I know not the hosted three dash, but this is the open source one and we need the api key and then we can limit the page uh for like for testing purpose by default.

A

This is not limited, so it will ingest all the dashboard and chart and we have the skip draft.

A

Optionally, if this is the default uh true, so if you want to invest the draft or unpublished, dashboard and chart, you can set it to false.

A

This is the example of investing to console it's pretty much like the other data hub in question.

A

If you have a lots of dashboard and chat, it will takes quite some time. So this is the report that we already invest: 39 uh chart and dashboard.

A

So I will skip my screen.

A

A

This is our data hub, for example, I will search a dashboard called the personal tracker.

A

Yeah, this is a with this dashboard. We can see the what is inside, but unfortunately, for for current uh development. We haven't ingested the ownership for now. uh Then we have the the view in re-dash button here we can see what is it actually.

A

This is the re-dash dashboard. That's uh actually uh querying the usage of the radius itself. So if you see here, we can see that.

A

It composed of several charts- and we also can see the lineage here.

A

The dashboard is composed by this phone chart and if we want to see the the data source, we can see that it's actually connected to the radius postgres backhand database itself.

A

But currently uh we haven't do things like how to map to the actual table because uh it's going to get a sql parsing like look ml does, but we haven't developed it right now.

A

Yeah, I think that's the demo of the redash.

B

I think someday you can probably demo us the uh massive lineage graph that you showed me once, which I.

A

If you want to yeah, I can do that right now. I already prepared it for you, so uh one of the few things that uh we hope uh data hub can address, that is the lineage visualization. I think for now most of the data catalog uh tools uh actually have the same problem.

A

If we see here uh this is actually uh the the lineage that uh came from uh like our legacy uh lineage google sheets, which I, which I just push it into data hub, and if we see that this is having quite a large lineage graph, and when you have this.

A

Large lineage graph, it becomes quite difficult to to read actually yeah. This is one of the things that hopefully can be found. The solution by data hub teams.

B

Yeah, I think we will ship people oculus glasses, so they can fly through uh these kind of lineage graphs, yeah.

A

B

Like uh one's demo,.

A

B

All right, but yeah, point completely taken, I think lineage graphs. They look beautiful until they become incomprehensible and I think um that's something we as a entire industry have to actually tackle yeah cool. Let's move on to the rest of the slides and I will share them here.

B

Okay, I think we already talked about this.

A

Yeah yeah yeah: this is a.

A

Data hub development experience coming from me and our team: actually, uh the contribution for connect source was my first open source contribution ever actually for uh github tastes, yeah, and I thought that the community is very welcoming it's. They are very supportive. I even got some uh private message uh and, uh like sushanka asked me uh do uh do I uh still want to to contribute something like that. It's very it's very supportive yeah and the documentation is very helpful like how to add new uh ingestion source.

A

It's really really helpful uh in the standard way.

A

Yeah, this is uh our studios and future works uh for internal we currently being in poc state.

A

We are still in p assisted, but we will socialize and get users feedback starting from next week, and I hope that this will give impact for for our organization and expected for data hub yeah. We already talked about the lineage for the large graphs and then we also interested in operational data quality metrics, something like uh lagging metrics and then row count something like that: yeah just to check for the anomaly on daily basis, something like that yeah. That's all for me. Thank you. So much.