DataHub Adoption Journeys, 1 Dec 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Grab & DataHub Community Case Study

Description

Grab team members Alex Dobrev, Harvey Li, and Amanda Ng share how they are using DataHub to democratize discoverability with computational data governance.

Learn more about DataHub: https://datahubproject.io
Join us on Slack: http://slack.datahubproject.io
Follow us on Twitter: https://twitter.com/datahubproject

A

Hi everyone, and thanks to Maggie and the rest of the wonderful team at the crew data for having us today, uh I'm sorry this will be recording, but our team is based in Singapore and we're trying to keep a healthy schedule just kidding, obviously working in data engineering, we're quite nocturnal, but uh we just wanted to save you from our late night phases. Okay, uh in any case, let's get started.

A

My name is Alex I work on the data platform product team here at grab uh and together with my teammates Amanda and Harvey today, we're going to tell you about our journey to improve data discoverability with computational data governance here at Broad.

A

But first for those of you who do not know grab is uh the leading super app in Southeast Asia. We operate across eight countries in the region, Malaysia Indonesia, Singapore, Thailand, Vietnam, Myanmar Cambodia and the Philippines.

A

We are the everyday everything cup that combines Mobility deliveries and financial services across a vibrant and diverse region of nearly 700 million people.

A

Our large consumer base and the multiple lines of business generate a huge volume of data and all Grabbers use data on daily basis to make decisions or to augment our systems with data the longer it takes for the right data to be discovered, the bigger the missed opportunity for us to generate value back into our Marketplace, hence the criticality of metadata management, which we're broadly breaking up into the very intertwined data discoverability and data governance.

A

In order to cater for uh in order to cater for this, and after evaluating multiple options uh in 2021, we decided to invest in data hub for three main reasons: first, the autonomy and flexibility to customize it according to the needs of the data Community here at grab and the specifics of our data infrastructure. Second, the maturity of the technology and the advanced metadata architecture and, third, third and potentially foremost, the amazing Community. Behind data behind data Hub that is driving the uh the solution forward.

A

uh I'll be surprised if there is another metadata management solution out there uh that has almost 5 000 people behind it. With that, let me pass it over to Harvey and Amanda for the more interesting and practical part of today's talk. Thank you.

B

Thanks Alex for the intro indeed, data Hub is so extensible with the strong community and that that is pushing towards the Forefront of metadata management at grab before the data Hub repo and we're rebranding that Hubble. Since data Hub was deployed I grabbed a few months ago, tens of releases have been made internally. We established a streamlined release process to ensure we have an easy way to merge. Our code changes with data Hub new releases.

B

Our release Cadence closely follows the community one. Each release is rebased against a recent Community version, with additional features developed internally at grab. Some of those have been contributed back to the community. Next Amanda and I will highlight some features that we instant from data hub for data Discovery and data governance use cases.

B

Let me start off with data discovery. Wispy data comes big metadata.

B

Scalability has been really an issue for data hub today we ingest 3.5 million metadata change proposals to GMS, either synchronously or asynchronously on a daily basis that translates to around 40 mcps per second, we managed to ingest metadata for over 100 000 people within 15 minutes with parallelism in place. Using Presto on hive plugin, which has now been open sourced, we also created additional entities and aspects to the metadata graph, one of which is a generic entity type. We call it others to cover custom entities at grab.

B

I'll share more later tooth here, I'd like to share how we scale Hive injection, to reduce injection time from over 20 hours, using original hard plugin to 50 minutes with a new plugin that we initially developed called Preston height from the original hype. Plugin metadata is fetched by looking through every schema table and view. If there are X Gamers white tables and zip views, there will be in total, X Plus y plus Z described queries sent to high server 2 to parallelize. The injection there's no effective way to shut new and existing tables as well.

B

So, instead of connecting to Hive service, we fetch metadata directly from hack meta store, DB we're starting to achieve ingestion in parallel. Now, instead of using X, Plus y plus Z queries as we saw when using High plugging, it only requires three queries to fetch metadata for schemas tables and Views. With multiple parallel tasks running.

B

We use modular operation on the nd5 hash of the schema table of unit to determine which task ID the schema table or view will be ingested from I'm glad to share that president has been open sourced since data have released 0833 since then, with so many Community contributors that have been making incredible improvements to it.

B

One of my favorite in data Hub is its super extensible metadata model where you just need to define the PDF schema and update the entity registry to add a new entity or aspect, and everything else is then taken care of with auto code. Gen next I'd like to highlight two examples we did on this front. First is on the data set entity. We added a Time series aspect to support sample events for Kafka topics.

B

This feature creates tremendous value for our users, as they know, can just go to data Hub and view the simple events without requesting data access to the Kafka stream itself for Discovery purpose.

B

As more use cases are onboarded, we often have to make the Judgment call on whether to reduce some existing entity. Add new aspects through an existing entity or to create a new entity in the metadata model. The current subtypes as well means the extensibility of an entity type. For example, we can model table view topic or as data sets in beta Hub. However, there are occasions when some entities are very bespoke to internal data platforms.

B

Such we create a generic entity called others in data Hub to cater to all these unique groups of use cases. It has an aspect for generic properties that support different rendering types so that a property can be rendered as table string code or Json in a UI in a screenshot here. What you see is another entity with a subtype name called Skywalker MLG notes are entities for graphs, contain ranking platform in short data hubs, plug-in-based ingestion framework and schema first metadata model are what impulses to make data Discovery democratize across all sorts of entities.

B

In addition to data Discovery, we also use it to facilitate computational data governance. I'll now pass it over to Amanda to share more about that.

C

Thanks Harvey I will now be going through. Data governance in graph data and grab is extremely valuable as it forms the foundation of many of our critical systems and operations. So we really need to establish our data as something governed while delivering the greatest value and Agility of these data at grab data governance anchors on three main points. Firstly, data is immensely valuable to grab great data unlocks great growth, encouraging Innovations, improving operations and the quality of our products.

C

Secondly, data management is democratized at grab, there's no single role or Department that owns every aspect of the data. It cannot be seen or treated in isolation, but it's something that everyone in the company may be consuming from to make important decisions. Everybody company should be able to work with the data comfortably and confidently, and lastly, data must be agile.

C

We need it to be at the right place and at the right time we want it to be accessible to have itself served and to enable everyone at grab to answer questions without having to ask someone else and because of the tech mindset at grab, we don't want to introduce new processes, so we built on top of the easily extensible data Hub, to enable our organization to use data in a governed way without introducing overheads to the organization.

C

Information classification in grab is a process where the owners identify the sensitivity level of the company's information to apply an appropriate level of protection to the information. As it's created, updated, stored, transmitted or deleted, the level of sensitivity would affect the user's access to the corresponding data. Prior to data Hub, we were relying on an Enterprise software to help manage sensitivity, levels and customizations were extremely limited, so all we could do is to guide users through documentations and relying on them to infer and conclude the data sensitivity level.

C

This manual work led to a lot of invalid classifications against the organizational rules. For instance, the table was intact as containing Pi when the column is tagged as containing pii, which then led to several rounds of organizational white campaigns from our governance and cyber security teams to manually validate these rules across hundreds of schemas and 100 000 tables within this was incredibly taxing for the team taking up huge number of men hours to go through them. Data Hub enabled grep to pick this number down to zero.

C

So how do we take that number down to zero? We modeled such metadata into glossary terms and nodes. We added validations on the react application and on the entity service within GMS. These allowed the validation errors to be displayed both to the users using the user interface to update and to any external Services ingesting these rules directly into Data hub.

C

In this demo, you can see an attempt to set the sensitivity level of a table where the column has been tagged as Pi having these validations shown up front allowed users to self-serve and at the same time they were also actively getting informed about the policy requirements within grab.

C

So we have talked about creating these information classification, and the next topic is how we would determine who should own these access controls. We Define these decision makers as technical data owners. These are individuals who are accountable for security access controls of the data sets they manage at a schema or database level and as the role of the technical data owner for the schema isn't trivial, like printing access to possibly hundreds of tables at one go. We have to enforce that.

C

This person meets the minimum requirements set up by our data governance office and considering that this seems to be a highly specific and customized piece of logic, we then decided to use the data Hub actions framework. The data Hub actions framework allowed us to read only the events we need, in this case it's ownerships for schemas.

C

When there's a change to the technical data owner from there, we could file a requests to a separate grab internal service to validate these owners and make follow-up changes when we need it, and, as such, we've also went the additional step to help the users automatically roll up to the appropriate owner.

C

If he or she doesn't meet the minimum requirements and in this snippet you can see that adding me Amanda as a technical data owner results in a select message being sent and when the page gets refreshed, it gets reassigned to Harvey, and with these two customizations in place, we have ensure that the flow for data access is governed and with that we've come to the end of our presentation. All this is possible thanks to the amazing team working on Hubble and the support from the data Hub community.

C

We also like to thank data hub for inviting us to present and we are looking forward to further collaborations. Thank you.