DataHub Adoption Journeys, 23 Mar 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: How the Hurb Team adopted DataHub

Description

Patrick Braz (Hurb) shares how he and his team successfully adopted DataHub within their organization during the February'23 Town Hall.

Presentation Deck:
https://docs.google.com/presentation/d/1cFZ1hhMtuXM1yU5ZHvT2aTGT89pUg33dR5ckWEzskQY/edit?usp=sharing

Learn more about DataHub: https://datahubproject.io
Join us on Slack: http://slack.datahubproject.io
Follow us on Twitter: https://twitter.com/datahubproject

A

Hello: everyone, my name, is Patrick I'm, the engineer at herb, and today I want to present to you the data Hub implementation at church. They did the Hub implementation Journey, so I separated this presentation in four sections and the first one I want to present to you uh a little about herb in the second session. I will talk about challenge that we have and what why we decided to use data in the third in the third.

A

Second I want to talk about the adoption steps and at the final I want to introduce to you how we are today and what uh what plans do we have? Where do you want to come? So so talking a little about the herb, uh we are a Brazilian online travel agency and we offer travel deals that include hotel accommodations activities and travel packages.

A

Travel package is a unique worldwide product that is offered an affordable and flexible way to travel. So basically, the traveler just needs to choose dates in the future, where he wants to travel and help take care of everything for him.

A

But how is it possible and what are the challenge involved so herb counts on a solid data during culture across the company and a team with data analysis, data, scientists, machine learning or and data Engineers that is growing and working hard to build a secured framework for the company decision making uh with that, we have to deal with different issues in our daily routine.

A

So I want to present to you some challenges that we have and just to briefly describe them. So one of them is the fast growing data assets. So, as our company grows, new Solutions are created and to meet company aims, and this is very uh related to the complex data platform.

A

So new Solutions may require new technologies or new data. Integrations.

A

Resource cataloging data assets Discovery. The question is: uh what's the value to have many data assets if you cannot find them or discover their purpose? So this is a problem. There's a usual problem that we had um and the traceability of data audition is. Basically uh it plays an important role when strategic decisions rely on information or data.

A

So, with the traceability of data origin, we can find and discover uh how data was transformed if it was transformed correctly and uh uh if the data has low to specific location and at the final, the most important one is building a single source of Truth uh in herb. We deals with different services and our primary Services. The services are metabase and bigquery, so we can catalog method. We can catalog our assets, Within These, two different services, but uh uh catalog different Assets in using distinct Services, can cause metadating consistencies.

A

So with these problems with this challenges in mind, uh we started to think about a data governance project, so the first step was to create a project requirements documentation, so this documentation consolidates all the requirements uh in a clear and concise manner. uh So the idea was to move to quit. How can you say: what's the map map problems and uh expectations from the two like how users will use the platform? uh How can we re-engage uh collaborators to use the platform, how applications will communicate and other questions?

A

So after we create the document and after we create the requirements document, we started to find uh to search for data catalog tools, so we found data hub and why we decide to implement datahub inside our company, so I can sit. Four main points that drive our idea to implement data hub. First of all, is the user friendly interface.

A

We have a solid self-service security inside the company, so we want to permit uh any collaborator to access and navigate our services uh in a unique platform, so the collaborator can build can can find assets. They want and build uh analysis that that can help him in his daily routine.

A

Another point is the active and receptive community. So it's very important at this. The implementation journey to have a helpful Community, because it's give you the assurance that, if you have any problem, you will find the if you you will find help fast.

A

Another point is that contribution opportunity, so we have a strong open source, skew turn inside the company and we want to position ourselves as a solid Brazilian company and finally uh Beauty in injection sources, with our primary services that is metabase and bigger query.

A

So I summarize this unit in these steps. First, we started with the POC phase after we started the we started to host our data, Hub instance, and with the data Hub instances inside our kubernetes cluster, we start the customization phase and finally, we'll present to you. How is our integration stack today?

A

So in the POC phase we tested all available uh features at the moment and Integrations with our primary services like bigquery and metabase, and it's first it's important to note that we use VMS to deploy the data hub using custom, Docker compose files, so with custom guitar composed files, we could change environment variables to test different behaviors of the data hub.

A

uh This phase was important because of decisions that we make for the future, so one of the most important was to disable fronti the front end injection. So we want to use data Hub as a source of Truth, so uh the ingestion uh through UI was not an exciting feature uh for a science. We could not. uh We could control the ingestion through back-ends and that's why we decide to orchestrate our injection with a partial flow so to use our flow is our injection orchestrated?

A

We separate the dependencies with kubernetes pod operator, so airflow just needs to start parts and provide the parameters for the execution for that we create a deck Factory that uh views that gather different ingestion receipes and builds the ingestions okay. So talking about the kubernetes deployment phase, we face it uh with some issues like in development experience, uh how can I say uh managing environment variables across multiple deployments and the multiple values files is costly. So all through now we know sub charts are recommended as good practice.

A

uh We decided to refactor data Hub community, chime into a flattened version. uh uh Besides, we started to test the usage of config maps in separated Scopes to manage environmental variables through different applications. So this was the the most important decisions and one thing that is important to note: it's. We are planning to open source or harm charts in the future. I think this year.

A

uh Talking about the customization phase. uh We know that I already have an airflow integration uh with data Hub, but we had problems with the dependencies, so we decided to implement a new integration in our site based on the community Integrations and use the new A New Concept of airflow, that is, data, set objects, so data Engineers can enrich metadata during airflow decks development. With this integration uh we can not only take advantage of triggering the execution of decks by changing data sets, but we cannot automatically build lineage with lineage backend.

A

So this is this. Integration is helping us to build our completely complete life cycle of data assets.

A

Okay, so how is all our stack today? I I, created this diagram to show the Integrations that we have today and I forgot to talk about. Animal anomalu is a data quality platform that we use to build our data validations, so airflow manage all the ingestion to into Data Hub, and with this integration we can uh uh we can.

A

We have the visibility or quality of quality control from source to Destiny and the one thing that is very important that we, we are frequently using the impact analysis, uh feature to see who is impacted with some chains or data issue that are normal, finds and talking about our roadmap.

A

uh uh One important thing that we are currently working in is customizing our front end, so we find that it's very important to use uh the company visual identity, and this will help employers to identify with the two and increase engagement, and this is very important because we want to adapt data Hub or so-called data Urban internally, as a data product for the company and other things that we are planning to do is customize the metadata model to Inc to include apis and Metric entities.

A

uh Most part of our data source is our apis, and now we are working to build our semantic layers. So these two, these two entities we will you'll, be very important for us. Another thing is to integrate our machine learning models and services in our stack, so we will. It will help us to build our full data lineage from source to uh services that are using, and finally, we want to use the actions framework.

A

uh The the actions framework will will help us to build an an outage log history of user users actions and build dashboards to follow platform, Evolution and engagement.

A

uh We are currently using uh kibana with electric search to build a kpi dashboard, but we are planning to ingest this metadata to bigquery to use our primary analytics services that is metabase.

A

So that is what I plan to present you today, if you have any question, feel free to contact me. Thank you all.