DataHub Adoption Journeys, 23 Apr 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: DataHub at DefinedCrowd: Apr 23 2021 Community TownHall

Description

Pedro Silva shares the journey of adopting DataHub at DefinedCrowd and their plans.

A

B

Yeah, can I take over all right great. Thank you. Give me a second all right. uh You.

C

B

See my screen right, yep all right, perfect.

C

So, uh first of all, thank you so much sir shanka for giving me the possibility to present the work that we have been doing at the fine crowd with data hub for everyone else. My name is phil silva. I'm a data engineer at definecrab.

C

So, first of all, let me just very briefly explain to you what a company is. Essentially, we are a marketplace for ai data and our objective is to make your ai use cases smarter and the way that we do, that is by crowdsourcing data assets specific to your use case into your industry as a service with certain quality guarantees.

C

These can be things like speech to text, translations, audio or even image recognition type data assets regarding the company we were founded in 2015, we have over 300 employees and, through our series, a and series b funding we have over 63 million dollars already raised important. uh To mention in this talk is that uh one of the ways or the way in which we crowdsource uh datasets is through our nevo platform.

C

This is essentially a mobile application for android or iphone where regular users can download the app and earn a little bit of money by performing certain tasks.

C

These tests are what we call as units of work and to give you a sense of the statistics and the scale with which we work it's over a million units processed per day with 500 000 or more crowd members and growing over 70 countries and speaking more than 50 languages, and the reason that this is important is to give you a little bit of context with regards to the data ecosystem that we have at the find crowd.

C

Essentially, it's an event-based data, lake-like system managed by a centralized and specialized data engineering team, and the purpose is for reports generation dashboards for the multiple teams that are allocated to certain projects and also to help perform our internal quality control over the crowdsourced data assets.

C

Regarding the uh architecture itself, uh what we have is nivo generates certain metadata events, so, let's suppose that a certain unit of work has been assigned to a user, he has been working on it. So you have progress uh over time. uh It can be cancelled, it can be completed these sort of things, so a sort of state machine of events and based on multiple aspects of the platform is the input or the source of data for our platform.

C

This is then retrieved from a kafka log storage, where we can perform a number of things, one of which is through k-sequel, perform streaming joins and the reason that we want this is to have a real-time computed view over certain data views. If you will on our data and that is stored back into kafka, which, through kafka, connects, we store in htfs and internally at the fine crowd we use or our cloud is azure and we do use certain managed services, including, for instance, azure data factory, which is uh azure's managed offering of something like.

A

C

It's their own interpretation of that and we also have the typical etl batch-like processing, in this case, through spark where we take raw data and process process. Our kafka topics into usually views that are more a consumable for our stakeholders and and these views can either be consumed via druid or via hive.

C

The use case here is in druid: we want to work with more real-time queries than an sql-like approach, but in hive through jupiter hub for those cases, for example, data scientists where they want to perform actual manipulations on the data, and they want to perhaps do some daily cleaning and feature engineering for our internal machine learning models so in a very high level overview. This is what we have for a day-to-day ecosystem. This.

A

C

Everything, but it's what I feel is relevant to this conversation right now and, as you can tell, um this is a very centralized like approach and the whole architecture itself is owned and managed by the data engineering team. Our vision is to move towards a more data mesh like approach, which certainly has some benefits for us. Concretely. What we want is to achieve three things: data democratization so allowing data-driven decision-making without bottlenecks or external dependencies by our stakeholders.

C

These can be people like project managers, business analysts, so on and so forth, or even c level decisions. We want them to be.

A

C

Make those decisions based on data but not being dependent on the team and the reason being, is that the scale of the team and the people who manage this infrastructure? We are around six people. This changes over time because we have people being allocated into different projects, um but our fan out ratio, if you will, is six to eighty. So if you want to reduce this, I think it's something like one to twelve and that's sort of how it works, and given this scale we naturally want our users to be more self-serviceable.

C

So to do that, we need to provide intuitive tooling for them to be autonomous, and, finally, this will allow, then the dead engineering team to scale up and be able to organize the challenges.

C

Speaking of intuitive tooling, that's where data hub itself comes in right. Self-Service ability is something that's not really possible without having data discovery and data data lineage over the assets that we have, and even for the data team. For us, uh we do increasingly have a harder time keeping up with the growth of data assets given, given that the company itself is a data provider, our asset catalog is continuously growing and for six people it's a lot of information for us to handle.

C

So we needed a data catalog from an exploration of the current ecosystem uh perspective. We explored options like weworks, marquez, apache, atlas, lifts and mutant, and mainly because they're open source solutions, and we wanted to understand and wanted to be able to contribute back in case. um Their approach was not um perfectly fixed to the finance crowd use case, though, we did also look to inspiration at companies like netflix and intuit that had other approaches. In the end, though, we did decide to go to data hub the reason being, it's extremely active community.

C

The ability to have strongly typed dynamic, metadata models so being able to define, in certain end certain entities, relationships and being able to modify that, if possible, to make it match our use case and finally, a push and pull ingestion model for the metadata and because we do have certain components of our architecture which are streaming based and others are batch based.

C

Possibly some will be created by us internally and having that flexibility is very important to us to give you a sense of where we are right. Now, uh we've undergone the exploration proof of concept based deployment. We are now at a production level deployment, though with a very basic use case.

C

In this scenario, data set catalogs only and on our downstream databases drew it in high and the reason why we wanted to do this is to ensure that our direct stakeholders had access to information that they could understand if they work with superset and jupiter hub.

C

They are directly interacting with data that's available in variable in druid and hive, and this was a work that involved three people and overall, it was an extremely positive experience and our initial rollout had over uh 20 plus data users and their feedback has been quite good, though at this time because of the lack of metrics. I can't really tell you if this number has changed over time, but it is the information that we have right now and but finally, uh data how about as a system. It is relatively complex right.

C

It's a it's a large database with a lot of moving parts, but the community support has been exceptional. So in that sense uh I feel it has been an excellent choice on our part regarding contributions- and I know shashank already.

B

C

This we did contribute with some things, particularly um monitoring metrics cron crawling support for metadata all of this done in kubernetes, because that is our default deployment mode and finally, support for jude, though it is not the end of contributions, I hope, but we will see as time moves on so just to give you a sense. This is the ecosystem that we had before, and this is what we have additionally with the with data right and you have your uh they have you high your installation?

C

This is all done in kubernetes, then, through the data hub, uh metadata, crawlers and the cron jobs we very high and and finally with regards to opportunities, and we feel naturally, there are things to improve. There always are in good projects, and in our case, for our use case, it's dynamic metadata models.

C

It is true that these models are flexible and that you can change. However, they are hard-coded into the database in the sense that if we wanted to change, we needed to maintain a fork of the project continuously and from time to time, merge things from the original repository to game all the niceties and all the features that have been released and the reason we want to do. This is not all our data. Stakeholders are tech, savvy right. Some of them come from linguistics backgrounds, with little to no cs training and for them semantics matters.

C

For instance, quality control at the fine crowd for translation use cases involves having people who are naturally linguists and they do.

B

C

Have cs training and, as such, we need to make their usability of this tool as intuitive as possible.

C

Other things include role-based active access control so having granular entity level. Access definitions is something that I feel is quite important so being able to say that a given data user has access to data sets a b and c, but not d, e and f, and this is something that is very relevant to us and finally, field level, uh lineage based on jobs and pipelines. So because we have certain data sets or in a certain views where a subset of those columns are generated by job a and the others are right on uh job b.

C

This is the sort of information that we want to make sure we surface correctly, though I do have to mention that all of this is already in data hub's backlog. So I don't think that I'm saying anything new here sure shankar, but it's just what we had to find crowd and would like to see in the future and yeah. uh I guess that's it. I don't know. If anyone has any questions, I do apologize. If I sped through this presentation, but give me your feedback, if you have any thank you so much.

A

Cool thanks a lot pedro, um definitely plus one on all the pain you felt with some of those things, especially the metadata models. We feel it every time we add one small thing to it, so no code, metadata models are absolutely top of mind for us and our back also on the roadmap field level. Lineage also we've got the rfc in and so we'll figure out the implementation pretty soon.

B

A

Question for you pedro, and then we can do some quick questions from the community. uh You know you currently have it uh have data hub kind of crawling, your druid and hive clusters. uh You also have some kafka connect uh in your ecosystem. Yes, uh do you anticipate uh kind of pushing metadata from that streaming ecosystem into data hub in the future.

C

Ideally, yes, I do feel that kafka connects integration is more on the level of lineage than generating metadata assets themselves. So, right now, as I understand it, the ingestion framework will generate data, sets snapshots and possibly even user information for ldap systems, and I feel that it needs to be adapted to be able to provide these sort of connections or updates to aspects of metadata right. Kafka connect feels very much like a source of that type of metadata and at the ingestion framework level. uh We don't yet have that.

C

As far as I know, but yes.

A

Yeah, I think that.

C

It'd be definitely a plus one.

A

Agreed, I think, um especially the way in which we've done the airflow lineage integration.

A

We can probably uh use the same strategy for kafka connect where there's like framework level integration with kafka connect and when those pipelines start up, they emit metadata events that then connect the edges together, so that I think, would be great for the community as well. Are there any other questions? We have a minute.

A

um You can always drop questions if you have them offline on the chat and I'll try to get back to them uh towards the end cool thanks. A lot pedro and best of luck.

B

All right, thank you so much.

A

All right so for our next piece, we have.