DataHub Community Talks, 21 Feb 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: DataHub adoption journey at Geotab : Feb 19, 2021

Description

John Yoon describes the adoption journey of DataHub at Geotab.

Recorded at: DataHub Community Meeting : Feb 19, 2021

A

B

A

B

Okay, uh so hi everyone. My name is john. uh I'm. The data ops team lead at geotab, I'm here today to share about our journey data of journey at geotab.

B

So geotab is a global global leader in telematics with about 2.1 million subscribed vehicles using our products and services we're one of the very few telematics companies that make both hardware and software for anyone who's not familiar familiar with the term telematics. It means that we use iot devices and oem softwares to collect data from vehicles to provide various products and services to help our customers below are some examples on how we help our customers to improve their fleet productivity, optimization, enhance driver safety and achieve stronger compliance to regulatory changes.

B

So geotabs spent quite a bit of time in 2019 on evaluation and poc, with commercial products like libra, elation and talent, which they all had robust set of features from data management governance perspective.

B

But it didn't really take too long for geotab to realize that it wasn't for us. Although the outputs from these commercial products were fancy and shiny, uh actual value, add from using the product simply didn't outweigh the direct and indirect cost like vendor lock-in, customizability, implementations and service costs and licensing fees.

B

So I joined right after geotag made the decision not to proceed with libra and that's when I started looking at open source solutions from early 2020.

B

I think most community members from the uh use cases that they shared from previous town halls. Most of us had a very similar list of open source products to evaluate from and from from those we short listed to atlas, amundsen and data for our evaluation.

B

Functional and non-functional requirements were very important, but uh one of the key evaluation metrics. I think uh we, I wouldn't say unique, I'm sure, like someone else also like looked into it, but uh key evaluation metric that made us to select data hub was approachability and technical capabilities of leading depth developer.

B

Leading dev team linkedin, as most of us know, had solid, have solid lists of open source portfolio that they designed and donated to the patch foundation and also data hub team. They have been very approachable responses and open during our evaluation phase, so from a very small team at geotab trying to tackle the problem.

B

Those technical guidance and support were very important to.

B

Us so our first crack at datahub, we onboarded small number of data sets just over 250 and had 60 users from one department to try out datahub. The result was somewhat disappointing. uh The adoption rate was very poor and feedback was discouraging uh from user's eyes. Data hub wasn't any better than how they searched uh data sets in google bigquery.

B

For some it was useful, but there were there weren't enough allocations when they needed to find something on data hub. So I asked myself like. I was told that data discovery was a problem at geotap, but it turns out the scope was poc was poorly established, and I made a very naive decision to blindly accept the comment that someone else said and took the scope uh from calibra poc, which was also an unsuccessful poc.

B

So for past few months I did uh taking on my own uh to learn, what's really going on behind the scenes, so just to give you guys some overview of what data journey was like from 50 000.50 geotap grew very fast 500 growth rate in revenue and size in five years fast. uh Within that time, geotap acquired five different companies, which contributed to not only growth in revenue and size, but also the complexity of data architecture and data management and governance structure.

B

So until today, geotab, like general data practice at geotab, is done in silos. Most people work with the data that they have within their team and department to drive the insights without requiring much insight into the other data available within other parts of the organization.

B

Teams aren't so big and they work with relatively small sets of data and have strong tribal knowledge of what data to use or who to reach out to ask questions within their domain, and this was the uh one of the key reasons why users from poc first poc didn't have. uh The need to uh need first need to search for what they need to do, what they need on data and many teams, don't have data management or governance, structure and ones that do they are using different tools, processes to integrate, store and derive data.

B

One of the pain points I had was less than five percent of data at geotap had concept of ownership, which made it very difficult to do further analysis on what steps we would need to take to onboard meaningful and searchable data sets on data hub.

B

So uh the past few months, I spent most of my time talking to people from other departments to understand where we're to understand where we are in terms of data management, then made a proposal on what we would need to do to change what we would need to change from architectural perspective, integration, security, compliance operations and metadata management perspective.

B

It took a couple of us a few months, a few months to get all the buy-ins from across all departments, and now we have a new team called data, ops which focuses on improving interoperability among data tools, processes and people using metadata.

B

So in 2021, one of our goal is to productionize datahub uh we're currently working closely with srishanka's team, john and gabe to learn more about their react, app and assisting them bit by bit in building react application and once we're comfortable with the app in the testing environment, we're planning on productionizing data update, geotab and internally.

B

We had a debate on whether to allow anyone to push any data sets within geotab, but the decision was to make only the production data sets available on datahub.

B

There isn't right answer for choosing one over the other, but we just we decided to put more emphasis on production level dataset, which follows our internal data office processes to ensure relevant metadata is captured on data integrity, ownership, security and compliance.

B

Basically, what we're saying is that we want our users to only be able to find, find and use. The data we know is in good quality within the risk. Tolerance level that we designed also, we think data hub can be more than just a data catalog at geotab.

B

Data's, generalized metadata model allowed us to start conversation with other departments at geotab to model custom entities that they want to catalog while capturing meaningful relationships with other data hub entities. So, basically, we are discussing and we'll be treating data hub as an internal open source project, so other department, dev teams also can contribute to internal internal features and custom entities.

B

Some entities and aspects that we're thinking of modeling this year are systems and applications. Apis are back projects and service accounts.

B

We're still very new in open source journey, but our plan is to make meaningful contributions to the community as much as possible. We just started to contribute to uh the open source, react app application. uh We made a couple of contributions past couple weeks, but hopefully the numbers will grow over time, we're not adding too much value at this point in time, but we're slowly shifting towards an open source. First mindset to generalize generalize our use case as much as possible to find the opportunities to contribute back to the community while solving our internal problems.

B

At the same time,.

B

So uh these are some of the wish lists before I close, I think I mentioned in this flat channel that hopefully we can have the roadmap timelines updated on the open source, skate repo, and I one of the pain point when we were having discussions uh like internally with other departments, was that there wasn't really easy way for us to uh understand quickly what entities, aspects and properties are available currently available in data hub.

B

So we can minimize the redundant effort when we create new custom entities, so hopefully the metadata model to graph with graphic visualization to kind of help the community members to quickly see what entities and aspects and properties are available and what the relationship between them among them is would be very helpful. In my opinion and column level. Lineage uh is something that we've been tackling internally to uh ask ourselves like how like what would be the most efficient and automated way to first capture the column level relationship.

B

So when the feature is available on data, we can readily uh surface it, and social feature has been like one of the hot discussion internally, but I know like uh most of the commercial products have this feature, but it's not the most like high priority uh item on our like backlog, but I think it would be. It would be very valuable for data hub community as well and that's about it.

B

Anyone does anyone have any.

A

Questions cool- that was great, uh john thanks for uh sharing the journey. I think this uh I can relate to definitely a lot of those uh challenges and concerns the one thing that we've had quite a lot of debates about uh with a lot of teams, uh especially central teams, is exactly this.

A

uh Rationalizing of do we only put the clean data in data hub, meaning the clean metadata in data hub, or do we actually put everything in there and then have the clean data rise to the top and use that as a way to drive data governance? So that's something! That's definitely on my mind, uh it's it's a big topic of debate in lots of communities as well.

C

If I can just quickly jump in my team built airbnb data portal at airbnb and we went through a similar decision making process and there's something magical that happens when you have you know more than 200 weekly active users of your product, you'll find the right blend of trusted. Data sets and data sets that people want to be productive with. So I I believe it's just about growing usage and the the quality of data sets. Questions will settle itself once you get the experts using the tool.

D

Yeah we we encountered the same the same challenge here in amazon with our clients and what we found that works better is to have the responsibility for the publisher of the data sets that they need to tell if the data is reliable, etc, because it's also relate like. We see a lot of customers and even our internal team like building a feature store right so is this. Data set is something that you can rely for your reporting or bi.

D

So we push it to the publishers and the subscriber, and we just create, like um you know, json, that that defined the contract between the publisher and the subscriber about the data set. So we try to you use technology to enforce it, but what I've seen that you always need the men in the middle, like the data steward or or someone from legal to tell okay, can you actually publish this data et cetera?

D

Because the challenge is that you don't know what is the intention of the consumer of the data, what they want to do with it?

D

So this is why it does I agree with you like you're debating, so we said: let's bring the publisher the they are the owner of the data, so the they have the responsibility um happy to share, like maybe in the next meet up like like some architecture or somehow we we solve it in several use cases, and um we again, like you mentioned collibra at leon like we saw you know, we are looking on all these third parties, some customers, we always get into this like there, isn't a man in the middle of processes that needs to be in force.

D

Somehow, oh, I agree with you in the comments.

A

Absolutely we'll take you up on that offer uh roy.

D

A

Great okay, all right! If there are no more questions, we'll go over to the.