DataHub Community Talks, 19 Feb 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: DataHub Community Meeting (Full): Feb 19 2021

Description

Full version of the DataHub Community Meeting on Feb 19th 2021
Agenda:
Welcome - 5 mins
Latest React App Demo + Tags Preview by John Joyce and Gabe Lyons - 10 mins
Use-Case: DataHub at Geotab by John Yoon - 15 mins
Tech Deep Dive: Tour of new pull-based Airflow + Python Ingestion scripts by Harshal Sheth - 15 mins
General Q&A from sign up sheet, slack, and participants - 15 mins
Closing remarks - 5 mins

A

B

C

B

I got excited, maybe joined earlier.

C

Awesome, you might see something you like. We have some uh prototypes of tags.

B

C

B

I that's very cool. I just wrote a thing for a glossary.

C

That would be awesome. Yeah want to figure out how to do glossaries right.

B

C

Would be sweet.

B

Basically, something that defines all the terms that are used in the.

C

Yep, hey john, are you here.

A

C

Hello, welcome to the community we haven't seen you before. Do you mind uh quickly, introducing yourself.

A

Yes, my name is uh juwan petrini. I'm work as a senior data engineer at klarna, uh together with thomas larson, should be until in this meeting as well. uh We are building a data catalog and we are basing it on nice awesome. That's the reason I'm here.

C

Great, how about stephen? What brings you here.

D

D

C

C

Anyone else who's new and wants to introduce themselves. I'm just going off of my screen. Aileen.

E

Hi, my name is eileen. I'm a software engineer at a health tech company called health fairy in philadelphia, pennsylvania and we're prototyping a metadata solution using datahub, so here to learn more thanks.

C

Well, welcome and uh john john.

F

Hi, um I'm the lead data engineer for a fashion um marketplace, company called depop um and here for similar reasons to everybody else, we're looking at starting a data catalog and investigating data. I'm here with my colleague, maria who's here somewhere as well.

G

C

Hi maria welcome, I see thomas uh also from clarna. I think.

D

Yeah uh hi, I don't know, maybe you one has already introduced things up: yeah yeah.

C

Nice, nice to see you all same here.

C

Cool, um oh that's, awesome john from geotab is here. uh I had heard that he might be a little delayed, but I'm glad to see that he could make it yeah. I'm here awesome, uh suresh, hello,.

H

Hi everyone, hey, I'm suresh, uh something ramen, um I'm I'm data, um I'm heading uh a data engineering team for children's hospital of philadelphia. um Again. My interest today is uh obviously learn more from you all um currently uh would like to evaluate data hub uh as an enterprise data catalog. I think I'm hoping to have another team member here. Let's go joe marzio with me today and uh very very interested. I think, there's a lot of stuff going on here um and thank you for uh forwarding that invite srishanka. Thank you.

C

Awesome we have naga and woody from linkedin hey. You guys, introduce yourselves quickly.

I

Hello, uh this is nagashina. Well, I'm working for linkedin on the data hub project, yeah.

H

Yeah same here, I work in the same team as naga and.

J

H

We are not new members here.

C

That's right and woody always has the best backgrounds.

C

We also have quite a few people, so I'm sorry, if I'm not able to call on all of you. This is going to go on for a while. So let's uh move on to actually sharing the screen. If we have time we'll get back to all of you,.

C

C

All right uh welcome. uh This is our second uh community meeting of the year. uh We have settled on doing it uh every third friday of the month. So that's how we pick the dates.

C

The agenda for today we're going to do a quick react. Demo, john and gabe are going to show what they've got cooking and how it's coming along. We had a react deep dive.

C

Last time we also have john from geotab who's going to talk to us about how the adoption journey has been with data hub at geotab and herschel joins us to give us a quick tour of the new ingestion framework that he contributed and to describe uh some of the design decisions that he made and then you know we'll we'll take some of the q, a from the spreadsheets uh and uh figure out logistics for the next meeting.

C

So that's the plan, um a quick uh thing that I was trying to categorize and I think we should make it a bit more formal from the next time onwards. I was just looking at kind of the developments that have happened in the last month. We've had close to 50 prs marched, that's probably a record. I do need to go back and look at kind of the historical trends in terms of big things that have landed the graphql support that we talked about last month.

C

It's now available, it's embedded in the play front end as well as in a separate service. So thanks a lot to john joyce and arun vasudevan for kind of working on it together and making it a reality.

C

We also have the react app that has been incubating and I believe it's at feature parity now with the ember app, and it has a few additional features like charts and dashboards that we'll be demoing.

C

Another big thing that was merchant was the ml model's back-end implementation. So thanks to ryan holstein, for pushing that through and to the linkedin team for actually shepherding and reviewing it very carefully and thoroughly, and then we have the python injection rewrite and we'll have hershel talk about it and finally, we're also seeing elasticsearch 7 support getting merged in slowly. It's still on a branch, but at least it's there for people to look at and john placed. It has been leading the charge on that with na and jyoti.

C

We should hopefully see that getting merged into masters soon or maine. That's another thing. We have to fix all right. So those were the developments in the last month and I would like to have uh john joyce uh take us through a quick demo of the react app and uh let's see where it's going.

K

All right, thank you. Shoshanka.

L

K

Sharing here, yes,.

K

All right, so I just want to clarify one thing: trishanka mentioned earlier.

K

I'd say we're at probably 99 feature parody, there's a few uh which you'll be able to see in the app, but we've also extended some of the functionality that the ember app has or does not have, rather so we'll start by just logging in uh you're greeted with this kind of home search screen. uh This is kind of the jump off point for both search as well as the browse experience, which you guys are probably familiar with uh we'll just start with searching some of the data um we have kind of a new, fresh search.

K

Look here, one notable addition is we've added some of this all tab which allows you to see different entities grouped by type. So you can see we have dashboards, charts users and data sets from here. You can go right to the details pages pretty pretty similar to what the mrap has. We have the schema lineage one of the notable exceptions in terms of functional parodies.

K

We don't have a visualization, yet we're kind of waiting to see what the use cases look like around that before adding it, because it is kind of complex and we have properties documents which you can update. Change similar to the to the ember.

M

K

um Let's see so the biggest, I think difference that we're looking at is we've added recently dashboards and charts. So now you can actually see them pretty minimal, we're looking to add new types of metadata to the model, but right now we just have say for a chart, the source data sets and for a dashboard. We have the related charts. So we have this kind of like parent-child relationship here, um the last kind of functional piece that I think we're really missing.

K

That's big is this kind of data sets you owned, charts you own dashboards, you own tabs in the profile, the user profile page, but that's that's pretty much. It browse works as you would expect kind of just walk through the file system hierarchy, um and then we just had a new addition. The log out button thanks to the folks at geotab for uh for getting that done and that's pretty much the the demo. um It's it's ready for deployment.

K

I've added a docker quick start react script that you can use. So it's under the normal docker folder. It's just quick start react. Instead of quick start and you'll be able to serve this on port 9002, you can actually deploy both simultaneously.

K

If you want to do a side-by-side feature comparison if you're considering moving over um but one other thing I want to call out before handing it back, is we're going to start doing some react office hours that I'll be leading just dedicated time for people to jump in to ask any questions about any part of the react. App deployment features roadmap how to actually make changes again. This is an ongoing process. I think in the future we have some interesting action items. We're trying to get to.

K

One of those things is doing kind of a comprehensive ui revamp of the different entity. Details pages you saw there as well as the browse experience revisiting that so yeah I'll kind of send out a link. It'll probably be fridays for about a two hour block, but excited to continue to work with the community excited to take more feedback, so feel free to reach out. If you have any feedback concerns considerations around the react app and with that I'll hand it back to tisha shanka.

K

C

Awesome thanks a lot john. I do know that people wanted to contribute stuff back to the ui, and you know the the amber application was something that was very hard for us to work with the community on. So I'm really glad we were able to make this push and finally kind of get to the point where I think we can start getting contributions back from the community. So I'm really glad that we got our first log out button. You know even small things like that.

C

Make us very delighted, so I would love to see the office hours actually start, uh making um making a difference. uh We'd love to not have to do all the pages ourselves. We would like to have people start contributing new pages, so that'll be great.

K

C

I know that gabe had something that he wanted to quickly demo um in terms of things that he's working on almost like a sneak preview of tags. So, okay, do you have a little bit of time to do that.

N

Yeah I love to. um I can share my screen real quick. If that works.

N

um Yeah and thanks again, john for demoing the react app.

N

I'm really excited about the progress that we've made and I'm excited about what what lies in store for it and also another plug for the office hours all be hanging out there too, and if anyone has some idea, something that they want to extend or a future that they'd like to see, you can come and we can hack together and try to get something working and we'll show you around the code base and try to make it as easy as possible for for you to extend the app how you want.

N

So the future that I wanted to give a sneak. Peek today is a adding a little bit of richness to the existing tagging system and I'm working with frederick and mati from walt on developing this richard tagging system, they're doing an amazing job driving forward the rfc. That's hammering out the nitty-gritty details and I think that rfc is still a work in progress. So exactly how this would be modeled in the back end is still being developed, but this is an exploration of what this richer tagging system might look like in the ui.

N

So you can see this is a sensitive data set that has some tags applied to it. The sensitive data set has this pii tag applied to it and then also we're able to tag this field, the contact field by saying that it's an email in this richer tagging system, we also might want to associate some additional context to a tag besides just the name email. So if we look at this email tag and click into it, it'll bring us to the tags page and there we can.

N

We can have this extra metadata, so someone can add a description to a tag explaining what email means, but you can also associate the email tag with other types of tags, so here we've associated the email tag and say that it is some.

N

You know personally identifiable information, and the exciting thing here is that, with the ability to associate tags with other tags, we can then start to explore this hierarchy of tags, and so we can click through to the pii tag and then see what tags that is associated with, um and then on this pii page you can see and again uh exactly what properties and what metadata we might want to associate with the tag. This is something that is still we're.

N

You know still trying to hammer out, but this is kind of an example of another piece of metadata that we could associate with so those tag. The tag lists could could demonstrate what tags are associated with what other internal to data hub tags but say you would want to also associate well. What is the relationship between this tag and some definition external to github?

N

So we can say here is our pii label inside of datahub, and this is the external definition of it outside of the datahub system, and then we can also, you know from there continue to explore this hierarchy. So this is uh just a sneak preview of what this tagging system might look like in the in the future.

C

Awesome that was pretty cool and I'm sure madhu is getting excited because he had uh submitted the uh business glossary rfc and I think he'll see a lot of similarities to how we're thinking about tags and so definitely there's a there's, something uh where we would basically wanted to make it as close to how business tags or business glossaries should look like, while still allowing a little bit of freeform capabilities for people to come up with lightweight tags that don't necessarily have to go through a ton of review.

C

So we'll we'll search for uh kind of a the right balance and so love to work with the community on figuring. What that is thanks a lot yeah.

L

I like that, he quickly could able to demonstrate with the tags now we're trying to bring in a little more better business context.

N

L

N

This tagging system that we end up with that and and then frederick and mata, are doing a great job. Driving that rfc process is, is something that could encapsulate the business glossary term. So just a slightly you know becomes somewhat of a more extensible version of the exactly that rfc that that you drove.

G

Hey um I'm curious with the tagging system. Where are um what's the thought process in terms of how those tags are generated? Is it some? Is it done via the ui or would it be like similar to the like other message like yeah, how are tags generated and associated to one another.

N

So um again, monty and frederick are going to be they're working on an rfc that is talking about these technical details, but um I think that being able to edit tags or having the capability to edit tags in the ui is something that we would eventually want to support for sure.

G

Yeah that'd be great.

C

I think we find like two kinds of personas right. There's the.

M

C

Team persona that says, I want to control the number of tags and kind of this explosion, and there are some specific tag, taxonomies that we want to make sure are known and understood in our business terms that everyone understands. So you know when I was at linkedin. We had a compliance taxonomy that you know we literally hard coded for the company, and we said these are the terms you don't get to invent your new term for what an email is.

C

So I think the platform teams will want the ability to control certain sub trees, yeah.

M

C

Taxonomy and say these tag, ecosystems are ours and you don't get to play with them. Yeah.

E

C

Other personas, like data scientists and data analysts, will want to say well, but I am team foo and I have my way of talking about stuff. Can.

N

C

That taxonomy and either upload it or just manage it using the ui, and I think we will need to support both capabilities. Yeah.

G

And I think, just like big plus one to being able to tie it back to some some notion of a business glossary, I think, is huge like actually tying those data um assets back to kind of more ephemeral. Understandings of of concepts would be really really great.

C

Awesome, we'll definitely do that thanks.

G

C

Sorry, if I can, okay, oh yeah,.

J

N

Want to say something.

J

N

J

I can quickly a comment here, uh so I'm I'm from waltz, also and frederick has been mostly working on this.

J

uh This rfc, but uh like our uh use case, also requires, like maybe there's like two taxonomies of or two types of adding tags to entities so that some sometimes types of tags can be added through the ui. But for some tags we need to be able to control them and audit, for example these pii tags, for example. So we can't hide them by accident, or we have to have some kind of audit lock there and then, of course, those those types of tags.

J

Then people shouldn't be able to create a new from scratch from the ui, but then some other types of tags. You could so that's like the basic use case and need for us.

C

Cool um yeah, I think the the rfc is open, so please add comments there and we can make it the right fit for pretty much all businesses that are trying this out maggie, since we had you just did the dashboard and charts page at least look like what you would expect as a first cut.

G

I had to join a few minutes late, so I only caught the tail end of it, which is just such a bummer but I'll watch. The I've been looking shashanka sent me a slack this week, like their like word on the street, is that we might have a preview of dashboards, and I missed it well.

C

There'll be a recording.

G

It looks great that that new, um that new ui is slick, looks good.

C

Oh awesome, okay! So now, if uh geotab folks are ready, then we can have john actually take over and drive.

O

Can you guys hear me yep, okay, uh so hi everyone? My name is john. I'm. The data ops team lead at geotab, I'm here today to share about our journey data of journey at geotab.

O

So geotab is a global global leader in telematics with about 2.1 million subscribed vehicles using our products and services uh we're one of the very few telematics companies that make both hardware and software for anyone who's not familiar familiar with the term telematics. It means that we use iot devices and oem softwares to collect data from vehicles to provide various products and services to help our customers below are some examples on how we help our customers to improve their fleet productivity, optimization, enhance driver safety and achieve stronger compliance to regulatory changes.

O

So geotabs spent quite a bit of time in 2019 on evaluation and poc with commercial products like libra, elation and talent, which they all had robust set of features from data management governance perspective. uh But it didn't really take too long for geotab to realize that it wasn't for us.

O

Although the outputs from these commercial products were fancy and shiny, actual value, add from using the product simply didn't outweigh the direct and indirect cost like vendor lock-in, customizability, implementations and service costs and licensing fees.

O

So I joined right after geotag made the decision not to proceed with libra and that's when I started looking at open source solutions from early 2020.

O

I think most community members uh from the uh uh use cases that they shared uh from previous town halls. uh Most of us uh had a very similar list of uh open source products to evaluate from and from from those we shortlisted to atlas amansen and data for our evaluation.

O

Functional and non-functional requirements were very important, but one of the key evaluation metrics. I think we, I wouldn't say unique. I'm sure, like someone else, also like looked into it, but uh key evaluation metric that made us to select data hub uh was approachability and technical capabilities of leading depth developer.

O

Leading dev team linkedin, as most of us know, had solid, have solid list of open source portfolio that they designed and donated to the patch foundation and also datahub team. They have been very approachable responses and open during our evaluation phase, so from a very small team at geotab trying to tackle the problem.

O

Those technical guidance and support were very important to us.

O

So our first crack at datahub, we onboarded small number of data, sets just over uh 250 and had 60 users from one department to try out data hub. The result was somewhat disappointing. The adoption rate was very poor and feedback was discouraging uh from user's eyes. Datahub wasn't any better than how they searched uh datasets in google bigquery.

O

For some it was useful, but there were there weren't enough allocations when they needed to find something on datahub. So I asked myself like. I was told that data discovery was a problem at geotab, but it turns out the scope was poc was poorly established, and I made a very naive decision to blindly accept the comment uh that someone else said and took the scope from calibra poc, which was also an unsuccessful poc.

O

So for past few months I did taking on my own to learn, what's really going on behind the scenes, so just to give you guys some overview of uh what data journey was uh like uh from uh 50 000.50 geotap grew very fast uh 500 growth rate in revenue and size in five years fast.

O

Within that time, geotap acquired five different companies, which contributed to not only growth in revenue and size, but also the complexity of data architecture and data management and governance structure.

O

So until today, geotab, like general data practice at geotab, is done in silos.

O

Most people work with the data that they have within their team and department to drive the insights without requiring much insight into the other data available within other parts of the organization.

O

Teams aren't so big and they work with relatively small sets of data and has strong tribal knowledge of what data to use or who to reach out to ask questions within their domain, and this was the one of the key reason why users from poc first poc didn't have the need to need for need to search for what they need to do, what they need on data and many teams, don't have data management or governance, structure and ones that do they are using different tools, processes to integrate, store and derive data.

O

One of the pain points I had was less than five percent of data at geotap had concept of ownership, which made it very difficult to do further analysis on what steps we would need to take to onboard meaningful and searchable data sets on data hub.

O

So uh the past few months, I spent most of my time talking to people from other departments to understand where we're uh to understand where we are in terms of data management, uh then made a proposal on what we would need to do to change what we would need to change from architectural perspective, integration, security, compliance operations and metadata management perspectives.

O

It took a couple of us a few months, a few months to get all the buy-ins from across all departments, and now we have a new team called data, ops which focuses on improving interoperability among data tools, processes and people uh using metadata.

O

So in 2021, uh one of our goal is to productionize data hub uh we're currently working closely with srishanka's team, john and gabe to learn more about their react, app and assisting them bit by bit in building react application and once we're comfortable with the app in the testing environment, we're planning on productionizing data update, geotab and internally.

O

We had a debate on whether to allow anyone to push any data sets within geotab, but the decision was to make only the production data sets available on data hub. There isn't right answer for choosing one over the other, but we just we decided to put more emphasis on production level. Data set, uh which follows our internal data office processes to ensure relevant metadata is captured on data integrity, ownership, security and compliance.

O

Basically, what we're saying is that we want our users to only be able to find, find and use. The data we know is in good quality within the risk. Tolerance level that we designed also, we think data hub can be more than just a data catalog at geotab.

O

Data's, generalized metadata model allowed us to start conversation with other departments at geotab to model custom entities that they want to catalog while capturing meaningful relationships with other data hub entities. So basically, we are discussing uh and we'll be treating datahub as uh internal open source project, so other department. Dev teams also can contribute to uh internal internal features and custom entities.

O

M

O

And aspects that we're thinking of modeling this year are systems and applications. Apis are back projects and service accounts.

O

uh We're still very new in open source journey, uh but our plan is to make meaningful contributions to the community as much as possible.

O

We just started to contribute to the open source, react app application. We made a couple of contributions past couple of weeks, but hopefully the numbers will grow over time, we're not adding too much value at this point in time, but we're slowly shifting towards an open source. First mindset to generalize generalize our use case as much as possible to find the opportunities to contribute back to the community while solving our internal problems. At the same time,.

O

So uh these are some of the wish lists uh before I close, uh I think I mentioned in this flat channel that uh hopefully we can have the uh roadmap timelines updated on the open source, skit repo, and I uh one of the pain point when we were having discussions uh like internally with other departments, was that there wasn't really easy way for us to understand quickly uh what entities, aspects and properties are available currently available in data hub.

O

So we can minimize the redundant effort when we create new custom entities, so hopefully the metadata model to graph with graphic visualization to kind of help the community members to quickly see what entities and aspects and properties are available and what the relationship between them among them is would be very helpful. In my opinion and column level. Lineage uh is something that we've been tackling internally to ask ourselves like how like what would be the most efficient and automated way to first capture the column level relationship.

O

So when the feature is available on data, we can readily surface it, and social feature has been like one of the hot discussion internally, but I know like uh most of the commercial products have this feature, um but it's not the most like high priority uh item on our like backlog, but I think it would be. It would be very valuable uh for data hub community as well and that's about it.

O

Anyone does anyone have any questions.

C

Cool that was great, uh john thanks for uh sharing the journey. I think this uh I can relate to definitely a lot of those uh challenges and concerns the one thing that we've had quite a lot of debates about with a lot of teams, especially central teams, is exactly this.

C

Rationalizing of do we only put the clean data in data hub, meaning the clean metadata in data hub, or do we actually put everything in there and then have the clean data rise to the top and use that as a way to drive data governance? So that's something! That's definitely on my mind, uh it's it's a big topic of debate in lots of communities as well.

I

Okay, if I can just quickly jump in my team, built airbnb data portal at airbnb, and we went through a similar decision making process and there's something magical that happens when you have you know more than 200 weekly active users of your product, you'll find the right blend of trusted. Data sets and data sets that people uh want to be productive with. So I I believe it's just about growing usage uh and the the quality of data sets. Questions will settle itself once you get the experts using the tool.

M

Yeah we we encounter the same the same challenge here at amazon with our clients and what we found that works better is to have the responsibility for the publisher of the data sets that they need to tell if the data is reliable, etc, because it's also relate like. We see a lot of customers and even our internal team like building a feature store right so is this. Data set is something that you can rely for your reporting or bi.

M

So we push it to the publishers and the subscriber, and we just create, like um um you know, json, that that define the contract between the publisher and the subscriber about the data set. So we try to you use technology to enforce it, but what I've seen that you always need the men in the middle, like the data steward or or someone from legal to tell okay, can you actually publish this data et cetera?

M

Because the challenge is that you don't know what is the intention of the consumer of the data, what they want to do with it? So this is why it does I agree with you like you're debating, so we said: let's bring the publisher the they are the owner of the data, so they they have.

M

The responsibility um happy to share, like maybe in the next meet up like like some architecture or somehow we we solve it in several use cases, and um we again, like you mentioned colibra at leon like we saw you know, we are looking on all these third parties, some customers, we always get into this like there, isn't a man in the middle or processes that needs to be enforced. Somehow I agree with you in the comments.

C

Absolutely we'll take you up on that offer uh roy yeah great okay, all right if there are no more questions we'll go over to the next item on the agenda, which is uh having our newest contributor herschel. uh Give us a tour of the new injection framework that he contributed.

C

Harshmell are you here yep.

P

Let me uh let me share my screen.

P

ah It's the first time I'm using zoom, so I'm going to need to enable this one again give me one moment sure.

C

Roy, can you describe a little bit what you do at amazon.

M

Yeah sure I'm uh roy benata, I'm based in new york, I've been with amazon like seven plus years, so I had several roles with amazon like product sa today. I'm I'm a director at aws. So what we do? We we work with all the aws customers and partners, helping them with their journey with data and machine learning on the cloud.

M

um Another thing that that my team does we, um because we cannot meet the demand of all the customer that needs help, so we are doing a lot of open source as well, uh one of our most popular open source. The rich two million downloads like and self dr is dead wrangler.

M

It's a pandas library that we build in python to uh for folks that don't like to use spark and they love pandas, especially the data scientist community, so and, and we're using a lot of open source in our tools, of course, aws as the service team that are building products, etc.

M

um We always try to use what the native aws analytics, because if there is a service that that solve the problem, but if not, we are looking at third party or open source, but um one other things that people do not know that we are also.

M

Some of our customers are amazon themselves, so amazon has different business units from alexa amazon go and we actually help them as well, because they are using aws and I joined amazon in 2013 when amazon web services has 15 services and today it's like almost 200. So it's uh the platform evolved over the years so and I've been in the data analytics, probably most of my career. My background is distributed computing, so a big fan of what linkedin did with kafka and others.

M

So you know jay and ania from past life where before confluence so always happy to join back to the meetups and teams. So now, in my role, I think there is a tight correlation of things that we can do together on the open source so happy to help and again I'm not selling any aws services in here. So, just from from sharing knowledge and and learn from you as well so nice to meet you cool.

C

Thanks a lot thanks, a lot all right: we have marshall who's, come back so yeah. Take it away, partial.

P

Awesome yep so past couple weeks, I've been working on a new python ingestion framework for data hub first off is kind of. Why did we do this? um So you know the status quo was that people were using python ingestion framework that we had previously was really just a set of scripts um and they were already using that to ingest metadata into data hub.

P

But you know there were a couple shortcomings there. um Specifically. You know it was hard to ingest via both kafka and the rest api. If you wanted to get instantaneous feedback, you'd want the rest api, but using that with those scripts wasn't possible, and another thing was that we had these opaque json blobs, that you would ingest into data hub and it was based on avro, which is a serialization format. Similarly, protobuf the issue is that it's it's pretty difficult to use. There are a lot of sharp edges.

P

People run into issues, not know what the schema was for the um or you know how to format their data, and so we get a lot of questions around what what is even possible.

P

With that, um and specifically you know not having type annotations around that was- was something that a lot of people struggled with um and they'd run into a bunch of runtime errors when they try to try to execute their code, but they wouldn't actually have any prior warning and then the other thing was that, in order to configure your metadata ingestion, you need to go and modify a bunch of code, which was not the ideal situation. Ideally, you just modify some configuration uh and then the code remains the same.

P

uh What we found is that people wanted to stick with python because of the the ecosystem around it all the open source projects. um You know things like airflow and numpy and tensorflow and so forth, and so they were used to this. They wanted to continue to use it to ingest data into data hub, and we wanted a principled way to make that happen.

P

So how did we approach this?

P

The first thing was that to solve the problem around the schema for avro and ingesting data, we took the schema from datahub and we did code generation to generate a bunch of python classes that are fully typed and you know because it's code, gen all was in sync with the data hub schema, and this way you always know exactly what what is possible, what you can ingest and what you can't the next thing we wanted to be able to support writing to our writing, to data hub via either kafka or the rest api, and, as I said, it's a trade-off between throughput versus immediate feedback with kafka.

P

You write and then forget- and you know you don't necessarily know that it was processed correctly until you, you receive the audit event or the failed metadata event back, whereas with the rest api, it's a little bit more instantaneous and so for different use cases. I wanted to use different things.

P

We wanted to enable that um we were inspired by apache goblin for the uh architecture of this and I'll go into a little bit more detail as to what that means, and the final thing that we we made sure to do when architecting this was have file based configuration, so you write configuration in a yaml or a toml file, and then you know you can just run the data hub ingestion framework against that config.

P

This enables easier testing and debuggability, but it also enables you to just have more flexibility when you're running the ingest framework.

P

So how did we architect this? As I mentioned, it was inspired by apache goblin right now we have two main abstractions. One is the source and the other is the sync, so it syncs. These are the methods that you can take a event of some sort and write it into data hub, and that can happen over kafka over rest or for debugging purposes. You can just dump it to the console or write it to a file.

P

The event that all of these operate on is metadata change event. So this is.

P

This is a central concept in data hub every single event, or every single change in metadata is modeled as a metadata change event, and you can update um you know basically do all of the operations that you want to do by emitting a number of uh mcds or metadata change events, and then, finally, we had we have the sources, and these are wide and varied everything from databases to a file to even like ingesting the metadata of kafka itself, um and as long as the source can create a metadata change event.

P

We were hoping that you know we can just plug it into the rest of this framework and get all the benefits accordingly.

P

So I will talk a little bit more about um what it actually takes to add a source as well.

P

So you know, as with all live demos, I'm gonna take a stab at it, um but you know things go wrong in live demos, so bear with me. um Hopefully you can all still see my screen here.

P

So all of this is in the metadata ingestion directory um the easiest way to install it. We can build the schemas from uh the rest of data and then we have a relatively simple set of commands. You can just copy and paste this into your terminal and get set up immediately.

P

So the first thing that you might want to do is ingest some sample data, um and so the way to do that is data hub ingest and then we included a number of examples.

P

So example two data hub rest. So let's take a look at what this does before. I run it.

P

So this reads from a certain file which is just a sample file that contains a json of of all of the mces that we've previously constructed, and it sends them to datahub gms over the rest api.

P

So if we run this, we'll see oops that that's not supposed to happen, um as with all demos bear with me one sec, um I will worry about debugging that after theoretically it will work. um So we will. We will fix that later.

P

But what what you'll see is you know, you'll, go to data hub and go under data sets and we'll see some stuff here, we'll fix the ingest. The other thing that that I did, I uh I used to be a student at yale and I ran a course selection tool there with a mysql database, and so I actually ingested this real world database into datahub.

P

This is the configuration there's a username and password above that scroll down. uh So you know we have the the host port and then we have a filter rule. So you can filter out certain like mysql tables and then allow the rest in initially. I just printed it out to the console, but we can actually write these to uh adahub over kafka. This time, let's say and so oops gotta change the recipe to run, and so we can run it.

P

It will configure all of the um ingestion and then it will go through and ingest a number of tables, and then you get a nice summary of of what happened so it filtered out all of the mysql internal tables. uh Here are all of the tables that it actually fetched. And so, if we go into um datahub, we can take a look at let's say the students table and we get the full schema.

P

We get. You know a bunch of other information related to the tables that we just ingested um yeah. So that's the that's the usage side of things. Let's talk a little bit about what it takes to add a source, um so the simplest source that we have, let's start with the source, is py.

P

Here we go so the abstraction that we have for a source. You need to be able to create it, get work units and get a report.

P

uh These are the three operations that a source needs to support and if it supports those three, then we can integrate it into the rest of this framework. So, as an example, here's a very simple source that just reads from a file.

P

As I mentioned, we have like a bootstrap mce file that just has you know a number of pieces of sample data.

P

So it reads that and then you know, constructs a metadata change event out of it using the gen classes and then sends them into the rest of the framework, and you know that's that's how simple it is to add a source, a slightly more complex one. Let's take a look at the source for kafka, so this ingests metadata about the topics and partitions and so forth in kafka, and it sends them into data hub, and so here it's a little bit more complex. You connect you construct your your consumer and then similar thing.

P

You know you, you list your topics iterate through them and then yield metadata change events accordingly, and as long as you can do that you can construct a metadata change event.

P

um You know using things like aspects and snapshots and all of these things that we're relatively familiar with as long as you can do that, then um you know the source plugs into the rest of the framework. Accordingly, all right, so last thing I wanted to talk about touch upon is: where are we headed with this.

C

Can I ask a question: um can you walk us through the chord gen that you're doing from the avro schemas? I thought that was uh one of the hard things about this project. Yep sure.

P

P

um Let's take a look here, so the av sc file, which is the average schema file, looks like a big json blob and it has you know all of the fields and everything. Accordingly, we have a code gen system um that you know I took from an open source project and modified, and I'm also working to contribute that back um but the cogen system.

P

Basically, let's say this is the ml properties class. um So for this, let's, let's do like data set snapshot.

P

Let's find this one.

P

There we go so data set snapshot has a number of aspects. Each aspect can be either. You know one of these things so properties, deprecation, lineage of stream, lineage memory, ownership status so forth.

P

When you construct it, you can automatically, you know, have the add the type systems and it all associates correctly, uh and this way you know if you run my pi on this code, it actually completely checks every single assignment, every single constructor.

P

All of that, so that you know for sure that um you know the code is at least semantically correct in terms of types. um So that's that's kind of a little bit on the code gen um and obviously it generates a truly massive file of 5000 lines.

P

So you know glad we aren't writing this by hand and updating it um did that answer your question.

D

Yes, yes, thank you.

P

um Yeah, so let's talk a little bit about where we're headed with with ingestion, so we're going to do a more formal, rfc process on this relatively soon and then hopefully also publish a package to pi pi. So you can pip install you, know, data hub or something along those lines and then start executing this, and you don't need to do the code gen yourself and all of those other steps.

P

The other things are more functional improvements, so, for example, detecting when metadata is stale. So if you've deleted a table in your source in let's say mysql, we want to be able to detect that it disappeared and perform the according the the associated deletion in um datahub right now, it's a purely like additive process and we can do updates.

P

But the deletes are not something that we have support. um Another thing is validating that it was actually ingested correctly. So you know if you run with kafka and you just via that, you just send you know- let's say 30 events to uh to a kafka topic or to the broker. You don't necessarily know that those all got accepted correctly and so doing that um validation step would also be very helpful and then on the java ingestion side. So we have python and we also have java we're.

P

You know, hopefully going to continue doing a little bit of improvement on the java ingestion side as driven by community needs.

P

So if people really love the the python or we'll invest more on that, um if people still want to be able to use the the java we'll do that accordingly as well and then the last thing is standardizing a testing harness between the two. So we can test functional parity between the the java and the python.

P

Any questions on all of this.

K

I have one harshal um I was hoping you could talk a little bit about how someone would productionalize this say in airflow or something.

P

Yeah sure um so we include um a couple sample dags actually in air in our in our repo, so you can use this directly within airflow. It's actually quite simple: let's save her for mysql, um you know you just create a pipeline, give it a configuration and just call run, and um you know, as long as you can do this within a python operator of airflow, you can run this and you'll get all the standard error reporting out of airflow, as you would expect.

P

um So it is, it is production ready already, uh and you know you can use it. However, you like, um I think it does make sense to run it within an orchestration framework like airflow or even cron. If you want something simple um so that you can continually get updated metadata from a from a given source, instead of just a one-time import.

K

Awesome. Thank you.

C

Yeah this is cool, because uh I know that data hub has this uh positioning almost as push-based right and so a lot of people think that oh, it's push-based. So I cannot pull metadata into this system, but I think this kind of shows uh the people like how easy it is to once you have a push based system. You can always add on a pull based system upstream of it to essentially essentially pull metadata into your system, um and you don't have to really choose between the two yeah you can.

C

You can do both if you want um all right, so I had just very small logistical things to uh go over very quickly.

C

We can take some questions thanks. Arshel.

C

Are you hosting office hours as well, for injection.

P

I can't, if that's something people would be interested in.

C

All right all right, we won't sign you up yet.

C

C

Okay, so a quick thing on next meeting, uh we got a lot of folks from eu who have struggled to join because it's a bit late for them. So for the next one we are going to move up the slot and make it 7 a.m. It still allows folks on the west coast to join as long as they set their alarms correctly and folks in the central time zone and eastern will be doing just fine and for eu.

C

I think it also works, so our next meeting um is going to be third friday of march, so that's turns out to be 19th again, thanks to february being exactly four weeks right, so 19th march, we're gonna be back uh 7 a.m. uh That's the next meeting!

C

uh That's all I had on logistics. There were a couple of questions on the town hall meeting that I wanted to quickly go over maggie, hopefully got her answer on uh dashboard elements.

G

C

G

Just one clarifying thing there, um so we're we are not well is the idea that we should migrate over to the react app for dashboard elements.

C

Yes, uh thanks for asking um we essentially it's very hard to continue supporting the community on the amber app. It's always been hard, so the react app is where we will be doing new development and supporting the open source community. So that's where you should plan to move towards.

G

C

That was the reason we uh we're getting to uh feature priority with the amber app as well.

G

Got it that sounds great. um I just wanted to signal that um spot hero is the spy hero. uh Data edge team is still planning on contributing back our liquor.

M

G

um To the community, so that's something: we've had to punt it a little bit for some other work, but um we should have that out. Maybe end of this quarter, more likely, q2.

C

That would be great, and you know hopefully it uh it. The new ingestion framework helps you with rewrite, like uh writing the least amount of code for it, and and finally, you can turn off that hack to make looker dashboards. Pretend to be data sets so yeah.

G

It's a it's a good hack, but I'm looking forward to it. Thank you all.

C

There was another question from johann about translating relational db metadata to the data hub metadata model. Did the ingestion example help you.

A

Yes to uh it helped me a bit, but uh I'm a really I'm still a data hub newbie. So, but I saw in your uh presentation before that you actually had this uh sort of a schema like uh for a data set that was a table. It looked like the relation table. You had this column descriptions and all that. So that would answer my question.

C

Yeah, I think I think, for those classic metadata that is just on the columns. um We, the the standard injection script, does pull it in.

C

If you do something custom where you have like very custom, table properties or column properties, that you're squirreling away rich metadata in then talk to us about how to convert that and extract that metadata and make it more structured. We have some ideas around that harshal did not talk about the extractor concept inside the source. To sync, the extractor is basically a plugable thing that allows you to extract more metadata than the default metadata that is getting extracted by the you know, the out of the box source.

C

So those are ways in which you can customize the extraction of metadata on top of the default pipeline. Cool thanks will do awesome. uh We are at time, so I will let you all go thanks to the eu folks for staying up late for this and we'll see you in four weeks.

L

D

C

D

C

Thank you, bye.