DataHub Community Talks, 29 Oct 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: October Project & Community Update: Oct 29 2021 Community Town Hall

Description

Maggie Hays and Shirshanka Das from Acryl Data give an update on the DataHub Community and Project for the month of October:

- Community growth + DataHub Swag preview
- Q4 Roadmap Update
- Contributor shout-outs
- Release highlights
- Improvements to User & Group Management
- Nested field support for Hive + Trino

Join us at our next Town Hall - RSVP here: https://forms.gle/g8EpCLnohtPLLtdg6

A

Let's take a look at what the community's been up to so to date we have we're nearing 600 members. This is bonkers to me. I remember when there were maybe 150 members in the community, so the fact that we grew by 200 in one month. It's just like it's mind-boggling um for those of you who are new or have not uh joined us. Yet we do have office hours every tuesday and thursday.

A

So we welcome you to bring any and all ideas, questions troubleshooting, uh we're there for you and then also reminder like we're all about collaboration.

A

So if you guys want to help contribute, if you are looking for ideas, you're looking for support head over to the contribute channel, we'll get you set up and then also, if you're, just like really proud of something you've done with data hub. I personally would love to hear about it. I geek out about this stuff, so um join us in show and tell, and let us know the other thing is we. We have a very silly but wonderful way of showing appreciation for one another.

A

So um if someone's made your day gone out of their way to help you um please say thanks to them by giving them a virtual taco, with our slack bot called hate taco.

A

Is it silly to send a virtual taco absolutely but like it's okay, to be silly and also show thanks um so coming up in future months, we are gonna, start building out some redemption programs for hey taco.

A

So if you accumulate tacos, you can trade that in for swag, you can trade that in for other little perks and bonuses there, so really all you have to do is tag someone's username tell them what you know what you appreciate about uh what they did for you and then um add in the taco emoji and bada bing bada boom you've, given a taco, um so also here's us uh an exclusive peek at some upcoming swag. Is that a fanny pack? Yes, we are absolutely going to have a date of brandon fanny pack.

A

I have no shame there um all right. So then a little more serious or back to business.

A

um We are well into q4, um and so we we do have a little bit of work to do to update our publish roadmap more to come there, but just wanted to give a heads up that the core data hub team is going to be working on building out some additional support within the metadata model, specifically for schema history, com level, lineage, which is a widely requested feature for us and then also data quality, specifically targeting great expectations, but just also building out kind of that more generalized data quality metadata model um we're also going to be looking uh or building out support for multiple data platform instances.

A

So if in your company, you have a bunch of different postgres instances, you'll be able to differentiate between those within data hub, so you can understand kind of the delineation between them.

A

We're also going to be focusing on improved support for dbt and also just figuring out a better way to kind of organize the the um entities of that um and then another widely requested feature from the community is better uh handling of stale metadata. So if something's actually deleted or removed upstream or in production, then that's going to be, um it will also be removed or soft deleted from data hub to minimize confusion of if that data or if those assets are even still available for you, the last one is spark data set lineage.

A

So that's going to be on the horizon as well, and then um just a general call for community support, we're hearing a lot of interest still in building out a tableau, connector and then click house is also a big one. I think they just raised like 250 million dollars or something this week. So um I think you know we're gonna we're seeing a lot more activity there, so uh tableau and click house, if you are either in interested in contributing either in the build or the design of that.

A

Please reach out to me and I can help um facilitate the community collaboration there. So that's all from me, sri shanka, I'm going to pass it over to you and we'll talk about uh project updates for october.

B

Awesome so, um first off there was a lot of activity in the last couple of releases and we got three new companies who joined our community officially as adopters uh peloton. I think everyone knows who they are um so really excited that we're all going to be getting peltons from them shortly.

B

I hope, as part of sending our first vrs to datahub. uh Second.

C

No, I just missed me, I I wish you they send everyone out, but I don't buy it.

B

All right so arun um people might know him from um you know. Being a long-time data hub contributor was at expedia and uh looks like uh datahub jobs is a thing and so arun. Why don't you share a little bit on your journey from expedia to peloton.

C

Sure, thank you hi everyone, I'm arun wasdevan.

C

I was one of the initial contributors for within expedia group for data hub and we were able to get some good traction with an expedia group and recently uh moved to peloton. um Pelotonism is in the stage of like rapid growth and they're, trying to get a data discovery platform and the data hub uh suited to what they're looking for so so the community definitely helps in in getting getting your visibility and getting your jobs around. That is something that I can guarantee.

B

Awesome um there are a few other companies, also that have added their logos at dfds. If you don't know who they are they're, actually a really large company, a danish shipping company, the busiest shipping company according to wikipedia.

B

So I learned something new about these huge companies that exist that we often don't know about in our bubbles and then finally, we actually have our first crypto company, so uphold.

B

uh Actually is a crypto uh company and uh they're also they're much earlier, obviously um than the fds or peloton in their journey as a company, but it's very nice to see kind of companies at very different stages of their journey, deciding that they need something like data hub and then deciding to adopt it so very nice to see the the wide range of adoption of this project. That said, let's go into the release details.

B

um I I looked at the commit so far and it was a lot higher than usual, like normally it's around 120 130 per month. uh But this time it was like 161, which is amazing. uh This means that our space of development is actually growing uh in, even in terms of number of committers or contributors.

B

We had 30 plus committer contributors from almost like 20 plus companies, it's hard to sometimes know from the github vandals, if they're from the same company or not. But I did my best and I think we've got about 20 companies contributing to the project right now, which is great. It's a substantial growth from like 10 plus in the last month. So almost doubling the number of companies that have started sending contributions into the project, which is great a lot of new contributors.

B

Thankfully github now gives a nice release notes highlight showing who are the new contributors. So thank you to all of you for the new contributions that you made this month. Oktoberfest, I think, had something to do with it, but I would like to believe that this is actually going to be a trend in terms of contributor.

B

Shout outs um huge shout out to enrico for actually improving the testing um infrastructure for, um like he battled the ci system, like crazy uh ben marty, who has been working really hard on getting data hub to work on m1. uh Thanks for all your work and patience so far, and hopefully, uh by the time the next town hall rolls around, we actually have uh an m1 um deployment like a quick start, working sim bunsel as usual.

B

For always following up on small things and sending pr's david schmidt sent his first pr um to meta world which is uh still under wraps, but public uh web github repo, where we are starting to store recipes and best practice. Examples for how to work with data hub, so david schmidt contributed uh how to write a custom source and run it on your end without having to fork data hub. So that was pretty cool recommend you check it out.

B

Remy salman, of course, has been quietly improving the looker and dvd sources so love all the contributions he's making and back at you maggie for um contributing uh the features overhaul page. I think it was uh long overdue and uh really like seeing kind of how nicely you've laid it out with gifs and everything, so people can get a very easy way of understanding what how to use data hub.

B

So I highly recommend you check out the features overview page on dataproject.io community shoutouts excel as usual, excelling at responding to community requests and problems and solving most of them all all by themselves, a tool for showing up at the community office hours, often and having really deep and thoughtful conversations.

B

uh It's actually been quite fun um and I'd highly recommend joining them to geek out about both metadata model designs as well as troubleshooting, as well as pretty much anything kitty, danielle for being very patient with us, as well as sending lots of interesting problems as well as solutions on the community, assem and remy. Of course, thanks for helping people out as well and jared martin newcomer, but lots of great questions and engagement.

B

So thanks thanks for all of uh all the engagement from all of you cool and then moving on uh details around the the release and the upcoming release. So 0816 is out and the python release uh that's going along with it is zero.

B

Eight, sixteen zero, zero, eight, sixteen one and just today morning I release zero eight, sixteen two, so that includes all of the injection sources that have been committed so far biggest highlights have been in the area of you know, product obviously unified search recommendations we'll get into that in a bit user and group management screens, improvements in primary and foreign key representations, quite a few lineage performance improvements as well as ux improvements and then on the ingestion side.

B

Support for redshift usage landed support for external tables in redshift and a few small improvements in representing types in redshift as well a host of improvements. You can go over the commit history and see all of them. Trino landed some improvements to hive mongodb improvements in handling large document sizes, um bigquery lineage. We will have a talk on it later today and some performance improvements that we snuck in as well. So I don't know how many of you have been running: looker ingestion.

B

It usually takes a while to go over because most people have a lot of dashboards and a lot of charts. We added parallelism for the looker source, so it's now going much faster and ingesting metadata, much faster and then. Similarly, on the data hub rest, sync side, we actually added a max threads config variable.

B

It's currently defaulted to one just because we didn't want to uh randomly surprise the community with a lot of parallelism. But if you know what you're doing you can go in to the config and change your max threads to you know: 10 20, 30 and just see that kind of that really uh not exponential, but multiplicative improvement in your ingestion throughput.

B

We were able to get in some cases, liquor ingestion times down from 50 minutes to like five minutes, so uh it actually has a huge uh bang for the buck because you know uh you're, basically, mostly I o bound in calling looker and then calling data hubris to get the metadata in on the data hub kafka. Sync. Obviously, you won't see those throughput problems, because the data have kafka sync automatically batches and sends uh data off async, so data her breasts think we improved it so very excited about that cool. uh Moving forward.

B

User and group management and that's something that john is going to walk us through really quick.

D

Hey thanks, srishanka I'll, keep it brief, but in the last month we've made a host of improvements to how you can manage users in groups on datahub you'll, see that we've added a button in the top right called users and groups, and if you click on that you'll see a management screen.

D

You can see all the users and groups that you've ingested into data hub, as well as those that are active. So in some cases you may batch ingest users from say a d and then you can actually see if that user has logged in via sso, uh using like a little active badge, which is pretty handy. You can remove users and groups, there's a lot of cases where you're you know, manipulating how users are ingested, changing things.

D

So we've tried to make it a little bit easier to see where you guys are at and then kind of fix up on the fly. So you can remove them. You can create new groups through the ui, as opposed to ingesting them from a third party source, and you can add and remove members from groups and of course this is all integrated with the policy system, ownership, etc. So you can kind of do that entire workflow from creating a group adding members and then adding responsibilities to that group.

D

So I think it's pretty pretty cool really excited about it. Let us know if you have any feedback using this feature, I think that's that's pretty much. It.

A

You want to move on to nested field.

B

Yeah, I blocked my screen accidentally back here, so so, quick updates on what happened on the ingestion side, nested field support, landed for high ventrino. It was a long-standing community request, so we're happy it's available in zero, eight, sixteen one and beyond, I would recommend moving to zero eight. Sixteen two, uh it just looks like uh how it looks on the screen very simple.

B

You just set up your hive arduino sources and if you have structs it's going to ingest them perfectly now so nothing to see here, except that it's going to be much better and invisible.

B

Another thing that I'm really excited about is one of the things we have heard from the community multiple times is they want to add owners through transformers. So, as I'm ingesting metadata from a source, I want to set the owners for a particular data set. Maybe the source doesn't have it.

B

Maybe I want to just transform metadata in flight and set these owners, and one of the things that we often ran into, or people often run into, is that maybe someone has gone into the ui and added some more owners or made some changes, and um maybe there are some other people. Who've been added as owners, but if the injection system runs, it can then override those ownership metadata, and we were thinking about how to improve the situation um and one of the things that we came up with.

B

That is kind of very simple, but can actually help the situation. Quite a bit is using data hub itself during the transformation process to assist with the transformation, so we're calling it server, assisted transformers and what that leads to is having the transformer have access to the data hub graph, while it is making the transformation happen. So if you click the next animation transformers have access to now a context object. This is coming soon.

B

It's not in the code, yet they'll have access to a context, object which allows them to get access to the graph which allows them to first see who the existing owners are and then to decide what they want to do. So this can allow transformers to do complicated, sophisticated things like patching owners or figuring out the new set of owners that it needs to write.

B

So it's not deleting owners that should not be deleted, so it you know a couple of screenshots for what that might look like um you know you get a pipeline context in your transformer which has access to the graph and then in your transformer you can kind of. If you've got a configuration called semantics, you can kind of look at the semantic that you want to do and then apply that change. So it's pretty simple, but I'm very excited about it.

B

I think it will really improve the experience with building and working with transformers so that you're, not overwriting metadata, that already lives on the server side.

B

Cool next is an interesting thing we did, which is uh use data hub itself to represent the metadata model. This has been a common ask. You know people have looked at the data hub metadata model, it's scattered in the code base across many many pdl files, and it can often be challenging to understand how everything relates to everything. So we built a little bit of an automated system that can process all of these pdl and avro files and produce a metadata model and we use the dot to kind of render it.

B

Obviously, it looks kind of messy and complicated like real metadata models. Do it's a very easy way to look at how everything relates to everything and what are all the relationships that have been already created by the metadata model? But if you actually wanted to browse it live um you can click on the uh link right here, which is datahub, project.io and then navigate to the entities page. So maggie, do you mind clicking on that.

B

And then you'll be able to see pretty much all of the entities that we have modeled represented just as data sets for now. So if you, for example, click into the data set, data set you'll be able to see the entire schema of the data set model, including the urn. The data set key and it's all modeled as a struct, so you can expand them out and look at them in all their glory.

B

um So maggie. If you don't mind, looking at ownership.

A

B

Where maybe control, f ownership.

A

B

Oh there, it is ownership and then expand out owners you'll be able to see that the relationships that exist between data sets and owners are actually modeled as foreign keys like this should so, if you click on that foreign key link you'll be able to see that that's a link to the corp user and that there's another foreign key relationship to um corp crew. So all of this is generated automatically off of the metadata model itself, so very excited that we're able to automatically generate a bunch of stuff like this.

B

If you scroll down further, you also see time series aspects that we are able to capture in this they're marked as temporal. So you can see also the data set profile aspects as well as the data set usage statistics aspects in there, so have fun with this. It's loaded up into demo and will share it out into the code base as well as something that you can build locally and browse the metadata model yourself on your local data hub.