DataHub Community Talks, 19 Mar 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Mar 19 2021: DataHub Community Meeting (Full)

Description

Full version of the DataHub Community Meeting on Mar 19th 2021

Welcome - 00:00
Project Updates by Shirshanka - 01:49
- 0.7.0 Release
- Project Roadmap
Demo Time: Themes and Tags in the React App! by Gabe Lyons - 09:36
Use-Case: DataHub at Wolt by Fredrik and Matti-Pekka - 19:28
Poll Time: Observability Mocks! - 37:08
General Q&A from sign up sheet, slack, and participants - 51:12

A

Can everyone see my screen all right great? I had a an amazing uh talk this week at data ops unleashed where I presented for 30 minutes and at the end of 30 minutes.

A

I found out that I was speaking to an audience that wasn't there so I'm starting to learn how to ask for feedback before starting all right. Welcome everyone uh to the I guess, the third community meeting of the year and as you notice, uh things are.

A

Changing, where is my slight deck.

B

A

We are going to go through a pretty packed agenda. We have uh project updates uh a demo.

B

Yeah I wanna, I wanna cancel the one password prompt.

A

Oh that's showing up. huh Okay,.

A

A

I think it's gone, hopefully.

A

Awesome all right so we're to go over the project updates, uh go through a quick demo of themes and tags, a case study of data hub at walt and then a quick poll with the community on some observability marks that we've been working on and then q, a based on questions that have come in and questions that come up during the session.

A

First off in case you haven't noticed we have a new homepage and massively improved documentation at datahub project.io, so go over. There check it out and feel impressed by how awesome this project truly is.

A

I think data hub has always been an awesome project, but we never really put it together to explain to people what it's all about. So I'm really happy with how this turned out. We've also done some interesting work in creating a live demo environment so every morning and dexter did a lot of this work. So thanks for that texture, we have essentially a job that picks up the latest bits from master from the main branch on github, deploys it out to a demo environment.

A

So you can go in there check it out. uh Does anyone know if recording is on because otherwise I can just turn it up.

B

It's not on nothing. No, I believe it's on. Oh.

A

It's on it's on okay cool, um so we've got the demo environment going uh check it out. It gets refreshed every morning, but we also have the ability to push a button and deploy it at any point. uh Since the last time we talked, I think, five.

A

This is annoying.

A

Since the last time we've spoken, I think five plus new pocs are being done at different uh companies. So this is great news. uh We published the road map. uh We, we changed from kind of a visionary missionary. Two-Year roadmap to a very targeted six-month roadmap.

A

Go take a look at it. uh It's again on datahub, project.io scroll down to the bottom and you'll see a link for the roadmap. It includes pretty much everything that the community had asked for for the developers.

A

The big things are going to be no code. Metadata model changes, it's going to be a big hard project and we're going to spend a fair bit of time working with the community and getting it out and there's a bunch of features that I won't get into in detail here and big news. We have a new release. Finally, it's been a long time coming about three months. Since we had uh released a new release, it's 0.7.0 there are a ton of new commits in it uh about 200.

A

I promise we won't go this long before creating a release. The next time and as with most major releases, uh you'll, probably see kind of a minor release coming out in the next week or so as we catch up any small bugs or small feature edits that we missed.

A

I was looking at uh the number of contributors, and that was quite nice to see that we have uh 24 unique contributors to the project for this time period, so that that gives us a lot of happiness, because it means that the project is actually getting the kind of community of not just adopters and users, but also people contributing changes back. So this is great and we want to see more of that all right. So what was in the release?

A

The first and most important thing was, of course, uh the new react application. So if you haven't been paying attention, we um completely rewrote the application from the ground up using react. This will help us move to a more modern stack and allow the community to give us many more contributions.

A

We also added support for graphql a huge lift from quite a few members of the community, and so that's going to be amazing. You now have a much more accessible api to get metadata in and then to also get it out and visualize it.

A

Another big thing was the metadata injection framework. uh We were always amazing at getting data in through kafka, but a lot of people were not quite sure how to get all this metadata into kafka in the first place. So we built a metadata injection framework in python. It's it's actually one of the most beautiful python code. I've seen so please go check it out. I love what our shell has done with it, and a lot of folks are giving contributions back. So this is amazing. We already have quite a few sources.

A

um Athena and druid was contributed very recently, and I know that dbt is cooking, so I'm super excited about that. One product features um the big uh enterprisy one was sso. A lot of people uh have been asking for this john put in a ton of effort and got the sso integration done on the react app. He tested it out with its oidc, and unfortunately, john is not here today, but uh we're gonna do a deeper dive on this at a future date.

A

Tags and themes we are going to have a demo later and dashboards. I think everyone knows now we have dashboards that you can visualize. You can find your favorite looker dashboard and then kind of browse its metadata, and things like that and the metadata ecosystem got stronger.

A

We had contributions from expedia on the ml model ecosystem, so machine learning models, they're, actually using it to store metadata about all of their models, and uh hopefully it'll become kind of a thing and we'll start data hub becoming uh the the ai uh metadata store of choice uh are also getting contributions on the data flow and the job ecosystem, and that's something that the world folks have been doing we'll hear more from them later, and the big uh breaking change is that we have finally shed our elasticsearch 5 ghost and we have moved to elasticsearch 7..

A

There are some migration scripts checked into the repository.

A

The linkedin team has actually been moving on that and john has contributed uh one of the scripts that has helped them migrate from elastic five to seven. So hopefully you don't need to do these big migrations, but it's better to get them done. Early before we start getting into a lot of more exciting features so uh get on with it and hopefully we'll see you all on uh 0.7 soon awesome.

A

uh I I did put in a couple of slides uh on sso uh we're not going to get into detail, but we've checked it and we've tested it out with google sso and octa, and there are documents on the on the hosted. Docs, so go read up on them and we will do an sso office hours next week.

A

So you can join in and we can kind of help you with any specific custom setup you have and we will learn ourselves along with you.

A

Awesome so now I'm going to hand it over to gabe to tell us about what he's been up to with tags and themes.

A

Awesome thanks.

C

For sharing um it's really exciting hearing about everything, that's included in this release a lot of really really cool stuff um yeah. So I'm going to dive into two of the things that are included in the release in their react, app they're available to use now and they've been merged in.

C

So I'm going to give you a brief overview of how the feature works and show you how you can get started using both tags and themes, um although I am also on the west coast, I actually wake up around this time to go biking quite often, so it feels very natural to be awake right now.

C

C

um Can't everyone see this go so both these features we know have people have been requesting from the community community for a long time and I'm really excited to have finished them up for tags. I want to give a special shout out to the frederick and monty from walt for helping clarify uh the spec for this the needs and and drive that rfc. It's really awesome collaborating with both of you.

C

What tags are there? Some globally defined labels that you can apply to your entities?

C

uh They're going to have they'll be able to have shared definitions so that when you apply these labels to disparate entities, you can be sure that everyone knows that they're talking about the same thing and they can be applied at the entity level, but for data sets, you can also apply them at the schema field level and in addition, we index these tags in elasticsearch so that if you apply a tag to a data set, you can then recover it by searching for a tag or by filtering for that tab, and I'm going to give you a brief overview of how you can use that now.

C

So if we go over to my data hub here, I'm at a airport traffic data set- and you can see that I've already applied a tag to this entity and this side could have been applied either by the ingestion pipeline or from the ui, and this is showing a direction that um we want to think. Actually, um let me see if we there's some type thing out here.

C

I don't know if we can need that anyway, so this tag can be applied from the ingestion pipeline or from the ui, and this is showing the direction that we want to take data hub moving forward, which is that um we want to start data hub is going to start becoming more of a interactive surface as well as a read-only surface. So if you look at the mce example, json file in the metadata ingestion directory you'll see some example: mce events that are ingesting tags, but a much easier and more intuitive way.

C

It can be just applying them through the ui. So you can see here's this tag that we've applied legacy to the airport traffic. So we're saying you know this is a legacy. Data set but say I wanted to assign say that this data set also needed an owner assigned to it. I could go into this. Add tag, flow and type in owner or ownership, and this is going to search our repository of tags and in the type head. Any tag related to ownership would actually come up right here.

C

It's serious talking to me since no tag exists, we can go ahead and create that. So we can say: okay, this guy is going to need ownership. I seem to be having trouble hearing we'll, create this tag and say give it a description and when we create this tag, it's going to be generated as its own entity, so that if we wanted to reference this tag from other entities, we'd be able to say this element needs ownership.

C

Once I created that it appears on the data set and if I click on it, we get brought to the tags page. We can see who created it again, read that description and then see statistics of how many other entities this tag has been applied to and when I click on this, it actually brings us to a search filtering for data. Sets that have this tag, so you can see that it's already been indexed and we can already start searching about it and we can use tags to filter.

C

We can search for elements that have multiple tags or elements that have one of a set of tags.

C

And, as I said, you can also apply tags at the schema level. So if we go into our schema and to enter the tags column, a little add tag. Button will pop up. We add that we get practice same flow, so if we want to add, say that this field needs a better documentation, I can add this needs documentation, tag which had already been created on another data set, it's discoverable here we click on that and add it and there the tag has been applied to the schema.

C

So again, this feature is already in the release. I'm really really excited about the different flexibility that tags will offer you and go check it out and let us know your feedback in the data hubs live.

C

So the second feature that I was going to demo is themes. We know that this has also been something that the community has been requesting for a while, um and this lets you customize your data hub instance, so that it can feel have a little more look and feel that perhaps you prefer or you could customize it to look a little more like your internal organization's themes, so things that we allow you to customize our styling like background color line, color font color, the font things like that.

C

You can also customize assets, so you can insert a personal logo and you can customize things like a welcome message as well and we're going to start expanding what other things are customizable over time. Saxo bank is currently contributing a change that will let you customize menu items uh using the same configuration.

C

So the two themes that we include out of the box are this dark theme and light thing that you can see to the left and the right and there's instructions in this link below that will tell you how you can choose, set your theme and how you can actually create a new theme as well.

C

I think dark and light it's a little, predictable and easy, but just to show you the flexibility that themes allow. I went ahead and created a theme of my own, so I used to work at airbnb and there I made contributions to airbnb's internal version of datahub, which we call data portal and I wanted to see. Okay, can I create a datahub theme that would fit in at airbnb and again just using the same theme config that created the dark theme, the light theme and that the instructions were linked here?

C

I went ahead and customized it to look like airbnb, and I want to show you what it looked like so here this is the landing page you can see. I customized gradient. I have a customized logo and I got to include my own customized welcome message and.

C

This green color looks a little garish, but it's actually one of airbnb's favorite brand colors, it's called baboo and we use it to highlight things there.

C

If I search for an element like airport traffic, you can see, the banner also is able to be styled and labels get their own highlights as well. um So this, I hope, hopefully gives you a sense of the flexibility that theming allows have fun, go check. It out, try to theme your own and again look forward to chatting about this in the slack.

A

Channel wow- that was amazing gabe. Are you planning to uh replace airbnb data portal with this.

C

um Yes, quite soon, crazy.

A

Awesome that looked amazing uh cool, so a couple of things the tags are also up on the demo site. Please uh go there and add some tags uh be gentle. We have not added bad word filters or other kind of things. We. This is a friendly community. Every morning um we are going to replace the tags with kind of the fresh like the default set of tags, so you're going to see your tags uh get wiped away.

A

So please don't get too attached to them on the demo website, but uh feel free to go in and uh play around and for teams. I think we just have the default uh live team up on the on the website, but yeah.

A

I think uh one of the great things that happened with react office hours that we ran a couple of weeks ago was that we had a lot of conversations about react, but then a lot of conversations about other things about data hub, so we're actually thinking about how to run community office hours where people can just drop in and talk about anything data, so stay tuned for that as well.

A

Awesome next up is frederick and mati from vault, and they are going to be sharing what they have been up to with data hub at world. So.

A

Frederick, uh do you want me to present the slides.

D

Yes, yes, please, it's possible, you had the link right.

A

A

A

Can everyone see my screen.

D

All right, yes, thank you for the for the opportunity to come and share some some of our learnings of uh working with data hub. uh So my name is uh frederick and uh I'm joined here by my colleague, matipekka uh we're from a company called vault.

D

um If you jump to the next slide, I'll just share a couple of words about the company, so you know get a bit of a better context of uh sort of the environment.

D

We're working in so so we are walt is a technology company that uh operates uh sort of food ordering and food delivery platform, so uh not unlike uh doordash, for example, and uh founded in 2014 in helsinki in finland, and currently operate in 23 countries uh over 150 uh cities, mostly nordics baltics, eastern europe, some mediterranean company countries, uh asian countries like uh japan, for example, and and lately we've also entered germany.

D

uh Maybe that's that's enough uh about the company so and actually now hand it over to to the mod the background deck go a bit deeper into how our text stack looks like uh and how data hub fits into into the picture.

E

All right, thank you, um oh hello, so I am materaka. I met some of you in the last last town hall and to give you a bit of context in the in the data pipeline, we are running. We have been now moving most of our most of our like basic etl workloads now over to kafka. So we over at kafka connectors on top of our operational databases.

E

Then we use kafka to store and distribute the data, and then we have this, uh like in in-house developed, streaming framework kind of a model repository ingesting the data and then uploading that snowflake and also alongside that, we utilize heavily air flow, especially for interacting with external apis third-party systems, but also other of our like internal and more batch batch based uh batch based systems and snowflake is our main data lake. That's a warehouse kind of kind of an in between a model model there and for data hubs.

E

Perspective like we have this luxury of storing only quite well structured data. So we don't really have a data lake where we dump everything, but we do some sensible pre-processing for the data before we land it in snowflake. So it's in a like usable and state with our other data sets and to make sense of all of this. We started last year to evaluate different, open source and also other other options for managing managing all of this all of the metadata we have and like we've seen previously in these presentations.

E

We also evaluates it, for example, average atlas and and I'm understand, but they both did not quite fit into into the way we wanted to operate.

E

We were going to go with apache atlas at first, but we ran into some considerable difficulties when trying to deploy it so and maybe just through some luck, we ended up looking at day job at the at the same same point in time, so we felt quite confident in what we saw and realized that we can safely like start working on a proof of concept implementation on top of that and due to that history of like being sort of undecided between different systems, we also wanted to well maybe at that time kind of hedge ourselves ourselves in a way from the technical choices made in the project, because we weren't quite sure at that point like how much, for example, the implementations would change a long time.

E

So that is why we ended up. If you now could move to the next slide.

A

uh Matty I had one question uh which I can ask. Of course: uh the the circle that says metamorphosis is that the confluent term, or is that your own system.

E

uh That is our own system. It is, it is basically a like a simple python application that simply creates kafka consumers, but then the actual business logic lies in these.

E

In this, um like sql alchemy model based stream, parsing classes that then we can register automatically as soon as we create them and then start consuming that it's from kafka. A great question. I actually didn't know that the terms overloaded at the moment got it.

A

Got it, I think anyone who has read kafka also knows about metamorphosis. So it's a matter of time cool. Thank you so much I'll move to the next one.

A

E

So uh we created this service and sdk called cellist for our internal use, and this is the main way we want to interact with data hub and also expose this whole storing and interacting part of the with the metadata service to our um like internal developer users like engineers who want to publish their metadata in data hub and the main main reasons or main main benefits. We saw from this approach or well the easier integration with data hub, as is especially when we started working on this.

E

There were no existing framing works like currently, so you need to parse the kafka messages and the overall schemas by hand. So this helps then the end user a bit and also allows us not to have kafka dependencies on each of the project that wants to actually use this, and then this allows us to reduce the complexity of the metadata models.

E

So we can like expose only the parts that we see relevant to our users and also this makes the authentication integration a bit easier, at least in our case.

E

So the service itself is simply a rest api written in python, which accepts simple, uh like post requests and then based on the payload input, step dates to kafka, to be processed by the mce consumer.

E

But for our end users, for actually um for like say our business business users, the. Of course the ui would be the date substitute ui. This is for storing the data and, for example, fetching downstream lynch information for for some some entity and here's a slight small example how it looks like in practice. So we have a simple python sdk that you can then use in, for example, the metamorphosis project mentioned.

E

So we have this repository where we define the models that we use and then on each release of a tag or a version for that model repository.

E

We simply run a small python, scraper script, that iterates over the models creates the required payloads and then outputs that into the service, which, for example, look like looks like here that we have this simplified data set format. That then includes, for example, the schema information instead of having them as different entities. That, then, might require, in some cases some intermediate entities for linking them together.

E

All right, I think.

A

E

A

I think uh we've.

E

A

Been making progress on uh having the python sdk get easier and easier to use, and this is probably something that we will uh motivate mati to contribute back uh as well, so uh it can it. It's amazing right to have this easy way to create metadata from uh right inside python.

E

All right, yeah, uh I think we can just go on project.

D

Yes, thank you uh yeah just generally, uh as stated already, uh oh yeah, sorry. uh So the main our sort of vision, vision for for our metadata ingestion is basically to to be able to cover the whole sort of data realm uh at all. So starting at the operational databases uh to and third third-party apis to the uh to the warehouse, snowflake can and all the way to to individual machine learning models and and uh looker dashboards.

D

uh And as stated, we have this schema repository that we utilize heavily in in this. We can easily add metadata to that and then use the our celeste sdk to to uh to push updates um and, as already stated, we've been proposing.

D

uh This global tags uh feature addition to data hub and now we're exploring uh how to utilize uh those most efficiently so playing with the idea to to use them as sensitivity classes on on, for example, schema fields and or on higher level entities, and, as you saw in the in the diagram earlier, we're heavy users of airflow so uh and we use it to uh transform and uh and and move data around. So so support for airflow was a must for us. So that's why we participated in the addition of that that feature.

D

uh Let's go to the next slide, try to be quick. um So what one one sort of aspect to us, moving or taking into use a metadata store or metadata catalog is. Is that we're moving towards a sort of data mesh architecture? I hope most of the people in this.

D

This crowd knows what that means, but basically that we wanted teams to uh individual teams to be take more ownership of their their own sort of data production and and the quality of that data and and uh and um the flow of that data and uh and our the core core team that we're part of with then just provide the tools and the infrastructure and uh and monitoring solutions around around those pipelines, uh but then enable to in order to keep keeping some some kind of control over the the whole.

D

D

Data data realm or data the data platform and uh still still sort of uh have it in a manageable state. uh It's sort of crucial for us to bring this this, this sort of a data discovery and data catalog into the mix and uh the use cases for us. uh I mean it's pretty clear.

D

uh I guess most of these are quite common for all of you, but uh you're, obviously, gonna use it for for data catalog and for data discovery for our end users, compliance uh use cases, governance, ml, lineage, as said and uh and yeah keeping track of ownership of the uh datasets or within the different teams, um and then sort of one one thing that we're probably gonna start looking into more more extensively.

D

uh Quite soon is uh sort of utilizing the downstream lineage and sort of ownership and and sort of stakeholder uh relationships uh for alerting uh sort of taking downstream actions. um I'm thinking about, like stopping machine learning training, runs if, uh if we notice that a data set just is going stale or things like this all right, let's go to the last slide. uh I think we have a maybe a few minutes uh so yeah.

D

I think we're we're beyond beyond the sort of stage of a proof of concept at walt where we're heavily invested in in this uh product already in the community. uh We, the experience so far, has been uh super great and uh uh we see a very bright future for this product.

D

We're currently sort of ingesting our or working yeah in just the the sort of warehouse tables and now now we're sort of extending that outwards to our kafka topics and operational databases and sort of downstream as well into into uh looker dashboards and so on uh and yeah. Maybe maybe one of the cooler cooler features that we we want to look into.

D

Is this like downstream chrome steam, uh we're utilizing downstream lineage to for alerts and and monitoring, uh and then at layer phase we always start onboarding our sort of business users for using it as some data discovery tool.

D

um Yeah, I think that's, I don't think there was any more more slides.

D

B

D

Shameless plug we're hiring so if anyone's looking for new opportunities in the most hap, the happiest country in the world uh get in contact. Yeah.

A

I think this is this is great and uh we would love to get some of these user avatars as part of our team.

A

This is just beautiful thanks for thanks for the support and all of the contributions so far, and we're super excited to be helping. You move along likewise.

A

Cool, we have about 20 minutes left in the session and um I am going to try something interesting this time.

A

So we tried to do we're trying to do like a live poll and kind of get some feedback from the community about something we are thinking about.

A

Which is um observability and frederick and mati talked about it a little bit as well right. One of the next steps, after setting up a a data, discovery, catalog or platform or whatever you want to call it tool, is that yeah people can find things and they can look at some lineage and figure out who to talk to and which data set is related to which dashboard.

A

But then often, we've seen that the next question that comes up is uh being able to really trust that this data set is indeed the right one. For you to depend on for the rest of your life, I mean the rest of your. You know next project and the question is: why would you depend on this data set? What makes this data set so amazing that you actually want to use it, and that's really that difference between a data set. That seems like the most likely candidate for you to use versus a data set.

A

That clearly seems like a trusted one and we feel like- and we've seen this at multiple iterations of this journey at different companies- that uh the operational signals uh coming from the actual active part of the data ecosystem gives you kind of that important signal that allows you to understand. If you should trust this data set or not so that's kind of something, that's also in our roadmap.

A

If you see, we've got integration with data quality tools and things like that coming up, and we wanted to instead of kind of working on it and then finally releasing it and then realizing that the community wanted something slightly different. We wanted to flip it around and do a few designs do a few mocks and then share it with the community and kind of get some feedback from you early on, so that we get the chance to actually build the product in the right way for the community.

A

So that's the goal and I'll go through a few mocks that we've come up with.

A

um So this one is, you know, think about what the lineage graph might look like in the future. uh I know that in the react app right now we don't have a lineage graph, we're actually working on that.

A

So imagine, you're going to search for a particular data set, find it switch to the lineage view, and then you can kind of see that it's got some downstream tasks like an airflow task that produces that writes data to snowflake, pretty much exactly how mati and frederick we're describing it or maybe there's a looker dashboard, that's sitting on top of that data set and then, as you click into that particular node.

A

The hive data set in question you're able to see kind of a quick summary of what that data set is all about it's it's not just that it's a hive data set, but the fact it's of like a fact table- or maybe we want to call it a log table or something like that. But it's basically a immutable set of facts that are being appended to the data set over time and there's some annotations about whether it's a daily partitioned data set or not.

A

How many rows exist in the data set? You know the tags and the description you already know about, but below that there's kind of a hint of operational health.

A

What's the status of this data set, when was it last updated? What were the checks that recently ran and these checks might be um operational checks like hey? It was recently updated, it's a daily data set and it seems to have landed on time. So that looks good, but also that it has past validation. There were data quality checks that were run and all of them seem to have passed successfully, whatever your rules are and that there are no active issues.

A

This data set doesn't seem to have any reported problems either by humans or programs, and it seems to be doing well. So the goal is, you know at a glance as you look into the data set at the very top level, you can see a sense of is this data set healthy or not operationally.

A

um The next uh kind of more expanded view of that same page is, you know you might search for the data set. So this is not the lineage way of getting to the data set, but just the standard search and then you click and you're into the data set detail. Page very similar. um You can see the same card below um most people are familiar with kind of the little array of tabs that show up below the data set entity.

A

Page right and most of us are familiar with the schema tab and you know the ownership tab, but imagine that we put up the ownership up to the right. So it's always there and available and schema can actually stay on as a tab in this array of tabs. But we add on a summary which includes this kind of operational health.

A

So that's that's kind of what the kind of landing page uh is, how we're imagining it and then, if you click into one of these tabs, let's say the events tab.

A

um Actually, this one is yeah scrolling down a little bit so you're you're up in kind of the entity, detail page and you scroll down from the summary into kind of the event section, and you can see kind of a timeline of events uh plotted on a graph that shows when this data set landed now for streaming data sets, there might be a different thing that makes sense. uh This is a fact-oriented warehouse table right, so you might want to look at it in that way. You know.

A

Typically, you know the standard stuff typically lands at 5 00 a.m, but at the sla is that it should be ready by 6 a.m. uh But you know uh a few days ago there was a problem and it actually showed up at 7. 00 am or 8 am, and so that it's late, but the good news is that the validation checks were actually run on that partition um and it was actually validated as being a correct uh partition of the data set.

A

So that's kind of like the scheduled updates things that were planned, things that happened, kind of on a drum beat, and then there's also unscheduled, updates things like backfills that can run every once in a while, and you can kind of get a separated out view into how unscheduled updates have happened on that same data set up into the right.

A

You have the overall statistics about the data set, including a quick histogram or a quick trend line that shows you how many rows are being added per day and whether that is in line with what you usually get and things like that. The goal again is to have a very simple and easy way to understand. If this data set is behaving as expected, because every data set is going to be different, there will be some data. Sets that are weekly data sets be hourly and then for streaming.

A

Data sets it's a whole new world right, so that's uh entity, detail page.

A

I talked a lot about events, so imagine there's an events tab that actually shows you that log of life cycle events that have happened to the data sets things like creating partitions, running data quality assertions, validation, rules that are running on the data set updates that might be happening of different types like backfilling, a partition that might be in progress and imagine being able to kind of walk back the timeline of all of the important events that have happened with the data set.

A

We expect to allow customization about what you think are important events and allow you to plug in different producers. That can say different things about the data set so that it's all in one place combined and multiple tools can probably be emitting. Important updates about the data set up into the right. You'll see this report and issue button right here, and that is meant for humans to essentially say hey. I think, there's a problem with this data set and you can imagine how that workflow will look like a report, a problem.

A

The problem gets logged gets routed to the right owner, maybe that's integration with your ticketing provider, whether that's servicenow or jira, or what have you and then that then reflects back in the active issues that are going on with the data set. So the next time someone goes and finds the data set, they can actually see. Oh looks like there's some problem that someone else has already filed and they can go check on how that's going so that's the events table stats is exactly what it says.

A

You know. Standard geek out about the data set stats row counts.

A

uh Column counts, like the width of the data set itself partitions uh trend lines uh looking back one month, three months, whatever makes sense right at a row level or at a data set level and then within going deeper at the column level, understanding uh histograms for each individual column, understanding if there are nulls being found, are they expected or not expected, so that naturally leads to kind of a data quality related question, but uh kind of looking for feedback around how people would like to see this evolve and I'll drop in a google form.

A

After this you know quick mock demo, so that you can give us your feedback on how you would like to see this evolve.

A

The next thing, um the next tab over is validation. A lot of people in the community are using great expectations. So that's the that's the one we're showing uh over here uh we have on our roadmap integration with data quality tools like great expectations.

A

What does that integration mean is something we are still specking out, but, of course, at the least, it means that if you have already set up great expectations and it's running against your data sets and producing important data quality results, you should be able to see it again in the same place.

A

So once you're looking at a data set, you can drop into the validation tab and it shows you all of the assertions that have been run. Assertions that are running failing succeeding, uh what those assertions even are and being able to kind of see at a glance how things are going. This is sort of like your cicd view, of the data set right.

A

And that's uh pretty much it. I'm gonna drop the poll into the chat right now and I would love to get some live feedback from the community.

A

A

So I'll give you just a little bit of time to click into it, bring it up and uh don't context, switch and I'll I'll go back to the slide deck, so you can kind of see it really quick.

A

I want us, I want you to give us feedback around.

A

What you thought and what are the changes you'd like to.

A

A

Okay, awkward silence done, hopefully everyone uh at least have the form open. uh We're gonna drop it on the slack channel as well, so you can always get back to it and uh we'll drop. Obviously, the slides are gonna be public, so you can go back and look at them and tell us what you think this work will be done over the next. You know month or two months. So, as you can see from the roadmap, we do have quite a bit of work ahead of us, so the earlier we get feedback in the more.

A

We can make sure that uh this thing is going to look just right for for your team and for your company. So.

A

All right, um so that's pretty much it from a programming scheduled programming perspective. uh We are not yet announcing the next town hall, but I'll I'll drop in the it. You know it's the standard third friday of every month, but we'll figure out the exact timing, uh whether we move it by one hour, we'll get some feedback from the committee around how they felt the timing was for them. uh For me, the coffee was great, so I'm feeling fine right now I look.

A

I took a quick look at the q, a section uh and there were a couple of questions that popped. uh One was around business glossary and the second goes around moving from confluence to data hub, I can address the first one. Really quick uh business glossary is something that we've been working with the community on for a while.

A

Now um I think saxo bank has been and thought works have been uh collaborating with the community on it recently with the addition of the tags feature, we had a renewed discussion around it, because we had this big debate about our tags, the same thing as business glossary terms or are they different things where we netted out at least design wise was that we will keep tags uh similar to global tags or hashtags?

A

uh You know you can always create little name spacing rules within your company if you want to, but at least it's, the main goal is not to create too much organization uh for that.

A

The business glossary is what we are going to move towards, and so, where we see the business glossary strategy going is in being able to allow you to essentially upload your taxonomy into the system and then that taxonomy gives you a structure of glossary nodes and terms, and these terms can then be attached to either the top level entities or to fields within these entities pretty much exactly the same way as which you attach tags.

A

But now you can kind of browse into a specific section of the taxonomy and attach it and give it a different kind of semantic meaning. So we think the application process like how you apply terms to data sets or to fields are going to look very similar, but the way you manage, the the taxonomy itself is probably going to be much more interesting. There will be hierarchies.

A

One of the things on the roadmap is role-based access, control or fine-grained access control on metadata elements themselves, so that will allow for kind of the tag, governance and allow for business glossary governance so that you can essentially demarcate owners of a certain taxonomy, and that then allows this. This feature to actually uh make sense for a lot of companies.

A

So that's where that is going. We will probably have some tech work happening later this month, but we expect kind of the the deeper capabilities around access, control and stuff to appear once we roll out access control on metadata, which is uh in the next quarter.

A

The other question we had was confluence to data hub and I don't know if there's someone in the community, since I didn't actually check who it was, who asked this question. But if there's someone in the community who actually is here who had this question, would love to hear from.

A

A

Okay, since we don't have, uh we don't have the person who asked the question, I can imagine what it was about. There's probably a ton of documentation about data sets that is being written in uh wikis and people are basically asking hey. How do we get that documentation into data hub, or vice versa?

A

How do we write similar documentation inside data hub? Can we actually organize documentation inside data hub the same way? We do it at confluence, so I can definitely say that this is an interesting idea. It makes a lot of sense, something we will take into consideration.

A

One of the easiest features that we can think of is to just allow editing markdown directly in the data set description page so that you can kind of have a living document that you're maintaining about the data set just like we have our readmes and other things in our github repositories. So that is definitely something that is on our short term roadmap and that should be coming, but deeper integration, maybe back and forth between confluence and data hub, is not something that we have planned for.

A

But if you have ideas, please come talk to us and tell us how you can make that a reality.

A

Awesome any other questions that I did not have time to get into.

B

The question: when is the office hour? Please uh the coming one, because uh we're interested in the ssl support a lot.

A

Cool, uh absolutely uh something just watch out on the slack channel do not mute notifications- I I don't like doing at here, so if everyone is paying attention to the slack channel I'll, stop doing it, but a lot of times people tell me: oh.

D

A

Didn't know there was a town hall, so, okay, I guess I'll- have to do an answer just to remind people. So uh so we will announce town halls over there and uh sorry uh office hours uh in the slack channel. Thank you just pay attention to it um and we'll give kind of a wide range of time so that people can hop in and out.

B

A

We are planning one next week for sure all right. Thank you cool. We are on the hour, so any other questions, I'm just looking through the chat.

A

What is backstage by spotify.

D

Sort of data hub for services.

A

Oh meaning the service graph.

D

Yeah yeah sort of- and uh they include, uh uh documentation there as well.

D

Just a thought.

A

Yeah yeah that would be interesting, keep the ideas coming and keep the contributions coming. I think it's been just amazing to see how the community has uh kind of taken this project and taken it to places that we ourselves are not able to do small teams right everywhere busy with many other things. So this is amazing and I'm really excited about dbt and looker.

A

Those are the two things that are already baking in the community, so hopefully the next time around, we are able to show like lineage graphs that include also uh sql and other things. So that's going to be amazing.

A

Awesome so see you all in four weeks give or take and stay safe and.

A

A

Gabe might do a spotify theme to data hub next time. I don't know.

A

A new theme for every meeting, all right cool- I don't know if people noticed I did a dark themed presentation today.

A

Awesome, thank you. Everyone and.

B

B