DataHub Community Talks, 25 Jun 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: June 25 2021: DataHub Community Meeting (Full)

Description

Full version of the DataHub Community Meeting on June 25th 2021

00:00 Welcome
01:30 Project Updates by Shirshanka
- Release notes
- RBAC update
- Roadmap for H2 2021
19:01 Demo: Table Popularity powered by Query Activity by Harshal Sheth
34:14 Case Study: Business Glossary in production at Saxo Bank by Sheetal Pratik (Saxo Bank), Madhu Podila (ThoughtWorks)
50:00 Developer Session: Simplified Deployment for DataHub by John Joyce, Gabe Lyons
1:00:00 Closing Remarks

A

All right looks like uh we have all our speakers so welcome everyone to the june edition. This would be our uh last town hall, for this quarter uh hope everyone is uh starting off a good summer or ending one depending or for some people. It's, I guess winter. um We have a packed agenda as usual.

A

We'll quickly go through uh project and community updates. uh Bunch of stuff happened in june. uh John is going to give a quick update on our back I'll. Do a preview of the roadmap for the rest of 2021.. I know it was a big thing for a bunch of folks in the community, and then we have three big uh blocks of uh talks today.

A

The first would be talking about a recent work done by herschel on adding a query activity based popularity data set tables, pretty much anything using usage logs and then the saxo bank team and thoughtworks are presenting how they have implemented business clustering, using data hub and what they're doing with it at saxo bank, some very interesting stuff there with uh usage of schemas and protobufs and finally, uh john and gabe, are gonna, walk us through improvements they've made to deployments of data hub both for single node, as well as for multi-node deployments.

A

A lot of simplifications.

A

Community update acura data launched, uh we are the company driving the open source project forward, uh collaborating with linkedin and all of you. uh Of course, there have been some press articles, tech, grunts and a few other companies covered us. So do read about us, but, more importantly, we are hiring exceptional engineers and community builders. So tell your friends and people who are excited about joining this community and building on the world's best uh metadata platform together uh we're looking forward to the the ride and the journey ahead.

A

Other things as I was thinking about kind of what distinguishes us from a lot of other projects or communities.

A

I would say the engagement on slack continues to be something that I'm blown away by both in terms of the quality of people coming in the kind of discussions we're having um as well as uh kind of the responsiveness of the community. So thank you to everyone. It's not just us. I know there's so many other people who help out when questions come up, whether it's design, questions or just troubleshooting questions.

A

So thanks for making this like one of the most vibrant communities that I've seen and let's keep the bar high like this all right. So, let's move into more specific project updates. I wanted to uh give the floor over to young to talk about some plans that linkedin have for the project in terms of amber young.

B

Thank you, shashanka for those I haven't met. My name is young. I lead our metadata and reporting teams at linkedin, so responsible for data hub linkedin. I want to give a quick announcement. I know there's been a lot of activity using the react, client going forward, and so on the linkedin side. We also want to announce that we're going to deprecate basically the number code base. I think we want to just kind of have a call to action here to the community if any folks are still using it. Please reach out to us.

B

We'll start a slack thread see if there's any anybody using it help folks migrate over to the react side as we'll be doing the same on the linkedin side and contributing as well um and tentative timeline, I think, is probably in the next month or so. um If we don't hear from a lot of folks, then we'll probably do it a little bit sooner, but we do want to give folks enough time in case there are folks that are still using them for side of things. So.

A

B

A

uh Just a quick announcement.

B

Inside back to you.

A

Cool, that's awesome. I know that uh there's so many amazing features that have been built on the amber side at linkedin and uh it's always been a bit of a challenge to uh kind of build those same features in the react app. So I can't wait to kind of join forces and build these together um in the future. So so this is gonna be great.

A

um Talking about releases, uh I held off on like minting the release just before the town hall, because there are like one or two uh commits still coming in, um but it's gonna be called zero. Eight four. It is not a backwards, incompatible release, so we're staying with a minor release. I was looking at. You know the activity that happened the last town hall. After that we published zero, eight zero, which was kind of the big, no code metadata release.

A

Since then, in the past three weeks we've had about 100 commits so still keeping up with in our 120 to 130 comments per per month and a rate. So that's pretty good news. uh There are a couple of rfcs that are in flight um and I wanted john to kind of walk us through the high level of what they're about and also give a little preview of what is to come on that front.

C

Yeah sure, thanks srishanka, so we have two rfcs that are currently open. um The first one is about adding sort of the ability to collect user feedback inside of the data hub ui and actually you know, read that that feedback afterwards to understand how people are actually using the product, that's being driven by melinda at the new york times, um really excited to see what comes out of this. I think this is going to be a feature that really can benefit a lot of companies deploying data.

C

Today the idea is yeah again, just basically being able to sort of trigger these. These in experience, pop-ups and small micro surveys and collect feedback on how people are using the app why people are using the app if they got the information they were looking for, etc. So I think this will be useful not only for operators of data hub, but also as a kind of a source of feedback for the data hub project more broadly, uh to help us sort of collect, aggregate insights and help drive the road map forward for datahub.

C

So, looking forward to that, the second one is on the role-based access control side. This is something that a lot of folks have been asking for for quite a bit of time. Last few months, we've heard it numerous times.

C

The main idea here is: we want to effectively provide a way to manage access to the metadata itself, that's stored on datahub. So this is not. You know, access control for the actual underlying data assets. This is just access control. It's the metadata that datahub has collected and aggregated on its graph. We have a design out.

C

I think we just got some feedback this morning. That was super insightful and useful, so we're still working through some use cases, some requirements and issues. I really would suggest that anyone who's interested in this take a look at the design go ahead and post on that that pr and we'll try to incorporate any of those suggestions. I mean all the feedback we can get on.

C

This are very useful because we understand that every company has a different set of requirements, is operating under a different environment, so we want to make sure that we kind of collect that aggregate pool of requirements before uh really implementing this. Just for as a heads up, the target for implementation will be hopefully starting mid-july.

C

So I do want to note that there is a there's, a timeline on this. So please, please do take a look at that. I think that that's pretty much it. I don't know if you want to add anything shashanka, but otherwise, back to you.

A

That's great um and I'm actually really looking forward to melinda's work on the user survey as well. It's something we've talked about internally, quite a bit, so um so yeah call out for the community. Please comment on the our back design, make sure it addresses or will address the kind of complexity that you probably have in your uh enterprise and make sure that by the time we get into implementation, we've got all the requirements locked in awesome.

A

So uh so a quick uh overall highlights on all the releases that happened in june um again broken down by the same typical categorization, product and platform, features developer and operations, improvements and integrations. So the product and platform features we're gonna see a talk from herschel later around how usage stats can really drive the product and what we can do with it.

A

Second, one is that we finally have markdown descriptions and editing. This was again a common request. So now you can edit column level descriptions as well as data set level descriptions using markdown, so edit away.

A

Third, one: on the api side, we do now have a version api for metadata gets, so it was a little bit embarrassing honestly that data hub keeps everything nicely version, but we didn't really expose an api to get those uh version metadata pieces back, so we added it. This is actually going to be the foundation for how we build schema history as well as other kinds of visualizations in the future.

A

uh No code has been hardening, so we released it in zero, eight small stuff coming in we're fixing those and so uh john. What do you think? Is it a seven out of ten eight out of ten ten out of ten? Where are we? How hard is it right now.

C

uh I'm just gonna be a salesman and say it's ten out of ten.

A

I think there are a couple of issues still open that we're looking at, um but I would give it like an eight out of ten right now in terms of um how good I feel about it. I want to see a few more folks uh saying that we've tried it out and it looks amazing. uh We've got a few people saying: we've got some small issues so looking forward to getting more feedback around. Are you trying out no code? How is it working for you? How many new entities have you added things like that?

A

Moving on to developer and operations improvements, a lot of contributions have come in around hardening auth, which is great, so we've got new impulse for oidc improvements for elastic kafka. There's a gcp guide that dexter wrote yesterday, which kind of completes one of our commitments to the community.

A

We we've got the aws guide. Now we've got a dcp guide, so you should be able to deploy on pretty much any cloud you want um and I think we'll have an azure guide soon. Other big thing, neo4j was a sticking point for a lot of folks, and you know we were considering moving to neptune. But then, as we looked at the details of how datahub is uh using the api and and how the graph is built, we realized that actually elastic can do just fine, um especially for you know, one-hour queries.

A

It's actually a little bit faster, so we uh we're switching the default. Neo4J will always continue to be supported and in fact there might be more graph engines that get added like we've heard interest in the community for d graph.

A

I know linkedin was about to move to liquid when I was leaving, so that might have happened already, but on the default side, we're going to say you can just run with elastic. So that simplifies your deployment um and in many cases even your production installation goes much simpler.

A

A lot of work has happened in hardening our docker images. We were running on a really old base image thanks to grant uh for pointing that out and we've kind of done quite a bit of work, and I think we're now all clear on the vulnerability side. For all our docker images, john and gabe, are going to talk about how much more improved single node install is now.

A

I think they might be able to run it on a raspberry pi now or it's one of our goals for this year. So that's definitely um on the list and then moving to integrations. um A lot of work happened to integrate with blue well, so kevin has been doing a ton of work. We now support blue both for s3 so like if you've got s3. There's a nice recipe for how to integrate uh with s3 data sets using the glue pathway, but we've also got support for glue jobs.

A

Now it was another thing that was asked right after we did the airflow integration. Hey we've got blue etl jobs. Can we get those ingested as well? So we've got that done. Dbt lots of features have been added on dbt, so we're getting better and better at the entire dvt uh graph, if you will, and finally, our first 4a into ml. So we've got integration with feast now so for people who are considering using feast for their feature store. This is a perfect time to try it out.

A

The integration that has been done for feast is only on the back end side, so the models are there for features and then the ingestion is there. We have not yet built the screens, for it just caveat there.

A

So that's pretty much it for the highlights on the releases and another big thing: the roadmap for the rest of the year, so some of the stuff is carryover from things that we had promised. We would do in the first half of the year, but a lot of this is new stuff that we are taking on for the rest of the year.

A

I feel like we did a decent job of hitting our first half of the year milestone. It was pretty ambitious. There were a few things we couldn't get to like data profiling and data set previews or data quality integration, but we prioritize more of the foundational work like the no code metadata, simplifying the single node install, and things like that, because we feel like that gives us the right base to build from.

A

As we add a lot of new features on top of the platform, so you feel good about the trade-offs we made big things that are coming up, as john said, are back. The implementation of that will will land very soon, we'll start working on it. As we said early july, business glossary saxon bank will talk about it a bit later, but uh you know the way they have done it.

A

uh It supports all of the modeling all of the viz, but doesn't have the edits yet so we're going to add the edits in in q3 column, level, lineage uh carrying over data profiling and sql, and data set previews data quality integration with a few systems, not just great expectations, but also aws dq and dbt tests. These are the systems that we find in our community.

A

So the way we envision that integration to work is, you can come to data hub and you can see all of the data quality rules that have run what is the status of all of them across all of these different tools. So it does it shouldn't matter whether you've got your tests written in one system or another system. You should be able to see it in the same way and leading to that is again a foundational improvement that you're planning to make which is building a metadata trigger framework. So what does that mean? You know?

A

Data hub is built for metadata changes and a lot of metadata changes are happening in your ecosystem. So how do you react to those changes and that's where the metadata trigger framework comes in? It allows you to listen for metadata changes and react to them. So this could be a workflow sensor.

A

That's waiting for a new partition to get added, or it could be something that's triggered when a schema changes or it could be something that gets triggered when a quality result is produced, and you know maybe the data quality looks good today or it doesn't look good.

A

You know the sky is the limit when it comes to all of the things that people can imagine, they can build and we're going to build a few of these support for strongly typed event-based subscriptions to all of the changes that are happening in the metadata. Substrate.

A

Of course, we'll build integrations with email, slack, github actions, things like that that are kind of the common ways in which people hook up things together. I'm very excited about that. I think it will open up just like the ingestion framework made. It super easy to get metadata in to datahub. The trigger framework will make it very easy for everyone to react to changes that are happening in the metadata ecosystem and one final thing: a lot of interest in the community for adding the metrics entity.

A

The linkedin folks are probably smiling because it was. It was actually a very important entity in the linkedin ecosystem.

A

We do want to put it out and provide it for the community metrics, with the ability to link it to data sets, dashboards, etc so that you can search for weekly, active users, and you find the metric that describes weekly active users and then you can find the data set that powers it and the dashboards where it is being represented.

A

So that's all from the from our side kind of committed for the uh q3 uh roadmap, moving on to q4 a lot of stuff that we actually wanted to do in q3, but we're going to do it in q4. So if you want to accelerate this roadmap, let us know- and we can partner with you on it- the ml ecosystem, getting features models and notebooks all nicely modeled and visualized in the ui support for operational metadata, so really supporting partition, metadata, completeness freshness, those kind of signals, support for the data lake ecosystem.

A

So you know support for the common formats out there, delta lake iceberg, hoodie hive already supported, so I'm not including that here and then uh I don't know how many of you attended data mesh or metadata day. There's a lot of interest in supporting data mesh oriented features, and I know a lot of our companies in the community are actually implementing data meshes so being able to support those kind of features in the product like being able to see a data product on its own, separate from a data set being able to see analytics on.

A

How is my data mesh journey going and improving as we go like? What percentage of my data products are being driven by high quality data sets, or vice versa? And then finally, collaboration features um being able to share knowledge across all the data professionals and then having conversations in line in the product as well as off platform. So those are kind of the our q4 uh roadmap items.

A

So that's that concludes kind of the project section and the community updates section and I will hand it off to herschel to talk about data set popularity using query activity.

D

Awesome thanks shirshanka. um Do you want me to share screen? I think that might be easier.

A

Okay, why don't you do that I'll stop my share.

D

D

Okay, so I hope everyone can see this um yeah, so I'm going to be talking about data set popularity and how we use it in data hub as well as how we get it from you know: query logs and creativity.

D

So, first off. Why do we care about this so for different people? Data popularity implies different things for the data platform owner. It enables them to kind of understand what's going on within within the enterprise. How is data being used in what systems is it being used? You know if you've got like a snowflake and a big query, let's say, and you want to understand which one is actually being used by more people and it's more popular.

D

You can understand that using you know, popularity and usage logs.

D

It can also help you in the help point you in the right direction, for where to dedicate more resources and where to both in terms of people and in terms of compute power, so that you know you never have an issue as you as you operate for data engineers, data popularity is a little bit different for them.

D

It kind of under helps them understand how are people using the things that you're producing uh so kind of like an impact analysis within the company as well as helps you prioritize among the data assets that you actually produce, which ones are most important um and which ones are actually getting used, which ones say to use more documentation to improve usage, because you know it's a great data set that you produced.

D

It can also help you streamline the deprecation process, so let's say a data set that you're trying to deprecate still has you know 100 queries a week? Well then, you probably don't want to deprecate it just yet. Instead, you want to you know, look at look at the popularity and the usage figure out who exactly that person is, and you know, reach out help them migrate to to a different solution um for data scientists.

D

Popularity and usage is a major trust signal. It helps you know kind of understand. Is this thing, someone that you know someone put out a year ago and hasn't touched since? Or is it something that is being, you know, regularly updated, regularly used um and is something that you can rely on, given that other people are also relying on it? uh The other thing you can do is you can look at kind of the other queries that people are issuing against that data set figure out, say what what other tables are relevant here?

D

What is it commonly joined on which keys um and so forth? So you can kind of determine um not just whether or not to, but also how to query that data set, and then you know helping everyone is you know we can? We can use usage and popularity data to improve search rankings, and you know improve the ordering of things in the linux, visualization and so forth. So lots of product improvements for datahug can also come out of the usage statistics.

D

So let's look at what we're collecting and kind of how we're doing that so right now we support bigquery and snowflake for usage, stats, bigquery we're using the bigquery logs and we're parsing those out and then with snowflake. We are using the access history and the query history views joining these together and getting our popularity and usage data that way and for each data set we can collect per user usage frequencies, so you know person a is using it. This much person b is using it this much.

D

We can also collect how they're using it what queries they issued a lot of granularity here um even to the extent of what columns are they frequently querying versus which ones are, are not being used um and once again, we kind of roll. This up and you know, can get frequent queries across all of all the people using it together, data set as well.

D

um So how do we design this? I want to talk about this a little bit, so we are some of the constraints we were pushing against. You know skill, wise. You know it wouldn't be unheard of for a company to be issuing.

D

You know 500 000, queries per day again say bigquery or snowflake or similar data warehouse and you know, have 10k users, so this is kind of our north star on what sort of scale we might want to support.

D

They'd want to retain a decent bit of historical data, so they can view this historical usage over time, let's say a year here, but you know it varies for different different enterprises.

D

Some might only care for 30 days, some might care for many years, um and the last thing is we want to avoid refetching the same data from the same solar system. Repeatedly.

D

What this means is, you know if we're collecting data, um we only want to pull that, given that given piece of uh the usage log or the query log or whatever it is once and then you know not have to pull it repeatedly.

D

So, given this, we are some of the decisions we made. The first is we're going to start with a batch based system, um so you know you can configure to run hourly or daily, whatever you'd like and we'll pull kind of the most recent queries in history.

D

That has happened in that past time period and then we aggregate it during ingestion, because you know this data at the at the top level of scale say if a given log event is, you know two kilobytes, this reaches you know, gig or so in memory, which we probably don't want to do.

D

So we have a memory efficient algorithm for kind of pulling this in, while keeping preventing the memory usage of ingestion to blow up, and then we do some pre-aggregation here on a byte on a per data set level uh roll it up so that we can frequent users of the data set. Frequent columns used for the data set frequent queries of a data set, and then we take that information.

D

You can push it into through gms into elastic and that's where we can store these aggregate statistics do additional aggregations on them um and then you know surface them in the ui, as you might expect,.

A

D

A

Guess one more uh interesting constraint in the design that you probably had was not adding one more moving part to data hub like oh you've got to go, run a spark job or you have to go run some other big data processing job to compute. This stuff right.

D

Yeah, I think absolutely- and uh you know that's actually kind of a good segue into the demo piece. So I wanted to show how you know: bigquery and snowflake usage work, the snowflake one, I'm actually going to show you how it's looking um when scheduled with the airflow, because that's kind of a the common use case here you schedule it on on a daily basis and then how it looks in the ui.

D

So we can start with how bigquery usage works um so right now I have a little recipe configuration. It works the same way as most other sources. Do you know you just have a new plugin type called bigquery usage you can put in the product project id for bigquery. I just have a playground instance that I'm using- um and I unfortunately haven't queried this this instance in a in a few days.

D

So I just reverted the the start time to the beginning of this month and then here we're just going to dump it to a file instead of pushing it to rest, as you might normally do and simply running it, as you might expect, will you know take a little bit to pull all of the data.

D

So what is it doing here? It's pulling the bigquery usage logs from the cloud logging um product from google and then doing a little bit of pre-aggregation here and then dumping that into a file here.

D

Cool, so this might take a little bit because the time range is quite large.

D

There it is okay, so you see a couple couple instances. um We can see the general queries that I was um or the general data sets that I was using and then, if we want to, we can take a more detailed look into the actual usage data that was produced.

D

So we have, you know, emails, frequent queries um and then the fields and each one of their usages as well- and we have this on a per day basis per bucket or per per data, set basis cool. So it's snowflake. It works pretty similar.

D

um I actually added it to our demo instance here, um it's you know pretty straightforward, how it looks this time we're running the ingestion using direct code, because we want to do that inside of airflow and it's remarkably similar, we first ingest snowflake and then we we add snowflake usage in kind of a pipeline, so you get both of them at once, um once again kind of set your configuration and then I wanted to get a bunch of historical data.

D

So I set the start time manually, but then beyond this I might just leave it blank and it will automatically do the current day um and yeah. Now that we've kind of run this successfully, we can see the little green box there. We can head over to the demo instance and we can see where this is surfaced in a couple places.

D

So the first is we see immediately the queries that the number of queries have been issued against this data set. We can see that it's, you know 78, um and you know this time period and we can also see you know I and through have issued queries against it. So we can see the top users and this is going to be ordered by frequency.

D

So you know that um you know I've done the most and sort of the second most beyond that. We can also see you know per column so entity and earn the two columns here we can see the entity has had 78 queries per month and the urn field has had only 43 queries per month. So you know that you know people maybe use the entity field more than the urn field, for whatever the reason might be, and what might that reason be well.

D

We can actually hop over to the query instance and we can take a look at some recent queries that have been issued that reference this table. So you know your standard select count. You know group by entity, here's where we might guess that the entity field is being used more frequently than you know the urn field or other fields, and then we can also see people are creating other generated tables that reference um this all entities uh table. So we can kind of understand how are people using it?

D

Oh they're, joining on the urn and so forth. So we get a lot more rich information about how people are using this data set. um So that's it for the demo.

D

I want to hop back over to talk quickly about what's next here, so obviously we're going to iron out the edge cases and figure out any issues that we might have we're going to integrate usage to statistics for more systems, so expanding beyond bigquery and snowflake, as well as getting more rich integrations of this usage data into search, ranking, lineage, is and so forth.

D

um With the ui improvements, we've got a bunch of you know, time, series data for the um usage statistics and right now we're kind of only showing you know, queries per month or something we can also add line charts so that if you're expecting a certain say, data set to be deprecated you're, going to look for the usage per day to to kind of taper off as you migrate people over and then finally expanding our time series metadata piece to add mechanisms, for you know using no code to like using a similar, no code approach to expanding that um so yeah.

D

With that love to hand it over to shoshanka.

A

Awesome, thank you, marshall, for that um yeah. If people have questions about how to use it, my my first question was: why is snowflake and snowflake usage, two different sources and there's actually a good reason for it, because in some of these sources, the place where you get usage data from is actually different from the place you get kind of the metadata from in some cases, you actually need elevated privileges to get usage data out, so it makes sense to separate out those two pathways.

A

One of the things I'm also very excited about is how simple we have managed to keep it so that it's still the same deployment footprint and that is able to also pull out usage logs and it doesn't kind of overwhelm the system.

A

Once we add time series metadata support for no code, I think it will be pretty cool to see different kinds of systems being able to push usage metadata into data hub awesome. So our next uh talk today is sheetal and madhu who are going to talk about business, glossary and their implementation of that in data hub, as well as at the samsung bank, so sheetal and mother. You want to take it away.

E

Yeah uh shankar I'll start quickly.

E

Yes, yes, yeah! uh Thank you srishanka, so a quick context. uh Saxo bank, uh along with partnership from thoughtworks madhu as a data strategist from thoughtworks, uh we were working together in an engagement and we have contributed uh back to the open source on business class. We work closely with acuraland linkedin.

E

I had the data integration and the data governance, tech platform for saxo bank and we are developing an in-house data management tool which is based on linkedin data hub and great expectations, so quickly an update on what we have done.

E

So what is business glossary? It is a list of business terms uh which is a blossom. Tell.

F

You want to say this sorry.

E

uh Oh, I thought I've already shared. I'm sorry.

E

Okay, let me know if you can see my screen.

F

Yes, you can see okay.

E

uh Okay, so a brief uh theoretical context on business closely. Business glossary is a list of business terms with the definitions uh which lays down cons, business concepts for an organization or an industry and is specific to a database or a data store.

E

It is it's a glossary of business terms. It enables the organizations to define a common vocabulary.

E

We are a financial industry, so we are inspired by fibo and we take often an example of the book, but the glossary that we have created is like an aggregator which can expand, apart from fibo to other ontologies specific to the industry as well as provides the ability to create a organization.

C

E

For us with saps or specific glossary as well, how does it help us so once it is out, it helps us to identify the relationship between different terms. This is an example from fibo. Eventually, we want to target graphical representations, but we'll stick to tabular on on data hub or our database product, I'm just laying down the pain point which led us to uh come up with this solution.

E

uh Why we kind of developed this a couple of years back when we started on this journey in saxo bank, and I was taking personal interviews across the organization to understand the pain points around the data, the common problem that came from different system owners and system. Smes was data, quality issues and inconsistencies where they were spending lot of time. Solving tickets because of uh data flowing across systems a quick example and a very common example.

E

I have a data set, a and system a data set b in system b and data set c in system c, and if I I got a couple of a few data elements where account in in system a data set, a the account name is named differently in data set b, account number they're same things, and in system seats account id. So the etl that flows from data set a to b is dependent on the mapping sheet that has been created by system a and b and similarly etl from data set c.

E

The system is dependent on mapping shapes too if the sme leaves or some knowledge is going near and there and another version of mapping sheet 2 is created. The etl processor, screwed up, validation, scale and account id and account are no more consistent, and then this leads to a lot of issues. Now, how can we resolve it right?

E

Can we remove the dependency on the mapping sheet and enable this or expand the maturity of the schema so that we can use a common ontology business standard oncology? In this case? This is the fibo account definition.

E

If you can point all these account, names account id and account number across these systems expand the schema, then uh being ingrained in schema. The dependence on the mapping sheet goes away. The dependency when the smes go away and the data flows across systems can be consistent, can be correct uh quickly. How do I have enabled it? So what we have done is uh if this looks familiar.

E

This is the data hub page, where we have added tags and terms, and these are the business terms which uh expand the metadata for data elements, and this actually and we'll show it later. This actually points to the fibo url.

C

Or whatever you have chosen.

E

It could be fibo or anything else, then uh the design principles that we have stuck to, as we start to data ops principles which is based on communication, collaboration, integration, automation and measurement. uh We believe that business philosophy can be involved, uh staying agile, iteratively, taking care of business needs uh in the digitization journey. This will also show in the end, if we get time how we wanted to make sure that technology is involved right from the start, when a business function is introduced into the organization uh enhance the metadata.

E

So apart from the data, elements now will also have industry standard ontologies uh defined at the metadata layer obviously schema maturity, because the the business terms are now engraved in the schema schema versioning. Any changes into metadata regarding the data elements, data types or business terms will cause the schema to be versioned, and it will enforce ownership not only of the metadata but also of the business terms, the appropriateness and validity on the producers uh quickly. How we have actually realized the physical implementation both for data sets and for business terms.

E

Our schema definition is in protobuf, so the messages which are defining a business terms use options to define the type of the business source that ontology source that they're using and their url. So with this I'll quickly, stop sharing and hand it over to madhu uh for the next set of first slides and the demo.

F

E

Yeah, I'm just once again.

F

uh Is the screen visible?

F

Yes, yep? Okay, thank you, uh connect connecting back where she talked about business terms like define the business concepts, enable the common vocabulary within the organizations.

F

So I wanted to talk about how we are trying to relate the data sets with the business terms that can enhance the value of the elements and maybe better meaningful to the data sets. I have taken a simple example with the purchase order, which has these elements, which is id revision, number status, employee id vendor a number of elements. We have like order line item this element.

F

Now, if I talk about like vendor id to map to the supplier, identifier may be another table, we call it the product supplier, id and map to supplier identifier. This actually enhances the value to this data set and other by-product is. If you define a certain business rule at a supplier, identifier level, you can actually drive those business rules against these data sets and other than association of the data which is enriching. The value of the data sets business concepts or terms itself is interrelated or like and hierarchical.

F

So some of you can compose the business and create a new term altogether. Let's say we have a purchase order, date, value and ship date. These are part of like composed and created the purchase order. That is, the kind of relationship will help you to discover your data sets of interest and get to the right data set.

F

So with this I'll just move on to the next step, how do we bring this uh business glossary into the linkedin data hub? I think left side is the one which is very much familiar to you, everybody, which is the data set from one of the first and foremost entity.

F

So you have these aspects: ownership schema metadata and all those things now we are trying to bring in business term or business glossary, which is a third with these two entities, one is the glossary node another is a glass written. The glossary node is introduced to define the hierarchy of the um ontology okay, so we could achieve very much similar to the free book kind of hierarchy.

F

If I want to relate this analogy right, okay, glossary node is kind of a package. You can have a number of hierarchical levels and glossary term is a class definition which talk about the business term. Okay, if you can say glossary term no info. This talks about the definition of the glossary term and source of the term which can borrow from it can be internal organization. It can be borrowed from the external organization and you can even have a link to the external organization so that people can navigate.

F

So this is the first thing we onboarded these entities. Then we had expanded the data set by adding a new aspect to the data set so that glossary term can be related to the data set and data set has a schema metadata, which is an array of schema field. We enhance the schema field to associate the business term at the attribute level. With this, you are able to attach the term to the data set and the schema field that help the business user to navigate to the data set from the business concepts itself.

F

The one other thing which currently we're working on the design is the terms itself are related, which we have seen in the previous example as well as, given that we are achieving to the another relation which is established within the terms. It can be easier relations and has a relationship so with this I'll, actually try to move to the saxo implementation, so that you'll have better context how we actually implemented.

F

These are the simple templates we use the first one is a data set definition. Okay in this case is the kafka data set kafka topic. We are on boarding, so you can see the name of the topic and there is a schema associated. This is one of the enforcement.

F

Then it is successful. Schemas are thing and there are like ownership, business, technical and data stewardship and right side. You could see the schema definition here. The schema is savings account that protocol. This is again a fictitious example.

F

If you see like there is a type name, and there are attributes, like account number and the balance, if you look at the balance, itself, is another type, which is a balance amount, and I would also see you could also see the savings account is kind of linked to a customer account here, we're trying to define that. Oh you see savings account.

F

This is of a term or type customer account so that you are able to relate things so that you can okay, even though, let's say example like in organization, your independently terms can evolve over them and realize that these are common. You can relate it back and proto, given a very flexibility so that you can actually expand the definition or metadata of a schema. We are using an options to do that, and there are other cases like okay.

F

If you wanted to attribute classifier information classification as a personal or a confidential, you can do that with proto much easier and the same thing can be used to drive the other business rules.

F

So next thing is: uh I wanted to give a little a little bro overview of how the metadata is. Onboarded saxo is adopted. The database approach to the new data platform, where domain teams are response for building the data products and also annotate about their metadata.

F

So the response for, like you, come up with the self-service capabilities where users are can be declaratively defined. The data set and we have a githubs process which takes this thing and create the topics in the kafka and register schema and extract the this metadata templates parse the files and pushing into the linkedin data by converting into a snapshot which is required by the mc schemas with that I'll quickly uh move to the demo in the interest of time.

F

Here you could see, this is the saxo. We call it. The data. Workbench is a one-stop shop for date. uh So you see the home page of the data workbench. Let's say let me look take it to the business glossary and these are like. uh We have a domain hierarchy, party domain, market domain and trading and common these things, and let me take a simple example of example: here, okay, we could see a customer account earlier we've seen the example I could navigate either I can search through or I can directly go here.

F

If I go to the business term, I could see the definition of the term and what is the source and I can navigate to the source within external different. It points the fibo here it can be other things and you could see the related data sets and additional properties right with the related data set. You have two data sets. Maybe you can navigate with one of data sets and see. This is a very much familiar to everybody so which is the data set home page. Where you have information now you could see.

F

This is mapped to a customer. Account as a business system is a relationship you can say. This is a. Is a customer account and has this terms uid and balance amount and one can navigate to the these definitions and get into the lot more details. So with that, I will hand over back to sheeta.

E

uh Shishanka, I just checked: do we have time for you.

A

um I think I would like to let uh john and gabe also go through, so maybe.

E

Yeah yeah yeah.

A

Cool thanks uh thanks cheetah and uh madhu. I think we've heard multiple times from the community, a lot of folks who are implementing similar practices in having their schemas checked into source code, along with metadata annotations.

A

On top, we did the same thing even at linkedin, but it's really nice to see a lot of companies are doing something similar and it's great to see some of these recipes emerging for how to connect schemas and git, along with metadata in-gate, along with this kind of push-based architecture, to get metadata out and integrate it into a common base. So I highly encourage reaching out to them.

A

We probably are going to have similar support for even in the open source code base, for uh you know having protobuf, schemas and applying annotations on them, and so to talk to them about how they've done it and try to implement similar practices at your organization. I think it's definitely a game. Changer.

A

Cool, um so next up we have gabe and john who are going to talk about all of the hard work that has gone in simplifying the data hub deployment, uh single node as well as multi-node john. Do you want to share the deck on your end, or do you want me to walk through it?

A

uh It's just a couple slides. Do you mind just sharing okay, don't wait.

C

All right, thank you, srishanka um in the interest of time, we'll keep it pretty short to allow some time for questions at the end um yeah. So, in the past few weeks, uh gabe and and folks from mackerel have been working on simplifying the data hub deployment.

C

uh We've heard from the community that it's very heavyweight and surprised. We agree uh so we're trying to do everything we can to to simplify it along multiple dimensions, including sort of the resource, consumption, the overall complexity uh and beyond. So, if you wouldn't mind, sri could go into the next slide I'll start talking about what we've done so there's two kind of broad buckets. We we improved over the last few weeks, the first one is really just the general experience of deploying a datahub for the first time.

C

We believe that you know you can get a lot of value out of data hub just from the default models, kind of the boilerplate implementation, without changing anything. So as part of that belief, we decided to kind of allow you to deploy datahub without actually requiring to check out the code. So previously you had to get clone cd into the docker quick start and and run that script.

C

Now, we've changed it up so that you can just install the datahub cli and go ahead and run datahub docker quickstart to ingest sample metadata. You can just run datahub docker and just sample data, and it's really nice. What we do behind the scenes is we we go and we fetch a few files dynamically a full docker composed file that includes all the default environment variables. You would need to spin up everything as well as some other resources, and then we go ahead and deploy it.

C

So we've updated the quick start guide to make this kind of the default mechanism for deploying data hub. Now, we've heard some feedback fixed some issues and I think recently things have been looking really good on this, we're again trying to push down the barrier uh to entry to actually deploying datahub for the first time and minimizing that time to value that we can provide with data hub. So the second thing on the experience side is that we worked on improving the logging coverage. So this is another.

C

Ask that we've heard from some of the community members as well as operating data hub ourselves. We realize that you know we're just not as we don't have as much coverage as we need. So we added a ton of info error, debug logs in both data hub front end and in data hub gms.

C

We also set up the kind of log rotation default settings, so we have all of the info error warning messages going to a normal daily rotated file that can be found under slash temp, slash data hub gms or slash temp slash data hub data hub front end. I think this is going to be really useful, especially for us providing support to the community.

C

I think now we can just ask folks to go and create a create a zipped log file and send it over when they have issues so greatly improve our own, our own ability to help you guys. The second thing is: we've added a ton of debug logs that will not be part of those logs by default, but we've added a short-lived sort of like memory capped, debug log as well that rotates every day. So that's going to be a lot richer.

C

It has information, that's really specific to data hubs application, so we kind of filter out the other, the other logs and only log data hub stuff. So hoping that will help us. You know debug those critical issues, much quicker in the future, uh just note that that pr is in review. So if you guys want to take a take a look at it, that would be greatly appreciated.

C

The second thing is on the resource side. So previously it required. You know two cpus, eight gigs ram and two gigs of swap roughly to deploy data hub on docker desktop or using the docker engine. We were able to get that down to one cpu, three gigabytes of ram and one gb of swap. I think we can to answer sri zhang's earlier question. We can run it on. I think it's the uh second tier model of the raspberry pi, now four gigs of ram, um but again we're continuing.

C

This is going to be a continuing effort, we're trying to push that really as low as we possibly can again. This is something we've heard from the community. It's just too heavyweight. uh It's kind of annoying to have to go and change those docker settings all the time and debug that stuff, so big progress there. The last thing is just the actual container count.

C

So previously we had shipped datahub quickstart with a lot of sort of miscellaneous tooling to help operators of datahub, but we found that most operators were not actually using this tooling, so things like kibana, the kafka rest proxy.

C

Some schema registry ui things like that, so we've just reduced the the number of default containers we're shipping with, and this is a work in progress. I mean you're going to see us over the next few months continue to try to work this down.

C

I think we have a few areas we can improve specifically on the data hub containers themselves, we're hoping we can actually merge gms and the front end into one container in the default case, of course, allowing you to um deploy them as separate containers as well, but one other big thing we recently did, which some of you have already discovered is we merged in the the two kafka consumer jobs we had ma consumer and mce consumer into gms.

C

So that's all now one deployable by default, so that gets rid of two containers which we think has been a great improvement for, at least for us operating data hub, so excited about that. We're going to continue to try to work on that. But one last thing I'll just call out is with the datahub cli quickstart. It's a lot more resilient than just running quickstart.ssh, mainly because we've provided this kind of wrapper. That will actually check to make sure that all of the required containers are up and are healthy as well as pinging them.

C

Things like that, and so I'll conclude with the little message that we we added there, which gives you a green data hub, is now running when you, when you deploy it, it's very satisfying to see, especially if you're coming from running the uh the quickstart.ssh world. So uh with that, I will pass it off to gabe who's gonna talk about um simplifying our persistence. There.

A

John, uh can I ask one question that I heard from the community as well and um has come up a few times uh the docker compose files that are sitting in the you know the git repo? Are they still usable so can I still go into them and just do docker compose up and will that still work.

C

Yeah it'll still work um the big change we made. There is actually splitting some of those tools that I had mentioned into a separate docker compose file, so you certainly can deploy kind of the thin version as well as adding that additional tooling. If you want it pretty flexibly, so we haven't really um regressed in functionality right or what we're supporting we've just made the default kind of much slimmer. But yes, all of those docker compose files should still work.

A

Got it another question we often have gotten is, with this merge of ma consumer and mce consumer into the data hub gms. Does that take away from the flexibility of being able to run those consumer pools separately and scaling them out independently.

C

No, so actually, we still are going to be publishing independent containers for mae consumer and mce consume consumer and you can configure your deployment using environment variables, so you can actually switch off in gms, the mae consumer and the mcu consumer and then go ahead and deploy those mae and nce dedicated consumers. If you are operating an environment where you need to scale those different services uh independently.

A

Yep cool awesome all right over to you, gabe.

G

Awesome um thanks john for sharing all that stuff. I'm really excited about how much easier it is to start data hub now and uh I think folks are going to find that it's just so much easier to get things started uh sritanka. If you go to the next slide, yeah we'll talk about uh one other way that we've made uh even easier to deploy datahub.

G

So one thing that we've heard from the community is that sometimes neo4j can add extra complexity when running data hub and some folks want to be able to run without neo4j. So we've provided the option to run datahub just using elasticsearch as the backend for our graph service. So we've essentially abstracted out the different graph methods into an interface, and we allow you to run that either with neo4j or elasticsearch right now. Data hub just uses single hop queries to power the front end, so elasticsearch and neo4j are going to be about equally performant.

G

In fact, elasticsearch might be slightly more performant. In some cases, some folks are going to want to do more advanced graph queries and also we intend to add more advanced graph queries to data hub down. The line in that case for very large graphs that are running very complex graph queries. Frequently, you may still want to use neo4j, however, for many deployments elasticsearch will continue to be uh just as good of a solution.

G

So if you want to go ahead and remove your neo4j dependency, you can go and start using elasticsearch as your graph service back end today, so we've updated the helm docs. To give you instructions on how you can use elasticsearch as your graph backend.

G

It just requires flipping a variable on which graph service implementation you're using switch that from neo4j to elasticsearch and then disable your dependency on neo4j in the helm, prerequisites and you'll start using elasticsearch as your graph backend so soon we'll be providing migration scripts so that you can uh re-index your elasticsearch graph backend with the existing data. That's in neo4j um for now, if you want to get that those relationships, you'll just need to re-ingest the the data we're all and.

A

Are we running demo dot data hub on elasticsearch now.

G

Yeah we are so if you want to go ahead and just see how performant things are, you can go to demo data hub and you'll see that relationships load essentially instantly and I think you'll find that in many cases, um elasticsearch is going to be just as good as a graph surface backend, um but moving forward we're going to just moving forward we're going to be continuing to support both of them uh and if you ever want to add your own graph implementation for a different graph service, uh you can implement the graph service interface with any graph database and contribute that back so it doesn't have to just be elasticsearched in neo4j.

G

It can be any graph database.

A

Awesome thanks both of you. um I know we're at time, so closing remarks super excited to be building with you. All uh data hub is moving really fast, so hold on and hang on. uh Things are uh getting more and more exciting, really excited about our roadmap ahead. Do give us feedback on slack and offline.

A

Try out the new features that we've launched um and you know, let's keep building, see you in another four weeks in uh and I'll send out the announcements for when the july town hall is gonna, be all right with that bye.