DataHub Community Talks, 6 Nov 2020

Previous Meeting

⏯

youtube image

►

From YouTube: Metadata Use Cases at LinkedIn: DataHub Community

Description

Lightning Talk by Shirshanka Das at DataHub Community / Nov 6 / 2020

A

So the the reason I thought of doing this talk was that oftentimes. I come across people who have a preconceived notion of what data hub is and how to use it, and I often feel like I'm, you know stepping back and explaining to them.

A

um Actually what are the different things you can do with metadata or what are the different directions? We've taken in terms of use cases to build on top of metadata, and so I thought you know it might be good to tell the community how we look at metadata and what we are doing with it at linkedin. So it gives people kind of an idea of where they can take this at their own companies.

A

So I'm shashanka, I'm a principal staff software engineer at linkedin. I have been accused of many things, including being a godfather of three projects: linkedin data hub apache, goblin and dali. There are a lot of other projects I've worked on, but these are the three that are probably the closest to my heart and I have promised naga that I'll keep this to five minutes. So I'm gonna go forward really quick, all right.

A

So back to the original reason why I started this, you know data hub is really two pieces. The first thing is data hub gma, which is a framework for building a mesh of metadata services.

A

So this is what you use to build a set of metadata services that actually talk to each other and form what I would call a metadata mesh or a metadata fabric, and on top of that is datahub the app. This is what most people see the application that actually enables productivity and governance use cases on top of this mesh.

A

So today I will actually talk about what happens when you have a single, powerful metadata mesh like something powered by datahub gma, and what are the use cases that you can actually build on top of it, and what have we done with it?

A

Quick intro to linkedin's data ecosystem? It is as complex and probably more complex, and it looks on this picture. The one takeaway from it is: we have services, lots and lots of them, so lots of apis.

A

We have lots and lots of types of data stores and we have streams coming out of them and we have dumps coming out of them into kind of a warehouse that has streams as well as batch data, there's a bunch of standardization reporting and then derived data going back into these stores and sometimes going out back into the services, sometimes going out externally into third-party apis complex. Hopefully, everyone has similar problems. That's why we are here.

A

So the first use case, of course, is search and discovery. So that's the bread and butter of what we do and everyone gets it. You know we take the metadata platform, sit it on top, like connect the entire ecosystem to it, and then we build an app on top of this platform that gives us search and discovery. Everyone understands what that looks like you, get search, browse faceted across a bunch of different kinds of entities, and then you can explore relationships among them.

A

This is giving you kind of this unprecedented visibility into the goings-on inside your ecosystem. Right so done easy.

A

There's something interesting happened. The ai team started saying: oh, we need a bunch of things. We need explainability, we need reproducibility, we've got a bunch of these model, training stuff that's going on and we need a place to store this stuff, and I said well, we've built this thing called gma. It allows you to store metadata, you might want to use it and they actually went all in on it. So we have the metadata platform extending to support experiments, metadata around experiments, metadata around model, training and metadata around features and then around it.

A

There are a few other concepts that have shown up and the goals have always been reproducibility, auditability visibility and then consistency of concepts, and also this thing where you know you want things to be integrated with the dev workflow. If I'm checking in my stuff in git, my metadata should be right there with it. So we've added a bunch of new concepts, like you know, what's the problem statement that the ai team that the the experiment is about what are the pipeline and run infos?

A

What are the projects and groups associated with this thing and what were the analysis results? And so all of these models have been added on to the metadata concepts or the metadata model, and on top of it, a workspace ui has been built and that is targeted towards an ai persona.

A

So the metadata platform is the same, but they're able to extend it in a specific direction to support the ai metadata use case and, of course, data hub benefits, because all of this metadata is integrated. So you can search and discover everything in the same place as well.

A

The other big thing that we've done is compliant data management, so we've got like a ton of data, obviously on-prem and cloud 100, petabytes, plus of it and growing every day, and we have a ton of apis, some internal, some external, that we're integrating and we had similar problems like hey. Do we know where all the data is? What is the compliance of those things and we have retention policies and fine-grained data deletion? Stuff that we need to do so.

A

Can I attach you know, purge policies to every single data set that I need to care about every single api that we need to be exchanging data with, and we've kind of managed to do that. So the single metadata platform has tentacles into every single entity in this ecosystem and you can attach tags as well as policies to these things, and then we use this.

A

You know big thing called apache goblin, which is our massive data, injection and life cycle management system, and it does all of these single things kind of as operators on top of the metadata right. So you can ingest data from external apis. You can then manage its life cycle inside you can do limited retention.

A

You can do fine grain data deletion stuff like right to be forgotten and stuff like that, and then you can automatically create obfuscated data to create pii free zones and then, finally, you can actually export and manage this data in external apis using this metadata. So if I've got some data in salesforce, I can actually export that data there or dynamics export it there and then, when the customer deletes their data or wants their data to be deleted, we can actually fire off, deletes against that external endpoint.

A

So it's pretty powerful and and we're really glad with how we've managed to do it metadata. First.

A

The other interesting thing that happened was governance workflows, so we've got the metadata platform, but metadata is changing so with the core metadata, schemas compliance tags, and then there are logs when data is getting ingested when access is being granted- and we had a few scenarios that we wanted to stay on top of ownership of an asset must be, you know, locked down and be you know, good schema changes must be insane. Deletion must happen in time.

A

Access must be granted in accordance with our policies, and so what we did was we took the mediator platform, it's stream first and supports batch integration, so you can do change processing on it. So you can write a deletion monitor. You can write a schema change, monitor. You can write an ownership monitor. You can write an access monitor and, as these things change, you can assert on these things that you want to keep happening in your ecosystem and those things can then fire off issues or alerts or actually lock down data sets.

A

If that's what you want and then those things can get you know visualized in dashboards, so that the execs know that things are in compliance across their entire org chart.

A

uh Orc chart that's been pretty exciting as well, so I talked about three things: search and discovery, ai model, reproducibility feature reproducibility compliant data management uh and governance workflows, and there are actually a few and I just ran out of time, so I'm not going to get into them data quality operations, monitoring and we're really just getting started with what we can do for the whole company using this one metadata platform.

A

So that was my talk. Thank you for listening and we'll take questions later.

A

Thank you, susan god.