DataHub Tech Deep Dives, 27 May 2021

Previous Meeting

⏯

youtube image

►

From YouTube: No Code Metadata: May 27 2021 DataHub Community Townhall

Description

John Joyce (Acryl Data) presents a tech deep dive on the new support for no code metadata modeling in DataHub that is going to be released as part of release 0.8.0

A

Wanted to make sure that we had time to get into the no no code metadata piece from john as well. uh Sharat is on the call. So if you have any questions about kind of his deployment on gcp and how he's setting it up at confluent definitely ping him either here or on slack, but thanks a lot chad for sending this over ahead of time and being able to attend, even though you were remote all right with that, I will move things over to john.

A

You have 14 minutes to tell us about all the magic of no code.

B

Hey guys, um can everyone see my screen: yep, okay, awesome, yeah, thanks trishanka um and thanks dexter and trout. Those were these are really informative and great. uh Today, I'm going to talk a little bit about a project that the akron team has been working on in the past few weeks that we're calling no code metadata, I'm going to go through sort of the problem, we're looking to solve uh the solution.

B

We came up with and a little bit more deep technical details around that, as well as a demo fit in there a little bit as well, so with that, let's get right into it. um So what are the what's? The problem we're trying to solve with no code metadata? Well, the problem is that adding an entity to data hub today is pretty hard. So these are three pr's. They all add two entities each and you can see that the number of files that needs to change just to add them to the gms.

B

So the backend layer which we're going to focus on here today um is about 50 files right. uh So we'll talk about two things, one is. The first is just the sheer complexity of adding that entity today. It requires greater than 25 files for entity in the complex case, in the average case, it's about 25 files and those files consist of models.

B

So these are snapshots and aspects that we're all familiar with probably search documents, relationship models, these wrestling resource values, burns and I'm sure, I'm missing something here.

B

Endpoints we have entity and optionally aspect, endpoint files, which are all separate. We have clients which are just wrappers to actually talk to those wrestling resources. You've created the endpoints.

B

We have these things called data access, object factories, so each entity has its own search, local browse and graph dial, which allow you to talk to the persistence layer and as such, we need each entity to have a configured factory which is uh to create these. Like strongly typed factories, we have index filters which are effectively just lambdas that take metadata change events and turn them into updates against the search index and the graph index.

B

And then we have some configurations like the elastic mappings, json file, the settings.json file etc, and that's kind of what composes that you know. 25 file per entity so again very difficult, and this is sort of a visual representation of how complex adding an entity really is.

B

This is the actual onboarding, a new entity, doc thing visualization, so you can see 27 steps listed here in seven different subcategories, it's just very difficult, and because of all of that complexity, it takes a long time to to actually add entities based on conversations with folks who have raised pr's data entities.

B

We found that on average it would take one to two weeks for a new data hub contributor to actually get up to speed with all the abstractions and add entities, and that's not even counting the back and forth that occurs on the pr after you've added 50 files, which can span up to a month, we've seen so it's just too hard right. That's the problem we're trying to solve with the no code movement. That's where the no code movement comes into play. We started to think about.

B

How can we make the process of adding an entity much simpler? And specifically, we started to rally around this goal that it should take no more than 15 minutes to add or extend a data hub entity at the backend gms layer and what we wanted to be in scope. Are the ability to read and write the new entity using a rest api, the ability to define searchable fields that are indexed in the search index and to define sort of outward graph edges coming coming from that entity.

B

So foreign key relationships- and this is quite ambitious- we thought, but we tried it. We had some non-functional requirements. We wanted to make sure was, was present in the solution. The first one is more of a requirement based on the 15-minute primary constraint.

B

We wanted it to be declarative, which means you know you shouldn't have to write any a java code or imperative code at all, ideally or at least minimize that requirement. The second thing is we wanted this to be extensible, because we are going to hopefully move towards a more declarative world. We'll have a dsl.

B

We wanted things to be extensible horizontally, as new requirements popped up and then the third thing is: we wanted to just be very usable right, intuitive hard to make mistakes well validated up front such that runtime bugs are sort of hard to come by based on these changes, and so now I'm going to go right into a demo of what adding an entity looks like in the new no code world after the work over the last few weeks and then we'll go into sort of how it all works under the hood.

B

So I'm going to get out of the slides here and go over to this town hall, demo dock. I have here and we're going to start with just modeling an entity so we're going to imagine. We want to model a an entity representing a service right. So we've talked about this before a service is maybe like an online microservice that you want to represent in datahub, and the first thing we're going to do in the new world is we're going to actually model the aspects and aspects are just.

B

You know metadata about the service that can be updated independently and we're gonna have two aspects, we'll start with the first, which is a key aspect. So what the key aspect is is basically, um it consists of the fields that kind of uniquely identify the service.

B

Traditionally, this would have been in a strongly typed urn, but there's a new concept, we're introducing here, which is a key aspect allowing you to access those fields that uniquely identify a service or an entity as though you were accessing any other aspect right, as opposed to cracking open a strongly typed term.

B

So here we have a an aspect called service key and we come to our first kind of new piece of metadata that we're adding in the no code world, and that is an aspect annotation and this annotation basically allows us to assign a common name that has to be kind of globally unique across the data hub ecosystem to this aspect and we're calling it service key in this case, and then we have one field that we're going to use to primarily identify the service, and that is its name.

B

We have the second annotation, which is pretty interesting here on top of the name field, and that is the searchable annotation, and so what this allows us to do is mark the name field as searchful, and this allows us to sort of index that field based on these configurations.

B

So we're saying this should be a partially a search, a field that can be partially matched right. That's what we're saying here and then we're saying we also want to support autocomplete queries against this field. So, if you're searching in the search box on the ui, you should be able to see auto complete results based on the service name.

B

The second aspect- we're defining, is just a set of properties. In this case, we just have two simple properties: a description and an owner, and you can see again we have a searchable annotation on description. In this case, we don't provide any configurations, because we are okay with the default searchable configurations which will simply make this a space, delimited search index field and then the second. The second field has the next kind of annotation.

B

We want to talk about the third annotation, which is this relationship annotation, and what this basically does is it marks this field as representing a foreign key relationship, an edge that extends out of the service entity and into a different entity.

B

In this case, we don't put any bounds on what can be on the other side, but we do support the ability to add something like this, where you can say, entity types is corp user, which will then restrict that edge to only have a user on the other side. So for this purpose, I think that that makes sense here.

B

The second thing we'll need to do is just add the aspect to union. This is exactly the same. Nothing has changed from what happens today. We add a service aspect which basically pulls together all of these individual pieces of metadata, so it pulls together the service key, the info aspect we just defined, and then there's this third thing- that I'll quickly talk about which is pretty fun, uh which is this new browse paths aspect, and this is something that the no code initiative has allowed us to address.

B

This ask: that's come up repeatedly over the past few months, which is the ability to customize browse paths. So browse paths are what you see when you're navigating the explorer hierarchy right, where you're seeing prod snowflake something else right now. All of those are generated based on this hard-coded logic that sits in gms. We've actually changed that such that you can provide custom browse paths as a normal aspect, as you would any other metadata, and so here we're actually adding that aspect.

B

So we can provide browse paths and we'll demo exactly how you would query for that in just a moment and the final thing, the final big thing we have to do is define sort of the entity model. This is the snapshot model that everyone's used to. We have one final new annotation that we put on the snapshot model, which is the entity annotation in the same way that the aspect annotation allows us to define a common name for an aspect.

B

The entity annotation allows us to define a common name for an entity which is globally unique. It allows us to get away from using the fully qualified model name as the de facto name for the entity, which we can talk about the benefits of that in a little while, but the second piece of metadata we specify here is that key aspect right so here.

B

What we're saying is that the key for the service is represented by the service key aspect which we've defined above, and what this allows us to do behind the scenes is translate between that service, key struct and a generic urn right.

B

So we have this ability to sort of serialize the service key and deserialize the service key into a struct that you can use such that when you're querying for an entity, you will always get back both in urn, which you shouldn't have to look into and a service key aspect which you can then pull fields. Out of so it's sort of this idea of a virtual aspect here and then the final thing is is just adding that service snapshot to the list of all snapshots.

B

This is again nothing different from from what we do today and that's pretty much it right. So four steps entity is now added. We redeploy gms and we redeploy mae consumer, a few other containers and we should be able to now interact with that entity, and so that's what I'm going to show now. I've already actually modeled this entity and redeployed, my own local versions to save you guys some of that awkward silence time, but we're going to go ahead and try to write an entity.

B

So the first thing we're doing is we're using a newly created generic entities endpoint, which allows you to read and write any entity in data hub and we're going to write into it. This service snapshot, so some things to call out are the description. Is my demo service? There's an owner marked here. So remember: that's the foreign key relationship.

B

This is going to be indexed in search, and then we have these custom browse pads right. So my custom browse path. 1, my custom browse path. 2. An interesting thing to note is: we've also made it such that you can specify multiple browse pads, so you can actually access this entity from multiple explorer traversals, which I think is a pretty cool feature.

B

So we're going to go ahead into this terminal and I'm going to go ahead and paste this in, and hopefully everything works. Okay, so nothing came back, which is good means. There was no exceptions and we can validate that by reading that entity using the exact same resource. So this is the same end set of endpoints.

B

All generic we'll go ahead and curl that, and you can see okay here we are so we've got our new our new entity back. It has all the data that we we think it should, and we also can call out that we have that service key coming back right. So now you can actually ask it: what's your name in a much more clean, clean way and we're going to actually search right? So here we have a search endpoint again, it's generic!

B

You can search across any entity in this case we're going to search across the service entity in particular, and we're going to pass in my demo. So if you remember, our description was my demo service and we'll get back a my demo service. So this is saying yes, this matched your search. Query based on this this field, my demo service, then we're going to go ahead and and run the auto complete test.

B

So autocomplete is available on that service name within the service key aspect, so we're going to go ahead and query my dem and it should auto complete into my demo service and there you have it completes into my demo service.

B

Finally, we're going to go into this browse endpoint, so this is going to allow us to say give me all the paths under my custom browse and we're going to try that out and see if we got any results so again, this will enable that explorer experience that you guys are used to on the ui.

B

Okay, we've got one back. My demo service is the browse entity there, so this seems to be working and then, finally, um we have this. This new endpoint called relationships which allows you to fetch arbitrary relationships between data hub entities, and in this case, what we're going to fetch is an owned by relationship that is actually incoming into a corp user right. So basically we're trying to test the inverse relationship.

B

So we're saying get me anything that I own as user one right, so we're going to go ahead and run that and you can see okay, we got the service back, so the edge has been indexed it's available via this generic relationship's endpoint now, and so that's basically the demo. That's the process of adding a new entity takes, you know less than 15 minutes. I think I was talking through it, so maybe 15.

B

um and we're going to go back to talking about how we did it so quickly I'll go over sort of what the architecture used to look like you know in a nutshell, the theme is that at every one of these segments you had to define components on a per entity basis right so at the client layer you had individual classes or components for each entity. Data sets users charts dashboards, that propagates over to the resource. The wrestling endpoint layer as well.

B

You had a different set of endpoints for data sets users, charts dashboards right and inside of those components you had another layer. You had individual dowels for searching for writing and reading to the key value store and to getting relationships, and these are each uh specific to the individual entity. So you can see this just scales with the number of entities.

B

If you head over to the mae consumer side, which is responsible for updating the search index and the graph store as metadata changes come through the system, you'll see that we have the same exact pattern right. So you have a search builder, that's specific! For a data set. You have a user search builder, you have a user graph builder, you have a ds graph builder, um and so it's just kind of scaling with the number of entities.

B

It's the common theme across all of this, and now I'll show you kind of what the after looks like.

B

So this is the after um we've revised that such that we have two sort of generic sets of endpoints one is about entities, so it provides the ability to read, write, search, browse any entity in data hub. We have the relationship endpoints, which allows you to effectively do the same, but for edges right and then we have these service classes that sit behind those and those are the the key they're kind of the generic uh read and write abstractions over the persistence layer right.

B

So we have the entity service, we have the search service, we have the graph service and then heading over to the mae consumer side. We followed a very similar pattern where we have a generic entity search index builder. We have a generic relationship graph builder and this is all driven based on those annotations in the model right. So that's how we compute what the updates need to be, and so I'm going to go quickly over this because we already talked through it. There's a few slides just going deeper.

B

You can feel free to look through these in your own time, but in a nutshell, here we have four new annotations: the entity aspect, relationship searchable, annotation that we talked about it's very easy to just define them alongside the model as additional metadata I'll talk quickly about one of the key abstractions that unlocked sort of all of this ability, which is the entity registry.

B

The entity registry is really a runtime source of truth for both models and metadata, and it's what all of the services and the index builders now depend on to get information about the model and information about the metadata. So you can think about the storage configurations, the annotations. How should I build the search index? How should I build the relationship index and a key part about this is that we decouple all of the service and index builder layer from the metadata models and configuration itself such that in the future.

B

You know, maybe we aren't using pegasus, maybe we're using protobuf, maybe we're using something else, or maybe we're having a completely dynamic entity registry where you can curl in a new schema um like a database right and from the graph services perspective or from the services perspective. In the indexability perspective, nothing would change, which is which is pretty exciting, so again, services. They all allow you to do generic things to each entity and relationship.

B

One key point here is they're decoupled from storage technologies, so they're all based on data hub specific abstractions. They are interfaces or abstract classes. That can be very slim, they're very slim and can be implemented for a multitude of different persistent stores, which provided the default implementations for ebean, which is all the sql stores elastic on the search side and neo4j on the graph store side. So far, endpoints again, we've seen them generic endpoints for both fetching entities and relationships and then index builders.

B

So one interesting thing here is that these service classes, we've introduced, are actually used now, both on the read path so from gms, but also on the right path from the index builders. So these search service and graph service are kind of common across multiple parts of the stack which makes kind of changing and updating things much easier, and we have one central abstraction that everything else depends on so just revisiting the the non-functional requirements we talked about earlier talking about how we may have achieved them start with the declarative one.

B

So we provide a dsl for defining models as well as storage configurations without any java required right, no coding, we defined an extensible model where it's easy to add new indexed field types. So you remember text partial. We saw earlier it's very easy to add something new there. It's easy to add new relationships, it's as easy as defining a new relationship, annotation, it's easy to plug in new storage, implementations and then finally, it's usable.

B

We have build and runtime model validation, provided such that, if you, when you define your entity hierarchy, you should know at build time if something is wrong right. So if you have a conflicting aspect, name or entity name or maybe you didn't define a key for some reason. You'll know a build time so that we can avoid these runtime exceptions and then a couple fun features.

B

We think, which are configurable browse paths as well as the moving away from the requirement to have these strongly typed earn pdls, as well as java classes, which is the current requirement, and so just quickly going to talk about the impact of this initiative. So I think firstly, the biggest impact is just the reduction in complexity across gms.

B

That's unlocked from this, so I don't know uh if, if you guys want to call out or or write in the chat, how many files you think may have been made redundant or can be removed with the no code initiative, if anyone has any guesses, let me know before I.

B

Okay, it's 271 files right, so this is 271 files that we can remove from data hubs code base tomorrow, given no code right, so huge reduction in complexity, we've also boosted developer productivity uh from 25 or more boilerplate code files to one set of declarative, pdl files and from one to two weeks of development time down to less than 15 minutes to add a new entity.

B

Now I'm going to talk a little bit about where no code goes from here, so we're going to look to actually move up the stack right now. All of the work that you've seen is really limited to the metadata platform layer so gms and beyond, but we want to actually auto generate the graphql api at the datahub frontend layer, as well as exploring the ability to sort of gener dynamically generate sort of ui configurations.

B

um There's a couple other things we want to do as quick follow-ups, so we want to clean up all of that legacy code, so 271 files we want to get rid of if we can- and then we want to you know, continue to expand, expand on those apis. We've developed they're, very minimal they're slim.

B

They do exactly what data hubs ui needs them to do, but we can certainly foresee the ability to add sort of new uh capabilities to those apis as well, and then I'm just going to close by talking briefly about sort of the vision of datahub. As we see it, we really want datahub to become this sort of true metadata platform, where you can do things like dynamic model, registration and storage configurations.

B

Perhaps perhaps even at run time as we touched on before, we want to provide these rich multi-language client sdks, which allow you to both read and write from the async and sync pathways into data hub from any language.

B

We want to have rich access, control roles and policies at the metadata platform layer and again we want to have those async and sync read pathways available easily available, and this is sort of the visual representation of that.

B

So we kind of think of the this having multiple kind of layers, where we have the metadata storage engine responsible for access controls, index, building the commit log, maintaining that entity registry, and then we have a set of apis on top both synchronous and asynchronous, so kafka based and rest apis, and then on top of that, the client sdks, which can then interact with all of those and then finally I'll just conclude by saying uh talking a little bit about you know how we get to no code. So how is this going to be released?

B

The code itself is coming next week early next week, and it includes a few different things. One is a newly introduced data hub upgrade container in cli that allows you to actually perform the upgrade against a running instance of data hub. That would be required to move to no code right. So there's two ways to do it. You can either sort of restart the data hub instance from a clean slate and that'll just work, or you can run this data hub upgrade cli against a running instance.

B

If you have a lot of data and we've tested this against a few hundred thousand rows in a sql store and things look good, so the second part is, we have all these guides. Basically that allow you to run the snow code upgrade against docker, compose deployments, helm deployments or manually, if you want to, if you have a different setup, but basically in a nutshell, it'll be deploy. New data hub containers, run this migration script and then verify and validate. You should be done and with that I think I will hand it back to shashanka.

B

I may have gone over time, yup 10 minutes over time. Sorry about that guys, but I'll hand it back.

A

That's fine, john. I think we can all deal with the productivity events. You've gotten us, so you've saved us a lot of time as well. um A couple of things I wanted to point out, as we did the design exercise for this. um We didn't want to boil the ocean too much so a few things we actually kept them the way they are. For example, the mce schema stays the same for now and even for strong types we made them stay as a constraint.

A

We wanted actually so these models that you're using to define your entities and your aspects they're actually used to create a serializable structs that you can actually use to send metadata over. So you don't lose strong types. As a result of this, we just created a generic entity. Endpoints so really excited about this. I hope everyone is.

A

I saw a lot of good feedback on the chat, so we're looking forward to rolling this out next week and then uh helping you as you upgrade your uh ecosystems, and you know we'll be there uh online to help you out with any of this. We've tested it out internally, quite heavily, and we've done lots of backwards and forwards compatibility testing, so we're comfortable that this works. But you know you only know when you finally run it so looking forward to helping you all get over the hurdle.

A

We've actually introduced storage formats now as a concept, so the next upgrade is actually going to be much more seamless, because now we have versioned metadata around. So that's pretty much all we had. I.