DataHub Community Talks, 27 Aug 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Aug 27 2021: DataHub Community Meeting (Full version)

Description

Full version of the DataHub Community Meeting on Aug 27th 2021

00:00 Welcome
01:13 Project Updates and Callouts by Shirshanka
- Accomplishments in Aug
04:55 Business Glossary Demo by Shirshanka
- 0.8.12 Upcoming Release Highlights
15:40 Users and Groups Management (Okta, Azure AD)
21:48 Demo: Fine Grained Access Control by John Joyce (Acryl Data)
38:41 Case-Study: DataHub @ Warung Pintar and Redash integration by Taufiq Ibrahim (Bizzy Group)
56:48 New User Experience by John Joyce (Acryl Data)

A

Welcome everyone to the august edition of the data hub community meeting. Let's get right to it, we have a packed agenda as usual. First off I'll go through the project updates. As usual talk about what we've accomplished. In august we've been busy building and the upcoming release uh zero eight twelve.

A

Then we have a demo of not roll-based but fine-grained access control by john joyce, and then we go to indonesia, where tawfiq ibrahim will talk to us about how data hub is being used at varun pintar and the latest re-dash connector that he has contributed after that, we're going to do a deep dive into data hub system performance and understand how you can measure data hub system performance dextre has been busy building a lot of instrumentation into the data hub system and he'll walk through how you can monitor it effectively and finally, there's one more thing that we would like to share with the community so stay tuned.

A

A

All right so project updates, so we've been, as you can tell a shipping code furiously. We've had like three or four releases this month.

A

Overall, from a commit perspective, I was just looking at the commits um since the last town hall, we've got more than 150 actually and so we're continuing our 100 plus commits per month rate, and uh we've got more than 24 more than 20 committers, almost 24 committers from 13 different companies and six new contributors. So welcome to all of you uh looking forward to more contributions from each one of you and we'll keep building interesting features together in terms of the biggest highlights we have business glossary phase.

A

1 is now complete, so the models metadata models are in the api is complete. The ui is also complete from a read-only perspective and I'll. Do a quick demo showing you how you can get business glossary into data hub and view it we're also releasing role-based access controls milestone.

A

One uh john will talk you through how it has been built and what capabilities you're gonna have and then, similarly on the users and groups track, we've got a bunch of work done with integration with octa and azure ad, as well as just in time provisioning that we'll walk through in other tracks. Typically, we go over like product improvements, integrations and developer experience improvements, we've had work in all three tracks and I'll cover them next, but first some community call outs.

A

These are the people who go above and beyond and really help us keep the community alive and vibrant, so first off aseam bunsel, yet again for keeping the bar high, both on slack and on github.

A

He's been just amazing at asking questions. Answering questions as well as providing contributions wherever possible.

A

uh Call out frederick for continuing to improve our injection code base uh recently contributed the ability for you to extend and bring in your own sql parser when analyzing local queries. I think it's going to be quite interesting to do that, then we have toffiki tawfiq ibrahim and chris colson also known as data science chris for collaborating on the redash contribution, and then we have uh simon orimus, walior and serif, who have been consistently asking great questions.

A

You know we love stack traces and we love troubleshooting issues, but we also like talking about things like how should data meshes be modeled and what does mlaps look like, and so it's been great having high quality conversations as well in our community talking about the community dan vestobe excel david schmidt and chris coulson have been helping the community out generously with their time uh when people have issues helping them out with solutions. So thank you all of you for doing that.

A

uh It takes a village to keep the community growing and uh thanks a lot for uh continuing to do that and last, but not the least uh dimitri boykin, for giving great feedback on our last town hall. Mlaps integration we're going to continue building out better lineage integration between features and data sets, as well as other systems in the ecosystem, so stay tuned. For that.

A

Moving on um the first product improvement uh that we would like to share is a business glossary phase. One uh and I'll do a quick demo, but before that, just a quick intro to business glossary itself, it's really a way of representing a tree of concepts that are useful for attaching to existing data sets or fields. So, for example, at your company, you may decide to have a taxonomy that says: classification as a top level node and within that terms, like confidential or highly confidential or sensitive, that live within the classification node.

A

Similarly, you can have another node called clients and accounts and all client and account stuff lives under there like an account and an account can also contain a balance which itself might be a term under the clients and accounts uh note now. Similarly, you can also have relationships across these taxonomies. For example, a balance or an account might be confidential or might have highly confidential data, and so you can have a relationship between accounts as well as highly confidential terms.

A

So, let's let me show you how you would describe something like this using the most favorite language of data engineers, yaml and I'll get into a terminal. So this is a pr I opened earlier today. It shows you how you can create a business glossary using yaml and then ingest it as a standard source.

A

So first, let's look at the recipe if everyone is familiar with the recipe, this is what it looks like you have a source and a sink. The sink looks exactly the same destination is data hub rest and the source in this case is of type data hub business glossary. This is a new source that I've created and a config file, which is the business glossary itself.

A

So let's look at what that business glossary looks like.

A

All right very: firstly, we have a version identifier and then a source which kind of describes who is even specifying this glossary. So in my case I decided that datahub is the author of the glossary in your company's case. You might have your company's name here and then default owners for all things in the glossary. These could be users, they could also be groups.

A

You can have a url, for example, this could point to the github location of where this business glossary file is stored and then below that you have nodes and then contained within nodes. You have terms so, let's look at the nodes I created, I created a node called classification, just like I showed you in the slide deck before it's got a description and it has terms within it like names, the terms have names like sensitive confidential, highly confidential, with some descriptions attached to them.

A

Similarly, I created another node called personal information, which contains terms like email address, gender and notice for the email term. I also have an inherit specification and email seems to be inheriting from classification.confidential, which is a term I just described.

A

Inherits is really like a is a relationship, and so what I'm basically saying is emails are classif are classified as confidential. The owner in this case is a group and, as we go further down you see, gender is inheriting the sensitive classification further down. We have another node called clients and accounts, and that includes I copied this actually from fibo and that's why, when I define the term, I say that the term source is external, the source rep is fibo.

A

I even give a link for where this term is defined in the fibo glossary, and then I have some uh specializations of the term like it inherits highly confidential and it contains another term which is a client in account balance term, which I define next.

A

The balance term is also another term contained in fibo comes from external, so this is what a yaml file looks like, and you could either have all of this glossary in a single file or you could split it up across multiple files.

A

As long as you preserve the overall structure- and you can imagine, checking this into source control and just managing it just like you manage all your code now when this is checked in and you want to ingest this into data hub, all you need to do is just run datahub, ingest and point it to the recipe file that I just described.

A

So, let's ingest this recipe, but before that, let's go into datahub itself and check whether we have a glossary ahead of time or not.

A

All right, so we are in the data hub ui- and this is you know, the home page. Let's look at the business glossary it's empty. Now, let's go back to our terminal and run that injection.

A

Hold a deep breath: while ingestion runs there, you go, uh we ingested um 11 terms. Now, let's go into the business glossary and see. If we see them there, you go so we have our business glossary terms. The top level nodes are in here. We can go into them and see the terms within that. We see that ownership has been ingested as well and we can go into each one of these things, like you know, remember: email, it's got a source, you can go view the source.

A

It's also got related terms and you can see that it inherits the classification confidential term, so you can view it here as well.

A

Similarly, we can go into accounts and go to the account term, and when we go to related terms, we see that it contains a balance just like we had described as well as inherits the highly confidential term. So that, in a nutshell, is how you can produce and load an entire business glossary into data hub.

A

So what good is a business glossary if you don't attach it to data sets and fields? So, let's look at a particular data set that we might want to attach a business glossary to here is a user account and it seems to have some tags like operational.

A

It seems to have a field called an email, but we don't see any terms attached to them and it would be nice to know that the user account is actually of type account and the email in here is actually of type email. So, let's ingest those terms for that I'll go back to the terminal, and in this case I have a nicely prepared datasets.json file. If you're familiar with how mces look this, you should be right at home here.

A

This is one of the classic mce json files and and what we, what I have done here is just added at the top level at the data set level, a glossary term attachment attaching this data set to the clients and accounts dot, account term and then further below in the field section.

A

Next to the email field, I have attached it to a glossary term, called personal information, dot email. So now I have all of my terms attached to the appropriate place, one at the data set level and one at the field level, and now, let's try ingesting this data in chess dash, datasets.

A

All right, we ingested that record and now, if we refresh this page.

A

The demo works, we see tags and terms attached, an account has been attached to the account user account and then the email term has been attached to the user. If I go into the email now I'll see that there are related entities like data sets that are already attached to this thing as well as, if I go to the account term. I'll see that same data set is also attached to these terms.

A

It also you can also search for these things, so I can just type email and I'll get a helpful uh drop down. Where autocomplete shows me that I can search for personal information.email and if I search for it, I not only get the glossary term, but I also get the data sets that have that term attached to it. So that, in a nutshell, is the business glossary demo?

A

Now, let's go back to.

A

The next phase of product improvements, which are key value, schemas in kafka and better handling of nested schemas, so tucker, you know, as you know, has value schemas, but in many cases people also use complex, key schemas when they produce topics to kafka and in those cases, data hub wasn't able to represent it before so not anymore.

A

We now support both keys as well as values, and so, if you use the new kafka source and upgrade your actual data hub, libraries you'll be able to see both keys as well as values showing up in a toggle bar at the very top right. In addition to that, we also went in and improved our representation of highly structured, nested schemas. So now you can finally ingest data, hub's own kafka topics and actually explore the metadata schema itself. We've done a lot of work in representing structs as well as unions.

A

Well, so you should be able to actually browse the schema pretty nicely and understand uh what this very complicated schema looks like all right. So that's pretty much uh what I had to share and now we're going to go over to john who's, going to give us a quick update on what has been cooking on users and groups.

B

John here I'm just going to give a quick update on users and groups, management and data.

B

So let me just present my screen here.

B

So we've had some recent developments on the ingesting users and groups front. So this is something we've actually gotten quite a few questions about recently, so we're putting some effort into making sure our guidance is is clear around how to ingest your users, as well as your groups, into datahub's platform on the recent developments.

B

We've added a couple new ingestion sources, so this is batch ingestion sources to pull users and groups out of your third-party identity stores and bring them into data hub both from octa as well as azure ad we've, also added just in time, user and group provisioning, which I'll get into detail on what that means in just a minute, but at a high level.

B

This means that when people log in we will provision them users and their corresponding groups, if they do not already exist in datahub system at login time, we've also made groups searchable via the ui and we've added group members on the the groups page itself. So you can easily understand who is who is in a particular group, so I want to get into the details of just-in-time provisioning.

B

This is a feature that's enabled via oidc currently, but in a nutshell, when a user logs in over oidc, we will check to see whether the user exists in the system already using a unique username that we extract from the oidc profile that we get back from the identity provider.

B

If that user doesn't exist, we'll go ahead and create them in datahub system, and if we are able to extract a group's claim which denotes which groups they're in we'll also go ahead and create group objects for the groups that they're associated with, if they don't exist in data hub already in the coming release, this will be enabled by default, although there are flags, of course, to control the behavior and turn this off completely and those are under auth oidc, just in time configs on datahub frontend's container.

B

So I just want to quickly give an overview at a high level of like user and group management and data hub, there's kind of two paths to seeding users and groups into into data hub. The first is what we call proactive, which is basically your batch ingesting users and groups from some third-party system, some external system like octa or aed, and we actually now provide the ability to validate that that user has already been uh sort of ingested at login time.

B

So, basically, you can go into octa and maybe only ingest 20 users that you want to use as your beta users and when they log in they will either be allowed or denied based on, if they're already in the system. So that's kind of the proactive approach and then the reactive approach is what I just talked about. It's just in time. Ingestion at login time over oidc, and actually both of these today do require open id connect for that authentication piece if you'll hit next srishanka.

B

So if, if this doesn't work for your organization, I.e, you're using saml or ldap would work better. Something else. Please do. Let us know we're always trying to get feedback about this particular thing, because it's a it's a domain, that's kind of different between a lot of organizations and we do it one way we use oidc, but everyone does it a slightly different way. So we definitely want feedback to understand if there's a better way to to kind of pull in your organization's users and groups into data hub and seed them.

B

Okay, next slide. So what's on the horizon, well we'd actually like to add sort of an admin uh console in the ui that allows you to manage users and groups. So do things like creating new groups through the ui, removing groups that you may have ingested or may have been provisioned and then manage group membership so actually be able to add.

B

You know: users to groups, remove users from groups and then finally we'd, like kind of fine-grained user state management, the ability to kind of activate and deactivate users that you may have ingested from a third party source or who may have been provisioned at login time. So that's what we're working on now.

B

We also have a user and group onboarding guide coming so once that user and groups management, ui lands, that'll kind of finish this track for now, and we'll kind of sum that up with this user and group onboarding guide, which will hopefully be very clear about how we can get users and groups into data hub and then finally, we want to build out a couple more batch connectors there's been requests for things like key cloak and things like google identity, some of these other third-party kind of users and group stores.

B

So let us know if you have other connectors you'd like to see we're, definitely interested in taking requests there. So that's pretty much it. um I think we can move on.

A

Awesome so, as usual, we have a lot of integration improvements pretty much across the board. A few call outs would be redash, we'll talk about that. Later. Kafka connect, we've added support for uh jdbc sources as well, not just the um the bayesian one that was there before and for mongodb. We added some small tweaks to handle uh really large schemas that were coming out of the schema inference system. So now you know, data hub is not going to crash on you.

A

If you have uh 13 000 the fields in your schema, like some people, do all right. Moving on on the developer track, we are going to talk about performance metrics, I'm not going to discuss that too much here. We've added a lot of improved documentation for injection sources. So if you go check out our injection docs, our source docs, are much improved thanks to john and kevin for doing that, and so as new sources come on board, we have a pretty nice way of adding them to their documentation.

A

Now, all right uh so with that back to john to start off with the first session of the day, which is fine, green access, control.

B

All right, thank you, shashanka. Let me just uh take over the screen here.

B

Okay, hopefully everyone can see this.

B

Yeah, so I'm gonna do a quick overview of where we are in fine grained access control. This is something we started thinking about at the beginning of the summer, based on a lot of feedback from the community around wanting this, this uh capability to control, who has access to what metadata on datahub's platform.

B

So I'm going to start by just talking about what access control is, so how we think about it is that access control is a way to declare who can perform what action against which resources and how we model. This is with three kind of sub concepts. One is an actor which determines the who portion of the the policy uh a privilege, so what action they can perform and then, finally, a resource or an object.

B

This is commonly kind of known as actor verb object in some some areas of the world, we use actor privilege resource and what we do is we put these three things together and we call them a policy so with the new kind of implementation, we allow you to declare a policy that includes these three things to control access on data hub.

B

So I'm just going to talk about a few policies in english. You know on datahub's platform, you may want to kind of restrict who can do certain things so number one. Maybe data set owners should be able to add documentation, but they shouldn't be able to add tags. So we want a controlled vocabulary of tags. Perhaps another example is maybe the data platform team should be able to edit anything about a data set right because they manage the platform they're sort of the admins of data hub, maybe ted.

B

Our data steward should be able to edit any data sets tags, maybe that's his job, but shouldn't be able to edit the description or the ownership or anything else. And finally, maybe the administrative group should be able to manage policies themselves right so should be able to dictate who can do what on the platform?

B

So as part of milestone, one for fine-grained access control, we defined the following as the scope we wanted to be able to define and enforce policies supporting the following characteristics.

B

We wanted to be able to support the following privileges: editing tags, editing, ownership, editing, documentation for an asset, editing the links of the institutional memory, about an asset and finally, in the case of data sets, you know, editing, schema tags and descriptions.

B

We wanted to apply these policies to resources at two levels, so one is based on the resource type. So imagine you know. Data set assets or dashboards or charts, as well as the resource identity level, so to be able to call out a particular data set or a particular chart and apply fine-grained access control against that asset individually and then. Finally, we wanted to model this concept of actors using our concept of user in groups that already exist.

B

So we wanted to be able to say that john should be able to do something to a particular data set, or maybe a group should be able to do something to a particular dashboard. We also wanted the ability to support sort of this wild card predicate and say all users or all groups should be able to do something to a particular asset.

B

Finally, we wanted to model one sort of edge-based predicate which allows owners of assets to perform particular actions. So perhaps the owners of a data set should be able to again update the description, maybe not the tags.

B

So now I'm going to go into a demo of the milestone, one implementation of policies based on what we just talked about, so I'm gonna go over here to data hub and you know right off the bat I'm just gonna. This is the default deployment of uh of the new policies world, so I'm gonna go ahead and search. I've just got some of this. You know basic uh sample metadata in here that you guys are all familiar with.

B

Probably I'm going to go to this first data set and I'm going to try to add a tag right so let's say new tag. Okay, I already have one my new tag and what you'll notice right away is that we've got a warning here, which says looks like you're unauthorized to perform that action. So why would that be? Well? That's because we haven't defined any policies yet so I, by default. I am not able to do anything to this data set right.

B

This is kind of a fail closed world, so we're going to do is we're going to open up this new panel on the on the left side here and we're going to see a few different admin level controls we're going to go to this policies, tab, which is where we can construct policies and you'll, see that currently we have no policies, so that probably explains why I can't do anything.

B

So I'm going to click this new policy button to add a new policy and what you're going to see here is a workflow that really roughly models that policy structure that I had talked about previously.

B

You know actor object privilege, so I'm going to start by giving my policy a name and I'm going to actually use the example from the the slides, I'm going to say, data sets owner's documentation policy right. So basically, I want to say that owners can update documentation, but that's it about a particular data set. So next I'm going to choose the type of the policy. There are two types today.

B

One is a platform policy which is really about who can do things at the administrative level on the platform like who can edit policies or who can view the analytics, for example.

B

The second type is a metadata policy, and that dictates who can do what to a particular data asset or metadata asset on the platform, so I'm going to go ahead and choose metadata because I'm trying to edit the policy for data sets.

B

Finally, I'm going to give it a description say only owners should be allowed, we're sorry, let's actually say owners should be allowed to edit docs. That's it I'm going to hit next and I'm going to choose the asset type that I want to apply the policy to so in this case, it's going to be data. Sets I'm going to choose that and then I'm going to choose the asset that I want this policy to apply to. So I can either search for a particular asset right or I can just say all.

B

Data sets right because this should apply to all data sets and then finally, I can select a set of privileges. So I'm going to go ahead and say you know edit documentation and that's the only privilege I want to allow.

B

Finally, we get to the third final screen where we can say who can actually do this and you'll see right away, there's three kind of options here we can either call out users specifically, so I can say datahub, user or john doe or whatever we can call out groups or we can say owners right. So this is that edge based predicate.

B

Finally, I'm just going to uh save this, and now you see I have a new policy right. You can see it's in an active state, which means it should apply. So I'm going to go ahead and go back to the data sets as you'll notice like this actually isn't owned by me. I'm logged in as data hub, so I'm going to go to the second data set which is owned by me and I'm going to attempt to update the documentation.

B

My new documentation.

B

And you'll see I was able to update it great awesome, so, let's actually back out here and let's try to update a data, sets documentation that I don't own right. So I don't own this one, I'm going to come in here and say: hey! I want to update oops looks like I'm unauthorized to perform that and that's because the policy doesn't allow me to do that.

B

So I'm going to go back and I'm going to open up this policy again, I'm going to take a look at what it says and I'm actually just going to deactivate it because you know. Actually I want to revoke this policy, so I'm going to go ahead and click deactivate and you'll see that this policy is now in an inactive state.

B

So if I go back to that that hive data set- which I updated previously see if I can do it now, I just want to clear what I added you'll see that I'm unauthorized to perform that it's, because the policy has been deactivated.

B

So that's the first kind of policy, the ownership based policy. The second one I'll add, is sort of a point policy, and that's where you know a particular user should be able to do something on the platform.

B

So I'm going to go ahead and say that you know datahub this user, that I'm logged in as add tags to specific dataset right. So this is that point lookup use case. So I'm going to go ahead and just skip adding a description.

B

I'm going to again choose data sets and in this case I'm going to actually look up a particular data set. So I want to say that I should be able to you know, update the hdfs data set, or maybe the kafka one as well, so I'll select, two of them and then finally I'll select a privilege, in this case editing tags.

B

And then I will just find myself data hub and I'll save it, and you can see. We've got the new policy, it's in the active state. So now I should be able to update the tags for this hdfs, one which I wasn't able to update in the initial case. So let's say my new tag again see if I can add it looks like I was able to add it. I can remove tags, of course, because I have full control over editing the tags all right.

B

So this one works this one's deactivated and then actually sorry, this one's deactivated and then I'm also the owner of this one.

B

So I can probably add a tag here as well: awesome, okay, so we've we've correctly created two policies and now finally there's the the final thing I want to demo, which is just cleaning up policies. So there are cases in which you may have created. You know a policy by mistake. uh What you can do there is, you can actually just come in delete the policy right, um delete the policy and we're back to state zero. So this is in a nutshell, what policy management and role fine grained access control will look like on datahub.

B

This is the mvp all of those privileges, the assets you saw both metadata privileges, as well as the platform privileges will be supported. Basic platform privileges, including managing policies, managing analytics things like that eventually that'll, be extended to include things like managing users and groups. So adding groups deleting groups things like that.

B

So pretty happy about how this turned out. um Looking for feedback from the community, we will have you know a global on off switch here which I'll talk about shortly. When I get back into the demo or the slides here. But let's let's go ahead and continue here.

A

John there's one question about um who can even edit policies like who has admin uh privileges on even the ability to add or create policies.

B

Yeah, so we we model the ability to manage policies as a platform, privilege right and so by default. Data hub will will ship or launch with a set of sort of immutable policies, and those immutable policies will grant the ability to manage policies to manage analytics to that core super user which is datahub today. So when you launch a fresh instance of datahub that datahub user will have all privileges on the platform and that'll be sort of the jump off point from which you can create additional policies.

B

So you can add, you know your data platform team to be able to manage policies themselves and do everything. So that's kind of that model. We have sort of the seed root node, which can then kind of replicate itself. I guess cool so.

A

I'm just going to.

B

Quickly talk about the implementation like what's going on here, uh you know recently, we we've moved our graphql api to the metadata service, so um that's actually where a lot of this is kind of occurring. So what happens when a request comes in?

B

Is you know it goes through data hub front end, which is now just sort of our ui server web server that serves resources as well as proxies to the api that will go ahead and proxy back to the metadata service, which is commonly known as gms, and at that graphql layer we have what's called an authorizer and that authorizer's job is to basically determine whether a particular action should be allowed or not, and it does this by periodically syncing with the database, which has the set of all the policies there's kind of two cases in which it syncs.

B

So one is on a cadence, so you can configure it to be syncing every two minutes: five minutes, ten minutes, whatever you'd like by default, it's at two minutes as well as when the cache becomes stale. So, if you add a policy or edit a policy state, as you saw in this demo, we will actually go and refetch the cache and and reboot the cache, and so that gets us into the authorizer itself.

B

This key component, which basically maintains that cache always keeps kind of the latest view of the policies as well as makes a determination at you know, request time whether to allow or deny a particular action, and it does so by exposing an api that takes those three pieces of the policy that we had talked about prior. So at request time, the invoking code will pass. You know an actor which is basically the user principle behind the request. It'll pass the groups that that user is associated with, as well as a privilege.

B

It wants to authorize and the resource against which it wants to apply that privilege. The authorizer will then take that information. Take the policies it knows about and produce a decision by sort of executing predicates.

B

So it's pretty awesome um so policies in in practice. uh We we we want policies to be enabled or disabled uh globally at deploy time. So what this means is, you can continue to use data hub as you're using it today where there's no policies and anyone on the platform can do anything.

B

We wouldn't recommend that we recommend you actually do start using the policies, because they, I think they'll be very, very helpful to make sure that metadata stays clean, but by default again, datahub will be that super user, which will be seated with irrevocable kind of immutable policies that say that it can do anything and so it'll be on the operator to go and spawn off additional policies on a per. You know, policy basis from that core admin account.

B

Finally, I'll just talk about a little bit about you know. What's on the horizon, for policies, um so after we get this kind of first code pass done, we want to release a policies. V1 usage guide, that'll talk about how you create policies, how you manage them, hopefully it's self-explanatory, but I think it will still be pretty helpful to have something accompanying a feature. This big um we'll also look at supporting additional predicate types, especially on the resource itself.

B

So we've gotten a request to have things like domain based matching, so once we model domains in the in the resource type being able to to match on that, as well as sort of data platform of data source based matching so to be able to create a policy where maybe all data sets within a particular mysql instance should be managed by someone and then in the long term, we want to kind of consider the utility of a role-based actor predicate.

B

So, as you saw, there's mainly users and groups which are able to do different things, we have had some requests from a few folks that this layer of indirection, which is commonly called a role, would be perhaps useful, so we're actually looking for feedback from the community and direction from the community to understand whether that's a requirement. That really is something we need to take into account here with this system.

B

So that's that's pretty much it um thanks guys. I will hand it back to uh shoshanka.

A

Thank you. We are um running a little bit late, but I'll stay with the policy of allowing everyone to speak. um There are.

B

A lot of good questions.

A

On the chat- uh and we will take it on general uh because I don't think we can get to all of them really good questions uh thanks for for handling most of them, but I think there are a few others that are still open all right. So, let's move on to our community speaker uh tawfiq, uh who is coming to us from indonesia thanks sophie for staying up so late and uh giving us your time uh take it over.

A

I will share the screen so um and then, as we get into the demo, um we can.

A

C

All right, okay, uh thank you! Srishanka, uh good morning, everyone, uh I'm tavik, I'm from indonesia uh right now, I'm working at busy now part of one peter group and I'm going to share our case study with data hub and how we develop the re-dash source connector here next.

C

Yeah, this is a some few things but busy and which is now part of one group busy was founded in 2015 and it was a b2b marketplace, and then we have several multiple company restructures. There's a merger, there's cells and then acquired by one quinturing to 20 uh early to 2021..

C

Now we are serving around 600 brands, fmcg brands and serving around 230k of retailers, fresh indonesia. So actually we have two kind of business here. One is the supply side, which is which is uh uh working with the distributor and the fmcg brand. And the other thing is uh we work with the retail part with the we call. It is actually in in initial work for grocery retailers, yeah nice yeah. This is a data ecosystem uh at one printer group and busy.

C

So uh we have several legacy. Let's say a legacy stack coming from uh existing uh platform from corporate uh enterprises like sap, but we also have uh more modern uh architecture like cloud-based application, so we have like a mix of technology stacks like you can see that we have airflow. We have, as it still have ssis here and then uh we broke some of this stack into operational part and the analytical part, and also we have operational domain, which is actually the erp and the application databases we.

C

We do some best processing in operational data, engineering and stream processing, which is actually quite different from most analytical data engineering.

C

We also touch the production database, like updating data, synchronized data from multiple sources, and we also captured uh changed the capture from the application database, using kafka, connect and sync it into multiple uh things like operational reporting, dbs and then also right into the bigquery, which is.

C

Processed by airflow to be served by uh several bi and reporting tools, you can see that we have multiple uh reporting services like we have the the old stack like legacy: sql server, reporting services. We have metabase, we have uh redash and also we have jupiter uh why we have so much uh stack here, because we we've been through multiple merch and sales and uh we need to maintain uh most of it, because the users still need to use it. That's why uh metadata and then the lineage things is really important here.

C

So we can uh understand easier for all the data yeah next, so uh why we need the data catalog in in busy because one, uh the first one is. We have like endless repeated question from from anyone like where the data is how it is produced who owns it, and the question is like repeated or every day from different person, and we we we keep answering it, and then it's also difficult to look for lineage and impact analysis like because we have lots of data source and a lot of reporting that use the the data.

C

It's uh it's quite difficult to uh to search. If we, we want to to change a data or uh modify data, uh what what is the impact for for the other for the the application for the reporting, something like that yeah.

C

So uh this is our journey uh with the data catalog things uh at the beginning of 2020.

C

We just create a like a simple manual dead lineage on cuba, siege and then we moved to to do some poc with mhg atlas, but we found that it was too complex and too hadoop at the time. So we we stopped the plc and then we also uh doing some plc with edmondson. But at that time it wasn't really answering what we need and then actually at the end of 2020, we uh found data hub and then we start doing uh poc and then development with data hub next.

C

So uh this is some reason why we choose data hub, mostly because data have pretty much match with our data set like uh mostly like connect and bigquery and kafka, because uh data have used kafka a lot right so like uh it's really uh matched with our requirement and then the nokia ingestion, uh the ml recipes. That's really really helpful for us and then the development of the source, connector and sync connector.

C

It's uh really really helpful. The the documentation was really helpful and then the other features that we really love is. uh We can show the dashboard link from the app and then right uh click to it and click to it, and then we we will uh brought right into the the the dashboard itself and then uh now we have the the role-based access and we can limit what user can do. That's just really really awesome right now, yeah next.

C

So uh this is our data hub, integrations usage here at our fintech group. We have uh databases mostly rdbms, like mysql sql server postpress. uh We also have uh bigquery kafka and uh there's. There are two uh source integration that uh we uh contributed. uh This, that is cafe, connect and re-dash.

C

What we love from data hub is, it's actually highly customizable.

C

The basic is actually as long as you can construct the urls things, then you'll be fine, and then previously we have several legacy lineage stored in the google seat and we just parse it into mcs, and we have that lineage right into the data hub, even if the source is actually not working as plugin.

C

We just push it directly to the data hub mcs yeah thanks, so why we uh developed read this integration, because, uh after the merge we found that one quinta group used redact a lot from that analyst to product teams to hr teams. They use redact a lot, they practically love to learn sql and they can use redash quite good and it's actually a develop based on the superset source and then uh the other reason is actually it helped the plc to be approved internally and right now.

C

Actually, uh we already uh deployed in a simple ec2 server, and then we are going to release it next week to get the feedback from the other teams. Yeah.

A

Cool so tawfiq, I think you wanted to do a maybe a quick tour of your re-dash integration that you have on your end. Okay, I will stop the share here, so you can thank.

C

C

Okay, I hope that you can see my screen right now.

C

Yeah this is the the example of the recipes of our re dash. Is you can find it in the documentation, the github? Basically, what you need is the the the url connection of the radar server itself. This is a hosted, uh no, not the hosted three dash, but this is the open source one and we need the api key and then we can limit the page for like for testing purpose by default. This is not limited, so it will ingest all the dashboard and chart and we have the skip draft.

C

Optionally, if uh this is the default uh true, so if you want to invest the draft or unpublished, dashboard and chart, uh you can set it to to false.

C

This is the example of ingesting two console is pretty much like the other data hub in question.

C

If you have a lots of dashboard and shop, it will takes quite some time. So this is the report that we already invest: 39 chart and dashboard.

C

So I will skip my screen.

C

C

This is our data hub, uh for example, I will search a dashboard called the personal tracker.

C

Yeah, this is uh with this dashboard. We can see the what is inside, but unfortunately, for for current uh development. We haven't ingested the ownership for now as then. We have the the view in re-dash button here we can see what is it actually.

C

This is the re-dash dashboard. That's actually.

C

Querying the usage of the radius itself, so if you see here, we can see that.

C

It composed of several charts- and we also can see the lineage here.

C

The dashboard is composed by this point chart and if we want to see the the data source, you can see that it's actually connected to the radius postgres backend database itself.

C

But currently we haven't do things like how to map to the actual table because uh it's going to get the sql parsing like look ml does, but uh we haven't uh developed it right now.

C

Yeah, I think that's the demo of the redash.

A

I think someday you can probably demo as the massive lineage graph that you showed me once, which I.

C

If you want to yeah, I can do that right now. I already prepared it for you, so um one of the few things that uh we hope uh data hub can address, that is the lineage visualization. I think for now most of the data catalog tools uh actually have the same problem. If we see here this is actually the the lineage that came from our legacy.

C

Lineage google sheets, which I, which I just push it into data hub, and if we see that this is having quite a large lineage graph, and when you have this uh large lineage graph, it become quite uh difficult to to read actually yeah. This is one of the things that hopefully can be found. The solution by data hub teams.

A

Yeah, I think we will uh ship people oculus uh glasses, so they can uh fly through uh these kind of lineage graphs soon, yeah.

C

For example, like uh one's.

A

C

A

All right, but yeah, point completely taken, I think lineage graphs. They look beautiful until they become incomprehensible and I think um that's something we as an entire industry have to actually tackle yeah.

B

A

Let's move on to the rest of the slides and I will show them here.

A

Okay, I think we already talked about this.

C

Yeah yeah yeah: this is a.

C

Data hub development experience coming from me and our team. Actually, uh the contribution for cap connect source was my first open source contribution ever actually for uh github tastes, yeah, and I thought that the community is very welcoming it's. They are very supportive. I even got some private message and, like sisanka asked me, do I still want to to contribute something like it. It's it's very, it's very supportive, yeah and the documentation is very helpful like how to add new uh ingestion source. It's really really helpful uh in the standard way.

C

B

C

Our to-dos and future works for internal we currently being in poc state.

C

We are still in fb assistant, but we will uh socialize and get users feedback starting from next week, and I hope that this will give impact for for our organization and expected for data hub yeah. We already talked about the lineage for the large graphs and then we also interested in operational data quality metrics, something like lagging metrics and then row count something like that: yeah just to check for the anomaly on daily basis, something like that yeah! That's all for me. Thank you! So much cool thanks a lot. uh What time is it right?

C

Now, uh it's around uh almost midnight actually all right, all right, no problem, uh I'm I'm quite that's natural! Actually,.

A

Oh cool did the did the data for today arrive yet.

C

Yeah, it's after 1am, actually cool.

A

Cool cool best of luck for today's video, then all right um thanks a lot toffee and we will move ahead and I'm thinking in the interest of time. Since we only have five more minutes, we're going to skip the performance metrics uh demo that dexter had and go straight to the surprise session.

A

What we will do is just set up a one-off developer session for people who are really interested in doing a deep dive into data hub performance measurement, and we can uh kind of do an office hour session with dexter on that, and he can um show us how he's doing load, testing and uh measuring uh performance, or we can do it as a follow-on session at a future town hall.

A

So moving ahead to the surprise of today, I guess that's back to john.

B

Yeah, do you want me to just share a screen again sure all right guys, one more time. You'll have to see me today, um so I'm gonna reveal a little surprise. We've been working on, uh and that is extreme makeover data hub edition.

B

So if you guys have been tuned into datahub for a long time since the beginning of the year, actually you'll know that you know our first pass on the react. Ui was really to get it to parity with that legacy. Ember app and what we've begun to do in the last month is actually start to improve the ui design and kind of get it into that envision state we had initially when we began the migration to react.

B

So what we started with is redesigning some of the data set page in particular. So that's what I'm going to to demo.

B

So I'm going to come over here to data hub and I'm just going to go through the login experience here and immediately you'll notice, a fresh new appearance on the home page. Now this redesign is actually not complete. So I ask that you kind of suspend your belief here, but we're going to go ahead and search for some data sets and I'm going to go ahead into this first one and you'll notice. The the new design here, a couple of things to call out is we've kind of greatly improved the the schema visualization.

B

So we have this nested schema kind of expansion that we have now uh we have colored tags, which is exciting. We have these fun owner bubbles. We have the side panel on the right side, which gives the kind of entity at a glance view. So we have the documentation. We have the statistics about the data set. We have the tags, we have the owners.

B

Obviously we have these little toolbars, where you can switch between different things. This is an example of the documentation tab. One change we've made is that editing documentation is now an inline process, all right, so my update, you know just hit save okay. We saved that um properties, nothing here, some lineage. You can go ahead and see a redesigned sort of visualization experience a little bit softer on the eyes nicer around the edges.

B

Query's got a little bit of a redesign here where we're actually highlighting the uh sql syntax a bit better than before. If you'll recall, we just had some gray text previously, so I think this will be super useful and then we have you know the the classic stats. Both the latest view, the historical view not too much has changed here um so yeah.

B

So basically, this is uh to announce that we're going to be kind of going through the the app piece by piece and redesigning it with this fresh, fresher, new look which we're really excited about and what I'll talk about next is sort of what that journey will. Hopefully look like.

B

um Okay, let me just present so yeah. Just this final slide. um The next thing we're going to be doing is kind of expanding beyond that data set page to the other entities, charts dashboards tasks and pipelines, and then we're going to move on from there to redesign and rethink the search and browse experience. This is sneak preview on the left here, where we actually have this more rich, faceted search experience, which I think hopefully will be a big improvement from from what we have today.

B

Finally, we'll be enriching that home page, which will be that next thing, with richer recommendations, as well as richer, jump off points to both search and sort of the more classic hierarchical discovery experience. uh And finally, we have we have dark mode coming as well, so we'll kind of be uh transposing this slick. New look into the the dark mode world as well, so super excited uh would love any feedback from the community, we'll be pushing out the first milestone here in the next week of the datasets page.

B

You know all the pages are going to have generally this branding, but some of the you know dashboard chart. Other pages will still sort of mainly look like they did previously for the time being until we get them all migrated. So that's pretty much. It uh really excited about this one looking forward to feedback and contribution from the community as we as we progress with this.

A

Awesome and that kind of concludes our session for today. I'm really sorry, we couldn't get to you dexter, um we'll catch you up in a either a deep dive session with the developers or um we'll do it as a follow-on, but thanks everyone for um all of the love and all of the you know star eyes. uh uh I was similarly excited when I first saw the demo. So um yes, it's.

A

It actually makes such a big difference when the there's an amazing platform beneath, but you can finally see the light at the very top as well, so really excited to see what we can do with um the product going forward thanks to everyone for all the contributions and for staying up, and let's keep the momentum going. um There were a lot of questions even offline, uh we'll take them on slack.

A

It's been cool, so see you in another four to five weeks september, 4th friday.

B

Bye see you guys and before I hop off one last thing. I want to give a huge shout out to my colleague gabe, who drove the first part of this redesign, um so thanks to gabe, if you're looking for someone to thank, don't thank me or thank gabe for the redesign.