DataHub Community Talks, 25 Jun 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Business Glossary using DataHub at Saxo Bank: Jun 25 2021 Community Meeting

Description

Saxo Bank and ThoughtWorks present how they implemented Business Glossary using Data Ops and Data Mesh principles.

A

Thank you srishanka, so a quick context. uh Saxo bank, along with partnership from thoughtworks madhu as a data strategist from thoughtworks. We were working together in engagement and we have contributed back to the open source on business glossary work closely with acuraland linkedin.

A

I had the data integration and the data governance, tech platform for saxo bank and we are developing an in-house data management tool which is based on linkedin data hub and great expectations, so quickly an update on what we have done.

A

So what is business glossary? It is a list of business terms uh which is a blossom. Tell.

B

You want to say this.

A

uh Oh, I thought I've already shared. I'm sorry.

A

Okay, let me know if you can see my screen.

B

Yes, you can see okay.

A

uh Okay, so a brief uh theoretical context on business clause. Free business glossary is a list of business terms with the definitions uh which lays down cons, business concepts for an organization or an industry and is specific to a database or a data store.

A

It is it's a glossary of business terms that enables the organizations to define a common vocabulary.

A

We are a financial industry, so we are inspired by fibo and we'll take often an example of the book, but the glossary that we have created is like an aggregator which can expand, apart from fiber to other ontologies specific to the industry as well as provides the ability to create a organization specific for us with sap specific glossary as well. How does it help us so once it is out, it helps us to identify the relationship between different terms.

A

This is an example from fibo. Eventually we want to target graphical representations, but we'll stick to tabular on on data hub or other workbench products, I'm just laying down the pain point which led us to come up with this solution. uh Why we kind of developed this a couple of years back when we started on this journey in saxo bank, and I was taking personal interviews across the organization to understand the pain points around the data, the common problem that came from different system owners and system.

A

Smes was data, quality issues and inconsistencies where they were spending a lot of time: solving tickets because of uh data flowing across systems a quick example and a very common example. I have a data set, a and system a data set b in system b and data set c in system c, and if I I got a couple of uh a few data elements where account in in system a data set, a the account name is named differently in data set b, account number they're same things, and in system seats account id.

A

So the etl that flows from data set a to b is dependent on the mapping sheet that has been created by system a and b and similarly etl from data set c. The system is dependent on mapping ships too, if the sme leaves, or some knowledge is going here and there and another version of mapping g2 is created. The etl process is screwed up, validation, scale and account id and account are no more consistent, and then this leads to a lot of issues. Now, how can we resolve it right?

A

Can we remove the dependency on the mapping sheet and enable this or expand the majority of the schema so that we can use a common ontology business standard oncology? In this case? This is the fibo account definition.

A

If you can point all these account names account id and account number across these systems, expand the schema then uh being ingrained in schema. The dependence on the mapping sheet goes away, the dependency and smes go away and the data flows across systems can be consistent can be correct uh quickly. How do I have enabled it? So what you have done is uh if this looks familiar.

A

This is the data hub page, where we have added tax and terms, and these are the business terms which uh expand the metadata for the data elements, and this actually and we'll show it later. This actually points to the fibo url.

C

Or whatever you have chosen.

A

It could be fibo or anything else, uh then. uh The design principles that we have stuck to is stuck to data ops principles which is based on communication, collaboration, integration, automation and measurement. uh We believe that business philosophy can be evolved, staying agile, iteratively, taking care of business needs uh in the digitization journey. This will also show in the end, if we get time how we wanted to make sure that technology is involved right from the start, when a business function is introduced into the organization uh enhance the metadata.

A

So apart from the data, elements now will also have industry standard ontologies uh defined at the metadata layer obviously schema maturity, because the the business terms are now engraved in the schema schema versioning. Any changes into the metadata regarding the data elements, data types or business terms will cause the schema to be versioned and it will enforce ownership not only of the metadata but also of the business terms, the appropriateness and validity on the producers uh quickly. How we have actually realized uh the physical implementation both for data sets and for business terms.

A

Our schema definition is in protobuf, so the messages which are defining a business terms use options to define the type of the business source that oncology source that we are using and their url. So, with this I'll quickly, stop sharing and um hand it over to madhu uh for the next set of first slides and the demo.

B

uh Is the screen visible?

B

Yes, yeah? Okay, thank you, uh connect connecting back where sheetal talked about business terms uh like define the business concepts, enable the common vocabulary within the organizations.

B

So I wanted to talk about how we are trying to relate the data sets with the business terms that can enhance the value of the elements and may be better meaningful to the data sets. I have taken a simple example with the purchase order, which has these elements, which is id revision, number status, employee id vendor a number of elements. We have like order line item this element.

B

Now, if I talk about like vendor id to map to the supplier, identifier may be another table, we call it the product supplier, id and map to supplier identifier. This actually enhances the value to this data set and other by-product is. If you define a certain business rules at a supplier, identifier level, you can actually drive those business rules against these data sets and other than association of the data set, which is enriching. The value of the data sets business concepts or terms itself is interrelated or like and hierarchical.

B

So somehow you can compose the business and create a new term altogether. Let's say we have a purchase order, date, value and ship date. These are part of like composed and created the purchase order. That is a kind of relationship, will help you to discover your data sets of interest and get to the right data set.

B

So with this I'll just move on to the next step, how do we bring this business glossary into the linkedin data hub? I think left side is the one which is very much familiar to you, everybody, which is the data set from one of the first and foremost entity.

B

So you have these aspects: ownership schema metadata and all those things now we are trying to bring in business term or business glossary, which is a third with these two entities. One is the glossary node another is the glass richter.

B

The glossary node is introduced to define the hierarchy of the ontology okay, so we could achieve very much similar to the free book kind of hierarchy. If I want to relate this analogy right, okay, glossary node is kind of a package. You can have a number of hierarchical levels and glossary term is a class definition which talk about the business term. Okay, if you can say glossary term no info. This talks about the definition of the glossary term and source of the term which can borrow from it can be internal organization.

B

It can be borrowed from the external organization and you can even have a link to the external organization so that people can navigate. So this is the first thing we onboarded these entities. Then we had expanded the data set by adding a new aspect to the data set so that glossary term can be related to the data set and data set has a schema metadata, which is an array of schema field. We enhance the schema field to associate the business term at the attribute level.

B

With this, you are able to attach the term to the data set and the schema field that help the business user to navigate to the data set from the business concepts itself. The one other thing which currently we are working on the design is the terms itself are related, which we have seen in the previous example as well as, given that we are achieving to the another relation which is established within the terms.

B

It can be easier relations and has a relationship so with this I'll, actually try to move to the saxo implementation, so that you'll have better context how we actually implemented.

B

These are the simple templates we use the first one is a data set definition. Okay in this case is the kafka data set kafka topic. We are on boarding, so you can see the name of the topic and there is a schema associated. This is one of the enforcement.

B

Then it is successful. Schemas are mandatory and there are like ownership, business, technical and data steward initiative and right side. You could see the uh schema definition here. The schema is savings account that protocol this again a fictitious example.

B

If you see like there is a type name, and there are attributes, like account number and the balance, if you look at the balance, itself, is another type, which is a balance amount, and I would also see you could also see the savings account is kind of linked to a customer account here, we're trying to define that. Oh you see savings account.

B

This is of a term or type customer account so that you are able to relate things so that you can okay, even though, let's say example like in organization, your independently terms can evolve over them and realize that these are common. You can relate it back and proto, given a very flexibility so that you can actually expand the definition or metadata of a schema. We are using an options to do that, and there are other cases like okay.

B

If you wanted to attribute classifier information classification as a personal or a confidential, you can do that with proto much easier and the same thing can be used to drive the other business rules.

B

So next thing is: uh I wanted to give a little little bro overview of how the metadata is. Onboarded saxo has adopted the database uh approach to the new data platform, where domain teams are response for building the data products and also annotate about their metadata.

B

So the response for, like it, come up with the self-service capabilities where users are can be declaratively defined. The data set, and we have a github process which takes this thing and create the topics in the kafka and register, schema and extract the this metadata templates parse the files and pushing into the linkedin data by converting into a snapshot which is required by the mc schemas with that I'll quickly uh move to the demo in the interest of time.

B

Here you could see, this is the saxo. We call it the data workbench is one stop stop for date.

B

So you see the home page of the data workbench. Let's say let me look take it to the business glossary and these are like. We have a domain hierarchy, party domain, market domain and trading and common these things, and let me take a simple example of example: here: okay, we could see a customer account earlier we've seen the example I could navigate either I can search through or I can directly go here.

B

If I go to the business term, I could see the definition of the term and what is the source and I can navigate to the source of the external different it points to the fibo here it can be other things and you could see the related data sets and additional properties right with the related data set. You have two data sets. Maybe you can navigate to one of data sets and see. This is a very much familiar to everybody so which is the data set home page where you have information now you could see.

B

This is mapped to a customer account, as a business system is a relationship you can say. This is a, is a customer account and has this terms uid and balance amount and one can navigate to the these definitions and get into the lot more details.

B

So with that I will hand over back to sheeta.

A

uh Shishanka, I just checked: do we have time or do you want to.

C

um I think I would like to let uh john and gabe also go through, so maybe.

A

Yeah yeah yeah.

C

Cool thanks uh thanks cheetah and uh madhu. I think we've heard multiple times from the community, a lot of folks who are implementing similar practices in having their schemas checked into source code, along with metadata annotations.

C

On top, we did the same thing even at linkedin, but it's really nice to see a lot of companies are doing something similar and it's great to see some of these recipes emerging for how to connect um schemas and git, along with metadata in gate, along with this kind of push-based architecture, to get metadata out and integrate it into a common base. So I highly encourage reaching out to them.

C

We probably are going to have similar uh support for even in the open source code base, for you know having protobuf, schemas and applying annotations on them, and so to talk to them about how they've done it and try to implement similar practices at your organization. I think it's definitely a game. Changer.