DataHub Metadata Day 2022, 19 May 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Managing Data Governance via Protobuf + DataHub

Description

Graham Stirling, Head of Data Platforms at Saxo Bank, shares how he and team are managing Data Governance via Protobuf during Metadata Day 2022.

Learn more about DataHub: https://datahubproject.io
Join us on Slack: http://slack.datahubproject.io
Follow us on Twitter: https://twitter.com/datahubproject

A

Awesome: okay! Well it's great to be here so good morning, good afternoon, good good evening. My name is graham stilling from uh saxo bank here in sunny, but still somewhat chilly denmark. uh So I'm going to be talking about how datahub is playing a key role in saxo's data revolution.

A

If I can advance the slide- and here we are and specifically how we're trying to break down barriers and make our data products accessible to a wide range of people across the organization and from our perspective, data products are described as far as this talk is concerned, using protocol buffers and published on kafka, and if this is a data revolution, it's about bringing a protobuf to the people comrade so, um let's see how we get on, but uh at saxo we placed the bet on on data mesh back in 2019.

A

uh We knew at the time that our target architecture had to be federated but governed, and I guess governed with the small g and we wanted to break down silos and power domain teams to be masters of their own destiny whilst at the same time uplifting our technical capability and central to this idea, of course, is thinking about data as a product designed for uh usage. You know within a domain beyond the domain.

A

These products are, of course, organized around the domains of your business and at a bank at some level you probably have something not too dissimilar to this example.

A

So domains host and serve the data to the organization using self-service data infrastructure that my team is ultimately responsible for now beyond enabling teams to get on with their use of kafka. The goal of the platform is to reduce the cognitive load on our development community, whilst also raising the data management bar.

A

It's a difficult balance to strike and we certainly haven't got it right all the time, but in doing so we place a lot of emphasis on on shifting left on data governance, treating governance as a platform requirements, a non-functional set of requirements, rather than an afterthought. I did notice. There was a great paper uh on the linkedin engineering blog the the other day on this on this very same subject, but shifting left means for us certainly think about data governances as code.

A

You know, for example, by annotating schemas, with their information classification at source, rather than as an afterthought, and, of course, for this data mesh idea to be worth it. We need to see the utility of the data amplified at the end of the day, we need more consumers and producers. Otherwise, what's the point quite frankly and key to this is of course metadata.

A

We see metadata as being the glue that ties these different data domains together. That ensures that the sum of the parts is is greater than than the home and schemas are a key part, of course, of our metadata story. uh So, for example, through the use of strong typing, we believe that it'll be easier for engineers to find the data they need to use and to use it safely without introducing tight coupling between services or teams.

A

So let me give you a very simple example: so trading domain instrument domain and not really a domain, but just a common schema, schema fragments, I guess which are used. You know across all the different data domains.

A

So within the trading domain we have a topic, uh a log of positions essentially uh which is represented by this open position, schema here and then within the instrument domain. Obviously, financial instruments. We have a compact topic, which represents uh a bunch of commonly used instrument, attributes as represented by the instrument-based schema. Now. Both of these schemas then make reference to this identifier. This instrument, identifier and we think very much about these identifiers as being the join keys of our data mesh.

A

It makes it very clear to potential consumers how they might go about joining the the two data sets together to create their own data products. That's certainly the idea and that's all great, but you know navigating what's turning in to be. You know, thousands of uh schema, fragments and git repositories isn't particularly accessible, and this is where the fantastic data hub project comes into play from, from our perspective, so a discovery platform that can surface this rich metadata and make it accessible to a much wider audience.

A

There are, of course, kafka specific solutions uh which which address this, I think we're going to see something from uh confluent uh later on in the lightning talks. But from our perspective, kafka doesn't live in bubble. You know we need complete visibility of upstream and downstream dependencies.

A

Ultimately, we want to understand the data flow from the system of record to the dashboards that you know, help drive the business importantly, I suppose, driven by metadata, that is, or at least lives with the code.

A

You know we want to get to a situation whereby we can ask questions such as where's my pi and get uh you know personally identifiable information and get an answer in seconds rather than weeks, which is realistically what that case is just now.

A

So let's continue our example, and for now I'm going to start to drill into the code which I'm keeping simple for illustrated purposes.

A

uh So our open position, of course, translates into a protobuf message of the same name, lives in the trading uh domain uh and, in turn, makes reference to this instrument, identifier within the common name, space all fairly straightforward.

A

This open position is then mapped in data hub terminology to a data set and these these fragments these schema fragments. These reference types are then mapped to glossary terms, so we have a glossary term in this particular case here for instrument identifier and similarly, we also create one for open position with the basis on the on the assumption, of course, that that same schema could be used by multiple topics. So again, nothing particularly error shattering.

A

So putting this all into context, we have schemas authored by some lucky individual in the role of a domain data modeler, and you know for our complex domains or trading. For example, that might be someone who spends you know a lot of the time in that particular space, and sometimes in our you know, analytical consumer-aligned domains.

A

uh It might be more. You know someone fitting the the role or persona of a data engineer who's fulfilling that requirement, but their their data model ultimately expressed as a series of protofiles are then stored in a domain specific repo and when that's merged to master.

A

We've got a common set of pipelines which deal with your basic hygiene, ensuring that we've got consistent, naming checking compatibility against production to make sure that no one breaks anything publishing a generic set of code bindings and that's, of course, used to drive our deployment to kafka, but also, more interestingly, is that we generate a metadata event which is ultimately targeted towards a data hub, and here our happy data specialists can ultimately see what's going on in the platform and all its glory.

A

So we can really think in this particular case here, but data hub is being materialized view of the proto schemas.

A

Now, over in data hubland, we can see we have our data set now here nicely represented, and similarly the the complex types are. You know manifest as uh glossary terms which, uh which the user can drill into so we have. We can maintain this relationship.

A

uh We can see which data sets are using, what terms without losing the fidelity of the underlying data contracts. As I say it's all about trying to um it's all about trying to bring the power of proto to the people, okay, so that foundation in place. um I really just wanted to talk about um future directions. You know some ideas in terms of how we'd like to take this forward. So there's absolutely you know nothing set in stone here. I think probably what we're talking about here is much more.

A

You know from our perspective, q3 deliverable so very much of the ideation phase, and you know super keen to get any feedback from the community, but certainly as we get more eyes on the metadata, the description originally provided by developers you're quite often under duress. You know when they, when their pipeline fails because there's no documentation.

A

um You know quite often that description is, is no longer fit for purpose. You know it's mvp got us off the ground, but as good corporate citizens. Of course we want to continuously improve. uh You know we want to be curating that metadata that's a process that is ongoing and long-lived.

A

So it's not unreasonable that our data product owner wants the ability to tweak this documentation without having to edit the files himself or raising a ticket for the dev team to address, and you can just imagine what that workflow might look like you know. The data product owner raises a ticket. A developer, picks it up, creates a branch makes a change raises, a pr, it's a lot of effort to fix a typo and, of course it's it's friction in the process, which means that ultimately, it won't get addressed.

A

So thankfully, data hub already has an edit description on the data set. So you know in this particular case here the the product owner might be thinking about. Well, we can come up with a better description. You know, is it the currency? Is it the account currency? Is it the exchange or the currency of the exchange whatever it might be? I guess there's room for there's room for improvement and we have a long overdue contribution to the glossary term to reflect the same. So what's the problem, of course the problem is the proto.

A

What about the proto? We started out with the code as the source of truth pushed the data hub on change and if we update it here through the ui well, of course the two will quickly diverge and you know we all know that that's not going to end particularly well so be to be consistent. Then, with our shift left methodology, we want the proto to be the source of truth, not just for the technical schema, but the supporting metadata as well. So how might we go about that? Well, what's the way forward?

A

Well, one approach that we we had been thinking about was you know, tapping into the metadata audit events uh generated under the hood and actually automatically raising a pr to update the proto. Now we've got that full audit trail of ultimately know why a description was changed and ultimately, by whom, but having said that, I'm just coming up to speak now with the new actions framework. So perhaps that's a better solution. Again.

A

I'd love to get some feedback from anyone else, who's tackling a similar set of problems, but at the end of the day, the goal here is to empower both those with a technical, leaning and those much more concerned with curating their their data products. Each of these you know, personas. These individuals brings a different perspective to the table which we should celebrate and embrace rather than push towards different tools. Yeah, I'm sure um you've been in situations before, whereby you know the business essentially have one one data catalog, which sits in a conceptual.

A

uh You know cloud of its own, uh which is completely different from ultimately what's happening on the ground. We don't want to go down that particular path.

A

Another request that we're starting to see more often as the complexity of our schemas increases is the need to visualize the relationships between data sets and their constituent parts. Again, perhaps thinking about the community of data architects, who are, let's see more familiar with visual representations, uml diagrams er diagrams, whatever it might be, you know they still have a role to play in this new world.

A

We just need to remove the friction uh and you know give them access to similar sets of uh data exploration tools, perhaps, and would then there might be accustomed to, of course, under the hood, we have all this information in data hub's uh graph database. We just need to surface it in in a user-friendly way. So certainly that's another another challenge for us and again, you know. The goal here is to power both those with a leaning towards the code and and those used to navigating relationships, perhaps using a visual modeling tool.

A

You know we certainly don't want to be in a situation where, uh where we're modeling schemas in in uml and generating code off off the back of that, we want to have this code first approach and again I'd love to hear from the folks the community we're already thinking along these along the same lines, but that's it for me. This was, of course, a very quick lightning talk um super happy to take any questions. I hope you found that in some shape and form useful.