DataHub Adoption Journeys, 24 Sep 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: DataHub @ Adevinta Case Study: Sept 24 2021 Community Town Hall

Description

Iker Martinez de Apellaniz from Adevinta shares their metadata use cases and DataHub adoption journey.

A

We are going to move into our community demo from ecare, so I will hand it over to you, sir, he's joining us from barcelona and on a bank holiday. I should mention so thank you for putting in the work, even when you should be on vacation.

B

Yeah no worries about that. Can you see my screen? Can you hear me well.

A

B

Perfect so I'll be back on how to say corporate slides, but I because it's friday afternoon for me, then it's a little bit of fun moment and I'm after this I probably will have beer and not coffee. Sorry for that. For the rest, you need to wait a couple of hours in the west coast for that to happen. So first of all, my name is eager, like email not like iphone, so I'm father of twins. I was a data engineer in the past, then I was an enabler and now I'm up probably that adevinta.

B

So now I don't know how to code or how to change time in docker anymore. Sorry, all that forgot all of this. But what is this anyway, so adivinde is a marketplace specialist. So we have many marketplaces around the around the globe around the world and we have different vertical different, the tolerance and we try to create perfect matches on the world's most strategic marketplaces. This is the the the fancy title we use for the vita.

B

So that's what we try to do if you need a car, go to adventures, marketplaces and you will find your car, your house, your job, a new pair of sneakers, whatever you want right, many many brands around the world, many many marketplaces, different teams, different offices and since last summer, or something this summer. Actually we booked ebay classifies group I'll see.

B

Usually you have a new t-shirt, but it's from the other, even the classifieds group, which means we have even more and more marketplaces now more and more teams, and you will see later how this is a challenging scenario for a data catalogue. If you are not the exiting already, if we look at the warm up, this is more of like this. I'm I'm as as you mentioned, I'm in barcelona right right now, so just in the middle.

A

B

As you can see so this data catalog thing, what's all of this about what? Why do we need a data catalog anymore or any way right?

B

um This is more or less our product, we call it the data highway um and it's basically a composite of many many managed kafkas that we manage for for our clients for our tenants, which are all these marketplaces and some central operation groups, and we also have data sets in data lakes and something we call fmdq, which is filter, map, dispatch and quality, which is how we move data around from one kafka to the other, from one kafka to the data lake or from one kafka to anywhere you that you need it actually, then we have an inventory of assets that we need to maintain all of this running, and then we have something that is called data.

B

But this is not your data. This was before data was partly we used this name. We decided on this. It was a cool one and suddenly someone decided to copy it. So, let's call it for the purpose of this chat because it's already confusing inside alevinta, so the government's ui okay, it's going to go over! That's right!

B

Okay! What do we use this data inventory for like the one we have today, so we need to self-serve the manage uh we.

A

B

Self-Manage, the authentication or the authorization in data sets, so we cannot control the authorization for all these data that we own from different marketplaces. Marketplaces need to give access to this, because we are like five people or ten people in the team, and there are many many datasets as you will see, so we have done this cell surf custom made, and this means that people need to control this to control this. We need a list of data sets. We also need to comply with regulations. If anyone is in europe, you will know gdpr.

B

So, yes, we need to delete the private data. We need to extract private data for this. We need to have an inventory of data. That's the inventory of data sets is actually part of the law and then to manage of the kafka topics we have.

B

If anyone is working with with kafka as you are, you will see that the topics it themselves they they don't have any metadata. You cannot just store this method in kafka.

B

You need to store it somewhere else, like linkedin data, for example, and at the moment we are doing this in in a class, and I will talk about this a little bit later and then now another thing that happens: we need to link like one asset with one athena table, for example, to give permissions, we give it to the underlying data set a little bit of lineage here we need to manage for this and.

A

B

Is good enough, so I can relax sit back and enjoy my coffee.

B

The problem is that it will have users and these users are quite demanding right, so so they want lineage, they want documentation, they want glossary, one dashboards, they want full text search and all these fancy stuff I want. I want to help this course on the data set, and I want communities saying this is a good data set. This is a bad data like are you serious.

A

Like this is like.

B

My kids coming here and asking me to play with them. I don't have time for this. Actually I have that's part of my job, but it's a little bit. uh You know it's a little bit overwhelming, so we said cool yes, um but what happens is we have? I mean a lot of data and a lot of flows of data moving around so an inventor is over is not good enough anymore. We need to change and there is a tool for this, which is called a data catalog right there, also, which tools are out there.

B

That can fit this and are better designed for this purpose and linking the ingredient data is one of these for data for for an hour, so we we said we need every architecture right and on top of this remember we have many many tenants, many marketplaces and someone say hey.

B

We need to do a global data catalog and we need to do it like data mess. I hope I mean, I don't know if you are doing database, but everyone talks about data mess. No one knows how to do data, but everyone talks about data right. So we have now suddenly subito the spanish marketplaces, belarus, austria, all of them coming and probably soon we will have also the ebay parts in canada, south africa, coming also to say, hey take my date I went out.

B

I wanted to be in a data, so in order to tackle this this problem, we said: okay, let's divide and conquer, and we we split it. What is the link in the data hover? The data have architecture into three patch match.

B

On one side we have the ingestion pack, which is how do we put data or metadata and lineage inside the tool on the other? One is like: how do we manage this infrastructure, so it scales, so it grows with a stable, so it can be accessed from different places and the third part, which is the one in the the api and the integrations part which is the acting but like?

B

How can we make what happens whatever happens in data hub, have a side effect or have some kind of effects and integrate with with the rest of of our tooling, because we have also, you might have guessed other tools. Apart from the data, so we said: okay, let's, let's do this piosis right.

B

uh We did three with different teams with uh other pieces and in my team we did the one with linkedin data hub for our purpose for our product. Sorry- and we said- okay, let's start with some research on on which alternatives are in the market. So we look at a couple of them. We already knew atlas. We already knew our own.

B

We have some people looking at third parties and we were quite interested already. We were quite biased towards linking to the linkedin data hub, so I have to say, like we already liked it from from the media and from the blog post and so on. So we said: okay, let's, let's give it a try and we find it in june. That was really uh easy. Just to display data from this off the shelf, connectors right so like having redstiff data having a thinner data.

B

It was like a couple of hours literally to to see it in in local host, and it's like seriously like this is too easy, and another thing that we discovered, also quite quite early, was like the infrastructure maintenance cost of it compared to the one from atlas that we have already for kafka.

B

It was much more light right, so it's like okay. This is easier to operate. So during from july to september, we have been working on this okay. We divided the squad into two on the connectors part and on the infrastructure or serving part on the.

A

B

We have kind of in production for of the self connectors. I will explain which ones we made one custom connector with we made two, but one is more mature.

B

We have the production and the infrastructure production already with with a little bit of bat, but we have it production ready and within an mba mvp of the ui that I will explain later. What's the problem with the ui, so now what we will do until the end of the year is try to get more data origins, as we call with external teams, so set up more connections from different marketplaces.

B

Some more bitcoin, more red shifts, more athenas, more glues, maybe a snowflake, maybe something else and do some kind of used characters like do you like this? Do you find it useful? Is it easy to make the connectors from the tenant perspective like they don't own the infrastructure? They just need to send the data and take value on other side of the pipeline.

B

On the ingestion path, we very quickly we will put a received of the shelf and thanks to the victim which give us the credentials to this, because we don't regret shift. We could put also athena. We use the glue connector for athena because it's more generic, but if anyone has a different opinion. Please contact me in the slack and tell me the process, because I think blue is better, but maybe you you think differently for kafka.

B

We run into a couple of challenges like we have the same topic in preample and if you just put the topic name.

A

B

There are some collisions there that we found in the in the browse path, uh and what we have decided uh already is that we will replace the atlas solution regardless of the result of the poc, which is still running. We need to validate with users, so we will change it for a w. We will change atlas for linkedin, because it's again easier to maintain on the infrastructure.

B

We we test octa because we need the list of users and the next we are already talking with people are more relative and maybe hive, maybe a snowflake, maybe you'll, be query and see who is in for testing. So we depend a little bit on our colleagues and the click is not clicking so yeah um and the custom connector again, I said we have an inventory of data sets. We need it to maintain access and authentication, so we need to implement and show this in the catalog.

B

This is key for us, but because it's a custom solution, we need to do a custom connect and it happens the same with this filter map, dispatch and quality image which, because it's custom, custom built it needs a custom connect without the spec youtube like connector, for ourselves no worries on the serving part. Okay, we deployed this in kubernetes and again kudos to the common platform thing. We have a orchestra, the connectors, orchestrated with the home charts. If I'm not wrong, um we are finalizing the monitoring and the alerting of these things.

B

You also are improving this as I, as you mentioned, the the the good thing that we found that I like that, we like is the meta metadata ingestion on top of kafka, because it gives this possibility of getting to the past of okay reset my consumer offset and present all the all this data and if the ingesting part is is down for a while, then it can catch up later. So that's our architectural pattern.

B

We have pro tried in the past and that happily surprised us so now the challenge will be to define how to make this multi-tenants. So one thing that doesn't break the other tenants metadata right. It's a couple of uh a little bit of an interesting challenge to have and the other one is the the mvp the ui. So I explained with an mvp of the ui, which means like the research that we have done internally in the company. It says we already have too many ui's.

B

So we have. uh We have the data cut up in a different part, from the governance tool or from the machine learning in the platform or from the experimentation platform. It's already a bigger mess. So we are trying to consolidate all these tools into a more centralized ui. So don't make the problem worse by adding now a data, and this is a challenge, and this is a pity, because the challenge is how to do something as good as the data hub ui internally, without rewriting all the components right. So this is an interesting challenge.

B

If you have a solution, please send send it to us, but this is more or less what we are doing, and the reason is because we have this actionable things that we need to do in the data.

B

So, for example, we we can delete a data set which actually deletes the dataset and the information of the data set or we can archive it or we can add permissions.

B

We can edit the metadata and it stores it in the source of truth. So there are a couple of things here. We have a custom dashboards in data.com with statistics on which people which tenants are sending data to this data set. So there are a couple of things here that we cannot replace and that's why we are not using the the ui for the moment or we are not using the ui fully, but we might use a.

B

I don't know how the react components as much as we can, for example, the graphql queries, for example, are very useful to us. Yes, yes forever and then the actionable part.

B

So we we will investigate how to do this with our own things, with our own kafka, with our own s3 data sets with the pi information we manage with the rest of the tenants that we need to wait at least until next year, probably a little bit more because we are simply, we are not there. Yet we are lagging governance and agreements to do so, I'm very quickly because I'm not checking the time and money. Don't look at me.

B

Some findings so far in these three three four months, we have actually been working here like kudos to the community community like there were a couple of names in on that slide of contributors that you mentioned so I'm we are super happy of being able to contribute um there. Isn't these 1 302 people, probably they are more now.

B

The atmosphere in this lag is super good responses is great. There are a couple of hours of delay, but that's normal because of the you know. You are sleeping when we are working. So that's, okay um and the other development speed new features are super. Fast are very good. Packs are fixed very quickly, so super.

A

B

There, the other one, is the architecture that matches what we want. So that's, okay, simply right and then the main problem here, which we already know is, is independent from the tool. So our inventory linking data have customized solution commercial solution. The problem is the same like: how do you fix multi-tenancy? How do you do it like the like latent the data mesh? What do you do when there is no metadata, no governance and with the data quality? So I describe this as we can do the best tool in ever.

B

But if you put something in, you will get something out right. So sorry, what is what it is and this any tool cannot solve it, and if you make a tool, that's all this, then you actually can charge a lot for this thing.

B

So somebody in the future is quite promising. I'm quite convinced that we will just link it in data half for for a while, uh even beyond the poc. We might end up using this for ourselves for the things that are under our under our own control for the group level, global catalog.

B

We need to see because that decision doesn't depend on us, but at least for for the moment and for 2022. I see a quite promising feature and I see linkedin data have been used quite a lot.