DataHub Community Talks, 23 Apr 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Apr 23 2021: DataHub Community Meeting (Full)

Description

Welcome: 00:00
Project Updates by Shirshanka : 03:03
- 0.7.1 Release and callouts (dbt by Gary Lucas)
Use-Case: DataHub at DefinedCrowd by Pedro Silva : 11:20
Deep Dive + Demo: Lineage! Airflow, Superset integration by Harshal Sheth and Gabe Lyons : 26:08
Use-Case: DataHub Hackathon at Depop by John Cragg : 38:58
Observability Feedback share out : 52:41
Product Analytics design sprint announcement by Maggie Hayes : 56:41

A

John, is it uh a half day at depop? Is that what you said yeah I've already clocked off about three hours ago, nice? What's the.

B

Occasion we have half days for on every fortnight on fridays, for uh sort of like lock down mental health kind of things, so people can chill out a little bit. It's quite good.

C

That is nice actually uh at envision. They, they tell us to leave at lunch friday every day um every week and about a quarter of the people do, but I think it's good a lot of people go and actually have family time.

B

Yeah, it's nice.

D

Very cool I see pedro is here as well maggie's here, so we've got our amazing speaker line up.

A

D

Perfect, this is uh gonna, be a super tight town hall. I have been uh warning all the speakers that I'm going to be keeping us all on time because we have a ton of stuff to go through so welcome everyone. um This is the april. I mean we're doing it after like five weeks, because I wanted to align it to the fourth friday of the month going forward.

D

So we have five weeks of stuff to catch up on and it's uh quite a lot. Our community has grown quite a bit over the last month. I think I was just you know: eyeballing the numbers in terms of the community.

D

uh I think we added like another 20 25 in the last month, so amazing and I think engagement has been uh off the charts so really enjoying kind of the conversations whether it is small stuff, like someone's docker image, not really working for them and the big stuff, like you know, tags and taxonomies and ontologies, and how do we keep it all together? So keep it coming? Love the energy awesome so uh agenda today, uh quick project updates, we're gonna, have gary talk about uh dbg integration.

D

Pedro is talking about data hub at define crowd um if harshal and gabe are awake, they're gonna do a deep dive and a demo on lineage. They were working pretty late last night, getting it all uh together and then um john is gonna. Tell us about uh the hackathon that they did with data hub at uh depop and last time we did an observability a share out with mocks and got some community feedback.

D

I'll kind of share out what happened as a result and maggie has some announcements around a design sprint that she's running so super exciting awesome. So uh around midnight I cut the release, so we have a brand new release: zero, seven one. uh It's been like five weeks, we're trying to go to a monthly cadence, so pretty much every month should expect uh official release coming out. We had almost 140 commits in the last five weeks. I think so our commit rate is actually going up, which is awesome to see.

D

I also pulled like the number of committers and the diversity. uh That's looking really nice. We had actually 12 different companies contributing to the project over the last five weeks. So that's great and in terms of highlights, I was trying to bucketize where all the features are coming in and where all the contributions are. You know. Obviously, product improvements is a big one. Then there's operator tools and then integrations so product improvements. uh Quick uh highlights, we've got column level matching going on.

D

So if you search for something and if it hits a column, we we show it now, and you can kind of figure out that that happened.

D

There's some stuff we did around. You know improving search suggestions on the first page as well.

D

And then pipelines are in and you can you know, search for them either as a top level thing- and you know, you'll, get kind of the all result with pipelines also included along with dashboards and data sets editable descriptions. So this was another common feature request.

D

If you have a description on a column and you want to just edit it, we support uh editing it and we actually keep it separate from the primary schema description. So you can always have that as well and there's some discussions on how we're going to improve the ux to make it even more nice for conflict resolution, nested schemas for people who run avro protobuf and that kind of stuff. You see this all the time.

D

So we've now have better tabular view for nested schemas as well, so you can expand and shrink those trucks moving towards operator updates. The data hub cli is getting better, so it started out with just ingestion right.

D

So everyone knows that now, data hub ingest, blah blah blah and now we're starting to add other verbs to it like check. So you can start checking the health of your cluster or the health of your local docker installation using datahub and a lot more metadata ops. We have a lot of ideas and if people have ideas, definitely send them over it's a fantastic place to hack on what are the amazing things you want to do on the command line with data hub, and we can kind of keep expanding that big news.

D

We finally mainlined our helm, charts so kubernetes helm, it's production ready, we're using it for our aws deployments. So we we can support you on that. A bunch of work was done by the community in adding monitoring, I think, define crowd worked on that linkedin actually contributed some neo4j writer improvements, because um when they were doing backfills and stuff, they found a bunch of bottlenecks in the neo4j integration, so that neo4j integration should be going faster now, and I think klarna worked on adding ssl everywhere.

D

So that's great news on the integration side, as usual lots of contributions. So this is great. We now have airflow linea and charcoal is going to get into that later. Today. We also have an official python package, so you can pip install stuff um and get it on your.

D

You know devices and then start hacking on data hub uh integrations include. You know: airflow dbt, superset, druid snowflake, um better snowflake. We had snowflake last time, but we improved it this time glue thanks to the debug team for that and mongodb and oracle looker spot hero actually contributed it. It's still in the contrib folder, because we want to go back and improve it and make it an official using the new ingestion framework. So lots of new integrations this time around and we're looking to continuously add more logos.

D

Awesome so I'll hand it over to gary who, in a in a hackathon, I think at envision uh kind of popped up and said: hey I'd like to work on dbt and we're like yeah.

B

D

For it it's on our roadmap, but we didn't have time for it. So yeah you take it away.

C

Hi uh I miss my name is gary: I'm a staff data, engineer, vision, app and, as uh srishanka mentioned uh it was. This is a result of a hackathon. When I was done, I decided that I wanted to contribute this work to datahub so quick overview. Dbt is a directed acyclic graph for sql. It lets analysts and scientists create workflows in sql. It's really easy to use it's easy to learn it's really just sql.

C

um So what I want to do is I want to import the dbt generated graph for lineage information as well as some specific metadata, such as the model name and the hierarchy within the dbt models folder. So I often find it's a challenge to map the. What I see in sql versus what is actually uh executing the dbt folder. Some people can do it in their head.

C

I find that to be a challenge um and as well as the model type so source the what, if it's a source if it's a model, if it's a viewer, if it's ephemeral, um future iterations may involve pulling in model documentation and additional tags and anything else. The community finds helpful.

D

So one question I had for gary when he was starting out was, you know, dbt already has lineage and it also already has some cool visualizations.

D

So what's interesting about integrating it with data hub, and I thought he had a pretty good answer to that.

C

Well, I think the the straightforward answer is, um you know if you're using dbt docs generate that's a great tool um if all of your assets are computed in dbt and all of them all those assets uh like feed into dbt in some way it works fine, um but most organizations don't actually have that situation. So there's going to be things that exist outside of that dbt docs shows the universe for dbt data hub is intended to solve the the the greater ecosystem. Does that make sense.

D

Yeah cool, so this is what it looks like once you get it all integrated. I think gary even checked in one of those uh files in the repo right, yep cool- and this is yours again.

C

Yeah, so in terms of how do I use this integration, um you you run dbt and you're in a regular pipeline, it executes, creates the uh sql um sql assets and or the the data assets in your in your data. In your data store, um the output of that is going to be a manifest file, which is in the uh slash targets, folder of your dvt project um and then to pull in um schema data and some other metadata. uh You can run dbt docs generate and that actually generates this catalog json file.

C

You reference those you run the metadata ingestion tool, and then you have that stuff in your system.

D

Cool and just today morning I think someone on the community channel was asking about. You know how dbt connects to snowflake, and then you know doing snowflake ingestion and dvd ingestion and the ids are not quite lining up. So I think there's some very interesting work left to do in stitching together the dvd graph with the rest of the catalogs, but I think it should be pretty easy to do it and then I think we can kind of get that whole end-to-end lineage flow that we all dream about.

D

Awesome thanks gary all right. uh The next presenter is pedro and he's uh gonna talk about data hub and define crowd pedro. You wanna.

E

Yeah, can I.

D

E

All right great, thank you. Give me a second all right. uh You can guys can see my screen right.

A

E

All right perfect, so, uh first of all, thank you so much sirshanka for giving me the possibility to present the work that we have been doing at the fine crowd with data hub uh for everyone else. My name is philip silva. I'm a data engineer at define crowd. So, first of all, let me just very briefly explain to you what the company is.

E

Essentially, we are a marketplace for ai data and our objective is to make your ai use cases smarter and the way that we do, that is by uh crowdsourcing data assets specific to your use case into your industry as a service with certain quality guarantees.

E

These can be things like speech to text, translations, uh audio or even image recognition type data assets regarding the company uh we were founded in 2015, we have over 300 employees and through our series and series b funding we have over 63 million dollars already raised um important. uh To mention in this talk is that uh one of the ways or the way in which we crowdsource uh data sets is through our nevo platform.

E

This is essentially a mobile application for android uh or iphone where regular users can download the app and earn a little bit of money by performing certain tasks.

E

These tasks are what we call as units of work and to give you a sense of the statistics and the scale with which we work it's over a million units processed per day with 500 000 or more crop members and growing over 70 countries and speaking more than 50 languages, and the reason that this is important is to give you a little bit of context with regards to the data ecosystem that we have at the find crowd.

E

Essentially, it's an event-based data, lake-like system managed by a centralized and specialized data engineering team, and the purpose is for reports generation dashboards for the multiple teams that are allocated to certain projects and also to help perform our internal quality control over the crowdsourced data assets.

E

Regarding the architecture itself, uh what we have is nevo generate certain metadata events, so, let's suppose that a certain unit of work has been assigned to a user, uh he has been working on it. So you have progress uh over time.

E

It can be canceled, it can be completed these sort of things, so a sort of state machine of events and based on multiple aspects of the platform is the input or the source of data for our platform.

E

This is then retrieved from a kafka log storage, where we can perform a number of things, one of which is through k, sql performance streaming joins and the reason that we want this is to have a real time computed view over certain data views.

E

If you will on our data and that is stored back into kafka, which, through kafka, connect, we store in htfs and internally at define crowd, we use or our cloud is azure and we do use certain managed services, including, for instance, azure data factory, which is azure's managed offering of something like airflow, it's their own interpretation of that, and we also have the typical etl batch-like processing, in this case, through spark where we take raw data and process process.

E

Our kafka topics into usually views that are more consumable for our stakeholders and and these views can either be consumed via druid or via hive. The use case here is in druid: we want to work with more real-time queries and an sql like approach, but in hive through jupiter hub for those cases, for example, data scientists where they want to perform actual manipulations on the data, and they want to perhaps do some daily cleaning and future engineering for our internal machine learning models so in a very high level overview.

E

This is what we have for a day-to-day ecosystem. This is not everything, but it's what I feel is relevant to this conversation right now and, um as you can tell, um this is a very centralized like approach and the whole architecture itself is owned and managed by the data engineering team. Our vision is to move towards a more data mesh like approach, which certainly has some benefits for us. Concretely. What we want is to achieve three things: data democratization so allowing data-driven decision-making without bottlenecks or external dependencies by our stakeholders.

E

These can be people like project managers, business analysts, so on and so forth, or even um c-level decisions. We want them to be able to make those decisions based on data but not being dependent on the team, and the reason being is that the scale of the team and the people who manage this infrastructure.

E

We are around six people. This changes over time because we have people being allocated into different projects, um but our fan out ratio, if you will, is six to eighty. So if you want to reduce this, I think it's something like one to twelve and that's sort of how it works, and given this scale we naturally want our users to be more self-serviceable.

E

So to do that, we need to provide intuitive tooling for them to be autonomous, and, finally, this will allow then the debt engineering team to scale up and be able to organize the challenges.

E

Speaking of intuitive tooling, that's where data hub itself comes in right. Self-Service ability is something that's not really possible without having data discovery and data data lineage over the assets that we have, and even for the data team. For us, uh we do increasingly have a harder time keeping up with the growth of data assets given, given that the company itself is a data provider, uh our asset catalog is continuously growing and for six people it's a lot of information for us to handle.

E

So we needed a data catalog from an exploration of the current ecosystem uh perspective. We explored options like weworks, marquez, apache, atlas, lifts and mutant, and mainly because they're open source solutions, and we wanted to understand and wanted to be able to contribute back in case.

E

Their approach was not um perfectly fixed to the finance crowd use case, though, we did also look to inspiration at companies like netflix and intuit that had other approaches. In the end, though, we did decide to go to data hub the reason being, it's extremely active community.

E

The ability to have strongly typed dynamic, metadata models so being able to define, in certain end certain entities, relationships and being able to modify that, if possible, to make it match our use case and finally, a push and pull ingestion model for the metadata and because we do have certain components of our architecture which are streaming based and others are batch based. uh Possibly some will be and created by us internally and having that flexibility is very important to us to give you a sense of where we are right.

E

Now, uh we've undergone the exploration proof of concept based deployment. We are now at a production level deployment, though with a very basic use case. In this scenario, data set catalogs only and on our downstream databases drew it in hive and the reason why we wanted to do this is to ensure that our direct stakeholders had access to information that they could understand if they work with superset and jupiter hub.

E

They are directly interacting with data that's available in variable in druid and hive, and this was a work that involved three people and overall, it was an extremely positive experience and our initial rollout had over uh 20 plus data users and their feedback has been quite good, though at this time because of the lack of metrics. I can't really tell you if this number has changed over time, but it is the information that we have right now and but finally, data hub as a system. It is relatively complex right.

E

It's a it's a large database with a lot of moving parts, but the community support has been exceptional, so in that sense uh I feel it has been an excellent choice on our part regarding contributions- and I know shashank already mentioned this- we did contribute with some things, particularly monitoring metrics crown crawling support for metadata all of this done in kubernetes, because that is uh our default deployment mode and finally, support for juul, though it is not the end of contributions, I hope, but we will see as time moves on so just to give you a sense.

E

This is the ecosystem that we had before, and this is what we have additionally with the with data right and you have your they have you high your installation? This is all done in kubernetes, then, through the data hub, metadata, crawlers and the crown jobs we very high and to it and finally, uh with regards to opportunities- and we feel naturally, there are things to improve. There always are in good projects, and in our case, for our use case, it's dynamic metadata models.

E

It is true that these models are flexible and that you can change. However, they are hard-coded into the database in the sense that if we wanted to change, we needed to maintain a fork of the project continuously and from time to time, merge things from the original repository to gain all the niceties and all the features that have been released and the reason we want to do. This is not all our data. Stakeholders are tech, savvy right. Some of them come from linguistics backgrounds, with little to no cs training and for them semantics matters.

E

For instance, quality control at defined crowd for translation use cases involves having people who are naturally linguists and they do not have cs training, and, as such, we need to make their usability of this tool as intuitive as possible.

E

Other things include: role-based active access control so having granular entity level access and definitions is something that I feel is quite important so being able to say that a given data user has access to data sets a b and c, but not d, e and f, and this is something that is very relevant to us and finally, field level, uh lineage based on jobs and pipelines. So because we have certain data sets or in a sense certain views where a subset of those columns are generated by job a and the others are by job b.

E

This is the sort of information that we want to make sure we surface correctly, though I do have to mention that all of this is already in data hub's backlog. So I don't think that I'm saying anything new here sure shankar, but it's just what we had to find crowd and would like to see in the future and yeah. I guess that's it. I don't know if anyone has any questions, I do apologize if I sped through this presentation, but give me your feedback if you have any. Thank you so much.

D

Cool thanks a lot pedro, um definitely plus one on all the pain you felt with some of those things, especially the metadata models. We feel it every time we add one small thing to it, so no code, metadata models are absolutely top of mind for us and our back also on the roadmap field level. Lineage also we've got the rfc in and so we'll figure out the implementation pretty soon.

D

B

Had one question.

D

For you pedro and then we can do some quick questions from the community. uh You know you currently have it uh have data hub kind of crawling, your druid and hive clusters. You also have some kafka connect uh in your ecosystem. Yes, do you anticipate uh kind of pushing metadata from that streaming ecosystem into data hub in the future.

E

uh Ideally, yes, I do feel that kafka connects integration is more on the level of lineage than generating metadata assets themselves. So, right now, as I understand it, the ingestion framework will generate data, sets snapshots and possibly even user information for ldap systems, and I feel that it needs to be adapted to be able to provide these sort of connections or updates to aspects of meditating right. Kafka connect feels very much like a source of that type of metadata and at the ingestion framework level. uh We don't yet have that.

E

As far as I know, but yes, yeah.

D

E

It'd be definitely a plus one.

D

Agreed, I think, um especially the way in which we've done the airflow lineage integration.

D

We can probably uh use the same strategy for kafka connect where there's like framework level integration with kafka connect and when those pipelines start up, they emit metadata events that then connect the edges together, so that I think, would be great for the community as well. Are there any other questions? We have a minute.

D

You can always drop questions if you have them offline on the chat and I'll try to get back to them towards the end cool thanks. A lot pedro and.

E

D

E

All right, thank you so much.

D

All right so for our next piece, we have partial and gabe from acryldata who are going to be talking about airflow, lineage and integration with superset um awesome. Thank you, okay, okay, cool. um Let me yeah present.

F

Thanks yeah, so me and gabe, we're gonna be talking about how we built deeper integrations with airflow and then how we set up a larger demo environment to build a full data analytics pipeline end to end and view it and understand it with data hub.

F

So first I'm going to talk a little bit about the couple ways that you can integrate datahub with airflow.

F

The first is, you know, pretty simple, using airflow, essentially, is a cron system to just run ingestion on a schedule, um so you know similar to how you define a recipe uh with the data hub cli, you can create a pipeline, give it your source, tell it to push to gms via the the data. Rest sync and just run it every day, um and it's pretty simple to do this and and set it up with air flow.

F

The second method, if you hit next slide, the second method, is to emit mces via a data hub operator uh directly within your dag. So the reason you might want to use this is, if you've got say, you're generating a dag, and you know exactly what lineage or you know some extra information about a given data set. uh You can just create that that mce here construct that object and push it up to datahub to tell datahub about um whatever you know within that air flow dag.

F

Now, in order to use this, you have to set up an airflow connection. This is a pretty standard thing and then you just put the connection id in the operator as a parameter and airflow will figure out the rest of how to pass the credentials into the the emitter operator and then push that information all to data hubs now, a third way to to integrate air flow and data hub, and you know the one that I'm most excited about is via the lineage back end.

F

So the way this works is you set up a little bit in your airflow config. If you see that second screen shot, you configure the data hub, airflow lineage backend, as deleting you back in with an airflow and give it the connection id similar to how we did it in the operator case and then in your operators within your dag, you pass inlets and outlets, and this is a airflow native integration.

F

uh Every single operator supports inputs and outlets in the right version of airflow, and you just declare your data sets that are consumed and produced by a given job and datahub is able to view and visualize all of that metadata, plus it fetches a bunch of extra metadata about the dag and the tasks itself.

F

So you can view properties about you know what were parameters that were passed into the task or you know when was this thing last run now the one caveat here is: you know it requires a little bit of config and it um only works with airflow, 1.10.15 or newer or 2.0.2 or newer, um because the range back end was not supported prior to those versions, and so there you have it. If you want to learn more, I'm sure I'll give you the next slide.

F

I've got a link to the docs uh where you can read about it and then I'm on slack. If you have any other questions, awesome and thank.

G

F

G

Sweet um awesome.

G

Thanks for sharing harshall, I think it's really cool how it's nice, that we've got different options of integrating with airflow so depending on how you know how much commitment, how deep they want to get in there's always a way. Someone can get this lineage information.

G

So next up, I'm going to talk about how users of data hub can take advantage of these connections that we have between entities to get a better understanding of the data that they have. So I know that partial in spare time made some analytics pipelines and dashboards about the demo data that we have for demo datahub project auto.

G

We have lots of data sets there and as a little pet project, he built some pipelines to and some dashboards to, visualize uh metadata about them and I'm going to go through and explore this and understand um what he's built and use lineage to get a better understanding and um more trust in the data. So say I wanted to understand what the documentation coverage is for our demo data so that I could understand.

G

Well, what what platforms do I need to improve the documentation for so I can go into the search bar and search for documentation, and I see a superset chart that has been created. That gets the completeness of documentation for some data sets. I can click into that in our superset integration. That's been included in the new release. You can go and see some basic properties like the metrics and the dimensions of the chart, as well as the sources that feed into that chart.

G

But how do I know that these data sets that are being talked about in the chart are the chart? uh The data sets that I'm interested in and actually are our demo.

G

This is when I'm going to need to go into the lineage view to see the whole picture of how this chart was created. So when I click this button in the top right of my entity, I get taken to a graphical view that shows in the center. This is the chart that we're talking about. It also shows um the upstream table dependency that the chart is reading from, and this is what we saw before downstream.

G

I can also see the dashboard that the charts included and these lines that I'm seeing coming out of the dashboard are indicating that this dashboard contains other metadata analytics charts.

G

So I might make a mental note that if this chart is what I'm looking for, I might want to go investigate this dashboard and see what other charts it has. And if I double click on this dashboard, I can recenter the graph around it and see here all the other charts that are contained in that dashboard. And these charts that I might want to investigate later and so.

G

Going back to this upstream table, I can click on it, get the full name. If there was description or other metadata like tags, I would be able to see that as well. But all I know is this: is some generated snowflake table in harshal's pipeline? How do I actually know that this table is generated off of the data that I'm expecting and and the dimensions are constructed in a way that I want at this point, I'm going to hit the plus this plus icon to further expand out the lineage graph.

G

So, as I hit these pluses, the lineage graph has expanded out and I can see more and more of the dependencies of these tables.

G

Until finally, once all the pluses have been clicked, I have now have a full picture and, as I zoom out, I can see a full picture of all the the flow of the lineage that leads to my dashboard zooming in I can see it all starts with this s3 bucket. That is a snapshot of our demo aspects for.

D

People familiar with babe, I think everyone.

G

D

Just see that waterfall man can you just.

A

D

Like just please wait for a minute, let like.

G

D

G

Waterfall of our dependencies.

D

And this is like metadata analytics right, so it's analytics on the metadata itself exactly.

G

Exactly so, what we've done here is built a pipeline around the metadata included in our demo datahub project, so, as you can see from the float from left to right, we're, starting with the raw data of the aspect, and then harshal has created pipelines to produce derived tables off of these aspects that they're easier to consume. Until finally, we have the table that all the superset charts read from, which is a much more consolidated version of our data, metadata that the chart that charts can easily be built off of.

G

So now, after looking at this lineage visualization, I feel much more confident that this data that is feeding into the charts is in fact the data that I'm looking for, because I can see from the beginning. It is built off of these core aspect tables that I understand, and I, if I want to zoom in and look, I can then inspect an airflow test and say I might want to know okay. So I know now that the source data and the flow of data seems expected.

G

But how do I know the transformations that partial has done are the transformations that I would want. I can then click view profile of this airflow task that I've highlighted and I'm brought to the airflow tasks profile.

G

We ingest a lot of properties about these airflow tasks, but also you'll have a link to your airflow task and the actual um in the airflow ui.

G

When I click on that, I'm brought to airflow and I can actually go in inspect the code and verify that this code is the what I'm expecting and the transformations that harshal's done are the transformations that I expect um going back to our lineage graph. Now that I've do my due diligence on the lineage flow understanding from the beginning, all the way down to my chart. How is this data transformed? I can now. Finally, look at this: go back to the chart.

G

Profile click, my click out to my viewing superset button, and now I finally have confidence that this chart, I'm looking at. It's not doesn't just say it's doing. Data set document completeness, but I understand from the beginning, through the transformations all the way to the chart. This is what I expect.

D

um We gotta improve snowflake documentation. That much is clear. That's.

G

Right it looks like the snowflake documentation is lacking and s3 is barely there, but our bigquery documentation and our kafka documentation is looking pretty good. So now I know next week I've got my work cut out for me and I can now go into my other charts. Do investigations there and draw more conclusions so now, with our in the new release, now that we have airflow integration, we have dbt superset, looker and other sources that are helping tie your different entities together. These lineage visualization visualizations will help.

G

You understand how your all your data relates to each other and with that this concludes our walkthrough of the lineage future.

D

Awesome, I think I think people are just uh blown away so huge kudos to the two of you for cranking it out yesterday and getting it to this polished state. This is really cool. This is really cool.

D

Awesome um cool, let's uh move to the next section one. Second, while I go find my tab.

D

Yep here we are and oh another handoff. uh We have john who's going to talk to us about the data hub hackathon that the depop guys did- and you know it was pretty cool- to see them pop in literally pop in to the community channel and say: hey we're doing a hackathon, and you know in a few days they had, they were contributing, glue integration back to us, and you know the kalerna folks helped them out. So thanks for that collab, I think it was really nice to see that happen.

D

Take it away, john.

B

Can you see my screen well, yep awesome uh yeah a bit of a tough one to follow that one uh that looked really good. um So I hope that is. Don't disappoint you here, but um anyway, uh my name's, uh my name's john I'm, the lead data engineer here at depop um and yeah here to talk about the hackathon that we did with the data.

B

um So just quick intro about the pop and who we are we're a fashion marketplace um for the next generation to buy and sell, discover, unique fashion, so we're an app. Basically, we sell uh provide the ability to um uh sell predominantly secondhand fashion uh and sustainable fashion. You can think of as a bit like ebay, mixed with instagram.

B

That's what my top my mum anyway, when I joined uh but um yeah lots of lots of people in the uk using depop, and it's growing around the world as well as the us too um uh yeah and we're we're growing very fast, and our data needs uh growing uh well too.

B

um So why did we look at the um data hub? Well, uh we need to enable the business to use data in a self-service fashion and we need a single location for all of our data needs um shout out to the design crew. Who did that? Certainly wasn't me.

B

So I'm going to walk us through some of the problems that we're trying to solve here that we can see through various uh slack messages that we've had through the company. So we've got issues with data discovery, somebody new joins and um they want to know about data for our crm and they they don't really know where to find it, uh which is a bit of a shame.

B

um We want to know data um about recently viewed items or any data about banning people in the trust platform, and we don't have that single location for search, so uh the data hub would be pretty useful there. The data lineage aspect.

B

I don't really need to speak about this, as have you seen a perfect, um a demonstration of how how that works, but um generally uh producers and consumers and seeing who, uh where data starts and where it ends up all the way through to our looker instance, that would be very useful for our uh business users and depop's, a startup or a scale-up, and and uh we've got lots of knowledge in our heads when we have one in the uh the sort of documentation phase.

B

So the tribal knowledge um is pretty rife, and this this table has a uh a a column called active status which, over the years um has has baffled. Many people in the business, including this guy who said active status, could just about mean anything. So documentation is pretty important for our uh for our users.

B

So what we did um is we had a hackathon, the in the data engineering team and and the bi team as well, and we split up and we tried to have a sort of a head-to-head against a munstern and data hub, and so we did. um What did we try and do? Well, um both of them have local setups um that use docker, and we tried to go from zero knowledge uh of these products to getting as much production data into them as we could do it inside two days.

B

uh So I will just change the screen and I'm going to only show the the demo part for the data hub. Obviously this is what we managed to do um and then I'll slip back in afterwards. So I'll stop now um I might need to reshare my screen actually two seconds.

F

H

Can is it playing? Oh, no, that's nightmare, huh no worries. How could you refresh it?

H

We just see your desktop john. Oh, that's good. Are you enjoying the mess.

H

I

Didn't have any glue support, so we spent the last two days um figuring out how we could ingest data from glue into data hub and we managed to do it. So that's good. um So if you look in data sets, we have broad thing. We have this glue here and then uh this then goes to a database level.

I

So, for example, if we just click in random on like daily compacted um here's all of the tables that are in there, um if you search for product, create and then go to the one in compacted so yeah, this has the search as well. um So, for example, you can have see we added a description here. Most of our data isn't really well documented. It doesn't have descriptions, um but in the schemas, for example, in glue you can have descriptions for each field.

I

um I know there's often confusion about like what a user id actually is. Is it the seller? So all that can be documented um got the name of each field, the type on the left, the descriptions. um So that's the schemas there's also ownership um that we could pull out. All of ours are apparently owned by owner. So that's not that helpful, but that can be changed um and then properties has um just probably about the table that get pulled out so just extra information in there.

I

And then you have the option to add documents, but we didn't have any for any of those but yeah.

I

I think we're going to open a pr in the data hub repo for the glue support and that's pretty much it. I don't know if I missed anything.

B

That's cool um thanks! uh Lauren. Do you want to show the the looker integration yeah.

J

Sure so, um basically uh our team uh rob abby and myself we um worked on integrating the redshift tables and looker um so as similar to the glue schemas. You know the ratchet tables we, it basically pulled the tables from redshift and it has again schema types names of fields.

J

um We also have um in ownership and all that, so that's existent. um Obviously, we've implemented obviously tags um similar to monster, and I think, uh in terms of the um like search.

J

uh If I search for a keyword here, I am able to see the uh um all the um entities that have this particular um tag, and I can even do that the opposite way. By going to the tag and then looking look up, you know everything that has this tag. I think. Similarly, if I look for a another keyword here, um I want to see well in this case. For example, it appears in the table name, but in this case there is a mac for calling.

J

So it's really interesting to see that the search is very inclusive um for the looker um implementation. So there's another area here um that we've been able to integrate a particular dashboard here with a description uh to scroll in I'm able to see the uh obviously tags and owners and everything can be added, I'm able to see the uh actually the actual looks that um are part of this dashboard.

J

In this case, we just provided a few examples, but um so, if I scroll to one of them, I'm able to first of all see um tags, I'm able to see the actual table, so the data source for this particular look um in luca and obviously scrolling and see um that information um I can. I don't want to get out, but uh yeah there's a direct link to the um to the look. So that's really nice um what I think I've covered most of it.

K

Here, if you check uh confirm signups, you can see documentation online lineage.

J

Yes, so uh yeah, this is a redshift table um and any documentation. Let's say it's an etl-based story, any logic that is uh part of that um creation of the table. We were able to see that and um each entity uh here, you're able to see the upstream national dependency. So that's very useful uh for later lineage.

B

Cool, uh so that that's the majority of our demo there's some uh faqs afterwards, but we were presenting um to the business. So I wouldn't show you those um so um yeah what we achieved uh during the hackathon is we ingested all of our production data into the local instances, so that was redshift glue and kafka. They all came into our local instance of the data hub. We also linked that chart of um looker in um and we created uh some.

B

We used the metadata change, events to create lineage and tags and documentation and owners, and we created a merge, the pull request, uh which was pretty nice. um So I think that the most important thing for us, and probably sir, for any advice I could give people who here who haven't decided yet, is why we actually picked the data hub.

B

um Most of uh the problems we had with the munston was the the lack of kafka support and when we tried to integrate that with data hub, it just worked straight away and, um as you can see, um we added the glue integration, which was really easy, and the the process for adding um a new ingestion type was was super super easy. It was very straightforward: the docks were set up nicely um and most, I think um pedro said earlier. The support from uh from the team was just a man immense.

B

It was amazing, um we were messaging at all times and we were getting responses and pushing that um vr was was really trivial and thanks to kleiner for helping us out there as well um yeah. The the aspect of data lineage is really important for us, because we have several layers of transformations on a business level. um Anna munson didn't really support that very well.

B

um Looker was a work in progress, and I know you said it's in the in the contra folder at the minute, um but we're really excited to see that, and just uh all of the other bits that you've seen already. um It was just super good and we had a really good time uh doing it and contributing back and we're looking forward to integrating into our production stack in uh in the next couple of months, um so yeah. Thank you very much for your help.

B

um I'm really pleased to be working with you and it's been. It's been absolute pleasure, yeah and shout out to the team. I think marie is here, um hey maria. uh Thank you very much.

D

Thanks john yeah, we really enjoyed uh all the energy that the pub team uh brought into uh the project so keep that coming awesome. So now that we have just a few minutes left, I wanted to do one of the things that we had promised. We would do for the community um um and that's a share out of the observability uh marks and the poll that we ran last time.

D

Everyone can see my screen right. Okay, so um you know. In the last town hall, we went through a couple of screens showing hey. This is what datahub could look like if we added observability metadata to it and then started building kind of data, quality style visualization, as well as data ops, style visualization to the screens, and we got a ton of feedback thanks to everyone who participated about what they want to see and what direction they want the project to go.

D

We actually got more than 25 responses, um so really happy to get that kind of high quality feedback from the community.

D

um The first thing we asked people was what is their role, and so what is very nice to see is that data hub seems to be very aligned with the interests of the data platform leads and the data engineers who are trying to ensure that they have like a modern data catalog with the architectural strengths of data hub.

D

We also see a lot of pms and architects in our community, who are also trying to think about a sustainable data strategy for their company, and so those kind of roles are definitely showing up at the very top in terms of pain points.

D

It was clear that what we are all trying to look for is make consumers understand the quality of data that they're consuming and have data producers, take ownership of data quality issues, and that starts with freshness, but also data, quality checks and alerts so being able to understand end to end, and the lineage features will definitely help in making all of that transparent.

D

But then, on top of this lineage graph, now adding observability or operational metrics, so that you can understand end end-to-end quality integrations with tools like great expectations uh and enabling alerting uh based on these metrics. These definitely seem to be popping up at the very top of the people we surveyed from the community.

D

The next thing we asked was feedback like should we do this? Or should we not? You know and overwhelmingly? People said this looks amazing and we should just work on it. So it seems like the community, definitely is voting for uh these things to be in uh the product, um and the nice thing was right. After that we asked people. How will you help and um I'm so glad to say that the number one and two uh results on those are?

D

You know I will storyboard the use cases and I'll contribute to building out the backend. So that's exactly what we want. um We will set up time with all of you to storyboard together and figure out how to even run the project if people are interested in actually working on this together.

D

So last night I actually created this slack channel. It's called design data quality, it's empty, but everyone who's interested, please jump in, and we can use that as a way to take the conversation forward and then set up follow-on, chats we're doing something similar for tags and taxonomies, and you know async works well pretty much. Everyone is busy, so it's good to get kind of async feedback from around the world and then come up with some sort of a global picture for where we want to take the project.

D

So thanks for all of that, we are setting up the slack channel and um we will set up in-person discussions with people uh who would like to go deeper. So look out for those announcements.

D

It is an opt-in channel, so no uh no requirements to auto-join, but if you're interested in shaping the future of data observability on data hub just join that channel awesome now- and this is kind of um a very interesting segue, because all of the discussions that happened today were about data platform teams who have taken kind of a first step or a second step with data hub at their company and they have rolled it out. Some people just finished a hackathon. Some people have actually rolled it out to production and have 20 people on it.

D

And, as we showed you, you know, you can do a little bit of analytics on the metadata itself to kind of understand how your data ecosystem looks.

D

uh But what we've noticed with a lot of data platform teams- and I remember even at linkedin- you know over the last six years as we kind of built out this product. We had these repeated moments of feeling like okay, we did something, but did it actually make a difference.

D

So how do you get from deploying what you think is the right solution for your company to actually making sure that your entire company actually loves this solution so that that's kind of the challenge that we're setting ourselves and I think, as a community. I think we have the same challenge. We pick tools and then now it's our job to make the entire company love the tool.

D

Because then we have succeeded, and with that I want to hand it over to maggie who's, been thinking about this for quite a long time as a leading evangelist of data hub at spot hero. So maggie go for it.

L

Awesome um so hello, everybody, I'm maggie, I'm a senior product manager at spot, hero based out of chicago focused on data services, so everything from data engineering to data science, data analytics and all of the complexities there. So I've been working closely with the with folks from the data hub community over the past year or so um I'm a huge proponent of the the tool and just think it's you know you guys are really just kicking ass. I'm just like thoroughly impressed by all the great work that's coming through.

L

um So in talking to shashanka and folks from um actual data, you can yeah, you can go to the next slide.

L

um I wanted to kind of provide some pm support as much as I possibly can, and so one approach that we're taking is by running a and facilitating a design sprint next week for those of you who are unfamiliar with design sprints, it's really just kind of a dedicated uh three to five day session, we're going to focus it on three days to identify a big problem, map out solutions, decide on a prototype and then rapidly iterate and build that prototype out to see.

L

If we can uh kind of come to a consensus of how of how to solve kind of like this big gritty problem, so um you know shashanka uh called it out that you know knowing understanding how these uh cut well, really meta tools solve these bigger problems can be difficult, so we're focusing on understanding product analytics, um so really understanding. You know how are users, interacting with the tool, what are kind of like the core uh user funnels?

L

How would we think about uh user adoption or power users, or or really kind of like a a successful user flow or user journey throughout the product, so we'll be tackling that next uh next tuesday?

L

If you are interested in participating in this, there are two ways um one we are looking for for folks to volunteer for 30, minute, expert interviews and really the the task there is is we'll just be asking you about. uh You know how?

L

How do you think about successful adoption or meaningful ways to track or understand uh user adoption within data hub at your uh at your company? So we're looking for folks who have either you know implemented the tool are thinking about implementing the tool, are admins of it and just have your perspective on product analytics there and then um otherwise.

L

We'll be scheduling, and uh I'm a little bit behind here and getting these things scheduled, but we'll be scheduling kind of like a final review or um prototype um review of what we build out and we'll we'll make that available to uh everyone to join that that presentation towards the end of the week so feel.

B

L

You're interested in participating feel free to reach out to me and or shashanka and slack and we'll get you scheduled.

D

Yeah and we'll make sure to meet you where your time zone is at so yeah.

L

E

D

And we'll figure it out, it doesn't have to be tuesday morning monday night, we'll figure it out awesome thanks so much um and we're right on time. uh This is it. uh The release is out check it out, uh kick the tires if it's, if they're bugs we'll fix them and keep hacking, we'll take some more questions offline. On slack, I want to give you back your minus one minute thanks everyone.

E

Thank you. Thank you.