DataHub Community Talks, 27 May 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: DataHub Analytics Sprint Demo: May 27 2021 Community Townhall

Description

Maggie Hays (SpotHero) and Dexter Lee (Acryl Data) describe how they implemented Product Analytics in DataHub using the design sprint philosophy.

A

All right and I will hand it back to you uh so the next thing on our agenda is uh the first kind of big item. uh The data analytics design sprint that uh mackie led for us um maggie. Do you want to actually take over the screen, share and drive from there yeah.

B

Yeah that'll be fine, sounds good. uh Give me one second, so hello, everybody, I'm maggie hayes, I'm a senior pm of data services at spot, hero based out of chicago um so earlier in april.

B

I think it was april. Time is weird right now. Who knows it was within the last month or so um I teamed up with the guys over at acryl data to run, what's called a design sprint. So I'll walk you guys through. What is that? What does that mean? What did we do? uh What was the point of it and then we'll move into a live demo?

B

So can y'all see my screen? Look good all right. So um if you've never heard of a design sprint, um it's something that was created. It came out of gv or google ventures and it's basically a framework to rapidly move through discovery, ideation solution, prototyping and testing solving hard problems with technology in five days granted we did it in three days. There are a bunch of truncated ways that you can do it, but the the original one was in a five-day, uh a five-day sprint um and I'll walk you through this.

B

If y'all are interested in learning more about this, there's a ton of information online, um this is the book. This is kind of like the main uh source of record. I guess of what a sprint framework looks like you can find out amazon all that um also on youtube. There's a channel called aj and smart, where they have. uh They have videos that break down every single session. They call it design sprint 2.0, so it kind of gives you like a refresh of it there um so ample ample context or resources for you online.

B

If you want to run um similar things in your own companies, so um what the role that I played in this was really facilitators, so moving the team through a bunch of different steps of this process.

B

So on the first day, we tackled understanding our problem like identifying and understanding our problem at hand, so that we could ultimately build a strong prototype around it. So we asserted that our problem was that the owners and admins of data hub do not understand how users are interacting with the tool. So that's a big problem right. There are a lot of technical approaches.

B

You could take to solving that and so what what we started doing was taking a step back and understanding how to contextualize that problem into the bigger picture of the data hub strategy. So we talked about how does this fit into the long-term vision of data hub, and we rallied around this um this vision that in 12 to 18 months data platform owners will want to deploy data hub at the organization because it gives them superpowers so right away.

B

When we start talking about solving this problem, we want it in the context of data hub is going to provide an immense amount of value right. So how do we, how do owners understand their user activity so that data hub can give them data superpowers?

B

And then we talked about what we identified? What question or questions we would be asking at the end of this process to understand if, if it was a success and we rallied around, are we providing data platform owners with actionable insight, so user usage analytics is not all like it doesn't. Just because you have usage analytics doesn't mean it's meaningful, so we wanted to make sure that we would be able to ask concretely.

B

Do you now have actionable insights so that you can move towards this, like future uh value of data hub in the long run? The next thing we did is we. We started to break down um all of the potential pain points within developing or solving this problem within the current stack, and we reframe this into. What's called a, how might we and really it's just a way to kind of like flip, a problem on its head and turn it into an opportunity? So we talked about how might we make the analytics infrastructure easy to manage?

B

So it's not another service for operators to manage.

B

How might we give clear insights where there's poor data qual, sorry, there's poor uh data quality coverage but heavily used assets so that way we're trying to solve this solution without adding too much burden on the uh the owners or operators of the platform and then also giving insight into? Where are you seeing a lot of activity and there's actually opportunity to enrich that metadata?

B

To give folks more more power, there then another thing we did was we talked to our experts within the data hub community and wanted to make sure that we had a well-rounded understanding of this problem, set. How folks even thought about how products analytics would fit into their management of data hub, and so you know sample questions in these uh user interviews were um what are some like. The top questions, you'd like to be able to answer around user activity and what decisions would that inform.

B

So the idea is that everyone in this design sprint is included in every single stage of this process so that we have all their perspectives all of their uh kind of like joint knowledge of how to solve this problem um again. On date, we're still on day one. It was a busy day.

B

um We then mapped out our kind of user experience within data hub so that we had a very concrete understanding of where this solution fit into that workflow. So um we talked about how you would install data hub as a poc have some step of ingesting metadata share it with your users gather feedback. Maybe do some iteration cycles here from there. You then move into feature development and improving metadata to then move it back into this this flow and we really targeted this idea of. We are assuming that the poc exists. There is metadata.

B

There are active users, we are gathering feedback and making decisions about user activity to inform future development areas to improve metadata and ways to drive adoption. So again, this really just helps us have a very laser like a laser focus of where this problem fits into the vision of data hub the user life cycle et, then we move into sketching solutions. You can see that these came in a variety of different ways. Some folks are writing a pencil paper. Some folks are whiteboarding mocking things up with the ui.

B

The idea is that we just start visualizing. What does this solution? Look like then, day, two decide on a solution, we're again we're we're deciding on a solution to tackle this one big problem.

B

um Once we had all the solutions up here, we did a lot of you can, since we're doing this remotely, there's a bunch of like little emojis or thumbs up to kind of show areas where we think they're good ideas, and it's really just rallying around. How are we actually gonna solve this?

B

um We then walk through our user test flow to get very concrete about what are the steps that folks are going to take in order to see if this actually solves their problem, and I'm I'm zooming through this very quickly, because I want us to get to the demo, but we'll have the deck posted.

B

You guys can look through this in more detail, but basically, this user test flow then moves us into having our storyboard so that everyone who's contributing to this project knows what exact steps are going to be taken, how they fit into a user test flow and how we can um kind of asynchronously begin building together.

B

So this is day two by the time we moved into day three, we started uh moving towards our prototype um and I think here, dexter.

C

You want to take things over from here: yeah cool.

B

C

B

To share your screen.

C

uh Let's continue with the slides and then I'll share once we start the demo. um So while we started building a prototype, we wanted to have some guiding principles um on as we make decisions on our architecture.

C

So the first thing is to standardize the way usage events are produced on the react app, so please check out the event schemas there, so we standardize the page view, events search events, browse events and so on and so forth, where uh we put enough information for us to understand where these usage events are coming from and what these users events like it actually mean.

C

Second, was to utilize existing components of data hub, as maggie mentioned before. We don't want to make operators lives even harder by adding even more components to deploy. So we wanted to use whatever components: we've already deployed to actually support a initial prototype of the analytics class uh analytics product. um The third was: while we wanted to have this default way of using existing components. We wanted everyone to be able to plug their own architecture for consuming these usage events.

C

So usage events are actually posted to a kafka stream, so anybody can just plug in any consumer of choice for data collection and analytics operators can also wire third-party analytics tools, like google analytics and fix panel to the react app. So please check out this doc for more details on how to do that. Unfortunately, for now you have to fork the repo, but we are going to work on making that through config all right, so moving on. uh Let's go on to the the end to end flow.

C

So you can see each component here are existing components in our data hub graph, um so our reactor, so that we have the user mark here, as the user interacts with the react app it caught. It sends over the events through the track endpoint in the front end. So the front end collects these events and poses to a kafka topic. So we created a new topic called data hub usage event v1, um and that is where all the events go through.

C

So we added a consumer in the mae consumer which already had a connection to elasticsearch, which is why we chose this one. um It will listen to datahub usage event, v1 and process these events that come through so note that these events are not hydrated. So what we do like, for example, a user urn comes in. We want to know the details about this user to do that.

C

We go back to gms, so we call the remote dow local dao to get the details about the entities, so we hydrate the entity features and we package it into a single document which we sent over to the data hub usage event. Data stream on elasticsearch, so elasticsearch connects all these usage events and front end.

C

So we created a new analytics controller which uh sends over filter and aggregate queries to the last search data hub usage event, data stream, where it says it tries to count and do a bunch of time series analytics and things like that to build some bare bone, uh charts and tables that power our analytics service and that is fed back into our react app at the end.

C

So let's go on to the demo. So let me take over the share screen here.

A

Dexter just one thing, maybe just take a minute or two at max for the demo, I'm just looking at the timing. Oh.

C

Okay, uh one sec, you guys see the yeah the data all right. So what I did here was I modified any consumer job a little bit so that it prints out the ma the usage event that is coming in, so we are in the usual data hub app. um So as we click you can see that the events are coming in. So we have a page view event. um You can see browse events browsers all click events as well as so, let's try searching.

C

You can see the search event that came in so it says it queried with a query sample as well as search view, events that talk about how many results was in the search page and as we click on it. Each of these actions that you take inside the entity page inside the search page inside the browse page will translate to a certain usage events that comes in so now. These usage events are all sent over. You can see that the elastic search connector is sending the bulk request to our data stream there.

C

So once we go to the analytics beta, what we do is each of these components are configurable inside the code. In the data hub front end, we have highlight cards, we have time series charts and then we have tables as well as stacked bar charts. So we created these main four different visual cards that we want to support, and then we implemented all of them. So you can see here. This is searches last week and then top search queries that come in you can see sample.

C

There was top search, five searches as well as section views across different entity pages, so we have lineage, we have ownership, we have schema and so on also actions by entity type. You can see. We have update tags here. I updated a few yesterday to just show you guys, and then we have top view data sets. Of course we will be continuing to add more charts here, so it would be great if we could get feedback here. So I wanted to go over the charts that we see for our own demo.data hub page.

C

So you can see we have amazing. We have 421 weekly, active users, crazy um thanks for using the demo. um You can see the searches that are happening as well as the various search queries. So we can. We can gather a lot of signals, but what users are doing on this platform by just looking at these few charts?

C

Awesome. That's it for the demo.

B

Yeah one thing I'll add here: um dexter: could you actually just uh display open your or hide the um terminal in there, so you can see like a full s, uh a full view of the dashboard perfect. Thank you. So what we're trying to do is like find find ways to contextualize not only activity, but also where is their opportunity to really like leverage the power of of data hub right.

B

So, if we're thinking about um the the number of data sets, so we have 92 data sets and half of them have owners assigned that's great. So what that means is that we're halfway towards having fully documented data sets within data hub right? So it's it's not even just the what are people looking at, but what are people looking at that's specific to the value that data hub is driving um the other part, and so speaking, from my perspective as a product manager managing this type of tool?

B

I want to understand how do I decide where to invest? My team's energy are people only looking at data sets. Are they looking at pipelines now, and maybe our pipelines aren't well documented or in the actions that they're taking? Are they? Are they adding tags? Are they manage changing owners? Are they looking at ownership, detail, lineage, etc?

B

That way, I can start to narrow down. um You know where to have my team and my stakeholders start to and invest in having more robust and and uh more meaningful metadata within there. The other thing that we're thinking about is um you know. Looking at the terp uh terp is not a word excusing. The top search queries to understand like what are people even looking for in here?

B

Is it specific terms- um and I think one one thing we were talking about- is um uh there as as the set of data platforms expands, do we have people coming in and searching for something like salesforce data or braised data like some of these other tools that maybe aren't in there, and that can be a leading indicator of other ingestion mechanisms or pipelines that we need to pull in?

B

So I think we can also start to leverage this idea of finding the gap of what people are searching for and trying to do, but we're not actually meeting that demand and, like like dexter, said, uh if, if you have ideas or questions about how to make this more impactful or meaningful, I uh will route you all over to the actual team. um But we're definitely excited to see where this, where this heads.

A

Cool thanks a lot uh maggie and dexter yeah. It was a great experience and I was talking to young- and you know, nick and ben over on the linkedin side as well, and they've actually built a very complex and very expensive um analytics uh capability on the product stream as well, so at a future date. We can get into that as well. That includes sessions and a lot of deeper analytics. So it's pretty cool what people are doing with it.