DataHub Community Talks, 23 Jul 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: July 23 2021: DataHub Community Meeting (Full version)

Description

Full version of the DataHub Community Meeting on July 23rd 2021

00:00 Welcome
01:08 Project Updates by Shirshanka
- July Release highlights
08:58 Operability improvements: rollbacks by Gabe Lyons (Acryl Data)
13:19 Demo: Data Observability: Phase 1 by John Joyce, Harshal Sheth, Dexter Lee (Acryl Data)
31:59 Case Study: Building User Feedback into DataHub by @Melinda Cardenas (NY Times)
41:31 Demo: AWS SageMaker integration for Models and Features by Kevin Hu, Dexter Lee (Acryl Data)
53:42 Closing Remarks

A

Cool welcome everyone. um This is the july edition of the data hub community meeting. We have a fact agenda as usual. uh Let's see who do we have first I'll do the usual. A few project updates cover everything we've done in july and the upcoming release highlights and then we will have three sessions. The first one on data observability phase, one, the acura data team will discuss uh what they've done and what's coming.

A

The second piece is melinda at new york times who describe what she's doing with user surveys in datahub and then finally, we will talk about work done by kevin on integrating sagemaker with datahub, for supporting models and feature observability and then we'll go into q. A and closing.

A

So, let's get started a quick uh updates on the community. A lot of people don't know this, I'm just putting it out there we put out updates over medium. We have a project newsletter that we typically have like a monthly cadence. So please follow it and share it. That's the best way to kind of have a long form content for improvements and uh changes that are happening in the project.

A

Also, if you want to write for the data hub project, definitely let us know we're always looking for high quality content so uh happy to uh guide folks in writing their first data hub blog post. The second thing is again: you probably know this, but we have a youtube channel. um We take all the community town halls, create videos for them and share them out uh if there are other talks that you're doing at other conferences or want to just record a quick uh demo for how you are setting up data hub.

A

Let us know- and we we will be happy to put it up on the channel as well, so follow these two uh channels, long-form content, video content and create all right, so project updates. uh The 0.8.7 release is coming. We like to not release releases just before the weekend, um so we'll we'll share. What's coming, we'll share the stats and then we'll probably cut the release right up to the weekend.

A

I took a quick look at the commits that have happened so far since the last town hall, 130 plus so we're keeping up with our you know, 130 plus rate of commits.

A

Committers or contributions 16 from nine different companies and six new contributors, which is great uh biggest, highlights in the last month, uh data set stats profiles, data observability phase, one as we're calling it is going to be part of this release and the sagemaker integration for models and features. Both of these are things we're talking about today, so I won't take up time and then there are a few other highlights and three tracks as usual. First product improvements, second integrations um and third, some developer experience improvements.

A

So, let's get into it. But before that some quick call outs, um we have uh kind of an interesting first commit from way at linkedin. Remember last time we discussed um deprecating the ember code base and officially announcing that we're all on uh react. Moving forward um so way recently sent over his first commit and it's amazing to see when a lot of files get deleted, because that's the best way to keep code clean.

A

So thanks way for doing that and looking forward to making the react app amazing, with all of you, a few other people, who've actually been doing amazing work, haseem bunzo, you might have seen him on slack.

A

He's really giving the data hub product a run through the ringer finding so many interesting issues. So thanks for doing that and giving us great suggestions, also sending prs over github to improve the product, that's the best kind of contribution. So thank you for doing that. um James lamb. He recently joined um and spot hero. I think and then started diving into looker right away and started improving the code as well.

A

So thank you for doing that, and then there are some veterans like remy and frederick, who are continuing to help and coach newcomers both on slack as well as on github. So thanks for doing that, um some new collabs are coming up, so he and uh toffee ibrahim and chris colson are starting to collab on a redash integration which is amazing to watch uh so contact them. If you want- and I know, redash was a commonly requested integration- so definitely reach out to them.

A

If you want to make sure that it's going to work for your environment, we have we had a contribution for adding functional testing to the front end, which was great. um We had always wanted to add these kind of tests. Didn't get enough time, and so it was great to collaborate with van 54. I think it's van shew from australia who added functional testing to the front end. I think we're using cypress for that right. Gabe.

B

Yep yeah. We really appreciate that pr a lot. It's super helpful and great that people contribute tests.

A

Cool and then, of course, last but not least, uh saxo bank team continues to uh improve the business glossary feature, and so we uh really like the collaboration there, thanks for sending all those pr's over and the mocs um very soon you'll start seeing some of these features show up in the in the main app as well, and for someone that I forgot to call out really sorry love all of the amazing engagement on slack, both in terms of feedback as well as helping each other out.

A

Let's continue to keep that going all right. A couple of quick uh call outs on the product improvements track: we've got the business glossary feature that's been incubating for a while, so there have been some improvements over the past month. uh It's still not ga, yet there's still a few things that uh we're collaborating with the saxo bank team on so probably be a full reveal kind of in the next town hall, most likely in terms of what it does similar story with schema history.

A

There's a few features in there for visualizing schema history, comparing multiple versions and things like that. We've already got some feedback about things that are nice about this and things that are not so great and need improvement, so we're actually actively working on those improvements.

A

So hopefully, in the next town hall, we get to do a deeper dive into schema history and making it amazing from a ux perspective. So these are two features that had improvements in the past month.

A

All right on the integration site was kind of interesting. As I looked at all of the different improvements that came in it's pretty much across the board, every single integration got better, which is great because that means none of these connectors are stagnating and not getting used. So we've actually had tons of improvements. A lot of improvements on the dbt and looker integrations in particular, but also small improvements, all over the stack.

A

Finally, on the developer track already talked about cypress for the functional testing of the ui. There have been some improvements in just the ergonomics of integration ingestion working with transformers we've done some work on image hardening for docker.

A

We now have a recipe for how to restore indexes.

A

It's not a turnkey recipe, but at least it tells you how to get it done, especially if you're running on aws and you've got an rds backup how you can restore your elastic search indexes and your neo4j indexes from those backups, and there is one more thing that gabe was dying to share with you all so I'll. Let him go for it.

B

Thank you. Yes, I'm very excited to talk about this, so one thing that we're, including in the new release, is the ability to roll back ingestion, runs that you've made to uh to data hub. So in the case that maybe you've ingested some sample data that you want to get rid of.

B

Maybe you ingested some data that you don't want to expose to users, or maybe you overwrote, some good metadata in these cases you might want to roll back an ingestion run and either delete some updates that you made or delete entities entirely, so we've added some ci command or cli commands to help make this a little bit easier for you, and I can just give a brief demo of that now, so you can see what it's like to roll back congestion rounds all right. Let's do that.

B

So in this demo, let's we're going to do some live deleting of data, and hopefully I don't delete my whole computer. So, looking at uh my data hub here, you can see that I haven't adjusted a very bad field during a ingestion run, and um you know I might wanna, I'm gonna wanna get rid of this. So now the flow is, I can go to my command line and I can say, github run ls, so this command will go to data hub and see what recent runs have been made.

B

Each time we do ingestion we'll create a unique run id and attach that to all the metadata we ingested. So I can see here. The recent runs that I've done and a little profile about them. So I think I know that the most recent run I did was good and I think that the bad data got in a small run where I was just trying to patch a little bit of metadata, but to make sure I can go run datahub run show, and this will provide a more detailed look at this individual run.

B

So I can see how many aspects were updated and also a sample of the metadata that was updated in that run. In this case, I can see that the schema metadata of this logging events entity was updated, so I can be confident now. Okay, this is the run that brought that bad data in. In that case, I can then say, datahub run rollback enter that run id and it will go through and revert those changes.

B

So we reverted the schema metadata on the logging events entity and when I go back to my data hub and refresh the page, I will see when schema metadata that I had ingested before that now shows up and is now indexed, and I get my my good data back. If I were to revert the run that created this entity, the entity would be then deleted entirely.

B

I wouldn't be able to search for it anymore and as far as data hub is concerned, the entity would never have existed so now, no problem just go to this data hub run command line. You can find your runs and you can delete them very easily.

A

That is awesome.

A

Honestly. I can't yeah.

C

A

The number of times I've ingested bad metadata, I'm going to have to know.

B

Everything again in an ideal world, no one would need to do this, but uh you know accidents, happen and we're here to help so back to you for shanghai.

A

D

A

And I need to go back there. We are back on the screen right, so uh maybe a quick uh detail on exactly how this works.

A

The these obviously work for batch ingestions, because a batch injection produces a run id that is attached to everything that was ingested in that one run. This is now produced. The run id now gets attached to every event that gets produced and is indexed in elasticsearch for these previews and as gabe described, the rollbacks will kind of revert changes on a per aspect level awesome. So now, let's get into the first talk of the session, which is data, observability phase one, and we have the team here, but before we get started.

A

I just wanted to ask people who was here in february.

A

Can we get some hands raised just to see how many people attended the february town hall, who are still here.

A

Okay, we have a few. We have a few.

A

A

Okay, great, so we have a few, but a lot of you are new, so quick reminder of what happened in february. We actually did a live survey and shared out a few mocks about where we thought data hub could go in terms of an observability perspective on top of the metadata, and we had come up with some mocks and we shared it out with the community, and we asked for feedback and we even shared out the results of the survey.

A

A lot of people were data platform leads some pms and a lot of data engineers as expected, and a lot of these pain points were around understanding how the data looks like understanding its freshness, understanding, data quality checks. Things like that.

A

We also got a lot of feedback that people couldn't wait for it. They wanted to have it and they wanted to work with us in storyboarding. These use cases and give us feedback, and that was I think, february um march happened. April happened may happened and we were busy doing a lot of other work like building out the foundations for no code metadata and delivering a bunch of other features.

A

But we always wanted to get back to actually building the observability use cases on top of this metadata graph, and with that I want to hand it over to john to share what we've been up to and what we've done.

B

Thank you, shoshanka, I'm going to share my screen. If you don't mind, um okay, can everyone see my data hub? Yes, okay, so yeah this demo is gonna be broken up into. Actually this presentation will be broken up into a couple parts. The first one will be kind of the product side and then we'll go into the engineering design portion. That dexter will talk about and then we'll talk a little bit about how to get started using this feature and herschel will leave that section.

B

So um we kind of think of of observability and this set of features in two buckets. We think of sort of the proactive and the reactive bucket sort of the explorer versus operator use cases and I'm going to talk through a couple different use cases using using this new feature, set we've built out uh to highlight some of the things we've done. So the first is the operator perspective. So imagine I you know own an airflow pipeline. That is maybe computing.

B

Some statistics I work at the nih, so I have a lot of cancer data, um so I have something called compute larynx carcinoma cases. So this is like a a job that runs periodically every month and computes some information about carcinoma frequency by location by geography, and this is something I own recently. I've found that this job has been has been failing and it's kind of interesting because I haven't changed any of the code.

B

I haven't actually changed any of the configurations about the job, so I'm going to walk through how I could go about debugging this problem using data hubs ui. The first thing is, you know, I'll, go and actually look at a task, that's causing trouble and then maybe I'll. Look at the lineage view to understand kind of where this data is coming from or what my inputs are- and I can see that there's this this upstream from redshift called lawrence carcinoma data 2020.

B

So I can go over there and what we've built is sort of a new tab on the data set profile which is called stats, and so what I can do is I can scroll down to stats to see sort of a latest picture of the data set. So I can see that it has a table level.

B

3742 rows has four columns and then it has uh the some stats about each column. So we have, you know the min the max. If it applies, we have the null count. The null percentage, the distinct count from the column, the distinct percentage, so id is obviously 100, and then we have some sample values that we can show, and this is great and all, but it doesn't really tell me anything about um you know the last few days when my airflow job has been running like I can't see, is this healthy? Is this?

B

Is this not healthy, and so really what you need to make that determination? Is you need a picture of the data set's history right? You want to see how this data set evolves over time over its life cycle, and so we added is what I think is a really powerful tab, which is called historical and basically, what this shows. You is the profiling history for this data set. So I'll start with the top here, where we can look at actual profiling runs.

B

So what this is is snapshot point in time, snapshots of information about the data set and so immediately. Looking at you know just two days ago, the row count has a dramatic drop from 3742 to 51, and what I can do is actually pull up that profile right, so I can go back in time and sort of time, travel and see profiling runs from history. So let's say the last three months I want to look at you know. Maybe I want to look at something from from june. That may have happened.

B

There's an incident- and this is this- is also great uh but, like I have to dig through this right. So this is one view I have to kind of dig through this like it would be nice to have kind of at a glance. Understanding of the data sets health, and so for that you can kind of scroll down, and you can see two sets two sections, so the first is sort of historical table stats where I'm showing how the row count has actually evolved over time on this data set.

B

So you can see things look pretty good until just two days ago, where we had a massive drop so something's up with this data set, I should probably go talk to the maintainer owner of this data set to understand what's happening, column count generally looks pretty good. We had a little anomaly here, but pretty steady at four.

B

This is probably just some some blip right and if I want to dig down even further to understand how individual columns have changed over time or been impacted by by incidents, I can scroll down to this historical column, stats section, and what I see there is statistics, like the null count over time for a particular column, in this case the id column, the null percentage, the distinct count and the distinct percentage over time.

B

I can also go ahead and flip it to some other columns and see how these have been acting in the last month, three months in this case, so you can see this. uh This patient age had a had some weird things going on with the null percentage that corresponded to that time when we were in the 51 row range. So there's a couple things going wrong: you can clearly see here and then maybe I'm looking at the gender and the gender percentage actually had an anomaly in mid mid july, but seems to have been resolved.

B

um So this is kind of the the ex the operator use case where I'm able to sort of use this historical context to debug downstream issues and consumers reactively.

B

Now the second use case, which I'll use the same data set to illustrate, is sort of the explorer the explorer case where I'm coming in with the goal of building a new data pipeline. Perhaps I want to build a view on top of this data that breaks down carcinoma cases by let's say age I can come in here and I can at a glance understand whether this is a reliable data set to build on top of right.

B

So maybe I see that in the last two days, holy cow like this, the row count has just been moving so much so maybe I shouldn't build something on top of this data set. Maybe this isn't providing that stable foundation that I need, um and maybe I'm able to go back and see that you know, even though, in the past week things have been a little bit rough generally, this data set has actually looked pretty good, so maybe it is a strong foundation.

B

So I can begin to ask these more nuanced questions about the data set now that I have this this context at my disposal.

B

So that's pretty much the the the demo we're still very much taking feedback and iterating on the actual, ui and layout, so in particular we're thinking about whether we should kind of highlight the historical metadata at the forefront or the latest snapshot, also thinking about how we should mix these insights in and perhaps show a kind of a combined view at a glance that shows both the summary today, as well as the historical summary, maybe as the top level section here, so really interested to get feedback from the community about this, please, let me know, drop in the comments or just ping me on slack john joyce.

B

If you have any product feedback, any ways that this could actually be more made more useful for your organization, as you can see down here, we do collect quite a few stats from our profiles.

B

uh We actually collect stats that aren't even shown here and aren't even displayed yet, which we plan to display, such as column, value distributions, standard deviations and a few other miscellaneous statistics, which I'm sure hershel will talk a little bit more about. So that's that's it for the product demo. I will hand it off to my colleague dexter to take you on a tour of the engineering design.

A

Awesome, let me go back to sharing. This was really cool. It's great to see those mocks finally come to life, and next thing I want to ask is for herschel to talk about the injection pathway.

D

D

A

Yeah here it is.

D

Yeah perfect, so the way we built this on the ingestion side is we actually built on top of an existing data quality and uh tool called great expectations.

D

This is a tool that you know a lot of people in our community actually already use, um and it has support for you know, in addition to the data quality rules um called expectations, you can also do profiling and that's the thing that we're built on top of. uh We also support. You know things like limit and offset, and you know those sorts of parameters, so you can help control the amount of data that you actually profile, the hens, control cloud costs um and this this actually works with all of our sql-based sources.

D

So you know bigquery snowflake, redshift, mysql postgres, like any of those and many more, uh you can basically just set profiling enabled to true, as you can see, on the on the right um and then you know, we have a similar regex pattern that you can use to. You know decide which, uh which data sets or which uh tables you actually want to profile.

D

um So you know, if there's some random testing data set, that you've built. You probably don't need to profile that, and so you can. You know, skip that when, when running profiling um and then, as you see on the bottom, when you run ingestion, it will just automatically profile the the tables that you've got there.

D

If you have enable, if you have it enabled- and it will push all of that over to the back end of datahub, um so we can collect a bunch of information, as you saw on the on the product side, we also have a couple more things in the pipeline.

D

Things like you know, histograms or for columns with like low cardinality. We can do you know frequency tables of all the distinct values. A couple other things are coming on: the data richness side of things as well, so stay tuned. For that that I'll hand it over to dexter to talk about the back end, cool awesome,.

E

All right so um so we've been, I've actually been adding quite a few time-based data to our system, uh starting with product analytics and and then usage stats in the last town hall in this one. Since it's the first time, we wanted to actually think about how to store this in a data hub way. um So the first thing we thought about is: can we just ingest it as a form of an aspect into data hub? um So you can see the usual usual kind of path that it just takes.

E

So first, an aspect is ingested, and then you check with my sequel whether there's an update and update my sql um and if there's a change, I'm in mae and then finally updates search indices um and when we try to figure like fit in this new kind of data into this system. um We realize there's some issues specifically on the my sequel side. um So time based retention is, is really hard to set up in mysql range queries. Aggregations are much less efficient um and it's not optimized for high ride throughput.

E

So, uh while preserving this whole system, we moved on to the to the second to the next slide, um where um we check that an aspect is a type time series. So an aspect that has time series includes a time series aspect phase which basically just includes one field called times: timestamp milliseconds.

E

So once the timestamp is there. We assume this aspect is time series and we skip writing to local db and directly emit maes, and this will actually be stored in the search index instead.

E

So in the search index, we'll have just rows and rows rows of events that uh that have been ingested and based on that we do easy retention setup, because now we have the timestamp field and elasticsearch provides ways to create new indexes per day and retire indexes after a certain date, and we also allow fast range queries and aggregation queries um to get most out of this.

E

The statistics that we have just ingested, so this allows us to avoid any bottlenecks in log oriented data where we have very high throughput uh ingest coming in, um and we don't want to wait on mysql to write it to their db.

A

E

A

Familiar with no code, this is essentially extending the no code series of annotations to add the concept of a temporal or time series aspect. So now you can start emitting event, oriented aspects that are attached to your entities, but don't never need to be really looked up. Kind of in a point, look up way.

A

Another thing to point out is the search index. In this case we decided to use elasticsearch, but you could actually technically use any analytics index to power. This thing like druid or pino, or you know, click house or there's so many different olap engines out there.

E

All right awesome so moving to how we get these aspects back, so we first started with a very simple approach where we just get all aspects so before we had these concepts of versions, but now we don't so what we do is we introduce this concept of filters?

E

So say we get all the uh the values of the same aspect um according to certain filter, so in this case a filter might be equals like the earn equals a certain earn uh so, for example, earn equals data set one's earned or you can put a time range filter like. I want all profiles that came in in the last seven days um now.

E

This allows us to deal with very complex data types because we don't have to do any aggregations which is currently powering the demo that john just showed so basically histograms or field values are all just simple aspects that get stored inside the index and we just retrieve them now. The front end is responsible for doing something with the retrieved aspect and showing charts or showing tables.

E

um The second method uh is, is a work in progress, but we will release it um as part of next really not this release, but the next release. We want to create, like a sql like api to access aggregated values. So there are statistics that are not like profiles where it's not snapshots, for example user stats. So we constantly get ingestion of query counts and for that to have meaning we need to be able to aggregate across certain buckets.

E

So we wanted to create a sql like api, where given metrics filters and group buys return a table of a generic table of metrics, for example, I get query counts for a data set in the last month on a weekly basis, so any kind of these queries will be supported through this new api and we plan to move our existing data hub usage and data set usage statistics to this framework once.

E

Implemented awesome, so uh this is not the end. This is again just phase one, so we have a long way to go um so. First, on the platform short term, we're going to merge this in your initial pr for adding time series aspect, um add support for aggregations as well as moving, but more so we want to connect more operational metadata streams, not just profiles from great applications um to data hub, and we have a lot.

E

We have a lot more features that we're waiting to build, um continue, building out the observability roadmap, and please give us feedback on any of this. um So we can make this product better. For you guys, awesome.

A

Thanks a lot dexter, and so um basically the initial pr for this will land early next week. um That should have pretty much everything john demo'd um and the things that dexter talked about right. After that we are we're going to add, support for aggregations and move use, its stats to that model.

A

Awesome. Now we have we move over to a much more visual presentation. Melinda who's been spending the summer at new york times is going to talk to us about what she's been up to.

F

Oh hi and you're going to show the slides right sure, okay, um so first, I just want to thank you for having me here and thank you to the data hub team for being so supportive, as the new york times begins to integrate data hub so yeah, I'm a new grad engineer on the data governance, engineering team and today I'll be talking about a small but hopefully impactful component, that we're going to be adding to datahub, which is a user experience and satisfaction. Survey.

F

So before diving into the survey content, I just wanted to go over what the new york times is using datahub for so the new york times, data governance engineering team is using datahub as part of a larger data strategy to improve our data, discoverability and maturity.

F

um Just as I imagine, most of these people here are using datahub. The new york times is using datahub to have a centralized space for consolidating data understanding how data is used and discovering data. As the new york times, data ecosystem becomes more complex. Having a data catalog is becoming increasingly more important, which leads me to the next slide.

F

So why is having a user survey important?

F

We don't want to be investing precious time, resources and money into the wrong things, and we don't want to be creating for the void, so getting user feedback can help, give insight into any bugs any confusing parts of the tool and is also an opportunity to get ideas for future features.

F

This way, there's more communication between developers and the intended users of datahug.

F

So some general ideas of the kind of information I was thinking we could be gathering is knowing if the users are able to find what they're looking for, because, after all, that is the point of a data catalog knowing if the users ran into any problems and depending on the problem, this could be extremely valuable information, knowing if the users had a positive or negative experience- and we want to this answer to be nuanced ideally, but getting their general impression is also very helpful.

F

So the next slide yeah, so we were brainstorming, brainstorming, some questions when we were initially thinking about this user survey.

F

So some of these questions are: when should the survey pop up, if at all, we talked about maybe using a random function to decide when it would pop up using cookies to be strategic about when the survey would pop up? Maybe it would happen after certain actions or after a certain amount of visits for each user, and ultimately we kept the implementation for mvp, pretty simple by just having a persistent tab, that's always available, so that way, users can always take the survey whenever they want.

F

um We also have some. Oh sorry, sorry about that no worries. um So we also have some related questions like how can we make questions configurable and if we do make questions configurable? Should the survey questions be kept in a table or in a configuration file like yaml or json?

F

um We considered keeping the questions in the table, but that it would be pretty complicated for developers if they want to have those questions appear on the front end um we ended up going with or we will implement, having a configuration file um so that way, like each company can decide for themselves which questions they want to use and just plug those in um we'll have a constraint on that yeah in this upcoming slide. So here's a sneak peek into what the front end of the survey looks like.

F

um So on the left side, you can see some specific questions, but I just want to point out that these specific questions belong to like a broader category of of question types. So, for example, there's an emoji rating question to describe how a person's feeling, in this case we're asking how's your experience so far with datahub, but we could also switch this out with. How are you feeling about this feature on hub?

F

Another type of question is a yes, no type of question where right in this instance we're asking if users were able to find what they were looking for, but you could switch it out for another. Yes, no question and lastly, we have a free text response question um where right now we're just asking broadly, if for people to leave whatever feedback they have, so this could be a spot for people to report any bugs. They found give suggestions for future features or leave whatever specific comments they want.

A

This is very nice. One thing I I just thought of, as I was listening to this was. Probably polls would be also very interesting to do using this right with like uh with like multi-choice answers or something, and then yeah you get back like a result.

F

Yeah that would be cool like so users can see like what percentage of people voted. This.

A

Right right, right.

F

um These are just some screenshots to show more in detail what the questions look like, the ui. You can go to the next slide yeah.

F

So this is the first and versus the final draft of the database diagram that was made, though the first draft was so complicated because originally I was thinking that we might want to have different kinds of surveys, say saved, so those would be saved in surveys table and to have that we would also need a questions table and a responses table and then other tables to show the relationships between the questions and responses and surveys uh that was getting too complicated.

F

So we um like martha, was very kind and suggested saving the questions in a configuration file. Instead, um so we were able to simplify it into just having a survey responses table with the response text. Question text, user id yeah. The only dilemma that we foresee um is maybe when later down the line is when we want to maybe show these responses in the analytics section of the website. um It might be a little difficult uh because say right now: um uh objects are grouped together or like if we want to analyze objects.

F

um We and we want to like select all the responses that pertain to a certain question text, say, for example, what, if there's like two different questions, which are essentially asking the same thing so say, for example, how's your experience so far with data hub versus how are you liking data hub, um even though these questions are essentially asking the same thing if we're grouping responses? According to the question text, the responses wouldn't be grouped together, since the wording varies so some not so fleshed out.

F

Ideas would be maybe adding a question type column or using nlp in some way to group together. Survey response objects that answer the similar questions, um yeah and just a high level overview of all the parts that are going to be added. For this feature, a big thank you to john for drawing me a diagram last week which helped me create this.

F

Some of this is still under construction, specifically in the back end, but just to give a quick run through of this diagram at the top. Is the survey component with some question components inside so those would be like the emoji question. Free text response? Yes, no, um once the user enters and submits their responses, the survey response object. Pdl is used to make sure that all the things that we expect are actually there. So the response text the question text, this information is then sent to the create endpoint.

F

F

Okay, um it's sent to the create endpoint where an even object is then used to communicate to the the table that we want to add a new survey response object. um The get all endpoint would be useful for eventually when we want to have something show up in the analytics page. Maybe we can use um that get all endpoint.

F

um And then for the future so, like I said this is still under development, but in the future you'll be able to use this survey feature by just plugging in the questions that you want in the configuration file that we'll have and some ways to contribute since right now we're focusing on mvp once it's out, maybe adding cookies, to be strategic about when the survey pops up and what questions are asked, or maybe figuring out how to analyze the survey responses to show in the analytics section.

F

Since I know that's something that the data hub team was thinking about, um but that is it for today. Thanks for listening to me and checking out this.

A

Feature thanks a lot melinda for um all of the great work and really looking forward to seeing this uh future show up in the product soon.

F

A

If you have any questions, feel free to reach out to her and give her feedback where she is always looking to improve the thinking behind it all right. Let's move on to our final big talk today, which is um announcing integration with aws stagemaker for models and features, and we have some familiar faces and one new face. Kevin was going to be talking about what he's done, but to start out with is texture.

E

Awesome all right, um so uh I don't know I've told some of you guys, but I've been an ml engineer in my previous lives um for the last five years, so I've been working on recommendations and search, ranking um and, while doing so, I had this craving for uh for a certain product to help me build ml models and succeed in my journey. um So why is that?

E

um I felt that training deploying maintaining my model are is very hard uh and the reason why it's hard, not just because it's it needs thinking or needs creative thinking or something like that it was. It was a lot of logistics, um so there's a lot of information that defines a model like features, hyperparameters, metrics and so on, um but a lot of times when we train we do this ad hoc process.

E

We have a python notebook, then we train with a bunch of different parameters, and you have to try multiple combinations to find the model that actually works out at the end and even after deploying that model it requires continuous training and testing. So you need to test whether it's working online. If it's not working, you have to iterate, uh try out another configuration and send it out and uh and for me oftentimes, it was really hard to keep track of all these configurations that go on and the name configuration implies.

E

This is a metadata and what's good for collecting metadata, of course, data. So by collecting these metadata on data hub, we would help the process of maintaining models um and training, but that's not all. So. Let's move on to the next slide.

E

A model is not just a model itself, but often a product of a long chain of data collection, starting all the way from front end logs, um going through multiple processing to create fact tables, creating training, data, creating testing data um and then doing the actual training itself um and oftentimes ml practitioners are only given a myopic view of the system where you just get have data sets that already exist, something like for me.

E

It was something called fax searches, which was like a list of all the searches that happened in a clean way, um and we recreate some training data and testing data from it and then train the model. So just that part of the system we know, but uh there are oftentimes changes in the upstream that can cause very unexpected effects on the quality of the training data sets, um which oftentimes also causes very unexpected model. Behavior.

E

I had one case for for a month we had we lost millions of dollars at airbnb, just because um there was an issue with some one of the feature that that we didn't expect was there, and it took us two months to figure that out um another issue. There was an android bug where some of the logging in android was not working.

E

Page view was not getting locked correctly and that caused our android uh efficiency to go down efficacy of the model to go down, and that also took more than a month to figure out um data datahub.

E

uh We want datahub to provide this end-to-end lineage, where you can actually see where all the parts of the trading data that you use to train the model, as well as the data you use to do inference uh in real time where it comes from from the very beginning, and so by pro, by providing this end-to-end lineage. uh We save all this logistical time of of these practitioners like trying to figure out what's wrong with the system. So I'm going to give it up to kevin for details on how we do this.

C

Hey everybody nice to meet you all so um like dexter was saying one thing that we really want for the mr ecosystem is to um have this end-to-end lineage from data to jobs, to models and eventually deployments of those. um So we decided to start off with aws sagemaker, which is uh amazon's like managed uh ml operations platform, um and the main reasons for this was that we had already built some integrations with our aws things, such as blue and redshift.

C

um So we had some nice experience with their apis and on top of that, sagemaker is a pretty well fleshed out platform. That really has some deep lineage tracking abilities um and it's already well used by some of the people that we've been talking to. So we thought it'd be a nice start for us to jump into the ml.

C

Space as for the main components of sagemaker, uh the ones that we focused on were first a feature story. um So features are basically like the entities that you would feed into an ml model that it's actually running making its predictions on and sagemaker has a feature store that lets you connect to say like an s3 bucket or their built-in streams, um which lets you uh save data both for training your model and also for running inference online.

C

On top of that, sagemaker also has this concept of a job which is uh quite similar to other workflow orchestrators, and they have a really broad definition, there's like eight to nine different types of jobs, which uh we all decided to support, but you have different jobs for, say, training, a model for tuning the parameters that your model is using for also just doing basic things like data processing and cleaning as well and with jobs. You also have some pretty interesting.

C

Lineage features such as, for instance, one job can spawn another job, which the canonical example is. If you launch a hyper parameter tuning job, that's going to spawn a bunch of training jobs to uh test out each of those type of frame loops that you're using and finally, we have this concept of an ml model, which is the result of running a training job.

C

So if you train a model or stage maker has a model registry that will allow you to edit through that model, and let everybody in your team see the models that you're using um and on top of that, they have an abstraction layer in that you can package a model and later group them together for versioning as well and after you've uh got a model that you think would can be productionized. You can deploy it via something called an endpoint.

C

So we had a couple of uh bottling discussions uh for how we want to represent these features in data hub and for future stories. We decided to create this thing called an ml feature table which is a group of ml features which had already been added to data hub, um but we also noticed that sagemaker as well as this other feature store that we've added support for called feast.

C

Has this concept of a primary key, which is this uh you can think of it as a column in your feature, store, that's acting as like unique identifier for each of your samples.

C

So we added that as well, and the idea is that each feature and primary key store is appointed to the source data set, such as an s3 bucket or bigquery table, which can be both the offline store, as well as the online store and the feature table is going to be an additional layer for grouping together. Your features and primary keys.

C

On top of that, we uh extended the existing ml model config uh and that we had to uh link ml models to their training jobs, uh as in as it's done in sagemaker, and we also added uh this concept of an ml model group in line with what sagemaker has, as well as an ml model deployment which models uh sagemaker's model, endpoints and uh yeah. On top of that, we also added new concepts for uh ml hyper programs for hyper parameters, as well as metrics, which uh we separated into training and online ones.

C

So yeah, as for what it looks like uh a feature table, is pretty similar uh in terms of ui to a data set, so you just have a list of all the features, the types um and also whether or not there are primary key.

C

um You also have a nice choices, tab that indicates say like the s3 buckets or kafka streams that have been linked to your feature table and on the linear view, you can see a representation of that as well, if you're, using your features in any models, that would also be shown in a lineage view.

C

And for models we made another screen showing for now just training, metrics and hybrid parameters, since sagemaker at least does not give us any online metrics, but we also have tabs for seeing which groups a model belongs to other custom properties of a model and on a linear view, you can see the flow from say. The data set that the model is using from the training job that produces the model and any groups or endpoints that the model is uh making or a part of so yeah.

C

uh The sagemaker recipe is pretty similar to our existing ones. um In terms of in terms of like your credentials, uh like your aws feature and your keys, it's the same as say glue or red shift, but we have a couple of additional configs for uh controlling cranial access to. If you want to ingest your future groups, your models or your jobs and for jobs, there's additional ones that you can see in this link for specific jobs that you want to specific types of jobs that you want to ingest.

C

So yep, as for what's been done, just a quick summary, so we ingest jobs, features and models, as well as all of the lineage that sagemaker gives us on that end, and besides that, we have also added a connector for feast, which is another type of feature story what's coming next. Is we've been really looking into lineage between features and uh model, slash jobs uh which would complete our lineage graph so to speak?

C

uh We've been talking with the stage maker team at the aws, but it seems that they don't have their own native support for that, but it's planned for later this year, so we'll be on that as soon as as soon as it comes out um and besides that we're also just very open to integrating more ml platforms. Now that we have these models in data hub. So if anyone has suggestions, you know feel free to message us on slack we're just painless.

A

So that's great, and I I see that some people are already letting us know what platforms they're using in their stack, and I think there was a pr on ml flow that came in a little bit earlier. So maybe now is a good time to kind of get back to it and and refresh that and merge it in right. Kevin yep cool um there. So this is basically it for the talk, correct.

A

Awesome, let us know, um as this release gets cut, if you're using sagemaker, try it out and let us know how it works for you. We'll of course, have our own demo pipeline that will be up on the demo website, so you'll be able to kind of look at what it looks like when a real team runs a machine learning model and trains. It.

A

Are we doing a live demo for this? I guess not right. We just had the screenshots correct.

A

Perfect, you can do a live.

B

A

We are running low on time, so let's skip that, we'll just put it up on the demo, uh datahub project.io and then we'll. Let you know on slack, so you can go, try it out all right, so a quick peek into the future. um I don't know if everyone knows, but we have published a roadmap, and so this is what the roadmap looks like for this quarter.

A

We just discussed dataset profiling and data previews today we're starting to work on data quality, there's already a lot of feedback on the our back or the fine grain access control for metadata, rfc, we're continuing to work through the feedback and starting to implement it would love some help on sql, parsing uh column level. Lineage is something we want to get to.

A

uh So if someone has sql parsing, uh you know creds and would love to work with us on that uh we'd love to collaborate on that one and then also building out integration with operational metadata. Now that we have support for time series data is going to be much easier, so we're looking forward to building out that stack for the next three months.

A

In addition, there's also other things that are happening, that we didn't really put on the roadmap per se, so there's going to be continued advancements in the no code capabilities, including better ways of serializing. These events and integrating with the platform integrations, are something we generally don't put on the roadmap, because we actually just try to be agile and react to what the community wants and facilitate collaborations or just do some of them ourselves.

A

So just a quick announcement uh integrations with redash and metabase are actually starting out. There are folks in the community that are already working with it message me if you want to collaborate on these integrations or want to start a new one, we're also starting to go back and kind of do a fit and finish run on top of the product.

A

We've we've built quite a lot of features and there's a few areas where you want to go back and kind of make sure that all the user flows work beautifully.

A

So you'll see some of that kind of fit and finish work happening over the next month or two and that's pretty much it so um stay in touch and we'll see you in you know the usual fourth week, fourth friday of the next month, um and if there are any most of the questions that I saw coming in um over the sign sign-up sheet, we're all about data observability.

A

So hopefully this answers uh all of your questions around um what we're doing and where we're going. But if you have other questions, let's take it on slack and if there are any burning ones, bring them up right now,.

A

Cool, thank you, everyone and we will see you in a few.

A