DataHub Community Talks, 15 Jan 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Datahub Community Meeting (Full) : Jan 15 2021

Description

Full version of the DataHub Community Meeting on Jan 15th 2021
Agenda:
Announcements - 2 mins
Community Updates - 10 mins
Use-Case: DataHub at Viasat by Anna Kepler - 15 mins
Tech Deep Dive: GraphQL + React RFCs readout and discussion by John Joyce and Arun Vasudevan - 15 mins
General Q&A from sign up sheet, slack, and participants - 15 mins
Closing remarks - 3 mins

A

Is my screen visible to everyone?

A

Awesome all right! Welcome everyone to uh the first data hub community meeting of the year. It's been a pretty impactful year for everyone professionally personally, but we're looking forward to amazing things. It should be 10 15, 20 21, but I put in 2020. This is where everyone's mind is at right. We feel like 2020 has not yet gone, but hopefully this year will be better for everyone. The vaccines are on the corner.

A

um So, let's get to it uh the agenda. It's pretty packed uh quick announcements, some community updates. We have anna from viasat who's, going to be talking about how datahub has been deployed at viasat and their experiences.

A

There, john will do a quick readout of the rfcs that he's recently been published and also uh lead a discussion with the community and then we'll go into some general q, a based on the sign of questions and questions that might come up during the presentation and then we'll close out all right uh announcements, a very short announcement, as some of you might know, I have left linkedin recently and I'm starting my own venture.

A

My goal continues to be to make data hub the best metadata platform out there and create a thriving community, and you know, stay tuned for more updates.

A

On the community side. We've had uh a few interesting events that have happened in the metadata community. uh One thing that we were participating in and leading was an industry-wide metadata day event that happened on december 14th. Last year uh I have put up a few slide decks for the conference. What happened?

A

It was quite interesting because we got a lot of projects together for the first time. You know, often you go and read in a munsen, blog post and of course, amanson is the only system mentioned or you read a data hub blog, post and data hub is the only system mentioned or you go look at collaboration and they're. The only systems that are mentioned, and we wanted to kind of bring all these people together to talk about what are the real problems and what are the real issues? So it was good to get everyone together.

A

um I was actually uh going over the survey results uh and you know it's. It's been a busy time. I hopefully will get to publish them. uh The audience did fill out a quick survey. uh It was nice to see that data hub awareness was quite high. It was actually one of the highest awareness metadata systems that at least from the audience that attended. So that was nice to see um in terms of quick, cliff notes and you should go watch the whole video. There was a lot of great conversations.

A

Cliff notes would be stream. First, architectures definitely are important. People agreed that lineage was an important problem. There was a lot of chat around fine, green lineage being important and some of the big thinkers who have actually lived through this journey many more times than we have throughout other industries.

A

They repeatedly kept warning us about the dangers or the challenges in getting higher level understanding from metadata, and this is definitely something that we are very aware of so highly recommend going and checking it out, and hopefully you know we'll have more of these events and kind of uh spread. The metadata word across the industry.

A

uh The next thing from a specific data hub community perspective, is, um I dropped in a poll uh end of like around christmas time. I was actually quite surprised to see how many people were still around and voting. So thanks for all the votes, we've got quite a few.

A

I did a quick scan over which ones got the most votes and tried to bucket them. As expected, product features are at the top of the list, top of mind for everyone field level, lineage, showing pipelines and flows uh being able to have social features, business, glossary dashboards and funnily enough uh visualizing. The metadata model was.

B

A

Voted up quite high, so I thought you know it was something that we usually talk about in our engineering meetings and we're like hey it'll, be very cool to see it, but it was very nice to see that it was something that was actually voted up by the community as well.

A

On the integration side, no surprises there, airflow and s3 were up very high on the list and on the model side, ai models was voted up and protobuf has a modeling language, which I thought was interesting versus pdl, which is kind of our default um was also got quite a few awards.

A

So this is just what the polls say. uh That doesn't mean that this is exactly how the roadmap will look like stay tuned. I think we have to work with the community and make sure that we are able to uh build kind of these features uh across all the different uh companies and projects that are working on uh data hub and make sure that we are able to deliver all of this for the community.

A

And now I would like to uh switch over to now because there's been an incredible amount of work done in the last quarter itself and you know now and I were going over all of the rfcs and pr's, and it was just amazing. So I'd like her to uh go over what has been accomplished in the last quarter and kind of give a brief intro to the work.

A

You want to go ahead.

C

Sure, hello good morning, everyone so welcome again to our monthly tanghao very excited to see the participants, um so this is now from linking I'm working more closely on the gma side, the generalized metadata architecture.

C

So I really wanted to take the opportunity here to show our recognization appreciation to all the contributions um to the data hub, so I'll quickly go over a few uh that show here in the list, so the first category you can see that we have this long awaited feature which is the field level lineage.

C

Thank you very much naga for the all the hard work on this, and we also have azkaban flows and jobs from hajj um dashboard from chrome. These are all popular and, unlike awaited features that we are able to roll in last quarter. Next category we go to the gma architecture side.

C

Thank you for john to start moving the code to another ripple. We call datahub gma, so this will try to decouple the backend gma that supports the data hub application so that the users can actually extend from data dma to build your own application and other use cases.

C

So that's very cool and another one is that we call scarcity as csi, which actually stands for strong consistency secondary index.

C

This is trying to provide the read of the right consistency in our system, so the other two secondary index search and graph are eventual consistency and we also see a lot of uh and more coming rfcs um one is from, I think, maybe his name is called madu madhu, I'm not exactly sure, but uh I put his id there uh for business glossary rfc, uh that's great, and also we have a recent rfc from john jones for graphql.

C

Thank you very much for all the contributions um we are looking forward to more coming this year.

A

Absolutely and one of the things you know being at linkedin, I know how uh how hard it is for some of the engineers to to actually keep both the internal data hub going as well as make sure that the open source data hub continues to stay vibrant, so a lot of appreciation for all of the work you've done.

A

I know for many people it's actually like a second job to not only keep linkedin up and working well, but also keep engaging and building integrations with the community. So it's been. It's been great thanks for all the contributions moving on to a little more fun.

A

We are also- uh and thanks now for some of the meme suggestions that she gave me last night. uh We are also starting to give some shout outs to specific people in the community, and this is the first edition of what I would call the data hub awards.

A

We didn't do it like a super long run data driven approach yet, but we'll get better at this. The first award that I could think of was uh like a trailblazer.

A

These are, uh this is a someone who has you know, provided us, something that we weren't anticipating or something that wasn't directly in our roadmap, and this is you know we are often busy with our blinders on saying this is what we want to build and we keep on working against this and then suddenly something comes in and says. Well how about this, and so this time the award goes to no surprises saxo bank for uh contributing uh not just business glossary which now called out, but also kubernetes.

A

So they have actually been uh amazing partners uh driving a lot of these advanced things that we don't uh directly think about. uh There's like a fibo financial glossary that they contributed. It was not something linkedin was focusing on. So thanks for all of the contributions, um I don't know if anyone from saxo bank is on the call and would like to just say a couple of words.

D

uh Yeah hi, so I think uh yeah. Thank you so much uh for calling out saxo bank and uh I think it has been an amazing amazing partnership. uh In last eight nine months, uh we got a lot of support from linkedin and uh you know. Definitely. This is the third generation architecture.

D

We didn't realize what we were picking up when we picked it up, but I think the entire bank, including the governance committee and and the business stakeholders, are very, very convinced of our selection of the tool, and it has been made easier by the partnership that we've got for linkedin and uh our vision of selecting. This is now true right, we're evolving the culture with the tool by contributing to the community.

D

So thanks for the amazing culture within within the community and and from linkedin as well.

A

Awesome thanks a lot moving on.

A

We have a newest deployment in the community, at least that I am, and we are aware of uh so this is the latest company that has been brave enough to push the red button and deploy data hub to production and the winner is shivam gupta at growers, and I checked with him last night: hey did you deploy data hub to production and he said yes, and I said, is it still up? He said yes, so hopefully he is around to say that experiment went well.

A

Shubham. Are you here.

A

He did say that uh they are having a midnight sale today uh in india, uh kind of like the prime day sale and he's on call, so he may not be able to attend so well. Next time we will catch him moving on all right. So this is the perseverance awards. uh We, I don't think we want to give it too often and I'll. Tell you why. But this is basically the community member who has spent the most amount of time pushing something over the line, and this time the award goes to arun at expedia.

A

For being so persistent and pushing the ml models pr, he started in july and then jyoti and karam, and I think mars also they were working with him very closely and helping him get it better and better, and it took him all the way up to september for us to get it to a point where we were able to check it in. um We would like to shorten it, but we would love people like this, who are able to actually do the right thing with us.

A

Arun. Are you around.

E

Yeah uh thanks shashank yeah yeah. I think the ml models. uh I learned quite a bit from the from the folks who reviewed the pr. There were quite a bit of changes from what we thought internally, so it's been a great process and yeah. We have been with datahub for about eight nine months and it's been a great journey. Thank you.

A

Awesome arun will be speaking a little bit later as well about kind of the graphql, so that would be great to hear next. We have our roadmap champ.

A

So this is someone who has kind of given us the most amount of direct feedback around our roadmap and asked us to do things that may be aware in our roadmap, but not quite prioritized or things like that, but just engage with us on the community side and the award goes to maggie at spot, hero for filing multiple feature requests and in fact she even filed a couple of bugs on the ui.

A

I still remember it was one afternoon and then suddenly, issues from maggie started showing up and I've put a few ones out here.

A

So if maggie is around would be great to see what her plans are for data hub. She was also tentative, so maybe she wasn't able to make it today.

A

Awesome moving on ah this is uh another award that I came up with uh it's the tech excellence award and this uh you know I mean at linkedin. We have kind of three vectors of excellence. We talk about leadership, we talk about execution and we also talk about craftsmanship and sometimes we talk about.

A

Can you combine all three together and produce something of high quality? And sometimes it's?

A

You know two out of three, and sometimes it's only one out of three, but this time as I was looking at kind of the last couple of months and looking at kind of the output from the team, um I had to say that the tech excellence award for impact should go to john joyce for the amount of impact he has had in such a short amount of time, through ramping up on data hub and producing high quality rfcs, and also backing it up with pocs on both graphql and react.

A

uh John, are you around.

F

Yeah thanks srishanka. um I won't. I won't spend too much time here because you'll hear from me in a little while, um but I'm glad to be part of the community and I'm glad to share some of the the work I've been doing recently.

A

Awesome, okay, um so now we can switch over to the the tech part of the presentations. uh First up, we'd like to have anna walk her through her journey with data hub at viasat, and uh I'm going to. Would you like to take over the screen share.

B

Yeah, that would be great. Thank you. Okay,.

E

B

Can everyone see my screen.

A

B

Great well, uh first of all, thank you so much for having me. um It's um been really pleasure to pick this tool and um really worked with the community to get it into production. We also recently um just couple of months ago uh deployed data hub in production. It's been working really well, and so we're excited to share um how we are using it and how we're what our future plans and sort of how we arrived at that at the selection of this tool.

B

um So my role at viasab is the technical product manager about analytics platform. So I've been working with data for quite a bit, so it's always been a passion of mine, and so metadata is definitely part of that. As well um and avaya said as a whole, we are a satellite communications company, so their isp um we provide internet through globally to communities to residential, come air, commercial, aviation and variety of other services, and so at um why I said: wait um kind of a state of our users and data is um definitely very complex.

B

Our core data platform has um which I'm a part of has a lot of different data sources and microservices data flows, data analytics tools, but we also, as the core data platform, was evolving. In parallel, we ended up with a lot of mini data lakes, databases, data sources all over the place, and so it's definitely been a complicated landscape to work with.

B

In addition to that, we have a variety of different skilled users from data suppliers and preparers, and then, even within the data consumers themselves, there's different capabilities, different skills that users have when they interact with data and um from data analysts that work more with reporting tools for the data scientists who are ready to um really dig in and explore. The data sets so there's. Definitely quite a complex data personas there as well, and the company has set the goal to really be very data.

B

Driven and part of it, of course, is just really even getting access to data. And how do we find all this small mini data silos and explore it to the users in a very good approached way, um and so some of the challenges that we are trying to resolve?

B

I don't think there is a really um new to many of you, since you are here and part of this community, and so the silent knowledge about data is definitely one of the challenges we're trying to address by introducing the data hub and the data catalog, helping our users to apologize, um helping our users to even find data to remove this tribal knowledge, uh remove a bottleneck.

B

So a small analytics team has to constantly work with different teams to explain to them all the data that's available and what the data contains and how to best operate the data. um So the slow analytics process of even many of our users, for example, they complain about even finding what data is available to them. Can they even access the data? What's uh where and what teams to work with when get into the data? And it's been um even in the last couple of months, the community has been as we introduced the data catalog.

B

The community already been very excited and we see a lot of users uh coming to the data clock looking and asking questions and providing a lot of feedback. So it's definitely been a pleasure to introduce that within the company.

B

In addition to that, so one of the interesting features challenges we're trying to solve by uh introducing data catalog, finding all the siloed data infrastructure and potentially integrating it into a core data platform to help the company really decrease the operational costs uh that today a bit inflated uh to decrease operational complexity for many teams and to really concentrate on uh data processing and utilization of data rather than maintaining and securing a lot of data systems around.

B

So in the next couple of quarters, we are working with multiple teams at viasat to ingest their metadata and to understand what data is out there and how could we best address different scenarios of operating the data sets and, of course, the compliance complexity.

B

Our compliance team has a very small team and as the company um growing and working with global um customers around the world, including europe and brazil and africa, so a lot of the company. A lot of countries, as you all know, introduced different compliance laws and policies, and so the company has definitely been looking for solutions and we have started it.

B

Looking at commercial solutions for data catalog, but unfortunately, couldn't find anything that would really fit our needs, and so um I just was in the meeting this week with our compliance team about the data hub, and um I was sharing everything that trishank and I was talking earlier this week of all the uh road map and all the items features they'll be introducing and they've been very, very excited to see and work with us to solve a lot of their use cases and a lot of manual operations that they do today and and really um improve their compliance moisture within the viasat itself.

B

And so our technology evaluation started um sometime june. I think last year and um we were very excited and data hub became open source. We've been um sort of following uh the journey of that product. Within linkedin, we've always been a fan of linkedin products. um We operated kafka, a few other systems, so it's always been a really a pleasure to see that linkedin open source as a product and um so data hub definitely made a list for evaluation and some other systems. We looked at as apache atlas.

B

We looked at amundsen, netflix metacat and a few things that we evaluated on is really just general functionality.

B

The feature richness we were looking for lineage was really important for us ease of search of this of the data um different metadata ingest methods, overall, security of a product as well as data modeling flexibility was important to us because, as I pointed out earlier, there are definitely a variety of mini data silos around the company and um we definitely anticipated the challenge of trying to model all the data in a very flexible way to ensure that we onboard all the teams and um not limited in that ability to on on board them right and so is of development.

B

uh What do I mean by this? We are service. um We like open source products, we like to contribute to open source product as well, and so we did evaluate what um tech stack is behind the uh each of these products to ensure that we are capable of submitting prs and really understanding the code. That's needed, maybe even helping with some bug fixes with the community, so that was one of our valuation, criterias and then ease of operations.

B

Our team is very small um and we do operate a variety of different tools and systems and um the easier the process is. The stability of a product is really important for us. The upgrades just really deployment and evaluation from development to production ability to integrate and test the tools before promoting to production. So all of this components definitely be evaluated as well scalability. um We do have a lot of data and we do have a lot of different micro services.

B

So when we start talking about the lineage and ability to really capture a lot of the different events, that's happening with the data. We wanted to make sure that the scale is not a problem for the new tool that we additionally select a road map. We knew that we won't be able to sort of take advantage of all the heavy chain features immediately for our customers.

B

So we wanted to do like a slow roll out, an addition of the various features within viasat, and so we looked at the data hub roadmap for open source in multiple features and really well aligned with what we were trying to do. um Introducing the lineage traditional ml models, introducing some of the metrics functionality and data quality ratings, so it um so.

B

The timeline looked great and um just the fact that we were seeing everything we needed um on that roadmap was really exciting and then community rating just for product itself, the github rating and just how well community um supports the product. So we took a look at it as well, and I guess it's no surprise since I'm here today, the data hub was the product we selected and so far so good. It's been really really good journey.

B

So some of the things that we did deploy the backhand for the data hub.

B

However, we implemented our own ui, not really implemented, I per se, but we did have an existing interface for somewhere access requests with some basic search functionality that held some of the metadata already, and so we um reused that, because customers were already um familiar with that ui and they were a lot of it, all the access requests were automated and we didn't want to remove that from our customers and we didn't want to um sort of, extend and fork the data hub ui as well.

B

So that's sort of one of the reasons why we went to our own ui. We did also added the feedback um button to our ui, to gather as much of the information for our users as possible.

B

We introduced some of the product metrics, so we have the product analytics, that's being gathered from this ui to really understand how users are interacting with data, what type of features they want, as we also introduce new features, to make the experience as easy as possible as and then some way, flexibility of integrations with some of the tools that we have. So we wanted to um keep that option.

B

So as an example, some of the global metrics store that we are working on to express that in the same interface, some of the potential visualizations from all the data like sampling of the data or some simple graphs within the um within our ui itself, and maybe even for our compliance team introducing some type of reporting mechanism within the ui and having it serve like all integrated experience.

B

So so we went with that approach um where ui is ours but backhand. We try not to fork it. We draw a contribute to open source community and we have been doing that a little bit already, which kind of leads me to the experience so the operations.

B

I chatted with the team this week to really um understand were any issues and deployments as we were doing it, and um there were minor things in the beginning when we were integrating viral kafka, the secure kafka- and, I believe um javier sotelo on my team. So he contributed the small code to exposing some of the kafka parameters at the time and we're really happy to see that being um really. I think it was within the days it was accepted and emerged, so we were able to quickly proceed with our deployments.

B

I think some of the other issues have been contributing as well has been accepted. I think he helped, even with some of the code reviews, so um that's been really um good to see just how uh how well um the community, how sort of uh responsive community has been and how um accepting um so welcoming. So it's really been good for us.

B

So, as I mentioned, it's been operating in production really well for a few months now we have um kind of bypassed the full dev setup and went straight to production and then um did the dev up. So we could um iterated all the new functionality very quickly. So today we operated in both um we did.

B

um I think we didn't do much of a complex modeling today, but so far we also had great experience to support in a lot of the data that we maintained in some of our small mini catalog, and so it was very uh simple and um we really were excited to see the rfc for the business glossary uh we've been looking somewhat compliance taxonomy and thinking.

B

Maybe we could contribute to a conversation around that as well, since it's on our roadmap and we're excited to work with the community to see how that could be um a joint conversation um and so just even presenting here today. So it's definitely a pleasure and thank you for having me it's kind of what we have today and what's the future state is um so we do have in production, we've harvested some of a core data platform and we integrated value.

B

I and right now we're starting to work with the rest of the viasat teams, and we already got a lot of interest from all these teams, which was very good to hear and see because previous, I think it's been. um We realized that uh we're not the only one who up here um sort of operate the data infrastructure but needed this functionality, and um the team has been really delivering very good products within the company and good service, and so we have this trust of our teams.

B

But we introduced a good tool and we worked with them to explain this is the data hub features? This is the evaluation process.

B

We shared the data hub information with them and all the teams been really supportive of our selection of the product. So it's making this adoption um definitely um much easier for us and so we're working the rest of the teams to ingest information about their data, working with a compliance team to see how we could introduce features necessary for them.

B

The lineage has been really um anticipated within the companies they're waiting on that um as and will be probably around the summer, really integrating with all our tools to introduce lineage, visualizations and then dashboards and reports. um We're really excited to see the rfc and the backend implementation for dashboards.

B

So it's definitely um we've been following that very closely, as well as the male models, we're starting to sort of standardize, a lot of our ml approaches around the company, and that will be very good to add into the data data catalog and then metrics and data quality integration would be sort of following closed list.

B

As I mentioned earlier, when we were looking at the roadmap for the data hub, it's really well aligned with where we were going as a company, and so it's definitely was another reason why data hub was um the tool we selected.

B

And I think that's all I have.

A

Awesome thanks anna uh for people who might have questions for anna, let's uh hold off and have it uh at the end, just because we have quite a lot to get through and uh just have 20 more minutes left, so I'm gonna switch over to our second uh second talk for the day, which is john and arun, are gonna talk about uh graphql and react.

A

A

Okay, john, you want to start.

F

Yeah sure thanks srishan, thanks anna, that was great to hear how you guys are using it at the company. So, um as many of you may know, I'm kind of a recent community member I joined about a month ago. Previously I was working at linkedin on a federated graphql layer um and, more recently I I have joined sri shanka in his in his venture. So as part of sort of booting up with data hub, I set myself a goal to try to do some work on the front end.

F

Specifically, I was looking to extend the front end to add something like dashboards or charts that have been added in the back end and quickly that initial goal sort of morphed into a different goal, which was to see how quickly I could spin up sort of a parallel stack on the front end, specifically choosing react and graphql, and what I want to share with you guys today is sort of what I built. um Why I built it?

F

How I did it and sort of what I'm hoping to get out of it and what I'm hoping I can contribute back to the community from this. So yeah with that, I think we can head to the next slide shazam.

F

So um the first thing, there's kind of two parts to what I've been working on. The first thing is actually adding a graphql endpoint into the existing data hub front-end server, which is a play server. The reason I chose to add it. There was that, first of all, it was just the clearest pathway we already had the server. We can just add a new endpoint there that supported the graphql spec.

F

I didn't have to spin anything else up, but also I figured that the existing ember app may benefit from being able to talk to a graphql server in the future. So I decided to place it there and the second thing you can see that, basically, what I've done is just add a little module to data hub front end, and the second part is introducing a new react client that talks to this graphql endpoint to create the views on the front end. So what we essentially have here is sort of a parallel ui stack.

F

You can think of it, as that, luckily is backwards compatible with everything that we currently have running for ember, and so next, I'm going to talk about. You know why I chose to introduce these two things.

F

Maybe before I do that, I can finish with one point that what.

B

F

Hoping to get out of uh get out of this is to actually be able to incubate both of these pieces with the help of the community over time and iterate on them with the community collaboratively.

F

So, uh first I'll talk about why I thought it was a good idea to sort of add a new application.

F

Specifically, uh you know I had come to the project without much much experience with ember, and I think I personally faced the steep learning curve of ember trying to get up to speed on the existing data hub client, and so I think that technology wise, it would make sense to add a react app mainly because it can extend the reach and accessibility of the data hub front end, as you guys probably know like react, is quite a bit more popular and much more familiar.

F

So I found it much easier to kind of get up the learning curve of react.

F

F

The existing ember app and I found it quite difficult to extend for a number of reasons. The main one is that I think there was just a lack of kind of guidance around how to extend it.

F

In addition to sort of there being some of this legacy or unused code that was appearing in that front end, but may not be present in the ui and then finally, I found the the ember add-on structure, the architecture to be kind of difficult to work with and confusing in terms of my mental model, and so I think that you know in building a new app technology aside, we will have a really good chance to sort of be more deliberate about these.

F

These aspects so specifically making sure that those levers for customization and extension by individual organizations are built into the front end from the beginning. We can have a chance to clean up some of that dead code that may not be used today and then. Finally, we can make sure that we have very clear and documented paths to doing things like extending the front end.

F

I think with that we can go on to the next slide, and so, as I was building the initial react demo, I initially was hoping to use the the the play apis, but I found that it was a little bit difficult because there was no clear contract.

F

So you know in order to understand what those endpoints from the play server were returning. They were just json globs I'd have to kind of probe those endpoints directly. Some of them had sort of these view, models that are specific to the client, like data set view, and some of them were just gms models that were passed through so like the corp user pdl.

F

So I found that it really helped my my own personal iteration speed by sort of taking the time at the beginning to establish that explicit contract, and I think graphql is great for this, because it provides these self-documenting strongly typed and validated contracts. It provides like a dependency, which is this intermediate api layer that the client can depend on, which makes it also very easy to sort of switch the implementing server technology in a way that's opaque to the end client.

F

The next thing is, you know, I think, graphql in itself kind of reduces the the api calls, as well as the noise and the by virtue of being able to sort of ask for exactly what you want, as well as being able to traverse related entities with fewer api calls. So there's just a reduction in this back and forth.

F

The next thing is that it kind of minimizes the surface area that we have to expose and maintain, and so what this really means is just one less step to extending the front end, which is you don't have to add another endpoint and then finally, I thought that you know graphql actually really models kind of what we're working with in this graphical topology of the metadata graph, much more accurately than sort of these verticalized, segmented, endpoints, and so that's kind of why I thought graphql would be a good choice here and I found that it definitely helped in just building out my demo in terms of my iteration speed.

F

So um if you guys want to see kind of how I did it, I have a few rfcs open. One on graphql queries, one on mutations, still some discussions ongoing, which I'll discuss in a bit and then I have a new proposal to sort of incubate. This parallel react app, as well as a proof of concept for for all of those things as well, and I think with that, I'm going to just quickly uh jump into a demo of the react stuff.

F

I've been working on now, I'll, okay, can you I might want to screen share, let's see: okay, okay, um yeah. So right now like this is very mocked like we are working against graphql, but we're using this mock graphql server that instead of the datahub frontend to actually populate the data, but I'll just show a few screens just as a proof of concept. So here you can see. I directed us right to a search, because this is something I've already implemented, but you can see it's pretty similar to the existing search.

F

I've only implemented data sets so far, um but we can go. Take a look at what a dataset page looks like so again, fairly similar to the existing one.

F

um You can add people whatever uh only implemented ownership so far, so there's still quite a bit of work to do, um but I think what I was hoping to do with this is kind of prove out the the idea that this can be done and not only that, but that it can be something that we can iterate on quite quickly.

F

So with that I'll, just go back to the slides.

A

That's very cool.

A

I thought the react. App would look much better john. Why does it look as boring as the regular linkedin data hub.

F

I think we have to hire a developer into the community. I mean sorry a designer rather.

F

F

Yeah, okay, um now just quickly talk about some of the open discussions we have on these topics, um so the first thing is sort of how we should model graphql queries.

F

Our proposal is that we essentially take the public gms models, not the entity and aspect models, but the models that are exposed at the gms get and batch get api layer which is like dataset.pdl, for instance, and use those to actually auto-generate. This graphql schema such that we don't have to sort of maintain multiple type systems or schemas across different layers of the stack right. Now we kind of have this divergence.

F

In some cases there are different view models on the front end, and what that means is that it's just more difficult to extend, there's more steps to to make changes. However, you know I recognize that there may be cases where we need kind of those front-end specific fields, and I think we can do both with some sort of extension system, which we can. We can talk more about in offline.

F

um The second thing is modeling mutations, there's kind of two schools of thought floating around right now. One is sort of keeping the both the mutations and the queries actually sort of entity oriented to the front end, which means we don't explicitly model this concept of entity and aspect specifically like the differentiation between the two on the front end and the front end simply treats all of these models as entities just single documents, which it can do updates against as opposed to having to have different routes or different mutations.

F

For each aspect, say: ownership schema, etc. That you need to change. You can just have one top level data set that you can update. With all of that information, I think you know the downside to the aspect. Oriented is that you just have this more coupling throughout the entire stack, where aspects as a concept sort of bleed across everywhere right into data up front end and then eventually into the client.

F

So this is still very much an open discussion ongoing I'm interested to hear uh what the community members think uh for certain and then the next thing is sort of the the react ember drift problem, and I think you know we're very aware that this can happen. I think our straw man is that uh in the long term the react app should be kind of the default disposition of the community, and what that means is you know in the short term, there's definitely going to be that functional difference as react.

F

Kind of gets up to speed to match uh to achieve parity with the ember app, but once that happens, we'll have to kind of talk about how we strategize around migration.

F

I think in our proposal, we'd sort of recommend a migration of clients from the ember app to the react app at that point and then eventually, eventually deprecate support for the ember client in the long term, and with that I'm going to talk about one other open discussion, we have going right now, specifically a collaboration with with expedia who is also interested in graphql, albeit in a different light. So I'm going to hand it off to arun who's, going to talk about their experience with graphql and datahub.

E

Thanks john uh yeah, it was it's really good yeah, uh so I'm maroon was david. I'm an engineer uh in expedia group. uh So, as we were like uh similar to what viasat was talking about, our internal approach is also like.

E

We have the backend data hub and our ui is completely internal, that we hosted up, and um so we are using the data hub backends to uh to pull information right now uh directly from the data stores, um so I'll jump into uh some of the motivations behind this graphql approach uh internally and uh what we are doing along with john so internally, we have a react and node.js front-end application, which directly talks to the data stores, the mysql neo4j in order to read some of these data.

E

But we are now looking to also mutate.

E

The data write write data and at this point we were looking at some good approaches to do that, and wrestling was not something that working out for us, so graphql suited up better because mainly around the consumption patterns around various like endpoints, that could come up in the future and also the reactant node js app works very hand in hand with the graphql api, and also we wanted to make this a standalone, graphql application deployed so that it could scale up to any of the future needs that we might have around any other applications or any other users wanting to pull up or push data.

E

So that's the main motivation and where it fits in the architecture is mainly like from the front end. There would be a separate graphql service as now we're thinking it like a string boot service that that would call the common resolvers the common resolvers are. Basically. This is where me and john would work together on coming up with, because john is calling up from his play server these common resolvers.

E

So we didn't want to duplicate these resolvers, so we would try to come up with uh something common for both of us, and these would call the gms dolls directly through the sleep. The rest of the architecture would be familiar for you guys, because that's that's similar to what data hub is these green components are the only ones that's added yeah. Moving on um these are some of the details on how I'm planning to implement it.

E

So that's the metadata graphql api uh in itself uh would be a standalone deployed um that would call the common resolvers to get uh all the resolvers fields and from there would be the wrestling call to the gms dolls and specific to the gms clients uh say the data sets or ownerships or any of the other things, the ml model that that gets added. So all of this would be called from the common resolvers.

E

So by this way we wouldn't we would be using uh the same code across for the both the front-end and gms apis yeah. That's all from me and.

F

uh Thanks erin um yeah I'll just go ahead and summarize that discussion. So, although the specifics are still in flux, um I think what we're thinking is that, because a runes use case requires sort of a stand-alone application that will not be communicated to from a front-end as in like react or ember. But instead a node server will essentially have a common library that we can both work against, which has essentially a shared graph derived from the gms models, as well as a shared set of resolvers that can be pulled into both of those deployables.

F

And this is just the the overall picture here uh where you have the expedia node server and the ember and the react app uh in the picture as well here um yep, so that that's pretty much it thanks. Thanks erin appreciate it. Thank you.

F

So just a quick, quick wrap-up.

F

I think next steps for these proposals are to align with the community on some of the open discussions around the graphql architecture, as well as some of the remaining open details on the react side and then hopefully, iterate collaboratively with the community on both of these pieces over time and sort of push those in a manner that is sort of more agile, as opposed to having sort of these large big, bang, rfcs and prs that are going to take quite a bit of time to get in.

F

So I think we should realize that you know this is considered in beta, and everyone should be free to sort of push these things forward together. So with that I'll conclude and hand it back to sri shanka.

A

All right thanks, john and arun, looking forward to the collaboration here um in terms of the general q, a I was just taking a look at the sign up sheet and questions over there. One of the questions that has been asked from the community a couple of times is: uh we had seen some sneak peeks of the new lineage ui uh when is that uh getting rolled out uh to the open source uh version? So if you know harsh or uh nacho, you guys are on the call, maybe you can share your plans.

G

Thanks uh this is harsh. uh I I don't think nacho could make it today, but uh we are actually working actively on rolling out our new ui. It's uh it's definitely a leap in terms of your user experience, so it will make it much easier to understand the flow of relationships and uh the other thing that we are also working currently actively is to have jobs uh as nodes within the ui to uh connect uh you know to build.

G

The linear end went uh which should also show different transformations or movement of data um happening across the ecosystem. um Yeah, I think so uh stay tuned. For that uh we we should be hoping to launch that out. This quarter, there's a few other things that are coming from the lineage perspective. Is uh we just uh as you saw early in the in the talk that we checked in the rfc for azkaban jobs and flow?

G

So we want to uh launch a reference implementation for that uh onboarding those entities, and that should also serve as an example for uh the more popular airflow integration uh that folks are past. So those are a couple of things that are in the pipeline and uh hopefully we'll get that out from linkedin as well. Yeah.

A

Awesome uh is that gonna include the spark integration as well.

G

um Yes, yes, that's going to include that yeah.

A

Awesome um were there any other questions that people wanted to get to.

E

I think ryan had a question regarding a graph implementation right. Are you in the yes yep.

H

It's ryan here: hi, okay,.

A

H

So at I'm ryan and I work with arun at expedia, uh so we we've had some performance issues with neo, uh especially on loading. I I know that linkedin had mentioned that they were also experiencing some uh performance issues and we're looking into different alternatives and ways to fix it.

H

One of the things I've been looking into recently is d graph, based on some of the benchmarks that they've posted it looks pretty promising. I've done some preliminary tests with some small loads as well um and- and it looks like there's a pretty massive speed up. uh I was just wondering if y'all had done any uh research into some of other alternatives, including maybe any progress you all have made on the uh the kafka streams connector for neo.

H

Now you want to take that.

C

uh Sure, uh hey ryan, so for the d graph uh we actually haven't really kind of deep dive in it. We, but we did read the article that talks about the graph, and we also like the fact that this is a rdf graph. uh It has the native graphql support.

C

It also like supports strong consistency and full text search. um uh However, I think it's also lack of the grambling um which we try to make it as our query deal in the uh in the plan um internally, uh as, as you mentioned, for the new 4g performance issue.

C

Yes, we realized that we are working on that in progress and we don't have a new benchmark or any performance testing out yet uh because we didn't get too much bandwidth last quarter, but we'll try to uh get onto that this quarter, and also internally, we are for the graph technology. We were actually also looking integrations with liquid, which is a in-memory graph. That developed a design and developed in linking and liquid, is also looking for open source, starting seriously put it on on the roadmap.

C

So there might be some updates in future uh for this uh stay tuned. So for the d graph, uh we'll also, um if we have more bandwidth to give it, give it a test some some time.

H

Okay, yeah and uh I'm I'm working on doing a quick implementation as well. So oh.

B

That's great, if you all.

H

Want to reach out or sync up on anything then feel free to reach out.

C

Oh that question.

H

C

uh Let's actually have some offline sync up on this as well. Maybe you can do a demo next time.

H

uh Yeah we'll see, hopefully I get you know something up and running. That's worth demoing.

A

H

A

Be great that'll be great, that's the power of the community love it all right. So I think we're just on time. um So thank you! uh Everyone uh and see you on the 19th of february, so we're basically standardizing on the third friday of every month from now onwards. um So you know, stay safe and stay healthy and we'll see you in a few weeks.

E