DataHub Adoption Journeys, 24 Mar 2023

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Jumio's DataHub Adoption Journey

Description

Ray Suliteanu (Jumio) shares the Jumio Team's DataHub Adoption journey, the problems they set out to solve, and what they have learned along the way during March 2023 Town Hall.

DataHub Public Roadmap: https://feature-requests.datahubproject.io/roadmap

Learn more about DataHub: https://datahubproject.io
Join us on Slack: http://slack.datahubproject.io
Follow us on Twitter: https://twitter.com/datahubproject

A

Thanks uh Maggie and datahub team for giving me the opportunity to speak today, um I will just Dive Right In Here um agenda a little bit about Gmail sort of the obligatory kind of thing and then uh get into mostly why we using data Hub, how we're using it and um some of the adoption and feature State kind of discussion.

A

So uh briefly about me: I've been at gmail just about two years: it'll be two years in May, but uh been doing this quite a while about 30 years more than 30 years as a professional engineer, mostly doing back-end, stuff uh and enterprise software and I've done. Everything from you know a handful of individuals at a startup to multinationals like HP, with hundreds of thousands of people and for those of you who know University of California, Santa, Cruz, I'm, a Banana, slug and my degree is in computer science.

A

So most of what I've been doing in my career as it turns out, has to do with fraud, prevention and that's no different at gmail. What we do is fraud and compliance things like anti-money laundering across the life cycle and it's extremely important to obviously do fraud prevention.

A

But you also want to try and do that with a good user experience, particularly on the onboarding side of things, because conversion is important, but you know: there's been a lot of stuff in the news about fake accounts and all these kinds of things you don't want Bad actors getting into your system as well, and so to do so. Obviously, it's pretty complex and there's a lot of different checks that you can do and a lot of different data sources out there. Each one of these different kinds of sources actually could be a separate company.

A

Some do multiple, you know, provide multiple of these databases at the same, you know from the same company, but a lot of these are different or you can have different vendors of the same thing so um trying to stitch this all together.

A

Obviously, yourself to do onboarding is, and fraud prevention is expensive, costly, a lot of different problems that you could have there and, of course that's where two meal comes in and we have our no code orchestration platform that pulls together all of these data sources and includes that with our own uh Ai and machine learning and apis to help you prevent the fraud and compliance challenges that that people have so you know to date over a billion of transactions and supporting 200 different companies so uh countries, obviously, if you're talking about fraud, prevention and onboarding and identity, verification, there's different types of documents and and proof, you know, in addition to just doing selfies, you know being able to scan passports or driver's licenses or other national identity cards, and you can just imagine how many different types uh there are for all the different countries and states within countries um and so being able to do that analysis and data extraction from all these different sources.

A

Obviously, a pretty challenging complex task we've actually hit last year, a milestone which was over 200 million dollars in bookings, and while the company is headquartered in Silicon Valley in Palo Alto, the company is also a global company with engineering offices around the world, Vienna Bangalore Montreal and a good chunk of distributed Workforce, so I actually am joining you from Seattle.

A

Even though I've spent most of my life in in California I moved to Seattle a little over a year ago. Some people call me crazy, but uh I'm enjoying it so far. So just moving on now to why we're here and uh why did uh Gmail go to data? So we have a lot of challenges at gmail. The.

A

Machine learning stuff that we've done is actually relatively new. uh Given the overall history of of jumia of around 10 years. The the machine learning aspects have only been around for about four years or so predates me, but um the it was sort of an add-on if you will and um not just related to machine learning, but there's a lot of productivity challenges.

A

Just data Discovery access to the data understanding the data, typical data challenges that that companies have and combining that with the global data and Workforce, as I mentioned all the different data sources, uh the different countries and having the distributed nature of our teams, where uh the people doing machine learning are in North, America and Europe in India, and so that poses additional challenges.

A

When you combine that with legal and Regulatory challenges of where you can have the data, gdpr kind of concerns, uh and not just obviously gdpr but there's plenty of other regulations and laws coming up around the world- including uh you know, CCPA in California and bipa coming out of uh Illinois and having to deal with all these different legal and Regulatory challenges as part of our data management has also been a challenge and from a productivity perspective, some of the consequences of that have to do with duplicated work.

A

You know, because teams uh these distributed teams don't uh have the discovery and access that we're hoping that they would be able to have there's a lot of duplication, people doing um the same, say, feature calculations or data set generation. That obviously, would would or could be resolved. If we had a way to make this information much more available and as I mentioned, you know, the the compliance aspect is is also a big challenge there as well. So um you know what we've got today.

A

You know our current pre-data hub in setup is we're an AWS drop and we have data centers and three. You know we use three different regions within within AWS, the US EU and Singapore region, and our homegrown data set management uh tool and user interface. That's built on Athena and uh our data is in primarily in S3, as I mentioned, with doing identity, verification and things like that. There's a lot of images, a lot of image and video data that we have, and so all of that's uh stored in S3, for example.

A

So, um on top of the Homegrown service, we don't have any kind of single sign-on um related to that and and along with that very basic role-based access control. The search is pretty simplistic we have. um We don't have something like an elasticsearch or similar search engine.

A

It's just sort of your typical database queries and the metadata is also just limited to data sets, so you can search for data sets, um but there's no uh other kind of data, whether that's models or model features jobs, all all of the other kinds of metadata and obviously none of the related linking that you might be able to get and and on top of that, uh essentially no governance other than you know who can uh log in and work with the data sets.

A

So what did we do? Well, obviously, we started this uh effort to address this as part of a a larger data initiative that I launched in uh it's probably a year and a half ago. Now more broadly than just hey, we need a a data catalog. It was more of an overall um strategic initiative, including data mesh.

A

But you know within the data mesh um idea, you have Discovery and Discovery was one of the things that was a big gap, so we started looking around at different um open source and Commercial products, and we had a bunch of different factors that we were looking at and, of course, then what did we do? We ended up picking data hubs, so um you also see Acro data on the bottom. There we as I'll mention uh in a little bit we we did actually look at the manage data Hub as well.

A

So uh yes, data Hub was our was our choice, and so how did we ended up doing that? um We have decided for the time being, to use the open source data Hub and deploy it ourselves. One of the choices that we made was to use the out-of-box helm charts without any modifications. This had a couple uh consequences, one of which is we are using eks, even though a lot of our other services are more or most of our services.

A

Actually now um in production are using ECS, but we do have some kubernetes use cases or usage in Gmail and given that the data Hub deployment was already all working with eks, we didn't want to take the additional effort to go and try and figure out how to get this working with ECS. It might have been easy, but we didn't want to go and give that uh a try, given the Staffing and resources and time that we were trying to do, and we also wanted to do serverless. So we chose eks fargate.

A

This had some interesting knock-on consequences for us, um so we use datadog for logging um with the out of box home charts without changing them. We couldn't add datadog sidecar, for example, so we ended up looking at and using a thing called kubemod which allows you to sort of more dynamically.

A

Do some deployment of of in this case sidecars, which is what we did. So we could continue to use the out-of-box home charts and, as I mentioned, one of the big things that we didn't have in our prior solution was single sign-on Gmail uses OCTA for its single sign-on service and we've integrated data Hub with our corporate OCTA.

A

This is all deployed in the e-region. This is something that Gmail has done in general for most of its data related stuff is that we for sort of as a as an easy way to deal with data locality requirements. We just keep everything um or or move everything into the EU if it's not generated there, but all of the existing stuff in the other regions in in the US and Singapore.

A

What we wanted to do, obviously, is have a single repository for everybody to do Discovery, as opposed to the current situation, where, if somebody wanted to find a data set, they would have to log in if they didn't know which region their data set was in. They would have to log into potentially three different locations to find their data set or if they needed data from all regions, they would have to go to um every region.

A

So what we've done is that existing metadata service publishes to SNS and we're listening to that from an sqsq in the EU Gathering all of that metadata and because the metadata is obviously our own format in our existing service.

A

We built uh what we're calling a adapter service, which basically is a spring boot application running in the U in eks that takes the metadata and transforms it into the data Hub format using the Java clients and and pushes that into uh data Hub, and so that is how we've deployed it, and um while it's not quite in production, yet we're in on the final stages of testing and everything is syncing correctly, and basically we just have to do our final bit of testing and back filling of all of the existing data sets that are are out there in in the other regions and then we'll be actually deploying it to um all of jumia So speaking of deployment, then and adoption we are intending to open it up to all of jumia.

A

We've had a lot of situations and I've been in meetings where people have asked you know well, what's you know this particular field or data about and I would have a side con. You know side conversations with uh colleagues on slacks like well. If we had our just, you know: data catalog, we had our Discovery service deployed. You know this product manager or the support person would need to ask that question.

A

They could just go and find out for themselves, but our initial uh Target users are the data scientists and and all Engineers within Gmail, because we are initially just deploying uh to get the data set information into Data Hub, because again, that's the only metadata management that we have, but we have a lot of tools as I was alluding to earlier diverse tool chains. We use things like data bricks and airflow and Sage maker and um and getting all of that data into Data Hub is is where we're going.

A

So um some of the uh enhancements that we're looking for is actually exactly what shank was mentioning earlier around schemas, um and that was key for us, because we have some use cases where it's we're interested in the schema and not so much about the data sets and so, um for example, we have as part of that kyx platform, that I was mentioning the um there's, a rule engine and a typical business rule engine. If some of you might be familiar, they have data models that uh need to be defined a fact. Model.

A

Facts are the the lingo for the rule engine yeah in terms of the data, so this these fact models are backed by schemas today, they're, actually Jason schemas uh and the team that is using those is just sort of built, their own little storage mechanism in S3 to load these schemas, but obviously it'd be nice if they could put those into this common service that we're deploying uh you know, built on data Hub and being able to just you know, save and retrieve these schemas to be able to have the rule engine, for example, uh know what the what the data is.

A

That's coming in so um having the schemas as a first class entity is going to be great and the schema registry support where we can then also leverage that with other services that know about Confluence schema registry that'll be another uh great thing. One thing that recently came up as we were deploying is roles associated with groups. I submitted a feature request for that just recently. Actually, it's not a showstopper, but it would certainly help us in terms of managing the um the security model.

A

So um the final thing, too, that we uh we're looking for I, mentioned at the beginning, that the initiative started out uh as far as looking for uh metadata service was around data mesh and so I had started a discussion around GitHub supporting this notion of data products and uh actually Maggie, went and created this slack Channel and there's been some good discussion on that and that's an ongoing discussion.

A

It's uh going to be interesting to see where that goes, uh but for jumia that still uh a road map item, it's sort of slow rolled due to a variety of situations. But um where are we going from here? So, as I said, we need that schema repository.

A

uh The data that we have is relatively poor around schemas. uh It hasn't been a focus at gmail for a variety of reasons, and so just getting people to create schemas and have that documentation and field level, Med data being able to specify things as pii and and having that strongly typed information. While we have the data sets today the schemas, the schema metadata information that we have doesn't include documentation or certainly not few level documentation. We do know which fields are pii, but um we don't have things like you know.

A

Is this biometric data versus you know just address data, or something like that, so just the expanding on schemas is, is really what the focus is going to be at Gmail and use of data in in the near term, but then obviously starting to integrate other metadata is uh the other top priority, as I mentioned, we're using sagemaker and airflow and data bricks and and several other tools Kalina for model testing, so that last bullet about data and analytics test automation.

A

um It's one thing: you know on the on the data side and integration with things like Great Expectations, but on the model testing side, where there are tools out there, as I mentioned like kalino, which is one of the tools that we're using um getting all that metadata with the lineage and governance around that is, is going to be key and um finally, hopefully um we'll be able to actually switch to the managed data hub from Acro and not have to deal with all of this metadata.

A

All of this infrastructure ourselves, we did do a POC um with Acro. Last fall and everything went well.

A

It was more of a business level issue that uh kicked that can down the road and we had been doing a parallel effort to stand up the infrastructure at the same time, just in case can was kicked down the road as as it as it happened, and so one of the things that I'm still looking for is uh ultimately to switch to the managed data Hub down the road, and so um that's really all I had uh given uh trying to keep within a time frame here, and so here's some contact information.

A

If you want to reach out I'm happy to discuss any of this, uh you know obviously on the date Hub slack but feel free to reach out with anything else. Thanks a lot.