DataHub Tech Deep Dives, 9 Dec 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Tech Deep Dive: DataHub Metadata Service Authentication

Description

John Joyce (Acryl Data) gives a deep-dive into the DataHub Metadata Service Authentication during the November 2021 Community Town Hall

Referenced Links:

https://datahubproject.io/docs/how/auth/jaas https://datahubproject.io/docs/how/auth/sso/configure-oidc-react/
https://github.com/linkedin/datahub/blob/681ed91a0006a2d20535c0d5c30f0a68afcfab9f/docs/introducing-metadata-service-authentication.md

A

Talk about something I've been working on the past month, around metadata service authentication and first I'm just going to start with a little bit of background.

A

So I think, as many of you are aware today, authentication and data hub is handled at the datahub front-end proxy service exclusively and currently that happens via jos, which is you know, username password login, which you can tie up to an ldap server, for example, or to a file director directly or via sso, so oidc, which many of you are using, um and what we do is we verify we kind of set and verify session cookies from data hub front end proxy.

A

uh What this means is that there's limited options for adding authentication to the metadata service layer itself, which is sitting behind the the proxy, and why is that significant? Well, it's significant because when you're ingesting metadata, typically, the recommendation is to ingest against the metadata service directly, and so what that means is that all of that ingestion traffic will go unauthenticated.

A

So there's two kind of suggestions that we've had for folks in the open source community around how to authenticate requests today. So the first option is, you: can, you know, set up a custom proxy in front of metadata service and you can perform your own authentication there or you can actually extract that session. Cookie, that's set by datahub, front-end proxy and use that in programmatic requests, as long as they're routed through the front-end proxy- and this is a very new recommendation actually so, just at a high level.

A

This is kind of what the old world looks like. uh We have on the left here, the front end proxy service and then on the right. The metadata service, the front-end proxy, is where the auth happens.

A

There's a session cookie, that's passed from the ui and the front-end proxy also handles all communication with external identity providers to perform the oidc or sso flows, but all communication between the front-end proxy and the metadata service, along with your metadata ingestion framework and the metadata service, is currently unauthenticated, and so just to summarize what the problem we're trying to solve in the last month was. Is we really just don't have any formal support for making authenticated requests to data hub apis which live in the metadata service?

A

This includes both the graphql apis, as well as the rest of the apis, because they're both hosted there, and so what the solution was was kind of two parts we wanted to build out.

A

First was infra, so we wanted to build an authentication subsystem into the datahub metadata service directly, and the second thing is: we wanted to provide the ability for folks to go and generate access tokens to use against the metadata service for things like ingestion or programmatic request, the graphql api that doesn't require them to kind of dig around their browser and pull out a session cookie.

A

So, just to summarize, our objective was to support authenticated requests through datahub frontend proxy or to datahub metadata service directly, and so with that, I'm going to step into a quick demo of what we've built.

A

So I'm going to come over here to datahub, as you can see in the top right here we have a new, um a new tab called settings and if you go into there, you'll see one one option which is called access tokens today and what this tab allows you to do is to generate access tokens for use with data hub apis and the first type of access.

A

Token we're rolling out is what we're calling a personal access token if you've used github recently, you may be familiar with the concept, but really what this is on data hub is it's a token that you can generate with your own privileges, so maybe I'm john joyce. I have a certain set of policies that are assigned to me when I generate an access token.

A

Those policy privileges will be carried over to that token as well, and so what I'm going to demo is actually generating an access token for data hub that expires in one day, I'm going to go ahead and click generate personal access, token and you're going to see this panel. So we've got the token itself and then we've got a little example of how to actually use it, which is pretty helpful.

A

It includes the domain which I'm posted on right, so I'm going to go ahead and copy this, and then I'm going to try to make an authenticated request to data. So I'm going to come over here to postman um and I'm going to hit the graphql account and request my own username. So that's the query I'm going to make so.

A

First of all, I don't have any authentication specified at all, I'm going to make the request and you're going to see a 401 unauthorized, because I'm not allowed to do that and then I'm going to go over and add the authorization header. With the new token that I just generated I'm going to try again and there you go. I've generated a graphql request that has my username coming back, and the cool thing is: is that this not only works against?

A

You know the frontend server, which is what I'm getting here, but if we flip this to the metadata server directly, it should also work against the metadata server. So metadata server is actually the thing. That's performing the authentication behind the scenes, so you're going to see a 401.

A

So with that, I'm going to go back to the the presentation and talk about sort of how we we implemented this.

A

So here's a summary for folks that couldn't make the the session but uh of what we built. um So I'm gonna go over a couple. First, just like the high level key concepts. We've introduced the metadata service to make this work, we have an actor which represents a unique identity or principle that is accessing datahub.

A

So this can be a user in the future. It can be a service identity, so programmatic type of request, currently we're only modeling users, there's the concept of an authenticator which is really a plugable component in the metadata service that is responsible for taking the incoming request context and resolving an actor from that. So being able to say, is this actor authenticated or not?

A

There is an authenticator chain, which is basically a group of authenticators that are executed in sequence, in an attempt to authenticate an incoming request. So what this means is, you can actually have multiple flavors of authenticator that are stacked. That can authenticate a request in multiple ways. So, for example, one authenticator may pull out ldap credentials and try to verify them against a third-party ldap server. The next authenticator may verify. You know an idp issued access token using a signature and maybe there's a third authenticator in the check, the stack that does something completely different.

A

The next concept is a data hub access token, and so that's kind of what's powering the generating personal access token feature. Basically, we've introduced a new token service into the metadata service that allows you to grant or generate access tokens on behalf of a user as well as validate those when requests are coming inbound and we do that via an authenticator which basically just validates the token, and I'm just going to talk a little bit about the high level.

A

What we did here, so we basically kind of shifted authentication from the front-end proxy into the metadata service. So you can see in about a data service box. Here we have this authenticator chain. We have this data hub token service, which are key concepts in the new architecture.

A

um I'm not going to go too deep into this, because it'll take a little bit of time. Instead, I'm going to refer you to a document that has all of these details covered if anyone's interested in that, but I'll kind of wrap up by just giving an overview of how you can actually start using this um once it's deployed. Currently it's in a pr review right now, so metadata service authentication will be disabled by default. For now, that means that your session cookie-based authentication that you're using right now will continue to work without interruption.

A

So it's kind of an opt-in system you can enable it using a single environment, variable called metadata service. Auth enabled all you need to do is turn that on and data hub front end and data hub gms and you'll start enforcing authentication at the metadata service layer. There's nothing else. You really should need to do just, except for noting that, when you're ingesting new metadata, you'll need to have an access token provided.

A

Finally, you'll need to grant privileges to actually generate personal access tokens. So by default, not everyone on the platform will have that privilege, so it'll actually be a great outbox to generate a token. You won't be able to use it, but using the data hub policy system you're going to be able to assign a new, privileged platform, privilege called generate personal access tokens. Of course, the data hub root user will come with this by default, so you can spawn off privileges from there and finally, I'll just talk about where we want to go from here.

A

I think this is really just the beginning of a really interesting authentication subsystem in datahub. That will be very, very powerful.

A

What we want to do is really make the the process of registering authenticators much more sort of dynamic, similar to what shoshanka showed earlier. The end goal here is to be able to have a plug-in location where you can simply copy an authenticator implementation into there, put it in your configuration and start using immediately.

A

Currently, you would have to define a new authenticator in the repository in the first pass service accounts, so we want to be able to provide the ability to create service accounts, assign them privileges and then generate access tokens on their behalf.

A

So we currently only support personal access tokens, but in the future you can see us supporting this additional type of token as well kafka ingestion authentication. So we want to be able to authenticate write requests that are coming off of the kafka stream via ingestion through the data hub kafka sink currently in the first pass.

A

These will not be authenticated, but the data hub rest requests, uh sync requests will be authenticated and then finally, we'd like to actually kind of make the access token management inside the ui a little bit more robust with the ability to view previously created tokens, manage them and actually revoke tokens on the fly, and I think the summary is this is in review right now.

A

There's a big pr and I've got a huge document that talks about all these concepts in depth, along with an faq section about how to start using it configuration examples and, more so feel, free to hit this link. I think maggie, hopefully, will share out the slides at some point to start reading more about it.