DataHub Tech Deep Dives, 27 Aug 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Fine-Grained Access Control in DataHub: Aug 27 2021 Community Town Hall

Description

John Joyce (Acryl Data) gives an update on recent improvements to fine-grained access control during the DataHub Community Town Hall on August 27, 2021.

A

Yeah, so I'm going to do a quick overview of where we are in fine grained access control. This is something we started thinking about at the beginning of the summer, based on a lot of feedback from the community around wanting this, this uh capability to control, who has access to what metadata on data hub's platform.

A

So I'm going to start by just talking about what access control is, so how we think about it is that access control is a way to declare who can perform what action against which resources and how we model. This is with three kind of sub-concepts.

A

One is an actor which determines the who portion of the the policy a privilege, so what action they can perform and then, finally, a resource or an object. This is commonly kind of known as actor verb object in some some areas of the world, we use actor privilege resource and what we do is we put these three things together and we call them a policy so with the new kind of implementation, we allow you to declare a policy that includes these three things to control access on data hub.

A

So I'm just going to talk about a few policies in english. um You know on datahub's platform, you may want to kind of restrict who can do certain things so number one. Maybe data set owners should be able to add documentation, but they shouldn't be able to add tags. So we want a controlled vocabulary of tags. Perhaps another example is maybe the data platform team should be able to edit anything about a data set right because they manage the platform they're sort of the admins of datahub, maybe ted.

A

Our data steward should be able to edit any data sets tags, maybe that's his job, but shouldn't be able to edit the description or the ownership or anything else. And finally, maybe the administrative group should be able to manage policies themselves right so should be able to dictate who can do what on the platform?

A

So as part of milestone one for fine-grained access control, we defined the following as the scope we wanted to be able to define and enforce policies supporting the following characteristics.

A

We wanted to be able to support the following privileges: editing tags, editing, ownership, editing, documentation for an asset, editing the links to the institutional memory, about an asset and finally, in the case of data sets, you know, editing, schema tags and descriptions.

A

We wanted to apply these policies to resources at two levels, so one is based on the resource type. So imagine you know. Data set assets or dashboards or charts, as well as the resource identity level, so to be able to call out a particular data set or a particular chart and apply fine-grained access control against that asset individually and then. Finally, we wanted to model this concept of actors using our concept of user in groups that already exist.

A

So we wanted to be able to say that john should be able to do something to a particular data set, or maybe a group should be able to do something to a particular dashboard. We also wanted the ability to support sort of this wild card predicate and say all users or all groups should be able to do something to a particular asset.

A

Finally, we wanted to model one sort of edge-based predicate which allows owners of assets to perform particular actions. So perhaps the owners of a data set should be able to again update the description, maybe not the tags.

A

So now I'm going to go into a demo of the milestone, one implementation of policies uh based on what we just talked about, so I'm going to go over here to data hub and you know right off the bat I'm just going to this is the default deployment of of the new policies world. So I'm going to go ahead and search. I've just got some of this. You know basic uh sample metadata in here that you guys are all familiar with.

A

Probably I'm going to go to this first data set and I'm going to try to add a tag right so let's say new tag. Okay, I already have one my new tag and what you'll notice right away is that we've got a warning here, which says looks like you're unauthorized to perform that action. So why would that be? Well? That's because we haven't defined any policies yet so I, by default. I am not able to do anything to this data set right. This is kind of a fail closed world.

A

So what we're going to do is we're going to open up this new panel on the on the left side here and we're going to see a few different admin level controls we're going to go to this policies, tab, which is where we can construct policies and you'll, see that currently we have no policies, so that probably explains why I can't do anything.

A

So I'm going to click this new policy button to add a new policy and what you're going to see here is a workflow that really roughly models that policy structure that I had talked about previously.

A

You know actor object privilege, so I'm going to start by giving my policy a name and I'm going to actually use the example from the the slides, I'm going to say, data sets owner's documentation policy right. So basically, I want to say that owners can update documentation, but that's it about a particular data set. So next I'm going to choose the type of the policy.

A

There are two types today: one is a platform policy which is really about who can do things at the administrative level on the platform like who can edit policies or who can view the analytics, for example.

A

The second type is a metadata policy, and that dictates who can do what to a particular data asset or metadata asset on the platform, so I'm going to go ahead and choose metadata because I'm trying to edit the policy for data sets.

A

Finally, I'm going to give it a description say only owners should be allowed or sorry, let's actually say owners should be allowed to edit docs. That's it I'm going to hit next and I'm going to choose the asset type that I want to apply the policy to so in this case, it's going to be data. Sets I'm going to choose that and then I'm going to choose the asset that I want this policy to apply to. So I can either search for a particular asset right or I can just say all.

A

Data sets right, because this should apply to all data sets and then finally, I can select a set of privileges. So I'm going to go ahead and say you know edit documentation, that's the only privilege I want to allow.

A

Finally, we get to the third final screen where we can say who can actually do this and you'll see right away, there's three kind of options here we can either call out users specifically, so I can say data hub user or john doe or whatever we can call out groups or we can say owners right. So this is that edge based predicate.

A

Finally, I'm just going to save this, and now you see I have a new policy right. You can see it's in an active state, which means it should apply. So I'm going to go ahead and go back to the data sets um as you'll notice like this actually isn't owned by me. I'm logged in as data hub, so I'm going to go to the second data set which is owned by me and I'm going to attempt to update the documentation.

A

My new documentation.

A

And you'll see I was able to update it great awesome, so, let's actually back out here and let's try to update a data, sets documentation that I don't own right. So I don't own this one, I'm going to come in here and say: hey! I want to update oops looks like I'm unauthorized to perform that and that's because the policy doesn't allow me to do that.

A

So I'm going to go back and I'm going to open up this policy again, I'm going to take a look at what it says and I'm actually just going to deactivate it because you know. Actually I want to revoke this policy, so I'm going to go ahead and click deactivate and you'll see that this policy is now in an inactive state.

A

So if I go back to that that hive data set- which I updated previously see if I can do it now, I just want to clear what I added you'll see that I'm unauthorized to perform that it's, because the policy has been deactivated.

A

So that's the first kind of policy, the ownership based policy. The second one I'll add, is sort of a point policy, and that's where you know a particular user should be able to do something on the platform.

A

So I'm going to go ahead and say that you know data hub, this user, that I'm logged in as add tags to specific data set right. So this is that point lookup use case. So I'm going to go ahead and just skip adding a description.

A

I'm going to again choose data sets and in this case I'm going to actually look up a particular data set. So I want to say that I should be able to you know, update the hdfs data set, or maybe the kafka one as well, so I'll select, two of them and then finally I'll select a privilege in this case editing tags, and then I will just find myself datahub and I'll save it, and you can see. We've got the new policy, it's in the active state.

A

So now I should be able to update the tags for this hdfs, one, which I wasn't able to update in the initial case. So let's say my new tag again see if I can add it looks like I was able to add it. I can remove tags, of course, because I have full control over editing the tags all right. So um this one works this one's deactivated and then actually sorry, this one's deactivated and then I'm also the owner of this one.

A

So I can probably add a tag here as well: awesome, okay, so we've we've correctly created two policies and now finally there's the the final thing I want to demo, which is just cleaning up policies. So there are cases in which you may have created. You know a policy by mistake. uh What you can do there is, you can actually just come in delete the policy right, um delete the policy and we're back to state zero.

A

So this is in a nutshell, what policy management and role fine grained access control will look like on data hub. This is the mvp all of those privileges, the assets you saw um both metadata privileges, as well as the platform privileges will be supported, basic platform privileges, including managing policies, managing analytics things like that. Eventually that will be extended to include things like managing users and groups, so adding groups deleting groups things like that.

A

So pretty happy about how this turned out. Looking for feedback from the community, we will have you know a global on off switch here which I'll talk about shortly when I get back into the demo or the slides here. But let's let's go ahead and continue here. John.

B

There's one question about: um who can even edit policies like who has admin privileges on even the ability to add or create policies.

A

Yeah, so we we model the ability to manage policies as a platform, privilege right and so by default. Data hub will will ship or launch with a set of sort of immutable policies, and those immutable policies will grant the ability to manage policies to manage analytics to that core super user, which is data hub today. So when you launch a fresh instance of data hub that data hub user will have all privileges on the platform and that'll be sort of the jump off point from which you can create additional policies.

A

So you can add, you know your data platform team to be able to manage policies themselves and do everything. So that's kind of that model. We have sort of the seed root node, which can then kind of uh replicate itself. I guess cool so.

B

I'm just going to.

A

Quickly talk about uh the implementation like what's going on here, uh you know recently, we we've moved our graphql api to the metadata service, so um that's actually where a lot of this is kind of occurring. um So what happens when a request comes in?

A

Is you know it goes through data hub front end, which is now just sort of our ui server, a web server that serves resources, as well as proxies to the api that will go ahead and proxy back to the metadata service, which is commonly known as gms, and at that graphql layer we have what's called an authorizer and that authorizer's job is to basically determine whether a particular action should be allowed or not, and it does this by periodically syncing with the database, which has the set of all the policies there's kind of two cases in which it syncs.

A

So one is on a cadence, so you can configure it to be syncing every two minutes: five minutes 10 minutes whatever you'd like by default, it's at two minutes as well as when the cache becomes stale. So, if you add a policy or edit a policy state, as you saw in this demo, uh we will actually go and refetch the cache and and reboot the cache, um and so that gets us into the authorizer itself.

A

This key component, which basically maintains that cache always keeps kind of the latest view of the policies as well as makes a determination at you know, request time whether to allow or deny a particular action, and it does so by exposing an api that takes those three pieces of the policy that we had talked about prior. So at request time, um the invoking code will pass. You know an actor which is basically the user principle behind the request. It'll pass the groups that that user is associated with, as well as a privilege.

A

It wants to authorize and the resource against which it wants to apply that privilege. The authorizer will then take that information. Take the policies it knows about and produce a decision by sort of executing predicates.

A

So it's pretty awesome um so policies in in practice. uh We we want policies to be enabled or disabled uh globally at deploy time. So what this means is, you can continue to use datahub as you're using it today where there's no policies and anyone on the platform can do anything.

A

We wouldn't recommend that we recommend you actually do start using the policies, because they, I think they'll be very, very helpful to make sure that metadata stays clean, but by default again, datahub will be that super user, which will be seeded with irrevocable kind of immutable policies. That say that it can do anything and so it'll be on the operator to go and spawn off additional policies on a per. You know, policy basis from that core admin account.

A

Finally, I'll just talk about a little bit about you know. What's on the horizon, for policies, um so after we get this kind of first code pass done, we want to release a policies. V1 usage guide, that'll talk about how you create policies, how you manage them, hopefully it's self-explanatory, but I think it will still be pretty helpful to have something accompanying a feature. This big we'll also look at supporting additional predicate types, especially on the resource itself.

A

So we've gotten a request to have things like domain based matching, so once we model domains in the in the resource type being able to to match on that as well as sort of data platform or data source-based matching, so to be able to create a policy where maybe all data sets within a particular mysql instance should be managed by someone and then in the long term, we want to kind of consider the utility of a role-based actor predicate.

A

So, as you saw, there's mainly users and groups which are able to do different things, we have had some requests from a few folks that this layer of indirection, which is commonly called a role, would be perhaps useful, so we're actually looking for feedback from the community and direction from the community to understand whether that's a requirement. That really is something we need to take into account here with this system.

A

So that's that's pretty much it um thanks guys. I will hand it back to uh srishanka.