GitLab Manage: Access Group, 27 Jun 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Project Authorizations - Lunch and Learn Session

Description

Slides: https://docs.google.com/presentation/d/1A1a4qB3G6Z4UnPlJMhQkF39Qk7JQfzqkpP9AFBpAqg8/edit?usp=sharing

Notes: https://docs.google.com/document/d/1ZHalCCzf6OOw9vruMKKWX0tStke70yy_QXqXT6eCBak/edit?usp=sharing

A

Yes and share my screen.

A

A

Yes, uh can you see my screen, yes, yeah.

B

Okay, so today we will be talking about project authorizations, uh past preser, present challenges and the future of project authorization. So my name is Manoj. I am a back-end engineer at the Italian skill group, but previously I used to be in manage art and manage organized and during my time in manager, I used to work a lot in Project authorization. So that is how I know most.

A

Of this stuff, so let's get going so.

B

Yeah, uh so the first uh part is a very basic introduction to project authorization and what what it is about uh so project authorizations is uh basically a cache for storing the access level that every user has to the projects that they have access to.

B

uh So it's a very simple table uh consisting of three columns: user, ID, project, ID and access level where you store the specific data, uh so I have put a question mark in front of cash because I don't know if you can really call this a cache in the sense that in Cache systems what happens is if the cache is missing, we call it a cache Miss. We are able to obtain the right value from uh you know, making that calculation without the cash also.

B

So if there's a cash Miss, uh you go into directly into the source and figure out the value, but in in this case that's not really uh happening, because the expectation that the system runs on is the fact that the value of access level is present in the other project. So if there is a cache Miss, then the whole system goes for a pass. So cash is probably not the right term for it in in the right sense. uh Maybe you can call it something as a pre-computed value, but uh first simplicity's sake.

B

We can assume it to be a cash, but we can always rebuild the cash. There are systems for it, uh but if, if there is a cash Miss, then system goes for a pass, which means that somebody does not get access to the project so yeah. So that's the basic uh detail and here I have an example of like what it looks like in our app.

B

So this is a rails console where I have fetched all project authorizations, so you can see like user IDs differ, projects differ and access level differ and there are values of access levels here and there is a uniqueness scope on user ID project, ID access level combination.

A

A

B

Yes, uh so yeah, there is a uniqueness scope on user ID project, ID access level combinations, which means that a specific user can have only one kind of access level to a project, which means that you can be a maintainer in our project, but you cannot also be a owner of that project, so it's always unique that way. So that is one of the constraints we have on the table uh next slide.

B

A

This is, this has been bothering me.

A

A

B

Yes, uh so the questions that can come up at this point is: uh why do we use project authorizations uh and the reason is we use it as a cache, because we want the precompleted value to be always there, otherwise fetching it from the members table. Like you add a member, uh add somebody as a member to a project and if you compute via that member's table, then it takes a lot of time because, like that computation is hard, it's expensive.

B

So what we do is we already pre-compute that value that a specific person has to a project and we store it in the project authorizations table. So that is why we have project organizations and the corresponding question is why don't we have group authorizations because users also have access to groups and we don't have a table called group authorizations and if computation is hard or expensive, why don't we have group authorizations Institute?

B

So the reason to that is, we kind of have satisfactory performance on the group side of things, but we did not have uh satisfactory performance on the project project side of things, because when you do a bit push to a private report, you you want it to be as instantaneous as possible and we do not have time to go into member stable and security. Hey. Does this use the highly enough rights to push to this uh repo.

A

So on the group.

B

Side uh there are also so, if you can imagine as a tree, a group is at the top level and project is at the least level. So once you Traverse to a leaf, there are already like more components involved in the calculation process, so that makes it more expensive, but group is at the top, so you can imagine that it's easier to figure out the value at the top level. So that is why we don't have a group authorization stable, a, but it might come at some point.

B

Currently, we do not bother because the performance is kind of satisfactory and it's also hard to maintain a cache. So that is also one of the reasons we don't want to go there and if it's absolutely necessary.

A

B

Slide uh yeah, uh so there are some existing concerns around project authorizations before we dive deep into it.

B

So the first point is this: Cache can be considered wasteful uh because we have to store every user's pre-computed value to every project that they are part of, for example, on my day-to-day job, I will use the gitlab rails project, but then, since I am part of the gitlab or group, I am part of like thousands of projects underneath it and my access level to each of these projects is stored in the project authorization table, even though I like virtually never access those projects like in my day, job I, maybe go to the the the handbook, repo and also the git Library.

B

These are like the two projects that I use, but I have entries for all the other thousand projects uh which is kind of you know in in a sense it is Facebook, and the second point is like I told this is a precompleted value, not really a cash, so it will. It is a cash that only grows like the value will never decrease unless there are like the user is deleted or something the project is deleted.

B

So the you can imagine that the size of this table is only going to keep growing as gitlab grows, uh and uh next part is maintaining. Any cash is hard, but maintaining the cash of project authorization is doubly hard because any problem that you have in this table. It is a severe repeat one issue which means that, like the, if the access levels differ, then somebody is going to get elevated access or somebody is not going to get the proper access.

B

So all of this which is like can be considered a severity one issue, so we have to be very careful while dealing with project authorizations any questions. This far.

A

C

A

Do I have I have yes.

C

um So does the whole concept of customizable roles after all, since this is about storing um roles and access, does it make it even more complicated? Now.

B

Yeah, if I understand correctly, a soil is not part of the custom rules, but if I understand correctly, we do not store the custom role in this table and it is calculated on top of that. So definitely it makes it more expensive. So you have to find that this person is a maintainer. It will have an access level of 40, but it may have some owner rights, so that is calculated differently.

D

And this is also the exact reason for why we had first the the technical discovery on customizable roles, because we were very concerned of this very issue uh that this might be very. uh This might have very poor performance.

C

Okay makes sense, so what you ended up doing at a high level is uh decoupling. The two.

D

um So not entirely decoupling. We still have the uncustomized borrowers. We have the the base access level. So that's reflected on this. What we try to do is to move away from relying on Project authorization on these specific permissions and rely on these uh um on-the-fly calculations. So we can check if the, if the, if the user has the custom permission, that they need, uh but that that that needed to be very performant.

B

It's more like one on top of the other, so the existing system continues, but you also check whether this has any custom Rubix and calculate that also okay, yeah, uh so yeah. So we have this cache, but we need to keep it updated. Otherwise, uh like I told you, people don't get the proper access to the products.

B

uh So uh the next question is: when do we update this cache and updation happens whenever we trigger an action that uh needs to uh update the project authorization, which means that, like, when you add a number to a group or a project when you delete a member from a group or a project when you transfer a group The the hierarchy thing just so. In that case, you also have to update.

B

So all these actions uh that happen, which warrants a change to your project authorizations, is when we trigger this updation and uh to update these project authorizations. We use workers which are like psychic workers. They do it asynchronously behind the scenes. So it does not return the requests and the worker that we use for it is called the authorized projects worker and authorized projects worker always refreshes authorizations.

B

On a per user basis by per user I mean that simply means that the argument that goes into the worker to perform the job is the user ID of the person. uh So in this case, whether as an example when a user with id42 is added to a new project or a group, we would usually run authorize projects, worker, Dot, performancing42, and that will do the calculations figure out.

B

Well, if the user term named 42, has need to have access to any new projects or has this person lost access to any project and we remove and add a project authorization records that those rows in that specific table as necessary. So that is what we are used to and we kind of Now call the system as the Legacy system, because we have something new to do this.

B

So uh when you say the Legacy worker you can imagine, it is actually the authorized projects worker that we are talking about and we had many challenges uh around authorized projects worker. uh um The main challenge being I told you. This works on a per user basis and per user basis is very expensive calculation and it is also not really necessary to do stuff on a per user basis.

B

As an example, when a new, when a new user is added to a project, you do not need to fetch data surrounding this user's access to all other projects. So if I'm added to the gitlab project, essentially I need to figure out I need to add an entry for manuds with access level of 40 to the gitlab project. But then what happens inside this worker? Is it fetches all of my existing project authorization records also?

B

So that is a very expensive calculation that is happening, and that was the major challenge, because you know the time required to do. This is very kind of high and when you have so many jobs in the queue that does this, you have infrastructure challenges that way we couldn't meet slos on on uncertain days, so the worker that runs. This is doing a recursive query, because we have like a our hierarchy effects like we have groups and then projects and subgroups, and then projects inside that.

B

So it used to do like a recursive query, but uh in May 2023 we have kind of changed to two linear uh queries. I think uh abubakar worked on it I, don't still don't know if the feature flag is turned on, but there have been like we have been moving in that direction, which kind of uh removes the expensive part. It's not recursive, but now it is linear based on linear queries, so it should improve the performance a bit uh uh next uh yeah, so I told you.

B

This is kind of the Legacy worker now, which means that we have something new in in place of that.

B

So around mid-2020s, when we started having inferative issues around project authorization, we used to have like a lot of problems, and that is when we decided to like think about alternative Solutions with something that can improve performance and not uh have so many informative issues coming our way and the alternative that we talked about or we came up with, is called a scooped worker, which would be much more performant and uh let's look at scoped workers, so um so the scoped worker we have right now is called a project.

B

Recalculate worker and scoped worker simply means that it works on a.

A

Specific I'm really sorry this this. This is happening right between the call, okay,.

B

Yeah, uh so this does uh refreshes on a per project basis. If you remember the last, one did refresholder per user basis, uh so this does it on a project basis. So when, for example, when a project with ID 40 is transferred from one group to the other, we simply do trajectory calculator, worker, dot, performancing 40, but 40 is the project ID.

B

So inside this job, what happens is it is able to fetch all the members arising from all different areas like from within the group The ancestors direct members of the project, people from Project group links? It is able to fetch all of them together and figure out how, like all of the users that have access to this particular project. So that is what we mean by thus uh the work on our project basis uh yeah. So this is the diagramized uh the diagram of the a case that I was trying to tell you.

B

So a project p is moved from group a to Group B and now it is part of project P. So you can imagine that there are users in group a and users in group b. So now uh previously, users in group a had access; now they will lose access and people in group b should get access. So earlier, what used to happen?

B

Is we used to like take a union of all user IDs in group a and Group B and run the Legacy worker for all those users, which means there are multiple jobs being happening for all those users? If there were like 100 users combined in both groups, there are. This was like a hundred different jobs and people in group a will look access, people in group b will gain access, but now this is much simpler. This is just one job.

B

We just simply pass the project of project ID of that project and it is able to figure all this out inside one single job. So uh when we deployed it, we had uh like a really good Improvement in all metrics that we calculated like or close to 99 Improvement everywhere in number of jobs being generated, because this is like n Jobs versus one job and also the uh the time spending DB and the number of queries that you have fired so yeah uh close to 99 percent uh Improvement in efficiency. There.

B

uh Any questions.

A

B

Yeah so uh so everything was.

B

So everything was not really nice, as the last slide showed you uh when we started getting good results with the new, the new worker. We thought like uh why not apply the same thing everywhere and we would have better results everywhere. So that is uh when our problem started.

B

So after we saw this approach, we wanted to try this everywhere and to do this, we started shipping project recalculator worker, even on group membership updates, which means that when you add somebody to the group, every project under that group should have a new project authorization for this person, which means that if a group has 100 100 projects under them, uh this would mean that we are giving.

A

B

To 100 different projects, recalculate worker jobs, but we thought that, since the performance of this worker is far better than the Legacy worker, it should be okay and we shipped this change, and this led to an incident and delete um what happened during the incident is the QA test started failing and on checking that we came to know that, like too many jobs were waiting in the queue and the queue depth was uh increasing, it was higher than usual and we realized that it is because of the changes we deployed for group, member updates and in in simple terms.

B

What went wrong is. This is a case of frequent action, giving rise to n jobs in one go wherein it's a very large number. So adding number to a group is a if you consider gitlab.com, that's a very frequent action and when you add members over API that happens very successively and each of them is triggering like a thousand jobs or more. Every job is in the queue and then the queue gets clogged. So that is what happened in this case and uh yeah.

B

So this gave to like without many projector calculator worker jobs, and it happened very like one after the other, because we have no control over how customers use the API, so they can add like a thousand different members to each group that they want. So this is the case that happened and the takeaways that we have from this incident is never have like in referral jobs being generated from one action, uh but.

B

But if you have like a one uh one, refresh job generated per one action that is kind of correct, so the takeaway here is when using scoped workers use them for the right scope. What happened with us is where we went wrong. Is we had a project scoped worker, but then we started using it for an action that takes place in the scope of a group. So that should not uh happen. uh You should scope it correctly and you also use it for the Right Scoop.

B

So we reverted the change and currently what we have is a hybrid approach of the Legacy worker and the new worker and on the left, you have the new worker and that is scoped to a project.

B

So we only use it in areas where the change happens directly on that projects say, for example, when a user is added to a project when a project is transferred and stuff like that so yeah directly, it should affect directly on the scope and on the right side you can see the Legacy worker we use, it still use it, but we use it for cases where a group is affected where, like a group is transferred or when a user is added or removed from a group.

B

So we continue using a combination of both right now yeah and uh regarding the future. uh There are like two different areas that we want to focus on. One is we have a proposal for groups for scoped worker. We do not have it currently and that can solve a lot of problems uh and hopefully, if that happens, we might be able to retire the Legacy worker and we can use groups corporate workers in areas where the Legacy worker is currently being like, adding a number to a group.

B

The moving group transfer all of that stuff and we also want to remove safety networkers from across the code base. This is so in gitlab.com. The safety networkers are not the replica, so there isn't much damage being done that way, but on self-managed instances this safety net jobs runs on the primary and for such installations it might be might be a problem. uh So we want to remove the safety net jobs at some point. But uh before that we have to make sure that our new workers give us the right output.

B

We haven't been able to measure that, but once we measure it and if the results are right, we can also remove the safety Network calls from the code base, and this is the uh like how effective net jobs function. So the first line this line is the new worker, of course, and the second line is uh we want to. Unless until we compare the consistency rates, we just uh NQ also the job in the Legacy worker, but with a low priority uh and with the one hour delay.

B

So we want to remove it at some point yeah, so yeah, that's about it. Based on on Project authorization, if you have any questions, I will be happy to answer them. I think we have questions on the doc, but if you have any more questions, you can add them here.

A

We can stop sharing the screen yeah.

B

Is everything clear and do you have any any part that you can that I can explain more I.

D

Have one question yeah, uh so you describe the incident uh where the queue was clogged with uh with a bunch of jobs. uh I, wonder what happens if uh so, let's consider the project recalculate uh worker. So let's see that the uh let's assume that the queue is already clogged and the member is added to a project.

D

So there's a new job is created to recalculate the the authorization for the project and then something else happens to the same project and then other job is added to the to the queue for the same project or so could that happen that there are like hundreds or thousands of jobs for the very same project to recalculate.

B

Yeah that can happen, but the good thing with the new worker is the the deduplication keys, the project ID so in in that case the jobs will be reduplicated. But in the other case the reduplication key was the user ID, because so, if you add different users to the same group, each is each job is unique, because the key is user ID. uh So it is never deluplicated.

B

Does that answer your question.

D

Yeah thanks yeah.

C

What is the um you've talked a lot about these jobs, getting stuck in the queue yeah? Does that actually like what it? What does it bring down git lab, or is it just a case of maybe you don't have access to the project? You should what is like the user-facing impact of it.

B

uh So it's a mix of both I would say uh there are also other jobs waiting. So when you wait, the access access policies are not applied correctly, so you do success and uh on the higher level as a whole. Also, our side systems come to a stop, because uh there are also other jobs happening in other parts of the code base that needs to use psychic, and since it is clocked at this particular Point, uh they also cannot uh process and it's it's sort of leads to a cascading effect and yeah.

B

uh Maybe we can talk about what JC has also asked in in brief uh yeah, so JCS asked whether the assigned nature of project authorization worker has ever caused any problems and I have replied that uh so uh I think means that it happens behind the scenes, which means that you, you add a member to a group and they do not get this access instantly like it is near instant because it has to go in the background and process it.

B

So we have no control over when that job finishes, for example, if they are stuck in a queue it may finish like after one minute or so uh so this was always not the case. uh We always we did not have async refreshes from the beginning it. It used to be sync repression, which means that as soon as you finish, the request, you also had access to the project, uh but at some point we had to change that, because it is not always nice to wait on request.

B

It will uh increase the response time and it will also skew your metrics for the error, budgets and stuff like that, and we also had another reason to do like async everywhere. So we shifted to async mode, but we are also aware of the fact that the access levels are not instant. It is near instant and we are we had to take that call.

B

We are okay with having near instant access, it might change after a few seconds or one minute or so, but we also make sure that these queues are considered high priority by gitlab.com psychic Fleet. So it tries to finish it off as soon as possible, but it is not guaranteed to be instant foreign.

C

So you're on tenant scale now, so how is organization is the concept of organization and everything you're working on there? How is that factoring into this.

B

uh To be honest, we haven't reached the stage where we are. We have started to think about uh project authorizations. Yet because projects will still access to users will still access. So we are confident that some version of the table will continue to exist, but we I think we were also talking about whether group authorizations need to happen because now organizations will have groups and uh it might be much easier to pre-compute that value and store it somewhere, rather than figuring it out on the go so that that might happen. I guess.

B

uh But we haven't reached the stair stage where we could like have a concrete plan about what to do with project authorization.

C

D

A related question to that uh so there's also the group project, consolidation effort and I was wondering. If you mentioned the the group authorization table that you are considering, would it make sense to just create a namespace authorization and use that for both project and group instead of treating the two as two different entities or are there benefits to keep them as a separate entities.

B

um So uh what I think about it is there are like variations uh in projects and groups on how we get memberships, because there are group Group shares, but there are no project projects here, so variation success. That way. So, even if you make it like namespace ID access level user ID internally, we will have to differentiate somewhere between a group and a project and then get fetch members of them and do the whole thing.

B

So the first condition in that case would would appear somewhere but uh yeah having it as namespace would make sense, because then you can contain it in one single table. Otherwise you would have to like split it into project authorizations or group authorization.

A

B

There are no more questions. I can stop the recording, okay.

C

Yeah, there's no more questions. I'll just make a general comment that you did an excellent job, breaking this down to the point where even I could follow along so um great job. I think that was. It takes a lot to take something like this and make it easily consumable so nice job there. Okay,.

A

D

Yeah and I think it's very, very useful. uh This summary and the presentation you provided because there's like a a lot of detail and historical context to project authorization- and this was like a very very nice story- yeah.

B

I I always think about the fact that somebody in the group should be acting as the historian of manage art, because there is a lot of there's a lot of context. I think at this point, I think it is, but somebody needs to be there because it is very easy to lose context. There are a lot of historical contexts around like what considerations we did to reach this point. So.

A

A