GitLab Protect Stage, 9 Jan 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Govern: Security Policies - FCL for Incident #8159

Description

This is pre-recorded summary of work related with FCL for Govern: Security Policies team.

Incident: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/8159
RCA: https://gitlab.com/gitlab-org/gitlab/-/issues/387556
FCL: https://gitlab.com/gitlab-com/feature-change-locks/-/issues/34

A

Hello: everyone, my name, is Alan versevsky and I'm, an engineering manager for the governed security policies team. Today, I would like to talk to you about future change that the tower team was involved with and during this video I will cover what happened. What we have identified during root cause analysis process, what we have planned and what was the goal for future changelog process? What we actually did during this phase and what is the plan for our team in the future? So, let's start with what happened?

A

We were working on one of the issues to improve the Scout result. Policies with ability to compare results from all Pipelines are related to change when evaluating them. So the main problem that we identified is that currently, when you are running Scandal result policies, it is only taking the latest Pipeline and not all pipeline created for a given Mr. So, for example, you could have merge requests by Appliance detached pipelines that are actually run along with with the main pipeline for your Mr, and we will not be able to to check this conversation policy.

A

We're not able to remove the approval and evaluate this properly. So the reason for that we created this issue, so it would help us with doing this. We decided to work on this feature behind feature flag, because we knew that at some point we're going to encounter some performance issues and wanted to make sure that the change is easily revertible.

A

We were enabling the future flag um and first we enable this on on staging. Then, after a few days or nine days, we decided to enable it on the production and then the incident happened. So the incident was created as to incident that was communicated with customers, that the background job processing time was dropped and and the for 35 minutes it impacted both web and API. So this is the uh the incident itself and you can read more about what exactly happened. What kind of slowing down we saw?

A

The most important thing is that it was notified for customers. As to incident, you were able to see that on status, page we're rolling out the feature flag and after it was identified, the future flag was easily and quickly disabled.

A

So after the incident we decided to actually go through whole root, cause analysis to understand what happened and how we could avoid that in the future. So we created this issue that could help us with understanding exactly the root cause and the problems that and things that you can improve. So let's get started with what actually uh happened and what went well, so you can read the timeline and what was the type timeline here in the issue? I will link the issue in the description.

A

However, you can see what actually was happening uh during few days and and the most important thing is that it was enabled on staging on 5th of December and then, after nine days on 13th of it was starting to be rolling out on production uh for like percentage of actors, and then we saw the incident, uh the aptex dropped to 92 percent and we saw the alert from alert manager on our Channel.

A

So we quickly had the call well, we had identified the root cause uh and we disabled the feature flag and it after a few minutes. It went back to normal. So the most important thing is that we try to identify what actually went well. So the good thing is that the feature was introduced behind feature flag. So that's the clear reason to use feature plaques, especially if you, uh if you know that you're experimenting with some new improvements that could cause the performance problems.

A

The other thing is that the monitoring system of on production works perfectly. We were able to declare a new incident where uh we were, including the call we're able to to make sure that this change could have affected and with support team -3 team, we're able to disable the feature flag and we saw the Improvement of the performance. So that was great to see the cooperation between multiple teams to make sure that we can qualificate get back uh to normal what we could improve.

A

So that's what we did actually during this vertical analysis, we were trying to understand what we could improve. So we've noticed doing this analysis that we were enabling this feature flag, we're not waiting enough between like next portion of actors. So we we thought that maybe encouraging Engineers to wait for like 15 minutes before naming the next phase of future flag would help us uh like reduce the this problem.

A

The other things that we not really have information about this and feature of like roll up template. We have the information about. uh We should wait at least few minutes, but it could be easily overlooked. um The other thing is that we are not really actively encouraging Engineers to observe dashboards after feature flag is enabled. We have one sentence about that in the documentation.

A

uh However, it also might be overlooked, so we decided that, oh, maybe if we have that as a as a point and feature flag template and also we could have like enhanced documentation around enabling feature Flags, it will help us avoid similar feature problems in the future. The other thing is that, even though it was enabled on staging, we identified that we not really. We had bad assumption uh that we are free to go with production enabling a feature flag.

A

We did not have enough data on staging for our group, so we're not able to properly identify the problem beforehand. So we also should investigate the potential impact on performance during the development of testing of the UI. So, during the review we we should be able to actually write down whenever we're doing something that might affect multiple uh parts of gitlab. We should be able to identify what happened and what could be the performance of it.

A

uh We were not working with database, so we're not able to identify the the problem on the database level that we're not able to to actually read the um the like the query time and so on. So because we're doing everything in Ruby to read the report.

A

So we've decided, as this was the S2 incident declared for customers. It was visible. We've decided to go with fcl route. If you want to read more about xcl, which is like future change, lock, you can go to documentation and in the handbook and actually read about the process, and why and two is involved and what we actually did. So you can read more about this here.

A

So after this root cause analysis, um we created the fcl issue and let me open this, for you.

A

So we identified the date for it, and it's actually finishing today um and we identified the work plan wanted to prepare it. We wanted to update stakeholders about it and wanted to make sure that we can focus on on this one, so we stopped any other development on backhand side and we decided to to start working. So what we identified during this phase, our work was uh work plan was that we first need to refer TMR.

A

We want to make sure that we'll not introduce the similar issue in the future by accident or anything like that. We reverted this because we knew that we need to work more on improvements and it will not happen during the fcl phase after the feature flag was removed where we have removed it, which, like itself from gitlab, so it is not available anymore and the source code for it is not developed anymore.

A

Then we we've decided to actually take a look at the recent improvements that create improvements that threaten sites team did uh that you could use to 30 findings to a database to synchronize approval rules, first control policies. Right now we are using Json security report, so we're parsing every single time the security reports getting the data for it, calculating the proper uh value.

A

If, if it should be, if the Mr approval is required or not, and then we're doing this so we're actually parsing Mr like the Json files twice, because we're doing this when we're creating security files in database and the other. When we are uh taking the approval rules for standards of policies, we want to use the work of threat, insights, the great Improvement they did and they wanted to change the way we're calculating if given approval is needed or not.

A

So that's why I created this epic and in this epic we identified on the implementation plan and what we want to do to improve it. So if you want to read more go to this epic and you can read more about or any considerations that we have and we had during the space, the other thing that we wanted to make sure that we can understand similar issues in the future. So we created a spike to investigate how we can actually check if performance is okay.

A

For this type of feature, so uh we did the whole investigation and in this investigation we we wanted to to test and evaluate different options that we could take and the most important thing as a result and the outcome of it is that we should work on staging more and we should add more data and stating so you can identify this issues in the future and we should observe the metrics that we see from staging to identified issues in the future.

A

We created a separate issue to use security, Precision data theater to improve the data and to have those data actually on staging. So that was the work that you wanted to do in terms of source code. What we can do in the code itself, but then we identified oh, we also need to work on some improvements in terms of processes. So first we wanted to include the information to verify dashboards after enabling feature flag, so we updated the feature of like roller templates.

A

So it's now, it's a part of the process of enabling feature flag to actually check the dashboard. The other thing that we were, that we're working in the Mars, ready and ready to be merged is adding sections feature flag, documentation about Matrix verification. So you can, you can see more about about it in those Mrs um that what you can do, what kind of steps you need to take and and how to identify?

A

If your change, if your future change, if your future flag actually caused an incident or not, the other thing is that we want to talk with other team members and gitlab to understand if there is a need to actually add link to dashboards when enabling chat Ops uh when asking chatups to enable a feature flag. So we wanted to encourage people to check metrics after their enabling feature flag. So maybe adding this small, a sentence like verify metrics with the link to the dashboard uh could help us with that.

A

The other thing is something that we should consider and we're discussing this. This was shared with others uh is, maybe you should add the production check whenever we're enabling feature flag, that we need to wait for at least some time before, enabling next section next portion of actors. So that's what we what we wanted to do, so that's what we did uh and for the future, what we're going to work on is actually this epic and this, and as a result of this bike, there's like one issue that we wanted to work on.

A

So in terms of what we're going to do in the future. uh We want to work on this data center that we, what we're going to do with the uh quality team and to improve that and to help them build a similar tools that will help us do it. Actually, we already have DMR with uh with POC of how how it could look like, and the other thing is.

A

We have created this epic and in this epic one and this epic is scheduled for 15.10 and this epic we're going to work on uh improvements to the worker itself to synchronize the problems using database instead of reading Json files, which is quite expensive in terms of like resource usage, and here we also identified that we already can do that uh with the POC small POC that we already did, and you could read more about it in the issue and in Mr itself.

A

um So I encourage you to check links that will be added to the video description and we'll gonna talk about retrospective during the synchronous meeting I'm going to share with you all Lessons Learned, if you have any questions, add them to the uh to the meeting agenda. Thank you. Goodbye.