GitLab Protect Stage, 10 Jan 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Govern: Security Policies - Closing Ceremony FCL for Incident #8159

Description

Closing ceremony for FCL for Govern: Security Policies team.

Pre-recording: https://youtu.be/ZpOxrCIPguY
Incident: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/8159
RCA: https://gitlab.com/gitlab-org/gitlab/-/issues/387556
FCL: https://gitlab.com/gitlab-com/feature-change-locks/-/issues/34

A

Hello: everyone welcome to current security policies, fcl closing ceremony, we're here today to celebrate the future change lock uh process and that we finished with the work for this one and we already prepared pre-recorded session with the general update about fcl and why it happened and what we did during this phase. And today it's all about celebrating and all about talking about possible outcomes for the future and what kind of plans do we have and answering questions uh so yeah I see wayne. You have first item.

B

Yeah so a great job melon by you and the team on your first FCO, some of it's obvious like the process and the. Why- and some of it is not so the second one you know- probably the second one will be for a while, but usually that you know second and subsequent ones are a little bit easier.

A

Yeah and thank you for your help during the on the identifying the need for the process and so on um so yeah. Thank you for this.

B

um I reviewed the FC I reviewed it over time, but I reviewed it again. Today. um Many issues in the work plan are closed and many Mrs are merged, which is great, but not all, and it's not expected at all would be. But the question my question is: for the items still open, which give us a high risk of the repeated incident. Should we hold the fcl open to complete that subset?

B

You know in particular the one that jumped out at me, but you and Dominic know the details better than I do um is I'm just reading from the RCA. As the issue was not reproducible on staging environment, we might look for ways to increase traffic in staging, to be able to properly evaluate uh the given change affects the environment. That's from the RCA. From reading the fcl, it sounds like we started some Skype work to investigate this, but haven't completed it as part of the FCA.

A

Oh yeah, like uh from items you mentioned like we, have some suggestions that we will leave open because they do require some discussion with other people in gitlab. So that's that's. The first thing we're waiting for the feedback for for this one.

A

um So the good thing about the whole thing about the whole like process and the incident is that the feature was introduced behind feature flag, so that process worked uh and the main cause of it. So, like changing the code that caused this incident was reverted. First of all, it was like uh disabled by Future plague. Then it was reverted and will not do any improvements in terms of like Scandal policies. Until we have this epic finished and terms of like tracking improvements for improvals.

A

um So that's that's all the second thing. uh The first thing on staging. We started creating manually project uh like group with the project in NMR to to get more and more data, but it will require some work with uh like security quality uh Team. uh So that's why we created this Spike and after the spike, we create implementation issue and we're tracking things with them, uh and if we will be able to help, we would love to help. But this will require some work with uh with people on on the like engineers and testing.

A

So that's that's the the biggest like biggest three and the last item. But in my opinion, like from my perspective, the most important one is the main root cause uh for, for this problem is not paying attention to dashboards after enabling feature flag and not keeping the proper distance between like enabling uh this feature flight for next portion of actors. So we did two two improvements during the process.

A

So, first of all, when in the future now template uh feature flag, rollout template, we've added this step that that you should actually verify this and then in the documentation. We added a section that explains what what to look for and so on and one on the dashboards and what kind of dashboards to look for uh what kind of data to look for, and and so on.

A

During this RCA and talking with a series and other engineers, and also there is one feedback on the chat up suggestion uh that we do not have like a good materials for engineers, especially for someone that just joined gitlab, about um like having a training and level up or any other place to to help them familiarize with dashboards and to know what kind of metrics to look for and so on. So we were learning that during this process, but also before, but it would be great to have this.

A

This kind of uh of training I wanted actually to ask you about this. uh Is this something that we should talk about with Learning and Development team, or something that we should initiate like I'm, not sure about the process for like requesting something like that?.

B

uh It's a great question: I, don't know.

A

Okay, so I'll I'll follow up on that. um Okay,.

B

I said some really appreciate all the analysis on this and it sounds good to me um some low priority, but I've made the low priority feedback um fcl. So oh sorry, read-only.

B

um So my last thing, I added while we were talking, has the engineering allocation meeting been updated to show that we're closing the FCO.

A

I'm I'll update it after after this meeting. I wanted just to make sure that we have recording I, have to hold the link so I'll do that very great.

B

And I have the um I always like providing the links to easily find things so I'll. Add that to the notes here too, um and that will be the uh Steph Dominic any any uh thoughts or feedback.

C

I think uh the most important part will be to enable staging to alert us for performance regressions, but um this takes a lot of work. There's an epic to tractors and also there's a separate issue that quality is tracking for automatically seeding the staging database and I think this is the most promising Way Forward like currently, we I was also looking into ways because alerts. We have two options: either we increase the staging the traffic and staging so that the production levels would work.

C

This would mean we create a very large number of these repositories or we change the production rules so that the alerting works in staging. So as far as I can tell from this epic, they choose the second option to update the production rules and I think until then, the only options that we roll like a custom, Homebrew metric alerta. But then you also run into questions like. Where do you actually deploy this thing?

C

Then you have one single piece of one service which is bespoke and which maybe runs in some CI pipeline, also schedule it, but still it's kind of wonky. For this specific thing to handle the metrics reporter, so I would also suggest to wait until this assorted out and staging alerting works as desired.

A

Thank you Dominic. Will you update the document with the links to Epic and uh like to those two epics.

C

You're sure I can.

A

All right, um so that that would be it. Thank you for your questions. The last item uh is like we should do the retrospective, uh like I'll, ask you probably bro like I I've mentioned something, and uh let me share my screen quickly.

A

um So I've mentioned two items. uh What went well and what was great, the the most important thing is that most of the features like say all features that were related to enabling anything. This kind of the policies were implemented behind feature Flags. So we were aware that at some point our experiments will end up in something that we'll need to revert quickly. So we followed the process to actually be able to enable disable it quickly.

A

So that's that's great thing that that we did that we were able to revert the change within like a few minutes after identifying the issue um and the whole like process in terms of like the communication, with support with SRE team and and many people that were involved in this was great. It was really quick and great uh like experience for us.

A

The other thing is that we started learning more about metrics, not only dashboards, but also like other, more specific metrics, that we can actually uh track particular worker or particular class that is not working properly, and then we would like to to continue doing that will will monitor those as we'll Implement new improvements. So this is this is really important.

A

um So yeah like Dominic, would like to add anything like any anything that you've learned during this process.

B

Yeah I'm sorry I.

C

Added one: no go ahead: the team process.

B

Quickly, as this was the team's first one sorry Dominic, you were.

C

Saying I also learned something about psychic in the process, because I was wondering if we cannot contain the blast radius of this Locker in the future. Because what happened was this one worker suffered to Performance regression and there's a consequence?

C

The whole applications that suffer like notes, merge, requests and everything were impacted and I looked into how the gitlab psychic setup looks like, and we have this workers, the number of workers, and this is metadata, and then you have a number of routing rules and these routing rules map the workers onto a number of shots in total. We have currently I think eight or nine, and you can think of these psychic shots as the unit of isolation.

C

So whenever one of these workers in the shot gets impacted, the whole shot suffers potentially as a consequence, so I was wondering and asking if we cannot like move it to a separate shot, because, ideally, if you have one worker one chart, the new workers are actually isolated. Reason why this is not configured. This way is for performance reason, so actually the way that Sidekick is set up. It contains a trade-off where reliability is traded to Performance to some degree.

C

Now, there's also a current time chart which was configured one year ago, and there are two workers currently in the shop and apparently there was also a problem for this, and I was asking if we cannot move our worker also to the scare internship, because it's still better. If these two workers get impacted, then the whole application sector- and it currently is- and the response I got was then that if you have three workers that are potentially like vulnerable, why would you put them in a shot?

C

So I'm not super convinced that this is the best argument, because the consequences are still better than when you cannot access nodes, merge, requests and everything right. So I was thinking, maybe I just opened the draft in there or just assign it to the reliability guys and see what they say, because it seems like you're worth buy a trailer for me, I mean just one worker Falls over which is really only limited and functionally functionality to security policies. And then you cannot access your notes or your merge.

C

Hookers no longer load so I think also if we could find a way to isolate psychic workers. Beyond one chart one worker. That would also be great but I'm also not deeply enough into the psychic thing and how they configured it. Yeah foreign.

A

Thank you. Thank you for the feedback yeah, so we definitely learned a lot about, like things related to Performance, reliability and and like we improved our uh like relationship with those teams and hopefully we'll work more with quality team in the future.

A

um So we'll like work on some improvements, uh so the most like what was really important for us is to being able to stop for a minute identify the problems like performance issues uh with with this comes with policies itself and and improve it.

A

The good thing that the timing for it was great because we actually like starting the fcl right after threat insights group uh Team like finished working on, like moving scan like all security findings for all branches to database, so they just finished uh and they uh like turn on the feature Flag by default, and now we can use that data. uh So so it will help us uh with with moving the whole thing to database.

A

So the main problem currently is that we're not using database we're using like reading blobs and those blobs are stored in in the storage. So you need to like Network traffic to par to get those. Then some time to parse them and so on and so on. So now we're gonna just reduce the time used for for parsing this data and we're going to just work immediately on the data and database. So this will help us as well.

A

All right, so, thank you. Thank you for for your work. Dominique Sashi is not here but yeah. Let's thank him as well. uh That was great, uh seeing us collaborating on those things and being involved in the process, so so that was amazing work. So now we can get back to normal and and make sure that we will deliver those improvements in in the future.

A

Again, thank you, Wayne, for your help and enjoy your day. Bye-Bye.