GitLab Quality Department, 25 Jan 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: RCA Retro gitlab-com/gl-infra/production/issues/1498

Description

Incident report details in https://gitlab.com/gitlab-com/gl-infra/production/issues/1498
RCA issue https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8729
Feature issue https://gitlab.com/gitlab-org/gitlab/issues/36709

A

B

B

A

So Cameron's meeting.

A

C

A

Repeat that I don't know who's gonna who's running the meeting, yep it's in the agenda, so I'll be driving. The moderation part at least and Cameron can provide some timeline details, given that he was part of the incident report original I'm, just giving some time for everyone to join, because we have a large audience and that's why we are not starting right away.

A

All right, I think, one minute past the scheduled time. We are ready to start I'll be driving. This meeting like I, said earlier. I'll just give you a short, very short expectation, set setting so that everyone knows what to see in the next 25 minutes, re I'm, getting distracted by swimjet things all right. So the goal as with any RCA is making sure that we understand what happened and what we can improve. We're not going to be pointing fingers to any of the teams or people that were participating in this.

A

The only goal is: how do we make sure that we don't ever see this again?

A

I'm, not gonna, read the whole timeline. The timeline is there for a sync, so I want to make sure that we have enough time for discussion. Points so I'll first hand it over to Cameron to see whether he wants to specifically point out anything from the timeline or the incident reports before I open up for discussion, Cameron.

D

So, from my perspective, I don't have a lot of great insight into this particular issue. I was the EOC the engineer on call, but, from my perspective, I received basically an alert shortly after a deployment. I think it was Stan who, who had a really good insight.

D

He connected the dots as far as to what the actual problem was so I'm not going to go into any more than that I guess, but I may not be able to answer some of the specific questions underlying why the deployment had failed or what changes were made to alleviate the problem.

A

Cool alright I think we should start with the discussion points right away unless there is something more to add prior.

A

I don't see or hear anyone raising their hands. So, let's start with a discussion.

A

Let's start from the development side of things. How was the change that contributed to this problem tested during development? I see yes, I.

B

Think so I was the original author of the change I. Also I don't have to fold insight into how all these things tied together in the end I think I have a vague idea.

D

Of what happened.

B

But so because also.

D

In a summary, it says.

B

The the intention initially was a refactor right, load-balancing I think it was. One of these things were a lot of unfortunate events ended up tying together and it wasn't strictly related to load balancing. So it just put it in one sentence, but we had wanted to do was to provide an API to allow the ramps have to identify itself as sidekick or the rails web app, etc.

B

So it was male just meant as a refactoring basically and the load balancing code, which I believe only executes as part of ie its relations. That's the bottom I'm really fuzzy on that was affected by this so and that that's where it then tied into this change. Where and a certain runtime should have identified itself in this case, sidekick should have identified itself as a psychic, but it did not identify itself as sidekick, which then led to a code path being triggered.

B

There was not meant to be trigger and that Len then lent led to all we kind of weird errors. That kind of looked nonsensical at first and that's worse, then stepped in, and he kind of connected these. These dots. Yes, so specifically to the point. So how was the test? The change confirmed so I do remember struggling it, but so I did all my testing on the GDK. So again, it's like, like local instances that we ran for Puma, unicorn and sidekick. So what I?

B

What I did not do, because I remember struggling with this was to test the change like on my local development environment in an omnibus container, so I think what I tried doing was mounting my rails app into the container and that unleashed all sorts of it's because to start over at my local config files, an MS of my GDK conflict, so eventually, I gave up and I I didn't really test that properly yeah.

B

So I think there was part of the problem that it was not really tested against how many bus and it specifically broke under omnibus, because sidekick so part of this check to identify the runtime to identify itself. That kind of snuck into what should have just been a reflector, and maybe I can talk about this later, but was.

D

To also test the script path for.

B

The current component that runs and we tested for a specific script name that should execute for sidekick which worked on you know under when we tested locally against the GDK by the pairing. These scripts names change in omnibus, so so that's kind of where that started to break, and then that works slip past us do any button.

A

Material just ask you a question because you're mentioning omnibus in inside of our pipelines, we have the package in QA item or jobs that will create Yami buzz package out of your changes and execute QA against it. Was this consider considered at some point I might.

B

Have known I might have not considered this was like one of the bigger stories I work in as well, so I'm still kind of like learning the ropes or whatever you could so I might have not considered that I. Also yeah I'm not sure does with that run like sidekick alongside the rails server as well, or does it just start the rails? It.

A

It installs the full omnibus package and then it builds a package and then it boots up a docker image as far as I remember as well, and then Auto does an automatic QA. Whether.

D

It eventually, whether.

A

That would catch an issue with load, balancing I'm, not really hundred percent sure, because, probably not because that's not the default and we run only with defaults. But it would at least fail on that.

B

A

B

Yeah I suspect I did not look at that yeah, but yeah thanks for the point. That's a really good point, I think you're right and that it might not have catch the caught it because I think this particular code pad that was affected require a particular license. But this is the bit I'm a bit: fuzzy on I think it was something he's specific, at least so I'm not sure I believe you're right.

B

A

It's yeah. We need to confirm that, but as far as I remember, unless it's in one of the higher tiers, if it's in starter, then it on, then it's behaving like a normal core I think. But if it's one of the higher tiers that we have, then you actually need a license to test it. We.

E

Have we have a one inch NPT engineer on the call anybody you can? You can affirm that package in QA runs EE.

A

C

A

Yeah I think I think your friends hehe yes,.

A

Yeah but it's unclear which, which dear I guess and different tears, unlock different features and I. Think load-balancing is one of them.

A

Okay, so are there any points that we want to discuss here? I think it's kind of clear how this went through. It was not obvious number one. It was not obvious how this works between omnibus number two, and there was no testing in packaging QA step in the pipelines, and then we went all the way to staging where it actually encountered. The first ie instance.

A

All right, let's go on to the next one. Was there any planning done in the requirement issue? Just planning I mean no.

B

Not really, there was another point that yeah and this is take a definition fell short of. We should have done that earlier. We kind of yeah.

D

B

To do this, headfirst I guess because it wasn't super well, groomed and I. Think we just like, as we went along looked at. How could we solve this problem and I do remember looking at did I yeah I think editing here, so we do remember I, don't remember. Looking at, we were thinking well. This sounds like you know. Someone probably solved that problem before so we looked at the New, Relic client library, so we look at how they do it and they were actually doing the same kind of checks.

B

So they were testing for this psychic script. Bin path and I.

D

Guess we kind of yeah put.

B

Too much confidence in that and kind of assume well if it works for never like it's kind of work for us, which is not a good conclusion to come to, of course, but I guess that's not a learning for us that we kind of put too much confidence and in that so so yeah I think like on our end.

B

We need to make sure that we test plan to testing for this a bit better going forward like think about what could be the potential case in which you might break and of different environments and so forth.

E

Is Mike here so not to point fingers at anybody, but they be considered using the feature proposal template because that template has a specific section for test planning and if not, was it because this is a backlog tech that refactor, which we use freeform issue format just want to understand? How can we facilitate raising these test planning requirements earlier on? Yes,.

B

Exactly that's first, you already mentioned. It was precisely the reason that it wasn't meant to be a factor only so it was not meant to actually change any semantics and we just wanted.

B

We had a lot of dispersed checks throughout the code base that would check for like class presence, for instance, of certain classes like the sidekick class or a unicorn worker class, and we just meant to move this behind a facade kind of a simple API, and so part of the problem was that during the reflector, a couple of well-intentioned suggestions were made for how to make this more solid and that's kind of when I think we started to sneak like these super subtle semantics changes into it that didn't quite make it a refactor anymore, but.

D

Initially, yes, like.

B

You said it wasn't intention as just a refactor, so we didn't use it feature proposal. That's why yeah we kind of underestimated that change. Okay,.

E

B

Radius is for sure, okay.

E

I do want to highlight that we do have a refactor template now and I. Think that's what a quarry I think I'll be signal boosting this. So team can use this, because if you look at the the refactoring template, it's actually more more aimed to testing than the feature delivery template, because if you there's like sections we're like what's the blast radius, what are the infinite side effects? What are the test levels that need to be considered, including starting from the unit tests?

E

So I think this is one of the improvements that we can do is make it broadly communicated to engineering hey if you're doing refactor, please use the refactor template yeah and we can begin plugin. That's.

B

Great, thank you. Yeah.

E

And refactor templates I think we should make package. It can be a package and QA meant like more of a mandatory thing like it, you are doing refactor. Please ensure that we are testing everything in an Internet to to miss right. So, thanks for facilitating at the discussions, man I want to give you kudos I think you were being very humble and transparent I've. Given you. The engineer worked on this, like thank you for the insights. Sure.

B

Sure course thank you.

A

Back in you, Mary yeah, so I think that brings me to the next question. How come was this not caught by unit integration? Testing I specifically want to highlight, like in the merge request. We say in availability, antastic section that this was impossible to feature flag. This is how we usually go, negate these big changes and try to control the impact, and there is a mention that cementing shouldn't have changed. The checks just moved behind the in interface, so yeah. Could you tell me, tell us a bit more or not you.

A

Anyone actually who participated in this I see there are other reviewers as well. How come yeah, maybe.

B

I can just quickly give my point of view and then so yeah I already mentioned it was meant as a refractor and that we did that some changes to semantics kind of snuck in I. Guess when I noticed or like it wasn't really called out, and we should have probably thought you know we yeah. We should have been more rigorous about saying. If there is a change, maybe just make that the separate, mr, you know make sure that this is purely a reflector. So that's like one aspect and the other one was oh.

D

B

Flagging in so this changed I really don't know how we would have done this because it's it touches like one of the entire code base, but it was used in completely disconnected part across the code base like whenever we need a like yeah like environment, specific check, like am I running a psychic, or am I running s, AF server and so forth? Am I executing as a multi for the environment? All these things I would have you know you would have had two features like every single call. Basically, I guess it's possible what.

C

B

Have been very, very noisy.

A

Yeah and it's not a feature flagging is not a solution in that case, that is for sure, like more confidence in testing, definitely is then needed or tests confirm or denying whatever you were doing. There.

A

Okay, does anyone want to add more about integration, testing, yeah.

C

Just one little piece about that, if we had realized that we were making a serious fundamental change rather than a refactor, we could have put a little feature flag right in the extracted class. But you know it didn't come to mind at the time. Yes,.

B

E

Why can you say it again: the extracted class yeah.

C

The class that we, the runtime class, that we created as part of the refactor, sorry I, don't know if you can hear me very well, that was just a refactor, but we could have introduced the feature flag. Just in that method, internal to the class.

C

C

C

Yeah, that's all I wanted to mention about that awesome.

A

Thank you all right to the next point. We passed all the tests we packaged up in deployment went out to staging. How come did the errors go unnoticed in staging?

A

Yes, you have items so I'll I'll. Call you out here. I. Do you want a sphere.

B

Not sure yeah again it went events I guess so what I did was I yeah I went to cabana look at the the pre prod logs, we're staging appear so and I guess what I was expecting at this point was to see so at this time. I think we meanwhile I pushed another change that actually removes these log lines, but at the time we were using the application logger to emit like a one-liner which would print out the runtime as it had been identified by this new class to the logs.

B

So it would say something like runtime detected, psychic, runtime, detect a puma and so forth. So what I looked at was I just verified. Do I see these log lines? Do they do do they make sense in the sense that they cover these cases today that I had and I was seeing them.

D

B

Just like there was not enough because again, there were, like certain cases under certain deployments specific, where this led to errors, and what I should have done as well is to be on the opposite check is anything that I'm not expecting to break is actually breaking, because I only saw basically the fraction of it working that didn't work, but I didn't see the fraction of it. Failing that it fail so I guess: I wasn't there enough, then.

D

I should have looked.

B

At sentry as well, because these the errors did show up in sentry, we already checked that.

A

Mike, you have the next item. Yes, before I start on the next Oh Mike Oh Mike Mike.

C

This is just an idea and I don't know, because it depends on the details of how we do deployments. But you know, if you look at the timeline, there's only like a few hours between staging deployment and production deployment a couple hours if we could lengthen that time without.

C

Reducing the rate of deployment that could be helpful for allowing people to come in and catch things so.

A

My question to you would be Woods I mean obviously hindsight 20/20 right, but would you actually log into staging knowing that this is going out and purposely go to sentry, to check for this.

C

No, no, no I yeah. That's not I'm, not sure that I know exactly how it would be caught, but perhaps it would increase the likelihood I'm not sure so.

A

That that was one of the points I have a bit below, and that is that staging has limited set of eyes on it. The only to me, the only thing we depend on on staging is automation, so we have QA tests running automatically after every deployment and when they pass only then we progress further to other environments, anything that is in between anything that is not failing. Test gets, it's become secondary right, like if we depend on manual action, and that is not reliable. We we don't catch that I.

B

Have a question about this: could we this might be hard to do it I, don't know because I'm not super familiar with the tooling here yeah, but could we extend this to sentry checks because I think sentry has this notion of an era that has never occurred before so I'm wondering if suddenly an era of files and sentry purely on staging that has never seen before on any of the environments, if there's any way to hook into that and stop staging from being deployed.

A

Yeah, that's a a great point: we've very quickly looked into it, the like we should look into that more in-depth like absolutely, and that is an action item for for my team. One of the issues with sentry is that after every deployment, it reports even the old items as new ones.

A

For some reason, I don't know exactly what is the specific reason, but if you go to sentry and you go to the releases parts select any of the shots, you'll see a lot of errors that are literally exactly the same, just in a difference in how it was reported in the Sentry, and we cannot halt everything because of it.

A

But that's that's an action item for sure that we should investigate make your items are next. Thank.

E

You so being fully transparent. Yet, like this failure was raised in staging I, believe three of our tests, not in the smoke tests, we caught it being all three tests, some fun normal aspect of CI, hence, is either failing at the Starla test middle or sometimes at the end. We fail to escalate this to the delivery team. We do have a tryout rotation, which is soon to be called the on call now from the Quality department.

E

The the reasons it's on below, because one of the person on my rotation was all for how the day and pursuing lunch not an excuse, because we need to make sure the the deployment is crisp, so we'll be looking into improvements using page acuity and make sure they we try at staging and production first, because that unblocks, the delivery team and and Knightley's and master will come later, I believe the compass wasn't there.

E

So now we do have a set priority of where what to charge first and my other menu that you, if you okay, I'm gonna, include mine at the bottom as well, so as part of the testing discussion that being in an easy flag would be if we use the refactor template, discussions will happen and an easy change would be.

E

If you know this, this is a refactor like everybody working on it should be monitoring the test results when they change its hit staging, and maybe we could do some sort of automation around this- we're notifying the Amoy. Okay, your test has passed the regression on staging it's passed or failed, but at least there's some thought of it that hey, let's check staging results when you're windy, a mile you're working on has like staging environment or workflow staging labelled you.

E

You are aware that test of running and you can help everybody- can help monitor the test results. That is something I think it could be the be done while we're working on fully automated solutions and fully automated notifications. I'll take a pause there. Yeah.

A

I'm, just being mindful of the time we only have well, we are basically ATS time, but we have three more minutes to go.

A

Is it possible to not depend on humans here and consider having another level of QA tests, so we have smoke test and the rest of the suit it possible to have smoke tests and then reliably passing tests and then the rest of the suit as well, because I would rather plug in the deployment system into an automated system and have that higher the board's then, depending on an on-call pager, Duty rotation that still depends on humans. I believe.

E

That is something we're looking into to having another notation of reliable tests, and then that's like that. The next cohort that you can plug into I can put that into the rapid action we're now working hard on. Actually, we have only six flaky tests left total and we're making great progress on it. I'll make sure that we tackle that from the on the quality side. Then they have. You have another level of cohort for you to select from all.

A

A

I guess we discussed. Why was this not caught by automated testing and then meantime to detection from detection to mitigation? What we have we done there to improve it, I mean from the timeline. It is three hours total from.

A

Education, which is not stellar but also not completely problematic, sorry not problematic, but.

A

Yeah sorry long day, I'm forgetting words now so three hours it took us to go from report to addressing it could have we done something to do it quicker.

C

Could I mention something sure I think it depended? The fact that we, it wasn't very problematic, might be heavily dependent on Stan, jumping in and figuring it out. So we don't necessarily have a process to make sure it's something like that happens. It was kind of dependent on a person.

D

I'll say that you know one thing that I may be confusing this with another issue, but I seem to recall that we were coming out of a period where we had been delaying deploys, and so we had a lot of changes going in at once.

D

That made it difficult to comprehend or narrow down where to look quickly to see what might have changed. So we were really heavily relying on cabana and sentry and I'll say that as I recall, Stan kind of narrowed in very quickly in century as to where he felt the problem was Cabana led me in a different direction.

D

That was not really the correct analysis of the problem as it was occurring and I do think that with the dev escalation channel, it might have added some more time, but that if Stan hadn't jumped in that would have been the next obvious step was to get in there and get assistance from whoever was on call from the devastation channel to assist and I also think that ours, our behavior of raising this incident and it propagating out into select channels, helped to bring that awareness in and Stan.

D

Just happen to be standing there, but I don't want to I and I'll. Be honest, I! Don't really remember when this happened, how the dev escalation channel, how setup it was, but I feel like that would be that surrogate for what Stan did is that we would go right into that. So I felt confident about that that we that I could get that help as the EOC to someone from dev to take a look at this and help where I didn't have that expertise. Yeah.

C

That's a very good point: I! Don't know that the meantime would have been so good, though headstand not been there, but that's a very good point. We have the process in place.

A

One thing that you mentioned Cameron there and that was that deployment was delayed and we cannot accumulate the changes.

A

Theoretically, if we have had smaller deploy bundles, we could have theoretically caught it earlier and we are driving in that direction. But we have a bit of a chicken and egg problem with you know, depending on the test, but that's not being reliable. Can we do it quicker? No, because we need to have a reliable test and so on. I want.

D

To say in this case Marin it was not a normal bundle. This I believe was very large because we had stopped deploys for some reason for a longer than normal period, and so that I'm not complaining about the current deploy schedule or the volume of each deploy I'm just saying it. This one seemed like, as I recalled much larger, and so it made it very difficult to know what changes were in. It was overwhelming how many issues had been committed into that deploy. Alright.

A

F

A

F

At time, you're all welcome to go over I sure I knew neither meeting, but I just want to say this is one of the best RC has we've done so thanks man for organizing and running it. This is really good and I echo. What Mac said a big part of this in maintaining the no blame culture is like the accountability and transparency for Mateus for Mike and others are a part of this.

F

That's really really important and I know it's a it's a chicken, the egg sort of situation, some people and it's very human, might say well, I'm, hesitant to speak up. Cuz I'm worried about getting blamed. I promise you nothing increases likely to blame than not speaking up and being accountable, but it's a chicken and egg thing so I just want to say like let's use this as an example, and hopefully it's a signal to people that are in future situations that it is okay to speak up.

F

We are about no blame when we are but improve. That's so great good job and that's it I've gotta hop off. So that's thanks.

A

Eric yeah Eric is Eric is right. We are over time, but if others are willing, I'm willing to go over five minutes to kind of wrap it up, I'm, seeing some and thumbs up so folks, who can't be here, feel free to head out I want to address something that Cameron said before well feel free to leave, but.

A

We should we should consider making sure that any stoppage of deployment is considered as as high severity was a possibility for high severity issue. I know that it's a natural reaction when something happens to stop and back off, but this is exactly what happens with large volume of changes that we have. If we slow down a bit, we will just introduce more changes that will have a possibility of like more angles of problems coming in.

A

So what we should be doing as all of us right development, ops quality is making sure that the amount of changes that we have are small, but also the changes themselves being small. This is why iteration is also really valuable, because you can actually commit five. Ten merge requests over the same thing and as the point five ten different merge requests over span of time is easier than one big thing that could cause an issue. So that's just a general comment. There.

A

All right, we have corrective actions written out, I'll, encourage everyone to read them and I will take another action item from me as a representative of the delivery team here, and that is looking. Why was the deployment bundle bigger this time according to Cameron? And then another thing is whether it's possible to safely increase the cadence of deployments, while still depending on test like finding that middle ground, where we feel comfortable from both quality and then infrastructure side of things, as well.

A

All right, big apology for going over time. I'm! Sorry, we'll do better next time, but I really appreciate everyone coming in and discussing these things. I think we have a lot of really good action items to take on thanks right, yeah. That's.

E

A recording as well, please I, want to propagate this discuss the department.

A

E

Yeah yeah, of course, that's a.

B

E

B