GitLab Delivery Team Trainings, 9 Dec 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2022-12-09 Managing Mixed-Version Deployments

Description

Discussing the current mixed-version environment setup and tests. Related to https://gitlab.com/gitlab-com/gl-infra/production/-/issues/8145

A

Okay, so mixed version testing, um so yesterday we had an incident. I just mentioned a little bit what's interesting. The um incident like um itself like has more of the specific details, but most generally what happened here was we had a failure on the staging environment and we had a quarantine of the test. Sorry, let me say we had a failure on the staging environment which led to tests failing those tests were then quarantined with an MR, which meant a new package was created that needed to be deployed right.

A

So that was the General thing now, if I actually show a little picture.

A

And so this is super interesting because we've always known it is a edge case in the way our testing works, that you can't nest if you have a test failure on staging and only on staging it's very hard to recover so.

B

They are you talking about the Epic.

A

So here we have our pipeline, so um we had gone through this already right, so we'd already put the package through Cedric Canary um we'd already run tests, and we had our failure on staging now.

A

What this meant is when we got the new package and we needed we needed to basically have our new package on staging in order for our tests to begin to pass again now. The reason for that is because um sagin Canary, these QA tests on step two are designed to test as well as testing the new package and the new functionality they're, also designed to test mixed version compatibility.

A

So what that means is mixed version. Compatibility was a problem we were previously seeing because our original pipelines deployed to staging then they deployed to production, Canary and then they deployed to production. Now, when we have a canary environment and a main environment, they share a database.

A

So we have our Canary production and our production are using the same database and what we had been finding was that occasionally, when we had this mixed version in compatibility users- and unfortunately, it was usually users on production were trying to use features which had the database would have been updated by the production, Canary deployment and the code on production wouldn't be backers compatible, so users would have problems. So that was the sort of original problem we tried to solve with through this kind of mixed version project, and we solved it in two ways.

A

One was we changed the staging setup and introduced the stage in Canary yeah, sorry, the yeah, the staging Canary, so that stage rating and production became the same. They each had a canary. They each have a main Fleet.

A

There's a staging database there's a production database and then the idea of the mixed version testing is we put the package on stage in Canary and when we run the tests, they are testing staging Canary to see if the new package looks good and then they hit staging, which has the previous package on it and they check that one still behaves as expected as well.

A

Right make sense. So we got a lucky, it's the known Edge case, but it's not a common one. So we got unlucky with our incident because what we deployed on staging failed tests, the recovery for that was a quarantine of the tests. In the new package, we deployed that to staging Canary where it was fine and happy, but the mixed version tests were still hitting staging, which had the previous version installed, and we don't have a way on our current pipeline to actually say progress to staging.

A

You know this was fine, it's tests, so in that case, what we had to do was basically bypass our pipeline deployed directly to staging, which you did and that allowed staging Canary and staging to end up uh being able to pass tests.

B

So yeah I think that was a really good explanation. um I have two common slash questions. First, one is we quarantined the test, um which kind of tells me that so staging Canary smoke tests are like that QA test interact with stagings tests.

A

Then um they are pretty much the same tests. Yes, they might actually be the same tests.

B

So I guess what um I'm more or less confused about is like when those set of tests got to staging Canary, they should have been quarantined and my understanding of quarantineers don't run them.

A

um Oh I see what you mean um so I. Don't know this for sure, but I I I think what may happen, and actually perhaps this is where one of the improvements. This is a super good question, though yes um is perhaps what the staging Canary tests are actually doing is triggering the staging tests. So rather than saying here is a suite of tests that hit the two environments, which is actually what I thought yeah, maybe you're yeah.

A

Maybe your question makes me actually think that what it's saying is the tests are kind of coupled to the environment, and it just happens. They both get triggered.

B

Because then, why don't we have it backwards or why don't we just have it in one environment: I guess: that's like a whole different question of like in terms of like the tests.

A

You may know foreign.

B

Yeah because once we're quarantining a set of tests- and we are expected- we we kind of expected it to quarantine on stage and Canary, but it didn't because staging um QA tests were not quarantined at that point.

B

um So then I guess my question would be kind of like why.

B

Why run that on the canary stage um or if we run it on the canary stage for staging, then why do we run it again after staging okay.

A

Yes, that's a great question so that one I can definitely answer so on your first question: I I I, it's super great question. I, think what what we probably want to figure out or find out from quality is a little bit more about that structure. I, actually, I actually don't know, but your question is a very good one, which is: how are those tests basically linked to the environments?

A

Well, your second question: um we actually don't run those tests after staging anymore and the tests after staging have moved and they now sit on the post-deployment migration pipeline. Oh.

B

A

We're testing there is, we say we run the migrations on staging. We run the tests on staging, we go to production, so yeah on our main Auto deploy pipeline. The only time we test staging is following the staging Canary deployment.

A

A

uh Let me let me.

A

Sorry I keep pressing buttons in there, no, no, no, what I'm expecting Okay so we deploy. So we get a new package we deploy it. It deploys to staging Canary and also stage and wrath in parallel yeah. um Then we run tests on stage in Canary yeah.

A

If those are passing, we proceed automatically to production, Canary and then we run tests.

A

Then you'll see what you don't see, but then the baking time happens and then you get your alert and then, if you click to manually, promote we deploy to staging and then we.

B

Deploy to production so.

A

The the all of the testing is happening around the canary environments, basically right.

B

But also in step two that is also testing the previous version of staging that.

A

Is that what I understood that I? Your original question around like the how the quarantine didn't end up affecting stages? I? Think it's a really interesting one, but my understanding is that these tests are um looking for. They are hitting both the stage and Canary and the staging environments. Basically, so yes, they two. There are two Suites in there or there are various different Suites and they are designed to test a new package and also look specifically for mixed version problems that would be visible on staging right.

B

Probably a good idea to sink them when they get to like just for the just for those specs. It's probably a good idea for them to sink at the canary stage. Yes, okay, yeah yeah, okay! Well, we can. We can probably bring that up in that in that, um in that incident.

A

Yeah I was gonna, say yeah. Please leave a comment on the incident like.

B

A

Know like um so zeph was super involved in um designing these tests and he led all the test designs. So he'll certainly be able to explain like or point us to the documentation on like how things are set up and how uh like what we can expect there yeah.

B

Also, what this kind of tells me is I guess like remember how I said, like oh I have two questions: what it led to like a whole.

A

Different discussion.

B

My second point is it kind of smells like? uh What's it called like? uh You know like how we use a pick label, and then it like does um like different things with the pipeline um kind of feels similar as a solution to this, um but but also it feels dangerous.

B

If we don't know what it does so, for example, I'm thinking of a label that, like I don't know, say like quarantine tests as a label and that Mr gets merged and because it's like I, don't know in an incident, it would deploy immediately to Canary's, um staging and staging in order to get those quarantine tests passing on both environments, but then, like, if say, like the release, honors, who doesn't know that, then they can like promote. Another package that possibly went through have a really weird staging um deploy going at that point.

B

A

Don't know, probably something in that for sure yeah somebody mentioned in the incident Channel I can't remember who it was like. uh This is possibly ties a little bit as well to um some work that engineering productivity are doing around so they're working at the moment to allow uh only the tests you need to run to be running. So when you do a merge, you wouldn't necessarily run every test. It would detect what you've actually been changing and run the right test.

A

So this feels like it's a sort of fits a little bit like there, where yeah you do have like quarantine, and that runs special, a special case. um Yeah.

B

Yeah, if we can have that a bit more granular, then yeah definitely right.

A

This is an option, doesn't it yo when you said that you also have reminded me of one other thing that you've mentioned very very near the beginning of that call and uh is relevant to this, which is so that we resolve this incident by um deploying directly to staging so deploying the package we had on stage of canary director staging.

A

We did take a little bit of risk with that, because it was a package with, like other changes in um which hadn't fully passed testing now a slightly less risky one and the one that um Myra was sort of helping investigate was whether we had a package which had the previously deployed. You know like a previously tested package with just the quarantined Mr on top of it, which I think we probably did have somewhere, because the pick label was used, but that would have been another way.

A

We could have reduced risk on this a little bit as it was because yeah we'd been blocked for so long it was. It was better to just unblock, but that's the other way is like you know, the smaller. The change set, um the safer things for.

B

A

If things had been risked like super super risky like we would have, we would have had an option to roll staging back from the version you deployed, um so we couldn't recovered staging that way right, because that one the deploy you did was completely safe there, so uh that bit wasn't super risky at least yeah.

B

And then that doesn't automatically get to production, we would need to unblock the auto deploy and for that to go through. So as long as it's staging, it's not like the biggest deal.

A

Exactly yeah there was definitely some mixed version risk, because what you lost was um you lost the mixed version testing? Basically, because we stuck the same verb, the same package version onto yeah, so you did lose a little bit of testing, but um overall the the risk wasn't uh wasn't massive and I think when we've been blocked for that many hours it was a. uh It was a decent trade-off.

A

Yeah yeah, exactly.

A

So there were two bits that I mentioned about I I put links in um this uh in the agenda. Basically, so we solve mixed version testing with two ways right. We added the stage in Canary which we had a project which um delivery let and.

B

A

We had the introduce the mixed version tests which theft uh from quality LED um and then those two pieces together should be giving us some mixed version. Protection.

B

B

Interesting stuff.

A

And this is one of the things that now gives us a few extra complexities. So, for example, um all of this mixed version testing is the reason why, when we roll back staging, we then put it back onto the right version, because otherwise, again our mixed version testing is testing against it, like what we really need is staging Canary and staging to end up with the same two versions that will be production, Canary and production, because that's basically what we're testing for.

A

So that's the reason why we roll back staging and then redeploy onto it as well to to get that versioning, um and it also linked to a little bit to um another issue which we have going about: how much locking do we need to put around the post-deploy migrations uh environments? Again, it's a similar thing, which is actually how much like do. We have enough protection over the staging version to be able to uh to be able to make the mixed version testing dependable.

B

Yeah that actually reminds me um this is kind of going back to investigating like what what happened during the 24 hours, basically that this was down for, um but basically so I guess. My question would be that this package got deployed to staging like the one with the failing set of tests. Yes,.

A

That was unfortunate. Oh yeah, like yes, I, don't know when that actually happened. Oh well, do you know what interesting right? So it actually may not have been recently right, because what triggered the tests to become uh unstable was a problem on the staging environment.

B

Yeah, and also is there, is there a faster way to detect that? So, let's say that we are looking at a failing, QA spec on g-staging Canary, and we raise an incident.

B

I didn't take a look at the exact failure, but like does it say that it's a problem on staging.

A

I believe so, yes, I think it does I think it does um I think what this one took a little bit of this one was a bit unusual, because um I think we knew quite quickly. There are cause of the original failures which was down to not having enough resources available um in the whatever the projects were, so the the report.

A

The test couldn't create the repositories they needed uh yeah, so quality spent quite a lot of time, cleaning up, which was all good, but that just wasn't enough to restore the stability of this test, basically, which was why it got quarantined. So it was kind of like two efforts to to recover from there I.

B

See yeah I'm looking at this is the QA reliable one out of ten um I'll. Take a closer look at this later, but yeah like so I guess for the next time, because this was kind of a special case because we were blocked for so long and we kind of wanted to get packages through, but I think actually the next time we run into this. The correct path is actually to rollback staging instead of deploy this yes.

A

I do think you're right. So if we could have well I mean the challenge would have been the step which we needed to do, for that um would be to figure out. When was the flaky test introduced.

A

Yeah, because I don't think this was I, don't think this was caused by a software change. I think this was genuinely caused by running out of resources.

B

A

Then not being able to easily recover from that yeah.

B

A

B

A good two problem in one yeah or.

A

B

Three problem in one, because we had an incident for the same environment before right.

A

Before exactly exactly so, it was super unlucky for sure yeah.

B

A

Where but you're right yeah, that would have been a good thing for us to have checked sort of earlier earlier on in the. In fact, we should always do that quite early on which is like.

A

Is this a thing that we can roll back from um because you're right actually yeah like, um and maybe we need to be rolling back staging a little bit more for these things, because actually, if this had been a quarantine Mr to do the test that had come from a software change, yeah you're, absolutely spot on, like we should have brought staging back and got back to the versions that way.

B

Yeah yeah, so that way like we can just unblock that and then just continue with the auto deploy and whatever comes up comes up after yeah. That's it yeah exactly.

A

B

Yeah yeah makes sense, so it's.

A

Always a little resolved one way or another I think we got.

B

Lucky in some way, shape or form with.

A

This one, this one was yeah. Definitely yeah I mean it's one of those things. It's definitely an interesting one. It'll be interesting to see what we can do to improve from here, because so it's not a very common. uh You.

B

A

It is a known Edge case, so it certainly won't be the only time we ever hit it and it is. It is a bit of a pain, as you discovered, when we we have it where you're like we need to get this fixed.

A

Do you know where we might actually I know we were able to do a push? That's fine! It's almost an example, though, for a Hot, Patch I, wonder if we should have hot patched staging how hold your heart perch. So we have a different process. We have a whole different pipeline, a whole different tool where you can basically generate a project link where we can basically generate a um effectively like a code patch that bypasses our deployment Pipeline and gets applied directly onto an environment.

A

Actually, probably I mean in this case the risk wasn't enormous, so it was, it was probably quicker, but yeah. What we also could have done is hot patched the the quarantine directly onto staging and that should have passed the tests.

B

Yeah yeah yeah yeah I was actually looking for some way to do. I think oh, come on stop making me log in but I've been complaining about um for quite some time. There's like a whole thread about it too. Right now, I.

A

Yeah, you definitely not the first person I've heard.

B

My goodness give me a sec.

B

My UV key is at the back of my screen, so I have to get up and then.

B

But yeah I think um yeah. Can you if you know the yeah yeah? Maybe you can add that and then I can take a look. Sorry.

A

And this is something which um going to bed at the moment and while we need to introduce for our Hot Patch process is a bit similar to our rollbacks practice. So.

B

That we actually.

A

Have a way where we um we run these semi-regularly or regularly so that we can actually um make sure they're still working as expected, and that everybody knows it's an option and here's how to use it. We don't do it on production except for an S1, but actually, um if, if we hadn't been comfortable, so say, for example, yesterday the next package headed for staging, had like 500 changes in it, and you know one of those was like super risky looking or whatever reason where we decided that looked too risky to miss those tests.

A

Yeah, um we probably wouldn't have been to roll back able to roll back, because I said I, don't actually think it's a new test. I think it just became unstable, but we should have validated that and then our final option would be to Hot Patch onto staging and store the test. That way.

B

Yeah we had many options as soon as we.

A

B

Options hindsight hindsight.

A

Yeah, well, that's what analysis is for right, cool.

A

So is there any other stuff that would be useful to talk about on this, for mixed version, testing.

B

A

Think, that's it perfect. Okay, great I'll, stop this recording.