GitLab Delivery Team Rollbacks, 27 Apr 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2021-04-27 Delivery team weekly rollbacks demo - part 2

Description

Running a dry-run rollback in Production

A

B

B

See if we could get a few other people to join if there's more interest.

A

Sure should we ping alessio and henry because they are.

A

B

Ping- everyone probably an easier way to do that, but oops.

A

Is that a turtle or a dinosaur in your t-shirt.

B

A

B

B

All right, I'm letting the on-call know that I'll be starting here.

B

B

B

Okay, so I think we've got enough people to go ahead and get started. uh I see the recording is recording. So let's go ahead, get started. I notified the on-call.

B

B

Just to validate our pre-change steps, we have the first item already checked, which I believe alessio you did earlier today, but just to validate yeah. I just want to make sure things are still cancelled. Yeah the post, deploy is still cancelled, ensure there are no ongoing deployments. There is no ongoing deployment.

B

C

Just a note, I didn't it changed the pipeline, so I suppose this is the pipeline for last week in the in the first step, there's a link to the pipeline. No.

A

No, I did it, it is. Oh you did yeah yeah, I'm updating the steps.

B

Thank you, okay, so there are no ongoing deployments I'll go ahead, proceed to do the check command.

B

I posted that in f upcoming releases, maybe not the best.

B

uh I've already notified nels who's, the sre on call.

B

So now I will find the package to roll back to I'm going to follow our documentation.

B

The rollback check command, which already ran copy and pass to the deploy command in the next step.

B

Let's see our current is fa1dd.

B

Which is not what we were deploying to.

B

Upcoming is the one we were deploying so.

C

Yeah, I think there is a bug in the check command, because the is is providing you a rollback command, but in this situation, when there is an ongoing deployment, this command is not correct.

B

C

That makes I I can take a note of this, but then I will probably drop in a few minutes so I'll take another about this problem in the india.

B

Yeah, that's fine, yeah, take a note, so technically the rollback command. We really wanted to state that we want to roll back to what's currently listed as current, so the fa1 dd, which I need to find the rest of that package name for so to do that. I'm just going to look at our announcements channel to figure out what that is supposed to be.

A

There are some manual steps on the documentation tool that you can.

D

A

B

Replace the shot of the deploy roll back above example, link in your.

B

Browser, oh, I guess I should be like sharing my screen. I'm sorry, I'm a horrible person.

B

So let me get the.

B

I said we want to roll back to fa1.

B

Correct sierra fa is what we are currently deploying to. Fa1 is what I want to go backwards to.

B

So in here so we have. This is our package that we want to roll back to and then, if I go back to our steps.

B

Where did our steps go.

B

We want to replace the plus with the hyphen, so we have our package replace the plus with the hyphen and we want to do a roll back. So I want to roll back to that package and I'm pretty sure we said inside of our procedure that I need to have the.

B

Oh I'm not following our own directions. My apologies.

B

So I found the package that we want to roll back to write down the package. We want to roll back to will do.

B

It's going to be this one: okay, copy and paste.

D

Can I ask a question, or is it please please, uh how do we check if there is a current deployment now or not what's the command, for that? Is it from chat, ops,.

B

um Chat ups run auto, deploy status will give.

D

B

Information: okay, thank you.

C

um No, I wouldn't no, I was I mean I think ken was asking. If there is an ongoing deployment and out of reply status would not tell you ah so. Okay.

A

I thought it did. Yeah yeah he's going to do it. It is going only for production, it is going to say you cannot deploy to production because the.

C

A

Is locked because there is an ongoing production department.

C

But in this case it will tell that there is an ongoing deployment, but we know that there is not so that's kind of this. This check tells you if there is an environment lock, not if there is ongoing deployment, it's kind of the same thing, but it's not.

B

Yeah, so this is it's just a, uh uh I guess, a language thing like from a technical standpoint. A deploy is still ongoing. We just paused it for the purposes of this test, but um inside of g delivery, I just ran the auto deploy status. It tells us there's an ongoing deployment and that's because we haven't finished the production deployment that we were, that is supposed to be rolling out and sure enough. Our last statement is the production environment is locked and that it's locked because we didn't finish our last deployment.

B

So I would answer yes to that question, but for the purposes of us in this um in this experiment we pause the deploy to roll out so we're safe to proceed.

A

Okay, before we drain, I'm sorry.

B

A

Oh yeah just a question before we drain canary, should we check that we don't have any deployment on going on canary.

B

That's a good thought. Sorry meeting uh the last deployment to canary finished at 9. 24 am my time. Oh 13 24.

B

I guess just to be super sure I'll just go to all pipelines and make sure there's nothing running.

B

If there's one for staging running, I don't care about it, but for any other environment I would care.

A

I wonder if we need some sort of command to tell us simply that if we have a deployment ongoing for each of our environments, I'm not sure if that is an overkill, or if doing it manually is okay. For now,.

B

Yeah, I kind of agree with you on that. One.

D

I think I think it would be nice to have that kind of stuff yeah. Instead.

B

Of taking pipelines.

D

Like you know,.

C

B

In my case, I'm looking at the announcements channel, bye, alessio, enjoy and see you tomorrow.

C

B

Myra, would you be willing to take over as notetaker.

A

Yes, of course,.

B

Thank you all right, I'll proceed to drain canary I'll notify the sra on call. Just in case nells didn't see me.

B

Because he didn't respond to a message so.

B

Okay, so I'll proceed to drain canary.

D

B

I'm running this in the production channel for anyone who's following me on slack, I don't want to share my entire monitor because it's kind of big for youtube.

B

D

uh uh Just a quick question: I I'm just uh looking at this uh run: books, roll back and deployment uh readme and the one that was open and it didn't mention about the draining.

D

B

ah Interesting.

B

B

Yeah, we should definitely have that step in here.

E

Are you sure it's not near the yeah you're right? We should be here uh like that.

A

It should be maybe somewhere.

E

uh Yeah, that's true.

B

Maybe we have it somewhere else, because I thought I've seen it before.

B

But this is our link to our run book, so there's actually.

E

That's the bit that you're most likely running on production right so yeah. It should absolutely be there. Yeah.

B

Okay, all right so chat. Ups, finally got back to me. We are drained for canary so I'll proceed with our rollback, uh so chat apps run deploy our package. I want the rollback flag, production and check, because this is dry run.

B

I'm running this in f upcoming.

B

B

This chat up slow today, there we go.

E

It's always pretty slow. I.

E

B

Ensure the pipeline is running in dry run mode. uh The easiest way that I know to check that is, we would have checked mode set to true, and that would also be reflected in the output of our other jobs of this pipeline whoops wrong.

B

B

There we go so in our checks. We should see.

B

Check mode, true, which is just outputting the same thing as our variables here so.

B

I'm bummed that we have to do a warm-up, but it's unfortunate so I'm confident that we're running in driver run run check that the pipeline looks as expected. Giddily and perfect are manual steps. Are they.

B

uh I'm expecting to see a play button is the play button going to show up after we get through production. Finish.

D

Yes, I think it's a manual job right.

B

For some reason,.

A

I was thinking.

B

I would see a play button already.

A

So the warm-up is actually been executed, doesn't matter if we are running in check mode.

B

Well, we need warm-up because otherwise well do. We need warm-up.

A

If we are running in check mode, I will expect it to be this like uh to do nothing basically.

B

You are correct, I shouldn't do anything, because this is dry run interesting.

B

We did skips the warm stage across the entire.

B

Fleet yeah, so we're not we're, not drive running this task properly.

B

But realistically I don't know if we need to run this job at all during a rollback.

B

The reason why we run this job is because we need to tell the servers to update their apt cache, because otherwise they won't know the package exists, because we're rolling back that database should already be populated with the package that we're going backwards to so hypothetically. We don't need to run this job at all during a rollback. In my opinion,.

B

The only other benefit I could see us using this uh warm-up job is that we pre-download the packages prior to running the fleet during the deployment, so that speeds up the fleet job, which you know it's gonna.

B

It's gonna cost us the same, because one job will download it versus this one. So I don't. I think we could probably consider removing this job from a rollback.

B

And then yeah that is not dry running properly either, which is not necessarily concerning to me. But we should probably try to address it, because that's going to make this demo a lot longer than.

B

B

I'm gonna before I check this off. I want to make sure I see play buttons, no assets job.

B

I forget which stage had the assets job, but I don't see it. um I think I.

A

Think it is combined with the migration one right, because we also don't have the migrations on this one.

B

ah That's right, that's right! So we don't see that as this job, this does not apply dry run on the gprd checks job as expected, because the shifter was locked by a quickly approximate environment. Oh yeah, that's right! This is what alyssia was talking about.

B

Jupiter checks, which we got past.

B

Oh running in check mode, so there's nothing trigger. That's that's why! Well that's the.

B

B

Yeah, so this is the task we're now downloading the package on all of our servers.

B

B

Yeah so because this passes, this is just doing a check to see if that package exists within the database. I don't think we need to run this.

B

So we've just unnecessarily made our demo longer in my opinion, but that's.

D

D

uh It's just a question, so what we doing essentially is we are removing that existing package and installing the previous version of the package right only was package.

D

Are we doing that on the vm.

B

No, so this is a dry rub. Our goal here is just to make sure that we have the ability to install the old package on the server.

D

Yeah but the real one will be like that. The previous version that way yes.

D

A

So do we remove all packages when we are doing the warm-up? I thought that we only installed newest. We always want.

B

The apt cache clean job, which is one of the finishing tasks this one. That's what technically removes the old package from the server since we're in dry run? I don't expect this to do anything um and considering the fact that the package is actively running and we haven't installed. The older version like this shouldn't, do anything yeah every deploy. We clean out the app cache, because otherwise our service would fill out fill up on the disk space. If we don't do.

B

B

B

B

So anything to talk about while we, uh while we.

E

Wait and how do you feel about maybe taking the action to open an mr and adding that canary drain state to the run book? Do you want to have a shot sure awesome thanks.

E

One other thing that's unrelated specifically to this dry run, but it's related to rollbacks. So, uh just before joining here I was doing a bit of an epic tidy up, um so we're going to try and get the code uh the pipeline for code rollback epic to be this basically and we can consider it complete when we've done a rollback in production and it's all gone well.

E

So, whatever our follow-up test after this is and then I will move, we've got quite a lot of other issues which are all around improving um information like uh dashboarding commands. Those are kind of lots of those pieces, um I'm just going to move those up onto the uh assisted rollback, epic and that epic will be the thing that we use to take the rollback pipeline. We have and sort of fill in the steps that we need to go to until we become, uh I suppose, confidence.

E

So it's filling in all the gaps of getting from here to the point where, in the event of an incident we automatically check, is there something that could be wrought back? If it is, we would roll it back. So all of the bits that go in between those two things so I'll move some stuff around. But hopefully it means that we are quite close to having the code pipeline epic completed, which is very exciting.

E

We just have the small, slightly tricky task of schedule, assuming this all goes as smoothly as we're expecting slightly tricky task schedule: a roll back on production, which it's a happy problem, though it's a heavy problem exactly it might be worth us um okay, once we get to that stage, it might be worth us really carefully checking incidents um because we might get lucky and there'll be a suitable incident that we could just run a roll back on, and that would be by far the easiest way to.

B

Test so you're suggesting that we test it live.

E

Well, we end up once we've done this dry run test. We end up in a interesting problem, which is that we, the next in order to finish this task. We need to do a roll back on production. The problem is really hard about scheduling. The test. Rollback is we're taking a fully working system and potentially breaking it, but if we're in an incident that risk is a little bit different because we're taking a broken system and potentially fixing it.

E

So it kind of comes down to confidence, but I think it might be in some ways: it's less risky right if we actually have something that's broken already and we're trying to roll back to fix the alternative is, I guess, the other safer way is we set up the rollback package and we deploy just a very specific package that we roll back, um which I think we should aim to do, but it's just there's so much scheduling around that that it if we could get an incident and test it on an incident.

E

That would be um certainly the easiest thing to do.

B

I think the type of incident matters like if it's an abuse problem, then rolling back, doesn't apply and in fact that might actually make things worse, because we're sitting here telling the service to go. Do some tasks unrelated to processing a user abuse request. You know I'd be wary of that kind of style of rolling back at that point,.

E

Yeah, that's a really good point: yeah! That's really good!.

B

The sooner we finish, the migration to kubernetes, unless we have to wait for this that'd be great nice.

E

My guess is nearly finished. I think it's like a 16 minute job. That's my bet.

A

Yeah, the last one took 16 minutes to complete, so this one should be close to finish unless I jinx it already so.

E

I feel like I've spent quite a lot of time watching this thing warm up.

B

Sidebar, I think it'd be pretty cool if we had like a a little informational panel. That said: hey your average run time for this job is 16 minutes. Yes, give me a little arrow down. That gives me some average or something I don't know.

E

B

Would require a lot of work, but.

A

I'd love to say that that is kind of the deployment has a low purpose: to have a breakdown for component, at least to have a target.

E

That's true yeah. That is true, that's kind of part of sli's right. I wonder if we is there somewhere, we could store them that we'd be able to pass them into that information.

E

C

We could put it at the top.

E

Right, like on top of as you start running, a job it could literally just print out like sli. You know number is around is averaging this.

B

Well, currently, what I do is ask robert to provide me a chart, so it's stored in slack.

E

I know but robert's charts are super amazing and you have to ask nicely for them, but they will be. um They will be part of the deployment slo.

B

B

So this is the job I suspect will get stuck, which one is this. One change lock.

B

Oh, is it change lock or is it prepare.

A

No, I think it is prepared.

B

Okay, what does this job do?.

B

Oh, this one just validates we're not inside of any sort of pcl. Okay, I completely forgot that existed.

B

Okay, so because we're in dry run, we didn't hit our blocker.

B

It would have been this step, but it's currently set to false, which is what have blocked us. It would normally be set to true, but because it's in check mode, nothing happens. So that makes sense.

B

um So let's go back to our job or pipeline and we should be starting the fleet, then yay so get lab.com help currently running zero fa and we wanna that should say fa1. When all is said and done.

B

I did not have to perform this work. This will be something that we will probably need to do inside of our run book, because otherwise we will hit this problem.

E

E

Yeah there's like some extra checks right that we need to go through around.

E

B

And these do, I do finally see the play buttons on there, so these are manual jobs, okay,.

E

Nice, I didn't know those didn't get set until the pipeline starts like so I guess there's a point where it checks for those sorts of things as the pipeline kicks off.

A

I think it is because they depend on another job so until that job finishes they are set to manual.

E

I see okay, awesome.

B

How come we don't have our lines connecting all these together.

A

E

Were they removed? uh Maybe they move because they they're on it. These are the views. Aren't I.

A

I recall there was a pipeline refactoring you, I think so perhaps.

D

A

Lines were removed.

E

hmm Yeah, it's all they just like look like they're floating.

B

D

So real in real scenario, we have to run the dry run as well right. First of all,.

B

No, no, so the reason we're doing a dry run here is just for an experiment prior to us actually doing this in production in real life, um just as a way to catch any future potential error scenarios that we may run into. I think all right.

B

B

Earlier I said that when all said and done that we'll see this roll backwards, but because this is a dry one, we're not going to see this go backwards. How come no one corrected me.

D

I was confused about that. That's why you know it was like. Is there some kind of magic going on at the end of the run? You know.

D

And just another question and why do we have to drain the traffic to canary.

B

So we want to make sure that infrastructure is in a known state. During the robot procedure. Our robot procedure is only going to impact the main stage and not the canary stage.

B

So if a change gets past canary for whatever reason- and we deployed it to production, instead of trying to figure out how to do a rollback, it can air as well, let's just drain it that way, it doesn't see any traffic at all.

B

um Doing so will prevent any version, differences from showing up still after a rollback is completed, um and then we could enable ourselves to utilize rolling forward in the future to go to canary re-enable that validate the change is actually working as expected, and we could continue on to production again.

B

D

So the the rollback will be just deployed to the main fleet, then straight right there. I understand.

B

Precisely yeah precisely.

B

Okay, okay, so this is just the clean job. Everything else has completed as desired. uh Nothing there. The pipeline has passed so I think we're good to continue.

B

So I think we're good.

B

um I guess before I enable canary is there anything that we want to look at check or discuss.

B

I can't think of anything.

A

B

I'm pretty confident this was a successful test, so the warm-up or the clean cash job just completed it'll skip so that job works properly. Just not the warm-up just fine. Do we want to play the either the two manual.

B

A

We have never done that for the staging scenarios, so I'm not sure.

B

It's dry run, so I don't expect it to change anything anyways.

A

I guess, then we can click them.

B

Let's do it I'm doing it.

B

I'll do them at the same time because they deploy at the same time.

A

Should we always execute those when doing a rollback.

B

No, I think this is what alessio was trying to really um get out there. Is that the normally, if any issues happen that force us to do a rollback? Usually it's the rails code base and not the giddily code base that um causes the problem. So in this particular case we made it a manual job because they are relatively good at being backward compatible with each other. So there's a chance of just speeding up the rollback procedure by simply not doing the giddily jobs, which I agree with.

D

So we expect the gap between uh roll back and roll forward to be very minimal right.

B

Yeah, usually just one package difference.

D

Okay, I mean the time-wise as well right I mean we like we expected to just deploy that changed fix quickly.

B

Do we I do, the quickly is a stretch. You know it'll probably take a solid hour plus uh to actually perform the rollback, but yeah.

D

Okay, and during that time right, we we won't have any traffic to canarista like we will just take it out. Yeah yep thank.

D

B

Okay, so the prefect job completed successfully.

B

So now we're just waiting on the giddily roll back, and I guess after this is completed, I could proceed forward with the um re-enabling canary and then just finishing up essentially.

B

Did we disable our deploy.

A

No, we didn't, it was added just in case we did it, but in this case we didn't.

B

B

Just making sure I didn't miss anything.

B

B

B

Shop succeeded, so we have a successful drive. Run of our rollback procedure.

B

Looks like we've got a few steps to adjust in the future, but overall things went well. So I'm going to proceed to enable.

B

B

And then, in our job that we paused the post deploy migrations, I will click the play button or the retry button. Rather.

B

And then that does not apply so canary's now up that job is playing, so we're done.

B

B

Is there anything else that we want to do discuss chat about before we end the call.

A

Should we discuss the action items.

B

B

um So myra you had the suggestion of ensuring that we maybe had a chat, ops command, perhaps to validate whether or not we have ongoing deployments. Would you mind creating an issue for that.

A

Yes, of course,.

B

Okay, uh should we always string canary, we decided that's always a yes.

B

I will create a job or excuse me I'll, create an issue to discuss removing the warm-up job from our rollback.

B

That sounds informational. We need to add the cheaper and I'll create an issue to address the uh adding the prepare steps that alessio brought forward into our book so I'll tackle the creating issues with these last two and myra you've got the action to uh create an issue for that one. So.

A

B

A

We need to do something about the check command that throw the wrong package. The first item.

B

uh Yes, um I'm.

A

Not sure if it is okay, is it actually wrong because we do have uh since we canceled the deployment.

B

Yeah from a technical standpoint, it's not incorrect. I wonder if there's a better way that we could display that information, since the upcoming was never completed, potentially I'll create an issue. We could have a discussion inside the issue.

E

I think like yeah, it's a good one to discuss like it certainly depends on who's going to be running rollbacks. Are they release managers or other people, so yeah? That's open, initially, that'd be good.

B

Yeah all right so meyer and I will create some issues uh but as far as I know, there's nothing blocking us from continuing forward with our next um actual rollback.

E

Amazing excellent: that's exciting stuff! um What do you want to do with the deployment that we paused as well scovic? Do you want to just run the post deployment migrations um to.

B

Get that completed.

E

Super great stuff, amazing, very exciting, big milestone.

B

D

Can you explain, uh what's this retry post deployment, migration job.

D

uh Can you which.

B

D

It says one of two stuff is steps is like retry, the post deployment migration. Why do we need to do this.

B

So, in order to complete this test, we wanted to make sure we we weren't going to block ourselves in the case that we had supposed to deploy migrations.

B

So we just simply told that job to not run okay in production, so that was just a clean up test to make sure that we're ready to go for auto, deploy continuation.

D

All right thank.

B

You cool, is there anything else we want to chat about.

B

Excellent, well, I'm going to go inspect my beast. You all have a lovely day.

E

Yeah nice work. Everyone speak to you soon.