GitLab Delivery Team Rollbacks, 23 Feb 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2021-02-23 Delivery team weekly rollbacks demo

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Alessio, my apologies for not writing down action items from last week's um uh demo.

B

There's nothing to apologize.

A

B

We discussed them and then we forgot it was late at the end and we were so happy because we were able to roll back and yeah. It wasn't.

A

As heavy taker, I I feel responsible for not doing my job.

B

Okay, not no blame on.

B

B

So let me check a I, so myra will not join obviously.

A

Yeah, I think amy is the only other person that may join us, but yorick as well he's listed as not coming according to the calendar. Oh.

B

A

B

Henry as well is lisa's present as a tending yeah. Also, let's ping, let's send a kind of thing, then we can start okay, so amy say this should be. It could be a couple of minutes late. Please start without.

B

B

B

B

So robert, do you think the did you already receive your handover message as release manager? So are you aware of what's happening in general.

C

In general, yes, and then today, there's a bunch of incidents that don't look great.

B

Okay, so now what that's? Because my question was my question is: do you think we can actually attempt our staging roll back or we can just follow the procedure up to the rolling back and just run it.

B

Because pipeline wise, I'm quite sure, nothing changed. So if we are unsure the the pipeline, the deployer pipeline is still the same. I think so I mean we can just in.

C

That case, what do we need to verify? Just changes we made last week have improved things.

B

Yeah because we changed uh the round book linking to so we can just. We can just run through the run book and up to the point where we decide. If we want to roll back or not, and I say now, I would have run this command, but I will not.

C

Okay, yeah, then, I think, for the sake of some stability, I probably just yeah not perform the actual robot.

B

Okay, so who wants to run it.

C

I'll go through it, I haven't done it.

D

C

B

So share your screen and let's enjoy the movie.

B

You are muted, robert.

C

That would make a difference all right. So having watched the previous recording, I know that we added this link to how to roll back deployment. So that's good, it's nice and prominent, and you can find it quickly. So, let's go through this shout outs command.

C

Except the stupid slack markdown editor is.

C

E

I always press up as well thinking I'm in my terminal and then I can't edit so.

C

It doesn't work yeah.

E

B

That's like when I just keep scrolling on other screen share sessions say why the document doesn't move.

C

Okay, your little script says it should be safe yeah. It's just two simple comments.

B

This is too simple.

C

Now we started to find the package to roll back. Is that right.

E

Just before we go, uh are you going to go ahead and do a full roll back.

C

Yeah, so we talked about that, I think just since nothing has changed in the actual pipeline, we're just going to go up to the point where we would actually okay, okay, do it and then not do it.

C

Yeah, I think something possibly helpful to add to that command output would be the actual package, so we have to do this rigmarole. It's the current one. I.

C

B

I think it's the previous one is it's because you want to see you.

C

C

Change the plus to a dash.

A

What that's terrible is your screen chair frozen, or is that just me.

B

I see the version as well.

C

What do you what's the present on.

B

On the search bar, you have the browser and the search equals and nothing type in, and then you keep talking and.

E

This is big, sir and zoom. I uh read, I read all of reddit the other day, um so yeah quite a lot of it. um Apparently there's a setting inside video sharing uh where you can set um the thing that will hopefully fix this, which is.

B

Now it's a black screen.

E

Yeah you'll have to rejoin robert to be able to share again. um There is.

E

C

Does it matter like what, if you're, only sharing a window, because I was sharing the whole desktop.

E

It doesn't make a difference, I don't think there's a thing inside your settings screen share which apparently, uh if you uh click uh use, tcp connection for screen sharing, is meant to fix this. I haven't tested it yet, though,.

E

I guess I'll rejoin real, quick, yeah.

E

I finally broke at the weekend. I was like. I can't handle this. Not this freezing of screen share thing. It's gonna, it's gonna, be the end of me. It'll, be like the final straw.

B

Now it's funny because there was screen sharing this morning and it was working, but I don't have to set thing so now I am on the fence. If should I change it and or just I.

E

Wouldn't touch anything if it's working.

B

Okay, we can see your screen now so.

C

Did I actually execute this and then just cancel the pipeline or no.

B

I think we are fine.

C

Okay, so I think there's still some um annoying work there. We can automate to make things easier.

B

Yeah finding the package and things like that, yeah there's also.

C

Another we already have that information like in the channel.

B

Yeah, there's also another thing that I put as a note in the in the round book itself, which is that um it marks every type of migrations that are potentially unsafe.

B

So I mean obviously it is, but it is potentially unsafe, but we should not put the red mark on when there is on when there are only regular migrations, because it is exactly the same situation in which cannery and production are running, because we know this. We keep repeating this every week and but yeah by. Maybe we know that this is safe, but I would like the tool to be more clear in this series.

B

It may it's worth knowing the draw migrations, but I would not call it a red potential, unsafe, maybe just say kind of a warning scene and say yeah. There are migrations, but no blocking migrations or something like this.

E

Yeah that makes sense having the information is useful for sure.

B

Yeah, because I'm thinking maybe yeah, we are not here for some reason so and someone has to follow it right so and the yes ring call and yeah. You say I just it's it's unsafe, so I will not proceed and it is okay.

B

B

I was okay. Should we move to discussion items? Yes, because we we would not run it. So I put my last week action item here. This sounds like a saturday, but it's more about.

B

We forgot to write them down, so I just want to make sure that we uh we we keep track of what to decide what we decide to do, because I'm quite sure we decided to do also other things, but I haven't. I don't remember what.

E

We put a couple.

B

Of extra things.

E

On the um on the board and that they were the main things we decided last week, so that's still true. I think we had a lot of uh release fun last week. Instead, what I kind of related to those issues, uh the deadline, uh their.

E

Actions so I've lost english, which is a bad situation. So, on the main, uh let me actually see if I can share this.

E

So, within our epic we have this uh we're doing the pipeline for code rollback piece. What I did last week as part of other epic updates, was to set a due date of march the 5th which is next friday. No, yes, next friday um does that feel like it's a vaguely reasonable time we have we're pretty close, so we've got um rollback documentation. Improvements is basically just the things we decide here.

E

um The main we've got two big betas right. One is uh adjusting the pipeline for getalian, prefect and standardizing process and investigating the diff between the auto deploy packages, and then we've got, I think, alessia you might have already done investigating the rollback feature. We need to write some notes there. So does that march 5th sound like a reasonable date to aim for, and if so, do we need to take some actions to actually get these items complete.

B

So I was quickly going through them before, so I'm not sure about the standardized processes of investigating this, because, in my view, in the in the context of what we are discussing now so having a pipeline that can do this and a manual procedure.

B

I think that we are done here because with the chat ups command is providing us the information. So maybe we can sneak in the finding the package thing so what robert suggested, but then it's this is.

B

I think this is enough, because then we discussed the thing that on a real pipeline once we want to automate this, we want to double check the status on the database, but I will not count this as part of of this epic, because it's it's another, it's not a topic, so I don't know if others have uh are aware of something that we are actually missing here, because otherwise I think this is done.

E

One piece I wasn't sure on um it doesn't have to be part of this, but I know we touched on it a little bit at times, but I don't know maybe what we've whether we've actually made a decision is: what do we do with uh in progress deployments.

B

It's a good question, so I don't believe we have uh enough automation and data points right now to figure this out, because it really depends on where it broke, and why.

E

B

So I will consider the documentation yeah. I would consider this as a kind of a documentation thing to put in the run book with some ideas, but I mean it's more about if, if it broke in the middle think very carefully about what you're doing and then we can provide some data points like before. Post deployment migrations this or it really depends on what broke.

E

Yeah, that makes sense, so maybe there's almost like an additional like it's, probably just a runbook or something right. There's an additional piece at this assisted rollback.

B

Yeah, because the point is that we are checking for post-deployment migration, but as long as it broke before running post-deployment migration, we are sure that there are no post-deployment migrations.

E

Yes, isn't there an extra there's a little bit of extra knowledge? Isn't there around stopping like halting deployments, and I mean I certainly don't think I necessarily well my assumption is there are better places to do that and I don't know what they are.

B

Yeah, uh I think, there's a bit of a tribal knowledge around this, and probably scarborough and henry are the best one to try to figure out this, because the my so I'm just trying to remember the discussion that I had with jarv. So there is one point about uh load balancer. So the thing is that at the beginning of each batch of deployment, we remove machines from the load balancer so depending on where how they broke.

B

It may require some manual action before we can actually even roll back, because when we will deploy again the old version on that part of the fleet the machines may be in in the wrong state, so the pre-check may not allow us to deploy so I'll yeah. I think that we should defer this to our sres so that they can they better knows what's happening there.

A

Yeah, so, as the lasso noted, we remove a node from a low bouncer. Well, so let me back up. We check for the stat status of all our nodes before pipeline begins. If they're in a maintenance mode, things are fine. What we'll do is we'll leave the node as is deploy to it, and we won't put the node back into the low bouncer on purpose.

A

It's common for us to usually drain a set of nodes and then put in maintenance, but I think that process may have changed since the last time. I've seen it done, but if a node was deployed to and then gets into a failed state.

A

Ansible will will stop at that moment in time, because the health checks are failing, it's expecting the node to be up inside a low bouncer, but it's not so ansible will fail at some point in time. If it's the first node, we know that something is probably very bad with the package we just installed if it happened in the middle, which has occurred uh sometime in the last few weeks because of a bug in the ruby garbage compaction process when we upgraded rails, um that was a sporadic situation.

A

In that particular case, we had to go mainly remediate that node, I think in those types of situations, it's better for us to figure out what the failure scenario is and determine if a rollback is legitimately worth pursuing, because if it's like a ruby compaction problem, you're, probably going to want to roll back. We did not do that in the last previous cases, but it may have been the wise solution that case, but we would need to get the node back online.

A

Otherwise, the deployer is not going to be happy.

A

Obviously, there's a flag. We could there's a flag you get set that ignores that, but um I forget the name of the flag, something with the word aj proxy in it, but like it's very dependent on the failure what's involved in that failure and where we are in the deployment process. I don't know how we can make a run book that if this situation occurs do this because it might be different, so we'd have another if clause and that may be different. We'd have another.

A

If clause, I don't know how we could make that a rum.

B

Let's start by just documenting the gating conditions, so at the beginning of the client we expect this. We do this and at the end we do that. If something happens in between then we are in a mixed state and later on, we will figure out what to do, and I was also thinking about another flag because you mentioned flag. There is also the omnibus uh skip chat, throw or some variation of these terms.

B

So what I'm thinking is if we yeah. So if something bad happens during a deployment, the rollback will fail, because at chef because of uh I think, because it's the same thing that happened when uh staging deployment failed, then the new package is ready and the staging deployment starts and the pre-check fails because they there's a mismatch in. I don't know you know better than me, but does it.

A

Yeah, so if they deploy to staging it gets all the way through, but like say, qa fails. For example, we miss the need to swap a flag in omnibus, so the next time the deployer comes around. What it wants to do is to validate that it's not stepping on top of an existing deploy. We use a chef role to set a flag to prevent it as a lockpicking mechanism prevent two deploys from here.

A

At the same time, when a deploy completes, we reset that flag to unblock future deploys if qa, for example, fails that flag does not get reset.

A

So there's two ways to go around that one is to mainly modify the flag, which is what I've been accidentally doing, which is not the correct way, but there's another fly that we could set during a deploy to ignore that check.

B

So this means that.

A

B

Have the same thing.

A

And we may not have a document in a run book style, so you know what to do when you approach this failure, but yeah.

B

Because the failure is self documented.

A

Yeah, when the child.

B

Fails, it tells.

A

You set this flag if you want.

B

Yeah, but I mean I suppose in a rollback situation where we know that this may happen and we are really in control of what is happening. Probably we want to have all the information up front. So, let's just say just I I say drawbacks to roll back, and the other thing is that this is a question and is I'm aware that these things happen in canary and staging? But do we have the same thing for production because we don't run qa at the end?

B

So maybe we just set the flag back to the original state after possible.

A

The flag- let's see I forget precisely which job does it all down my head? Let me pull up a pipeline real quick.

A

I just forget: the name of the job is all yeah they're, really covering every section.

D

C

A

C

E

A

A

Success so must be environment. Hyphen version must be the one that does it. Yes, so there's a job in the the finishing stage called environment, hyphen version and at the tail end of it you'll see that it does a set omnibus item.

A

I guess I could share my screens just so. You guys see what I'm talking about share, so this is just looking at the latest staging deployment. So in this case, gstg version is the name of the job that runs and we do a set version and we set it to the name of that item, and then we do a set ominous updates, which sets a very simple flag to true and that's just um modifying our gstg omnibus version role that sets a flag to enable that thereby removes the lock that this pipeline had that way.

A

Future pipelines are able to continue to run so if we don't make it to this job, for whatever reason you know that should be investigating along itself. You know that might be a qa failure or maybe a deployment failure. But if this doesn't run this will prevent future deploys from working.

A

And the job that does that check will obviously tell you what to change. If you need it, if you need to make a change in order to force the deploy to go out.

B

So I think that, in terms of understanding what we need to do so, if even just starting with something simple, so if something broke during a deployment and was deemed that rollback is a good option. That's something that we want to do because otherwise we're just wasting extra time, because the check will start and will fail. And it takes a lot of time to fail, because unless you're checking it and see the failures coming, it just retry for a very long time.

E

Yeah, that's a really good point: yeah nice. So that's the kind of the the things that we started. So the standardized process of investigating diff between auto deploy packages sounds like yeah. We're we're we're good on that right. Like um nice,.

A

I think our last demo highlighted the improvements we made, so I think we've been on that front. Yeah.

E

Quite a substantial change: nice.

A

I think at this point it's just documenting some of the out the the minor outlier cases that we hit so rare. That would be good for, say, onboarding henry such that he knows. You know that we've documented stuff like this and but.

B

I mean it's good for the older team for also for the rest of the team, because I don't believe we should keep this kind of knowledge separated between sres and other big.

C

B

So I'm I'm all for surfacing this type of information, because it's it's very important.

E

E

Nice so then the other piece we have is just pipeline for getly and prefect.

B

C

One is simple: I.

B

Can pick that one so that so did we complete this part buddy? I.

E

Think we talked last week about just saying like yes, it should be manual right and I don't think we had any objections to that. So nice.

E

One of the things if we do make it manual.

E

Will that mean the qa tests run after the getly piece? So will it be? Will it roll back the fleet and then we'd have a manual step to get to italy and then we'd kick off the qa tests.

B

No, we we that's the point where we need to rework the pipeline so that the there's this needs flag. That tells you what what you have to wait from from previous stage. So you, basically you you remove the digital deployment from what is needed for moving forward.

E

Cool okay sounds great nice.

E

So that will give us the pretty much give us a rollback pipeline right that we we're happy with what does that mean for uh rollbacks.

B

We should test the production, one.

E

Yeah, what do we like? Is it worth us um like? I guess once we've run it on stage right and we're happy with it, like. Maybe it's a case of us then starting to run the um check for post-deployment migration thing when we see incidents, because if we have an incident, we might be able to find an opportunity to do a rollback.

E

Because a good thing about an incident is we're starting from a something's broken um which no or me I guess we could just roll back something right. There's no reason why we couldn't.

B

We're gonna have a fake, I mean we can schedule something so that we do this in okay in safe environment right. So we we coordinate with development. We do um change, which is, I don't know something in the help page something really trivial, but you can visually see.

B

We deployed that and we wrote back that yeah it takes hours. We know it takes hours to run this type of thing, but I mean I don't want to test this for the first time during an incident.

E

Okay, yeah you're right, you're, totally right yeah. We can schedule.

B

E

A

I agree that we don't want to schedule this during an incident, but you're concentrating your efforts on a relatively small change, but that's not realistic.

A

I'd rather us focus our efforts on whatever is available to us at the time in which we're able to test you know if we can't test it because there's a post deploy migration, fine, so be it, but if we can test it and there's not a post-deploy migration and there's still 3 000 change files in a million deletions in 2 million editions tested anyways, it's.

C

Safe according.

A

To our documentation, so theoretically we should be okay to go. What I don't want to do is interfere with any potential upcoming releases, specific security release, and that kind of thing, if we can avoid it, um I think it'll be easier to schedule in that particular case.

B

So what about we do this, but we ask release manager. To I mean we can do it also as a team doesn't matter, but then we roll we roll forward again right after because as long as we play on staging, who cares but productions may there can. Maybe customers see something and new features whatever and then it disappear. Maybe there's some engineer that is testing something say: yeah finally landed in production, but it is no longer there, so maybe, as.

A

Long as we communicate up front, we we advertise externally as well. So maybe we should get the cmoc involved uh for part of this exercise, because that that would be kind of disruptive for some people um depending on the change.

A

But since you can't.

C

A

Predict what that change and what that roll back is going to introduce. You know it might be worth just saying: hey, we found a problem or we're testing something so we're rolling back. We don't expect outages, but I think it would be good to advertise that.

B

Yeah, so my calendar froze so uh what about? If we tentatively try to run this after the security release, so would be if a calendar helps me now, maybe.

A

The second week after march, for.

B

The second week.

A

B

Yeah exactly so march 4th means uh march 9. Could it be no uh yeah nine.

B

Doesn't make sense, I mean again we check the status of everything when we run when we start, maybe isn't we can't do it? Maybe there's an incident whatever right, but I will yeah. Maybe we can say that starting from march 9 up or maybe yeah, nine or 16, because then there's a release day, there's release another time so the first occasion we can run it. We run it.

E

Sounds great so yeah yeah, let's open the week of like yeah you're right like that week at some point we can. We can do it and um awesome.

E

That's really fun.

E

I'll begin sharing uh comms on that around uh infrastructure, so that, if anyone's got any uh concerns on that we can, we can start addressing them. But uh I think from what we've seen in staging it looks pretty like it either doesn't work or it's a pipeline and it rolls through so nice.

A

We should add that event to just this one event to the maybe the shared count that way, if any other.

C

A

Curious about this, they could join in and watch us, hopefully be successful.

E

That's a great idea: yeah, so should we uh find like it's going to be uh like two hours right, roll back prod! Is that right, I'm assuming it's roughly the same as uh a deployment like a little shorter, but not a huge map right.

D

E

B

D

Maybe we should also make a note on the production calendar, because I mean we do something in production right and I'm sure the sre teams and brent would like to have some notification for a front, and you know be in the picture, I'm not sure about what announcements we should do, but um I mean we constantly deploy something without announcements. So I guess it's fine, but at least in the production calendar. Maybe it's a good thing to have.

E

Yeah, it's great, oh yeah. Let me put an issue together because I think uh it would be good for us to have a bit of a checklist of all the pieces that we.

C

E

To make sure we do before that um and then, like I doubt, it'll be the last time we do a test like this, so we can. um We can copy it later if needed,.

B

B

Okay, I have another meeting I realized only now, so I will. I would drop off and schedule this this completely meeting for the next time, so uh yeah. Thank you thank all of you and I would catch up on the agenda.

E

Later speak, the same cool, so uh was there anything else people wanted to cover today.

E

Were there any other, I suppose requests from your side, robber on as the person going through their stuff like if this was in a stressful incident? Was there anything else? You'd have liked to have had.

C

No, I think, yeah just automatically showing which package to roll back to and yeah it would have been nice if we could somehow have like slack interactivity where it's just like you click a button and it creates the pipeline, but can't do that yet yeah.

E

Yeah yeah is that something like what will we need to be able to have that yeah.

C

So I was kind of thinking about that. In the background. During this meeting we would need to basically stand up a permanent web server that can receive an endpoint from the slack button that would trigger something and release tools. So long term, probably yeah.

E

Whatever, okay, it might.

C

Be worth it, though, because at this point we're running so many release tools pipelines, it might just be worth it just to have a permanently running release tools. Instance.

C

That would also include a web server, but this is a different discussion.

E

Yeah, it might worth thinking through, though.

A

Gitlab doesn't have a feature to accept uh web hooks from someplace.

C

Incoming love, folks, that's a good question. I guess the triggers.

A

A

C

A

He's kind of limited in what he's allowed to accomplish, with receiving.

B

C

I think also slack interactivity requires specific responses which we might not be able to do from.

D

We had this issue today that um a token was removed right or even a slack application which broke the um sre on call and the release manager. Notifications, I mean there's the place where we have those kind of web services hooked up. I think uh serverless functions in nodejs or something and if you could replace it with something, that's just easier to understand, and that would be already an improvement, and I know that craig furman is working for the incident management.

D

Things on on woodhouse, which is written in, go to include a lot of functionality uh and integration with slack. Maybe we could hook up in there somewhere or maybe create our own, I'm not sure.

C

Yeah it'd be good to avoid duplicating that one.

E

Yeah for sure nice, okay! Well, let's, uh let's keep that in mind like if, if, if someone wants to like think through that, then go for it or say we can keep it and keep in mind for a future future iteration awesome um is there anything else that anyone would like to cover for rollbacks.

E

No okay awesome like I love the fact that this has become like quite uh routine quite rapidly in just a few short weeks, so that's pretty exciting, so yeah onwards towards uh production, cool all right, enjoy the rest. Your day speak to you soon,.