GitLab Delivery Team, 30 Mar 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2021-03-30 Delivery team weekly rollbacks demo

Description

Testing out a staging deployment when we have nodes in DRAIN state

A

A

About this new camera angle, york.

B

Yeah, I'm I'm still waiting for the the arm to arrive. It should be here like tonight, so for now it's sitting on um this thing feels kind of voyeuristic and creepy yeah.

B

It's um it's a bit tricky because I uh like the cable, is uh like bent such a way that if I put it more to the right, it will sort of yank the camera down. Let me see if I can maybe fix it. A.

C

Bit so um do you want us, kick us off robert.

A

Sure so I added some documentation on to reflect the fact that the package name is now included in the rollback check output and then hid the old manual steps. In a summary in case, we need them later.

C

Super thank you.

C

So over to um over to the.

B

Demo uh hold on give me one sec, actually I'll just turn it off, because I don't think I'll need it with the screen showing anyway.

B

That's a bit better because otherwise this is gonna fall over halfway through. uh Let me.

B

Get out of the way uh I can go, I will share my screen and just stop one. There you go.

A

B

Did you devote in uh rollback demo pretty sure I had it in my history? Apparently not. uh Let me get the calendar.

B

All right uh first engineer: that's not my username. Let me fix that.

B

And let's see first steps uh read through some of the roblox scenarios today there are so yep that I read okay, testing steps, engineer, communicate to infrastructure allowance and quality that we're going to test a rollback in staging.

B

uh Let's do that in front. I.

C

Can take that step? If you want uh switches, are we able to uh roll back? Is.

D

B

C

A step in here to check that we are able to.

B

That's the second.

C

Step you want to just do that and then we can see.

B

uh It should probably be the first step, actually, let's see um upcoming release, but whatever.

B

Potentially unsafe on migration, zero possible immigrations, let's do the step say such a staging note for a couple of stages. Okay, let's check that first, um I guess compared to previous.

B

Let's get that out of the way db, slash.

B

Let's see the first migration has a index concurrently.

B

Schema, that's the only thing current one doesn't have any database changes.

B

Let's see so wait so upcoming is uh oh sorry. I have them in the wrong motor. So current is a 8 a3. I'm a little surprised here. This is it from six days ago. Let me check if that's actually what we currently have on staging.

B

Because that might be wary.

B

B

Staging eight eighth, three! Oh I guess that's right!.

B

All right so hold on security command, eight, a three one: seven uh current 88317, oh so yeah! It's compared to the one that came before that it was misreading the output.

B

Okay. Well, this one migration looks safe to roll back, because it's just adding an index.

C

Oh I'll do the announce.

B

uh Check that one.

B

B

Okay, that one checked that's the sre set a staging node or a couple of staging nodes to drain state sre assistance is required. I guess that's scarborough. That's me.

E

I'm going to take api 01 in staging and set it to state drain. I'm not going to share my screen for this, because this is boring.

E

So I will just let you know when it's ready. uh Do we want to do more than one server is one fine. I think one is fine, because we bailed pretty quickly yeah. That's what I thought. Okay. So if I do a get server state and staging for api.

E

So api 01 is currently in drain. Api, 2 and 3 are curling state up, and so which is odd, but so is our gke nodes. So we should be good to go all right. Thank you.

E

I'll hit that check box beat you.

B

Let's see engineer find the package to roll back to, let's see.

D

B

The package name of the rollback is pleasing. I would roll back jack commander copy it to pass to the deploy command in the next step. uh Upcoming release.

B

So I guess that's the uh previous show or say previous package name copying. It.

B

And no further steps.

B

uh Perform a roll back I'll run that in delivery. This time.

B

B

And let me go back to the other.

B

Channel copy that.

B

uh Challenge from deploy because your number dash just roll back that looks legit.

B

And check that one off.

B

Everybody brainstorm with the team about what should be the next step to perform rollback, write down any action items.

B

um hmm The rollback should just work right.

B

I mean in theory that applies to many things I mean it should um yeah. I kind of think like looking at this output from the check command. um It shows things like upcoming. Current previous makes sense. Previous package also makes sense, um I'm just kind of curious. If there's anything, we could do to make it more obvious that the previous package line is what we want to go in that rollback command.

B

Maybe one approach would be to add like a snippet of the commander, so it's like previous package version number and then like slash, shut ups, run, deploy rollback with that version number in it. That way, you can just copy that entire command and run it instead of uh copying. This sort of template then replacing that placeholder with the actual version and then running it.

B

A

B

E

Decent so far, yeah would there be a way? I don't know what slacks um blocks allows us to do, but I'm wondering if we can make that like a little hidden option that we only see when we click on some link to like show it or something kind of like a arrow that hides it.

A

E

Way, it's just not cluttering that space with so much information, that's unnecessary! If we're not doing rollback.

B

Let's see what the docs say, probably not.

A

Sorry, what are we trying to hide?

A

B

Trying to hide york's idea.

B

Slack your docks look nice, but they're, not particularly friendly. I guess right.

A

They had the overflow menu that put the like in the triple dot thing. But if you click it, then it has to respond to a web endpoint which we don't have.

F

So I initially thought that, when rolling back and having notes in a drain state, the prepared job was supposed to fail. But it doesn't fail anymore, because I think we fixed that last week.

E

Which is what we wanted so.

F

Yeah yeah: this is what we want, but now I wonder um once the the rollback is finished. Do we need to do something like do? We need to manually fix that note that you set in rain mode in writing status.

E

Theoretically, if I recall correctly, deployer is going to push that node back out of drain state into up when the deploy completes. So we should validate that and if it doesn't we're going to we're going to find out during the api stage when something goes wrong.

E

It's only when servers are in maintenance mode that the deployer doesn't touch it from that perspective, but drain would be the state in which we push it into during deployment anyways. So hopefully it just does what it's supposed to.

C

We do so does drain basically mean that it is um there's just no traffic right like it's just out of use.

E

Drain will tell hd proxy to no longer send new connections to that server, so organically traffic will be removed from that server.

E

um It's essentially the same as maintenance, but drain is cleaner in that it will not cut connections, whereas if we.

B

E

Maintenance mode that'll tell h a proxy to sever connections.

C

Right so main is like something is actually wrong. This thing needs to come out of rotation.

E

Immediately drain, does it in a more clean.

C

C

So whilst this is running.

C

I had a discussion item that maybe we could jump to. um We've got the epic for 11., and this is the pipeline for code. Rollback uh we've been working towards now, um one, I think um it's sure we haven't got an sei, but I think we can see how far we can get anyway, but uh the exit criteria for this epic are sort of complete as they're written um and the due date is today, but actually I don't think the epic is complete um based on other issues and other things we've got.

C

So what do we think is a uh like? What's the thing that makes us think, okay, co pipeline for code rollback is a completed task like what would what's the kind of main exit.

E

Criteria I mean, in my opinion, the exit criteria should be. We have fully tested this in production.

E

I don't, I really don't want to wait till an instant occurs for us to test it. I want to test it when we're all calm collected. There's no incidents, we're all expecting this, it's okay! If we want to create a test package to be safer sure I don't care. um I just don't want to wait to an incident to test.

C

E

My primary concern.

C

Yeah, that's a good point: yeah! Okay, I think if.

E

We do test it. We should invite a larger group of people to watch this. Well, maybe we don't want to do this because that'll be an expensive meeting, but I think it'd be wise to have people who are close to this realm, primarily on calls maybe watch this maybe record.

E

C

What do other people.

B

Think um yeah, I kind of agree there like saying: oh, we can do rollbacks just because we can in staging might be a little too optimistic.

B

uh I do think if we do a production roll like the first time, we might want to pair that up with a production change lock uh just so that we can actually pick a day, knowing that's not good, knowing there's not going to be sort of self-induced incidents, because there couldn't always be these sporadic random ones, but uh just we have a day where we know. Oh, we're not gonna introduce a new database migration. That's going to take two hours, for example um beyond that, I think so far, yeah we're so far so good.

B

C

So mario you've been doing a lot of thinking about kind of the failure case.

C

So the the the question mark we had previously when we started thinking about can we go to production is what are all the ways that this can fail and how will we be able to recover from it so that when it does fail in production, it's not the first time we've ever seen that so from the stuff that you've been looking at over the last week or so mira, like other other failure cases that you uh you came across whilst you're going through that stuff.

F

Yep sorry, uh so, basically, one that I have been thinking about is that we need to roll back, um and there is a there is a deployment in progress. So in that case, what do we do like so far? Our options is to basically cancel the the ongoing deployment, but we don't really have a way to tell when it is safe to cancel like we can click that button cancel, but at least I have no idea what is going on with all those logs.

F

uh Those are not really readable for me, so it will be nice to have some sort of tool that we were talking about last week to safely trigger that on slack and then the tool will analyze when it is the safe time to cancel that one, um but well that aside, once we cancel a deployment in progress, we need to know uh when it is a safe time to do it and then what are we supposed to do? Are we supposed to roll back immediately?

F

Do we need to do something with the servers um that will be another scenario that it will be nice to to have before actually testing this in production.

E

We should get an issue open for the cancellation of an ongoing deployment because I think that's worth a larger conversation, because currently I have no idea how we will go about doing that. That kind of concerns me.

C

Yeah, okay, um we have so we have an issue for like the automated bit, but I think there's an uh there's a step before which is the actual like. We can't automate something right now right, because we don't know what are we automating? We need to have a safe way of uh of knowing when to cancel.

E

I think knowing when and how to are two very distinct questions that we need answered in that.

B

Issue uh unrelated to that uh we had apl, look api. One marked the stream, it seems to have rolled back. Just fine uh double checking. Was that the expected behavior or were we expecting it to fail.

C

I think that's.

E

Expected right yeah we were expecting a deploy to go to it. Just fine.

C

E

Expecting it to be pushed back into the low bouncer in an upstate after the deploy completed, okay, okay,.

C

E

C

Was going to fail, it would have failed at the prepare job. That was what we previously saw.

E

Yeah either the prairie job or when ansible started the deploy would have been like uh it's already in drain. This is a problem, but we didn't see that. So that's a good thing um and I just did a get server state and it looks like that. Node is in the rotation again, so it looks like the rollback indeed completed, just like we wanted it to.

C

Very nice, that's great news, great news: when we, uh when deployments fail um or cancel, I don't know if those if they end up the same, do things get left in drain state.

E

It all depends on when you know if a deploy failed.

E

Say we ran a server at a disk space when it was installing the package um that one server will have been marked as failed and ansible and from whatever call in school, is going to stop um at some checkpoint. But I don't know what that checkpoint is, and maybe when that task completes.

E

So for that for the servers it was operating on that moment in time the deployment will stop and those servers those operating on will be stuck in a drain state.

E

We could retry the job at that point safely if the failure occurs uh outside of that like if it fails trying to push a node or take a node out of rotation, then nothing bad technically happened to the node itself, like the deploy wouldn't have continued because aj proxy was the failure case.

E

So we would have bailed prior to deploying to the node.

C

Okay, so they would just.

A

C

On the previous version right, but that'd still.

A

Be serving traffic.

C

Okay, that's good news right so like from this rough uh sort of outcome, it looks like robot pipeline handles drain state very nicely.

C

Which should in theory like we, don't know how to cancel a deployment? But you know that's not really a blocker for rollbacks right. We have a slightly annoying state where, if we see a deployment going through and problems start to happen, we don't quite know what to do which we should find a solution for, but that's not really a rollback problem.

E

I think something we should consider is that we don't have a very quick and easy way to determine the state of a deployment if we decided to cancel it like if we hit the cancel button during deploy and like all the stages were being deployed at that moment in time, we're going to have different behavior depending on what what fleet was being deployed to.

E

I don't know what will happen with kubernetes, because all we did was stop a waiter action on a triggered pipeline so that triggered pipeline is going to keep going and kubernetes will continue to be deployed to so there's once that trigger starts, it's not going to stop so kubernetes is not going to be like oh well, never mind I'll, stop but say like we cancel in the middle of the api, or we say we cancel and every other stage is running.

E

Api is going to stop at a different place in the web fleet and such so we're going to have a mixed deployment scenario. Where we have two different packages running, some servers will be stuck in drain and kubernetes will eventually finish its deploy automatically.

E

We would have to if we cancel, deploy we'd have to reach out to the kubernetes, deploy and determine where we are in that stage and cancel that, if possible.

E

Right now, we stagger our deployments to our individual clusters, so that's made slightly easier, but that puts kubernetes in an awkward position.

E

There's a lot of use cases we need to think about is what I'm going towards.

E

C

What about, if we thought about this in the so the goal we're going against is um improving, mean time to resolution right, and I think euric has been painfully aware uh this week, and I know robert in recent weeks that this number is quite high right. So actually what? If for now, we just made sure that if we ever had to cancel like in the theory that we're talking about here, we just cut it before post-deployment migration right so, oh wow everything looks terrible, we're deploying on the fleet. At the moment.

C

We just make sure we don't run the post deployment migration, and that means we have a rollback available.

C

Yes, it means we potentially are rolling out bad things to the whole fleet, when maybe we could have only put them on half the fleet or a third of the fleet, but um it's certainly not it's not making it worse than it currently is.

E

I think that might be a little naive, because if we allowed to continue up until post-deployment migrations and if things are getting worse even after say the first two nodes deployed, we're gonna, make it worse. Until we get to the step where we could actually watch the current set of jobs complete and then start the rollback process.

C

Yes, it's not a roblox problem, though right like this is a cancelling deployments. Problem yeah, which already exists right like have. We have we actually ever had to cancel a deployment.

E

We've stopped it just like you said at post, deploy migrations just because you know an incident started, um and we wanted to have some like checkpoint, that we could make sure that we could have a place where we understood where our environment stood.

E

um I think the one time I was involved in that we ended up just allowing the deployment to continue because it was discovered. It was something completely unrelated to the deployment, but.

B

There's actually a case, uh I think a day or two ago or yesterday, maybe even where maybe it was friday. I don't remember a deploy was going on, or at least I felt so it later turned out to be a uh a dry around pipeline.

B

uh A incident popped up at people, I think was skyward. Actually hey can we, you know, uh pulse the deploy while we're dealing with the incident and my response. There is basically like paul's how like and when like, can I just cancel the pipeline or wait and then the result is okay. We have to wait until um I think, like post deployment. Migrations are done, reversals if gitly is done, but because there was a dry on it happened so fast. I wasn't able to respond in time.

B

uh So so there is a need for this. I think every now and then, but it is quite rare.

B

Let's see api finish rolling.

B

Back, let's see rollbacks are done, qa is now running. uh Do we want to await the qa results.

C

Yeah well, we might as well right. I always finish this uh discussion. I think.

B

It takes like a solid. What is it hour? 30 minutes something? Yes yeah. Let me see how long they take.

C

Cool awesome, um I add I asked as well. This is completely dominated what we were just talking about, but just what's in my mind, I'll add it to the epic, but uh I checked up on the qa tests and we don't run them as part of the production deployment pipeline, but there's no reason not to have these on the rollback pipeline. So it feels to me at least, that a good use of a bit of extra time, following a rollback to run these tests.

C

B

Yeah, it seems qa has just skipped here. At least uh keyway orchestrating qa full see. Qa smoke is that running. Yes, that does actually run.

B

Let's check how long that takes.

B

Yeah pipelines uh qa trigger smoke. Let's look for.

B

That 39 40 minutes yeah 40 to 60 minutes. It will take roughly, I think or oh sorry, that's a time ago. So it's confusing.

C

Like 20 15, 20 minutes.

B

Yeah, it seems to vary between 15 and 30, but it's a bit inconsistent, so yeah, um let's see what the next steps are in the meantime.

C

So, just on the um testing on production, um from all the other failure scenarios that we've already outlined or and any we haven't yet outlined, what else do we need to test before we try and uh set up something for.

C

E

I cannot think of anything.

E

So I think we should test the production roll back next week.

C

So you should get it set up the first step. uh We can do this as a dry run, so that would be a good first step.

C

um In fact, maybe that's a good one to do as a team and then all going well, then we can do the kind of a slightly bigger overhead of setting up the package um scheduling.

C

And see what we need to do to get this tested.

B

uh Staging enemy type is running the uh expected version.

C

C

F

Should we open on an issue to discuss when it is the right time to cancel a deployment in progress? Yes, are we going to yeah? Are we going to test that before um rolling back in production.

E

In my opinion, these are two separate concerns, so I think we could probably work this and everything else in parallel. If we want them.

B

Yeah, I think, like the proper approach for doing this was probably going to take a bit of effort because there's a discussion about uh kill switches, basically for this um some different suggestions there, I think last year suggested we use console, and I actually agree with that, uh because we we're now sort of running into more and more cases where some sort of central authority of state would be uh useful, and I think console is a decent solution for that.

B

But setting that up is probably gonna take a bit of time, um so I think the whole you know. When should we stop? We can do yeah in parallel, because in the worst case, we basically wait for the deploy to finish. If we.

A

B

Reason can't uh stop it halfway through.

C

Yeah, I think that makes sense um it will give us being able to halt a deployment, would give us more rollback um use cases, but I don't think it blocks uh robex.

C

So I suppose the other thing is, in the event we're rolling back on production.

C

Is everybody um confident that they understand that you think it's in the role in their own books, but it's a good thing to just have in your head as well, so you know quickly, but we roll back production.

C

What else do we need to do? What do we need to do with the other environments.

E

Well, if we do a roll back, we should have a procedure for this and canary would likely be drained so.

B

E

Wouldn't touch canary, I think we wouldn't touch staging. We would probably make sure that whatever is on staging make sure that it does not get promoted, because we probably need to at that point identify what needs to be fixed. What caused the problem and get that merged in place and prioritize getting that into staging and proceeding forward.

E

So I would say we don't need to touch our prior environments, it's just making sure canary is drained and such.

F

I guess we can treat uh rolling back production as any other s1p1 incident in which we block autodeploy or pause the tasks, because we need to have our environments frozen so to speak.

C

We may not need to, but yeah I mean sure. I think if that makes it easier, uh then yeah um we definitely have to drain canary that's a really key bit, because otherwise canary will end up uh potentially ahead of uh sorry behind production.

C

Cool okay: it is in a run, but but just wanted to make uh like, I think in the heat of the rollback moment. That would just be a really good one to be very clear where these things sit together.

C

We should see how that goes, whether it's almost worth or so it might make it just too complex, but with the production rollback like with a rollback almost having a reminder or something, that's like hey, don't forget canary or something, but we can. We can see how that plays out. For now. I think uh we can like we just have to follow the run book super closely right.

C

Fantastic um and then the other thing that we have on this pipeline, epic, which I'm going to propose moving out, is we have quite a few uh issues on here around kind of uh almost like information and posting of information. So there's the issue which I need to do something with like um putting together a how do we bring all this stuff together into a human, readable proposal, and there are various things around information to set up a rollback in information that we've done a rollback.

C

How much do you all feel that you want to have that as part of this creating a pipeline.

B

I mean creating pipeline, isn't like having all that info in in the issue.

C

Just whether this uh those extra tasks should be parts of the exit criteria for this epic or whether we're actually saying that we feel like production, rollbacks will be uh well. Sorry we'll feel like a rollback pipeline task is kind of complete when we actually feel happy with the production rollback um or whether we actually want to go through and call it complete. Once we have all the extra pieces. So we have, for example, mirror deployment tracking on ops.

C

There's things around dashboards, there's just kind of a lot of extra stuff about information which is going to be useful, but whether like it needs to be on the actual pipeline, epic or not,.

E

I'm going to vote yes, in my opinion, if there's anything that we could expose to make our lives easier during a high stress event, and I feel, like I haven't, looked at the issue, so I don't know what these are, but just based on your description. So far, they sound important enough that I feel like we should tackle them.

E

That way during a high stress event, we've got the information front and center and we're not guessing we're not fumbling around, but instead, when an sre requires our help, we can give concise answers as needed.

C

Cool. Okay! That's a that's a good point, great.

C

Cool, so we have one issue: uh one action to create an issue on, so that we capture when and how we cancel deployments, who would like to take the action open day? Good. Thank you.

C

And then the other thing that would be worth doing is uh having a write-up on the um I'm going to propose that we separate out these two failure. Scenarios that you put together byron have one the one we've just done, one that we haven't tested yet on a separate issue and then um have like a summary of what happened today on the one we've actually just tested.

C

Does that sound like a sensible way cool? um I will uh create a separate issue to separate out the two um scenarios so that we have just this one and then I'll create a new issue with the one we didn't test uh eric you, okay, to put together like a summary just of what we tested and how it went.

B

C

Bro thank you and are there any other actions that we should be taking.

F

uh We discussed something about having notes in maintenance mode. I think I missed what happened with those when we are rolling back, are those ignored or do we need to do something with them?.

E

They're probably going to be treated just like we do in uh when we roll forward, so if they're maintenance but they'll just stay in maintenance mode uh during the rollback procedure. If we want to change that behavior, then we need to address that in some way, but there could be cases outside of deployments um that would lead nodes to being in maintenance mode, so it's probably wise to just leave it, as is.

E

If, however, we want to test that you know overall theory, we could certainly do that.

A

E

Repeating today's test, but instead of draining, we do maintenance instead.

C

It might be us just checking that um yeah, that's not a bad idea, um because yeah they do need to be left right, because there are many good reasons, they're in mate, um so yeah that could be another. It could be another useful.

C

um Well, it's a good test to have us pose right, like I don't know if uh we should try and get coordinated, so we can get some of the production tests going, um but it's a good test scenario to have if we have any delays, whatever reason so um cool mario, would you mind adding that to the failure uh discussion issue so that we don't lose that yep sure I will do it thanks very.

F

C

Cool great anything else, then I want to cover today.

B

No, yes, I have two questions, so the qa is still running. I presume we can just leave that and the next auto deploy will take care of it and the node that was in the drain state. Do we need to do anything to undo that or does it fix itself the.

E

The deployer fixed it for us no action. Okay, no action is required from us.

C

We should keep an eye on the tests, though, um and just make sure those do pass.

C

Cool great nice work very nice very nice scenario, nicely executed and uh I feel like we've, we've learned some more stuff, so that's super good great. All right thanks. Everyone enjoy the rest of your day.