GitLab Delivery Team Rollbacks, 25 Jan 2021

Previous Meeting

⏯

youtube image

►

From YouTube: 2021-01-25 Delivery team rollbacks discussion

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Cool okay, so um we've got the uh discussion, notes doc up already, so I think the kind of rough scope for today, as the uh sort of first discussion we've had with rollbacks, is to try and work out like where we want to go. Roll backs like what might be a suitable scope and how do we kind of progress things so um feel free to lead this direction wherever we feel like it's going to be useful, so my initial question really was: do we want to focus this entirely on auto, deploys rolling back, auto.

B

Deploys well, we don't really roll back anything else. I mean we never book anything. So that's! This is the first statement, but uh auto deploys is easier and we are in that situation where it may actually make sense to to do it, because we we deploy multiple times a day. So it changes us really small, and I mean I see no value in I mean there's value, but I mean just the trade-off between the investment time investment and what we get out.

B

If we go something like rollingbackrelease.getlab.com or just or pre-environment, I mean we're just for sure there is. There will be positive, roman migrations or things like that when you go from one minor to another minor, so it is just see it's impossible to roll back.

B

And yeah, and actually what why why I think we need to do this is just to have uh for me for mitigating an incident. So that's the reason why we want to roll back, because that's the main point right. So when we are in, we have an outage and is deployment related.

B

The first thing is always: can we roll back and the answer usually is well, we have no idea, so let's go and try to roll forward and understanding. Where is the merger request that broke it? While we may aim to have a situation where we say yes, we can, because we know or- and then we roll back in the meantime, someone fans finds out. Where was the broken, merge request roll it back, and then we roll forward.

B

Jarv, do you want to vocalize.

C

Yeah, I just wanted to cover all the scenarios where we have wanted to roll back, so there are configuration changes to start which is covered by either chef or kubernetes for chef. I think there's not like a reason to roll back just because those reverts are fairly they're done.

B

C

And they're, you know managed by chef runs uh which we don't have a pipeline for anyway for case configuration changes. I think it'd be good to have a rollback option. This would be anything that you know any kind of like configuration change that we make outside of auto deploy, including like adding environment variables, uh changing configuration, changes to github.rb et cetera, and then we have the satellite projects registry, shell et cetera. For those, I think it would be good to have a rollback option. The same way we have for auto deploy.

A

Go first go back.

D

I'm tying into jar's statement about the need for rollback for configuration changes right now. We do suck in some chef changes into kubernetes to deploy them and similar. If we have we, if we do a roll back in kubernetes related to a configuration change that was spawned from a change that's stored inside of our chef repository or our chef secrets, we don't currently have a way to connect those two together, that's usually via someone's memory banks hosting that data inside of their brain internally.

D

We don't have a way to say: hey. This configuration came from chef, let's make sure if we revert.

B

D

We also repeat this pipeline in the kubernetes environment as well.

D

I struggle to verbalize that, in words.

A

Cool okay! Now that I think uh that's a good point, um are we talking with the like? What's the kind of uh I guess like scope of rollback, people are thinking about so like are we thinking that we either might want, or it might be a good idea to have parts of auto deploy roll back separately from like the whole auto deploy like? Would we would we want to roll back like just the kate's deploy, for example,.

C

The scope, I'm thinking of is incident mitigation and when I think of incidents that we've had in the past, there has been a desire to isolate rollback, to either fleets or k8s clusters, even individual zones so being able to roll back one cluster instead of all three clusters. Just so that we can help isolate.

D

Problems is it worth splitting up this discussion into two parts one is auto deploy and the other is configuration changes.

B

In my opinion, those two things goes in, I mean there's no common development here right. So when you work on one you just have to do something and it's completely different. What you're gonna do, in the other hand, so maybe you're right. It's worth splitting the conversation.

A

Yeah that we can do that, like is the so in terms of what we are trying to achieve with rollbacks um from like a um I guess from uh maybe from like the okr perspective, in which I mean like the what's our kind of direction we want to go in as the team like.

A

Is there one of those that's more valuable than the other.

C

When we say outside of auto deploy, I think we need to be specific as to whether we're talking about configuration, changes outside of auto, deploy or also satellite projects that are deployed outside of auto deploy.

B

Yeah, but all those satellite projects right now are just a configuration change in kate's workload. Is it correct.

C

Yeah, that's that's true, so I guess if we have a rollback for configuration changes we get that for free. um I would argue, though, that we also get all to deploy for free in case workloads, because they all go through the same pipeline.

B

Yeah the thing that I'm so the only thing that scares me here is registry in terms of their there is active development in registry world and they will soon introduce their database migrations, and this is absolutely okay. It's going to be a problem.

A

Yeah, okay, I think.

C

Why why is that a problem? Just I mean, like you, think that uh it'll be hard to reason about whether the rollback.

B

C

B

Yeah, that's the point right because uh I'm I'm also scared of the um of the development process itself right. So we took two years for us to have a framework for developing migrations. It was actually much here to do multiple environments. At the same time and yeah I mean when we when I, when I pointed out this uh things in the I don't remember, the the team was just asking: how can we deploy? We pointed out things like: how do you plan to do roadblocks?

B

How do you plan to do migrations on multiple environments or environments with multiple versions and they kind of blank it out? So.

C

I think this is a good argument for targeted rollbacks, instead of all or nothing rollbacks, which means if there is an api regression, we have the ability to pull back just the api instead of rolling back.

B

No, the product does not work. This way.

D

B

Will break everything I mean. I mean that we rolled back the apis, but then fronted fleet and sidekiq will run a different version. I'm not really sure for how long we can keep this running.

C

Why do you say that? Because this is the way that we deploy now like? We don't deploy everything at once. We have to support the version. You know the version n minus one.

B

Yeah, this is correct. It's time.

A

Right like how do we know that something hasn't moved forward, so I suppose that that's the probably a question for me on rollbacks is that like, presumably, where there's like a window which we can roll back and it has an impact for another period of time like it feels like registry in particular, I would rather avoid, because I think it's it's a slightly different project. We need to work with packaged uh to understand the impact.

E

I think a difference there too sorry to take into account is that currently we sort of support multiple versions as in all anew, but it's always moving forward and it's only two versions. If we start supporting rollbacks for say individual clusters, you can end up in these cases. Where say uh you know fontes and git from version b, but now suddenly uh say sidekick is roll back to version a in theory. It's kind of the same concept.

E

uh You know the direction is different, but I think- or at least I would suspect our code doesn't handle going back as well as it handles going forward. So as an example, um we have to split between regular and post deployment migrations.

E

But let's say you introduce a code change where we're now storing comments in a different format. For example, moving forward you can deal with that, you can, you know, generate a new format for all things, sorry, the new format of new comments and perhaps support the old format for all comments.

E

But then, if you roll back suddenly, your code is now only capable of understanding the old format where some comments might be in the new format and so to handle that you basically have to say, okay. Well, you know we support the old format until version x, but it doesn't really solve the problem, because if you then reach version x and have to revert, you have the same problem.

E

So you basically delay it, and so you end up with these cases where you have to code very defensively, or you actually have to do the sort of three-step process where you add a change that one you know, if there's a new format, you can't roll back. You then add another change. It sort of you know stops using the old approaches that you might be able to roll back.

E

um It ends up just complicating the whole process like nobody's going to do that, like I even as I say this out loud, I'm already confused by the idea.

B

E

What the reason our.

B

Documentation is the expand and contract pattern and engineers are really struggling with it. So yes, but there's also another thing which is sidekick jobs. When you add a new variables and you tend to default it, then it's forward compatible. But then you cannot roll back.

C

We need to support this. The the application needs to do this full stop right, I mean we upgrade sidekick in parallel, unless, if we want to start upgrading sidekick before we upgrade the rest to the front end, um you know this happens now, so the application needs to support it. One recent example of the registry was we deployed to registry and there was a metric change and then our app dex dropped, and in that case I wouldn't want to have to roll back the entire stack right. I just want to roll back.

C

I have the registry developer on the incident. Call he just wants to roll back to the previous version. I want to click a ci job. That does that I don't want to have to roll back everything. This is why I think that targeted rollbacks for incident mitigation are essential, like I think it just comes up over and over. In my experience that you want these targeted migrations, you don't want to have to go through an entire pipeline.

B

Yeah, but this makes sense for registry, which is not part of gitlab code base, so I'm I'm completely completely in line.

A

With it right, I think, because I think that's the the thing is in the future. What we don't know is the dependency so.

B

Different component will be a version right, so you can still handle this.

C

But what about what about web api in sidekick and gita right? Should those all move together, or should there be like? Maybe I want to do web api inside kick together and not italy as well.

B

Right and I wrote to skip gideon brought back so it's later in my point, I said that we should not touch italy by default and it should be an option to eventually roll it back, because we have the same problem here that gizly is usually very uh backward compatible, but yeah forward compatibility is really it's really hard to get it right and we are not yeah. We are not really stressing enough for this in development, so this is not. This will not happen very soon, but then think about this.

B

We're talking about changes that have something like three four eight hour of code changes right, because we do also deploy four four times a day three times a days and we tend to deploy every version. So we have really. We need to focus on this, because this is the key aspect. It keeps the change really small and what I'm thinking here is still.

B

If something like this happened, we are an incident. We should not just keep roll uh keep deploying. We have to figure out what is broken facing it and move forward, because then we, otherwise we have this nightmare scenario. When you have one part of the fleet running version n minus two, then you have another power running version and minus one. Then uh sidekick is version n, plus three, and then you just you, can handle this.

C

But but we would never have n plus three. We would only have n minus one right, and this is how we do deployments now, but.

B

I guess I'm thinking.

C

I'm just thinking that um outside of a post-deployment patch rolling back is the only way we can. You know fix something quickly. I agree so so it's it's. It's really the best option we have now. I I don't think going forward outside of a post-deployment patch and maybe that's what we do, but uh outside of that, I think rolling back would be the right thing to do.

A

But that, but the point is we'd, then stop right still and it's even though we'd roll back, it's still listed.

C

A

Pause, the pipelines- and I think that has to be true for whatever we're rolling back, because knowing we've rolled back is the challenge right and investigating. Why and things like that, so that makes sense. um So, if we're saying an end goal, is everything can be rolled back, which I think is what we're saying um what makes sense to focus on initially like? Is it auto, deploys as a whole, or is it something else.

C

I'm still not saying everything should be rolled out. Maybe I'm the only one saying this, but I'm saying that we shouldn't do all or nothing roll back. We should do targeted rollback.

A

Sorry, no, I meant more like our end goal is everything is able to roll back, not necessarily.

C

It's a one-stop.

A

One-Stop thing where you roll back the entire fleet but yeah the.

B

A

That every project can be rolled back, so what makes sense is our kind of first step.

A

F

D

Let me rephrase this.

A

D

Maybe, as a first step, we target the ability to roll everything back and then later we could iterate towards the ability to selectively roll back like jarvis pushing here.

A

Yeah that could work.

C

What do you mean starbuck by that? Just just like I understand, because we kind of have that ability now by just kicking off a deployer pipeline for the previous sha right? Yes, so I.

D

C

D

Of automating that could be.

C

Betterized but we've never ever done that like.

B

This, I don't think it's ever happened. Italy would break.

D

Every staging before.

C

Yeah, it would be terrible right and we tried to like create rollback pipelines that reverse jobs or reverse order, and- um and I guess my question is- is like: do we really want to do it? Like? Is this?

C

It sounds in my experience on incidents. That's never really even come up as something we would ever do just because it takes so long. It's faster to do a post-appointment patch or even maybe move forward.

B

This is not true jarv, I mean because you're thinking about a human, detecting the problem at the end of a complete deployment and then going back. But he, if you have this thing's automated, you can stop earlier. So if you are 10 in the fleet and your updex drops, you can stop and roll back and it's gonna be faster.

C

Oh yeah, but in that case we wouldn't be rolling back everything. We would be rolling back just whatever cluster fleet we're upgrading right.

B

Yeah, but if you roll back everything more than half of the fruit would just be on the right version, so we do nothing.

C

Yeah, um I think um I I think that's that's a good part of this, but I also think that it's very it's very rare that our metrics would detect yeah.

B

C

Problem I mean I mean, let's be realistic, I think like most of the incidents I've been on, that have gotten past the deployment have been stuff that wasn't caught immediately during the deployment it took a while.

D

And a lot of people want to spend some time troubleshooting and looking at errors before we decide whether or not we want to roll back um so, and that also slows us down uh to kick a kick off a roll back quickly.

B

Because you have no other opportunities, so what what do you have now? Nothing I mean you can just try to figure out what's happening and try to detect what is broken apart, because there's no rollback option today, no release manager will ever say, let's roll back, because you have to manually check and sell and make sure that there's no post deployment migration, which is not easy right, because it really depends on what you deployed you. It's not just the content of the last package.

B

It's the delta between the last two deployments, then, who knows how to roll back against lab.com deployment? I don't know and.

F

D

F

I follow destruction, it will.

B

Not work so will.

C

B

Propose something like that.

C

But isn't that what this meeting's about? Like, I think for this? It's like we know we don't, we know, we can't do it and we're wondering if we should do it um and you're right. The post-deployment migrations is a major problem for us. I don't know and.

B

We don't have a solution.

C

B

What I was thinking- which I I mean so when you think about selectively rolling back you're thinking about if I'm correct, either part of so you're you're kind of the splitting the fleet by cluster names and things like that and services right. So you want to roll back only registry or you want to roll back uh git fleet or git fleet in zonal cluster x, you're. Thinking about this right, jarv.

C

Yeah, basically, this epic uh 373, which has like a mock-up of it, would just be a manual job next to each fleet or each cluster. That would do a helm, roll back or a ansible roll back.

B

Yeah, what I'm? What I'm thinking here is that it's just some so when we have um deployment and rollback pipeline there are really straightforward so that every single job handles something- and you don't have. This kind of catch-all do kubernetes deployment, but you may have triggered this kubernetes deployment for this cluster or whatever. Then you can selectively run only those jobs that you want so that you have the ability to roll back everything and by everything. String is attached because I will not re roll back digitally.

B

In any case, I would just for me easily is a manual step that you can do after if you want, but that's just I can. I can uh explain a bit better about this, but it's but the point is that if you can roll back and everything, then you can also roll back selectively.

B

If you, if you if your pipeline is, is well designed.

C

I see so like with the refactored pipeline, there'll be like one trigger and then we would just have a rollback trigger, um but we're still, I think, isn't part of this kind of deciding whether we're going to do something sooner than the refactor or not. I'm not sure.

B

The refactor is still going on and part of I mean there were some. So, for instance, we need to figure out if there are rollback, migrations right, sorry, post-department migration. So this is something that is easier in the refactored pipeline.

B

So the idea is that some of the most important part of the pipeline refactoring should kind of go from the current uh okr to the rollback okr, so that we have what we need in place.

A

B

Think it's this.

E

B

A

Part, this is about prioritizing those tasks.

D

So for for me, it sounds like the first step that we really need to accomplish is how do you programmatically determine whether post deploy migrations are going to hold us up from performing a rollback, correct.

B

No, we just need to know if they, if there are because if there are there, we cannot roll back.

D

Precisely so, what if we create that method in some way shape or form it may be when it gets around to time to building a rollback pipeline? Perhaps when a deployment pipeline gets created, it's smart enough to know that post-deployment migrations are contained or are not contained inside of there. If they are, don't create the robot capability in that pipeline.

B

Well, the story was a bit more involved, which was something like: don't run, post-deployment migration. If system is unhealthy and there are post-deployment migration so that you can roll back, because it's like just like having a longer deployment window.

A

So you're almost proposing like a a baking time like a deployment baking time, maybe or something that gives a window a rollback window.

D

I vaguely recall I can't remember who suggested this, but I remember something similar to a baking time, but in this particular case it was more along the lines of we don't run the post deploy migrations until the next deploy is about to start.

B

Yeah but then you have the problem that you may have an incident because of the post deployment.

C

Migration and then you.

B

C

But that doesn't uh I I proposed this a while ago. I think there was some resistance to it, but so what if there's an incident after the post-deployment migrations, you would have one anyway right like I don't. Why is that? No.

B

Because I think you would find it either way yeah, but the thing is that, if you run just before the next uh deployment, it means that you don't know so the reason about it.

C

Right, like yeah yeah, the reason why why, if it's the new code or is the voice.

B

Deployment and there's also another uh aspect of it that you need to run migration. The regular one before the cannery deployment.

C

Correct, yes, so you would run you would run post deploy migrations before canary.

B

C

And um but but while you're yeah I mean I don't I, I guess I I see your point, but I think it still kind of puts us in a better place than we are now, because um how often are problems triggered from deployed migrations doesn't happen very often and to reason about that we could always just roll back the application and then, if the problem is still there, then we would say: aha, it's supposed to play migrations like we would just keep on rolling back up until the last post-deploy migration right and then you would.

B

Know, no, I don't I didn't get this.

C

Okay, so we do a post-deploy migration and then we start upgrading canary and then we start upgrading the main fleet and now there's a problem. So was it due to the post-deploy migrations or was it due to the application upgrade? So you just revert so.

B

You wrote back, you can.

C

Everything so that sure yeah you just roll back. You know like it's up to the last post to play migration. I I don't think in theory, it sounds reasonable, but I don't know if that would ever happen in practice. I I I I.

B

C

Would be better than what we have now, though, right, because then we would always feel like we could just roll back.

B

But yeah, but it makes really hard to reason about it because you are thinking about the next version yeah, but you didn't complete all the old version deployment. So it's you have those two things attached, but I mean.

C

B

It's just a detail right as long as you can detect either you stop either you pose either you run it at the beginning of the next one see you need to figure out if there is a post-deployment migration, and this gives you the answer. If can I roll back yes or no.

C

Yeah and- and I think we landed last time we talked about this- we landed on batching the post-deploy migrations as being the best option. Yeah.

B

C

This is where we left it.

B

So what happened last time is that so batching plus deployment migration is kind of outside of our control right. So it requires a an effort in socializing the idea and development and making sure that we can actually do this. So if we can do our part of the homework so that we actually detect them know when there are, then it's just a matter of showing yeah.

B

We were able to roll back in this case, and then you can do. These are all the incident that we were not able to roll back because of this, and because this tied also to the discussion about automating completing automating deployment, because right now we still have the baking time and we click and we click the play button. But if we have more control over this and knowing that it is a rollbackable migration, sorry a rebeccable deployment, then you can just be more easily rolling forward than because you can roll back.

A

Yeah that makes sense.

A

A

So, are we saying that we want to before we even attempt any rollbacks on staging? We definitely want to be able to identify post-employment migrations and also make sure like eliminate italy from any of our rollback conversations.

F

Should we also identify the type of post-deployment migrations we have, because currently you can have two types, a post-migration that triggers a background migration, and I think in that case it is not possible to roll back. But there are another types of post-deployment migrations, just like the ones that add indexes or remove indexes and those are declared inside the post migrations because they are adding intersect indexes to large tables such as projects or namespaces or ci bills.

F

And then I mean in that case, should it be possible to roll back when just adding an index.

A

What would happen? Are they they're long running, though right? Aren't they like what would happen if it was part way through.

F

Do you mean like if it is cancel or like.

A

If it like got halfway through creating the index and yeah, it would stop right and begin a rollback like.

F

I guess the index is not going to be completed and it will be like partially created.

F

But honestly, uh the indexes that are added impose migrations are normally used for the tooling that is used usage research, tooling, the one that is only executed every 13 days, so those are indexes that are not truly needed at the moment of the application.

F

So I guess if some of them is not created completely, it shouldn't be a little. It's just a theory, not a fact. So, though,.

B

Yeah go ahead. Scarborough sorry.

D

Maya, I was going to say that I think your point is very valid and I would love to see us get to that point, but I think we should iterate towards that goal.

D

I think it'd be easier if we just concentrate on if post-deployment migrations exist, true or false and operate upon that to start and then later on, we could figure out what types of post-deployment migrations would interrupt us in specific methods.

A

That make sense.

D

Sorry alessio for interrupting you.

B

I was going to say the same exact thing, so thank you for vocalizing. The idea.

A

Cool so next comment: we've got here: alessio uh we need a robot pipeline. So how do we get to that yeah.

B

So what I was thinking here is that yeah we already thought about this, so the the current rollback pipeline was designed long time ago with different deployments in mind, and I think it would be safer to assume that is not working instead of trying to just fix it and handle it.

B

We said we cannot roll back migrations, so this is well.

B

We will not run them, but it wasn't. What I was thinking is that if we run the deployment of a previous package, then the migration contained in that package should already be in the in the database, so just running the deployment job. We just do nothing, I'm talking about regular migration, so the one that we do up front so would be kind of a no and it should be safe, italy easily. In my opinion, we should avoid rolling back easily.

B

I was also working on the pipeline refactoring before the school and in the original attempt to run to make the rollback pipeline. The rollback deployment is already there. So there's uh there's a variable that say if we are rolling back, don't do easily deployment and yeah. This is what I was thinking.

A

I think that makes sense, so certainly as a first or like first few iterations. I think that feels like it will help us, in instance, um without taking on too much.

A

A

Do we have elements of the coordinate deployment work that we would need to complete in order to build out a pipeline that would roll back this stuff.

B

This is a very good question, so my point: what I'm thinking here is that it's probably not worth touching the current gitlab cia yaml, because it's kind of huge, so I would rather because we still have to fit to think about. How can we I mean we know how to find positive migration. We need to actually code the the. I think we run enhancement the ansible script, to detect this.

B

So maybe we should work on this in parallel with refactoring the the the current deployment pipeline, because it's gonna be way more easier because there's so many jobs there right now- and I mean we just we're just going to remove many of them.

B

On the other hand, I'm thinking has, as I speak, so uh what what will happen with the pipeline? Refactoring is just that we are going to collapse, jobs together, because right now we have something like this, so we have think about the web fleet. Just as an example right, then we have some kind of skeleton of what does it mean to deploy the web fleet, and then we have many jobs for every environment, so we have kind of uh gstg web fleets.

B

Then we have gs, uh gpr dc and I web fleet, and then we have gprd so all the environments, but they are the same job except from some variables that are yeah that detects the stage in the environment. Basically, so in the current situation, this information are not detected but are kind of supplied by this gitlab ci yaml file. So let me let me try.

B

I hope I will be clear enough so right now we have one big string that can have a several environment with a coma that just split them, and so what we are doing is that we use regular expression to match if there is the environment that we want in that string and then the scaffold of the job inject stage and the environment.

B

What I'm working right now is doing the opposite, so you can have, because you can have only one it either have the the environment and eventually the canary stage or not so the same job would just detect the content of it and generate those information. So in theory we should be able to run the same job and we and ansible will do the right thing, because the variables provided are the right one.

B

So, in the end we will move from. I don't know we have kind of 20 jobs, whatever environment times to just those 20 jobs.

B

So if, if we start adding the logic for rolling back, it should just work, because it's still a matter of uh running it basically is running a deployment with the previous enviro. With the previous version and cleverly skipping, some jobs like do not run possibly, migration do not run the digitally digitally deployment, so maybe we can.

B

We can do both things together. I don't know.

C

So, in other words, the way you're seeing it is that we would have the release tools pipeline and there would be a manual trigger job that would be like a rollback and that would trigger the ansible pipeline with rollback equals true, which would then run the appropriate jobs at the previous version and uh it kind of stays the same. I guess the deployer pipeline there's no special logic there, because you're just passing in the previous sha and rollback.

C

Maybe you have like one special thing: logic for rollback equals true, where you skipped italy or something, but that's that's.

B

It and it will also allow us to start thinking about dog fooding, the the rollback feature in gitlab itself, because I'm quite sure it works more or less the same way. It just runs the same pipeline with with the version from the previous deployment and something like rollback equal through. So I would just say: let me, let's make sure that the variable is the right. It's the right one, so that we are kind of uh gitlab wrote back compatible in a certain way, but yeah the.

F

C

B

C

So then it would be the case, then, that api web sidekick and um well api web sidekick pages would all be treated as a as a unit. Everything that's upgraded in parallel would be rolled back in parallel. Okay,.

C

I'm trying to think how this could work better with the multiple zones, because I do think there is potential to have like you know, rolling back a single availability zone would be nice, but that can come later. I guess.

A

Yeah one thing: I'd love to us to think about java is whether we could blue green deploy rather than roll back, but we could work about that might be a separate challenge.

C

Yeah, I'm very nervous about this, considering how we all think we're just like getting lucky by you know the application working with n minus one. You know.

C

Like thank god, thank god. It's been worked. You know, it's worked to date,.

D

Yeah, I don't know yeah.

B

But we have paying customers that wants this uh no downtime deployment, so it's kind of we should be doing right.

A

Would we uh on that proposal alessio? Would you would we?

A

Where would we start the roll back from would like to say, we detected a problem in the production deployment with the rollback begin at staging, or would you just roll back production and then get the others in line, or how would you, how would we handle the other environments.

B

Okay, so I I think was in another discussion, the one about the deploying from master, but I'm not really sure so. There was this concept of um testing, always testing rollback on staging, regardless of incident.

B

So if we have this pipeline- but we let's say run it once now, because we are coding it, then we never run it for six months. Then we just cross our fingers and hope that it still works when we actually yeah need it in six months, three months whatever so one idea was that in parallel to canon, redeployment or something like that, we could think about rolling back staging and then rolling forward again.

B

So that, with this kind of approach either we we know if the rolling, if rolling back, is an option because it actually works. And then when we have an incident it will make absolutely no sense to start from from staging, because you have the problem now. So you want to mitigate it as soon as possible, and then you can just drain cannery and roll back production.

B

Then you roll back canary if you want, or you can have a canary still available in case, you want to do some kind of active debugging on the broken code. It really depends right so because maybe I remember, we had an issue of multiple versions.

B

At the same time where we were hoping to have some possibility for develop for our developers to actually run this, um I ran some tests against cannery, but with cannery drained we had because they were not able to reproduce it locally, but I mean this is very in the future, but so I was thinking right. We have an incident.

B

We can start to roll back and drain cannery immediately.

F

um So I like the idea about rolling back to staging and then uh rolling forward, but I do wonder if um saying that everything was okay. Rolling back in the staging is going to give us a reliable measure, because staging is quite different from production in the traffic and in the database in the future flags that are enabled there.

F

um So I'm not sure about that. One.

B

It would be as safe as saying that rolling forward in staging was enough to actually deploy it to production and at least tell us that the pipeline works.

D

I think another issue we may run into is that staging is often very, very ahead of production, as I've been released, major I've seen staging be ahead at least two to three revisions at times by the time production receives a deployment.

D

If we start the pro procedure at staging, it may be too far ahead for us to use that as a good measurement of whether or not we can roll back production and.

B

This is why this conversation started in the deployment from master. Now I remember because we were talking about committing to if we package something we should commit to bring it up to to production.

B

Yes, you're right, uh I don't know in in the afternoon myra, but at least in my morning we I I try to keep an eye on this. So what? What really happens is that, because of a bug in our release to logic, we tend to talk two version: every auto deploy branch. So usually we have one merge request of the difference between what is actually in in in staging and what is in production, because, for instance, if I have just some minor changes, I don't I don't go for a second uh production rollout.

B

We just wait for the next object I built so delta will be one one commit yeah, one, one merger, because but yeah it's it's true right. So it is not exactly the same.

A

Thing is that something? How do we test this, like? I think we have to get to a stage where we can get staging to be to be. We we're happy enough with staging, like, I think at the moment where we use it for a decent bit of our test confidence.

A

um Are we confident enough with it to be able to test out rollbacks, or do we need to like actually do more of the ideas we proposed in the deploy from master, and actually you know, keep it a little bit closer to the version. We're actually deploying.

D

Staging has been rolled back recently. The last time I was a release manager. We had an issue in staging that caused us to roll back.

D

It's just a matter of creating the necessary pipeline that we enjoy using, and I think that's the larger conversation that we're having here. The other portion of the conversation is determine whether or not we have post-employment migrations.

B

I was also thinking about the your point. Scar back is that if we always roll back staging after each deployment, so let's say we wait the qa and then we start the canary rollout and as soon as canary is rolling out, we drove back staging so this.

B

Well, it's not exactly. No, it's still not the same, because that's the problem right. We we keep thinking about packages, but we never think about delta. So what we which was, uh can yeah so cannery and staging will get every single package, but production usually skips one or two. So that's yeah this yeah. We will not solve this one, but I mean we are moving from not not having this in as uh as an option to actually having rolled back. So I would I don't know I mean it's.

B

Some sometimes seems like we are trying to really solve problems that are really far far away and we still know nothing about the the process, and so I think that we are. There are several steps. We could do and start iterating on it, so that then we can. Then we would know better and figure out if we actually have to change something else.

D

I think alessio raises an interesting point because we skip revisions in production. We really need to figure out if a post-deploy migration is going to halt us, and I think we need to leverage what we use for our release tracking feature to use that as a gauge, because we could create like a chat, ups command.

D

That says, can I roll back- and it just knows, what's currently on production, maybe what's being deployed to production versus what was previously on production, because we can't just compare the last shot that was on canary, because that might be a revision that never made it to production. For example.

D

I don't know if that's going to be the hardest part of this, but I think that's probably where we want to start that way. We can build our own confidence level in making sure our pipelines do what we want them to do and have that we're testing the right procedures and.

A

A

Yeah, so I mean yeah: what do we? What do we need to do to actually try this.

D

Out I would like to see a chat ups command. Where I ask it: can we do a rollback, give it a environment, name and chat. Ops knows what to look for and will give me a yes or no.

A

Do we know all the things we need to check for.

D

A

D

Tracker has the history and we should be able to check what current pipelines are running to determine if something's going out now or if there's nothing going out and we should be able to determine which version is going out. If there's an ongoing product or deployment.

B

I was looking into this in issue. 1.061, there is a script. Actually it's a one-line plus graph. That tells you, if you can or cannot what I think, but I'm not sure, because I'm not entirely confident in the in this part of the deployment. But what I'm thinking here is that if you can have a pipeline that connects to the deployment box that runs migration running this basically git lab rails db, migrate status, egrep something down yeah.

B

So it tells you if there are migration in the current version of gitlab installed on that box that are not present in the database.

B

So if you have this thing, so these things will tell you if, by the end of the ongoing pipeline, you will have positive alignment migration, but it will not tell you after you have run them.

B

So let's say the deployment is completed once it's done, it's done, and so, if you have problems after running the possibility migration, then you you need to store this information somewhere.

B

So that's why, instead of working on this idea of chat ups, we started kind of a bit earlier, which was kind of kind of let's figure out if there are post-deployment migration and acting within the pipeline, so that you know this right before running post-deployment.

B

Migration, because we need a place to store this information and we at the moment we don't have any kind of storage or database for release.

A

Tools, what's the smallest thing like how do we? How do we actually, like mira? I know you've done it to me a load of times but like how do you actually determine? There are post-deployment migrations in a deployment.

F

Not in a very fancy way, honestly, I normally check the sha from the packages and fetch the difference, and then I compare the difference in the ui and search for a db, slash, post, migrate,.

E

So what we could do here is we have this metadata project on the ops instance, where we track versions and branches and stuff like that, when we generate the tag for the auto deploy deployment, if we can somehow determine what the previous attack is, we can generate a div through the api and then just find hey.

E

Are there any files, I guess added or changed in db post migrate, and then we just list those files in the json that we store in this metadata repository and that way later with tooling, we could just you know, go to that repository fetch.js and get the list of migrations, uh and that way you wouldn't just know. Oh there are migrations. You'll also immediately see what the the files are.

E

I think the challenge there is that when we create a ulta deploy tag, I think it's as difficult to figure out the previous one is because our versions aren't necessarily monotonically increasing parts of them are, but the shells, for example, are quite different, and so I think you end up with this logic, where you fetch, I'm guessing all the tags that match a certain pattern.

E

Let's say the usual pattern, and then the year and then after those remaining texts, you have to figure out okay, sort them by day month, hour minute whatever and get the latest one.

E

As I say that now the alternative is when we record our deployments, uh we know the current shots that we deploy. We know what the previous shy is, because there we do store it in a more easily accessible manner, and so we could record it. Then the problem is that we then record the data after the deploy not before it, whether that's.

D

Sorry, let's record it, let's record it when we begin a deployment and it'll have some state that says deploying some present tense and then we could record it a second time afterwards. So we'll have the data before hand and then we should also have the data after it as well.

E

Right but so there, the difficulty is, if we, um but put us right in the release tool side of things, there's really two points where we can hook it. This is when we tack these auto deploy tags, which is essentially at the start of the deploy, but at that point figuring out what the previous deploy is is more difficult.

E

On the other end, we can do it when the deploy basically finishes its stage so staging canary production, etc, because then it will run a release tools, job that will create the deployment data for that, and at that point we know what we are deploying and what we deployed before quite easily like we actually already have the data, so I think hooking in there will be from a coding perspective easier and I think, from a release manager perspective.

E

It wouldn't necessarily be worse, as in I, I can't think of a case where you want to know before you do a deploy actually as it is deploying, but before it actually updates host, you want to know that it has post deployment migrations, usually especially in the context of a rollback.

E

You want to know that, after a deploy as in you know, let's say if I'm release manager this week, I don't care if a deploy has post deployment migrations until something is wrong until you need to act upon it, roll back, you know whatever it is.

F

B

But we are long-term. We have this long-term project of actually running the deployment in automated way, which means that you must know it up front. Otherwise, you should.

E

And so long term, of course we can change and I think long term, this release tools code, will change quite dramatically. You know, as we move to kubernetes, for example, we hopefully at some point, stop deploying this big package and then this entire tooling needs to change anyway, because right now it assumes that we deploy this big fat package and we sort of yank the shot out of this. You know big version, identifier.

E

So at that point, when we do that, then yes, you generate this diff before a deploy, store it and then you can do things like. Oh, we changed this file. Are you sure you want to deploy automatically things like that, but I think for now probably it's easiest to hook into the deploy tracking that we already have and then just store a list of migration files uh for later review um and yeah. Then we need a separate tool that somewhat takes that data and presents percentage shut up or whatever it is.

A

I mean- maybe it's implementation detail but like so. It feels like first step would be like deter like release. Managers are able to determine post-deployment migrations right, like we need to have a way we could cover.

E

That as and like write down a procedure for this, because even if you fully automate this, it might break, might not work. So it's nice knowing uh what to do manually, yeah um and then the next step is record, those migrations uh automatically and then the third step will be some tool that presents the data.

A

Yeah that makes sense.

A

It would be great to be able to test this out on staging to some, like does that feel like something which would be a useful step to actually like, I know, scuba, you tried you, you did do a roll back a few months ago like would that be a good way for us to actually start building up confidence and finding some of the edge cases of things is actually to identify a deployment that didn't have post-deployment migrations and roll that back on staging and see what happens like only if we'd learn stuff, of course,.

D

I would hope so.

B

In, in any case, I would like to so, I don't really believe that the looking at git diff is the right option is just the only one that we have now and is the worst one, because often times especially on staging, we may have things that got reverted or things like that. So the content, that's why I was stressing out.

B

That is not about the content of the package, but it's about the status of the database and that's the reason why, when I was investigating this, I was trying to figure out a way to actually ask the database about this, because it's the only real true answer that you can have because oftentimes we have my, we may have migrations that are just skipped or they are just removed after or we did, they got messed up but after running.

B

So the idea was this one that when you promote a production build so when you promote production, build your, uh you already had the regular migration because we do them on canary.

B

So the only migration that are missing at that point in time are post deployment migration and they may be from different packages. They may even be because we that current package has no migration, but the previous one had, and we skipped that package in.

E

Production, I think, in that case we can do um because we track the finished migrations in the database. Like a simple table that you can query, um what you would essentially do is fetch all those versions and fetch all the files on disk in the post migrate and basically get the diff of that.

E

The downside, for that is, you need database access and you need, I think you need root access on the server because of the user. Permissions or we'd actually need the appropriate user account that can access the source code on the server.

E

um The challenge now becomes that we need some sort of program running on these hosts. uh You know without, for example, that's doable uh yeah. Essentially, you know whether it's grab how you do it doesn't matter, but you need to somehow run it on the host with the ansible approach. I think that works, because we already essentially ssh in and run the commands, I'm not sure how well this would work if we do kubernetes, because the deployment approach is very different there um and what you could do.

E

There is, as part of the deployments, the the host as it deploys itself, run some command, and then you know sends this data somewhere.

E

The problem you get, then, is: if you do that for 20 hosts, you get the data 20 times because they don't know which ones is sort of the the leader in that sense, um so I'm not really sure how we would implement this appropriately, given the state that we are in, but also the state that we want to move towards when it comes to deployments.

B

This is something we discussed with distribution when we say when moving forward to a full kubernetes world, we want to have hooks into the deployment process so that we can code our rolling back and rolling forward strategies.

B

So that's exactly that type of things right, because you want to be able to run a pod with um certain script or whatever, after or right or right before, right after migrations, and things like that, so I mean it's as soon as we know how to do it. It's just a matter of making sure that we can port this over to uh to kubernetes, and even if we don't, we can still run migrations outside of kubernetes, because we will ship packages for to our customers.

B

So I think that moving away from the machine that controls the deployment man, the the migrations manually, will be very far far away in the future. I don't know jarv what do you think about this, but I mean it seems very far away to me.

C

I think I think we're gonna have migrations done outside of the kubernetes cluster. I'm not sure how that's gonna work, and I I see us having you know the four clusters and us being able to at least for the front end we'll just do all or nothing rollbacks for each cluster, because with auto scaling, I mean we do this now, where we completely drain a cluster and then the other clusters are able to take on the extra load.

C

So um this doesn't work with side currently because it runs in the regional and I don't think we're going to be splitting side kick, but uh at least for the front end web api and get uh I see that just like we'll, we'll probably just do full cluster rollbacks and we'll drain clusters. Even if we need to.

A

Cool okay, that made sense.

A

So a few other bits on the agenda.

A

Myra, do you want to just mention your the the note you added about the blueprint, because I think it's definitely an interesting one.

F

I'm sorry yeah sure, uh so there is a blueprint that is being developed by database team, about testing database changes in a production-like environment and the idea is to test the migration, whether it's a regular db or background migration.

F

I don't think this has like a direct impact in this like in this subject, but it could make like the migrations, more reliable and trustworthy, that they are now.

A

Yeah great great thanks for sharing that- um and this is the kind of the other thing actually is interesting about rollbacks is that um I mean. Hopefully these will help us sort of go back and work with development on how to make code changes and migrations safer so that we don't need to be rolling back stuff. So um yeah, that's a good one. I think for everyone to know about, so we can help people um use that as well.

A

So we've got about 20 minutes left. um How do we want to take this forwards? um So I think to be kind of clear like this. This okr will pick up on where we're up to with coordinate deployments. So it's not a case that we need to shut all that work down and immediately begin rolling back on monday. They should one should feed into the other, um but other things that we either want to prioritize or test out or like investigate further to kind of help move this.

A

D

I think maybe the first thing we need to tackle is programmatically determining whether or not it's safe to perform the rollback in the first place. So we need to figure out how to pose post. Deploy migrations been run. Do they exist in the database?

D

Is that a blocker for us, and then we can determine how we could make this better in our pipelines? I guess we could do that parallel, but we still need that prior to us performing any tests at all.

A

A

Cool okay, yeah that makes sense, um and then also alongside that or as part of that, let's also figure out the pieces of the coordinate deployments, work that will help either the sort of early stage of rollback or or the next stuff, like I'm kind of thinking, if there's anything around, um if there's any tech debt that would do like make sense to have paid down um or if there's anything around like notifications, I suppose that's the other interesting thing about rollbacks is keeping track of what has deployed or not deployed or partially deployed, um and then we can sort of load that stuff on the early part of the quarter.

D

I think another item we may I mean, unless I'm not been paying very close enough attention to this meeting is maybe decide how to do the rollback, because I think jarv and alessio have two not necessarily competing methods but they're not compatible methods of performing a rollback procedure, and I just want to make sure that we've got a sane user experience nailed down that we agree upon and you.

B

D

D

Both are plausible suggestions. I just want to make sure we're all on the same page as to how we're going to move forward with what that's going to look like.

B

Are those idea, mine and job really different? I mean it's kind of doing as subset or doing as a whole. As long as you can still do, the both thing is still the same say the same process. The only thing that we kind of excluded is the is everything that is not so. No, let me rephrase so there could be configuration changes and rolling out registry, for instance, or git.

B

Shell is configuration changes right now, because it is as a configuration in kate's workload, and this I think we we say that they are part of what we want to do, but are kind of outside of the scope of the early part, because we already have this right. We just you, you can roll reverse the change and roll uh and deploy again. Unless I misunderstood.

C

My my uh my preference here would be to focus on alessio's proposal for the deployer pipeline and then, in parallel, add rollback manual jobs to the gates workloads pipeline, which will allow us to revert, which is basically just it's just a manual job that sits next to the upgrade job that will do a helm rollback, and this, I think, is important for the configuration changes and also we could use it for application changes that are treated like configuration changes like registry.

C

um I mean, I think, it's a pretty simple thing to do. Anyway. It's not going to take a long time to make that pipeline change. What do you guys think of that.

D

B

So, do you mean that every single job in kate's workload has a manual job next to him to undo the what he did? Yeah.

C

And then another stage.

B

But then you have this misalignment between the what's committed in the repository okay and what is actually running right. So the next one.

C

B

C

Yeah, it would, and what would happen is if we auto deployed after that, um it would fail, because we do this diff check to see if there are any unexpected changes and it fails the dry run. And uh so we would have to wait until someone made a new. Mr with you know, either either we would do a proper revert and revert the change so that it's clean or we would move forward, but it would at least allow us to roll back quickly a cluster if we needed to.

B

What exactly will fail will fail the kubernetes trigger, or it's earlier into, the deployer.

C

Not the trigger um well, the trigger will fail, but what actually fails is the downstream job that does the diff, because when we do, the diff we also take a look to see are: is the set of changes consistent with an image only update, and in this case we would see other things that are pending and it would just fail.

B

Okay, so this will result in having half of the fleet upgraded and just kubernetes failing because it was lagging behind yeah.

C

But I think it would not prevent.

B

The rest of the fleet.

C

Yeah but I think yeah it wouldn't prevent the rest of the fleet. It would just prevent the kubernetes, but this is good because otherwise we would revert the change and then the next auto deploy would just like apply it back, because um right, because what like you said, what's on the git repository, isn't consistent with the cluster.

C

So I think it's a good thing.

B

Well, I kind of imagine that in we still have a production incident issue open if something like that happens so yeah, I kind of imagined that we would never reach that point because we already stopped before but yeah.

A

Oh okay, that sounds good um and then one other piece um which I'm sort of thinking about which I'll put on the okr issue most likely, is um how we link this to something measurable, so marian suggested mean time to resolution, which makes sense from this being an incident related thing. That's not a number, that's being tracked at the moment, and obviously it's also a number that when we start tracking, it will be hugely impacted by lots and lots of other things.

A

So I'm going to have a think about how we could actually measure the types of incidents that we want to be able to roll back on, like whether we are tracking those as the ones that have got a root cause of code change or something like or deployment related, or something that focuses in a little bit so that we can actually link it back to.

A

Rollbacks cool so in terms of next steps.

A

Do we want to catch up again like next week? We should definitely catch up again next week, because I'd like to talk through what the okr looks like in terms of our kind of scoped out okr, because it will be beginning of q1.

A

So, let's do that and have a think about whether there's kind of follow-up conversations that we need to have or like whether if we have some issues, we could review, for example, um but it would be. I think, it'd be a good idea to keep these two aspects of rollbacks in a kind of one conversation so that we actually are.

A

I want to make sure that, when anyone's a release, manager you're all comfortable with here are rollback options and it's not like kubernetes rollbacks being totally separate from code rollbacks. um So, let's even if they're two different solutions, let's also discuss both of them.

A

Cool is there anything else, and I wants to either discuss today or propose for kind of a next step.

A

Nope, fantastic, okay, so I'll uh set up something for next week. I'll probably just put an hour so hopefully we'll be starting to wade through some of the uncertainty as we go um and we can check in on where we're up to and work out. Some next steps cool all right, speak to yourself. Take care bye.