GitLab Delivery Team Rollbacks, 16 Mar 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2021-03-16 Delivery team weekly rollbacks demo

Description

Discussion on possible rollback failure scenarios and how we could test them

A

Hello feels like hey.

A

How's everyone doing how you doing henry. Are you still well.

B

I got my test today and it's negative hey very happy about it, but I guess I need to go into uh quarantine anyway. It depends on our bureaucracy here to wait for that.

A

Tough tough times, but still glad your test is negative. That's uh that's halfway! Good right, awesome, uh scarbeck is in that incident, so I'm not sure. If he'll we should start without him. I think.

C

Okay- let's start then, so we we don't have plans for demos today, so this is kind of we're. Basically, the reason behind this is that uh we initially we thought about testing this in production, and maybe it wasn't really clear what happened last week, but basically we always tested the happy parts. We were always lucky, except for the missing machine, but it was not. It wasn't in the cleanup phase.

C

So the point is that we should first work out the failure scenarios before we can move on on, say, testing this on production, and so because we don't have demos today we can spend time to figure out what are actually the next steps and making sure that everyone is involved and aware of what we want to do here and just so, we can figure out the failure scenario.

C

What can we test next, so the first one is one that we kind of discussed a bit last week was the recoverable recoverable hosts failure and, as an example, we had this running out of this space.

C

So yeah um it's more of an open question. So how can we test this? How can we, let's try to make a plan together.

B

Basically yeah, I just commented there that maybe I mean the idea would just be to create a file to fill up the disk right. I mean that's what we can do and so that's well as a manual step, and I think you don't need to have any automation on this. So just creating one selected, node big enough file in the temp directory and then run the rollback or deployment and see how it fails.

B

I don't know how much free space we need for deployment. I mean we need to unpack the package. I guess something like that. So I'm not sure how much we need to fill it up, but that should be easy to be adjusted depending on space available on the node.

C

Do we think this is something we have to say be more verbal about what we are going to do with the rest of the infrastructure team just put awareness, because we are kind of inducing a small outage. It's not real out here, just one machine that I mean well.

B

Because filling up this space may have other apps.

C

B

I mean I would do this in staging and and then maybe send a note to the sre that we are going to fill up with this for test testing, but I think there's not much more needed in staging anyway, and so I think that shouldn't be an issue and then production. We often enough had these issues with api nodes. You know failing because they were full, so we know what's happening, but yeah testing and staging should be good enough. I think okay, so.

A

We be able to recover that ourselves.

A

I mean like so if we fail we'll fail the rollback and then um do you have like permissions henry to be able to actually remove that uh file as well from the machine.

B

Yeah, I I mean I, I have permissions to everything.

A

We're gonna keep that yeah.

B

Oh by the way, I don't know what permissions do the rest of us have, um so the release managers like developers, do you all have root.

C

I would in delta we should have on staging yeah. That's true. Yes,.

A

C

Point is that the so the access grant the base title based entitlement are even remembered right name. So, basically, the definition of what are we a backhand engineer in delivery, the permission that he have was changed after the creation of the team. So I think that it really all of us is in a different situation, but in theory we should be allowed to have root access on staging machine, but not on production.

B

Makes sense should.

D

We should we normalize that I recall I had staging access, I don't have it anymore. I only have access to the rails console and to the db console, but I don't have access to the machines. I think I think we.

A

Should like normalize that I think that's what's expected, myra from what I understood, uh I think.

C

A

So when we, when we oh.

C

She should have also shell access to the staging environment.

D

I don't have that one yeah. Are you sure.

C

I re-normalized my situation myself just uh just kidding this is this is the new rules I have I'm most of them so and.

D

It was kind of no.

C

Op yeah, it was a real approval, but was kind of uh duty, stamping it right. Just saying: yeah.

D

It's exactly what it's.

C

Written in the base entitlement so- and it was realigned.

A

C

It's kind of a paperwork thing.

A

As well so yeah yeah, let's mario, do you want to open an issue. We can actually work out what your.

E

A

Are and work out what everyone has and just normally.

E

Because I don't think I even have like console access wow should we also have.

D

uh Well, we can discuss in the issue, but should we also have admin access on staging? I think I have yeah. I have admin access on the stage.

C

You should have one it's all defined in the thing which instance you should have uh admin access, if it's your own user or if you need an extra user, it's very detailed. It also goes down into what what access you should have in aws in digitalocean, if you're using it. So there's a lot of details.

A

Cool, so should we try and write up this rollback issue, this staging rollback issue.

C

Yeah- let's try doing this, so I so what I have in mind here is that we should try writing a plan of what we what we will do and then we run. I mean we review this and we can run it next week if it if it sounds reasonable.

C

So henry, are you willing to just try to frame this? I mean we can. We can do this together, but you started with uh with your idea right so filling up to this. So maybe we can work this together as a team trying to figure out. When do we want to fill it up? How can we fill it up? What do we expect? How is how is gonna, the failure look like and how we should fix.

B

This yeah, I think the first thing is to um test how long it takes to fill up the disk, because if you know, if you need to fill up 10 gigabytes, it could take a while. I mean not too much, but I'm not sure how much you would need to fill up. So first step is testing, so we know how long it takes to write a 10 gigabyte file. Maybe, and then we need to measure how much free space is there and how much do we need to fill up to really make it fail?

B

It should be an easy calculation and then, in the issue, I already pasted a command which we can use to do this with dd to just create a file in the size that we want, and then we should be done and can start the rollback and everything else and see how it fails. Hopefully, and then we should make sure after that, of course to delete the firewall again.

B

C

Yeah, I was thinking: do we want to make it fail after the warm-up or before, because I think it adds so from my point of view when things fail during or before the warm-up phase is kind of no big deal, because we know that nothing started happening yet. So my suggestion here would be to make it fail when it, when it arts more right so during the real deployment, so filling it up. After having, after passing the warm up phase- and I.

B

C

Question for who may know the answer, which is instead of using dd, we maybe we can use trunk truncate where you can provide the size, but I'm not I'm not sure. If the linux kernel is smart enough to figure out that there is actually real space because we truncate, you can say, create a file of 10 gigabytes and we'll not write it. We just say: okay, there's a 10 gigabytes file on disk, but then I don't know that's.

B

That's a sparse file and I think the kernel would be clever enough to know that this space isn't taken, but I'm not sure we need to test. So I'm always doing things like the dd with trunk head. We could test, maybe, but I think, a sparse five would not take the wheel space.

C

Yeah, because I was thinking about timing right so because we want, I mean we can still cancel the job, fill it up and then rerun it, so that, because I don't want us to be in this situation where we have to rush to get the disk filled, and maybe we just missed the opportunity to have real failure.

C

B

We can prepare the steps right. We can fill it up nearly enough, then start the roll back and then fill up the rest. When you feel ready that it should fail yeah, we just need to figure out how much we need and how much is too much right.

B

C

C

How do we- oh so I was looking so robert- was taking note in the issue. Nice.

D

So I don't have a question uh well, this problem of having not enough space like on the disk space could lead to uh not been able to insert database records. I think the answer is going to be now. Our features not working for some reason.

C

It really depends on so if we end up with a disk that has no space at all, the type of failure is really unpredictable because it really depends on what is happening. So you may crash the running puma and who knows what is happening, but I think that the point here that henry made and also I think robert- was trying to figure out yeah the numbers around. This is that, in order to install a package, he has to be extracted so that one alone is kind of a big expansion phase.

C

Where is very likely that if you leave enough space for regular operation to continue, but not enough to expand it, it may just give us enough enough troubles to see the failure, but not to cause any type of say, long-term problem.

A

I think also to your point, mario, like yeah, absolutely and that's kind of the sort of the point of this testing now is to actually find out what might go wrong um and how do we respond to it so that, like we see this fairly, often in production, so it's a decent chance. We'd hit this. If we rolled back production so um yeah, we should absolutely see how wrong it.

C

Goes so if so, the point is that oh no, I just moved too much so so we can try to take a look at the number of disks free space. Fill it up enough to allow for the warm-up phase, but not to have enough space left for expanding it, and so we run the and then we so and then we run the drawback.

C

Is is that what we are trying to do?

C

Just repeat it once more sorry, so we take a look at the disk space. We pick one machine and we take a look at the available space. We fill it up enough so that we have enough space for the warm up phase so downloading the package, but not enough to expand it. So just a trivial example. Let's say we have a 200 megabytes package.

C

We just leave 300 megabytes because we know that once expanded is 2 gigabytes, something like that, just random numbers, so we should be able to download it but not be able to expand it so that we don't have to deal with um with timing with the rollback itself right, because we just we just have to check after warm up that, there's really a tiny fractional space and it should fail.

C

And so after this we run the rollback, we see the failure and how do we fix it? I mean the point is what we expect a failure. A failure can happen. So if this would happen in real in a real um rollback in production, how do we fix it?

C

Would the first step.

A

To try and work out how much impact is.

B

E

If it feels during the warm-up, though, that's not really a meaningful test right, it's like in reality, we would clean the cache and free up disk space, and then it would continue undoing like rolling out the package. I think a more interesting failure would be like certain hosts fail, actually like restarting unicorn or puma or whatever, and then we're in an inconsistent state during the deployment. I think that's more.

D

Are we expecting for the failures to be like when we are updating the fleet or when we are executing the migrations or updating the assets?

D

I think it will be interesting if it fails when we are updating the fleet.

A

That's mostly all we're doing on rollbacks right like what are the other steps that could fail.

D

uh The migrations which I think it is uh no operational now.

C

uh The asset it's no longer there, the job is removed as well. Okay,.

D

Okay, okay, then, then it should fail on the fleet right.

C

Yeah, basically, we trimmed it down up to warming up installing the packages and tracking the running qa. This is another topic and then tracking the result so that that's fine it can fail upgrading a machine or preparing just downloading the packages.

C

The thing what what robert was suggesting is so I mean last time we identified kind of two main areas. One is something that is recoverable, regardless of what the error is, so a machine that goes offline. If we can spin it up online again, it's recoverable, if we can't, because it was deleted and there's no way we can generate this one. Then it's an unrecoverable failure and then we have another scenario for that one and either so we can fill up the disk uh eric proposed. Another thing that happened today.

C

Basically we downloaded our corrupted package, so it was impossible to expand it. So all of this involves manual, interaction from uh necessary or some I would say, nsra because yeah we have access to staging machine. We say sre in general to fix the machine because it's recoverable but doesn't mean that we are targeting to do self recovery here. We are not doing it for regular deployment. Why should we start with rollback in and addressing self recovery right?

C

So the point here is that, if something fail, when a machine during during the rollout, so when a machine fail, what what happens so my my assumption here is that we will reach a point where we have two versions running is not really clear: if uh will be just one machine missing out or it really depends on when it fails.

C

So if we just upgrade everything else except that one and so then we can fix it and we can rerun it and will the end result be a system which is completely rolled back, and I think the real question here that we want to figure out is: can we really retry the job? Will it do the right thing and this type of so I don't think I don't think we should.

B

So in just one sentence, we want to make sure that everything every stuff we are doing is even potent right so that you can repeat it with the same result. I think that's the thing that we need to figure out and that we want to try by testing right and if we identify things which are not either important, then we need to be extra, careful right and need to have runbooks or something else to know how to deal with that, like restarting the whole deployment or something else right.

B

So we need to differentiate between things which are important, which can just be reach right, like starting the job again and things like if you deleted the machine in between that, you need to run the whole deployment again, because we need to update our inventory and things like that. Right.

C

Yeah there's also another thing related to this, which I think is about uh being aware that things can go wrong and not panicking when it really happened. So it's just a matter of things can go wrong. So let's try to figure out. Let's show that things can go wrong and we can recover instead of just having to deal with this for the first time during a real uh incident during a real production, rollback.

A

Yeah, that makes sense so based on uh henry's sort of summary of that, then, is this filling up of the disc, the best sort of first test case to um to test this out to test that we can just rerun rollbacks.

B

I think this this um failing when installing like after the pre-warm face. That sounds very much like a condition where we can't really fix it by repeating anything right. So it needs manual, interaction.

B

And after that I guess we can rerun the job, which would be fine. So that would be maybe a good test and we have some experience with that and I don't know what else could break if we fill up the disk, I mean as uh unless you said, we could have issues with um processes just stopping because they can't write to temp or something like that. So it's very unpredictable.

B

What can happen so we can't have a solution for everything here, but if we just need to be able to roll back, then it's not that important that the single host is working perfectly just that we can bring it back into a state that is expected right. So if we try to cover this, then just means bring it in a state where we can retry the same job again right. Quite importantly,.

A

Yeah, that sounds good. Do we have enough to write this one up.

B

I mean we could just put it in the issue right that we already have for this, and then I could try to write something up there and then ask the experts to fill in the details of our pipelines that I'm missing. Maybe.

C

Sounds like a great plan.

C

So, okay, should we move to the next item?

C

Okay, uh something that came out from the conversation is about qa failure and where we were discussing qa failure, then we realized that wait. We are not running qa triggers at the end of a production deployment, so we always saw qa in our roadblocks because you're rolling back staging, but this would not happen during production rollback now do we want to run it only for rollback?

C

What can go wrong? What what should we do if something goes wrong? So, let's open for discussion.

D

I think we should. It will give us some confidence on this package, even though it was already tested on all of our environments and if something goes wrong and it is like a legit failure. Well then, we need to investigate as any other failure in our environment or perhaps it's a flaky one, and we just need to retry.

A

Is there a reason we don't run these as part of deployments, production deployments.

D

uh We run them on canary, but I'm not sure why don't we run them on production.

C

This is a great question and I don't know if I think yeah robert is a long tenured in this team if he has a good answer for historical reason. Do you know why? Because I.

E

Have guests, I believe, the canary from the canary uh deploy runs both a canary and a production qa job. I don't know why we don't do it after the production promotion.

C

So I have just some guests here, so I think that the the deployment pipeline is really long enough and so because we tested it on cannery, we kind of expect the same behavior and we run. We already run production uh qa smoke tests. I think once every hour, something like that. So probably is also this and a matter of not putting too much load on a system, and things like that, but I mean I'm just guessing here, so I.

A

Also wonder if it might be if we do run them on canary um whether we've left anything behind in the database that might conflict I mean I hope they they clean up, but they may not right, not all tests clean up everything, so that might be a good thing to answer. I I agree with myra that if we can run these in production, it would be nice, but it might be worth us trying to find out why we don't do that already.

A

I think, even if we had them on the rollback as like a rollback was pretty much done. Tests were just running through for kind of additional confidence that would feel quite nice. After an incident.

A

Do you want me to take an action to find out why we don't run these on a production deployment.

C

Yeah, let's try to figure out if we are overlooking something, because I think that our final action here could be just to, and then we run it only on roadblocks, but maybe let's figure out if there was good reason for not doing this.

A

Cool yeah that sounds like a good plan, one of the good things. If we can't run them, though, is rollback's just got quicker.

C

Yeah, I would still consider tracking the successful rollback before running the qa, this guy kind of well. The new package is running. Maybe it's broken, but in terms of deploying something was successfully deployed. Yeah.

A

I think we should, I think we should, because we don't want to have to just watch it right like. I think we should consider it as a extra like check so.

C

Okay next point this is interesting, is my favorite topic, so blocking new deployment and attempting a rollback.

C

So far, we are always discussing this kind of synthetic example where we decide to roll back for no reason, so we are taking a perfectly working system and attempting to roll it back, which is exactly like a rolling forward. I mean there are complexity around this, but it's basically you. You have a system that behaves and you installed something else.

C

This will be one use case for rollback, but I don't think would be the say. The more common, the more common one would be rolling back during an ongoing deployment. So it's more about how, and when can we stop an ongoing deployment?

C

What will happen if we start a rollback with um broken pipeline? Broken power in the sense that there was an ongoing deployment that failed, so we had a broken pipeline on production deployments and we want to roll back.

C

So question: oh I just I just had some questions now. Remember so do we know when it's really safe to stop a deployment, and if we have to make this goal- and we I mean the company now not just our team so who should make it so is the asuri cereal call. This should just say, kill the deployment and we are going to make a rollback or is us release, managers that we decide? No things are going bad. We want to stop it now.

D

There be a combination of both. Oh sorry, no.

C

No, no, it was. It was out there another question. So let's discuss this one.

D

Yeah, I think it should be a combination of both, um particularly I'm, not quite sure when it is safe to stop a production deployment. I think it might be safe, uh stopping it before it updates the fleet, but I'm not sure about the state of the machines when we do that. So probably an sre can help with that.

C

Yeah this is exactly a good answer and I think that everyone in our team has his own view on when he knows it's safe to stop something. Let's say prior to warming up or between warming up and before italy starts upgrading things right, but then, when you are in the middle of something, can you really hit that cancel button wow? This is a I'm.

A

Glad I haven't had to right. I know we talk about it. A lot like the rm should decide if they could halt it. Like.

B

C

Maybe we need a kill switch so something like a feature flag that we can flip and the deployer will halt as soon as it's safe doing so.

C

I mean I I never thought about this, which just came to my mind now that we're going to.

A

Like it, yeah like, like maybe like a chat ops like cancel deployment type thing I like that.

C

A lot- and so I'm just thinking about here- I'm kind of imagining that he will because we do batches so probably we just complete the batch and put all the machine back into red estate and just killed the job right just say. I failed because someone told me just to quit.

C

Maybe this I don't know what you think about it. Maybe this could remove a lot of stress around us hitting that cancel button because it's really hard to parse what is happening in a uh in ansible output, so yeah. Now it's the right time right. Well, I don't know so yeah. Maybe this is a good action item. I think.

A

That's a great option. One question I have so uh I think in terms of your question of like: when would we want to trigger the halt? I think one. Maybe the most common scenario I could think of is almost the question mark of something doesn't look right either from our metrics or another alert.

A

um We halt the deployment, whilst we kind of consider what happens next and then from that point either resume or roll back.

C

Yeah, so thinking long term here this is basically what we were trying to do with the production checks.

A

Exactly yeah exactly like a manual step of that right, yeah.

C

So something like I'm not sure something is going wrong here. So whatever you run, the next production check make it fail so that it will stop what it's doing and we can investigate and decide if we want to roll back and roll forward- and this is good. But then we have the other options, which is something fail.

C

So, if static fit during a deployment, we are in a broken state, and this is all the conversation that scarborough and I are having in that issue about uh our um aj proxy. So I think the scarborough point is really good. No, he is not in the school but yeah. I would just try to verbalize his idea so because I was learning a lot following that conversation, because it's not exactly my so I'm not really comfortable in discussing this, because I don't know how much how it works. So, henry you can correct me.

C

So I'm saying, if I'm saying things that makes no sense. So basically, all of our fleet is behind uh h, a proxy and every machine in behind aj proxy have three states. It could be up. I think his or ready and up let's say up maintenance and drain so happy path is that we drain machines drain means that they receive no traffic and and we upgrade them, and so at the end we put them back in ready.

C

So this is the happy part if something breaks in between uh a number of machine are in drain state. So what happens? Is that when we run another deployment, the prepared job checks that all the machine in that environment are either up or maintenance?

C

Where maintenance mean a human operator decided to put this machine offline and so tooling should not touch it yeah kind of so. Can you touch it by attach it? I mean truly, should not put it back in the rotation while drain means an automation. An automated system is handling this, it's in the middle of a deployment. So if now it's still in drain and I'm starting a new deployment, something went wrong and so the basically the the deployment bailout and say no, I'm not going to touch this because something is wrong.

C

So I was proposing complex solution to this problem, but I think the scar back area is very good here. Is it and it's? Maybe we should just avoid that check in in a rollback, because when we are issuing a rollback is something that we assume that we are in a broken state. So there will be an incident so, for instance, the production pre-check will fail, because there is an incident open.

C

It's very likely that our metrics will be bad, so this will fail as well and is expected in this case, to have machine in drain state.

C

So maybe what he was proposing here is that we should just ignore the drain state and consider drain machine as a say up in sense that we don't bail out and we continue with the deployment so that, at the end of the deployment, which is a rollback in this case, they will go back into upstate.

B

Me that totally makes sense. I mean it's an operator error if you set a machine in to drain and after that, not into maintenance, if you intended to really put it out of traffic for a longer time right and if it's only temporarily, then I think the risk isn't that big that we doing a rollback pull it up again.

B

We shouldn't fail of some notes being in drain state, because we know that can happen because something before didn't finish maybe- and that also would make it then again item potent right, because we can just continue the same operation and expecting the same result at the end, which would make it much more robust.

B

We just need to educate everybody dealing with aj proxy that we use maintenance if you really want to put machines in the maintenance and not just put them into drain state. So then this automation should work. I think.

C

Well, consider that this you say: let's call it bad behavior. If unnecessary, I could say yes a re, because I think I'm the only one that can put machine drain state, but maybe I'm wrong. So if you are just draining something because you want to put in maintenance but- and you are not you're just draining, it will fail also a regular deployment.

C

So I think, is something that we are in certain way already addressing, because it will fail deployment. So someone would make question and we will figure out why this happened.

A

Is there a chance.

E

A

Anything else by skipping these checks.

C

This is the question that I have left in that issue, so I think we need to check what what is happening in in the job and if it's just checking for that state, then we can skip it. If not, we have a variables in in the deployer that tell us that we are rolling back, so we should act on that variable instead. So when we do, when we do, the ansible check say check if everything is maintenance or up we just say not. If we this variable set.

A

And on your point, henry about educating people about maint and versus train: um do you mean across all of infra, like would other sres that would need to make sure are using the right state for this stuff as well.

B

Yeah, I think that's uh especially for sies a good education, because I know for myself. I didn't know this for a long time, because I always assumed okay, I want to drain a note. Then I put it into drain and then I do something with it. Maybe there's some issue with it and I leave it until tomorrow and then I got reminded by somebody else: oh raise it and try and put it into maintenance, because you know that's the right state for it and it's just a thing of knowledge. It's hard to.

B

um Maybe you should could have some kind of check for that or an alert or which needs to be silenced. Then, if you only are doing something with it, I'm not sure if the, if there's a way to prevent this, but education maybe is the best way and assuming, if you put something into drain manually, then it's mostly because of an change issue.

B

I think, and then we need to check what's going on there, of course, but if you hold back anything something is broken, and then I think this takes precedence of putting it out of train state again. Instead of being blocked.

A

Do our checks currently tell us everything, that's in drain state if they, when they fail.

C

It's a bit hard to parse that error. So probably the first time you see it just say what is happening here, but yeah basically there's a list and involves the fact that you have this concept of front-end yeah, there's a weird math involved and every time I go back in release management and one of this check fails.

C

I can't really understand how it's working but yeah. It's kind of we have multiple front tent, which are h a proxy. Then we have multiple backhands and just kind of this and basically there's a cartesian product between the two things. So you say how many servers do we have here because it's kind of but yeah you see the error.

A

C

Okay, so I can take this one trying to scope out to cover what we discussed and try to scope out what we should do to make this real.

A

C

A

This one up, because this one has a couple of things right so.

B

A

The halting a deployment as well as the skipping of the check um and does he have other stuff in here as well.

C

No, these are the two main. So the starting question is: how can we start a rollback if there is a failed deployment or if you want to stop ongoing deployment and issue rollback, and this basically are the two main sub issues here, which is one?

C

How can we safely stop an ongoing deployment which deserves its own issues and the thing we discussed about, uh kill switch shut ups come on whatever, so that when it's safe, the the deployment script, so the player will bail out and fail the ongoing deployment, instead of just us, clicking, cancel and hoping for the best.

C

The other one is about something went wrong, so we are in a broken state in broken state in terms of h.a proxy view of our infrastructure, so hi proxy thinks that something is broken. How can we issue a role back in this state when there's, usually we do manual cleanup so when user? How can we remove this manual cleanup in this specific case.

D

um So, just to understand the problem for the first case, why would we need to stop an ongoing deployment to roll back? Is it because gitlab.com is down or something, and we really want to roll back as soon as possible?.

C

If you realize this is a good question mark, so the idea is that you're rolling out a package that is broken. Let's say, maybe the merge request search doesn't work at all, so big uh p1 s1 issue and instead of generating a complete outage, which means we roll out everything and yeah. Now it doesn't work, no matter what, if we are lucky enough to spot this early, we can stop the ongoing deployment, so only 10 of the traffic, for instance, is affected and rolled back.

C

So this is the use case.

C

Does it make sense mara.

D

Yeah yeah, it makes sense. I mean it will be in a situation in which we, for some reason, didn't notice this in staging north canary, and it is currently being deployed to production, which actually happened to me once uh but yeah it wasn't like an s1 v1. It was an s2p2 but yeah. I only noticed the failure on production not on.

C

Canary yeah I mean it could be also for yeah. It happened right that we figured out that something was broken during the deployment and it was not catched by qa or anything else before and so right now we have no good answers. So I had I had to cancel some deployment uh and I don't remember if it was this shift or the previous one, but it's still it's really stressful, to figure out. If you can do it now or not so yeah, that's it.

A

If we have halted a deployment or if our deployments are progressed, do we know if we can roll back? Do we have that ability yet.

C

So the point is that, yes, we can, because we didn't reach the post deployment migration so consider that that same code is already running in a percentage of the fleet. Maybe a big one or a small one. But we are already in a mixed state.

C

Because yeah, basically, we are rolling out a new package and the current one is still running. So we just want to stop the rollout of the new package and restore every machine to the current one. So the point of no return is yet to be reached.

A

Should we add that to our run book, I'm not confident about that little subtlety? Staying in my brain in an incident.

C

Yeah definitely yeah, I think we our run book right now is is good because it's.

A

Very specific right, it's like if.

C

You're in this.

A

Situation, this runbook's perfect.

C

We we miss a couple of um check and concept at the beginning of it so kind of extra cases, special cases what to do. If, because we are always assuming that we are in a b1 incident, something bad is happening, but there's no ongoing deployment and anything like that. We just issue. We just installed the package, which is, as I said, as you see it as well uh specific.

C

C

I have another item here which is uh yeah it's more a statement, but I I am happy to discuss this, so we know that we cannot start a rollback if github will come down. We kind of discussed this briefly last week and I opened a couple of issues discussing this, so I was thinking I I ideoprivatized this work basically so reason being. There is a lot of ongoing thing right now that we discussed- and this is uh unknown limitation, but is not worse than what we have today.

C

So let me all the other k, all the other failure are. We start a rollback and something goes wrong. So we need to make sure that we can uh go to the end of the rollback, but here it's kind of um at the beginning of the rollback. We don't we not don't even have the option to roll it back. So, yes, I would love to be able to roll back also in case of a github.com complete outage, but it's not worse than what we have today.

C

Does this make sense? Do you have counter arguments.

B

It just frightens me very much if I think about it,.

A

One of the chances, though, that this could would have rolled through staging and through well up to canary right and then taking it out um well beyond canary right, so we can't drain canary, like it's scary, but doesn't feel like the sort of thing that will happen like has it has. I hope it's never happened before I say this: has it ever happened.

C

Well, I so this is just the generalization right, so the worst case we are completely down, but it just maybe we just rolled out something that broke the release. Api is enough to remove our ability to roll it back.

C

So basically we rely on that information, and so, if we can't get this out, there's no way we can roll it back kind of I mean we can try to figure out packages really going through the history of slack and things like that, but the rum, the wrong book will not help you. You just have to to do yeah something with what you have at hand.

A

Yeah I feel like we should. We should think about it, um along with various, like we've got a few gaps like this right. So maybe we should uh take a look at those, but it seems like we have enough rollback stuff to be testing on. Whilst we think about that stuff, it'd be good.

D

To have a plan.

A

Right, even if it was like a fairly manual, be nice to have a plan for what we'd do.

B

uh I have a question to to the worst case study. So if you would come in a very bad state by deploying something deployment goes wrong, we can't reach gitlab.com anymore for rolling back. uh Would we at least be able to manually, go on each node and try to reinstall the latest package, which is maybe still around there or something stupid like that? To get get us going again.

C

We never did. I think that.

C

Maybe martin, you know better than me, but if I'm thinking about big incident where we were offline or something similar or we'd say something broken like at the very core. So basically every page fails either we patched it with a patcher or was something let's say: external to code changes so load on the database or things that are kind of we had to handle them externally.

F

Only time when we were hard hard down was database, so this is our recurring topic. Apparently, every two years we have the same thing or when we wiped out our load balancers, we have never ever had an issue where code broke, something fundamentally in the product, at least not in production, how we caught it. We caught it on devgetlove.org because it was running nightly packages and still runs nightly packages.

F

We caught it in qa, whether manual qa or automated qa, and we caught it in staging before it went out to production. The only time that something resembling core breakage uh was when we were fixing a security problem and that security problem changed the behavior of the set feature and it broke fundamentally how the future works for everyone involved, and at that point in time we had to hot patch.

F

Luckily we had we could hot patch, and that was the ci.

F

What was it cie variable masking? I don't remember right now something ci fundamental was broken. There.

C

Thanks for sharing, I was also thinking that. Maybe you should make sure that, because we rely on the api and the release api, that there is a qa test on the release, api.

C

I don't know, maybe there is, but I will maybe make sure that actually there's there is this so that we can check and access the qa because you're right, so maybe we may had something badly broken, but it it has always been.

C

Maybe just one feature went away, but it usually was some. Maybe some paid feature that some paint customer had, but not so fundamental to break everyone's workflow, because usually those big things are covered by qa, so it either be external causes or database problems. This type of problem.

B

B

Another question when we deploy something we download the package in the preform case right, deploy everything and after that we remove the package again. Is that right.

C

Yeah we clean, I think it's the it's the job that failed the other times when we were missing the user. I'm.

B

Thinking, if maybe it would be a good idea to just leave the latest version of package still in place, so we could more easily in a faster roll back.

B

I mean that would take a little bit more disk space, but um then everything would be in place for a rollback right. If you figure out later that we broke something with this deploy, because normally it takes some time after deployment that you figure out that we need to roll back. I guess I don't know if it would worth it, but that would be an extra security thing.

B

So at least then you would have all the packages on the node still and could just manually go there and just install them right as a last resort of things. Just thinking out loud.

C

I don't know how we're doing this, because I'm quite sure that the cleanup is an apt cache. So I don't know if we first download the package in some directory or if you just do the did, we pull it. Basically, so we just download it from apd and if that's the case, I don't know if there is an option to apt cache to say just clean up everything, but something so yeah. I don't know, but this is a good.

A

Point we should get that one on the issue because I think that's the sort of thing scarbeck will have good thoughts on but yeah. It's a good idea.

A

A

Should we review check that we've got all the right actions um from this.

A

So what are the what's? The action from the um maint and drain discussion.

C

So the action for domain and drain discussion is to figure out if you are skipping other checks or not, and then the the real action item is changing the rollback behavior to either remove the.

C

What is the name prepared, job or change the behavior of the prepared job to ignore uh drain machine.

A

Okay, so that's kind of completed like basically follow through this conversation. That's on the issue. Early, okay, cool.

C

And the other one still related to stopping an ongoing deployment is just having a manual switch to stop it instead of canceling it. So we cannot have a clean, uh broken state.

A

And then there's the actual writing up of this scenario into test scenario right for for testing a rollback.

A

So got issued to have a way to safely hold a deployment. Who is going to do that.

C

I can take this.

A

One and then uh myra, I think that leaves you with the.

A

Make a have a have a crack at writing out the test scenario for this blocking a deployment and attempting a failure, and I think, probably assume you have the the drain thing resolved and assume you have a halt deployment but it'd be good to at least think about what would we? What would we do on stage and what steps would we go through to actually test.

A

A

Is there anything else we need to get through.

C

Well, the investigation that and reproposed about um leaving the package in place, at least we should report it so that we don't lose track of that item.

A

Yeah henry: do you mind right adding that to the um well? We should probably, should we put all of this stuff into this one. Six, two one issue.

B

Yeah edit there.

A

Great. Thank you.

A

Cool is there anything else for today.

A

Super great thanks for discussion. Everyone speak to you later on.