GitLab Delivery Team, 6 Feb 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Delivery: Canary promotion and deploy rollback

Description

J.Jarvis, J.Skarbek and M. Jankovski discuss deploy promotion between non-prod and production canary and rollback process

A

A

All right, I wrote up some in the dock in 161 and obviously like one of the big items that we need to finish. First before we get to that issue is scheduled, automate automate the scheduled deploys to staging, but that can't really move until we decide. How are we going to promote things? Took an area I would say, because us just automatically deploying to staging will not bring any any real benefit. I would say.

A

I think like there is already a situation where I think staging is not sufficient for us I think we will probably have to have another non production environment, but to keep it simple. Let's, let's try and think about how would things look when we are happy what the staging is actually producing when we deploy a newest lay a nightly package and.

A

What are our decisions? What kind of decisions do we have to make to promote to Canary.

B

So I personally I, don't think I have any less confidence and a nightly build off master than I do in the our seas. To be honest, like I feel like it's probably the same amount of risk as far as deploying to staging I'm, not saying we shouldn't have another environment, but for this, but I think it probably I feel pretty comfortable deploying nightly to staging where at least as comfortable as I do deploying our seas to staging.

A

I have the same confidence to promote two canary.

B

I think, as long as we have a rollback, that's fast and I think the canary it is relatively fast and we need to do canary routine, it's relatively safe, although we have had instance instances where the point the canary can do damage to Revis, which can screw up everyone.

A

B

In all these cases, I don't think it would have been caught earlier if we delayed getting the change on canary. So with that in mind, I feel like no I feel like we should just deploy the canary and roll back fast if we need to and ensure that we have enough checks in place to do that.

A

But is there a situation where we are on canary and we don't want to roll back or we are on canary and because of an issue we found. We have a delay in promoting the next item from staging to canary yeah.

B

I mean I think canary is our best opportunity to roll back, because there are fewer servers and also we don't do post apply migrations.

A

B

I would say we well I, guess I! Guess it's a bit early to consider it. We need to get the staging first, but I would say it comes shortly after shortly after we do these staging deploys. I would say we just do canary. You know and I think once you roll back canary, then you wait until the fixes are on stages on staging and then you promote again.

A

You know, then, we have a problem of new items accumulating in the next batch of yeah.

B

A

B

That as well and in but at some point at some point we have to like move forward. I mean this is the same problem. We have with our C's right.

A

And what we try to do there is that's right. What we are doing there is, we have a stable branch and we create an RC out of only that. So this is where my other proposal comes in, where we would have a simple way or simpler way for people to know what version or what sha all of the components are running from. So even if we go and roll back canary, we still- and we found like an item- we want to roll back an area.

A

We still want to redeploy that same package, plus the fix to canary.

C

A

That we don't bring in more items accumulating right, like you know how many commits are going in daily, so it would be not unexpected from us to not stop the train, but at least say well. We have evidence about these 15 items and that one item was broken. So how about we create a version of github that will deploy just with that fix, so I.

B

Think I think the two two questions then are when we're in the pipeline. Do we create the stable branch or a branch for which we say we're only taking small incremental changes to fix regressions and how long of that time window is. Those are the two questions now the answer to those two questions are for we deployed a staging is when we do it, and the time window is from the 7th to the 22nd.

A

My answer to this new situation is: we never create a stable branch, yeah.

C

A

The window is one day, so we have a window of. We say that we want to deploy to staging the earlier right, and we kind of agree that well I, don't know whether we agree, but the discussion is that everything that is on staging today will be on canary tomorrow right. At the same time, we want to pro promote the previous day's work right so say that the item that we are deploying to canary tomorrow is, for some reason has it has a broken item. We are already on day 2 on staging right.

A

We already have a new package on staging containing new changes. Well, what I want to happen is to say to developers.

A

Well, your fix knees needs to go into master right, like what I will fix, you're working on your targeting master regardless, but you have an additional requirement and that is ranking off the commits that we previously deployed to canary.

A

You need to back port this into this temporary new, stable branch, whatever you want to call it temporary branch containing only this fix, so that we can build off of the head of that branch right like containing everything that will redeploy it plus the fix, and we can do a fast promotion to canary- or maybe we have another environment in between that will only be used for these types of things. But the idea being. We already know that the package worked.

A

We know that the fix is merged, so we want to move that quickly to to canary as soon as we roll that fix to canary. We can verify that that one instance is resolved and we can then promote the these days item two to canary yeah.

B

I see what you mean and I I think I think we're kind of talking about the same thing. I think what you just described to me is pretty much the same as a stable branch. Yes,.

A

Please, let's, let's make sure that we don't use the same terminology as right now, because I think that's that's possibly gonna confuse us because right now, stable branch is for us, a cinema synonym for a long-running frozen branch and I. Don't want to do that like I, want to be always branching off master and having those short-lived branches right.

B

Okay, so let's not use stable branch and I think what you're saying is. You know we take the commit that be deployed to Canary and we create a branch from that commit and then we'll check in changes or and then we build a new package from that and we deploy that update to Canary.

B

I guess I guess we're. We may differ a little bit. Design I was actually being a bit. I was thinking me would be a bit more aggressive and do this on the production stage. Not the canary stage so canary would continue to get daily snapshots of master, and we would do actually. I was actually thinking. Post-Deployment patches on prod didn't spill until canary is until are comfortable to canary, and then we would just promote it again, which would bring with it a whole bunch of other changes right.

A

B

Not to be that aggressive, then yeah I think what you proposed sounds good to me. I.

A

Think that will be a big job. I think it might make sense to do that at some point, but I think we can't be that aggressive at the moment. Given you know, do you I wouldn't say all the unknowns? We've seen quite a lot of different things, but maybe because we've seen quite a lot of those different things we we need to I, wouldn't say this is not playing it safe. This is definitely not playing it safe. This is more being cautious in how we promote things. I would say so.

A

Another requirement here that I found out today is before we actually go to production environment. In this case, canary is production environment. We have a compliance requirement for this promotion to be approved.

B

Because really so, by whom so.

A

People not involved in the release process, so we are the ones bringing in this change. We are required to I wrote this lower down by the way there by.

A

Anyone that is not carrying in the change, so if this could be production, this could be security in case of security fixes and so on, but there needs to be a documented way or documented approval of.

A

Moving an item to production, environment.

B

A

Another gotcha here what.

C

Governs something being approved or not like? Is it just someone saying, oh.

B

C

B

It like something.

C

That says this checks, a set of boxes for us, we're okay, to approve this I. Don't.

A

Know about the set of boxes, what I found out is that we need to have whatever is done in staging, or rather whatever is inside of the package inside of the release. Artifact is not on is not altered, so this is also the case for patches. Right, like patches also need to be approved, because, as soon as it changes the source code of our environment, it needs to get through additional improvement.

A

C

For that sounds like something that can be automated, that way, it's not a huge deal. It just takes some work to get that done.

A

It needs to be yes that part is true, but the actual approval needs to be human human. It can be auto approved by a bot.

B

B

Anyway, back back to what you were I'm still trying and trying to make sure I understand, yeah what you're thinking, Maren and I put some notes here underneath the first bullet, so you're saying before we promote to canary or maybe after we promote the first time to canary we.

C

B

A branch at the commit and then any critical bug or regression fix, would not only go into master but go on to this branch and we would cut a new package. What I'm missing is what happens next? Do we promote that package? Does that package go directly to canary or does it go to staging because now we're gonna be- and this is this, why you said that we need like another environment? Yes,.

A

Because of the compliance right, because there is a requirement that anything that goes to production can't go directly needs to go to non-production environment first and it needs to be approved after employed to non-production. This is.

B

A

So we can't be doing anything directly to canary anymore. It needs to be for staging I, can't change that we it needs to go to non production environment first before it goes to production in mind and I found out about that today. I do.

B

Do you think, and just as a thought, experiment in that environment share the Postgres database and our bread is cache cluster? Absolutely.

A

Not that's production, okay,.

B

This is what we do.

A

Is leverage what you had built previously staging linear.

B

Well--That's shares okay, so we could have maybe one staging database, one staging Redis cluster and then the problem, the problem where we run into problems a sidekick because there's really no way to isolate sidekick without namespaces and there's no pan issue for this and I, don't think it's gonna be resolved soon. If.

C

We can, if.

B

We can, if we can kind of like gloss over the sidekick problem. Yeah that would work. Okay, I mean.

C

It sounds like we just need a new staging environment from the side of this I think.

B

I think that's it's.

C

Unfortunate but it sounds like we just really need to do that. Yeah.

B

But new environments are expensive exactly.

C

As unfortunate as it sounds for this process to be successful, I really think we need a new separate environment with its own data and all that I. Don't.

B

Know I guess maybe we could call it pre prod. We discussed having a pre prod environment for this I.

C

Don't care what we'll call it? We just need a different environment.

A

Yeah I I think so too I. The problem is any sort of work like that delays. The work we want to do, but apparently the inconvenience is, has to be accepted because we we have those requirements, but.

B

What, if I'm, going back? What? If we don't do this branching on canary what if we just do patches on prod and then until we're ready to take a promotion from canary it seems like it solves the problem. I mean it's a little bit more dangerous, but I don't know.

A

Sorry I didn't understand that one so.

B

So the current proposal is that we deploy to canary- and this is where we create a branch, and then we take small incremental changes for critical bugs and regression to limiting. Now the amount of commits that go to prod what I'm suggesting this is that we do patches like post-deployment patches on production until we're ready to promote canary, which would have those fixes plus everything else and like it. Just seems to me that, if we're trying to get to full CD, we should be deploying for a master.

B

You know and and not have this feature freeze window and I. Don't.

C

B

Still, don't think I understand how long this window would be. You said a day, but how does that work exactly like we would, you would freeze, we would create as soon as we deployed a canary. We would let it go to production and soak for a day and then do incremental changes for a day and then unfreeze, okay,.

A

Okay, well, you added a lot of things there. So if you take a look at the I should have added another branch to the graph, I created and say production. So basically, you see how this would work basically, the day to writing the day. We have the purple circle that would be day tree in production.

B

A

So day, one purple on staging day, two purple on canary day three purple on production. Basically before this arrives so from the moment on staging to the moment, we have this on staging on production. It would be two extra days allowing us. You know one day on staging to actually find anything and stop the move one day on canary to do the same right, because it's gonna be like way more traffic and so on, and ideally by the time this would arrive on production.

A

We will already have day 2 plus fix deployed on canary bet it and then time to production and I know. This is not continuous delivery, but hey like one step at a time. Yeah.

B

Yeah I guess I get so what I'm proposing is that we would just do hot patches on production until we're ready to promote canary the fix is plus whatever else. So. This.

A

Is obviously only for hot fixes we can do, and that is back-end code. Yeah.

B

So that's that's also I, guess problem, although I mean we haven't really come across a situation where we've had to deploy fast except for hot patches, like I, don't know like I, guess we just gotten lucky so far. We haven't really had a situation where this was necessary. Well,.

A

You know we're going to expose new problems when we do this. I have get love. The torque is not being used that much, but it's used enough to actually observe some of the problems in time before it. They actually reach the release.

A

You can expect that this is going to increase in volume as soon as we speed. This up alternative is to do a slower roll out so that we can get the feel of what is gonna be done so say instead of deploying every day on Zone on staging, say every two days and then promote once a week to Canary and then promotes, but that's way slower and I. Don't want to do that. I want to do faster, deploys.

B

What if fun, just as a thought, experiment entertain this thought? What if, but if we created a new environment, that was just a single instance.

A

For production for staging yeah.

B

If not for state I, think I think there's value in having master go to staging daily I believe.

A

B

Environment for this gap, where we want to take small incremental this, basically the window where we want to examine incremental changes. What if, instead of going directly to canary we go directly to a single instance server, we call it pre, prod, whatever very low maintenance like we, don't have a full H, a topology, and then we go to canary. Do you think we could sell that.

A

I think it might be interesting too, because if.

B

We, if we do that that we're talking like very it's very easy right, mm-hmm.

A

What kind of data would we be able to import? Could we would we be able to import like the database of staging, for example, or I.

B

I mean a year ago, I was able to import all the entire production database on one single VM, like a single omnibus, install I. Think it's a bit crazier. Now we have about double the spaces. We had a year ago, I think on the database, but I'm only.

A

Interested in in staging because production database has other consequences right like we will have a PI on security and so on and I think we we need to limit ourselves to non prod. Well,.

B

That's easy because stage it is about a year old, stale data. So it's definitely.

A

Something right to ourselves and then in parallel, I already triggered a question of how can we do this with staging or another environment, so I'm kind of waiting for for that answer? From from the compliance team, yeah.

B

No I think I think what makes these a new environment expensive. This is the AJ topology of Patroni, plus Redis plus I'd. Take all these single server. That's cake, I mean we could even we could even do like in H a database and a single server and still be a lot less overhead, so yeah I think that's. That sounds like if we can sell that and I think the outdated to.

A

Repeat basically, best-case scenario: we have this graph, that's as day one we deploy something on staging day to that same purple, circle is deployed on canary. We verify nothing is broken on canary day three. We have thing the same thing in production, so this is the best-case scenario, and then that goes for every deploy everything lands in production. Now, where things start to complicate, is in case we are, we didn't find something on staging. We are on canary and we found a regression that we find is gonna cause a lot of pain for our users.

A

We need to find a way not to stop the move, but say that from the commits that is currently at canary developers need to do two things. One thing is fixed this in master, so it goes into the next day's deploy on staging, and the next thing is.

A

A branch that is made out of that sha that is currently canary once that is done, we created, is out of that branch as heads, and instead of deploying this to staging, we deploy this to a one box on staging, rightly.

C

A

That is in-between staging in canary to verify that this is okay and then deploy this on canary verify that the broken item is actually resolved and then promote that version to production.

A

Right yeah draw another one of these, because I think they were very useful for me to kind of visualize this. This problem, ok.

A

Another question for you: Jeff: what, if that one state one box, actually just reuses the staging environments database as in connects to the same database yeah.

B

I think this is we run into the problem with it I think yeah, it's against.

A

B

So I, don't think I mean not to say we couldn't it's better than nothing and I think we can probably sell it. I. Don't think people understand this architecture well enough to complain at least the people who are complaining, don't understand it enough, but I think I, think yeah, I think I think we should probably I.

C

Mean do we care what data is on it? Why can't we just use the fake data that we use when you split up GD k locally there. My being too naive, we.

A

Do care about what data is there because the items so theoretically, if all developers did their work correctly, they would have already noticed this breakage in their local data. The problem is staging brings in another, you know set of data or large larger data set that all of a sudden, their item is breaking down on.

A

So we would want them to verify that on that same data set I mean none of this would be necessary if we had in development large enough data set, that developers would be able to observe on, but we are really far away from that yeah I.

B

Kind of like think, we're, probably going to need a separate environment for this also like. Doesn't this cause a problem with migrations and things like this like who knows now? Granted we never had to have a patch that required a migration but yeah.

A

But never seen ever we never needed.

B

A

B

So we have an issue up: infrasonic namespacing I'll. Take that up, but I don't think.

A

That's the middle. You don't need to dig it up. I know where it is. I know what it is and I know where it's blocked. I also know that yeah it's it's not a small amount of work.

A

B

A

We do it that way. We might have a way in to resolve things in.

A

One place where these breaks down does it? Does this breakdown at security, workflows.

A

It does it does I'll tell you how oh my god, so, first of all, if staging is deployed continuously from master. That means that master is.

A

Requires security fixes in it as well, which means that we have a divergence between our development on github.com and definitely it or anything right now. He doesn't have to be delicate laboratory. We have separate repositories, we deploy from verses repositories we develop in.

A

That means, if we don't do security release quick enough, we'll have a large backlog of things to deploy on later. So we are back into the situation we have now. Alternative is as soon as we have a security patch. We don't cause anything when it comes to deploys to stage in Canary production, but we have security patches continuously applied until we have a release that we can publicize, and then that would mean that the master would receive the fix from security requests.

B

Just as a as a thought experiment here, so we have to we're deploying staging from master and when we create a security fix, we don't merge that change into master until it's published right down. That's that's the.

C

Problem right because staging and.

B

We'll never get that security fix and and until the until.

A

These merging, until it's merged into master.

B

B

Yeah I mean unless, if we always yeah I think what we can do is probably, but we have a you just pause the pipeline right, like isn't that, yes,.

A

But I'm afraid if we pause the pipeline, we run into a situation where we are going to have like say, for example, we had a seven days of security release right. That means that we have seven days of not deploying. This brings us back to our current situation, right of having things build up, and then I was deploying a week or two weeks later right once you actually unpause the pipeline, then you get like this torrent of things into your environments again. So pausing is something that we really really should not be doing.

A

If possible. We.

B

Can require for all security patches I mean we can apply them as patch files on top of master. After.

C

B

Is done it's kind of a bit of a it's? It adds a bit of overhead, though.

A

But security, patches or security releases are always in overhead because they have a different nature. I just.

B

Think that, given that master is a moving target every day, it means potentially a new patch every day right.

A

B

The file changes.

A

In areas so, for example, were see this happen so, for example, say hey. We have a security vulnerability inside of our project. Rb Project RB is one of our biggest and most important items that changes fairly quick, fairly, often yeah.

B

B

C

B

Another another option is that how cut off staging from the outside world that makes this problem go away? How.

A

Do we get this promoted to Canary then, because the problem is not just staging, the problem is how all of this reaches. No, the problem is not even staging being accessible. It's the fact that we now have diverging projects.

B

A

B

No I I just assumed that if we have the security release process sort of stays the same like as soon as we create this branch when we deploy to canary then security patches would be applied to that Shaw, whatever whatever is deployed to canary and those and then once it's released, it kind of makes its way to master, which will then make its way to staging the problem and I think only the only problem we have is this window of time, where staging is not patched and has a new deployment from master everyday, which makes it very difficult to keep up with like patch files.

B

If we cut off staging from the public Internet, then I don't think it's an issue. I think.

A

We, like, maybe maybe I'm, misunderstanding Jarrah but I, think we then still have the same issue because we will have to cut off canary as well right.

B

Canary is still being deployed to with my security patches right. It's deployment.

A

But if we, if we say that we have this continuous pipelines, that means that it is continuous, so it needs to go from staging to canary, so whatever it is deployed on, staging will have to find its way on canary. So you, then, you have the same problem. Maybe it's.

C

A

As frequently a staging, but it's definitely gonna change, frequently yeah.

B

If we have a continuous pipeline, if we're deploying master to canary, we have the same problem, yeah.

A

B

But I think if we're just talking about master on staging and then promotions to canary, where we make.

C

B

You know after that mm-hmm, then it's less of an issue, but we would have to cut off staging from the public, Internet and I. Don't think this is doable with that brothers.

C

A

We're gonna need something.

B

A

The whole engineering, which is gonna, be painful.

B

Unless, if I mean we can use HTTP basic off in front with the shared password like.

A

B

Hey I hear a lot of defense in depth. You know it's like just one layer very.

C

B

C

Layer, it's a layer.

B

And I don't know man I know it's not that uncommon to use basic auth for, like an environment like this actually I mean these, at least in other places. I've worked a small place, that's maybe not good led.

B

A

There is an argument to be made that we need to do something like this because of this.

B

B

Yeah I, don't know, I mean this.

B

A

Mean no wonder why everyone works in private I mean it's way easier. Man yeah.

B

Right, well, you don't have to deferred.

A

You just work whatever the hell you want, but this is why this challenge is so exciting. Honestly,.

A

How to do it in public, but not really when it comes to edge cases.

A

Honestly, like I, think a single code base you get. Liberals would help immensely like lower the difficulty level of this, because now you need to to think about for repositories versus not even to one that you would be needing to use I guess, because we have one on top get lost or gained one on Comm.

A

A

Okay, so let's talk about the version stock, so job you posted some items in the document saying.

A

Linking the versions, but the versions are not enough, like semver versions are not enough, we need shots because these versions are going to go away as soon as we start deploying from.

A

From nightly packages, these versions are going to be looking a bit they're going to be looking differently. Yeah.

B

And I figured I figured that these are like the tagged versions and that these will be shots for nightly builds, but they will be right. Yes, so, where.

A

Are we picking those versions from to push into communities? You know.

B

Yeah I do the so. The this file based metric that we use for the version dashboard is the parsed output of dpkg. It was kind of a joke when I did it and now it's like something we depend on it's funny, like I I, think it's a total hack, it's a cron job that does a dpkg and parses the output of it and gives you the omnibus version. That's installed. It's a file based metric.

B

We could extend that cron job to pull out the individual component versions and as long as those files are had the SHA or the semantic version or whatever we can just expose those to Prometheus. So.

A

If you I mean you know that weren't world 2 runs on hacks and jokes, so right, if you take a look at up to get lab, you have a version manifest or JSON file in the root of that directory, not to get lab that one is obviously JSON file and it has all the versions that we ever want with Shas with URLs. Okay,.

B

A

It should be like way easier to parse.

B

This is it far off no.

A

No opt your cloud version manifest.json and here's. The output of this file I'll place it well.

B

I see okay, I, have it yeah, I didn't know this existed yeah, so we could just transform this into metrics that Prometheus scrape and then we have it centrally for each node. If you probably do this regardless.

A

B

A

So if we do that, skarbek I'd paste this in the in the dock so below the picture, you can actually see how it looks so it has the locked version which is, in our case the original of the nightly deployed. So this is the version of github rails community edition this codes, sure that is deployed on get low block on that bit of the tour, and we have a similar thing for get early and all the others.

A

So then, it would be question of exposing this to what the public, when I say why the public I mean the company so that, in an event of a problem, they can find out with with a single command which version they need to or which commit they need to branch out from I.

A

Think this might be our first step actually by the way, guys viii, of creating like adding these two committees.

A

Creating a chat ups command. If we want to use chattels for this, you wanna use something else: I'm fine with that, but I'm spitballing here chat, ops command. That will just tell us, you know, chat ups, run canary version, chat, obscure on canary staging or from staging version right to print out all of the shots that we have currently deployed and.

A

Well, that is being done. We can work on changes that are required to tag a commit for that tag. I released from commit not from attack and.

A

Once we have those two things, I think it's a matter of adding a schedule in a repository that will continuously trigger a build and deploy to staging automatically. This is the last one is two minutes of work. The previous one is a bit more and Java already say that we have you're already saying that we have things in place for pushing things to primitives. We just need to change from parsing apt to parsing a file JSON file, yeah.

B

That's that's gonna, be straightforward and it'd be fairly useful.

A

Okay and then we can start enabling this and starting how to see how things break. We can then pause it and continue with our current workflow that we have right with tagging and so on, but then, as we enable again and figure out new breakages, we can we can try to try and introduce this. It's not gonna, be as smooth I expect. This I.

B

Was kind of I'm sort of joking about this basic off thing, but now that I think about it, like you, could also envision for the stage in environments like a chat, ops command. That will allow us to turn it off and on and rotate the password. And you know we have a security release. Then the text aging, with like a basic basic auth layer and the password is different. Every time you can ask chat outs, what the password is that could work in place of the VPN, but I think it probably would give us enough.

B

I. Think I mean probably we probably want to not allow like only give up, users should have SSH keys and we might need to change the configuration for that. But if you cut off HTTP access that pretty much cut something off everywhere, yeah I mean.

C

Do users need access to this, or is it just? We just need to set up a QA job to hit this instance yeah.

B

The problem is, is that many people are hitting staging from many different IP addresses for QA, so they need a way to get to it. For.

A

Manual QA yeah, yeah.

B

And actually for gillip QA as well, because that's what being run from runners which which so we need a wait? Maybe we need a way to like have a wave to query what the password is securely and I.

B

A

Joe, can you create a design document for this one, because we need to discuss this.

B

A

B

A

They are receptive to this I think it also like alleviate some of the the pressure they have to deliver. A VPN I mean they're.

B

Not completely, but please do check and I mean if they're, if they're moving along with that and it's probably worth waiting. But if they're not and I, don't.

A

Want to wait for anything like.

B

A

Passes it's February already, so can you can you sit down and create a design document just to propose this and let's see what reaction we get? If we get a good reaction, we can move with it. It's actually I know you joked about it, but actually I I can see a benefit of old yeah.

B

Even for even for death as well right, I mean I think that we.

A

Can not because we have a bunch of builds depending on pulling registry items from there. Yeah.

B

But registry could be Dutch right. Oh you know. We have the same problem for registry on ops as well. I think that's, it could be an issue there.

B

But registry is a separate domain right. So you could you? Could you know you could leave registry open it's fairly unlikely that it's we have it I mean this is the docker registry, so they have like we don't. We.

C

B

Those in private we just take from.

A

B

Anyway, yeah all right.

A

So yeah we're considering in in the dock yeah.

B

A

A

Really don't have the energy to think about compliance right now, creating approval steps, I think my brain is fried I'll. Think about that.

A

Because this feels like we are introducing a new item because we don't currently have it, we are going to need to introduce it, but this is already hard enough.

A

Alright, let's, let's talk rollback so I think robic is a simple on canary because we don't do post-post deploy my gracious, so the rollback would be.

A

Roll back all of the notes, then roll back Italy. Would we be rolling back migrations online migrations.

B

A

B

The assumption has always been that we don't roll back migrations unless something has gone horribly wrong and to dates. Nothing has gone horribly wrong, which requires rolling back migrations.

A

Okay, so we basically say migrations are taken taken out of this requirement. For now we want to roll back, but without touching migrations. Okay, that makes sense. So basically, our roll back should be just simple ruling that nodes and then rolling back literally so just making sure that order is correct so that we don't cause more damage.

A

What happens in production, I'm.

B

I'm kind of I know you don't like it as much as I do, but I do kind of like think we would be better off to lay in pose to play migrations until the next deploy and I. Think. The main issue with this is that you kind of don't wrap up a release until you're done with everything right and maybe like I, don't I, don't fully. Maybe I, don't fully understand all your your criticisms of this approach. Yeah.

A

Well, the problem that we have experienced in the past with post-deploy. My gracious is, they are.

A

Were known to cause a lot of load on the database error cause they have been known to be unoptimized. So if we delay this depends on the delay. But let's say we delay this until the next release. Whatever the next release is we run into a situation where people already moved on so just bring the context switch of understanding what's happening right now, and how do you fix this and then release this as an level of items that need to be? It needs to be tracked.

A

What we're doing right now is we do it right now we do the deploy post deployment migrations. If something goes wrong, we need to handle it right then, and there, and we need to fix the problem within a certain amount of time that it takes to develop this. But it's really much. It really depends much on what do you consider the next release on production, because if we are talking about this continuous pipeline, that will mean that it will take 2-3 days for things to arrive on production, I.

A

Might be okay with keeping the posted deploy migrations for a couple of days. Instead of you know like a whole month, yeah.

B

I mean I would I wasn't actually thinking a whole month. I was actually thinking before the next RC, so the first step we do is we upgrade the deploy box and rerun the regular migrations, I.

C

B

Change that to the first thing we would do is we would do post deploy migrations at the current version. Then we would upgrade the deploy box and then we would do the regular migrations, so basically it'd be. The first step would be to do post deploy migrations at whatever version is currently running. How.

A

Do we know which change introduced the problem in case there is a problem? Is it a post by migration that now changed something or a new version or a new migration.

B

Yeah I, don't know I mean like so there you're talking about like a week to two week, delay right now.

A

B

I, don't know like I mean I, just it's just scary to me that I think. The only reason why we haven't had problems with row backs is because we don't do them, but if we start doing roll backs, I have a feeling this. This is going to bite us I.

A

Don't know I mean your excess, that post-deploy migrations.

A

Yeah I mean he's. He says that to keep things simple, it would be best to assume that by default we cannot safely revert database, migrations.

B

Any database migrations, yeah.

A

B

Well, that's impossible to deal with I! Don't know! Well, okay! So let's not! Let's! Let's just have this discussion.

C

B

B

So for rollback, I think it's pretty simple. We just initiate a new deploy with the old version we flip giddily, so that it's in front of the fleet instead of behind of the fleet. That's it.

A

You want to say it's behind: the fleet can stand up in front of it because right now, it's in front of the fleet, when you're deploying oh I'm,.

B

Sorry I was looking at it like yeah.

A

B

To right but yeah, whatever yeah yeah, it needs to be after the fleet set of.

A

Let's go with that and assume the database migrations we're not touching them at the moment or the order. We are not touching at the moment.

A

This is the best way information we have at the moment. I think I think we'll need a bit of a write-up for deployer as well or rather not deploy or rollbacks as well.

C

A

Information we have and what do we wanna do and what are the problems we anticipate and how are we going to resolve them? Yeah.

B

A

My god, my brain is fried guys. I, don't know about your car back. Are you okay, you're awfully quiet there? If you decide to run for the door, I think.

C

Due to my lack of having been on this team for so long has prevented me from being helpful in any of this discussion. Any.

A

C

It's more mind-blowing and confusing I think I just have more questions about how we do things in general, before I'm able to ponder whether it's good or bad or what happened all.

A

Right, so, in order for you not to leave without a single task, I want to hear those questions, maybe not right now, but you have the document. I want to see the questions you have in the documents.

C

I'll come up with a list. I want to see as scheduled some one-on-one chats with people. So maybe I'll ask some of those questions during that time, or maybe our one-on-one tomorrow.

A

Anything that is a more public arena would be great, because then we can share this with others in the team, so they know that we are not slacking off, but we're just humans.

A

But questions first would be a really good Jeff I'll set up the issues to create diversion to push the version to primitives and I'll also set up an issue. What did we say? Where was it.

B

Kidding about you know, I was talking about the basic all stuff. We were talking about: the new environment, new environment.

A

So I will create an issue for that to create a new box in between staging and fraud. Yeah.

B

C

There's another one to like query that data, so developers have a way to.

A

Exactly alright, so it species I'll, create I'll, set up dos and for you you have a design dog for the HTTP basic auth and a write-up for roll backs. Yeah.

B

Right, yes, I'll just add the row back stuff to the design, doc I'm, sorry to the deployer dockets that I'm gonna work on tomorrow and.

B

B