GitLab Delivery: GitLab.com migration to k8s demos, 22 Jul 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2021-07-22 - GitLab.com k8s migration APAC

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Hey reuben.

A

B

C

B

Good morning,.

C

Hey good thanks: how are you doing.

A

C

A

How are you doing is it? Is it? Is it already hot, or is it not too bad yet.

C

Oh, it's boiling yeah.

A

Is it, and I know like everyone, is getting laughing.

C

At me, uh we're we're in like low 30s right now.

D

C

Now it's probably mid-20s. The problem we have is after about two or three days of heat, yeah.

A

C

Just trapped so like my apartment, just doesn't cool off, it will need like a solid. Where are you? I forgot we're.

A

We're about time.

C

I've been north london, so we've got kind of open space, so it's not everywhere. Definitely but.

C

Is like not uh not somewhere, you would go on holiday. Right, like hackney, is a pretty poor part of london, so yeah.

A

He's like super hipster yeah, it's getting better yeah. I've got a I've, got two flats in bow, so pretty close by yeah like not not far.

C

E

C

um Like climate change thing about, it was probably about 10 15 years ago, and the government gave everyone loads of grants and things to really increase insulation on properties so that you don't need to heat them so much so. Everyone like all the building all modern buildings, now are like fully insulated against like cold weather and when you get heat wave it just it's not what you want it to do so.

A

Don't even talk to me about insulation in apartment blocks and because r1 is basically now unsellable and and like bank doesn't want to know anything about it. It's like.

C

Yeah we ours have gone really quiet like there's just really.

A

C

In our place was like: do we actually have a fire safety certificate so for reuben? Maybe we've missed the first version of this since the grenfell um tower fire.

C

Virtually all tower blocks in the uk have the same flammable um cladding on the outsides, um which you can't get insured anymore tower blocks need to have like fire safety certificates, which you pretty much can't get yeah. It went really quiet in our place. When someone said do we actually have a fire safety certificate, no one's managed to confirm, or not so at.

A

Some point: we're shooting the.

C

A

Yeah, the insurance on on the block on the apartment, so the building insurance went from like 50 grand a year to 400 grand a year, so yeah.

C

It's right, you have to rebuild it from scratch right like if there is a fire, your.

A

C

A

It's about a million, that's what they've they've priced it out for us and everything, but every we've had all these experts. We keep having these calls. These zoom calls and we keep having these experts on and they're like, like we've seen some bad situations, but this is one of the worst, because the the um the freeholder was also the um the managing agent and it's all collapsed and, oh anyway,.

C

Is anyone coming on this call? Yeah, I'm hoping graham, is yeah.

A

Oh there graham's.

C

Summoned and you.

A

C

Just said, is anyone coming in.

A

C

A

Yeah you just you just there was actually a puff of smoke. As your picture came on to the zoom.

C

Hey day, game.

F

Yeah not too bad how's everyone. It seems like we're. If I look at the wall, it's like you know, with the x amount of days since uh s1 incident. I think we're up to like maybe what 12 hours now without an s1 incident.

A

C

That's being optimistic, yeah, yeah yeah, it actually is yeah. um So well time, though, for this so rather than a demo, I mean happy for you to show stuff as well uh graeme if you have stuff, but I thought it would actually be uh useful to follow on from the slacks word that started up yesterday. Around uh kate's workloads rollout process um so find the link. So what we talked about briefly yesterday was um following um following the incident with the uh the artifact uploads.

C

We changed the kate's workload uh documentation so that one thing that was tricky on that incident was we didn't, have a great audit trail um like it wasn't super clear for the sre on call like what had changed so we've added in a step uh on this in the documents that basically um all of these, mrs will have a change uh issue associated with them. So that was all we've changed so far. um There was a discussion uh in fact, jeff put the market. Did we imagine the uh mr to make the tests blocking job.

D

I haven't caught up, I assumed starbucks um graeme. You had some reservations about this.

F

C

Before we discuss into that.

F

I'll, let amy finish.

C

Well, yeah, I was going to say actually before that um one thing will be really helpful. Could we just do a quick overview of how the kate's workload rollout process looks because um that might like help uh all of us? I think.

F

Yeah, so I can go. I I understand this pretty well because I designed most of it way back when we first put it together and stuff, and um I designed it, I designed it going off various experience and thoughts I had which I'd now realize were actually incorrect, but I'm going to get into that a little bit later, but how it actually works is so we've got a few different steps now, but basically, more or less we do.

F

We won essentially a diff against the environment, so for every single, like environment, we have we just like what with this change at the state, you know what you're committing, what is this diffing actually against the real environment, and then we um do more or less on when you merge to master, we do an apply and that apply goes over different stages.

F

So I think it does pre uh pre and staging in one group, and then we have qa tests and pre-tests depending and then we have canary and then we have production, but production is now split down into um the regional cluster uh zonal cluster b and then, after those are complete and successful zone. Cluster c and d- and I think that mostly works fine, but there's a really critical, well, there's a couple of critical points, and maybe I'm getting a little bit too far ahead, but I will point them out now the problem.

F

A lot of these systemic issues are around um the fact that the way that helm file sets up, arguably helm, helm file and everything sets up is that it's like and how gitlab ci works and and based off how normal software code works is we have when you touch a file when you change a specific file in that repo? There is a very complex web of determining what environments that file will change.

F

If that makes sense like you can, because we have one file, we have this inheritance model, like all of these values touch every environment. These values only touch staging these values only touch canary, but then, on top of that right, the helm chart takes those values and further translates them and all those different things we have other sources. Obviously we pull values from chef as well, so we have all these external dependencies which could feed changes into that.

F

Even though you're not expecting those changes to be fed in I, when I designed all that, I was working off a model of like argo, cd and flux, cd, which are kubernetes deployment, tools that are based off the git ops model of apply time run so they're very much more like a pull model, and they once you commit you apply like you, don't it's not a ci cd workflow, it's slightly different, um which is which is fine for that model, but translating that to a ci cd model over the last year.

F

Now, especially, we've started to see that model fall apart. The big thing I see in particular is we have to run jobs to diff every single environment. We have to run jobs that apply every single environment and the only way gitlab ci can intelligently determine what jobs we should skip or not skip, and what order we should do.

F

Things are based off what files change in the repo but, as already talked about it, we have some files that will touch everything, but, but even though, if you touch that file that touches everything, it may not actually end up. Touching an environment if that means like there's just so much complexity around where you change something of what actually changes on a real environment um that that's that, I think, is a real systemic problem that a lot of these other issues tag off of and I think the biggest way we get around.

F

That is moving to a model where, where what is actually so kind of technically inside the repo we want somewhere, where it's like a staging area or the the final files that are rendered that are actually applied to the cluster, is separate from this like execution model and pulling in in uh variables at runtime. If that makes sense.

F

So the classic example is the runbooks repo with the jsonnet stuff, like you can modify the json at source files, you run a make generate and you have to push all of that into one merge commit, but that allows gitlab ci to us. Allow allows us to use gitlab ci to intelligently determine what jobs we need to run because you're only worried about the final like.

F

If you just go off what files the rendered final product, that's changed, you use them to apply to the cluster and because they're in git we can use gitlab to track. If these files have changed, launch should launch a job. If the pre-file has changed, we launch a pre-job.

F

If there's no pre-job changes, then we don't need to do qa for pre, and so my problem with the current approach now and I'm not even saying overall, it's a big problem, but what this side effect is is because we don't have that tight control and by introducing qa and staging into the pipeline is. I can make a change that only affects production, but our pipeline will still run jobs against staging and pre to a degree and if they fail and myself- and I know I won't call that names but other people in asia.

F

Pacific get this frustration because we're so isolated from other people who can help us with this that I'm trying to roll out a production change, that's gone through staging, it's all, fine, it's in a change request and then I go and merge and all of a sudden, a random failure in staging is stopping my pipeline from applying to production, and so now, I'm in the middle of a change request and I've got to start only start going down a rabbit hole of fixing either either I was skipping the tests or or um it's like the model of.

F

I have a merge request that only affects production or I have a merge request. It only affects staging, and our testing pipeline does is not a one to one of my intention and it's a fault in the tooling and how we set it up.

C

So in terms of the tests, so these are the qa smoke tests. Is that the is that one.

F

Exactly it's it's it's a bit of a contrived use case. It doesn't happen a lot, but it's totally could happen that I've done these mr's in pre and staging and it's fine and I've got the change request. That's fine! I've talked to your eoc and david they've approved it, and then the and the change request will be now. Please apply, merge your production change request and apply that. So I merge that goes on to op sockgitlab.net.

F

The pipeline runs, it does a qa.

F

First of all, it's doing like a you know, a state now real run qa test and staging they could fail, and I can't actually roll out my production change this because it arbitrarily test, that's unrelated to this change in test in in staging, fails and to be clear here, I'm not against this change as an immediate reaction as to what's happened this week, but I'm just highlighting this is this is the this is a problem that can happen, and I unders and I understand the problem is our repo setup and our pipeline setup is wrong because, oh.

C

No, I was actually going to say great.

F

C

Problem is our process right, so I think what we should do. These are the tests we run for auto deploys right, so they are stable. We run them regularly now what you're saying when they fail they're super hard to do back. What I think the gap is is that we and kind of how we we've talked a little bit. We don't have quality super close to infra um and I think we should solve that right. So quality have on call and they're very responsive, but they don't have. They don't have 100 percent coverage.

C

Obviously yeah you have quality on cool, yeah yeah. Then you have a couple of people primary and secondary who we can escalate to problem. Is they don't have someone across all time zones? So they do have people in aipac, um but I think that's the bit we should tighten up. So we should put our products in place, trust the smoke test. Yes, we will get blocked, but we should work with quality too sure.

E

C

Process to getting ourselves unblocked for this stuff, but.

G

Wait uh one sec uh one sec before before you continue, um graham you, you said unrelated change blocks. Your or unrelated failure blocks your change, yeah. So two things one is: how do you know? How are you absolutely certain that that's not related? The system is very large. So it's really hard to know whether your change breaks, something in a completely opposite end of the of the system. Right, like that's one and number two, we always should have a flip switch right like if you are absolutely certain.

G

If you went through you know a couple of people verify this qa helped you out verified that oh yeah. This is really unrelated. We saw this in another place and so on. You should be able to flip a switch and say I'm skipping. This approvals are here one way or another whatever. That means I'm moving on with more eyes on this show right.

D

I think one one thing I've seen is that uh you can be absolutely sure when there are no changes to staging, but qa runs in staging, and this this happens occasionally when maybe graeme you kind of describe this when you're, when you're making a change in staging. So you you update the staging yaml file, and now you want to make that change into production. So you move that change into the common yml file, but we recognize that as a change too.

F

D

So, even though there were no diffs for staging um we're going to run qa because it's in the change list in ci, um in that case there were there were zero changes for staging, um but we still run qa and and if there's a qa failure, then then you've got yeah.

C

Pretty sure it's not related, so I think it sounds like we need a bit of process around qa tests failing and how we deal with that. We need to have a way of overriding, and we also need to have a way of logging, this stuff right. So exactly the same, we do with auto deploys right. We can override our auto blocks, but when we do, we have we document it, and we have that uh log somewhere. So it sounds like we just need to put doesn't have to be super.

C

It doesn't have to be that level of depth, but we just need to be.

F

Sure somewhere.

C

Where we're tracking that stuff, like maybe like with the change issue, be an appropriate place like we're going to have that linked to all of the that's kind of about.

F

C

F

Yeah, I guess maybe um yeah like I. I guess that is the thing like your correct job of. Obviously the big one is, if you move it into the common file. I actually do agree with that use case in that I've moved it into the common file because it touching is a common file, that's affected by all environments. I don't have too much of a problem with that. I think it's definitely um yeah it's it's a small use case and I'm I don't. I said, I'm not even like against any of this.

F

I'm just calling it out because there has been frustrations because it because stuff like this has happened in other repos. For us, um it's just hard when you're absolutely confident a change is not like. If I add, I don't know, yeah, I'm not even sure entirely what I'm trying to say is, but we could work on our repo.

F

You know I've got the proposal there for for redoing the repo in some respects, and I think that's what I'm trying to get down to ultimately is.

G

This is it you're, you're firing, 300 things in a second, so it's kind of.

F

Hard to understand.

G

You right now, so let me.

F

G

Maybe maybe regroup for a sec while andrew is uh saying because, like I would like to to hear like one train of thought, because right now you're starting four. So it's kind of hard to follow.

G

Sorry about that.

A

Just just like one of the things that I was thinking about was the the environments and this I was thinking about, particularly in the context of last night's incident. The environments are not that similar. Frankly, I'm gonna, I'm gonna, be they they're, they're, pretty different and over time they drift. And what about this would cost us money.

A

But you know what else cost us money incidents um if we had almost exactly the same environment except for secrets in staging and and obviously with kubernetes, we we, maybe we set the minimum number of pods down like that. That's like an easy optimization, but other than that. The configuration that you apply to staging and the configuration you apply to production is the same and the there's a cost involved with this, and I think we've always said.

A

Oh that's too expensive, but you know we're getting big now and we can't key and- and that would make what you're talking about. I think, graham, like much easier because effectively you've got one config and you're, applying it to two environments and the the thing is always: oh, that costs us too much money, and so maybe we should challenge that.

A

Just throwing that out there.

G

I andrew, are you sure that it's about like I, I remember that right. I know the discussions from couple years ago about this being about money. Let's not right now, honestly, let's not think about that right, like let's not spend 200k a month, but let's also not think about that, particularly.

A

But I think we're going to give me another good reason why we're not doing it then.

G

The environments are really hard to um for, for them to be actually similar right like because the database size is different. The the data set inside of the the other environment is uh different. Load is different. All of those things add.

C

As we roll out the conflict changes, so one of the big differences we've seen this week is the new registry database coming in. So that's still going to be coming in right. We can't have an environment for every single different conflict change rolling through to production.

F

Yeah, so I think so yeah, so I think data size is definitely different between the environments, the the config, the actual configuration.

A

F

A

We can manage around the data size. I think.

F

Yeah yeah, it's a fairly.

A

Trivial thing, too, the.

F

Configuration drift between staging and production has drifted, but yeah. I actually agree with you andrew they shouldn't like they really shouldn't, have drifted, possibly as much as they have. I don't know how we can rein that in, um although it is easier on kubernetes, because, like now all of a sudden, we do actually have everything in the one repo.

F

um If I'm collecting my thoughts a little bit now. What I guess all I'm trying to imply is that so now that every pipeline is likely to have testing and staging um that's okay, that that that all I'm just saying is that, even if I only do changes in production like just the way our repo is set up and unfortunately how we've implemented some things, uh we we have to overdo the amount of jobs we have in ci, which is fine, um that's obviously the safe way to do it.

F

I do think, ultimately, I could rewrite that repo and change things around, so that would be more efficient and better both for um using it like detecting configuration, drift and actually in efficiency of the ci jobs like we could change it to. You know like a a rendering model like the run books repo, so we can just you know you just run like a make generate. It generates the helm file with all the values in one single file per environment. So we'd have like eight eight files, yeah.

A

But you committed all these no common files, no exactly.

F

A

That's right, yeah, there's no.

F

Common files, maybe there's a common like source where they come from, but they're rendered out one by.

A

One completely json, it can have lots of the more common analysis, but not in the yaml.

F

Exactly and then we can and then you could literally do canary, like canary pipelines. I think we have. We could do canary based off canary changes uh stage based off stage qa. They are based off qa, I'm I'm. Actually, that was my actual overall train of thought is there's problems with the way we currently do this. I know their problems and their problems by design. I actually know how we could fix them, so I'm not sure if now is the time, but I definitely think I just want to bring it to the attention moving forward.

F

I think we could change this to make it even safer and better for people.

G

Graham, just one one question for you, sorry andrew: um I want to kind of visualize what what you're saying so, basically you want to um so the runbooks repo has like the the generate command. It templates out. Basically everything you need right, like you're, going to have like one file per environment and then basically using our ci, you can say: oh only, this file actually changed so run. Only this thing right.

F

Would you like me to show you some work? I've done doing.

G

No, it's not necessary. I think we have a lot to discuss, but one question that comes with that, though, how do you ensure that um you always test your changes in prior environments like will that be enforcement on the reviewer or on the other pipelines or general process change? Like that's another question, I have.

F

So that is completely separate from the concept of splitting mrs to uh so so I think to answer your question, is I don't think technically I can enforce, or we could do a bit of tooling to say hey this, mr touches more than one environment. So, if mo, I don't know if, if multiple files are changed across environments, we probably want to flag that right similar to how we have the chef one that like it, puts a big warning when you like touch a lot of servers and things like that.

F

So we could do automation around that easier in this kind of model, but there's no inherent built-in kind of stuff. Unless I'm not aware of it.

A

No, but but you can, you can enforce it by the way that you structure the the libraries in jsonnet right. So you could say that you basically have like features or something like that, and you say this feature like we're talking about the delta of um cass or the container registry redis, or something like that and and then in the base environment. You basically say this is on or this is off, but you can't kind of like completely customize her.

A

You know each environment you've got very you structure the code in such a way that you make it very difficult to to diff right you. Basically, it's basically toggles. So canary is like these toggles on and production is these toggles on and when you promote something it's effectively flicking that switch and that you enforce through the code and through merge request reviews. um But you know that yeah, it's kind of quite deep, but but that's how you would do that.

A

I think if you were doing it in json and another thing that I think would be really important, is that at the moment we we well we'd have json and we'd also have like yaml entities and like that hellscape and then also golang template um stuff, and if we could kind of simplify that down to first of all, I think just get rid of yaml entities like please and then, and then you know the files that you actually get out of the end of it. Are these really really boring specs?

A

You know yaml helm files that are very easy to read. You don't need to cross-reference with different environments, and I think it would be safer.

F

C

That's a short-term thing right, because I I guess I want to just keep it.

D

C

If the biggest action we have is the one you've suggested andrew, which is finish, the migration, um how do we? uh How do we prioritize between these two- like? Is it enough for now that we use change, requests and put the tests to blocking, um or do we need to do other stuff before we continue with the web migration.

A

I I think we should just go. I think we should the number one priority before any other changes. Frank you know personally is is get the web done because we have at the moment everything every conversation we have is prefixed with. This is what we'll do for the kubernetes environment and there's so much cognitive overhead. Every change is like a set of changes, and that was a major contributing factor to last night.

A

The reason why it didn't go through staging was because that would have meant like another six or eight merger quests right- and this is just too much change- there's too much cognitive overhead on people to to think about all this stuff and then also you know in canary we caught the change because it wasn't, you know, helm, didn't roll it out because it was broken.

A

So kubernetes has got the mechanisms to catch this problem, and so that is our first line of defense and it's the most boring thing we should do and like we should just do that. First, I think and then and then once we're there, we don't have to say well. This is what we're going to do, but we're going to have to make other plans for chef and omnibus it's just this environment. Obviously, there's radish and everything like that, but most of these kind of horrible changes are in the application.

E

And I want to add to what andrew's saying there that, even when we're starting to have discussions now about what we need to do next, the discussions include how to sort this out for the vms and how to sort this out for kubernetes and we're really trying to think about how to resolve this twice.

E

Rather than trying to resolve it twice. We should just de-risk it and move forward with trying to get this into kubernetes.

A

Yeah the biggest cause here was cognitive overhead.

B

That happened in another incident um recently as well, where um somebody asked me like: uh can we do this to sidekick and then the stuff started like um where is staging? How does that look in staging you? Do the change in staging? Is that the same in production.

E

There was the incident that was last week when we were the incident. The on-call engineers were trying to reason about it on staging, but there was a reasoning about it on the vms, on staging versus kubernetes, on staging versus the vms, in production and kubernetes and production, all of which were in different phases of the rollout. So it was very difficult again to try and keep not only two environments in your head, but across two different um stages.

A

Technologies yeah so.

F

Like I, I don't disagree with any of that. I'm curious as what we've got left in sidekick and vms, like we should be mostly done right, like I thought we have that's.

B

That's more of a, I think: it's by accident, some stuff is left there, but after the like, it's I think sean um has it planned in the whole migration like when we're done with this um one cooper shark thing one of the last steps. There was just clean up the vms and put these things in kubernetes, but jobs were in vms before um because we needed to wait for stuff to to migrate them, and then we haven't gotten back to it.

B

G

Want to make this clear, like it's not by accident right like there are there were reasons why uh sidekick vms were left behind and that's all all of them had to do with the application not being able to support the new platform right. So we needed to isolate some of these things and leave them out. So that's the only thing I want to clarify.

B

Yes and the thing like, so we left them out, we left them on the end. One of the reasons was shared shared folders and stuff, and we haven't looked back at it yet so now, we've just tagged these workers again, so the project that we are working on now wouldn't move them around, but we plan on looking at it again to like get rid of those vms after this project.

F

That makes sense so.

C

I think next steps on uh actually progressing on the kubernetes uh web migration, so the big one- and maybe a good one for us if we can get people to help us uh today- and tomorrow is so one of the incidents we had this week was around uh mounting of the temp uh files to allow the um the uh asset uploads so correct, um and that is our short-term mitigation around the sort of actual proper web blocker.

C

That's prioritized for fixing 14.3, so um under 3b I've dropped in this is the the issue that we need to get resolved before. We can actually um like that's our current web blocker right.

F

Assuming that so.

A

Go go ahead. Okay, oh.

F

It's going to say yeah assuming so it's either they fixed that.

G

D

F

Yeah, I I don't know if he's, if there's two problems, we have to fix one of the two.

C

F

Two parts to.

C

This business about right, so there's a charts change. We need to make as well.

A

What um amy is it like an open issue, because the issue I've clicked on is is closed.

G

Yeah that that's the one that stand closed two hours ago. Oh sorry, that's sorry.

C

I meant it's been the incident.

A

So it's resolved.

F

So he resolved the bug that caused the error with the workaround. So in theory, what that means is we could try the work around again because the workaround is what caused the outage. So I mean it's still a workaround, but yes in.

A

Theory with the actual the the the single source of truth bug the like the core bug that the the actual thing that we're working around where's that and can we get it onto infradev and make it a p1, s1 infradev issue, because I haven't, I don't it's not in my stack. I don't know if I'm aware of it, yeah.

C

It's on the multi-large working group, but yeah. Let me I'll dig that out and get that to you. Andrew um thanks.

A

We can just put an infrared label on it. I'm.

C

Going to come up my.

A

C

G

S1P1 and reference the incident as well yeah.

C

So, just in terms of that, we'll still probably need to do some work around, like they predicted a two-month uh fix for that one. So it's not a quick as it's that as currently understood. It's not a quick fix right.

A

G

Look there are some backing engineers in this call as well. So I'm not going to pick people who are in this call, but what I want to say is that we do have mandate.

A

Yeah ruben's going to fix it straight after this. Yes,.

F

So then, what I might so to wrap things up, I guess from the earlier discussion and everything, so I I'm okay with the what we. The changes in everything we've done with kate's workloads in the repo now gives us the maximum amount of safety, which is what we need. um I was highlighting. There is deficiencies and how it's done now sub-optimal.

F

However, I once again, I do also identify the way we would do that better, but it's significant not even significant, maybe a week's worth of work, it's a restructuring of that entire repo and workflow, which it sounds like we probably shouldn't do now, and I actually do agree with that.

F

We should keep focused on the uh on the web migration and possibly even other bits like sidekick, and we should really focus on doing the migration as much as possible um if I want to give a quick update on where we are at with the web migration. So obviously it's been running like all of pre. All of staging has been running 100 web in kubernetes.

F

Now we've had flakiness with different tests, which is what um I spent time looking at and basically kind of paralyzed me a little bit, um but we're kind of over that now, as of today, everything looks good. There were some problems with staging because of the database. Hosts were all configured wrong with console. We had pre was basically out of memory in itself constantly because it was set so small. All these things seem to be resolved.

F

So when we're in a good state there, um skybeck and myself now should be looking at getting to canary I've, been working through the configuration audits for today. So the configuration order like how do we set stuff and vms? How is it set in kubernetes? What's the differences by and large, it looks good, but I have identified some critical things. We need to check and basically confirm so going off the the um the post-mortem of the api migration. This configuration drift missing things out parting.

F

The migration was apparently a big risk, slash a cause of problems.

F

um Man with your suggestion now to move towards like doing the change requests in a way that we do in like a vertical silo like like, we actually kind of calling out each environment in those change requests, I think, is great, I'm actually thinking now in general and if we have enough time to quickly discuss this, we take that one step further and so far we've done the migrations. In a like an environment like this, like we've been like, we do all a pre. We try and get pre completely solid.

F

Then we move to staging then when it's completely solid and then- and we just find so many differences in the environment. Maybe we do try that more vertical approach. So what I'm thinking is, instead of us saying we're going to try and put web into canary, we try and put it into canary and production and do one change request for both.

F

What do I mean like that? So I mean we do a change request to add nodes to canary and production for web. We do a change request to turn on web pods in canary and production in one change request. Even if we send zero traffic for another two weeks, three weeks, it doesn't matter, we get everything there now. So, if I come back and I'm like, oh I I found there's a license problem in pre. I can go and check the license problem in pride, because what I have at the moment is all these problems.

F

I find I don't know if they're going to be problems in production, because there's nothing in production for me to check against at the moment, so we switch to a model where we're going to do. Every single step will be done in every single environment across the board, rather than us trying to like do it from pre-staging upwards. If that makes sense, that's my theory. I'm happy to be shut down.

D

I think it's a good idea and you're talking about basically deploying web without taking traffic, but you do have to realize that you know you're gonna have puma running in production and connecting to the database and redis, and you have to be.

F

Able to the point.

D

F

um What, if we do it like, we run it in production, but one pod, like we said.

D

That's exactly what I was going to say like, let's just take it very, very slow, um maybe do a max of one or two um and that would be safe. But let's, let's not blow out like I mean you just need to be careful.

F

That's a good point, but I think that yeah, so at least that way is we were less likely to miss configuration problems, because when I fix a configuration problem, I should be trying to compare it and fix it in every single environment at once. I think definitely from web pages. I think this we should do it like this as well. We should be like okay day one we're going to turn on web page like if we want to build nodes to take web pages. Let's do it in every single environment with a change like.

F

Let's actually do the steps we need in all environment and treat them at some kind of proper level rather than trying to work upwards because the environments are so different. You fix a problem and one you forget about it by the time you roll in production because it was a month ago and things like that, obviously tuning the levels of production pods to be sensible and staging. I guess.

C

Yeah: okay, let's make sure when we retro on this, we capture those thoughts, make a plan for pages um in terms of we've. Just got a few minutes left for uh for this meeting, so just want to wrap up like um do we. We know next steps for web migration. um I think everyone happy with that job. Do you want to just go through verbalize the process changes you've added in and uh to.

D

Yeah or I I don't want to read it just like, does anyone have any objections to any of those bullet points.

F

So my my only question is, I think, the wording there was one merge request per environment. That's fine! So does that mean even for pre and stage which aren't production? We should do separate.

D

Non-Product non-product products, so, okay, I guess like let's just do non-profit.

F

That was my only question.

D

Yeah everything else looks um reasonable, yeah.

D

Okay, great, I think I mean do you think it's overkill or do you think it's okay.

F

I think until once again, I think we can restructure things to make it safer and better all around yeah.

D

F

This is the best we can do right now.

D

I think we're doing most of this already most of the time.

F

And then yes, maybe with the reception. So yes, I think yeah yeah yeah.

B

A

um Jeff, what about the is, do we need to kind of yeah so right now the.

D

Shift changes, I know because, like I, I've had examples recently of an environment variable we need to add in chef and then someone like, oh yeah. We need to make this in case workloads too uh yeah yeah. I mean like, if there's a corresponding chef, but we often do this. I mean we can. We can make.

A

That mark maybe just formalize, it just include it in the.

D

A

That you should reference the the chef changes you know. The staging and and chef should be the same right, but the staging shift change should map and production. One.

B

That's one change. Just one change issue: multiple merge requests all referenced everywhere, yeah got it.

D

um I I I think, like for the change request, we're saying for all production changes. We need to have a change request at a minimum um and we're also saying that, maybe, if you, if you're making it in non-prod and prod, you would have a change request to reference both. I think.

A

So, are you so does this include like that? The problem that I always find with with these statements is that they're very like what is all production changes mean, so does it mean any change to chef repo? Does it mean any change to run books?

A

You know, like there's a there's, a there's, a long tale of things that people kind of on an ad hoc basis decide what is production change and what is it.

D

We're we're keeping, as far as I know, we're keeping this specific to the web migration for production, like not all changes. As far as I know, right, amy or not, or is this.

C

Yeah, I think so what we said based on yesterday is web fleet, uh not just migration, but web fleet changes.

D

C

Shopping environments, but at least production right.

D

uh I mean that's a big, that's a big change. We need to socialize to all sres, because I think um you know we don't. We make frequent.

C

Changes we want to respond to what we've got so we can keep moving forward. So, let's just keep it simple and yeah web migration. um We're specifically concerned about the kate's workloads um going out, uh particularly once we complete this migration, that stuff gets even more risky. So, let's keep it focused yeah.

D

um Graham, do you find yourself iterating on non-prod changes over and over again before? I guess I guess right now, you're kind of in that mode. So this is where.

F

It's a bit weird.

D

Right because um not every non-prod, mr, is going to have a pride, mr because you're probably going to batch up the changes or how will that work?.

F

Well, once again, this is the tricky part right. So if like, if, if I'm you know, obviously the the change that caused the outage was was was wrong like there was no reason for us to make that change in production. It was just kind of it was a bad judgment, call um so pushing that aside.

F

Yeah at the moment, I'm like doing a lot of stuff in non-prod and like it's not really a change request, because I know I'm going to be iterating non-prod and then eventually right we're going to get web into canary in production and then it's going to be a cut and paste job. But but, as I was saying before, I'm happy to kind of flip it now and let's try and get something running and I'm actually okay to get something into canary and prod.

F

And then even if I am changing anything for canarian stage, I kind of have to do a change request, because chances are I'm also changing it for prod, like I'm, actually okay, to push myself into change control territory deliberately, because it will simplify what I'm trying to synchronize across all the environments. If that makes sense, so I'm okay with what I'm trying to say is I'm okay with the overhead, if I'm actually making changes into product like if I'm actually doing work that is going to make it into production one day.

F

If that makes sense, I I feel, like I haven't, explained it well, but.

D

C

I think we're fine to keep going and see. Let's review it. If it's a super blocker yeah yeah.

F

Yeah yeah safer.

C

But not slower down, so we can just migration.

F

I mean it sounds like if I'm reading this now like, as at the moment we've got no production things for web. So, despite like this change, shouldn't have gone into production. It just didn't need to in in general, but doing the right thing. I would be changing things in non-production. I won't be having change requests, that's fine, but you know in the next week or so we're probably gonna have stuff in canary in production and I will be pulled into doing change requests for those.

F

But that's fine, because chances are any changes I'm going to make to non-prod. I will eventually make him proud, and so that's fine to me.

C

Great right, um so I need to drop. um Thank you so much everyone um jav. Would you um should we work together on getting this into the uh yeah.

D

That's true uh sorry, it's documented. I have to run, but after lunch I can um maybe we can start an mr for just uh okay contributing contributing for this. That sounds good.

C

Okay, all right great thanks so much everyone thanks. Everyone.