GitLab Delivery: GitLab.com migration to k8s demos, 9 Jun 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2022-06-09 GitLab.com k8s migration APAC

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Evening, everyone or morning I guess I should say it's a morning for you guys. I don't know why. I said.

B

Yeah morning, yeah.

C

Good morning, everybody.

D

Hey: hey welcome phil d.

D

Awesome so um should we get started? Graham, you have the first item.

A

Yeah, so the only uh so got nothing really to demo today- and I realized I didn't just put this on the agenda. I will do another announcement about this in slack, um but uh just a very small note people be aware, we now have no external syncing, it's part of the gitlab com pipeline so, like the kate's workloads get like com repo, we don't sync from chef anymore for everything.

A

We do have another repo that synchronizes secrets into kubernetes secrets uh from chefs still, but I would suspect with the vault the secrets management work, that reliability is doing. That seems to be progressing.

A

I would suspect that in the future, at some point all of that will be handled by that project and that whole thing will go away and just shuffle itself the chef secret stuff will just go away in general, so it isn't so yeah. It's just something to be aware of. I have noticed at least in apac hours, because it's when ops.gitlab.net goes down for its upgrade that we have had a lot more stability in the ci pipelines for gitlab com, especially during auto deploys.

A

However, we haven't had the best amount of stability for auto deploys for other reasons, so it's kind of been lost a little bit, but um you know that that part. I do feel a lot more confident. Now it's a little bit quicker and uh yeah. It's a lot more safer. So when you do see, if you're a release manager, you do see failures in the get lab com. Part of auto deploys it's just another thing: you can rule out: it's not going to be chef or any external sources.

A

It's more likely a real failure of something.

B

So just so, I understand the screen, the first, the first part of what you said where we don't have any external dependencies. You're talking about the kids workloads pipeline no longer is reaching out uh to um I guess before it was doing http fetches on boards.

A

Yeah to get the right yeah.

B

And um it was also doing calls to gkms to so all that all that, basically that values from external sources. Yes, that file is gone um and um so we've just so I'm clear, we've duplicated all of that configuration in yaml files in kdsworkloadsgit.com.

A

Yes, yes, so everything that was on gitlab.com uh under in the gitlab com repo there was that gitlab secrets helm file. That's basically.

C

A

Been just copied into that into the gitlab secrets, get repo. Does that make sense.

B

Yeah, I understand the secrets part, but there was this other part which was just like not secrets, but just like stuff, we were parsing out of json files.

A

B

About that stuff.

A

Yes, so so that good point so there's a little bit of the I think over the few months, we've kind of whittled that down so there was stuff that we got the upstream charts are like you need to put this. This is secret, so please support doing this in secrets and the other pieces, so basically what they've added uh so a lot of them were environment, variable stuff and technically they were like. We consider them secrets, but it's arguably they're, not secrets or for other people they might be secret.

A

So the classic example is rack attack, so the upstream chart now has support for you to just generically say using because there's kubernetes generic support for this. If there's a kubernetes secret object, you can take the value from that and set it as an environment variable for the container. So when you do the end, instead of just key value, it's n uh secret source, key secret key name or something like that.

A

I can't remember it off the top of my head, so yeah, yes, so the key here is- and I should announce this as well and make it clear- we do not have support now. Well, it's not that we don't have support. We would like people if there's anything that needs to be a secret val secret at all.

A

It is a kubernetes secret, even if before we had to have it as an environment variable or like even a um a like a value that had to be pulled in at um like a helm like some of them were actual chart, values and stuff. Most of that is gone, and we want to try and move forward with the idea of if we can ever consider something secret, we're not just going to fall back to putting it in chef and like pulling it from chef. It's just like a well.

A

We can do that, so we should. We want to really treat it as like. This needs to be a secret object as a first class citizen.

A

I haven't done this part yet, but the next step will be to take away the kate's workloads users to not have permissions to view, create, update or do anything with secrets and then once we do, that that paves the way for us, making those pipelines public and improving the user experience for people who aren't sres and want to roll out changes because they can see the pipelines they can see the difference. Let's.

B

A

Because there'll be something.

B

A

They can't see secrets it'll be impossible right.

B

That makes yeah that that makes a lot of sense, um but I still unclear about the first my my question, which was: are we still depending on these json, like we had regular config, not secret config, but regular.

A

Config that was being pulled out of.

B

The json file, so we removed all that.

A

Yeah yeah, sorry, I I I misunderstood the question. I think I I think I get it now yeah, so we anything that was just regular. So probably good one is like giddily notes right, like yeah yeah. That was.

B

One of those, so no we just.

A

We require now for you to quote unquote manually, sync that and.

B

A

The reasoning was was even with the setup we had. You had to open an mr anyway and just like, create a fake text file or something to trigger.

B

A

It just like we may as well just get you to just do the change manually and I you know, I spoke to a few people who are making a lot of those types of changes and they were kind of okay with it um and and now that the only thing that's actually left on vms is database in italy. You know it's it's pretty much. The first, the first place you go and change something like that is probably kate's workloads.

B

Anyway, um I guess my my only concern here is that you know we still have omnibus vms um the deployer node specifically, I guess- uh and I guess there should have like.

A

To remove yeah, we want to get rid of that.

B

Yeah, of course, of course, but for now I just want to make sure that sres are aware that if they um well, let's, let's take an example. I don't know- let's say we add a new storage chart since that's a pretty common thing. um I would I guess for now I would add that to both chef and kate's workloads as two mrs right.

A

B

Whereas before it was one, mr and then you have to re-trigger the cage for clothes well,.

A

It was kind of two, mrs anyway right. You would still do its.

B

A

Honestly part of part of this may actually be that we may have hit the tipping point where we need to clean out the chef config, because we we put stuff in chef now, I I don't even think we need to put in chef anymore. I I don't even know if half of what we've configured and chef.

B

Isn't even relevant.

A

Now that only giddily and postgres are using it.

B

No, no, the player, the player.

A

B

Yeah this is the um and yeah. Hopefully it's a near-term thing that we're gonna get rid of, but once that goes away then I think it's time to clean up because you're right then we're just left with basically just giddily at that point and um redis uses the omnibus but and those that configuration is, is pretty isolated. I think you know like there isn't, um there's a whole lot. We can clean up then, um okay. I I'm just kind of wondering whether we're going to make the mistake of not yeah.

B

I don't know how to avoid like. We could add a danger, bot warning or something to chef repo to say. Like hey, I see that you're changing the json file. Maybe you need to do this in kate's workloads as well, but- um and I think that answers that question. The second question I have is so. With regard to secrets, you said that we've moved everything to this new project and we're still sourcing secrets from gkms right, but for new secrets, you're saying you don't want to do that.

A

B

A

So for new secrets, that's fine, like I think what I wanted to avoid is in the gitlab com, repo. So in the gitlab secrets repo we can we do whatever we want. I don't really mind in the gitlab com repo. We had certain things that were helm values, so rack attack is a good one, but.

B

A

We put it in gkms because it was secret and we just relied on the fact that at execution time it would pull from gkms yeah.

B

But it wasn't like.

A

You know what I mean.

B

It wasn't any secret it wasn't, it wasn't any secret.

A

But what all I'm just trying to say is and I'm trying to work with kind of upstream as well. You know the chart team on this as well. Is that if something could ever be considered a secret like if this could conceivably, you can see secret data. The first thing you do is make sure we can read it from a secret, because if let's say we didn't have gkms and we were just putting all the values in that git repo, we just couldn't put some of the stuff in there. So and even if.

B

A

Chart doesn't support it. I guess my my other thing I was kind of saying is we have enough leverage where we can still put it as a secret and work around that to pull it in. We can modify the inner container. We've got enough kind of, I wouldn't say hacks, but we have enough knowledge and workaround um but yeah. The goal is just try and get everything into secrets and then we'll figure out from there how we can consume that.

B

So if I have a new secret I'd kind of do the same thing I was doing before, which is add it to gkms and update the secrets uh chart in this new project. um One more question: um we've had the problem before where I need to change a secret and, like you know, the secret's updated, but nothing like the application doesn't know right. So what's the run book, what's the run book for that, is it? Does it change at all? Or is I guess it's the same? You just re-trigger a deployment or like what?

B

What is the standard way to.

A

Do so to expand upon that, so I don't think enough. People know about this as well is the way we actually create those secrets. We use a kubernetes interface that is uh create only so what I say is so if you go into gkms, you change a secret. You go into the gitlab secrets, regrowing just run a pipeline again, you would expect it to resync it and that existing secret to change it actually won't. At the moment the helm, diff and everything will say there is no changes and that's because.

C

A

Use the string data interface on the secret object instead of the basics, d4 encoded data interface, so we use a specific interface that only works when you create a secret, and you cannot update secrets through that.

A

So what you're supposed to do, if you're changing a secret, I know you talked about adding a secret, but either way is you have to create a new secret with a new version number, so it could be like whatever v2 or if it's a new secret v1 and then in the gitlab helm chart wherever it's been consumed.

A

You reference that name or you reference that new version or whatever.

B

Yeah, but I think this is in the readme of the new project right yeah. I.

A

B

A

I'm hoping that, as the secrets management project takes off they're, going to be using an external secrets, controller, and this would just just it won't just be like this it'll just be um they'll just have a process which just creates uh these cid objects. That would just sync it and it'll just it will happen.

B

Okay, so um I need to change a secret. I have to create a new secret and then I use.

A

B

Yeah, so so then, how many there's two, mrs then right, one in the secrets project to update the name of the secret and then um and then that uh I run that first, yes and now, like and and the old secret's still there right. So you have two secrets now: two different versions of the same secret. I guess, and then I run another pipeline in kate's workloadsgive.com, which updates the name of the secret and that will trigger uh deployment and it'll update everything.

A

And you will have to open that, mr with the like, I'm changing it to like a v1 to a v2 or whatever right like there's no automatic, and I think that's honestly, that's kind of a little bit better because you can't do them in two deliberate steps right. So what I've been doing? I've been doing changing secrets recently. Is I do a v2 of the secret? I roll that out. I manually go in and check the new contents.

A

Look how I would expect, then I do my mr in the gitlab com, repo and I change the version number to like version two. I roll that out and then I go back and do an mr to clean up the old secret. You know, and it is a lot of mr's. It is a little bit of overhead, but it is that if you never need to roll back the old secret's, still there you're not stuck in this, like oh it's sinking the new secret from chef.

A

I need to go back into chef and think like it just you know it's that constant. You make the deliberate change everywhere and the old secret's still there, so you can still roll back and get labcom if needed.

B

Yeah, maybe maybe we should expand on the readme.

A

I don't know where to put this.

B

I think the readme might be the best place to kind of give you the full picture of what to do when the secret changes, um because I didn't even think of the cleanup but yeah. I guess that has to be done as well or it has to be done. At some point I mean.

A

It's nice, it's not mandatory, but yeah. You know it's one of those things I I will admit I'm kind of kicking. This can down the road a little bit because I'm hoping the secrets management project will make.

B

Yeah sure, but you know I don't know.

A

How long that's going to be, I shouldn't rely on that. That's probably bad form.

B

D

Yeah, do you want to take that as an action sure.

A

I will I will just expand that up and um I can fling that through to java to just make sure that, from your perspective, especially from reliability's perspective, they understand the.

B

Pieces another thing that would be nice to standardize is: I have a secret that I need to. um There have been. There have been secrets that like are needed for the kubernetes that aren't present yet in omnibus, and um I think what I've done in the past is like I've done like underscore secret name in the gta.

A

Live because it's kind of weird right, because.

D

B

Yeah, so I I don't know, I don't know if we could standardize that a bit better, um because normally the gkms structure that json structure matches the omnibus config structure.

A

Yeah right, like I think, it's actually objects right, like they're, full chef, roll objects and we're kind of stuffing extra fields in them. I think because I think I've done the.

B

A

B

Yeah yeah um they're they're, um it's so it's a a mapping of the gitlab rb right like it's like it's the same structure as the gitlab rb.

A

I thought it was more.

B

A

Was literally a whole chef role, but I I haven't looked at it in so long. I could be wrong.

B

um Yeah, it's like node properties for or that the role properties for um the omnibus get loud, cookbook right so anyway, yeah we've had this weird situation, where we have secrets for kubernetes that don't exist in chef and.

A

Then we just need to be there anyway.

B

Yeah we put them in there anyway, you know so I I don't know man like I guess, I'm the same in the same camp of hoping, this gets a lot better, but it's kind of the wild west right when it comes to the stuff. You know.

A

I completely agree and most of the secrets, the new ones are all ones that, unfortunately, are not, and you know we could tanker. Does it's a secrets management in repo as well through ci? So that's like another kind of fragmentation. We have there so yeah, it's not. It is not great at the moment. I will admit.

B

um Are we like pretty much like, don't have any external dependencies now in our pipelines? Like I guess, in the ideal situation, we could even block outgoing requests except to the kubernetes cluster, to see like if we depend on anything outside of where the pipeline is run.

A

Yeah, so I had a look at this the other day, because I was interested with the bitnami issue we had last week yeah and so so, when I've looked at uh other companies in the past as well, all their build systems were completely off the internet like they were isolated. You have to have everything in the local location, everything- and I do think it's a it's a nice goal to have as a larger goal.

A

I was looking to see if there was some way we could actually do it at the runner level and that's where I kind of got distracted and lost, but I I overall I agree. I we've got that hack at the moment to try and manually block sites in the ci job, but it's really error prone. I would love us to have a new set of runners, preferably even not managed by us right, like by the runner team or whatever, that we could use that.

A

Don't have internet access and then we could just start using like tagging jobs with that and start testing the idea of completely offline processes or things like that, um both for kate's workloads and maybe to expand that to other things. I I do think there's merit in that short answer is, I agree. I think we we really need to start getting a little bit stricter about our external dependencies.

B

Yeah, um I think it's worth like I don't know. I think we have an issue for this uh to ensure that, but the problem is, is that we use the um the runner manager and.

A

Yeah, I'm sure there's a few different things where it's it's not a.

B

A

Just block everything right like.

B

A

But we do need to try and pull that down a little bit where we can, I think, like even just a runner manager, probably the kubernetes clusters. It would be great what I would love is if, in the gitlab ci, you could say like there would be a firewall stanza or something, and you could just put.

C

A

What it gets translated to like a docker network policy or something that would be fantastic if the runner could do that.

C

Are we using the router.

A

C

We generated vms or with kubernetes vms.

B

C

We can for a while, we can create a network in the vr.

B

C

There is no internet access, yeah.

B

We can just do a private network without a nap yeah.

A

B

Interesting, but um we would have to reconfigure things a bit, but we could. We could definitely do that.

B

A

We do have a long-standing epic open for basically making like I it's um something like yeah making deployments not depend on gitlab.com, and um you know I I set up some monitoring around that already and yeah. I can see like we pull in release, manage like release manager, yamas from gitlab.com, there's a few different pieces, just not across case workloads, but across release tools. None of it is unsolvable, but it is interesting when you actually look at it from a bigger perspective. What the dependencies are yeah.

C

I I was uh fighting with this as real as manager recently, because, even though we put a lot of effort in making our own stuff uh say, offline or independent from things like baking, dependencies and images, and things like that development is not doing this, so we just we and so best case we can deploy something that already exists. Yes, if we need to package something there's no way, we are going to make this qa download stuff from um uh ruby jams, and there are other dependencies there are just maybe they are in the cache.

C

A

That that's why we need to like, I think, if we had runners like if we just had like here's the runners with this tag, and then we can just start like that's when you start the process. Okay, I'm trying to try and use the runner, because people won't actually know. I I mean it's kind of a condition of software in general. Now, right, like you, don't know until you try it and then you, like figure out like what's actually happening.

B

Okay, I'll bring this up with the runner team to see, I think what we'd have to do is create a new runner manager and um associate it with a network. I don't know how this would work with the kubernetes end points I mean is that um yeah that should be on the internal network or not.

A

B

A

I think all of them go to internal ips. If you want my longer term solution to that, we could be using kaz and agent k and then they've got a built-in ci tunnel, in which case it all goes through that, and that also could be an idea, but we're not ready for that yet, but yeah. I think I think we can. Our kubernetes clusters should be able to be accessed by internal networks. I I don't.

B

D

A

D

Someone want to get an issue up and detailed with that stuff I'll.

B

I'll take an action for that. I can kick off that conversation.

D

Great nice job, so the interest of time can we move on to the uh pipeline isolation stuff, because I think that might be uh fairly uh involved as well. So great, you want to give us a bit of a walk through of this epoch.

D

uh You're muted, if you're speaking graham.

A

Okay, I don't know why keyboard shortcuts presume decided to not work okay, so I'm gonna.

C

A

The discussion issue on this, so the epic itself um is based off the discussion. We've had previously about some of the problems and pain points we have with um the gitlab com repository in particular, now focusing on the ci jobs and the ci setup. In specifically, um the epic is like trying to make sure that we can make the pipelines safe as quick and as discreet as possible, and this issue is a discussion and kind of, I guess an investigation into what we have so far.

A

So when we look at gitlab com repo, specifically it's interesting in that there are actually three different types of pipelines that we actually run in that four. If you actually encounter scheduled pipelines that do chart bumps but they're, so small and isolated, I'm just and they're so different. I'm just going to exclude them from this discussion for the moment, so we actually have auto deploy pipelines that are called called by the mac triggered by the main auto deploy pipeline.

A

We have I've called them: repo change pipelines, but they're just when someone wants to change some part of the configuration inside that environment, so typically a helm value or something like that, and then we have the mrdif pipelines which are typically for those repo change requests, but they're, just diff pipelines for merge requests.

A

um We do so so the big part of this as well at the moment, all of these pipelines to some degree can overlap. They can interleave a little bit, and this leads to lots of interesting things. It can lead to auto deploy, applying changes that it's not meant to. Although we we stop that by having tooling in place to try and prevent that it can lead to. um You know: diff pipelines, displaying information from other change, requests or unapplied changes, depending on when it's run um and and more problems like that.

A

So try to understand what we can do to improve this and really what we need to do is isolation. We need more pipelines to be isolated. We don't want them to interleave. So when we see pipelines running or things happening, we get an accurate understanding of. What's going on at that point in time, um so I'm gonna pump text up for people so yeah. So at the moment we actually do do locking in that repository, um but it's only at the environment level for each individual job.

A

So, for example, if you have a diff job and then an apply job straight after you have another pipeline, say an auto deploy pipeline, that's just doing an apply. It is possible for those to interleave between the diff and the applier, which should be almost an atomic operation. An auto deploy pipeline could come through or the opposite way around right. Auto deploy is going along. Does it stiff and apply auto? Deploy, theoretically can say, I'm doing a diff.

A

This diff looks clean, there's no outstanding changes and then someone else's change request pipeline can sneak through because it's just the level of locking we have and things like that um so yeah. So this is kind of what I'm covering here. The following scenarios can happen: an auto deploy to an environment and a configuration change from another environment can take place very close to together seconds apart.

A

Oh- and this is another thing that I didn't realize until recently- um that so we do qa, obviously as part of the main auto deploy pipeline, but when you just do a normal change request in the gitlab com repo, we also try and do q8 there as well, because you know safety, and so what um apparently qa told me that we can't or should not have multiple qa pipelines running against the same environment, and we do see problems with this, where they we get errors like um things already taken like uh you know, project group already taken.

A

You know things like that. So that's one kind of problem is that we can have qas caused by like different pipelines running at the same environment, giving inaccurate test results. Obviously it would be nice if that's not the case, but that is the case. So we need to accommodate that an auto deploy to an environment and a configuration change to an environment can each run a qa pipeline that clash, yep configuration change to all environments gets held up by deploy to staging, which means the chain configuration change, gets delayed going to production.

A

So if I want to change, say a container memory limit in production, because our gitlab com pipeline is so uh unintelligent, it has to do a job on every single environment. We just do everything. It just basically applies everything I'm just going to apply to every environment. So if an auto deploy is going to staging at that point, it can't get the lock for staging and therefore my change, which I only want to affect production, gets held up by something which is happening in a different environment.

A

Excuse me, a diff job on an mr shows, incorrect information because of scenario three so yeah. So we also have these diff pipelines into leaving and they show outstanding changes from other from unapplied changes.

A

So we want to lock over that initially. So I I'll talk through this in a little bit more, but I am going to go a little bit off script for the epic. Initially I was kind of like we should just lock the whole repo. So if you ever do anything in gitlab com, only one pipeline can run at a time, but after thinking about that more that would actually cause. Probably a lot of problems and a lot of slowdown.

A

It does mean that, like, if you're applying to change to staging, you cannot do an auto deploy to production which would be or pre like if you're applying to change to pre, you can't do auto deploy to production, so it's probably not ideal. So you know an entire pipeline lock as much as it would make our lives.

A

A lot easier and simpler to understand is probably untenable for us, so the next level is pipeline, locking per environment which actually- which, as I said, is what we kind of already do, but it's actually expanding that for the whole pipeline. So if a pipeline is targeting an environment, we lock that environment for that pipeline.

A

It does also mean that we need to refactor and make our gitlab pipelines more intelligent and part of the work I have been doing by removing external sources changing around parts of that repo is so that we can start leveraging the um gitlab ci functionality based on files changed right. So if you change the staging file, we will apply a change to state. You know we'll have the jobs for staging there's some exceptions like the root values, file and things like that, but we can try and make that more intelligent.

A

We could look at things like dynamic child pipelines where we run a diff against every environment and if the diff against an environment reports empty, then we just do not have the job for that environment. We do not lock. We do not do and apply for that environment, which would also be an improvement to what we have now. So I think we have a lot of options for doing this.

A

It's just really picking the right one and the right implementation.

A

So that's kind of what I've come out to is. We probably want to lock per environment and we want to make our pipelines more sensible so that our pipelines are only doing the jobs that make sense.

C

I have a couple of questions sure, so let me start with qa first, so do we have evidences of a failure by so let's say we change the configurations because we're talking about changing configurations here, because we run qa in the replay.

A

C

For configuration changes.

A

C

Do we have evidences of failure or incident caused by something like this that could not be detected by regular metrics instead of running qa, because I see no value in running qa in this situation it takes hours it's flaky and we're not testing new code, we're changing the configuration, so I think we're looking at massive scale error rates. If something like this breaks, because you are change you're, breaking connection to database, you are just misconfiguring, something that should have big big errors.

C

A

So for that question I I've definitely been saved by it. Like I've definitely been like okay, I've applied to change the staging, then staging qa has failed and I'll be like. Oh, that was not great, probably the biggest problem we have. Unfortunately, is we don't page on staging errors and our staging metrics and alerts in general are not that they're not as good as they should be, but you are absolutely right, ideally and and to be clear, that's not even for staging to a degree right like any environment, we should just have veterans yeah.

A

We should have good metrics good alerting so that you do something there's a problem. You detect that straightaway.

A

The other thing that qa is kind of good in this context is that immediate feedback that I do appreciate people want, is that did my thing work or not so an engine, an sre from reliability or anywhere comes they make a change.

A

Qa is run, it validates the environment still working the pipeline is green. Therefore, I'm done and I can go away, whereas I don't have to like find the metrics or I have to jump into the metrics channel and see if anything alerted for what.

C

Five times about integrating measuring checks in the pipeline.

A

C

10 minutes and then you have a script that is checking for for for things like this or so.

A

I think more yeah, you.

B

C

Even have some kind of uh this is even better. So when you change something, you may consider providing some way a a custom metrics to check so that when you want to change something as part of your merge request, yeah there is a place. There is something where you say. I want to monitor this look at this before.

C

Do the change wait, 10 minutes whatever 15 look again and yeah. If there are significant changes, fail the pipeline.

A

Yep, so I'm just going to quickly jump to this epic, which is the one we're planning on doing after this. This is, I I'm keen basically for that 100, so uh the pipeline will have a test stage where we validate and any other static analysis apply and then monitor, monitor the rollout of any change resources. So this is a high level. Epic, and basically, I agree at the moment.

A

We've got two things that determine if a change was successful, if helm says it's successfully applied, and I don't like the way home validates it, it's very anemic. It's not that great, it's okay, but it's not great and b. We do qa, but we can replace those we, but we need probably a custom tool, no not a custom tool, but we need to define better tooling for both generating the manifest applying the manifest and, most importantly, monitoring them out. As you said, you want to probably have something where you define the service.

A

You define some clear, prometheus metrics like an app decks or something that indicate whether that service is healthy or not, and you want to monitor those. You probably want to monitor the pods. You probably want to monitor deployment. You probably want to monitor a bunch of kubernetes stuff, there's actual like there's other kubernetes things. We may want to monitor like custom resources and things like that, but we also need to define application level metrics for that help. So I think I I am 100 agree with you. We need to actually do that to replace qa.

A

I I don't think what we have is with helm and qa. I just don't think is is is getting us where we need to be, um but that I kind of consider that a little bit out of scope for at least just the pipeline work yeah.

D

Let's plan that as a future iteration, I think it's uh just uh for the clarity of this recording as well. We're saying this is for conflict changes right that we would yes yeah.

A

C

Yeah yeah yeah yeah, if we're going to invest time in making the qa integration with the conflict changes better. While we want in indeed to have a monitoring solution, I would say absolutely let's skip as it is and focus on what we want to have, instead of working on improving that one yeah okay, so we are all in agreement here. Thank you.

A

Yeah yeah your answer, so I think second question.

C

So the second question is about the actually the the locking itself, so I do think that so your reasoning is is good environment. Locking, I think, is, is the way to go. I do agree with uh dynamic child pipelines so that we can actually, because we need to validate this, because we, I think we never tested, but the problem is that uh our ci is designed to lock by job and we want to lock by pipeline.

C

So I think that the only way for doing this is by dynamic pipe child pipeline so that we can lock at the trigger level, so the trigger level which is dependent will have the lock, and so we hope that, because this is running while the child pipeline is running, this will help the the lock. But I don't think this is enough, because, as soon as the child pipeline fail, the lock is really.

A

Released yeah and so.

C

That's very good well, but there's, but this brings me to another point, which is: if we are going towards running, deploy migrations in kubernetes, I'm expecting to have another trigger before the effective deployment that will just run migrations, and I would like to have this unlocked as well. So I think we should spend time in investigating distributed lock across pipelines, which is not part of the product and that's the problem, but I mean we already have locking happening.

C

uh um What is it is in chef right because we do we do this from deployer and we did this from release tools, but the the storage of the lock state is is chef itself, so I don't really like that one, but it could be okay. The point is that it looks like that. We are going forward more something that is uh re-entrant lock with multiple pipelines and condition variables.

C

When you have the system, the environment is locked because the process is running, and then you may have some processes running in that locked system would maybe waiting for different conditions to happen, which is basically the how you do concurrencies in a completely complex system. But then here we also have the addition yeah. This is also distributed. So it's more challenging yep.

A

So I agree: I've been trying to not get too deep into implementation so far, but I I'm 100 agreement with you. So I've also come to the conclusion that we need to define a locking mechanism. If it's not just relying on gitlab for a specific pipeline, I would almost say it could be time to unify the chef locking we have with the kate's workloads, locking. So I'm looking at say the deployments and environments api for gitlab on ops.gitlab.net is like a central locking manager, but you know it could be anything.

A

The implementation is kind of just an implementation detail. What is interesting to me- and I discovered it or or thought about- I thought about this more this week and it's not really captured here as much so here. I just kind of like talk about how long some of these pipelines take and like how long different pipelines are locked, but on top of auto deploy pipelines on top of like changes to gitlab.com pipelines, we have another type of locking that we do entirely via people and that's change, request pipelines.

A

So at the moment for me, in apac hours, I've been losing a lot of time in my day because of the cid composition project. To be clear, it's a very valid project super important. I fully support all the work they're doing, but it's interesting because, as you can see in the channel in the delivery channel, we rely on human locking we're sitting around trying to like coordinate with each other, like very frantic, on trying to understand what who's deploying where or what's going on what we would really like to do.

A

I think, as maybe a stretch, kind of implementation detail for this project is not only just have okay. Auto deploys, can lock lock the pipe like lock an environment. uh Kate's workloads can lock an environment. We may even want to see if, like engineers, doing change, requests actually have to fetch like a chat, ops, command or a tool, or something fetch and control and lock the environment themselves for the change request.

A

They are doing because the two reasons, one right if they get that change lock and we have our pipeline set up correctly, we're sure nothing can we don't even have to do anything right. We just we know the pipelines will just not back up or block until the pipeline's released. It does encourage people to be more diligent about their locking and unlocking, which I don't think is a bad thing, because we do rely on people at the moment and people. uh You know we forget and do things um the second part.

A

I really think it's really good for us to do it this way as well. Is I want to capture metrics on these locks to understand the contention we have in environments?

A

So, for example, I have a suspicion that, with the new pipeline set up with all the the change request, work and and stuff going on that the staging environment is locked or in use 80 of the day 70 of the day.

A

But if we have like a locking mechanism that everyone is using, we can actually track that in metrics. Right, like I think, unless I say we could use delivery. Metrics tool to like pull the lock, you know pull the lock state out from git lab or something, and then we could metric. We could get metrics on that.

A

We could show that oh look, look at the staging environment or look at the production environment 80 of the day, it's locked for changes for whatever reason and maybe they're this change, maybe their case workloads pipelines, maybe their change requests or whatever. But we really need to understand that because at the moment it feels like everyone is getting frustrated understandably, um but there's no real way for us to detail how this contention, how much of this contention is happening.

A

So I think not only do we want to do this, locking work for safety. We need to do it for observability. To show that like auto, deploy is a a good one as well, so with the new auto deploy pipeline, we lock staging for long because we lock it for the whole deployment cycle of production. Now, so we deploy staging, we deploy production, we qa staging.

A

We post deploy migration staging then we unlock.

D

A

We'll change very.

D

Soon that, graham so, I think, although that's true.

A

For now, no no, I agree.

D

Say for now that, yes, that's the case and hope the post deploy migrations, work sure lands before we even get to this point, so I think for now. Yes like, let's, let's look for the times that we're running the current pipeline, so we're currently running staging staging deploy, then we wait, then post deploy migrations and then test, but um but that won't be the case for too much longer.

A

No, no, I agree, and I guess my point there wasn't even I understand why we're doing it and we're definitely changing it, but I never really thought about that and was had no way of really visualizing.

D

That exactly like yeah, you know we never really think about it in terms of um we probably like. We don't need to necessarily try to um accommodate. For that. I guess it's the thing it's certainly true, and yet I don't think very many people would realize that's how it is set up if you're not watching the announcements channel, you, wouldn't you wouldn't even know that you wouldn't see that um so for now I think yeah. We have to just accept that staging and production would both be locked in parallel.

A

And that's fine, but once again that I think we want to. We want people to see that clear yeah I mean like talking away. Even our gitlab com pipeline, which we've already mentioned a few times tonight, is all is not efficiently set up. If we want to lock the environment from when it does a dry run to when it doesn't apply, production can be locked for up to 95 minutes with our current setup. If we don't change anything because we would do it, we would we would lock here to do the dry run.

A

We'd upgrade everything else. Then we upgrade production at the end, including running qa, on staging. So that's like 95 minutes. We could conceivably lock production for if we want to lock over these two jobs, so we can see that the job setup is inefficient. So, in terms of this epic, at the moment, it's still just discussion. It's just capturing this and I'm trying to put together.

A

um I guess a solution. What is this going to look like? What do we want? It sounds like if I'm understanding people correctly we're kind of all in agreement, that we want a general locking mechanism that both the auto deploy pipeline and kate's workloads, gitlab com, repo need to interface with both of them needs to be the same lock. um It sounds like we probably at least for the moment.

A

We need to lock over both applying them uh applying a change and if a qa needs to run for that environment, we should probably lock on that as well. So we don't have multiple qas confusing things um and I think.

D

The other part just one thing related to that bitcoin. It would be great to get that make sure that's written down somewhere I'd like to see something from quality. That actually says. That has to be the case, because in the future, that will be a really good thing for us to be able to work with them on in terms of like how does that not need to be the case once we know how many sort of we would like to be running in parallel, I think yeah.

C

D

Improve that so, let's make sure we get quality to actually write down, why that is the case right now, perfect yeah.

B

It's like that's the first time, I'm hearing this stream, like I'm, I'm really surprised, but also not surprised. I guess, but um I I thought this was possible before.

A

It's it's one of those things where, as I said, the errors I saw and we're like, oh yeah, you know we can understand in certain parts of certain jobs run at certain times. It's probably more of a case of like before we had qa in kate's workload. So before we even put it in there right. Probably the only people running qa jumps was auto, deploy right, and it was already in auto. Deploy had good enough, locking that there was never any contention. Yeah.

D

So we probably didn't run the route with me enough. I think this is a scaling thing. Oh.

A

There is qa schedule yeah, and I agree like this is like we've looked at this, and this is like yeah. This is not great. We can see situations where qa tests could clash, but you're right. This is not an unsolvable problem if, if they actually said, if we defined this as what we need- or they had the probably.

D

Just testing the motivation already yeah.

A

C

B

A

B

Don't go ahead finish what you're supposed to say.

A

All I was just going to say is in which case if we do get confirmation one way or another, if they're like. Yes, we can fix this and they do fix it. Then we can like change the scope of this project. We can say we only need to lock the deploys. We don't need to lock the qa, although once again I kind of like I still kind of feel like. If you make a change, you only want that qa running. Otherwise, you qa gets multiple changes. I I don't know.

D

Quality you're not going.

B

D

This quickly, so just for now going to keep this this epic simple, assume that, yes, you will have your change, and your testing would be locked together.

B

So um that kind of brings brings me to 3i on the agenda. I just was wondering like: does it make the most sense if most of our issues are about qa and changes um being deployed while qa is running, maybe just the qa job should be the holder of the lock for the environment. Does that make sense.

A

It makes sense, that's certainly one implementation consideration. We could do the.

B

A

Thing is, we don't have qa for some environments like production, which I guess is fine. You know if we don't.

A

I don't think we do qa.

C

Deploy is not running pro uh qa in production.

B

A

The comm line is only running for staging it's not running for staging canary and it's not running for production canary, so that's also another gap. We have oh boy.

B

Okay, well whatever, um if, if we managed to sort that out, I was thinking like. Maybe qa should be that holder of the lock like like right before qa runs. It just blocks everything including itself um and then releases that lock.

A

But then the question is.

B

One one idea: how.

C

Do you know if you can deploy, because you are.

A

C

From qa, so the holder is the extern, the most external entity, not the inner one. That's why.

B

C

Because you want to.

B

C

Able to reinforce the lock from something that is dip, that is a child of the owner of the lock itself.

B

um I was thinking that the trigger job would create the lock right and release it.

A

B

A

And qa happen under the one lock. That's, I guess is the key.

B

Yes, yes, like there'd, be one lock that would lock all. Basically, the lock would be for the qa job itself and all of the things that the qa job is testing all of the. But.

A

What about just because you want to keep the lock for both right, like I guess that's, the key is, if you have one lock for deploy and one lock for qa or you don't lock on deploy. That means multiple changes can come through unless I'm misunderstanding, yeah.

B

I would I would walk for deploy as well like there would be one block that would lock the all the jobs that make changes on the things that we are testing and then release it.

C

At the end sounds.

D

Like we're in agreement lock at the beginning, change test.

B

Unlock yeah, um and just to just as a and j is just um like. I I'm just wondering if we introduce more locking we're gonna, start stacking up pipelines waiting for locks, and this is super scary to me like when that lock gets released, then it's a race to see who can grab it the first and that might not be like so would we have to. We would need extra logic there to like.

A

B

A

It what I'm hoping- and this could be completely completely a bad assumption- is looking at the gitlab deployments, api and environments api, which I was thinking. We would use to implement this. It has the concept of like the person who gets the lock next or I think it is the person who gets the lock next is the next numbered pipeline. Does that make sense, yeah yeah so but you're right? We should do this like if changes are going to back up, they need to happen in order.

A

Ideally, we would do something like a merge train and I looked into that as well. But the problem we have is a merge. Train setup would happen on dot com, but ops has no concept of the merge request. It just gets get commits on branches, so it just can't con. It can't do a merge track unless we wanted to switch everyone to doing merge, requests on ops, which I assume we don't want to do.

B

Yeah, I don't know, and and like do you really want to like run each change in qa step? You want to run qa on every if.

A

You have 10 changes.

B

In in in like yeah, of course, you want smaller changes right, but at the same time like um we could be backed up forever. So we.

A

Can we can do we can start doing? I want to do more, intelligent stuff, like you, can put labels on issues for like skip qa and stuff like at the moment. um We we don't have good ways of controlling the pipeline, but we should be able to be like you know this change. We could skip qa or something. I look.

D

I agree health text, let's simplify later by just giving a standard pipeline that hasn't got an hour of qa on it, but it's got. We've got good monitoring and health uh checks and you know we get staging. So we can optimize that way.

A

I agree, but we're not going to be there at the moment, but but I think that's the stage.

D

For now, though, I think graham, like I think, graham's point earlier kind of answers, this job, which is yes, that's possibly true, but actually, if we can start to see how much time we're locking the environments for then we can, because I don't think we're too far away.

B

D

The infrastructure department to the point where we are all trying to make too many changes and then we'll maybe need to review different categories of changes or find some way where it's not like an exclusive lock over an environment.

A

And I'm just going to say one more point which is also going to this is obviously months away or, if even longer, but it becomes interesting too, because if we want to start making service ship service ownership done by the teams, then a global lock may not be good right. You might want to lock per per service as well.

C

I was also thinking when you mentioned the idea of the merge strain, which is obviously not appliable, because you got what you said if we are willing to say, take the risk of batching changes together, because uh we need to double check the details of what we support, but we can say that um job is locked, so you can run only one and it also is superseded by more recent version of the same one.

C

What I think is it happens right now, which I don't like is that if a new one gets in the current running get cancelled, I would like to have something like the current one continues. All in between get cancelled, only the most recent one runs in in line right. So if we have something like this, we could say something like detect, as you say, detect changes per environment, and so you identify that we are changing production.

C

So you do uh dynamic child pipeline, which is dependent, which is locked by the research and is marked as a deployment to an environment that has the only relies on the most recent versions, which means first one gets in. We get a pipeline and it's running, then we merge three changes when we, so this is the first one is running, then we have three changes when this kicks in, and this is running these two get cancelled because the only one is this one, which means it will accumulate changes, but it gives us the sequence right.

C

So you only you can run only one at time and it's kind of the opposite of the merge train, because the merge range we run all of them and the first fail removes everything on top.

C

But if you have to run one hour qa in every one, it means we just you never deploy. This goes in the opposite direction, which say: let's accumulate. So if we have a peak time, we accumulate all the changes together and we run the most recent.

C

A

Other option yeah, then we have to figure out.

C

A

Yeah, maybe if we're gonna do that, we just like okay, we're not gonna do qa as part of like change part like get labcom change pipelines, but we just do them regularly once uh every two hours. I think the problem was. We had them there, because people were rolling out changes and they would break staging and we weren't watching or monitoring it. And then you know staging would broke and then we'd come along to do an auto deploy and the environment was broken and then it was figured out.

A

It was a kubernetes change like hours ago, but maybe there's a middle ground where we don't qa for every single change, but we do do more regular qa than what we did before, which was no changes. Like kind of thing, I don't know.

D

It could be worth looking at the data like maybe, let's figure out how much like if we were to say something like that and say we didn't roll if we'd rolled out changes every two hours, how many changes might that be, um and that might help us figure out what what are good cadences.

A

That's actually a really good point. If I was to look at the data on how many merge requests and stuff we do a day for gitlab.com, it's not a massive amount. It's it varies day to day, but it's not a huge amount. So maybe it's not, but, as you said, as we scale infrastructure, that's that's going to change, but yeah.

D

Well, we can fix some of that stuff in the future. I think so, just as we are getting to the end now we're really close to oh we're kind of officially over time, but very, very close to kind of top of the hour as well. um Graham, what do you want from us like? What will help us get move? This discussion forwards.

A

um So I I think everything we've talked about tonight has been good. I think the next step, for me probably is to actually define a proposal and probably get feedback on that I've put. The discussion has been there and really I was just trying to the discussion issue itself was to get out. Did I miss anything, and I think talking to everyone here even tonight, I don't think we've missed anything big. I mean we've. Just we've thought about a few new options um and yeah just kind of explaining the problem getting feedback and and seeing.

A

If there's anything, we I haven't considered or thought. The next stage is to really just distill it into some proposal and then I think, probably a next round of feedback, and I think that'll obviously be a much more targeted discussion about what are the actual problems with this proposal that we or you know what do we see that we have to do.

D

Awesome sounds good great thanks. So much graeme like I appreciate how much effort you've put into thinking through this stuff and getting us to a point where we actually have a single issue with this stuff really clearly uh captured like because I know it's not a tool straightforward to pull all these threads together. So thanks for for putting all that work in um great and thanks for the chats everyone like great discussion today and uh yeah, let's follow up on the issues.

A

D

Everyone, thanks for everyone, staying.

A

Around late as well.

D

Take care thanks. Everyone bye.