GitLab Delivery: GitLab.com migration to k8s demos, 28 Apr 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2022-04-28 GitLab.com k8s migration APAC/EMEA

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

A

Awesome so welcome everyone. uh This is the 28th of april um apac timed. So, graham, you have the first item.

B

Thank you. um So, basically, the only agenda item I really have is to go through for those of you who haven't been following along or completely aware of the work I've been doing around trying to plan what we want to do with uh the gitlab com, repo, which is the main repo that we use for deploying gitlab.com on kubernetes.

B

um I've done two kind of bits of work. Recently I've done a discussion issue that kind of highlights the problems and then a document which we converted uh into epics and issues and things about what what are the solutions or what we can look at the solution. So I'd like to just kind of briefly run through that I'll share. My screen, I'm going to be very sensitive to time here and try and kind of um keep things as quick as I can. So.

B

The issue is uh in the meeting docs if people are interested so we've had a lot of different issues like around specific problems. We've had with the com repository and we've had a lot of issues about ideas for fixing that, but perhaps a little bit low level. So what I wanted to try and do with this issue is just pull back a little bit.

B

Take all those issues and kind of try and look at it more at a little bit of a high level and try and give one kind of big overview picture of what all the problems are, because what I found with a lot of the issues about specific problems and what's a lot of the issues that um we had about specific solutions is there was a lot of duplication or slight duplication, or perhaps the issue or the problem was missing a bigger picture.

B

um So I kind of go through how things work. At the moment, I've broken it down into a few different sections, so we have user experience, problems, ci problems, repository problems and auto deploy problems I won't go. I won't go through every single description here, but to give a quick overview, we have from the user experience problems. Diffs on merge requests can often show unrelated changes, changes that haven't been applied.

B

Users outside of infrastructure, so infrastructure as a group are unable to see the diff output they're, unable to see ci output um so because of the way we've got the security and the fact we have secrets in this repository um and then users outside of infrastructure who don't have access to our kubernetes infrastructure. Probably even people in you know, say backend engineers and delivery are unable to get very simple picture of. What is our kubernetes deployment objects? What do they actually look like uh ci problems uh when master breaks on the repository?

B

We get absolutely no uh notification that it happened beyond like an email saying a pipeline is broken, which is very easy to miss. um Our ci pipelines are very bloated due to the fact that we can't make intelligent decisions around what jobs that we include. So, for example, if I make a change to one cluster in pre or one even just one deployment in pre, we run every single job deploy to production, deploy to staging and everything, because we just have no intelligence to say. Okay, this is only a staging.

B

A change for pre gitlab ci only needs to run against pre. Ci job is cancelled, while helmet's running we have this helm in an unclean state which gets us a lot for delivery.

B

Helms detection of problems and rollback is very slow if it works at all and off this often gets us in a state of cleanup. So this is similar to the above at the moment, if you're a release, manager or anyone doing deployments. Basically, if something goes wrong, you probably won't know about it for at least an hour, if not two hours after the the deploy actually happens. Just simply because helm will sit there for an hour, it will time out and then it'll roll back taking about an hour.

B

So it's extremely slow and um not not the greatest experience pipeline execution. Time is very long. Due to some of the tool choices, repository problems. We have no easily viewable ledger of changes and what order they happened in because we push stuff through git, but we also push a lot through just pure pipeline execution, in fact pure pipeline execution just on ops.gitlab.net right, so not even pipelinesvisibleon.com.

B

So it's these kind of between those two. It's it's very hard to just get a clear picture of as to what happened when diff shops, uh dif jobs, showing actual changes and mrs can get stale similar to the above. uh We can't do any dynamic, uh sorry, static analysis or other preemptive validation of um of manifest and ci jobs. So a classic example is. We recently had to spend a lot of effort going through all our kubernetes objects for api depreciations, because we were upgrading kubernetes.

B

They said we're going to get rid of these certain apis and we had to be. We were using them, and so we had to go and find them all and stuff like that.

B

There is a lot of tools and in fact I think, gitlab itself, the product ships, a couple of these tools to actually audit your files in merge requests or in a ci pipeline and say you know this is using a deprecated api or this is insecure or you know container security scanning all that stuff, and we cannot leverage that at the moment with our current setup.

B

uh Tooling solution uh is poor and logging, um so we've got some manual jobs to do some band-aids to make sure we don't pull dynamic data in from gitlab.com. So there's a couple of themes here, as you can see emerging, we do secrets management inside the repo. We pull data from chef when there's no reason to and now auto deploy problems which are probably the most interesting to us in delivery.

B

um Similar to the above changes outside of auto deploys um can left unapplied or if a job breaks or anything, they basically will block us from doing auto, deploys it's kind of by design, but it's not an ideal situation right like it. Ultimately, it shouldn't be that way it shouldn't be us to be like something's happened and then us go and investigate and then find out. We have to hassle someone else, because you know they didn't do the change in the correct way or follow through with the change.

B

If it broke, something auto deploy, pipelines can interleave with configuration changes on a per cluster basis. So, once again, all these pipelines can run auto, deploy pipelines, uh merge, request pipelines. Everything can just run, and you know sometimes, if they're running at the same time, you get this kind of weird crossing of changes happening or even just diffs and trying to just understand who is trying to do what is incredibly difficult.

B

When that happens, it's not often, but it will continue to happen more and more as we speed up with our deployments and the amount of change and the version of the components that is run is past the ci variables, meaning we have no concrete history inside of the repository or when those components were changed.

B

So once again, this is the google doc which I'm starting to use to put together the solution. I the reason I started with the google doc rather than actual issues is it's a lot easier to kind of edit track changes. Do suggestions in google documents, but you know if there's a general agreement on the approach here. I do want to try and push this into epics and issues and then further refinement and feedback from there from the from the team.

B

This is kind of like a first pass and the reason I want to present it here today um just once again make people aware of it spike conversation and feedback, if necessary, so I've got four kind of key epics or or things I've tried to identify, as at least to start with the solutions to some of these problems. Some of them are very simple and straightforward.

B

Some of them are need more investigation and a bit more refinement, so I'm going to start with the easiest one and the simplest one and then work my way through to the hardest one at the end might not be in order. I've ordered these on how I think they should be done in terms of we need to do this before this, but once again, that's not concrete, and, as we kind of dig into this, it might become a little bit easier uh that might change so.

B

The first epic is really just about removing any external data sources from that repository and what they that mean. What I mean by that is for those of you who aren't very familiar with how the tooling runs in that um repository.

B

I kind of use an analogy like this: the code and everything we put in that repository, the way that we use a bit of bash, then in a tool called helm file which then runs helm, which then, like actually does the work on the cluster.

B

What we commit to the repository is like the recipe to bake a cake, but the problem is a bunch of those ingredients that we put in are from external sources. They're from chef they're from uh I think variables, there's all sorts of different random things that we include in that. So what we try and do, for example, when we do an auto deploy, is we run a job to bake the cake with all of these stuff in and then see? If it's what we expect and then, when it's not, we just fail.

B

So a release manager has to come along and be like. Okay, the job has failed and then there's a whole process about determining. Why that's failed? Is it a change from here? Is it a change from there? Is it someone hasn't applied a change? So it's very you know it's very manual when that happens and error prone and it's very frustrating, and if any of those external data sources go down, we we fail our deploy.

B

We fail everything, even if it's just a transient issue, so this first epic is just about removing all that there will be no external data sources in the repository anymore. So there's a couple of things here: one we pull data from chef dynamically because there's certain settings for the gitlab application that are in environment variables.

B

They should not be, and there's issues open to actually move them into the gitlab.rb in the gitlab yaml, so into actual file on disk configuration state, but they're not so at the moment we need to modify the upstream chart to allow us to set those settings from kubernetes secret objects. So at the moment they're literally put at run time so they're.

B

Basically, we we call out to chef, even though they don't need to be in chef and then we pull them in put them in an environment variable and then ship it off to kubernetes the problem with that is obviously it's an external data source and also because we consider these secret data. It's a big blocker right for us to be able to expose this. You know, expose diff output or just expose more of the repository on what's going on to other people, because it contains secret data.

B

um The second part is we actually do the secret, so all the secret data we do for gitlab.com is inside this repository.

B

I made that call when I first set up this repository, because I was big on the the kind of monoreaper approach and not fracturing things into different places, but when I made that call I was an sre and my- uh and I was a I didn't- have as much overall view of the company as I did now, especially not delivery. So I missed that key point that me, as an sre, especially in reliability, could see everything.

B

Permissions were never an issue, so it was a very easy decision to make, but obviously now, as we you know, auto, deploys delivery, um uh delivery, team members who don't necessarily have the permission- and ultimately you know we want to open this up. But the goal is to open this up to, like you, know, gitlab developers more and make it the workload, make the workflow easier for everyone. Having secrets in this repository is a huge barrier to us, because we always have this consideration.

B

Oh, we can't do this or we can't do that, because there's secrets here and the data classification for this is high. So therefore we we are restricted in what we can do. There is no reason for that to be like that. It's a very simple task for us to take the the snippets of code. We have now that do the secrets management put them in their own git, repo and just say: okay, you know that the permissions are that are locked down.

B

Only these people can see that most people, um you know who are using this repository, including ourselves in delivery, don't really care that much about. What's going on with the secrets, we rarely change them, so I think just pulling that out solves a huge problem um this one. Actually, I might drop this point out. This is probably I might strike through this.

B

This is covered a little bit later and then this one is just to check that. If we remove all the external data sources, do we still need some of the cumbersome tooling we have around blocking access to the gitlab.com during ci runs. We might still need it and it might still be there, but I do think it's another task, so I said all in all this one should be relatively simple. I understand concretely what needs to be done, and um you know I personally don't think it's too too controversial or anything is very unknown.

B

Does anyone have any feedback or thoughts on that specific kind of overall goal.

C

B

So the end result for that should be that the big thing will be that less breakages of auto deploys there's less external sources, less less breakage of other employees, where it's like. Oh someone changed something in chef or whatever. We just get over all that it allows us not straight away perhaps, but it paves the road for us to then I want to drop the permissions from the ci accounts to actually look at secrets.

B

So the ci accounts for that repo will not be able to look at secrets at all, and then we can actually do a better job about who can see the ci jobs right, because the big risk at the moment is no one can see the ci jobs.

B

They might have secret data if we make the repo safe and we make the ci the permissions that the ci repo has like the service accounts and all that, if they just can't have permission to see the data, then it's a lot easier for us to open that up and therefore you know a lot better user experience for people, because they can see a little bit more. What's going on, but that that's kind of a future thing and I think what would happen is we'd still do an audit and kind of figure out.

B

You know it does that make sense, and what other considerations.

A

Yeah I'd suggest breaking that bit off, graham because I think it ties so tightly to our goals of helping teams get to self-serve, but before we can do that, we have so many other pieces like improving, tooling and just generally a lot of other stuff. So I think it'd be good to get it um as its own thing, but something which we um maybe separate off from from this sort of. First phase of this at least.

B

Agreed- and I haven't included that part here like I have not that's just me thinking out loud it just this- would pay that the very concrete part outcome here would just be less failures of auto deploys, but it does definitely pave that road, but I agree that is a separate discussion that we'd probably do later, but it's certainly a first step towards that, because it's the biggest step whenever we talk about trying to get people, you know to view more and do more in that reaper.

B

The first thing that always comes up is: oh no, but it's got secret data and that's classified you know I think it's red status or whatever so yeah, so the second kind of epic is so that one's very easy and concrete. This is a little bit trickier, but I think, still sensible. So what we want to do is we want to stop this interleaving of gitlab ci pipelines. So there's a lot of work.

B

We can do with ci pipelines for this repo, but in order for a clear, easy, first iteration, I think just making it so that no two ci pipelines in that repo can run at once. So entire uh pipeline level, locking.

B

However, I thought about this and I'm not sure if there's used cases where we do want two ci pipelines to run at the same time, an example might be if we want to deploy the production and staging at the same time. So I think this epic is definitely a investigation of what are the. What are the?

B

What are the use cases if, if we just say no 2ci pipelines for this repo can run at once? What are the bad things? What is that going to break? What are the use cases where that doesn't work, and then we refine from there what we can do, but it's really about increasing the locking.

C

I have a question about this because I've been thinking about this for a very long time. So, first of all, just to clarify your point: are you thinking about resource group locking at ci level, or are you thinking about some external locking, implemented somewhere else.

B

So I think we need to investigate all of these, because I'm not entirely sure I've put some avenues for investigation, merge trains, resource groups, global rock resources, some other things, but yeah, even just another tool or another something: okay,.

C

Someone is ringing, my doorbell, I'm sorry just come back soon.

B

So yeah, so so, basically to try and not just get stuck down in discussion on this issue, I'm trying to make it clear that, or at least start with the start, with the hypothesis that we only want one pipeline running in this repo at once and then working backwards. Is there any situation where that hypothesis would be bad? And um you know refining that hypothesis from there?

B

I think, even when I looked at the the pipeline or auto deployed pipelines today, we still do not overlap, production and staging deployments.

B

So I still think this hypothesis is valid and in which case this epic is probably just a little bit of investigation, a little bit of implementation and probably a lot of testing and rollout of that and making sure that everything still works. As expected, I, I really think merge trains from our product would be invaluable, but I realized today we have a very awkward problem with that, because we do our merge requests in gitlab.com, but it's ops.gitlab.net that actually run the pipelines we care about the ones that deploy.

B

So I don't think we can merge trains work within a single gitlab instance like the whole workflow around them, some of the safety they give you and the locking they only work in the single the same gitlab instance. So if a merge request is in gitlab.com, I actually don't think we can use them at ops.getlab.net.

B

So this is why this is a little bit of an issue where we need to understand what the options are. If we go with the goal of of repository level, locking what are the options and how do we make that work with both gitlab.com and not stock gitlab.net, and that's why I think, there's a little bit more work here than just say a single issue.

B

Okay, so I'm going to skip over.

B

To four, just because I'm doing this in order of least complicated to most complicated, um so modifying release, tools and patcher, because we still use that, but both of these tools to no longer depend on ci variables.

B

I've got this as number four, as I think this should be not done last, but it could be done in parallel. Actually, a lot of these probably could be done in parallel, or it's definitely a lot harder to see what is depending on what immediately.

B

So, what this covers is really more looking at release, tools and patcher and seeing the model that we currently use to run auto deploys is we simply invoke an actual pipeline and we pass a variable in this is the version we want to deploy but trying to change that to a more descriptive model and changing the api. So, at the at the moment, the api that release tools uses is running pipelines with these variables in, but that is very hard to read uh in terms of like the history.

B

You've got to find the pipelines you've got to like look, and even when you find the pipeline, you can't see what variables were passed in easily. It's a tricky kind of situation, um and if we want to try and move to a pure git ops model for this repository which I'll get to uh soon- and I do think, that's valid, then the interface for release, tools or patcher or anything that wants to do a deploy, is not running a pipeline and passing dynamic variables in and then expecting a different outcome but the same interface.

B

We expect our users to interface with the repo which is through a merge request or pushing a commit straight to master. So we give release tools the permission to push commits straight to master, so it creates a commit saying I'm going to change. uh You know the this version of the image as a commit pushes that onto the repo a pipeline will run on master, which we can match to the commit that we created.

B

So then we know exactly this pipeline uh is run by this commit which was caused by release tools during the deploy or, alternatively, you can do a merge request. Merge that and it's the same thing.

B

This is an idea. It may not be there's probably a lot more to it or there's a probably a lot more problems to it than uh initially.

C

Yeah there is a big problem here which is active pipeline waiting, which is something that we tried very hard to remove, and we only have it for things that are happening on dev. So there are a couple of things we can try here, which are interesting.

C

One is a self-committing pipeline when you are still invoking that pipeline. The pipeline is um basically it received the variable it commits itself on the repo and with.

B

C

Of skip ci so that you don't trigger a real pipeline because otherwise.

B

C

You can't run it and then you trigger a pipeline on that new commit, because if you trigger a pipeline as part of the pipeline, this is a child pipeline and can be tracked in in waiting. So you can still wait for it.

C

But if you are not doing something like this, basically you end up having double pipeline, and then you have the time out problem, which is the same that now we have with staging ref and that that we have, with tagging and package creation, that you treat you you commit something, and then you have to actively wait for it to complete, and when do you stop? Sometimes you stop after 30 minutes, and that pipeline takes three to five and so everything blocks.

B

Right yeah, I see what you're saying so.

B

Don't we have that same problem, though with deployer at the moment, because it runs the kate's workloads pipeline and it's still active waits for it to complete right, yeah.

C

We have that problem there. We have the problem with tagging, but for reasons kate's workloads is we never reached that point, but we were supposing to move it to to release tools with proper, triggering yeah.

B

Sure, okay, so no that makes sense so this I might rewrite a little bit because there's another direction or discussion that can happen here. Is that- and I already had an idea, slash issue for this, and I think maybe this merges merges into it as well- is that we we basically have release tools talking directly to kubernetes and just patching doing the kubernetes api request to patch the resource and change the image name instead, and then we just bypass ci and and release tools handles it. It like locks, takes workloads, we basically bypass case workloads.

B

That's got its own pros and cons, so maybe I'm a bit too specific here, but I do think reevaluating the process that that patcher and release tools currently work with this repo, I think, is the overall kind of get you know thing I'm trying to to get with this one. I guess- and maybe I can rewrite this to be a little bit more generic in that exploring what options we have of how release tools specifically.

B

uh Does that what I was hoping is that by using something like just pushing commits or a merge request um to the repo, we just write the one pipeline for kate's workloads that can be used for auto, deploys and everything everything just becomes. The same workflow and a configuration change or a deploy is still just a commit. It's still just the pipeline, and then we can do things like cluster targeting and everything based off the files that were changed in the commit or the merge requests. And things like that.

B

So that's what I was kind of going for, but I do understand that that's different than triggering a downstream pipeline.

A

Before we maybe like, maybe it would help, then can we frame this one in terms of what we actually want this to look like, rather than what.

B

We necessarily have right now. Maybe I.

A

Think about from a from a release, tools, perspective like what do we want release tools to be able to do and then maybe from a uh sort of a?

A

I think, a question for me. I I'm not super clear on is as a sre making changes. How did how does an sre want to get their cluster changes out to production? I assume they don't want to wait on an auto deploy, so we have sort of two requirement needs of how changes reach production, and then we can maybe start to look at like how do those things happen right now and and what options do we have to get closer to what we want.

B

Sure yeah, no, so I it definitely wouldn't be, it would be the opposite, auto deploy. So basically it's a unifying of the pipeline. At the moment, when you run a pipeline in kate's workloads, you can pass a variable. Is this an auto deploy and they do different things based off that? But what I'm trying to say is whether an sre or anyone is changing a config item or whether an auto deploy is changing the image that we're running under the hood. There just changes lines of text for the deployments right, so it's can.

B

We unify the interface so that you know honestly using git just like you're just doing the git commit when you commit that it has a pipeline that rolls that out and therefore you know, release tools is creating a commit which creates a pipeline or an s or a human being is creating a commit which creates a pipeline, and it's just a unification of that that idea. But, as I said, that's there's probably a lot more, that I haven't thought about uh that.

B

We need investigation in discussion and this one kind of might link into how this works. As well, this uh makes the eye pipelines more isolated because I think one of the options, even with this is to do dynamic. Child pipelines so actually believe it or not. The gitlab documentation has pretty they've got some pretty they've done a lot of work, since I last looked at this with with uh dynamic, shot pipelines and stuff, it's not unreasonable for us to do dynamic, child pipelines per environment as well, and things like that.

B

So I I do think there's a few more options there, but both two and four really just circle around. How do we do ci in this repo that makes sense and is as simple as possible while giving us the safety and pieces we need.

C

I have a question with this because yeah I was making my point before, but then someone rings adorable, but you you came back to the thing, so I I'm really supportive of the dynamic child pipeline is something that I wanted to. Try myself the there is an open question to me here, which is if we and I I don't think this is the define in the documentation.

C

But if we put a resource lock on the trigger of the dynamic child pipeline, will that lock extend it's a depending pipeline? Will that lock extend to the whole pipeline? Because if the answer is true and we can get uh qa smoke tests confirming this in the product, then this is a group. This is a great way for having um isolation at uh environment level, which yeah will give us more yeah, more parallelization in terms of what can we run on on several different clusters.

B

Yeah exactly no, I agree yeah absolutely.

B

Okay, so then the last epic, which is kind of the biggest one. um So it's basically that the very short description of this is once again with my uh kind of very bad um analogy. At the moment, this repo is very much about. We put the recipe to bake a cake and then in ci we actually run that recipe.

B

We, uh you know we bake the cake. Does the cake look what I expect? What is the cake in production? Do they? How do we gift that? So there's always surprises that happen, ci time and kind of the the common thoughts around that in the upstream community and how to do things is this model called git ops which um what pure get ops which they call it, which is a little bit of a bit of a marketing term more than anything, um but the end of the day.

B

What it means is that, instead of just putting all these pieces in the repo on how to quote unquote, bake the cake and then baking it at ci time, you actually change your tooling and workflow, so that you, when you make a change in the repo you can more or less compile it and you get the actual manifest. So the actual output that's going to be run against the cluster, and you put that in the repo as well. So we do this for other repos, like the run books. Repo is a classic example.

B

Already it's a fairly well known model. So when you're putting a merge request up you're, not only putting the merge request up. Oh, I change this value for a helm chart or I change this. You also change it. You you're actually committing the output as well. So this cluster will change this file. This cluster will change this file. Then in ci you validate that the person hasn't done the wrong thing, so you take their input.

B

You confirm their output is the same just as a pure validation, and so what that means is when people are testing locally it. Everything is there for them to see I'm going to test this input. Oh this has changed this output. I can confirm that locally and I'm ready to push that up before. I even do a merge record like before they do a merge request, there's no kind of development in ci. It all happens locally. They can see the changes.

B

It's a lot clearer, it's a lot more sensible and then the merge request is always just I've changed this source and you know you can see the output here. Ci validates that's correct. We no longer have to run diff jobs against actual clusters, so we can remove some of the permissions. We need from that making ci jobs a lot faster as well, because any diff like I've changed this. What is it changing on the output side? Is there in the merge request? So it's very easy to validate you use it from a user experience perspective.

B

It makes it incredibly easy for anyone inside or or even outside the company, to say what is the current value of the memory who requests on the git https deployment pod, it's right there and git. They can search it. They can you know it's just there, it's a very easy, obvious uh kind of process, so this would kind of be a bit of a shift, hopefully not too much. The other thing this kind of covers as well is our use of helm.

B

So helm does three things for us: it does this whole templating and generation, so the kind of recipe and baking of the cake. If you use my terrible analogy, it actually does applying that stuff to the cluster, and then it does the monitoring that you know the rollout was successful.

B

It does an okay job at the first thing, it does a good job. At the second thing, it does a very poor job at the third thing. So by splitting moving towards this kind of the manifesto and the repo, we will still use helm and the gitlab, because we have the gitlab helen chart to generate the manifest. It's always going to be there at some way, so we're going to use helm pass the helm chart in pass some values in we get out the output.

B

That's always going to be there, but if we don't use helm to do the other stages, we are free to choose better tools and develop our own workflow, which I think could actually be a standard workflow that we could adopt as like a best practices or a um like a standard that we use in terms of delivery for basically applying things to a cluster, kubernetes cluster, rolling out a change and then, most importantly, determining if that change was successfully applied or not.

B

So I mentioned a little bit earlier. What the way home works is that uh you know let's say we're doing an auto deploy helm applies it to the cluster. It will wait until all of the pods you know, are running the new version, which is fine, but uh if a pot is crashing it'll just crash over and over helm just sits there saying: okay, I'm gonna wait for the pods to finish crashing, and this is all by design, because kubernetes is about eventual consistency. Not like you do a request.

B

You get a valid response back straight away, but what it means is that if, if no one's watching uh the ci job will just run, the pods will crash over and over for an hour, and then we have a timeout that says: okay, if it hasn't succeeded in an hour or whatever the value is. I think it's two hours I don't know then just roll it back, which is fine. So then it rolls it all back.

B

But that means that unless you're sitting there watching the pipeline, like the output of the pipeline, you probably lose an hour before you even roll well, two hours before the pipeline fails and you notice something is wrong. Fortunately, with the way we do kubernetes roll outs and everything it's very safe, we usually don't have incidents or you know anything going wrong from a customer perspective, but it's very slow.

B

So by splitting this helm becomes a templating tool with the gitlab helm chart for defining the manifests. We are now free to use different tools for applying them to cluster, which is a simple problem. We can just use cube ctl.

B

I don't see any reason to use anything different and then we can come up with our own tooling solution or just use, what's already available to monitor the rollout, and I think this is incredibly valuable because we can start with something simple, that's probably better than what we have now excuse me, which could even just be a set of tests. A script is the pods crashing.

B

You know just very simple stuff. At the moment we can use cube, ctl rollout status to give us, like you, know the output of number of pods that have changed, but over time, as we have probably as we have more incidents and find more problems, we need to check, for. We can expand that test suite right, so we can actually start to say. Okay now we want to check the pods are valid. We may want to watch uh prometheus metrics right, like we may want to say.

B

Okay, we've seen the pods already we're going to watch them for five minutes to make sure there was no massive aptx drop or something like that.

B

You know all of that is now available to us, because that's the stage we can define ourselves and then what our rollback story is when we detect something goes wrong, becomes something under our control as well, but we can do whatever we want so once again, there's maybe not there's, there's good benefits straight away, but it does open up a lot more doors for us to make deployments safer, and I do genuinely believe we can develop some of these pieces with the ultimate goal of them becoming ci templates and part of the gitlab product, because everyone wants to monitor their rollout of their kubernetes deployments.

B

These are generic problems.

C

I was also thinking that one of the extra benefit of this approach when you were describing is that it has a better disaster recovery approach here, because the diff assumes that the cluster is healthy, because we are checking that only several things are changing.

C

But if the cluster isn't healthy, you can even restore it to the what is expected to be while here when we did the diff at the review time, and then we have written down in the repo the expected status, if we want to say just bring the the cluster back to this point in time, you don't care.

B

Just apply it yeah exactly yeah, exactly agreed, which we can't you're right. We.

C

B

At the moment and.

C

We think we're in a very.

B

Bad state we've got, we got no chance of actually getting our ci jobs. It.

C

Also protect us from uh I call it tooling shortage. So for any reason, there is a problem, a huge bug and help whatever that's preventing us to deploy this here. We have easier way to get out of it, because we can still change the template files and just commit it and cubectl will be there.

B

Yeah and I think we can develop, you know we can start doing what we do a lot more with like um at the moment. We rely on ci variables again, which is okay, but I think we can do things like have labels on. Mrs, if you put a label skip qa test, if you have a label emergency, then maybe it will just like apply like that's the only job right like it just skips the rest of ci and just applies it and things like that.

B

You know we can do things like that, or you know emergency might not validate that this was, uh you know, correctly changed. It will just like an emergency kind of apply. So I think we we get a lot out of that, but yeah, I think the freedom there to define our own validation of rollout, I think is, is very very I think, is going to become more important because I don't think it's very well known because it doesn't cause customer facing incidents that when something goes wrong, our mean time to detecting that change and fixing.

B

It is actually incredibly slow at the moment and it shouldn't be.

C

So I have a counter proposal to this, which is, I mean it's even harder to implement, but what about a gitlab.com operator? Because what it looks like here to me is that we just want to change variables versions. So there are something that we want to control like the versions or some configuration options, and then what you define as we want to build our own uh rollout uh checking strategy sounds more like. I want.

B

To have something in.

C

The yeah, an operator in the cluster that is making sure that the system converge to what is expected and alert if it's not uh converging, and things like that. I.

B

Agree, I think that is the ultimate long-term goal, uh not sure if we can get it away as a first step, but.

C

Yeah, obviously,.

B

You know this is kind of like at least this might be a first step, but I I 100 agree, that's that's where we need to go for those reasons, as you said, like yeah.

C

Because it has some some nest benefits if you want to do ephemeral environments, because once you have something like this, you can say I need a new environment and it would just create it for you right because it handles a gitlab.com style type of deployment.

B

Yep we can, you know it's useful for the green. If we want to uh you know if the pla, when the platform team starts looking at blue green, you know that's it's invaluable for that. I think it solves the auto, deploys the helm charts issue because the helm chart becomes in the operator and therefore, if we're auto deploying the operator, we can get the helm chart in there, and so I do. I agree. That is a definitely something we should look at it.

B

Do uh it's really just a question, I'm happy to look at that now. If we think we can do that now,.

C

No, no, it's just. Let's keep this in mind when we make decisions.

B

C

This is something that is an important goal, but not right now.

A

So an interesting time since we are now at time for this meeting next steps so alessio like do you want to join in on this document, or do you think that your comments are kind of, uh and that gives us an easy way to edit things, or do you feel happy enough that we have.

A

Good enough stuff to go to issues and epics and then input into that.

C

I think taking input into that, because I was basically in line with everything. Dreams were suggesting, just was making remarks and things.

A

Like that, that's fine, okay,.

B

I don't think there's um like I said I'll freely admit these aren't so like there's, not a solution. You know like there's. Further refinement needs done, but I do think we're at the point where I think doing it on the issue where the conversation is recorded there. Now it was really just a case of. Is there anything major that needs to change or I did rewrite some stuff, but I think now we can kind of.

A

Like I mean if you focus the issues and epics on like really capturing the problem and the the sort of result we want to get to, then I think the actual how we get there the implementation details we can figure out as we go. I don't think we have to have that all locked in gotcha right now.

B

Yeah, it makes sense.

A

Awesome, uh thank you so much for going through that graeme and thanks for the thinking you've been doing onto that like I hope everybody appreciates that like this is. uh This has not been a trivial uh problem to pull apart so uh yeah great to get to have a sort of direction for this.

A

um We are at time, but was there anything else? Anyone urgently did want to go through today.

A

No okay, in which case, thank you for discussions, have a great rest of your thursday. Take care.