GitLab Delivery: GitLab.com migration to k8s demos, 30 Sep 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2021-09-30 GitLab.com k8s migration APAC

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Good morning,.

A

Oh welcome everyone.

A

So we have nothing in the agenda at the moment, but okay. So, let's see last minute entry good work, graham.

A

B

Sure so I can give a quick, uh hopefully demo, of where I'm at with kind of restoring deployment um visibility. Once I can find the right.

C

B

B

B

Sorry, I'm a bit slow today: okay,.

B

So I'll up this um so at the moment I'm still experimenting with this, and I've still got a couple of pre-tasks to do so. This is using like a newer copy of helm and some a new copy of cube, ctl or some other bits and pieces. But at the moment for those who are aware, so basically um when we run helm or we do our deployments, we look in the ci job.

B

We just see how I'm saying I'm applying your new manifest and then it just sits there spitting, spinning, spinning and then it just says it's done job's completed and we get very little feedback as to what's going on, and I've been looking for ways to restore this and get some visibility into it, and it was surprisingly difficult to do and I'm not even sure if I've entirely got the best solution. But I've got one solution, which is the smallest and simplest solution.

B

So what I'm going to do? This is in the pre environment. This is good value, so I'm just going to temporarily dirty the pre environment by putting a label on uh on this, and then I'm actually going to deploy to that. um So this is this part of the command. Here is just the normal kctl upgrade, and it's this second part.

B

That is the important part, and so basically, what I'm doing is I'm getting all of the deployments and then I'm basically using this command cube, ctl rollout status to basically what that does more or less is. Basically you give it a.

B

You know a name of a resource like deployment or whatever, and it just rolls the watches the rollout status and gives you kind of a live feed update on that I'm using xox to basically run the command itself, actually only takes one resource, so I'm using xox to basically pipe that out, so we can run multiple copies of it at once.

B

I should have actually kicked this off already because it takes a while to run um so some interesting pieces about this, though- and this is going to be probably just as much or, if not more work for me to do this on this issue is to do this to work. It me to get this to work.

B

We have to change helm file from basically blocking until everything is done to just submitting the manifest the cube api server and then just you know exiting cleanly. So then we can fire off the next tool to actually watch the deployment.

B

So this means we have to drop the helm weight flag and the the helm atomic flag, so helm is no longer like blocking on resource completion and it's also not atomic, as in that doing that automatic roll back. If something goes wrong, um so so there's a linked issue to this, which is discussing about dropping the atomic flag discussing about um you know and then the second part of this. Why is it always?

B

It might help if I do valid uh so the second part of this, which we can I can kind of get into while we're waiting for this, is basically getting out of using helm to do the atomic like automatically rolling back if something goes wrong and how we still capture the capture when something goes wrong and then either automatically rolling back or just having a mechanism where we can fire off another ci pipeline that does a helm. Rollback java already lodged an issue discussing this.

B

So these these are all kind of linked, um and unfortunately, we kind of have to go down this path if we want to actually have some visibility, because there's just no way for help to give us that visibility. For some reason, I don't know why they they took this functionality out. It's really annoying, um but yeah.

D

B

D

Does that mean that, um like we would block in the pipeline until the rollout is complete.

B

Yes, so what I'm.

D

B

To do so, I I need to do a full test of this. This is why this is a work in progress, but what the plan is is if any of the so the rollout status will watch the action which we'll see in a minute we'll watch the rollout status of the pods and we can set a timeout on that. Well, there's two mechanisms right, so kubernetes deployment objects already have like a setting where you can say how long?

B

How long should you wait for a rollout to not progress before aborting? If that makes sense? So that's on the deployment object and then I can also tell cube rollout status. The command line if it takes x amount longer the next amount of time bomb out as well. So either way. What will happen is you will see these logs from rollout status at some point they will actually return with a non-zero exit code, and so the ci job will fail. The apply, ci job will fail.

B

You'll be able to look at it and hopefully glean some output from what's going on. However, just because the apply has failed, what that typically will mean is now not things are in a bad state, but, like you know, let's say it's tried to spin up two new pods and those are crash. Looping, um the helm, the helm, release object has got like the new code and kubernetes has not progressed.

B

It hasn't taken away any pods, because the new pods are crashing they're, not ready, so things are still in a workable state, but then we need to figure out how we kind of gel up the helm release again like like. Do we just do a git revert or in the case of auto deploy? Do we have a pipeline? Where, or like you mentioned this job, like just every pipeline, having a helm, rollback job that you just have to click if you're like? Actually this was a you know. This is a border.

B

This is a bad idea. The other thing we could do with the helm rollback job is. We could make it like a on fail. So if the, if the apply job does fail, because the rollout status is returning to nonzero exit code, we could then you know automatically run the um the helm rollback job, if that makes sense, so we're just kind of figuring we're moving the logic from doing rollbacks from all inside of being managed by helm to basically rci pipeline.

D

So let's say I do, let's say I do a registry application upgrade and I fat finger the tag, the version so so, obviously like there's an image pool failure, it's a failure. You know- and previously this would have been rolled back, but now we're in this state where um you have an image pool. Failure at that point. Would we be able just to go forward or would we roll back? But what would we do.

B

That situation um so two points for those of you who are who are watching the screen. So this is the actual output. Now you can see in real time that's what rollout status is giving us. um You know it's showing us as replicas are coming online as old replicas are going offline uh so to the question. So there's two things that to note here: one I'm not convinced helm is actually picking up if new pods are crashing and automatically rolling us back. I haven't seen a solid case.

B

I think what we see now is home just times out um whenever I've seen recently- and I don't know if something's changed and I need to do further debugging on this, but I'm not convinced helmer's, actually smartly, watching and rolling. Like I can't remember the last time I've seen helm atomic actually automatically pick up a problem and roll things back, and I don't know if that's our setup, a bug and helm or something else, but definitely it's always had to have been us coming and killing pot.

B

Remember how we have to kill jobs, and then helmets are the bads. It never just seems to pick up the problems we have a fat fingering and, like you know, a crash looping pod. um So in that case, so so let's say help did do that.

B

Sorry I got distracted and just it's okay.

D

So, like I fat finger, the image tag, name or something so then I can just go forward right like I can just show you, okay yeah. I won't have to revert that change and then.

B

Technically, no because the replica set, the replicas that wouldn't have progressed, you've still got all your old pods. The autopod.

D

The horizontal.

B

Portal, auto scaler should also use the old replica set. It should be modifying that it won't until the replicas that has been marked as completed like the u1 has progressed completely, that it should be using it. So in theory- yes in theory, there's probably never a situation where you're kind of in a bad state that you can't just keep going forward.

D

B

Really what I need to do, because a lot of this is, I think, and I feel like, and what have you I need to as part of this issue- do proper actual testing, possibly even with helm and also with the proposed new mechanisms to make sure we do understand um both options.

A

Could you write this out as well in terms of like the process, because this is pretty similar to all the discussions? We've had around deploying things and rolling them back but uh kind of parallel, um so I'd like to make sure that we bring those together and have like the same kind of responses on on rolling changes out logging, those what happens when they fail and how we recover from that. So it doesn't have to be like an identical, tooling setup, but just so that we have the we manage those processes in a similar way.

B

Yeah sure that makes sense yeah. I I think I've still got to kind of flesh this whole thing out, um because yeah even just talking about it now, it's like it sounds so simple just to restore basically the reason we kind of gotta go. This path is like helm used to have tiller and tiller would log what was going on. So we could capture that log and put it in our ci jobs.

B

Helm. Oh excuse me, helm3 lost tiller and thus we lost those logs. So we can't and there's issues upstream at home. Please give us logging, please tell us, what's going on and they've just been brushed aside, or it hasn't been done, so we have to rely on some other process besides helm, to get output as to what's going on um so hence why we have to we.

B

We don't have to drop helm atomic necessarily, but I can't see a way where we can capture the point where we can monitor the rollout unless we just get helm to apply and then something else to monitor the rollout.

A

Is there any? um Are people out in the industry who are also using helm? Have people kind of documented how they've recovered logs.

B

Yeah, so I've been looking at uh trying to figure out what other people doing it. So this is rolled out successfully, so I'm just going to quickly un-dirty pre releases.

C

I have fun questions sure, um because the way you do it you're doing it right now is um you're trying to get um the details of the deployment right and try to to list them by why they are happening. Could we spin up another pipeline or another job in parallel to the handys.

B

Yeah, so I was thinking about that approach. I will get back to your question amy. Sorry, I didn't want to feel like I've forgotten it. um So the the problem is the race condition. If you run the rollout status command before helm has actually applied the manifest you will get it. It will return straight away and says the rollout is complete. Everything is updated, nothing's changing you kind of got to kick up the change and then like run the rollout status afterwards I haven't even looked at helm.

B

File itself has got a way that it could fork it off like. At the same time, it runs helm because it's got those exact hooks. So I do need to investigate this a little bit more but yeah. If, if we could just somehow guarantee that we're like forking them both like we, we definitely are running the rollout status after the helmet pliers happened. um We we could leave things the way they are.

B

I I still think potentially looking at what helm gives us or doesn't give us is not a bad idea anyway because, as I said, I'm still not 100 sure.

B

If it's monitoring of problems is as good as we potentially think it is, I know they've removed large chunks of code related to determining if things succeed or fail around other things like services and load balances and stuff, because all the different implementations had their own different quirks and helm kept on tripping up on them, like helm, was trying to figure out if something was successful or not, and they couldn't because of the different quirks of different providers.

B

So they started rolling back a lot of the helm, detecting if something has succeeded or failed, um but but for pods and deployments, which you think are a very simple concrete concept. You would hope it works um so yeah. I think. As part of this, I definitely want to just test what we currently have, and you know even just capture like if I fat finger, a config setting or a um you know, an image thing: does this actually pick it up and roll it back, and how quickly is that.

B

And uh yeah amy to your question, so there's a few basically honestly, a lot of it is custom tooling. So a lot of it is people just write their own kind of I've. Put on the issue. There's like six or maybe less, maybe four different. Like tools. I found that people use to like basically wrap around the client, uh the kubernetes client libraries to provide like pretty output, output of things changing and a lot of bash scripts that just wrap cube, cuddle, roll out status or there's another command called cube cuddle weight. So that's yeah.

B

It's um the other big thing to note as well, for this is a huge part of the larger community, now use the quote: unquote: the the git ops operators, um so that's flux, cd and argo cd um and those are like. You have an agent running like a giant piece of software, a software agent running inside your cluster and it pulls the git repo down and you can define, there's a helm release in there. So please run it and so it's running helm, but it wraps helm and basically does its own logic.

B

So argo cd has this concept called argo, rollouts and flux. Cd have their own kind of mod, so they they, those those software products, have their own monitoring system for how they determine to roll out or kubernetes objects are complete and that's agnostic to helm, because you flux, for example, cargo, for example, supports helm, jsonnet and all the other things, and they just wrap it up in a nice little package with a pretty gooey. It's actually honestly for what we do. It's actually kind of cool like they.

B

They do, like all sorts of like auto canarying, stuff promotions, built in like promotion concepts, uh first class citizens in the application. So it's actually kind of cool piece of technology, but it is different in that you're no longer using ci to deploy. You just merge in your ci, like your ci or your git system, just merges, and then it's the controller in the cluster that rolls that out.

B

So it's it's it's just, and that is like very, very popular, like very, very popular now.

D

Yeah, what happens? uh What happens now, assuming we implement this and we get rid of home, robots and helm atomic? What happens when I retry fail deployment they'll just fail again right, yeah will something else happen.

B

No so you're so to be clear when you say retrial failed deployment, you just let's say you just ran kcdl apply again. Yeah.

D

Yeah, it should.

B

Just do the same thing: okay,.

D

B

D

I'm just wondering like because I thought it would be different right, because previously my understanding was helm was rolling back. So then we tried to reapply the bad change again and it would fail again inaudible.

B

D

Has already been applied, so would it just be like, would it be nothing.

B

Would happen it wouldn't even run help so you're right. It would not, it would say, there's no differences between.

D

B

D

For me, because sure right because your first instinct is always to retry, you know and that'd be like we try and then there's like no change, you know and then maybe even our pipeline would proceed.

B

Right, yeah, you're right and to be fair. If and once again, this is a big if, if, if you're in the the process currently with atomic- and you cancel that job like while helm is waiting and if it doesn't pick up and roll it back we're in the same problem um like yeah, I I really need to I think as part of this, I we should do some. I should do some proper testing investigation of what the current state is in terms of failures that we're picking them up.

D

B

Picking them up how long they are, because I don't know I feel like there was a there's a lot of times where I was like. Oh, this is taking a long time. I see the problem. This is crashing or not. Working and I've had to cancel the pipeline and helm just hasn't picked it up, but I don't know whether it's just been like. Maybe it can take crashing pods but not crashing in it containers. Maybe it's.

D

B

That uh so it honestly this exercise, although it's growing in scope a little bit, it's probably worthwhile and understanding like all of this is just around. How do we understand if a kubernetes deployment is progressing or failing and um what we do with that, so even stepping away from technologies for a second, maybe just capturing, possibly even doing a run book on this, maybe of just like how do we determine you know what is the tooling and pipeline that determines something succeeding or not, and what to do.

A

Yeah, I think that would be really useful to go through like feel free to just put it all in an issue or something for now, and we can pull out the bits that end up being. um You know we need to know, we can have run books for those bits and some other kind of tooling things, but it would be good to understand um yeah what, where things are.

B

A

What's the kind of theory at the moment, then, on kate's workloads versus deployer, now that we have virtually everything migrated.

B

Well, I mean last night there was a discussion of uh what's how do we feel about gitlab, get labcom infrastructure and kate's workloads as well right, like um so.

A

D

I totally see us triggering kate's workloads from release tools right, that's the direction we're gonna go. I hope where, as we move, things to kubernetes deployer becomes smaller and smaller and really deployer right now is only doing migrations. Post-Deploy migrations pages, um italy gilly right. So so, maybe um maybe you know, release tools. Does the trigger decades workloads.

A

Yeah, I think, that's probably right and that's kind of where, like the so, we have a lot of um kind of logic and process built around the player, so I think it'd be really good to do a comparison around. How does that map to kate's workloads like what? What are we missing? What do we gain? What's you know, what's different and then solve this like as a kind of uh like delivery problem, rather than as a kubernetes problem.

B

Yeah, I mean that's the interesting thing where lines blur right because, like kate's workloads is interesting in a little bit because it is the it is now like, uh and someone could correct me if I'm wrong this first, inter intersection of like just sres doing standard infrastructure work. I just need to change this config or whatever versus the intersection of we are just delivering.

B

Delivery is delivering code and delivering code updates for git laptop the application, but now they're kind of blurred right, because before I guess you could with chef, you could just run it like. Your answer would just go on the host and update the package or whatever, and it was separate to like whatever else was going on, but now they're, actually in the one, repo and they're kind of.

A

Mm-Hmm you've got an interesting like again. We we are not the first person to handle this, but it's it's definitely an interesting one, where we're kind of dealing with two layers at the same time. Right because you can change the config and you can change what's running on that config.

A

So yeah I mean from a I guess, from a growing into a bigger company and needing more um kind of uh confirmation around this stuff. They should probably both go through the same level of tracking and documentation, and uh you know like uh approvals and things, but there must be something like it's not a great idea in the concept of continuous delivery of small batches and being able to quickly identify what you change.

A

It can't be a great idea to be changing your config at the same time as you're changing the code running on that config, because when something goes wrong like, where do you start.

B

Well, we've got the protection, so that doesn't happen right like we do have kate's work. Like I mean this is the other. We've got a lot of not craft, but we've built a lot of stuff around kate's workloads um related to yeah, because that's the problem like at the moment. We have a lot of bits and pieces and scripts and stuff in there to make sure that someone doing a conflict change is not trampling or fighting with someone doing a delivery change kind of thing and um what.

A

About a theoretical so like everything in our ci pipeline at the moment, is jobs right, and so we handle the dependencies between like italy versus rails and deployment, migrations and things as different stages of the pipeline.

A

Is there a way that this we just add to that, so that we have when we roll up out a deployment pipeline like the first stages, apply config changes and then maybe there's some testing or whatever, and then we apply code changes on top of that with that, so.

B

You said so you're saying that like so, if I want to make a config change, you're saying that now becomes part of an order deploy pipeline yeah.

A

B

So there's no like yeah, so the only p like any change just comes in part of like our continuous delivery pipeline.

A

Full stop yeah like thinking in the future. Like imagine, this is actually.

B

A short appointment.

A

Like fully automated with like testing and like rollbacks and things in place, um I'm not would you want config to be running on a separate pipeline because it seems like what we struggle with a lot? Is the conflict of these things colliding so.

D

We uh sorry yeah, um I I was kind of thinking or maybe flipping this around.

D

Why can't release tools using the api, submit the merge request for um a version bump in cage workloads, and we use something like merge strains which sort of gives us what we want right, because we want to queue up changes that are both either application or config differences.

D

So, instead of triggering a ci pipeline with a ci variable that overrides the image version, um we just have everything in git and you know, instead of triggering a pipeline, we we submit an mr over the api and merge that mr over the api, and so everything is going through mrs to uh the main line.

D

B

Be I would love that yeah. I think if, if that was possible to do, I think that would be great, because that gives you the git history of when image image bumps happen, because at the moment, as well with the ci variable part approach as well. We've got the problem of not even a big problem, but like we want to get more. What you see is what you get, but at the moment we actually use helm file to call back to kubernetes to get the current running version of something.

B

Unless it's an audit like we, we do some strange loops to actually kind of so people can record so the helm can reconcile if I'm not doing an auto deploy. What's the version of gitlab, I'm running, I don't know I'll, go and check what the current version is and just use that kind of thing, whereas if it's all in git and it's just in files like you, can see you're clearly a good history and then I.

D

B

And then also, wouldn't that mean rollbacks at least when the kubernetes side would just be a get revert.

D

Yeah, I don't know.

B

D

I don't know why we were so set from the beginning of using pipeline triggers. I think it's because we've done that in other situations, but I guess it would be just as easy for us to use the api update. The file with the new version update the values file with the new version, submit an mr over the api and merge that mr over the api, and we could do that all from deployer or from release tools right.

A

Do you mean for all changes or conflict changes.

D

A

We have two categories right. We.

C

D

Changes which require usually a manual review, and then we have auto deploys which need to be automated, but in both cases we would just use mrs and for auto deploy. These mrs, would be open program, programmatically and merged programmatically.

A

Would you maintain the order like? Would we just trust that git can? I can maintain that because, like so.

D

B

D

A

Forwards of backwards compatibility right.

D

Yeah I mean this is where I was thinking. Maybe merge trains could work because this queues up changes. um You know that need to be merged. So then yeah, you do have this problem where um you could have people stepping on each other, like.

A

One of the benefits of a pipeline is it's kind of a contained. It's like a transaction right. You say like for this period of time. We guarantee we're going to do job one two, three, four, five in order, but.

B

That's the problem we don't at the moment. So at the moment, there's there's we to be clear: we've got resource, locking on individual environments, but not for a whole pipeline.

D

Yeah, it's for jobs, yeah yeah, for jobs,.

B

Not pipelines so a config change and- and I think I've done this- I think others have done this before- definitely just through bad timing, um a config and a auto deploy.

A

Sideways, this is kind of what I'm saying.

D

A

Somehow they need to be tied together right like we. I.

D

A

Think these things will exist as separate processes, because of that, like.

D

A

What we don't have a way of at the moment like at the moment we have the problem, we must have incidents where um something happens, that we see something strange and we've recently applied. Some conflict changes and we've recently deployed some code like that's, it's not clear where that comes from right. Now, that presumably increases in the future.

D

It's it's almost like. Oh, go ahead, go ahead! Oh.

B

D

um I was going to say it's almost like you want something like a lock on the pipeline. You would like you would lock it on release tools, say I'm going to make a config change, I'm going to lock the pipeline and that would stop all auto deploys I'd, make my config change and then I would unlock it. Why.

A

Would you want to lock it, though, like because say assume in the future when we achieve small batches like wouldn't we, like my kind of assumption is we will have so many of these things in the future that you would want to have kind of like you know, five minute, conflict change is going out. Five minute code change, five minute conflict change: code change. You know like it's.

B

A

Quick and rapid.

B

So, in that case, really what we need. I wish eliza alecio was here because he um he was telling me about this as well. He had some ideas on this because I was telling him about the whole lock a whole pipeline. He reckons there's a mechanism you can use to like. Basically, you just have a pipeline, that's one job to spawn a child pipeline and you can lock that job and therefore you kind of more or less lock your whole pipeline.

A

Pipeline is doing.

B

All right, possibly yeah,.

A

Because that's what I mean, I wonder if there's like a stage in there where we apply config.

B

So well well walking even back from that right like if we just regardless of any other change now, if we can make it somehow that kate's workloads pipelines the whole pipeline never crosses, so they all back up regardless of config or auto, deploy changes that that's probably pretty pretty good right like if the whole.

C

B

End-To-End can never overlap. That probably gives us so much more safety right now than what we currently have.

D

Yeah, I think, right now I mean we lock at the job level, so that can you can interleave changes. For example, you could be doing qa on your staging environment while a change is being deployed- and you probably don't want that right. So we want to lock at the environment level uh and- um and if we did everything through mrs, where an mr like starting an mr, would lock it and then it's not until that change has been fully applied, then it would be unlocked. Then I think that would solve the problem.

A

Interesting yeah, so I think graham you've got a fun task ahead of you, which is, I think it would be really interesting to actually do a comparison of. What are we? What are we doing at the moment? Kate's workloads versus the player and then say in a year's time? How do these things come together and then we maybe will get some um that will help us decide on kind of next steps for the helm logs.

B

Yeah so yeah yeah, fair enough, because.

A

I think these things are gonna have to come together, like in assuming you say, double the number of engineers working on all of these things. We've got a lot more changes coming in um so for those pipelines.

B

Yeah, so it sounds like also what we I probably need to spin up an issue like a tech debt issue around just. We need to make the case workloads pipelines fully pipeline, locking like that, that's a that's a discrete small task. We could do, I think, uh or a fog at all to everything else, even if we just have if we can have the case, workloads pipelines at the moment never overlap for whatever reason. Somehow I think that's a good discrete change.

B

We could do that gets us some value, because at the moment we do interleave and there's potential for problems. Then there's a separate issue of trying to get some visibility back in deployments, and then there is the separate issue of what what does case workloads currently do versus deploy error. What's the safety and rollback process around that, and what have you so? The thing with deployer 2 is like so deploy a.

B

Does. The player actually do rollbacks like automatic rollbacks, it's only manual right, yeah, right.

A

But we're working.

B

A

That, and this is where we should pull this stuff.

B

All together right, how do you determine an automatic roll? That's what I'm curious about. Maybe that's a larger discussion like.

A

Yeah I'll, send you the epic yeah. How does.

B

How does deployer make the call that something needs to be rolled back, because it's the same problem? I've got with this with kubernetes, I'm like how do I make the call there's some very simple basic things I can do like if a pod is crashing roll back, but there's a lot of kind of I mean it depends on how much deep you want to go into it right, but.

A

Yeah, let me send you the thing, so it builds on what we have at the moment, but uh it they'll need to be some kind of health checks and metrics added in.

B

Oh yeah, any anyway yeah. This is good. um I will you know, I'm gonna be well uh I'll. Definitely like continue on fleshing this out um and understanding what we currently have and then trying to figure out what we want to do. Moving forward to get some visibility back, because I think having something where we can see what's happening with the deployment is good, because at the moment you basically have to get on a console, server and start looking around. You don't know which pods are even crashing and what have you?

B

We don't monitor jobs at the moment as oh well, henry's here as well. Okay, so 100 helm does not monitor job success, so we use jobs for the registry database. Migration helm will happily say thumbs up green tick. If the registry migration job is just failing so and that's they've made a conscious decision to not support monitoring jobs, I think you might be able to there might be a flag to do it which we don't have set.

B

So maybe that's the solution, but once again we need to understand with those migration jobs, because at the moment it's deployer. That's currently doing migration jobs forget lab, I believe, is that correct.

C

B

The gitlab for.

C

Gitlab yeah: that's the deployment doing this.

B

So so this is going to be the first system where we're not using deployer for handling migrations and we're doing it in kubernetes.

B

So it sounds like then that we need to get. We need to understand that and get a handle on that. But potentially, if we do that right and get a handle on that, then also it means we could actually move, get lab migration jobs into kubernetes as well, and that would come out of deployer. Potentially if we got that into a good state.

C

Not sure the plans for for that, I think the distribution team thought about this a long time. I think- and I think they have some ideas about how to do this, but I think for for good reasons. We stayed with a separate approach to a structure right for for uh database migrations, oh yeah, especially if you want to hold back and things like that, and so um so.

B

Then the question is, why is registry different not not trying to say like.

C

um I think that was the easiest solution to do this, and this will also be the solution that customers are using like for for our customers. They also use kubernetes db migrations, I guess normally, but in our complex setup, with a lot of parts and pieces, I think we decided to do it separately and for registry. It was easy to just use the mechanism that was already in our charts for db migrations and it's easy because we didn't have the case of complicated post migrations. Yet so we always have pre-migrations and so.

B

C

Run the job and then run the deployment, and you just have to make sure that you have some kind of version compatibility so that the uh alternate new deployment still can work with the migration yeah. And so this is fine for now, um if you should have.

B

C

Up then, it's getting more complicated.

B

Probably yeah I'm just curious as like: okay, we're doing some migrations in kubernetes and then we're doing some migrations in deployer, which is yeah. I was just more curious as why we chose different ways and it sounds like realistically, if we keep registered migrations in kubernetes, that's fine, but then, as part of this, we need to make sure that pipelines are failing or we've got some kind of mod, because at the moment there is zero monitoring. If those jobs fail like nothing, no pipeline will fail. Nothing will.

C

um I I saw uh when we had uh pod crash loops, because migration is not being successful for registry.

B

B

uh Maybe it's um maybe it's because the actual registry pods detect whether the migration hasn't been done and then crash out. Maybe that's that's. What's saving us there.

C

The issue that I saw, I think, but I'm not totally sure- is that um if the um db migration job failed because it couldn't connect to the database, for instance, um then then this job failed, okay and then it was tried again and again and again with a um a back off uh yeah and so on and after I, after an hour, I would say: oh no, I can't get forward, so I will roll back. So that's what's happening so that protected. I think registry from rolling forward.

B

Okay, so I guess two things there, so maybe I'm wrong, maybe helm does monitor jobs. I was pretty sure it didn't, but maybe it does and b um back to what I was saying earlier about hell, not detecting, not rolling back. I was wrong, it does roll back after the timeout. So that's the problem. Helm atomic will roll us back, but it waits an hour. So we have to crash loop for like an hour or whatever our timeout period is before helm. Rolls back it's not that it didn't roll back.

B

It's that it takes a long time to pick up the problem and roll back. That's what the that's! What the problem is with the current state.

B

Which I think we want quicker right like correct me or wrong? We don't want to we kind of as soon as the pods crash. Looking for a bit, we kind of want to roll back straight away.

C

Yes, that would be nice, of course, yeah.

A

Awesome: okay, great stuff, um is there any other stuff? People want to go.

C

A

Nope, okay, fantastic! uh Thank you very much for the discussion. um Okay feel free to put some like kind of like several issues. With these sorts of topics don't feel like you have to work out how to stretch them together and things. So I think we need to pull this into like the slightly bigger um in a year's time. What does this sort of stuff look at and then we can um tie the short term into that awesome all right. Well, I hope you'll have a good rest of your day, speak to you soon.