GitLab Delivery: Single pipeline, 4 Mar 2021

Previous Meeting

⏯

youtube image

►

From YouTube: Trigger separate deployments

Description

Related to https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/1578

A

All right, so uh I prepared like on a small agenda. I don't know if you saw it.

B

I'm opening just now because I was supposed to be in a meeting before, but what got cancelled last minute. So I was.

A

Okay, okay, I'm going to paste it in the chat either way.

A

So I was reading through the issue and through the epic- and I do have some questions about it, mostly about what we are trying to solute to fix and how do we plan to fix it? So when I read about this pipeline this coordinated pipeline, I have some trouble imagining it or visualizing how it's going to be in my head.

A

I'm not sure if we are going to put like the big deployer pipeline that we have in deployer right now into the release tools or if we are going to use the release tools to trigger like different downstream pipelines into the deployer.

B

Okay, so, okay, let's start with this one. So let me share my screen that we have also in the recording.

B

So, okay, so we're thinking about this one one, five, seven, oh seven, eight- and this is uh representation of the um of the desired state of the pipeline that we want to have so right. Now, where is the powerful zoom thing here? It is annotate.

B

I hope it works so right now we have this. This things is already in place: yeah, okay, so- and here is a trigger. This is a trigger from it's a fire and forget trigger from release tools to the deployer. Deployer has one of the reason why the player is so complex is because you support multi-environment.

B

So basically you can tell him. Can you please deploy this chain of environments and do the right thing? Basically, this means that, but this complexity is handled in gitlab ci yaml files in the deployer, which means that.

A

B

Everything is at least duplicated four or six times. I don't remember, because we also handle pre-prod and other environments, so every every single job is generated or just edited or duplicated for every environment, okay, okay and then we have special rules with special jobs that are only there if you have more than one environment, think about um baking, time or manual promotion. This is a special job in a special stage. That is there only if the deploy environment variables contains cannery and production, at least canary and production right. So this is level of complexity.

A

B

So what we want to do, I'm going. No, I will not scroll, otherwise. My notes will just remain there for no reason. So what we're gonna do here is this, so change color.

B

So we would like to draw here okay, so we would like to trigger every single environment as a single, isolated deployment. Okay and we and because we want to trigger then the next one we need to wait now. We need to wait for for the end of the for the trigger pipeline, but there's an extra thing that we want to do right. So this is part of the this epic here, the phase two, which is a it's, a big happy which is also about what can we move away from release tools.

A

From from the deployment to release.

B

Because many things are just triggers back to release tools or just triggering other thing, or just um things are already implemented in chat ups or in release tools and just just duplicated code, which often times is in in ansible or is in python or some scripts are in ruby.

B

So if we could move more of the knowledge back to the release tools where most of the team can operate freely will be better and we have more control over. What is what is happening right so, but this is the and the end goal of the epics. But this epic is we're working on this right now, because it's important for rollbacks. So this is not the main epic of this quarter, so we would try to do what we need for, but not to do the not to complete the whole habit.

B

A

B

The what I was thinking when I wrote this is that the bare minimum part that we need is triggering staging alone waiting. We will talk about waiting a bit later, then triggering cannery.

B

Then I would love to do those two things here as well reason being baking. Time is just a delayed, gitlab ci yaml job. So it's all handled in gitlab yaml that triggers back production check on on release tools.

A

B

And manual promotion is exactly the same thing, but is um it's a manual job and not a delayed job and I'm I'm sure quite sure that basically there's one environment variable the difference that basically inform the release tools if it's a baking time check and so the basically the difference in the message that gets uh printed out. One is writing to the monthly issue, the manual promotion right to the mental issue, making time brightens luck, I I'm quite sure it's just handled by an environment variables.

B

So what I'm thinking.

A

B

This is just move it in gitlab yaml in um in release tools, and is there yeah? That's why I I was putting this here and then the final one is triggering uh production deployment. Now I want to keep this in parallel with this trigger here, the old one. For a reason, there are many moving parts here as well as we want to simplify the deployer. We want to remove the ability to trigger more than one deployment, so uh here it is.

B

Basically there is a extra branch in the player which is next-gen, which has an environment variable for check mode forced into true.

B

So my hope here is that if we can keep this running- and we add these things working on the next gen branch- we'll be able to have dry run deployment in parallel with the regular one so that we keep deploying and doing the same thing as we used to do right now, but we can start seeing if what we are doing really works or if we need to tweak the pipelines or so we we can safely work on the or in this part here, without affecting ongoing and daily deployments.

B

So, every time we deploy, we will see um a dry run. Basically, a dry run deployment with the new code and yeah. I mean this. I think this is a good way for having safetynet around this. So I spoke a lot so, first of all, let me delete all those things stop share. So does this make sense.

A

Yes, yes, it does. So um to summarize, we are going to basically trigger a deployment, a deployer pipeline for every environment.

A

We are not going to move like the radical idea that jarv mentioned about moving like the ci into the release tools, at least not for this iteration, perhaps later, but not for this one, and you mentioned that you want to have two um two deployments short speak like once, one that one we have today, the the long, the long one, the multi environment, one and another one that is going to use a different branch, the next gem branch that is going to be kind of. Well, then, the new implementation right.

B

As yeah the new implementation, but it's in dry run mode, so it will just run through the machines. Download thing check the everything, but it will never install anything so.

A

We will just say.

B

That that's the check mode I I should have explained because I know what it is, but when you have, when you set uh check underscore mode true, I think in in the in the deployer. Basically, it runs ansible in check mode. Running ansible in check mode means that it just go through all the steps telling you what would have changed without touching it.

A

Yeah yeah makes sense, so I guess what I am having trouble understand is that let's say that a package is tach and the package is built through cng and helm and then the deployer starts.

A

I guess at the same time you are going to have another pipeline.

A

And do you think that will be like troublesome, because it will try to download the package twice in staging or that is not going to be a problem.

B

So being a check mode, it should not download it because it can't.

A

Touch the environment, okay, okay, so we will have two pipelines running at the same time. One is going to be just for checking this one is free another one yeah one is fake. Okay, got it interesting: okay, okay,.

B

This is a good safety net right, yeah.

A

Yeah definitely.

B

And when we are, when we like what we see we can tweak, we can make a vision flag. Whatever we think is the right thing to do here. We can have something that allows us to say for a deployment. We want to test the the new one instead right, so we just flip it run it and say yeah it's working and then we can go back.

B

Okay, so that that's that's the idea. uh I would like to talk about a bit more about the waiting part, but if you have other questions, we can go through your questions. First, no problem, no.

A

No, I think that that one was basically answering my first question. um The second question that I have was basically related to the waiting time, because right now, let's say that we are going to deploy staging uh since we are going to trigger separate deployments.

A

I guess we need some way to know when the deployment to staging finish right. This is basically kind of the same strategy that you have right now for auto, deploy, weight, helm and cng that you are waiting. I think 45 minutes, and I guess in this case we will need to wait for.

B

So it's a tiny bit different, okay, okay, so omnibus and cng are clever in sense that we know that they always take at least 30 minutes, something like even more so they would never take less than 30 minutes. Let's say, okay, so we wait we delayed pipeline, I think, dear to me. I don't remember basically before starting active waiting.

B

We just have delayed job so that we don't waste money on waiting. Okay, then we we start waiting and we have. um You can see it because it's based on what is in the retriable jam. So we have this retinol gem. That catches um exception right and it has a special context with special default values for handling long wait because by default, retireable never waits more than 15.

B

Minutes is designed for things that should happen quickly, but if they are basically designed for api right, so you're asking something, and maybe the system is under eavy load. So you don't want to ddos. You start waiting more and more and more, but you are waiting for something that should be so. We are waiting for the system to be uh not under if load.

B

While here we are waiting for something that is designed to take a long time to run so it's basically, the numbers are tweaked a bit so that it doesn't time out after 15 minutes and things like that, and so that's basically active waiting. So it's a waste of time, but cng and omnimas there's no way around it, because it's on another instance, so we trigger on dev.

B

We are on oops, so there.

A

Are no other options.

B

There we just have to wait- and this is fine now here- I'm looking forward to be able to do. I'm not only sharing my screen so was kind of pointing to my screen, but I wasn't sharing so here we are, am I sharing.

A

Yes, you're sharing.

B

B

So the point is oh, no, where it is. Oh, I gotta.

B

This one here so if we could use this feature here, we will have no active waiting because basically, this is designed- I don't know if you ever played with this, but basically this is designed for the bridge job here right.

A

B

There are a couple of nice things here. First, one is this. So if we generate this type of report here a.m, report the job that will uh load that that um that artifacts will have those uh those variables defined, which is super cool. Then here we have a bridge job that can trigger another pipeline in the same environment in the same instance.

B

So, basically here we treat, we may trigger the deployer instead of this test here and when we say that strategy is dependent. We say we want this bridge job to wait until the end of the of this triggered pipeline and reflect the status, and this is this is cool because it's no longer active waiting. It's just for free, it's just the same instance that will move your pipeline when the next, when the trigger one is completed.

A

B

So what's the problem here as jarv mentioned, if the something that we trigger fails this this the whole pipeline just got stuck because basically the bridge job will fail and the failure of a bridge job will set. The next stage has skipped so in our case, uh think about it. The trigger um staging it fails because of qa. Facebook is a flaky test. I mean qa, if usually for qa can be a real failure.

B

Let's say that we have to retry something in one of the fleet right then it will never go past that point, because even if you go there and retry it, it will in the end it will be green, but we never run the the next stage, which is.

A

Kind of a problem here.

B

It's a bug, it's definitely a bug, because if you refresh it so if you try it, it will be green and the whole pipeline will be marked as succeeded, even if more than half of it is skipped yeah. So what I was has I'm trying to ask uh reach out to the pm for verify to figure out if they can kind of prioritize? That fix, I mean there's the full-blown issues about.

B

Can we retry the whole pipeline, which will be nice, but I don't care right because, as a release manager, we can we can manually, try the job that failed. Usually it's one or two, no more so it's still doable, but then the the trigger the pipeline should pick up the the the next stages, basically yeah. If it is stuck, we we can't use it so because of that, I will hope that we get a fix in time, but I will consider doing active waiting because we already have the code more or less right.

B

We have the one for the tagging pipeline for tagging omnibus. So it's just a matter of playing with numbers.

A

Yeah yeah, I saw the message that you left to the pm and well it would be great if they have it fixed, but in case yeah it doesn't depend on us. So.

B

A

Just to unblock this, I I wonder what kind of numbers should we use for actively waiting because we don't have control over. You know, migrations that can take hours like opposed migrations can take hours. So perhaps we can use like a pessimistic approach and set a large time.

B

I would say why not doing something different, which is we by default. We delay by, let's say 40 minutes. It would never take less than 40 minutes to run a full deployment, not even on staging. It should be around one hour, so we say we wait 40 minutes and then we start the job which can wait another.

B

Maybe we can maybe okay. Let me rephrase the initial waiting time depends on what we are waiting for.

B

I don't believe. Okay, I rephrase again. Okay, I'm thinking yeah.

A

It's fine. It's fine.

B

My point is because let me try so right now we can do something like this. We can do active, active waiting on staging and cannery and passive waiting on production reason being I want to see the red pipeline in the end if production fail, but I don't care about waiting, because I have nothing to do after that right now in the pipeline that we are designing.

A

B

Okay, so even given that I would say that uh staging in canary roughly takes this the same amount of time to deploy because same, I want to machine, see it's it's a it's a small environment.

B

So maybe we can do delay a delayed waiting by 30. 45 minutes then start active waiting and have it another 30 45 minutes. I don't care, but as a release managers we can retry the waiting. So if it fail and it failed because we'll still running, then I can just try.

B

It's not a big deal.

A

Okay, okay yeah. I think I understand.

B

A

Yeah, I I wasn't sure if staging and canary uh deployment lasts the same because staging actually executes post migrations and canary doesn't so. For me, it is kind of a larger. For me. I think canary from at the top of my head normally lasts like less than an hour any stage in it can be an hour and a half.

B

Yeah, but you know on stage this is true, but staging is a smaller footprint, so less concurrent requests so either regular and post deployment migration take less because there's no busy tables there, which is not true for um canary, because you do the regular migration there and oftentimes you may have a busy table, so it may take a long time even for them, because I think the maximum time out for a single migration is 40 minutes, because we do the retry thing right, so it requires up to 40 minutes.

B

I don't remember the correct numbers, but but my my I'm trying to uh given a ballpark number here that kind of works, but if it doesn't work we can. We can tweak it later because right now, this is this would not cancel ongoing deployment.

B

But you will work at a release manager that something fishy is going on right because it's been uh 40, not 45. Maybe you wait 45, then another 30 minutes or one hour and a quarter and still this environment didn't complete. Maybe it's time to take a look.

B

A

Okay, yeah, I think I think it makes sense. I guess I was remembering, like uh a net cage in which staging actually had millions of records in production have like me like 100, so it was an hk, so I I think we shouldn't consider it okay. So I just just another question regarding implementation that you told me it was uh we are going to. I guess it's not clear to me about having a delayed waiting like the same that we have from baking time and then we are. We actually start waiting.

A

That was your idea right.

B

All right, no, I don't. Okay. Can you repeat so, I'm not sure.

A

I get the answer because yeah um yeah I'm trying to wrap my head around the idea that you proposed about having a delayed waiting.

B

A

And then, once the delayed waiting finishes, we are going to actually active waiting right or yeah.

B

Yeah, which is a exactly what it you can see, the implementation for. Let me show you screen: let's find this together, so this is what happens? It's already, what happens here right, wait c and g and wait how many bus? So if we go?

B

No, it's not delivery.

B

Give me this here.

B

This is these: are the coordinated pipeline, okay, yep, so a couple of rules, but the thing is: this is a regular job. This is the one the tag. Okay, then the next one is build and that's the thing if coordinated pipeline delayed by 45 minutes.

B

So basically this job here and you can check it out by yourself- is just uh searching for the for the pipeline.

B

Knowing the tag it just finds, the the the url of the pipeline on on the dev instance, and it waits it just it loops and every minute check is this done not yet is it done not yet up to the maximum timeout, but instead of starting immediately, it will start 45 minutes after the we complete tagging? That's it got.

A

It yeah it is the same idea: okay cool. So um we are. I just passed our time, but just one last question, uh so the scope of this issue is just triggering a separate deployment for staging then for canary and then move baking time and promotion to release tools and finally trigger production right, yeah, the other jobs, uh q, a and whatever they are.

B

They are already in the deployer, so we don't care about them right now. Okay,.

A

We are going to deal with those like later right later.

B

Maybe not in this quarter.

A

Okay, all right, uh yeah, okay! Well, uh thank you for answering all my questions.

B

Thank you for asking questions.

A

All right: well, thanks, alessio, take care bye.

B

Have a good one.

A