GitLab Delivery Team, 9 Nov 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2022-11-09 AMA about GitLab releases

Description

Delivery Group's monthly AMA about GitLab deployments and releases

A

Hey everyone I'll wait just a few more seconds to let people join.

A

Okay, I'm gonna go ahead and get started so welcome. This is November the 9th 2022, and this is the delivery groups. Monthly AMA. So thank you all for joining. We have a question in the agenda, um but I don't think cash is here so I'll verbalize.

A

What are the main challenges you've faced daily so who would like to go ahead and share? I'm, not sure Harper has enough time, but uh keep it give us a summary who would like to share some of their uh the daily challenges of uh kind of I guess like working in delivery group or managing deployments and releases.

A

Here we go, hang on just pause yourself, one second Kesha's just joining hey record cashier. We were just kicking off on your questions, so thanks so much for adding it so uh delivery group who would like to go first. What are your main challenges.

B

I'm looking at Myra smiling, but she doesn't seem to be talking so I'll.

A

B

So I'll say that uh there is um a main challenge that is a recurring one, which is that um as part of the delivery group through, you are doing some shift as release manager. So basically you can think of what you want to achieve during a year, and you have your quarters, your okrs and everything, but then it's a full month where you are doing release management where basically everything else stop.

B

So this this is really hard on personal level, as well as on a team level. Then, if we go on what we do as delivery outside of being released manager, one of the main challenges I think in our case is communication, because we are kind of at the end of the process. But when we want to change something when we want to Implement something new, we have to either implement it, but also convince everyone starting from product development, whatever. Why we're doing something?

B

What we're doing, and so this is another big challenge and then finally, uh challenges release manager.

B

So the when you are doing release management, I would say the big challenges are incidents because you, these are the things there are. These are the things that are blocking your day-to-day activities, because we see a deployment to gitlab takes hours and basically we try to fit as much as we can into the 24 hour, but then getting outside of an incident if it involves a package that it has to be rebuilt or fixed, or something like this can easily take eight plus hours so yeah.

B

Those are I, think the three main challenges, at least for me,.

A

Nice thanks for sharing that last year, yeah I I definitely um definitely uh would resonate with a lot of those I. Think it's very interesting as a kind of release manager you have sort of some days are incredibly quiet and it's just promoting to getlab.com. We just have one manual step that we have on our process, so we just hit the button and hopefully it all just moves through slowly.

A

um You know and we're all we're all good there, um and then there are other days like particularly around kind of security releases and the monthly release where I actually trying to coordinate all of the different tasks you're trying to accomplish, and if you have a failing pipeline like it it's incredibly time consuming, and certainly that's the probably the bit that I find most uh challenging you're kind of in this uh reactive mode and trying to coordinate all the pieces.

A

um So I think the challenge is really the um sort of the volume of tasks and um the significance of getting the ordering right getting the communication right and meeting any deadlines um as we go through.

C

Right um I was also told as a follow-up question, um or maybe not quite actually. um I was also told that um when someone merges an Mr Mr with a feature flag- and they did not enable it in staging um that piece of work is not going to like, they won't be able to enable it in production.

A

Correct, yes, I believe there is actually a check on the feature flag, the checks that staging has already been fully enabled. Yes,.

C

um I have another question: uh if um so, at my previous company, we obviously had uh sandboxes and pre-production environment, and so all these different environments, where you could actually test out your code without actually deploying it, it would actually emerging in into a master and when I joined gitlab.

C

Recently I realized that um we primarily do testing locally and then we merge it into master and that basically goes to different stages where you QA in staging and then in Canary, and then in you know, um and I was just wondering uh like the decision um that let's not have any like uh testing QA QA environments, rather pushes all towards like after merging into a master I wonder what um um like. How did that decision uh come to be, um and are there any challenges around that? Because I guess people that are new join?

C

um They might merge something that wasn't quite ready. Maybe I don't know, but they didn't really have a Sandbox to test it in. um If you know what I mean, if that makes sense, I hope that makes sense.

D

Yeah, uh this is a fun one Mara! Please, please go first. If you want to yeah.

E

Yeah, no I think it is a great question and I think there are different different angles to it.

E

uh So one angle is that git love receives a lot of commits daily and even hourly, so we need to be uh fast enough to try to keep the pace uh and actually a funny thing is that in the beginning, um three or four years ago we only performed two or three deployments to production, because we actually were testing like commits in a in a different environment, but that was so slow and by the time we promote to production when everything was broken, because we were not able to to test it like in a real environment.

E

So because of it, we decided to be more agile and and try to have this Auto deployed pipeline that actually uh promote everything in a very fast paced, and we do have some q, a environments uh that is like staging Canary. We have is also staging ref and each on the well.

E

We also have production Canary and in each of these environments, we have that respective q a that try to test well a smoke, reliable tests and well also in the terms of when something is not ready, and you want to ship it just to continue like to your development. Well, we have another tools like feature flux in which you can merge something into master and it shouldn't impact, because the changes are that like under a feature flag.

D

And even with merges into the primary gitlab project or other projects that make up this whole Suite, there are actually pipelines and abilities to run all of these and build the commit deploy. It run. A full end-to-end test even run some surrounding full application Suite tests. In some scenarios they aren't always used to be fair, and but they are there, they're, just not always enforced.

C

um Are we talking about prior to merging to master or after murder.

D

Yeah prior prior to prior to the merge request even being accepted there.

F

D

A suite of tests that have the ability to do a full, buildup and deployment of your code and all the related code related to whatever your changes are so if, for example, there's an API change in Italy and we need to make that same, API change in the rails makes use of it. You can actually deploy all of those components as a whole Suite that will build everything and deploy it and actually perform tests against it.

D

So back in 19 years you get to make a certain amount of call on exactly which tests get run, but those environments are actually fully automated.

E

I think those tests are also automatically triggered, depending on the files that you modified on the gitlab repo. But yes, sometimes they are optional and you need to actually click like a job in the pipeline to trigger and to build the package.

G

Yeah but I think that what you're getting at is that, ideally, we would have a pre-production environment, that's very similar to what we have production, usually click a button deploy your code and just do your manual testing that way right. We don't have that right now, because it's a it's difficult to set up a data set that size and um you know and enable like a thousand people to be able to work on that. So we have these proxy things where you know we have these staging Canary deployment.

G

We have these cute, automated QA, it's not completely. The same I can see there being a you know: there'd be a case for having a much nicer environment. That's actually mimicking production as closely as possible, but I don't think we're there yet I, don't think it's a constitution. Is that there's a lot of little things we can do before? We need to do that, um but it's something we should probably consider.

B

We are actually working on something that is say partially related to this, because uh we are extending the standard, backboard policies from only the current Milestone to three version back, and so we actually have the problem of uh not having a long-running environment for the all the three stable uh versions, and so we are working on a way to have those environments set up and running in a kind of um say now we have continuous delivery, disabled branches instead of just installing the packages and as we are going through this, we are actually thinking about.

B

Can we build on top of this short-lived environment that could be tied to a given merge request so that you can create and run prior to merging something and yeah? So this is all there's an old discussion that are happening. There are problem with dataset.

B

There is a problem with the number of merge requests, because if you take a look at the I mean the number of merge requests that we have in GitHub the project, if everyone is going to spin up a kubernetes cluster or just a namespace inside a cluster and run pause, this is going to be also a huge cost issues. So there are many things connected to this.

C

Thank you thanks.

A

Thanks for asking great question: does anybody else have a question they would like to verbalize.

A

I would just mention, um as a sort of a final note, then so uh near up the top just above the agenda, I've added um a couple: Three Links uh to the kind of three big things that delivery group is working on this uh this quarter, so um the first one maintenance policy extension is what Alessio just mentioned so hopefully we'll be able to open up the maintenance policy a little bit within the next quarter or two and be sort of regularly accepting back uh bug fixes to go back a couple, more versions, we're also we've just started working on our deployment pipeline observability work.

A

So in the moment we actually don't have brilliant sort of data or trending around our deployment pipelines. So we can't really easily see if we have a long-running pipeline like whether we commonly have long-running pipelines or whether we you know a certain job has trended up or does it fail in sort of certain patterns? So this is going to be the sort of first piece if it's actually starting to build that out.

A

So we can make better decisions about how to improve the Pipelines and then our final piece is we're working to make the kubernetes Clusters easy to rebuild. So this is sort of a good bit of Maintenance that we've sort of fits on top of our kubernetes migration work that we've been doing to actually allow us to, hopefully make the classes a bit more flexible and hopefully that unlocks some sort of other deployment approaches in the future. For us as well.

A

Awesome Kesha: do you want to verbalize your question.

C

um Yeah so I have another question like what happens when, um like a maintainer emerges, a uh an MR and um it's it's actually breaking staging, but because I'm asleep I can't be I I, don't know that it was merged um Etc and it wasn't under a feature flag, let's say um and so uh yeah what happens in that situation.

A

Yeah, this is a great question, so what would happen in this case is, um it would most likely hopefully be caught on our staging Canary, so the first uh environment, where we run QA tests on the package, then we would end up with failing uh most likely. We are now with failing tests um and I should say. Actually. The very very first step is with a bit of luck. It would have failed to merge right.

A

Hopefully it would have failed tests on the merge, Pipeline and wouldn't have actually mentioned, but in the event it did, it would most likely have been caught on our stage in Canary environment. The tests would have failed, so the deployment would have rolled out the package tests would fail. At that point, we would ask the quality on-call engineer who's one of the software engineers in test.

A

They they have a rotation, they would have done an investigation to sort of figure out which test is failing and what's the cause of that most often they would identify the Mr quite quickly, and we, if, if you weren't online, we would uh use the development um the dev escalation process. The development dri would revert the change to to unblock the pipeline, so that's the sort of the most usual one.

A

There are times where um the failure can be a bit obscure, and maybe we can't necessarily identify the um the exact Mr causing it, in which case um a bit more investigation again. Dev escalation is our process that we use to engage people from within development if we don't know who specifically or which specific Stage Group to go to until we investigate and find the actual cause, and then we revert that out. So we have processes to catch us. They are fairly um involved.

A

They do generally require sort of like three four, maybe more people to be involved in this, um and they also take a while one of the things um which I think Alessio mentioned earlier was one of their actual pain points we have on our process is reverting it's a really slow process, because a revert Mr is basically the same as an MR, so an MR has to be created, it has to get merged and we have to build a new package and we have to deploy that.

A

So actually, the turnaround time of a broken stage in Canary and a recovered stage of canary is is quite a lot of hours, um which is why it's certainly in delivery.

A

We are huge fans of feature Flags, so if you're feeling in something that might be risky a feature flag is a really really great way, because what that does is it puts the control back in your timeline because we can just get the code changed to be deployed to the environment, but it's completely up to you when you turn that on so you can do that in your day, you can see the tests running. If there are any problems, you can turn that off. So it's certainly a much shorter recovery sort of loop for us.

B

There's also something to be said here that is, uh we have rolled back as an option which is lowering the time it takes to recover the environment, but is not lowering the time it takes to get out of the Sadie incident, because, in order for us to restart the deployment process, the regular one we still need to have that thing revert or fixed, usually revert and then merged package and everything. So the detailed amount of hours is still.

B

The same is just that we may restore the functionality of the environment earlier if we, if we have the opportunity to roll back. Thank you.

C

And how often does this kind of situation happen.

E

More often that we would like to admit yeah.

A

Yeah- it's probably a it's probably maybe once a week, I would guess like on the whole I think how much uh I sort of um approvals on maintenance for reviewers. You know our merch pipeline test as you do. A great job um and a lot of feature flags are in use as well, so I think an awful lot of um of these sorts of things were avoided, um but we probably I would say on average around one a week.

A

The real the real pain point for these things is um the impact it has on other teams. The most significant time is in the sort of several says three four between two and four days before the monthly release, because it what it can end up with is, if we don't do a deployment for seven or eight hours. Actually, quite a lot of changes don't make the monthly release because of a hard deadline. So the impact can be quite wide, um but we do only see a reasonably small number.

A

But I do think, like with our with our process our kind of reviews and all the tests that go on in the merge pipelines. Like you know, I I think an awful lot of stuff is caught very early on. um So it's not a. We shouldn't be in fear of of this happening, um but a feature flag is a great way if we have got something. That's slightly risky.

B

There's also the impact of the things that is broken that has to be considered right, because, if you're talking about uh major features broken with no workaround, this is exactly what we described right. So we're gonna, stop everything and find work to revert and revert, and nothing will continue. But if we're talking about Minor feature that is not behaving correctly, but there are workarounds or is not really use that often because maybe there's something that is not even under a QA reliable test.

B

So we can't notice this until it's in production, and at that point you say: okay, what's the impact of this, and we can say we can keep it buggy and just fix it and in the next 6 to 12 hours. This will be rolled through all the environments, and so we get fixed so not always rolling back is an option. We're talking about uh priority one and two issues mostly.

C

And when something goes into, the product to production, um I I might have read or heard somewhere that people actually like manually, enable the deployment to next stage is that is that what happens? Correct.

D

B

Right there is a baking time of one hour in uh before that. So when we have the package running on the canary stages for one hour, release managers receive a pings. They have a pink and there's. uh There are information about the healthy of the system that get taken at that point in time, and so it's just you click the button.

C

And why is that not automated, but rather done manually.

B

There are no, there are release. Manager are not wearing a page, a pager, so there is no expectation for us to be actually online when this happened, and so this is one of the main reason we have been working on automating this couple of quarters ago, but we were starting from far far far away from that point that we actually implemented rollbacks and all this all the tests. But we never went through the phase where we're just going uh automating automating the rollout, because we have no availability list of release managers and knowing.

B

If someone is online or not.

C

A

C

More mean uh sorry.

A

I was going to just add to that so like there, when, when a deployment is rolling out um to the production environment, if uh if there are any problems- or you know like questions or things, is the release managers who would join the incident to help the um the engineer on call, so we have a kind of responsibility for when changes are rolling to production. We guarantee that there's a release manager available at that time, so just to keep those two things in sync, um we have still the manual promotion.

B

And is not just to be around at that point in time, but it's kind for the next two hour, even more because their the rollout takes time. So this is the commitment that relates manager are saying I'm here, I'm online I will I can pay attention for the next two hours see.

C

Right and we have um release manage assuming like in every time zone, so they're like 24 hours a day, that's right, yep,.

A

Okay, we're getting close to time. Does anyone have a final question they would like to verbalize today.

F

I got something if you don't mind, go.

A

F

So this most recent line of questioning reminds me of the fact that we still have to be available, but the a delivery team member must be available in another deployed package is going out so I guess the challenge here is: what could we do to remove ourselves from having to be online and make it such that it's fully automated without us needing to be around to kind of babysit? The situation like? What can we do to enable I, don't know infrastructure or maybe some other team to just be like? Oh well, here's a problem.

F

Let's just hit this button and roll back.

A

Like actually, rather than sort of um passing the responsibility to another team, I think having the automated rollback is actually the is the solution right so um increase health checks on our environments and roll back. If you know if certain uh flags are hit, and then that way, you know, even if it paused at the next stage, then we're not sort of relying on any human um to have to be the one to watch the risk.

A

Of course of that, is you really have to get those flags to be accurate, because otherwise you have a lot of uh rollbacks, which also take time when you didn't need to I. Think that's probably the biggest challenge um to actually enabling something like that.

E

I think another challenge should be our ability to Halt our deployment when it is going like when a certain metric has been reached right now. uh There is a point in timing which we cannot simply just cancel a deployment in the middle of it and for us to be removed from being like release manager Watchers. We will need something like that to Halt a deployment and then to start automatically a rollback.

A

Yeah agreed great thanks for asking that scar back so Kasha we're going to make your question the final question of the day, thanks so much for having them uh bringing them along today. Do you want to verbalize it.

C

um Sure uh so um so would you say we talked about a lot about like feature flags and and that being an option for everybody. uh Would you say that by default, people should just throw feature flags at a majority of their work, just to make sure that uh it's easy to just flip like disable it when it needs to be it's easy to kind of test it and make sure that everything works out.

E

I will say that for sensitive changes. Yes, when you are involving projects, CI builds or something that might affect a lot of users. It is easier and safer to roll it out under a feature flag and then just ship it gradually and actually feature flags are very cheap. Like developing wise, you just need to add a couple of lines, and then they discard that and then to remove it. It is also easy for trivial changes. I will say that is not necessary.

A

Do we have any official sort of guidance from like the back end maintainers like is there actual? Do we have any documentation about when to use a feature flag and when not to.

E

I believe we do, um it is documented somewhere. Yes, okay,.

A

We'll see if we can find out for the ketchup, because uh I think it's a it's a little bit nuanced, but I think generally um they're cheap, um but we do need to remove them. So there's a little bit more overhead. There.

C

Yeah I did read something um in our in our docs about that um it is quite high level. I guess it just says. If you have a sensitive change, then put a feature flag on it. That kind of thing. So it's not exactly saying like what area or like what um level of like how many changes or anything like that, it's more like. Basically, the the the the engineering needs to decide what they think and then just go by that.

A

Yeah and I'd say that's very much going to be down to impact, um as as um as Alessia was mentioning earlier, we wouldn't always roll back a change. So if it's a low impact uh problem, you know it it it's something we would fix forward. So in that case it probably wouldn't be expected. A future flag would exist.

A

Fantastic, okay: we are at time, so thank you so much. Thank you Kesha for having uh so many questions and uh everyone else who uh joined in the discussion like really great to chat to you all today and um enjoy the rest of your day. We'll, hopefully see you next month thanks a lot team thanks a.

F