GitLab Delivery: Deployment SLO, 23 Apr 2021

Previous Meeting

⏯

youtube image

►

From YouTube: 2021-04-23 Deployment SLO discussion

Description

Discussing the Deployment SLO OKR and Epic set up

A

We could share it out so, okay awesome, so I just wanted to kick off before we dive into kind of the details, because I think um so deployment slo right. So I put a suggestion on the issue, which is the deployment slo focuses on tracking the time taken from starting a deployment through to completing the post-deployment migration step, which is final stage rolling out changes to it to the environment, but, as I copied that out, I sort of so I I put it in.

A

I was going to ask you um about what we want to use this for and, as I realized, I thought that I thought this description doesn't actually include what we want to use this for, and I think that's the big piece- that's missing um so deployment sli.

A

um So I think we like from all the comments and everything and all the stuff into document.

A

We know we want to track duration of deployment pipeline, yeah yep and then it's a case of just picking like what is the start point? What is the end point, and how do we break that down?

A

um What so? What do you think we are going to be able to do with a deployment si once we have a deployment sli? What are we gonna? What changes? I guess.

B

um I think we will be able to set a target. You know when people usually ask us okay, when this, when this change is going to hit production- and we are like okay, it's normally one time to staging one hour- staging then one hour to canary then baking and then production, but we always use that or we always calculate that into our heads. We don't really have a measure, because this is something that we have been calculating on the ply.

B

So having that, I think it allows us to set some targets and once we can send some targets for our environments or or a general target, we can also set targets for our um components for the sub components of each pipeline. Right, so let's say that italy normally takes one hour like and suddenly it is taking two hours like we can say: okay, it has been taking one hour for the last 19 days. Now it is taking two hours what the hell happened.

A

Do you think it's more or maybe different, uh but like? How do you think well? Do you think we'll want to focus our time uh on say how do we make italy deploy faster versus the?

A

How do we make the pipeline have like not have 10 steps, for example, like do you want to say like if there are 10 steps in the pipeline? We want to be able to measure and say ah step. Three is now taking a lot longer or do you want the measure to be a thing which really motivates us to not have ten steps.

B

um Interesting.

A

I think, and it may not.

B

A

It might be that we want to.

B

Yeah but yeah exactly, I was going to say that they might be related. I think the first insight that we are going to have is that okay, this component is taking x time and should always be into this average right and then we can probably start thinking the deployer about stages about steps like we have. I don't know 10 steps, each one is um each one is lasting around 60 minutes or so why don't? We have like seven stages right, like the conversation that we are having about having uh q a executed.

B

At the same time, at the baking cannery from the top of my head, I think q a on canary and then staging it. It lasts around 30 to 40 minutes yeah, for example, in canary we start a q a and after it finishes we'll start making time, and it doesn't really make sense if you think about taking time could be executed at the same time as the q a because, uh once the q a is being executed. That means that all the changes have been deployed, and I mean if the q a fails.

B

There is no point to start the production deployment because well the queue may fail yeah. So I think it could give us some insight about okay, um the time and it could help us to also try to make our deployer pipeline faster and shorter, which is obviously a target that we should that we are moving forwards too. Yeah.

A

Yeah, okay, yeah yeah. It's definitely an interesting one. Isn't it because there's actually points where it's not worth optimizing certain steps so, for example, um you're totally right like actually yeah? Maybe we should just be running uh qae and baking time at the same time, but if we did make that change, there's actually very little point optimizing, one of them like at that point. It wouldn't matter if qa tests took 15 minutes instead of 30 minutes, because we're running out of baking time so over the overall time becomes the driver.

A

Does that make sense like so once you start running steps in parallel, say you have three steps that are all running in parallel if, if all three of them take an hour it's an hour, but if you spend time making one of those three take 15 minutes, it's still an hour for like elapsed time right.

B

Yeah but then, uh if that's the case, I mean assuming that we using the q a as an example. If we now know that okay, it is basically 40 minutes between canary and production, because both q and a and making time are executed. At the same time, then we can always reduce the making time from one hour to 40 minutes to 30 minutes. That is something that we have control, because it is just oh.

A

Sure yeah, no, that's what I'm saying but like I think those are the decisions we want to kind of get out of deployment, slo yeah, so deployment estimate. Let's think about this, so we want to.

A

I wonder if it, because all of these metrics go together right because that's the whole thing the whole useful thing of a metric is, it doesn't really exist on its own. It has counter balances to know if you're going too far, so we already have mttp. Now I think mttp is probably the overall measure of um kind of how. I guess I suppose it's how fast the overall process is, and it's the frequency for me. It's.

A

The big difference is frequency because I've uh like said a few times before, like actually the easiest way for us to fix mttp, would be to give up sleeping right. Easy mttp will be so low if we just deployed every single time we could deploy, um but that's not the way we want to go right now we want to we. We know we want to automate. We know we want it to be fast, um and so I think deployment slo needs to be the measure that makes us forces that right.

A

So let me say it's a different way right. So um have you watched the wire the wire.

B

A

Yeah, that's a shame. Okay, it's my favorite example for, like the wide stats, are bad. So let me think of a better example. Then, so I get your idea, though, when you, uh so you kind of want to make it so that we don't. We can optimize the metric in any way we can and the end result is what we wanted it to be so, for example, on mttp.

A

If we, if it wasn't probably our own sleep, uh if we had a team of people deploying who we didn't care about one bit right, the easiest way to optimize mttp would be they don't get to sleep anymore. They've just got to keep hitting that deploy button right and mttp would be the fastest it's ever been, but we personally know that we don't want to do that, because the negative side is our health right, so we're not really going there.

A

So I think deployment slo is the measure that goes against mttp, which, if we optimize deployment slo we're forcing it to be deployment faster. The pipeline is faster right and then we go deployment. Slo is like whatever number say six and we can mttp becomes, like I don't know, 20 whatever no no low in that like 10, and we we would force both of them like we would say, deployment slo gets lower. That naturally leads to an mttp getting lower, but they go.

B

Together, right.

A

And then, actually what we might be able to say what we might actually be able to reason about which we can't right now is um the difference between our speed of a deployment and our ability to deploy right. So what we have right now with mttp is like a combined metric right where we go, huh mttp is 20 hours and what someone outside the team can't see is: did the release managers take a day off or actually did we have an incident that blocked deployments for 12 hours and we have a six hour pipeline right.

A

We have no subtle information coming out of our mttp measure, so maybe deployment slo is the the bit that goes alongside that. It's the this is not at all the definition of it, but like it's the I guess, it's the end-to-end pipeline measure and that has a target and we always want that to be super low and the lower. We can get that we would assume release. Managers remain awesome and therefore mttp becomes low.

B

Exactly yeah, I also see deployment as a low as part or as a supercomponent of mttp, because mttp measures like from getting merged to package to deploy to staging canary and production and basically deployment slo is kind of measures. The last three so yeah, the lowest the deployment is a low. Then it impacts directly the mttp, so the lowest the empty.

A

Yeah exactly yeah, and then I wonder if so to so from some of the stuff you've seen earlier, I'm wondering almost if it doesn't, it doesn't really matter when we start out where the optimizations are. If that makes sense, so I wonder if we have deployment slo and we just say it is seven whatever the current number is, it's probably about seven, um and then there are probably there are two ways we can go with that right.

A

We can either just track it as a trend and if it always stays, seven life is good, it might creep up and we suddenly, like you're saying about say, say: gitly suddenly starts taking two hours. If deployment slo suddenly is taking eight hours, we have an action, we dig in what's changed.

A

Which maybe makes me think that we we just need to have the data? Maybe it doesn't matter, I mean I, it might not be an option. Andrew is usually very thorough, but like it might not matter if the measure includes like the components, if that makes sense like I almost wonder it was just that when deployment slo changes, we have insights in that we can find out. Why.

B

I mean it will give us some sort of alert like okay. Now it is, is lasting eight hours and it should be seven hours yeah, but I think when that happens, it might be useful to have some sort of breakdown, even though, if it is on the stages right, like okay, the uh the updating fleet stage that usually takes one hour and a half, it's not taking two hours and a half something changed right. We need to pinpoint where that is coming from yeah.

A

Exactly yeah, that's exactly it right, um but I don't know if it matters. If we necessarily need to worry about uh say I mean we might get it, but I I don't know if we necessarily have to have it. Certainly first iteration of like oh the italy stage alerts when it takes two hours. It may not matter right if we just are tracking on deployment slo the overall thing and then yeah, like you say when we when, when it alerts, we have insights that tell us what's going on it might not matter.

A

I just wonder for a first iteration, I don't know if we necessarily have to say there need to be 10 alerts. I think maybe one alert will tell us and insights.

B

I think it probably doesn't matter on the first iteration.

B

I think, having some sort of analysis on each component would be useful to actually set that number right, because we cannot say: okay, it is just seven hours just out of the blue. We need to base that on data and we we need to probably analyze the data from. I don't know one month, two months, three months behind, to try to estimate what it's going to be yeah um in my head.

B

What I am basically imagining is that okay, we have a deployment, we have some sort of slos or staging vertically and pre-production, because timing is different for each of them. We probably have one general that is based on the last three and it would be very good to also have some sort of estimates or slos for each kind of component. For example, migrations and pulse deployments migrations. We don't really have control on those right. Exactly normally, migrations takes five to ten minutes, but there have been.

B

There has been some times in which migrations have taken around 60 minutes, because someone actually created a large index on a tv migration yeah, and that shouldn't happen, because that impacts the deployment slo and the mttp and blah blah blah. But now we have something to point out to, for example, the database team or whatever it is. We can relate it to the error budget team that is associated with the error budget or something right.

B

So I think it will be also useful to have some sort of metric for each component or perhaps the company that we are interested or that we suspect that are taking the longest.

A

Yeah, okay, yeah! I think that's that's right! um Okay, so.

A

Just I wanna just.

A

So just can we jump to number seven, I knew kind of said: what are the next steps, just in the interest of time like so, I think, because I think a lot of this, like I think, we're on pretty much the same page, probably just need to get it kind of specked up and uh get the rest of the team, like the define, slo um defined deployment. Slo issue- I think, covers a lot of this, so we can kind of like just wrap up a we. I think we do need to get a tighter definition.

A

I think we, I think, everyone's on the same page, about what this roughly is, but we maybe just need to work this out. So in terms of next steps, um I think we should work on like just getting this all set up and then in front of the team, and then we can start breaking down like what.

A

uh What would we need to do? First, like I actually think, I'm kind of analyzing the data from last 90 days. That's probably one of our last steps, like I think. First, we should focus on like what do we want deployment? Slo to look like like what's the measure and then once we have that in place, we'll naturally have the data like.

A

I think it's fine, if we don't set a target for two months, like that's: okay, okay, so I was going to suggest see if you see if you agree with this, I was going to respect so andrew left, a comment which I need to respond to on the issue to kind of say what we've just talked through I'll create an okr issue.

A

Are you all right to create a working epic and put in like a project label and get that stuff set up, yeah sure?

A

And then I I think, for the um I mean we may have to iterate a bit on descriptions. I think the defined deployment slo issue is probably like the first issue of that epic. So the epic itself, like can be a kind of a introduced deployment, slo type of epic, and we can then define what it is like put it in the handbook work out what graphs we need measure it set targets and that sort of stuff?

A

Okay, um so I think totally fine if you want to keep it quite uh like high level in terms of like what the actual description of deployment slo is right, we can work that out as we go yeah for sure for sure I.

B

Think uh right now we can use the one that you said uh focuses on tracking time taking from starting the deployment by on the staging through completing completing the pulse deployment, migration, and then we can probably flesh it out as we go.

A

For the um yeah, I think that's probably the right measure, probably just I I don't like the fact. It's not the full pipeline, but I know that we're gonna have something later that picks up the kind of the final bit, but I don't like the fact that it cuts off the tracking tasks, because that's one thing I think is really important about. Actually the deployment slo is that.

A

Tell you what maybe it'd be easier right, so we know we're going to have a qa sli right later or test slo. Maybe it doesn't matter if deployment slo includes that, because what I don't want us to lose is actually the bits of time in between the components. So I think we need to be really careful. We don't just go hey. Staging takes an hour canary takes an hour.

A

Therefore, it's two hours because actually within our pipeline we do have kind of handoffs um and I think those might be almost the most important bits on how do we make the whole pipeline string together? Neatly yeah, no.

B

In my head- probably I should have mentioned, but yes, we are, including like the tracking jobs and the notification, jobs and the q a and they should be included. They are part of the deployment.

A

Because that makes a whole lot easier right. That's a slightly different uh definition to the one I wrote on this issue, but I think that might be good because, actually, maybe we just say it's between pipeline starting a pipeline ending yeah, basically by plan starting.

B

On stage in pipeline, ending, I think, probably perhaps I shouldn't use staging. Perhaps we can also use like the single coordinator pipeline. Okay, when the single coordinator pipeline starts yes and then finishes the deploy with all the jobs that are within yeah yeah.

A

Because that's super easy to track right and it's a true picture right. That's how long a deployment takes exactly and then maybe in the future, there's a time where we say: hey, we're, gonna skip staging or we're going to not. You know, like the whole. That measure is future proof right. It can be whatever we wanted to do, yeah so yeah, maybe that's the easy thing, and then we just have a separate slo that we'll introduce later, which just looks at the qa jobs.

A

Okay, that makes it a whole lot easier and then that's actually the true picture because yeah I don't want us to like, skip off and go. Oh, we just disregard these jobs and those jobs there, because it's easy for us to cut corners. Then right, we don't need to optimize them.

A

Okay, yeah! I like that more. It feels uh that feels better. um So, on the epic, I think it's probably a case of um I guess, focusing the epic on probably on the why we want a deployment slo and then we can use this to define deployment slo issue to actually be the thing that captures the um the. What? If that makes sense,.

B

Yeah yeah, I mean in the epic the issue that well we have been discussing. Things is like the first step right like define deployment slo, and then we can start fleshing out the other issues right. That's it exactly. Yeah.

A

Because I think the big question we're gonna have is is where we want to keep this. I think it was quite a lot of suggestions that I just scanned over, but it was there's lots of different places. We could be um like graphing this or alerting on this. So that's probably an interesting one yeah. I.

B

Mean we don't alert on mttp right, we don't alert if we have a graph, but we don't send like. Oh no. The mttp now is larger. We can probably take the same approach. We can have some sort of graph or measure or something and then in another iteration. We can send an alert.

A

Yeah yeah, we could do something like that yeah, so I think andrew's suggestion was very close to mttp in that it was looking at the mergers to to production.

B

Yeah, I think andrew's suggestions was more towards mttp and how we can break down nttp even further, because now we have just a single number, not individual numbers but stages of exactly yeah, which.

A

I think it's going to be super useful um yeah, actually, what I, uh I guess, andrew's- probably thoughts ready, but it'd be good to have like if you suggest this is updex, then the same sort of style and place that we have the uh app decks that we use for, uh like the notifications as we go through deployments, um unlike the fleet health like how far not for health but like how far we've deployed like we're at 50 of this thing, um that's all measured somewhere.

A

I'm sorry! Can you rephrase your question because I didn't get it yeah.

B

A

I think where, wherever we put this new uh like deployment slo stuff, we already have app decks being tracked uh for the for the no deployment notifications that we could get. Oh.

B

Yeah accident, yes, it's.

A

Just an extension like another one of those or something like that, but it'd be good just to have here's the place where all of these things live. Oh yeah, yeah, of course, um okay, so that all sounds good so just to very quickly summarize before we wrap off. So we are, we've got like a we're going for like there's an overall uh deployment slo, which is that issue, but we also want.

A

I also want some insights into the components and that kind of assumes environments but and environments right.

A

um I don't know if it's just like assumed already, but I'll just mention it. We definitely want to have something about um like a trend um so like the one. The insights is kind of like a number plus a trend, because I think that was interesting in that graph. That robert had, which was like it, wasn't particularly long but like when you see it trending up so then we'll see so with post-deployment migrations.

A

It might be that occasionally, there's a mega spike and it's like this one took an hour longer than normal, or we might see that over the last quarter gradually. This job is taking like two percent more time each day and that in six months time that will become really slow, yeah.

B

Yes, having some insight about how it behaves across time, yeah exactly.

A

Nice, okay, super. Is there anything else? I know I've. Sorry. I apologize I've skimmed through like a whole load of your agenda. Oh that's, fine! I think we covered all. Actually, then we covered a lot of the pieces, um and some of these are like kind of details. It would definitely be. I think we should just put in the issue and discuss on there and get like other people's insight.

A

um One um thing I was going to ask you about so: do you do you want to just see how this goes async um in terms of working out the next piece of work, um or do you want to like set something? Should we set something up for next week? I.

B

Think we can start a scene, I mean we have this. The next steps about uh creating the ocr uh respond to android. I can work out with the epic, creating the epic and the next steps, and we can continue discussing there and you know if it doesn't. If we don't get that much, we can always set up a meeting yeah. Okay, that sounds good awesome.

A

Thank you for that. That was uh that was super helpful great. uh Is there anything else we need to cover? No, no! I don't think she can recover old, fantastic cool. Thank you very much. I will see you in the issue. All right. Take care, bye,.