GitLab Scalability Team, 9 Sep 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2021-09-09 Scalability Team Demo

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Yeah, so uh it's about an issue. um It started when jan and sean were talking about having a separate separator sli for the service desk. So the thing that receives emails and creates issues for that for those emails and so on, um but that's not as used as everything like that runs. That's mostly sidekick. So that's where that sli would live, but that's nowhere near used as much as all the other sidekick jobs that run so we would need like.

A

If we had that sli, we would need to be able to set a different slo for, like specifically for that for both aptx and error rates. So I created an issue about how to do that, and I wanted to bring it up here to here. If we should do that, I don't know if anybody has thoughts.

B

A

I thought so that's why I wanted to bring it up.

B

So the the reason why, just to give you a bit of background, rather than anything else, the reason why we have a single value is because, if you well, I'm probably the only person aware of this.

B

But if you cast yourself back far enough, we didn't have jsonnet generating all of this stuff and it was all hand coded rules, and it was really hard not to mess things up, because you had to make sure that everything was here and here and here and in 10 different yaml files, which was really horrible, and one of the um things that I chose to keep things a little bit simpler was to have a single slo per service, because it was just like one less thing that you could mess up, and the intention has never been to stick with that um and now that we have it in the in the jsonnet.

B

And it's like you know, everything is got a single source of the truth. Instead of having ten sources of truth, um I don't see any reason why we have to keep that um it is. My is my take on that cool and.

A

Do you have thoughts on where to keep it, because um the suggestion sean made on the merger quest was like to keep it right next to the sli definition.

B

A

Yeah thought was to put it on the top with the thresholds, because we have separate kinds of thresholds. No.

B

Right so, okay, so those yeah it's slightly. So I think if you aggregate up to so we'll have to treat those as the ones for the service and at the service level we'll have one set, but then on the individual level. I guess it will inherit the ones from the service, but I think we should definitely keep them on. You smiling at the wincing.

A

Or smiling, and I don't I don't understand what you what you mean by inherit, because if you define a new sli and that has.

B

A

What I mean with different threshold is we have the contractual ones, but let's leave those out. We have the monitor yeah.

B

Forget about those.

A

Then we have the ones for like the defined and the other thresholds, hash yeah, which are used for by the delivery team for the like deployment, health and mtbf.

B

A

B

So so those I, I think that that those mtbf ones, you know it's it's tricky, because we got these different thresholds. Where maybe you know we should be looking to.

A

Go towards one we don't have to have all of them like yeah,.

B

A

You don't need the deployment health one.

B

Because that that's only at the service level.

A

Yeah but like even for an essa like um if you don't specify a different one for deployment or for mtbf, then like we'll just not record it and everything's fine.

C

So if, if we're not really using mtbf at the moment, do we need to include it and still have these.

A

Well, the deployment one is used, I think, uh but the like for this kind of super low traffic sli.

B

Yeah, the the deployment one, what you will never have on sli, because we're not monitoring individual components, we're only monitoring the services and we'd never want to do that, because it's just too itty-bitty right, like you know, they don't want to stop the deployments on on, like monitoring, 10, different components and anything like that at least not yet.

A

um They want to stop the department if a service goes.

B

Although what came up in a meeting I had with that technical debt assessment meeting with christopher on tuesdays, they wanted to have um a whole bunch of like uh for every feature, flag, toggle like basically like a more rigorous set of dashboards and everything, and what I was encouraging is actually pointing to an sli rather and then so in. In those cases with the feature flags, it would be, you know it would be, there would be an action on the individual.

B

You know if the sli of a particular feature went over an edge, then the feature flag could be automatically reversed, for example, would be one would be one approach to that, but that's a slightly different discussion, but um I, I think yeah if we just had a single one, which was effectively the the monitoring one.

A

So we have a monitoring one and you can define inside the service level level definition you can define something like as a threshold and if it doesn't, it will use sli kind or whatever we're going to define there to find the the right threshold from the.

B

Global model from the top one yeah.

A

I love that id and then we just ignore all the other thresholds like.

B

Yeah yeah because the deployment one doesn't matter mtbf1 I mean ultimately over time the mtbf1 is kind of like a where we want to be. Rather so yeah, it's it's. We should. You know those two should theoretically come together, but uh we'll see that happening when it happens, but um yeah, I think I think from you know, and also let's do it as a first with the monitoring and then we'll have a better understanding of it. And then we can move forward and if we need the mtv f1, we can do it, but yeah.

A

What what I'm thinking about doing now is we can just, I think, that's what also has shown suggested. We can just repeat the hashes inside the inside the service level definition, so you have just the monitoring thresholds and then.

B

Yeah yeah, I I think that might get quite um like like there's a lot of yeah.

A

There's a lot of conflict there I'm going to uh this is my main concern about.

B

A

That's like um implementation detail, I'm going to take note of what you said like keep them close together. The threshold.

B

A

And then we'll take it from there, because.

B

In those, then they um there's another reason as well bob and that's that's um we're starting to see some places where people are sort of duplicating services and then um what they're actually doing is they're kind of denormalizing um slis out of there, and the case that I can think of is alejandro with the post, petroni and petroni registry. I think it is it's basically another copy of the petroni metrics catalogue and he's basically um dried it up and he's copied and pasted right, and in that case it makes more sense.

B

So he's kind of got a single source of of of some of the sli's that that are common to both, and in that case, like.

A

uh You wouldn't want it to.

B

Yeah yeah yeah exactly, but in that case, when you're looking at that sli that's shared between multiple services. You wouldn't want to kind of have to remember to go to another place to see the slo right. It makes sense for those things to be kept together, because then you know, although it might be problematic, if you want different ones for different services, but that's just.

A

Thinking, I think I think I think we do want both like the like one per service and then one per sli in a service. I think.

B

A

And then we've got everything like we've got the best from both world that from what was discussed in that openness, a low issue 49 or something yeah cool thanks. I will take a note of that in that issue. Thanks for talking me through it.

B

Yeah, exciting stuff.

C

Since we are talking about air budgets and um having both of those uh both of those thresholds listed in uh for each of them, um I just wanted to bring up that. There is a conversation that some people are having about splitting the view of error budgets into two parts, um so that, if anyone wants to contribute to that they can, they can do so. um Some of the teams want to treat different.

C

They want to treat the different portions with different, um like with a different level of urgency like they want to treat the errors first and the updates portion second, but when they're looking at it as a whole, they're only seeing it as a whole, they have to dive in to be able to split them. Apart.

B

But rachel, why don't we wait until we've got the customers sli's the custom, thresh thresholds? Why we do this? This is a this is not.

A

The time to be doing that- and I also I would prefer to like- um I would prefer to go to a place where we monitor groups the way we monitor services, so they have a short range like they watch the last six hours like we do on the service overview, and then they see things happening there and then the the big number is more of like how much have we improved over time and then like now.

A

The single number is just because it was easy to get and like easy to grasp, but I think we want something clever later well,.

C

If that's, if that's the case, then I'll get the conversation to pause so that they first see the advantages that are going to come from having the customer customers for endpoint and if that doesn't give them enough, then we can start having a conversation about what they actually need further than that, but I think you're right um pausing, the conversation is, is more effective yeah I mean.

B

C

Andrew, what do you think.

A

B

The patient is on the table. Bob is performing the heart surgery like now's, not the time to to give him a new pair of lungs. Well, maybe it is, let's just I would say, like we're, going to have a very good idea of of of what that's going to change with regards to the aptx's and so like. Don't, let's not change the other thing. At the same time,.

A

Make sense: okay, yeah? uh What are your like thoughts for the future on this, like with what I just mentioned, I'm having a short view, and so like the like? What do we use the 28-day thing for, because that's more of like an sla number? What's.

C

Really nice about having the short view is people people want to make changes and watch the changes affect the error budgets right now. They want to almost watch it. While it rolls out and watch the little graph change and feel good that they've they can see the direct impact of their work and over the 28 days you can't really it takes a long. The the gratification is not instant. So having a six-hour view is, I think, is good. Yeah.

A

But also, it's also separated then over the sli's. That may like it's more like every every row that is now in this attribution thing would be a separate graph that they can watch.

B

So I mean bob this actually ties into that that that uh brief mention that I had earlier of the feature flags right. So if, if you can tie a feature flag to an sli, we could give them, we could give the engineering teams an outs to avoid you know. Please give us the query that you want for your logs. Please give us the query: the dashboard, you know the prometheus query and like fill in these 20 things, you can do the 20 things or you can give us the name of the sli and and.

C

In the first version.

B

It in the first version- it's just uh oh, this sli has gone south okay, cancel it, but in a future world it's like turn off that feature flag straight away. Like no human involvement.

B

um You know obviously raise an alarm and everything else, but just just toggle it off um and you can actually start doing stuff like um you know. There might be multiple feature flags on at the same time and you can kind of statistically look at where the which one is contributing the most actually.

A

That could be like we already have like this solid definition of feature flags inside the code base and it's in the ammo file. So we could add an sli there like an sli name there and then yeah magic.

B

That's, that's it and and sort of the the the response that I got from from from christopher was like well, uh do you expect like every uh every endpoint to have an sli, and I said no, I expect every feature to have an sli and, um and he was like. Well, that's that's a huge ask, and I was I said: well you know do we want to monitor these features like? I think I think it is a big ask but- and it will take time, but if we can.

A

Encourage it with these.

B

A

If you're talking about a feature as a feature category we're not that far off like yeah well, yeah yeah.

B

If it features slightly smaller but yeah, it's a good start. I mean it's a little bit highly aggregated, but it's it's not a bad place.

A

No, but the thing that um that that young wanted to do with the with the service desk stuff, that's like the first thing. That's it like exactly what yeah.

D

B

So, and and and if and if we can make developers lives easier by saying you know, instead of filling out this long form, because it's looking like it's going to be one of those, you know check box things you can either do that or you can go write a piece of code which will give you an sli which will benefit us forever right, yeah.

A

B

A

Dashboards you're going to have something yeah.

B

Exactly bob's got some great dashboards for you. You know you just got to do a little bit of work here and we get that monitoring forever um and.

A

It's much more fun to write a little piece of code that generates something than to take a bunch of boxes.

B

A

C

So bob can I ask one one request regarding this epic, about uh the defining their own sli? Can you please update the status block today to just say where we're at with it? Thank you.

B

Yeah because there's there's from from the the calls that I've been there's like a huge amount of pressure, that's there's a lot of things behind that 525 at the moment, um like another example, was in the call on in the infra dev call or engineering allocation call on tuesday. um Eric did a great job of saying. Well, you know. Let's look at the area budgets that you published rachel. I don't know if you watch the recording of that call, um and it was fantastic. It's like okay.

B

Well, so we breezed over it and then eric said no, no, no, let's go back. What are we doing? You know these. These teams are out. What are we doing about this, which is which is amazing, which is like really kind of like the enforcement of it, which was fantastic, but anyway, one of the things that came out of that was well. Then the teams have to develop modern.

B

You know one of the things that they might have to do is develop better monitoring and- um and I said like- let's not go in and have like lots of teams, adding like rules and and alerts to to to the run books like that's not what we want here.

B

Let's rather wait for bob and for five to five to be finished, and then they can get that monitoring by adding slis and that will ultimately benefit the whole product, not just gitlab.com and also it's not gonna, be, like you know, a reversion back to everyone, adding like the 10 alerts for the 10 last incidents and going back to that cause-based um approach right. um So so the and the pressure there was like those teams are getting told to add, alerting and I'm saying like.

B

Let's push back and wait for five to five to be finished and then they they can add the alerting, but they do it in the product and they own it rather than them coming to infrastructure. Saying hey! I need to look for this and.

A

B

C

So if there's so much focus on this- and there are so many pieces that are queuing up behind it- is there any way to make this project shippable in smaller pieces? Like I know, we've got get an api and where there's separate separate pieces in there.

A

But that's the reaction, that's that's just the thing that we are doing it for, because the ask at first was. We want to set a different duration for a request, but the way we're getting. There is a way to define a slice and we're using that thing for the requests. But then the groups could already like earlier start defining their own stuff, um like yeah before we've gone through all of the the request endpoints and set appropriate thresholds on them.

A

It's also um andrew the way. I currently think this. It's not magic like um when a new sli record like not for requests but something else. For example, the service testing desk thing would be defined. It would still need to be added to the run book. Somehow like this is a thing that sidekick will emit and yeah.

B

B

It still has to be added to the like, I mean. Ultimately, we should be able to just kind of pick that, like like, we should be able to publish that label through the um you know through through prometheus um and then just basically.

A

Inherit that label it's not a single metric like everything that uh every new sli that's defined, will have a new metric and that should have a feature. Category label, but.

C

A

Right now, um the the metric is like um the requests count and an abdex success count or something. um I don't remember why.

B

Why don't we have a single? Why don't we use a single metric then and just label them all like yeah sli tool, sli success? I.

A

Mean that's like.

B

Better name than that, but but basically that it's.

A

Not a big change to do that. Just a sli.

B

And then whatever, and then in the run book side we don't have to configure each one. We just say this is a special kind of sli and it and it's we pick up the you know the name. You know it's like a. We inherit the name from the label. We don't define each one separately um and then literally as it rolls out, you know we pick it up. We'll probably still need to pick up metadata for like dashboards.

A

Then it becomes difficult because then you need to because everything will roll up into a single thing and we need to tie it to one of those slo thresholds that we were just discussing and some will have a separate one and that's tied to the label. So we need to.

B

A

D

B

That's a good question: do we do we allow the teams to publish the slo on their side? I think you do.

A

That's the logical.

B

Conclusion of that.

A

Yeah and then that's just that that should just become yeah, because.

B

Then because then they have.

A

Yeah yeah, but it does become more like.

B

Everyone's just going to set it to 90 percent.

A

Well! No, no, because.

D

This is the point we well. We need to kind of talk a bit about how this will actually look like this. No negotiation between inferent development, because that needs to happen. Like you just said, there is a possibility. Someone will say well, 10 seconds is fine, because my things are green and our infrastructure can only take five seconds. For example, what do we want that? How we want? How do we want that negotiation to look like.

A

D

A

D

A

For now, like uh there's an issue inside that epic, that says like um we're, going to set the most important ones for the endpoints that we're monitoring now, but that's again just for requests, but but I haven't thought about the new sli bit of that and I haven't thought about slos on the other side. So, on our side, the request duration thresholds, I like in in the beginning. I would just do the thing that doesn't scale and ask us.

C

I'd also say in the beginning that, um when they are set that we, someone from scalability, should be a reviewer on those requests and when we observe them to be excessive for what the endpoint is, we have to have a conversation with them about what it should be.

B

I think I think.

C

It's just the first alternative.

B

Can I offer an alternative proposal, so I think that the method, the methodology that we use for sidekick and for saying that things can be slow, but then you know basically having trade-offs. So you can, you can have a 95 um slo, but then um so with with with sidekick it was. You can take a minute to run a sidekick job. I mean you could take 10 minutes if you want, but then we're not gonna treat it as urgent. um You can have you know. Maybe uh we can do a different.

C

B

Yeah and so so so so we give it's kind of like a give and take, and I'm this is like really kind of off the bat, but I mean maybe we could do something like where we actually start sending traffic to like different parts of the fleet, and it's like sure your request can take five seconds, but that puma there's going to be a lot more queuing on that and it might take like a second for that request to be seen. You know in a busy time, because we we're going to have different scaling characteristics.

B

We will not guarantee that that that job will be seen by a puma.

C

B

250 milliseconds or whatever and.

C

B

And so so we have this kind of tension between what we provide as long as and that's what we do the sidekick and I think it's worked very well.

A

So that means that we would separate the queuing, latency, aptx kind of thing which is owned by us and then.

B

The the lever there is actually the um is the is the auto scaling of the fleet right um because that's the that's the thing that controls that queuing latency, the the the amount of time will take for puma to see a job, and so you know we'll say by all means you can have five seconds, but we're gonna reach it. The problem is it's kind of difficult to read traffic according to those sli's right and but maybe that's something we can overcome.

D

And just to reiterate rachel's question, because I don't think I understood the answer to it. Is there a way for us to make this? This is rachel's question. Is there a way for us to do this smaller right now with one team, even if it's super janky, because I think one of the other items, one of the other items that I think is worth mentioning in this call- is in that same engineering allocation call after eric brought us back to the error budgets topic.

D

He said, and I quote, buckle up everyone and quote this is going to be a rough ride, so we need to be prepared that there is going to be a bit of a back and forth until we get it right, um and I think we should be using that warning right now, if possible, because it works in our advantage. So.

B

I was actually going to say this, like I think like we were talking about.

B

You know that thing with having a single sli and and but let's, let's start off bob by by defining them like you know, the way that you said define them in in the jsonnet and we, you know you're going to work with jan to kind of define one for service desk and and we will do it the manual way first and and then at some point, there'll be like 200 of these things and it'll be unmaintainable, and I I see us having I mean.

B

I think it would be easier if those metrics all had the same. I think that that could help us in the future, but that's not a.

A

Big deal like we're doing we're doing kind of that now know where we end up in.

B

A

Service component recordings and gitlab component feature category recordings and so on, and that's how we aggregate them up with with recording rules and yeah. That's not we. We won't be able to keep doing that forever. If so, suddenly everybody adds 200 sli, but let's, let's have.

B

A

And see where we go.

B

We'll we'll buckle up and uh and see how far we can get, but but but I do think that that using um consistent metric names might.

A

Make it easier.

B

A

Future right now, I don't.

B

A

What I did there is no. What I plan to do is enforcing a feature category label and on an sli, that's well enforcing, then, if, if we review that, that's what I'm going to mention, but then the feature category label, for example, for request, is going to be different. All the time and the naming is also prefixed, um like the the names are also all going to be the same. It's like prefix with gitlab sli, uh sli name, and then it has a total and a success count. So two counters with that name.

C

Can we, I don't, have this fully formed in my head, so this may not come out straight, but can we get it to the point where we can tell the stage groups what we want them to do while we actually build it to work in the background? So if we know that this is the structure that we wanted to have, it needs to look like this in the files.

C

Can we write the documentation for them, give them the instruction to get on with it and do it and then, at the same time, we'll be doing all the work.

A

Because then they've got something concrete to do. This is the merge request that me like that. I already have open and the virtual quest that kuang min already has open is the start for that. So then they can start defining a slice and adjusting thresholds. Theoretically, yeah.

B

A

They just don't.

B

Have any they will not, they.

A

Will not be rooted to anything, they will not be tied to alerts and not be tied to services. Yeah.

B

Out of interest is that an api that you've defined- uh I am defining yeah, yeah yeah, but but yeah, because you can have an api and get everyone to implement the api and then how what you do on the back end of that is kind of you can you can parallelize those things.

C

Okay, so so bob as part of updating that effort today, can you please perhaps actually you and I need to have a separate conversation after this, so that we can figure out from that epic which pieces get pulled, because um I want to be able to communicate to people we're doing this part of it at this line. We need you to start defining slis.

C

um While we do these other things. I just want to be able to have the communication that we can send out to people um so perhaps united chat after it.

A

Would be very nice and like even before we start doing stuff with the sls that they define in the rails code base and that metrics are there when we start um doing stuff with them for alerting and monitoring like because otherwise we have nothing to work with like if we have seven days worth of data, that's handy.

D

That that sounds excellent. That sounds like an excellent step.

C

Well, is there anything else that anyone would like to chat about or feel free to stay on, the call to talk about taking the epic pieces.

C

Awesome I'm going to go ahead and stop the recording. Then.