GitLab Scalability Team Demos, 17 Feb 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Scalability Team Demo 2022-02-17

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Right I just wanted to. I have there's only one item on the team demo, but I did want to want to share it because I'm I have a hard time estimating how valuable it is, how valuable it would be to work on. uh Let me try and share my screen.

A

That's not what I want to show.

A

uh So created this issue about including the patrony slis into the air budget for stage group, so that would basically mean the sli that we currently use to monitor the patrony servers, which were uh these ones that are currently on my screen.

B

I don't know if it's my connection but, like the resolution is very power like I can see. Yeah.

A

No, it's not your connection. It's a zoom screen sharing is broken, so I'm using obs studio to create a virtual camera with what is on my screen. But um I don't.

C

A

That's apparently not working very well. Let me stop sharing then.

B

um It's kind of useful.

C

B

Not bad yeah, it's yeah you can. You can carry him.

A

So the idea is, I want to include these kind of slis into the error budget for stage group. So that's uh here, query, durations and uh yeah. I think only query durations for the primaries and the secondaries uh into their budgets for stage groups uh because then like if yeah slow queries get triggered too often that would be visible there.

A

The problem with this um well like, as you can see these these things already have a feature category on them. So we could. If we change this, to feature category and source metrics like the the magic string, then these things would feed into their budget and, like everybody, would be very happy because we do a lot of fast queries all of the time and those would like hide everything else in the budget and then the thing that I was thinking about to to make this possible was weighting sli.

A

So we could say, for example, um the the these these slices only weigh. I don't know a tenth of what a request weighs, for example. Okay, I don't know about these numbers so yeah uh does anybody have thoughts on if we should do that? If it's going to be too complicated for people to understand, or if it's going to yeah, if it's even valuable, to have these kind of more hidden things included in them, the budget for stage groups.

C

Sorry I joined a little bit late, but I'm curious. I saw this meeting. I was just curious: um how? How are you attributing, like sql request, duration to individual stage groups? How is that being done.

A

So we have feature categories on endpoints right and feature categories.

A

As soon as a request starts, we kind of push a feature category onto the onto a context and everything that happens within the cycle of that request gets the label of that feature category in our. So that's gets submitted.

C

Sorry in metrics as well and yes- I can see this now. I don't think I've ever even seen this yeah.

A

It's really okay,.

C

Okay, so if I like pull up request duration for patrony, I can see stage groups if.

A

C

A

Up the the gitlab sql primary duration seconds bucket, then yeah like the the metrics, you will see them and the petroni logs. I think we do this fancy thing with. um What's it called yeah, it should be there like in a comment like because we we annotate, like we said that over to the database and so on. If I remember correctly,.

C

And and right now we're just not doing anything with that label on our patrony dashboard right like do we have no and that's what this is sort.

A

Of about right, okay, you can, you can see it like. If you expand the details, you will see yeah, um you will see that, like um which feature category has the slowest queries but yeah. We only look at that during an incident and then yeah.

C

I see so and- and this is kind of about like assigning weights to each stage group.

A

No, this is about assigning.

C

A

And this is this: is this leads into like right? Now we have. We have we kind of have weights um for things that we do in the general sla, dashboard and so on, but there's also an issue about having like an internal or like trying out an internal measurement of general availability, not using time, but using events.

A

Let's see have you do you know what I'm talking about.

C

I I know because just like a minute ago, I read the issue and I'm just catching up on the context but yeah. I think I I think I understand yeah.

A

C

Like yeah, monitor, availabilities and currencies based metrics rather than time-based averages that one yeah I just just skipped it yeah.

A

Like that's, not something I've started, but we would need weights for that. We already have.

C

A

Services defined somewhere like, for example, the registry which receives a lot of super fast requests, has a lower weight than the web. For example, I think we have we use two weights now, five and one I don't know.

C

A

They are there or where they come from, but that's what we do um so yeah. My proposal would be to to like the the error. Budgets are event based their budgets for stage groups are event-based, so then um the whole thing is how many events were good over how many total events were day and that's the availability number.

A

But now, if you want to start including things like the database into that, then like yeah, we need. We need to wait that.

C

A

It becomes very hard to explain because I'm already having trouble now yeah so yeah, so wondering what people's thoughts are. If it's something we should explore or not or if there's better avenues to get stage, groups focus on database queries and performance, the performance of a database queries. The same goes for hitler goals in the end.

B

Is the plan that you will have a single aggregated metric and this weight will work as a factor as to how each sli.

A

Yeah contribute more or less exactly. We already have a single metric, but the numbers have lined up well enough that it works like um where, for most groups, this just means requests from rails and sidekick jobs, and the numbers are similar enough, so one doesn't outweigh the other.

C

um I kind of think this is very useful for recording, but maybe not super helpful for uh like troubleshooting right like it's.

A

Well, troubleshooting, not like it wouldn't the way we look at a single sli on a on a dashboard, it's more like um if we include this into the budge into the overall number and like we could have a um people, can now say that the request is allowed to take up to five seconds. If all of these things include queries that take five seconds, then one is going to be good, but that's really not healthy for the database. So.

C

Then you need to have.

A

One that that is a little bit more uh yeah.

B

I I wonder, if that's more an issue of how you set up each sli, like maybe the what needs to be adjusted here, as are the threshold within an sli, but each sli has the same weight. It's just that. Maybe this sli is more permissive with its specific metric because of x or y reason, um but like if um right, exactly, as I said like what constitutes an acceptable database request time, it's different to what constitutes not acceptable um yeah time. But then what you need to adjust.

B

There is the threshold of hsli, but maybe not the weight of each sli.

A

But the thresholds for the sli are already different right, but if you have like, um if you have a request that takes five seconds and inside that request, you're doing a thousand database queries that take less than 100 milliseconds, then you would score very well and that's why I'm proposing the weights like because we perform a lot more queries than we do. Requests.

B

I because what we have right now is that, because what I'm wondering is what's the problem that we're trying to solve- and my understanding is that, right now we don't have a problem because everything it lines up well,.

A

There was um there was um we right now? We we don't have a problem, but if we wanted to like the the thing we want to end up with, is people looking at database queries that they are writing and if they are going to perform well, if they like need to add indexes after the fact and that kind of stuff, so we want people paying attention to that before we have an incident. We have a lot of like slow query kind of um incidents.

A

I think, like a slow query and patrony database locks, that kind of stuff.

A

So we're look basically looking for a way to get people to pay attention to the to the queries that they write and optim like to the queries that they execute and optimize if needed, and we're looking for an avenue to get people to pay attention to those.

A

I think so. That's the problem. We're trying to solve getting people to pay attention and the question is kind of- is error.

C

A

A good way to do it if we need to do this kind of stuff, for it.

C

I think I have a really naive question.

C

This proposal here this doesn't change the ratio by multiplying the numerator and the denominator by service weight. Could you just expand on on this a bit like.

A

C

A

C

Being like I understand, I guess underneath why you kind of say like okay, if you're in a quiet period, but maybe I'm missing a bigger picture here,.

A

Like wait, are you asking why weights or why we want to do this.

C

Well, first of all, I guess like um including service weight in the calculation you're you're, multiplying the the numerator and the denominator by the service weight right, which is the same value. So it doesn't.

B

Change the ratio right.

C

B

A

So to me like this, like what is.

C

Actually, being changed here,.

A

Like look think about it as a prometeus query, so you've got one series yeah.

C

A

And the other series with the type and the numbers are different for each type.

C

Yeah: okay! That's what I thought so so for um I guess for metrics, where we don't have as many requests. This helps to balance them out a bit.

A

We're already doing this right now in our general sla dashboard, the.

C

A

Service has a lower weight than the the web service, because otherwise everything would look excellent, because the registry is doing well and we'd never know if things were not excellent because well, the registry outweighs the rest of the services.

C

A

And as to the reason why um we want to try out an event-based thing is because, like then not every like a sunday night will not weigh the same as a monday morning. If we're doing right.

C

A

Numbers are going to be worse than what we currently present on the.

C

A

C

To put it another way, you're sort of like normalizing the number of operations per service right like you're, trying to um normalize them so that uh they're on equal, equal footing right is that. Does that sound right.

A

That sounds right. Yeah! That's exactly right!.

A

That's like the initial goal, like it also is kind of like a lever. I don't know if um we're going to like look at like we're generally doing this. Many database queries per request, so we're going to like say we do five times more database queries. Then we do requests it's a random number. I don't have no id, but yeah then, like we might say like the database is pretty like we might not do divided by five for database queries. Just because of that we might.

C

A

Put it higher or lower, depending on the importance of a service criticality of the service. I don't know.

C

um To me, it makes a lot of sense and I think this knob will be helpful like, like you said like for, but.

A

It's for a general number, it's for for, after the fact kind of thing,.

C

Yeah, because, basically, what you're saying I mean what maybe what's missing or what I didn't understand. First in this proposal, is that you have many different services that have this ratio, and then each service will have its own service weight and then, when you aggregate all that together then you're looking at okay, this is the slo for the entire. um I guess.

B

C

Web api right and then and then and and then um these these weights help put the individual services uh normalize the individual services so that they're they're more equal yeah.

A

Yeah and um my my like the initial id that you read in the last uh and the issue you just had open, was to put.

C

A

A service, but now I'm more leaning towards putting actually putting a weight on a service and or an sli itself like we have uh monitoring thresholds for both like you can define the global one on the sli. But you can all also be more specific inside the sli itself.

C

What's the is it like, how impo are we trying to improve? What are we trying to um improve? Like is this? I understand this is for reporting mostly and for our contractual agreements, maybe with customers. No.

A

No, not contractual, no, not okay,.

C

This is for us.

A

And we want to see like we want to people to pay attention to more things and okay.

C

So maybe attribution, then it's sort of like okay, like you know, we have a budget attribution to particular teams. That's the main reason why we're doing this right.

B

C

Yeah, so, okay, so really like it's not so important that this rolls up into the top level, so um the slo for or the sla for gitlab.com right I mean like I mean I guess that's also important, but um this is more about the the slas each stage group or each service has.

A

Yeah yeah, but right now we've mostly got that. Well, we don't. We don't have an availability number for each service. We have.

C

Well, I'm sorry, maybe the service is overloaded here, like when I think of service. I either think I mean I'm thinking. Web api did but you're, probably not thinking that right.

A

C

A

C

We do have we do have like I mean we have a dashboard for that right.

A

um Which one do you mean like if you look at the.

C

A

Yeah, because I think I can show you the same kind of problem. Yes,.

C

A

General general slash I'm not talking about this. This is the time-based thing, yeah.

B

A

Averages, um but if you open up because I you are better at screen sharing than me, if you open up the monitoring dashboard, which I've been recently working on.

C

A

Monitoring maybe.

A

Or service web nope.

C

Okay, so just one of the service dashboards.

B

C

Okay, so service web overview.

A

So the top number there uh aptx and error ratio yeah. Those can also get like um muddied by if there's an sla, an sli um below there that has way more traffic than yeah another one.

C

A

That's why I'm saying maybe we should weight those as well um for the web, it's kind of okay, because we have like the workhorse thing and puma and they're actually kind of the same thing, but they aren't so if, if workhorse alone goes bad, we'll know if puma alone goes back, we'll know often they go bad together. Then we certainly know, but then for example, image scaler is a bit uh yeah special.

C

A

Becomes more interesting when you add um an sli like the monitoring service, is now like a big pile of different things and different services mushed into one and.

C

A

Those things that are that is inside is memcached, so memcached gets quite a lot of traffic in comparison to, for example, thomas query,.

C

A

Yeah, that's going to muddy the overall line we draw on the top there, where your mouse is hovering.

C

Yeah, I think yeah, I don't I I don't know yeah.

A

This is way out there, I'm not changing that.

C

It's really, I think it would be useful to work backwards from our dashboards to kind of see what um I know this isn't explicitly for troubleshooting, but uh you know kind of think about what we want to see here first before we, um because this will have like we'll be updating our dashboards after we make this change right like to have um these metrics broken out by service by right or not.

A

Which this dashboard is a service and it is broken up.

C

Okay, sorry stage stage, I guess stage group like would we have.

A

We have those by stage group already like um off.

C

Of our overview.

A

um Yeah not from the overview and yeah, but if you open up, go down scroll down and look at the rails request sli details, that's a little bit up.

C

A

An sli there rails request, you've passed it uh between.

C

A

This one yeah- and there is already uh aha.

C

So right yeah, so I don't think I've. I've really looked too closely at this. um I see.

A

So groups also have this dashboard kind of that.

C

A

End points that they own uh yeah.

C

C

Nothing to be concerned with here, like, I guess, hopefully not.

A

The attribution the attribution is like the things that are going bad now, what are what is causing them? But if it's not.

C

A

That bad, then this could go but go down when, when.

C

General graph on.

A

The top does not yet.

C

Yeah sure, okay.

A

No worries jarv.

C

Cool thank you and I apologize for hijacking this demo a bit I'm just kind of curious. We didn't.

A

Have anything to talk about so this was. This was excellent.

A

Anybody has any something else to share.

A

Then I'll stop the recording.