GitLab Scalability Team, 2 Oct 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Scalability: Adding feature category labels to web metrics

Description

Andrew shows Bob how we automatically generate recording rules from high cardinality metrics, and how to include a new feature_category label in that.

A

Go andrew, so, okay, so just to kind of explain the background for the video um we are looking to add. Bob's new feature, category uh attribution onto our error, but error budgets, page or which is here and at the moment it only tells us the attribution for sidekick cues and what would be really nice is if we could extend this so that it was also um the web, and so we were discussing ways to do this and the most obvious way is to just kind of use the the metrics that have got these new.

A

You know. So if we take a look at this metric, for example, let me just open up a new prometheus. So I'll just do this rather.

A

I'll just do it: okay,.

A

A

uh That's interesting, oh not everything. Has it right, presumably no.

B

If I remember correctly, I haven't merged anything yet, but I saw sean uh added a bunch, but only the merger quest controller has attribution right now. If I remember correctly.

A

A

um So, okay, so this is this is what we've got at the moment right. So we've got this uh feature category over here: source code management and that's awesome, um but the problem is if we had to take this metric and put it into here, especially you know. This has got a seven day uh by default. It it'll just time out, like these metrics, basically anything with these metrics at the moment times out, and so we have to take a recording rule.

A

Now there probably are some recording rules that already have these in them um kind of by accident, because when you do aggregation in prometheus you can either say some buy and you give it a bunch of labels or you can say some without and then it drops some labels.

A

And so, if you add a new label like feature category, it kind of gets added, and so we do have some personally, I'm not a big fan of that way, because the you kind of change your your dimensions of of your of your data, so I tend to use, buy most of the time because you can't accidentally kind of start adding new dimensions, but I'm sure there are some.

A

But what I was suggesting to to bob was that we take the the the significant labels and use that as a way of doing this, and so just to kind of explain what significant labels are on all of the service overview. Dashboards like this is the service overview dashboard for the web, and you know this is each of the sli's. You know here.

A

We've got the load balancer puma workhorse, which all kind of things that we're monitoring for the web servers and then we have these collapsed rows, one for each um components and if we open that up you'll see inside here, we have kind of even more detail than we have outside and for each row we have.

A

um Basically, we break the data down by some label, so in this case this is kind of a non-aggregated version and then we've got per fully qualified domain name per method and so we're breaking these metrics down by these different uh labels. And if you go into workhorse you'll see, we've got different labels or maybe they're the same so yeah so for workhorse. We've got per fully qualified domain name and per route and the way that we do that is in.

A

Let me just open.

A

A

If we're going to with the web metrics catalog definition.

A

We have this attribute over here called significant labels, and what that says is with these metrics like gitlab workhorse http, request total. You know if I go and look at what that looks like uh in prometheus.

A

Oops know: we've got lots of different metrics on here, but when you are diagnosing a problem and trying to understand, what's going on, you don't really care about all of them.

A

Like you, probably don't care about uh job, you kind of know what the job is, and so we say the metrics that we kind of interested in the most as operators or people running the system are fully qualified domain name, which will we'll slowly transition to away from as we move to kubernetes, because we don't have fqdn anymore, but then the other one which is super useful, is root which is like uh workhorses got about 10 different ways that it handles things and that's on the on the root label and so that over there is what drives um these breakdowns here.

A

So so you basically get a row. You you get the the the metrics aggregated by each of those labels inside this detail row. So by default. You don't see that. But if you in an incident you might open this up and say: oh all, the errors are on this node or all the errors are on this route and that's kind of a way to kind of speed things up. So you're probably still wondering.

A

Why does this make any difference to this, and the reason is: is in the metrics catalog we've gone some way towards automatically generating recording rules and the place that we do. That is where is it.

A

If you have any questions, just shouts as well here, so I don't know if you've seen this okay, I had some before okay. So basically, what we used to do is we had the the metrics catalog and then for some of the metrics it would use if they would rely on recording rules, but those recording rules were manually maintained.

A

So you had like this system that was kind of all automatically generated, but then it was relying on some other code somewhere else to do the recording rules, and so the kind of things that would happen were like someone would take a look at the recording rule and say: oh, you know, we don't need this label anymore and they'd, remove a label from the recording rule and they'll break all the slis.

A

All that add another label, and there were all these different things, and so I was kind of getting quite frustrated and also the other thing that might happen is that we would need another label and it wouldn't be on these metrics. And so we built this small thing called recording rule metrics and we we use it for sidekick, because sidekick also has very high cardinality.

A

So what it says is in the metrics catalog whenever you are dealing with these metrics uh metrics on on on these names, uh try and use a recording rule instead and then, when we're generating the the service level metrics.

A

What we do is we go and we take a look at all the places where we use this metric and then we look at all the labels that we use for filtering and all the labels that we use for aggregation and we basically come up with a unique set of all of those labels, and then we automatically generate a recording rule from that set.

A

So, uh to kind of give you an example like if I go to this sidekick jobs fail total over here and then I add for the selector I add like uh whatever wombats yes right, kind of in the old world. That would break because this recording rule well, the recording rule that we'll be using would also need to have that on it and and like everything you know, you'd have to manually go and update something else, and that was error-prone and now what the metrics catalog does is.

A

It goes when it generates the recording rule for sidekick jobs. Fail total it'll go one of the things that it filters on, for that is this wombat label, and it will automatically add that to the recording rule, and likewise, if I remove this from here, it automatically removes it from the from the recording rule that gets generated. So if you go look at the auto.

A

Generated so you can see here it's this is the recording rule that gets generated. So you know it's taken. It's looked through the entire metrics catalog and it's seen that the labels that we use for that metric environment feature category. That's useful. You know less than or equal to for the bucket, and that was all this. This recording rule is totally automatically generated because of that definition up here, and we also use these in I'm pretty sure. If you go into.

A

These dashboards, we use the same sidekick. We use the same, recording rules to generate this stuff. Sorry, I know this is probably quite a lot to take in no, it's actually.

B

Quite cool, could you also, after this show the method where you like how you are collecting that from.

A

Sure I I'm happy too.

B

Are you doing that after you generated the json.

A

uh It's as part of it so basic. Well, not the json, but the um the recording rules. I'll show you in a second, but here you can see I've used that recording rule. So when I, when I build that dashboard, I don't use this thing. I actually just said you know: sidekick jobs, completion, total and then there's kind of like a like a pipeline, and in that pipeline it says.

A

Well, the labels that were requested were x, y and z and aggregations were these and therefore it actually matches with the with the recording rule uh and so I'll use the recording rule, and so it's kind of done. The substitution for the record, oh and the other thing that's really important- is that this has got a rate on it right.

A

So this has got a five minute rate and if, if I request like a 10 minute recording rule sorry a 10 minute rate, then it won't, then it won't resolve down, because it doesn't have a 10 minute recording rule. There is.

B

A fall back to the original metric.

A

Yeah to the original query, and so why, like I'm, still like adding to the stack but it'll start unwinding quite soon, so the reason why this is important is because, when the resolution is when it's deciding what labels to use for a recording rule, one of the things it looks at is the um significant labels. So any label that you add to uh where is it yeah any label that gets added into your significant labels gets included in the recording rule kind of automatically.

A

B

This is our role to be added to the details uh of the passport on the service dashboard.

A

Yeah, which is also kind of useful- I think, because you know maybe one of the ways that you want to slice and dice. Is you see like a huge error rate?

A

You open up that detail page and what you see is that all the errors are coming from a certain feature category and then you know who to speak to so and if you take a look going back to the sidekick um example, where we've already sort of done this, uh let's just go back here, you'll see that we do do that with um these component details yeah per feature category, um but these are hey here we go, um there's, there's a bug which causes these latency ones not to work.

A

It's quite a complicated bug to resolve as well. So it just needs a lot of time, um but there you can see you know the spikes are on a certain error category, a feature category um I don't know. What's happened to my grafana, looks like it's crashed, hello.

A

Well, um I don't know what's going on there, but hopefully it'll come back. So that's why? So? What I was going to say right is: if we go back to the web over here and we take a look at puma so for puma for the error rate we're using that um does that have the label on it? I don't think so. I don't know where okay well, then that's.

B

A

B

A

The method we probably want to add it to it would be super useful. I mean we could probably get around it the other way, but I quite like these metrics yeah.

B

For other reasons I was just, I was just about to say like let's, like maybe not tag too much on the transaction things, because.

A

B

We're kind of unsure, where they're coming from and how useful they are.

A

That's that's exactly my gut feel on that as well, so perfect. So you know if we were to add http request duration. So let's just try that.

A

So but it sounds as though there's a little bit more work before we can do that which is kind of, but that's.

B

Why that what's this issue? Because I didn't know which metrics we needed.

A

Yeah, um so if we go to here and we go to puma- and we just add here like um what's what's the status of the moving these.

A

The status of the off the histogram and onto the uh not histogram onto the.

A

It doesn't matter, but that will also make a big difference here with these and and we'll need to update this and and probably the recording rule as well.

A

While you look at that, I'm just gonna re-run the generation, I'm kind of surprised- that's not one of the ones that we do already, but anyway could.

B

You, while you're running that, also show us um where you're generating like where you're getting where you're aggregating. All the information like to generate a recording rule.

A

So it's pretty complicated, which is.

B

Which is why I would like you to explain it. While I have a call, because otherwise, yeah.

A

I so so we have this thing called and it, and it's still in need of like more iterations, because it is a complicated piece of code.

A

um But it's also one of those things where people say you know the code behind the metrics catalog is like really complicated, but it does do stuff like this, which I'm really happy about and like ultimately leads to less like manual work and maintaining three different things which I, which I hate doing right, and so we have this thing and we got called the recording rule registry um and I've been kind of moving away from, because recording rules are super, overused term and it's quite complicated to understand what we're talking about.

A

So I've been sort of moving away from calling it recording rules to intermediate recording rules, because what we have is we have a a metric and then we have an intermediate recording rule which we don't. Actually it's not the end game, it's kind of like a halfway house and then from that we we we use that to generate the sli's and everything else. And so I might call this like something and- and I think I've got that name in some way. No, I don't.

A

um There are certain places where I call it the intermediate recording rules, but that's kind of beside the point. They're kind of temporary right, they're kind of like a a pre-processing step before the next step, and so what we do here is it's got on its public methods. It's got. This thing called resolve, recording rule four, um and so you give it the type of aggregation, you're doing the labels that you're aggregating over and the uh the function that you're using is so kind of all.

A

The queries have to be of a similar form, which is basically like um function on a vector and then aggregated on a bunch of labels which is like 90 of what we do right. Obviously, there's some clever things, but a lot of it is that, and so this kind of just it only deals with those kind of functions, and I think that's like a reasonable thing to do. But basically it says you know, what's the what's the aggregation function, what are the labels? uh What's the?

A

What I call a range vector function um and what's the interval because, like I said we have to do each of those and then what are the select? What's the query that you're running.

B

Most I saw those were like the the ones you defined a little bit more up like yeah fixed one, five minutes, 10 minutes, 30 minutes.

A

That that's right, and so at the moment it's hard coded in here at those this one minutes gonna go soon, because I think we've kind of well I've just I've.

A

I don't like one minute, uh they they're just noisy uh and they don't have enough data in them. So I think that's going to go. We pretty much. Don't have much left on that, um and so anyway, what it does is this. This will then go and it'll look up and it's got like an internal registry which is basically like a big hash.

A

uh You know mostly driven by the by the metric name, but then also it validates that the the labels that you're using in the selector and aggregator are a subset of the labels that it's using on the recording rule, or you know that they're contained within uh and that obviously the range vector matches and the aggregation function matches.

A

So you know that has to be the same. That has to be the same. That has to be one of the ones that you understand. That has to be one that you are recording, have recording rules for, and then the labels come from the selector and the aggregation labels. um So if you go look at how this gets used.

A

Service metrics rates right so when we um in the metrics catalog, we have this thing called uh like a rate metric and it's how we model that for every service yeah exactly, and so we intercept it quite low in that service. When it's when it's you know for for a definition of a rate metric when you say basically give me the prom ql for this definition, what it does is it goes at this point. It goes to the registry.

A

It asks the registry hey, like I'm, trying to build up some prom ql for this definition. Can you help me out here and then it'll either say like yeah here you go, it'll say no.

A

The thing in this code that I don't like that I want to fix is this is a kind of a library that is, um I want to extract from from the run books and just have it as a as a as a library, but it has a dependency on the registry, which is part of the run books repo.

A

So it's kind of like an inverse dependency that I need to get rid of um because yeah, it's the the registry of where of all those things is part of the of the implementation where this and this library depends on it, so there's kind of a circular. Well, there is a circular dependency there, which is horrible, but I can fix it, but it's just more work um and then the other part of it, which is kind of required reading, is back in the metrics catalog. We do this thing where we collect all the.

A

Basically, when we start running the queries, where is it?

A

Maybe it's up here.

A

Yeah, here's the.

A

Yeah collect metrics and that's where you're keeping track of yeah, so I basically walk through all the functions.

A

Sorry through all the all the definitions and I kind of yeah, I I kind of step through everything as a pre-step before I start generating anything, and I build up those, so I mean that's pretty uh so I start here I say: collect metrics and labels and then I go through each service in the service catalog in the metrics catalog, and then I go through kind of each function on that, and I say like hey, give me the those uh you know: the labels, the applications, yeah yeah and then and then obviously for some of them.

A

It's like you know some, but then for the labels, it's kind of adding to the pot, the union of so sorry the set because json it has got, sets so you're kind of just pushing extra things in there um yeah. So it's kind of complicated, but it does work.

B

That's that's! uh That's where the where we read the significant labels. um That's.

A

Exactly right, yeah.

B

A

So yeah those are yeah apply, apply significant labels yeah. So.

B

We go to every service in the regis in the registry check which significant labels have been defined and then pick them up here so later. We can use this to generate the recording rule.

A

And we also use the labels that we pick from the the request.

B

Rate and the error rate and.

A

Yeah, so uh one thing to note about this is that we only use um selectors that are hash functions because I didn't want to go through the bother of like writing a parser for prom ql selectors, and so you know, if you look at these, the sort of way that I do selectors now like this is much easier for me to pass out the the keys.

A

um I mean: there's lots of reasons why I prefer it like. You can union them together and you can manipulate them much easier. So if you want to use this, the components need to use um a hashtag.

B

A

From ql then- and we could like, if we didn't want that, we could write a promptcuell parser in jsonnet, but like just don't do that. um So that was that uh oh yeah, let's go take a look at what that code looks like I'm kind of interested to see if it worked, yeah see.

B

What what the diff is now.

A

Right so yeah, that's kind of one thing, that'll be kind of interesting about this yeah I mean here you can see it's done. It.

A

And so so, there's one slightly concerning thing here and that's that the same metric is used by three different services. But the definition of the of the recording rules did we add.

A

Oh, we obviously didn't add feature category to the uh things.

B

Yes, because it won't be good yet, but that doesn't matter does it.

A

Yeah, no, no, uh it's exactly it. It'll just work, but if we had to add in here not that it's there but.

B

If you add it now, uh it will just all be empty and fine, and then, as soon as you start emitting with those labels on those metrics.

A

Yeah it'll it'll it'll it'll, just pop up the one thing: that's oh, you know one reason why I don't want to do that and I'd rather wait until it's there is that we can actually kind of cause more cardinality explosion. So what we want to do is what, like, with these recording rules. You want to take something: that's got like really big cardinality and bring it down to much smaller, where it's manageable and then what's nice is all the other. Recording rules are using that.

A

So we we take a lot of load off the prometheus server. You know, instead of having to go through 100 250, 000 metrics, every 15 sec or every minute it's going through. um You know this much smaller set, and so what I like to do, if I ever change those is I just like to go and check the definitions that have been created.

A

You know like these ones and run them and then just see what they uh you know, how they look and how how much cardinality they have, and if we oh, what uh just this wasn't it.

A

I probably shouldn't have done that 8 000, you know we kind of knew that, but so I would rather not add feature category until we've added it. I mean feature category. Won't we kind of know that it won't add any more. Will it because.

A

Right now it would be that, yes, it would it would, but it could. uh You know, that's probably not gonna work, then, because it's gonna add several hundred. uh How no, how many feature categories? Are there um yeah something like that?

A

Okay, not a higher than we can yeah yeah, so then it'll probably be fine, um and then what that means sorry this is this. Is I'm still actually unwinding the stack to the original question? What that means is that we can then use that recording rule in the where's. My.

B

In the service catalog.

A

Yeah and the in there yeah for the well actually for so we'll have like this. uh Here is a chair. You know we'll have this recording rule sli aggregations and we can use that in the error budget page because it'll have the feature category on it.

B

Yeah and won't be as because we're actually over a week, so it won't be as huge as the metric.

A

Itself, yeah yeah, so that might be the best way for us to do that um and then you know all the added benefits. Are we automatically get it in the pull-downs?

A

um You know anything else, kind of we also just saying more generally like this is a useful label. You know, and maybe in future we can order, generates um queries for sres. You know like when there's an incident like hey, you know, or even just order, generate documentation.

B

Or like alert inside the the channel, together with the incident manager.

A

B

Whatever like it would be cool like now, we we go through all engineers when we need one uh yeah bring an old call, we go to all of them and then maybe we have somebody yeah the feature category, maybe not.

A

Yeah and that's where we've already you know, we've already got some of those um alerting rules that will kind of go to the pages channel or the gitly channel and like at the moment, it's only driven by service. So you can only do it for services that are very obviously correlated with a team, but, like that's very much like the there's, a there's, an issue about that, actually rooting the alerts on the on the feature category, which is which is awesome.

B

B

Just double checking now we need the new feature: category on the http request, duration, yeah and that's inside the request.

A

Wrap middleware, I would say we want to do that other task at the same time, because we actually don't want it on the buckets, it's probably better to have it on the counter.

B

uh Only on the counter you mean.

A

Yeah but then at the moment we don't have the status on the counter right, so we wouldn't be able to tell which of the requests are in error.

B

Yeah I was looking at that issue. I think sean picked it up like.

A

This yeah, it's it's not hard, but it's. The problem is it's kind of like it's going to break a bunch of stuff when we do it. um So that's it requires caution.

B

uh It's now in proposal, I'm going to move that forward later today, I'm going to add it to do now. um Nobody.

A

B

It up yet, but I'm going to block the issue you just mentioned on it, because we want the feature category later.

A

Yeah, oh yeah,.

B

A

B

uh Yeah, I will get through that to that today and then uh what do we need to do like how? How do you imagine like if we had the feature category as a significant label inside the the puma service? Yeah we'd have a new road there that shows the the rates per feature category.

B

What if we also wanted to include those things in the error? Budgeting dashboard that currently on by so what we.

A

Can do I it's a so, I think, as a first step like the first iteration, we don't aggregate up to like a single like feature category. We kind of have here's your information for sci-fi kick and here's information for like web, or you know not necessarily web, but like http requesty stuff um and then, like we just kind of duplicate. What we've got on the error budgets. I can share my screen if it makes it easier, but, like we just kind of have the same, I mean I that that is not really used.

A

I think ultimately, where we want to go, is we want to kind of just have an aggregated? uh You know the everything rolled up to the level of a feature category and it's kind of like your overall score.

B

A

B

Now it should be, it should be possible like um if, if I was in the create sort code group, I should would go there and see like yeah we're doing good and if we're not doing good, I want to be able to expand and see we're losing.

A

B

Budget inside sidekick and then I wanted to take down which worker is causing the headaches and so on. Like the simple request.

A

Like in the part, this is now I'm just kind of spinning off a little bit, but like I'll just mention it in the past. I also thought it might be useful to generate like a dashboard per feature category, that's kind of got there like each feature category or maybe each team's kind of information on it. um You know which is like almost like their dashboard in grafana, and we can totally do that because we've got all the mappings. We've got the stages. Yaml we've now got the feature categories um and we have a map yeah.

A

So that's all we need really and then we could start coming up with a thing where it's like, like I'll give you an example why I thought about this like if you look at um the stuff dylan and his team are doing with search they're, adding a lot of stuff to the web, or I think it's a web uh dashboard, but they're the only and and I'm really happy that they're doing that and like that should be encouraged, but really they're, the only ones that are ever going to use that stuff and so having like a dashboard where, like there's a whole bunch of stuff, that's auto generated like here's your feature category stuff and then he can add they can add their stuff into that dashboard like further down.

A

You know like the way we do with the service overviews with some of it's generated and then some of it's like automatic. Sorry, that's spinning off a little bit into like future walls.

B

But I think no, it makes sense and the reason that dylan and his team I have an easier time doing. That is because they, the the things that they're adding is around the last year. So it's their own service and they yeah, and they just want a place to view it and the web dashboard.

A

Is there yeah it's there and it's and like we haven't sort of a better place for them, so yeah yeah and it's and it's awesome that they are adding those things. Yes,.

B

A

Know you don't want to discourage it, um but yeah? No, but.

B

It would be nice if everybody has their home in the metrics and yeah.

A

B

Grafana, like their own thing, where they can dig down yeah um for this. This, like, let's just as you mentioned, copy what we have for sidekick aggregated over a week, but not combine it yet into um no into.

A

A single, I think I think, because the other thing is once you start looking at it.

A

I suspect that the numbers from the request ones are going to be much much higher in most cases right, and so, if you're aggregating, you've got to do it in a kind of careful way, because average is probably not that good um but then also like a weighted average based on like number of requests, you'll totally drown out everything from sidekick right, because maybe you got 100 times more requests on the web than you do in sidekick, so your weighting will be vastly skewed in that direction.

A

That's why I don't really have a like a story for that, um so I think it's better just to keep them separate until we can kind of figure out like how to yeah how to.

B

A

Yeah, because uh and giving them 50 50 weighting also feels wrong, but then giving them like just wiping out the the low stuff is probably wrong as well. Yeah.

B

Let's get the numbers in first and then see what they look like too yeah.

B

Okay. Thank you very much. I'm going to comment that on the issue where you've just linked and blocking and stuff and I'll might bring you in slack later to verify if I've got everything right cool. Thank you very much for your time. Andrew cool thanks bob have a good day. Bye.