GitLab Delivery Team, 24 Jun 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Andrew explains general metrics to the Delivery team

Description

Andrew N. talks about general metrics with Alessio C., Mayra C. and Marin J.

A

Jim, hey man, how you doing yeah.

B

Thanks yeah, like.

A

This week has been a total, it's recording, isn't it so we did great this week has been quite something at the with the incident that we've got ongoing and I'm tired and I mean I haven't even looked at this talk, so the last time I looked at it I'm really. Sorry. Last time I talked looked at this talk was about a month ago and I was hoping to get some prints on today, but we've been fighting fires for like a week, solid and I haven't had a chance.

C

You know you've got this yeah.

A

Okay, well, let's, let's see how we do, let's, let's load it I, just trying to find it here we go.

A

Okay and I'll just turn off my shameless screen.

A

A

B

It wasn't having up it.

A

Was there we go hey, there's back cool um so like this actually started off as the omelet man, I dropped it and it's stuck being a talk from monitor Robert and it's kind of a sitting around and I was talking to her another day and we were discussing like how did he metrics on get lab? Calm and I was talking about the general metrics that we've got and they've been around for about six months. I use them in lots, but generally in the company, they're not really used very much and I.

A

Think the main problem is a lot of people, don't really understand them, but certainly in the last week, there'd been the kind of primary signal that things are really slow and so a lot of the other metrics that we have you know haven't been a nursing. We don't we don't see a lot of alert, saying gitlab is really slow at the moments and and and these are nerds have been alerting and so meirin suggests that I give a little. I give you guys a practice run of my talk that I did.

A

That was on the original subject. So probably only show you guys, but there's obviously, architecture as it was about a year ago, and it's obviously changed a bit since then, but you guys probably have a better understanding of the original target audience and I'm not sure if you're aware of this excuse that terrible graphic like because it's never got into like being done. But this is what our monitoring infrastructure looks like, and I just remember that you can't really use an Apple Mouse with Google slides. It's like jumps around like crazy.

A

So like are you guys? Do you guys know how interests are monitoring infrastructure works if.

C

You don't, if you don't, I can also link you. The recording I did the other day with skarbek and jarv, and it has like the details of what you see right here, but in talking form rather yeah.

A

Yeah me neither that's that that was done, so this is actually before that and I'm just trying to find some time to look at what they did this yeah. So so what we have is really got the finer to refiners. We've got dashboards, get lab, there's some dashboards, get lab chrome and obviously what we don't want to do is have the public facing Gravano kind of crash our infrastructure.

A

If there's like a hack, a news article or something like that, and so we have this little thing called trickster and tricksters quite a clever little proxy that sits between the Prometheus infrastructure and Ravana and what it does is. It does a whole bunch of caching, but it does something else. That's really clever.

A

So if you ask for like a time series from Gravano it'll, look from now like now being like the number of milliseconds after right now, and then it will look back every you know, 30 seconds or whatever going back in time and that isn't very cashable, because if I send you the link, can you look at it you're going to look at it like 1.5 seconds later and then you're gonna get a whole bunch of different data which is like really stupid, and so what trickster does it goes?

A

I'll give you sort of what you want and it nurtures everyone's time series so that they align right. So, if you're looking at now and I'm, looking at now, we kind of round down to the nearest minute.

A

So then we all get the same time series and that's much more cashable then like everyone having a slightly different, skew and I, actually think we should probably put this on on on our private instance as well, particularly because it's a bit slow at the moment- um and you know I'm- never really- that worried about the last 15 seconds, like it's really up to the last minute that that we really care so so we might as well aligned on those on those things so anyway tricksters there, and that goes into a thing called Thanos and that ice is kind of like a view onto lots of what Stannis does lots of different things.

A

But then this in this context, Thanos query the view onto lots of under underlying religious service. So if you go to you know, we have Prometheus at Prometheus, DB, Prometheus and they're all collecting different metrics, and sometimes it's much more useful to just have a single view of those and Thanos will effectively. When you make a crater, thermos it'll fan those arts it'll query, all the underlying Prometheus instances get back the results and then present you with a single view. So it's it's a much better sort of way of accessing Prometheus.

A

It's especially useful for if you want to have the same dashboard and then just be able to pull down like the environment is proud or staging. um You know you don't have to change the data source for that and that's really really useful. So we've got that also, then, we've got lots of different Prometheus servers and and I've just lost my headphones because they have run out of battery.

A

Give me a second: can you hear me? Okay, cool? Hopefully, I can hear you and it's just good.

C

Can you hear me yes,.

A

I can okay cool so.

A

Yes, that's kind of what we've got on that oh yeah, and the last thing it's got is we've got this thing, called fairness, compact and panel store and what they doing is they take all the data from premiers and they writing it to Google, Cloud Storage buckets, and so, if you, if you look for something over the last week, it'll generally going on Prometheus, but if you're looking for data in Prometheus over like an eight week period, it'll go off to the the GCSE brackets and we'll fetch it from there and one of the things that Thanos come back to does.

A

Is it also aggregates it down to like an hour interval? So that's instead of like looking at every 15 seconds, you get a new metric, it's just looking at an hour so that you can get like a longer time series.

A

So that's like a super brief overview of what the metrics infrastructure looks like at the moment so like this is just for the for the talk, but really what we were seeing like last year, or at least in 2017 was you know we had really bad availability on github.com kind of like feels like how the last week has been actually- and you know, things were going down a lot and the worst part is that it's not that we don't have enough metrics right.

A

We've got lots and lots and lots and lots of metrics, um but it's not that on its own, isn't gonna solve anything, so you can have lots of metrics, but if you're not using them properly, then they're not very helpful and kind of irony which still exists.

A

Obviously this was kind of polished up for monitor on it's always like hey, everything's, great, no, but I think this is still like a major problem and the probe the other problem is that not only did we not have a lot of metrics, we also had a lot of alerts, but the alerts that we've got, especially in the production channel. If you go look in there and in pager duty, they're just firing all the time and nobody seems to be bothered about them.

A

But you know, if you look at the the uptick scores over the last week in the performance slowdowns, we're not getting alerts on that. We getting alerts on very specific, like production issues which may or may not be critical, but we're not getting alerts on like what the users are being impacted by. So the way that people talk about this is our learning is based on like causal Allah. So click is the cause of this and here's an alert for that. So the CPU on like gotelli, oh five, is at 98% it's a cause, but.

C

You're real, basically trying to fix someone's broken leg, while they're actually dying right.

A

C

Broken leg instead of yeah exactly.

A

Yeah and and and I mean so so really what we kind of what this is about is about instead of using causal alerts, we're gonna go to like suits and based alerts right. So it's like what is the user experiencing and we're gonna we're gonna monitor that instead of saying you know like hey, this is broken and and fix it so yeah we, you know we still had like all these alerts and every time something goes wrong.

A

We'll quickly go and like add another alert to do too Prometheus and we'll say like well when this goes above this threshold, then we'll sit in other loads and then of course like. If you look at the way it lab works, everything's, growing and growing and growing, and so you know I think, there's a there's. One of the alerts says: if we have 60 errors in rails in a in a one minute period, then we need to send out an alert. Now.

A

If you look at the when that came in the box, it made sense because 60 was like a reasonable number. But if you look at like the three or four thousand requests per second now like 60, is like nothing right, and so so we just have this like whack Amole, where we like stick an alert on it and that's gonna fix it for next time, and so we we have all of these alerts that that are firing.

A

That don't mean anything and we kind of ignoring it, because they they just a little bit irritating and we playing whack Amole and that's why there's a mall, you're, just sort of hitting this and then another one pops up when we hit that one and and so we you know, even with all of these alerts, we still have. You know they're, not kind of full warning us of problems like often the first time we really know that things are going is is when someone goes into the production channel and says: hey guys.

A

Have you seen that this fight this the the the Tanooki on on github.com? And actually I was talking to someone this morning and it was like like the way we seem to treat things is like notes and nicky no problem. It's like if someone comes in and goes hey, there's a tanooki on the on the page and it's like okay, everyone scrambles, but you know if you've got like five seconds latency on every single web request. This is sort of attitude is like.

A

Well, you know, we don't see it sanuki, yet everything's, fine and and like it's the wrong way of looking at things so like I was trying to like model what our matrix look like at the moment. So we have only these metrics coming in and then we'll create a bunch of recording rules and those recording rules will get consumed by dashboards and then some of the recording rules will get consumed by alerts.

A

But the problem is like often what happens? Is the dashboards to be stopped working, or maybe we change the the configuration of the dashboard and so like a lot of the recording rules? Don't seem to actually be there for any purpose other than they were there as an optimization for a dashboard that was that was around like five years ago that nobody actually uses anymore and same with the list. We might actually remove the alert, but we don't remove the recording rules.

A

So we have this whole sort of like rabbits warren, like I'm so rat's nest of like like things that are wired together and it's pretty chaotic, like I don't know, if you've ever tried to sort of audit the the alerting rules that we have in in the rand books project, it's like a little bit crazy and and like what's actually even worse about this. Is that if you go look through our dashboards like most of my dashboards, don't work like I would say more than 50% of our dashboards, they're just empty and and like. When?

A

Did they stop working? Nobody knows what's actually more scary than that is the same as true for alerts right, because Prometheus has got like a very clever and very sophisticated, alerting system compared to many other alerting systems that are there and it gives you an expression and if their expressions sort of resolves to data, then you have an alert and it fires really well.

A

But the problem is that the reason why the alerts doesn't resolve to dates it might be because it's false, like the the expression, hasn't been met, but it also could be because the actual underlying data has like long been removed from the system right, and so you can't tell the two so I would guess that maybe 25 or 30 percent of the alerts that we have in our run books haven't seen any data for, like the last year, like like they're, just they're, giving us like this false sense of security.

A

Are we having a nerd for that there's no way to actually well? There is a way you have to manually go and look the loot and sort of decompose it and look for the online series and make sure that they did. But nobody does that. There's no CI for that, um and so we have this like big problem, where we have like a lot of alerts. That don't mean anything. We have a lot of dashboards that don't work and so like what I was trying to do with the general.

A

Let the general metrics was to try to like sort of reset the board a little bit and kind of like kind of come up with something a little bit more saying, and so I came up with this idea of building like a pipeline where we have alerts coming in and then we apply the same steps to all of the alerts and and then what we can say is like we sort of normalize all the lids dance. So for every service that we have from gitlab.

A

You say we want to have an error rate, a single error rate metric for that service. We want to have a single apdex for that service. We want to have a single variability and we want to have a single operation rate per second like how many operations are going in, and those are the four that we've kind of picked on and I'm very close to you adding a fifth one which is saturation, and so what saturation is? Is it kind of depends on the service? But if you think about it, like the saturation of.

A

Postgres one of the saturation parameters for Postgres would be in Postgres. We can have a hundred concurrent connections into the Postgres server at any one time we are currently sitting at 70, so so the saturation level would be 70, PG bounces, the same thing. You know we have lots of owned and for unicorn the saturation would be. We have 1000 unicorn workers, and so far at the moment we are we're using like 700 of them, and so we have.

A

These different, like saturation is different for each kind of service, but effectively and and each service would have multiple saturation parameters like another one that would be common across all services is the CPU on on a service right and in at the aggregate saturation of the service. Isn't the average of like the CPU and the connection pool size. It's the max or whatever, the whatever the most saturated and lying metric is that's your saturation. So sorry I, spun up on on the saturation.

A

But what we trying to do is come up with like a simple set of key metrics, so we we just want like a few ways to tell the health of a service instead of a thousand ways that we can keep track of and so like with this. The whole plan is like: we have a, we have a containment pipeline and we can tell when things are broken.

A

We can have like a small set of dashboards, that we can kind of use to to look at all of our services and and kind of tame everything and have a single set of alerts and single set of dashboards.

A

So and the results are promising because, like just this stuff is very working and I feel very confident in it. So obviously like what I wanted to do with this was was the goals of the project were to reduce the antigens, like the problems that we that we sing like as we see at the moment and then trying to get it early warning on two on two issues and then I really wanted the sticky like system-wide. So like this? Isn't a bark like individual teams like losing their dashboards like?

A

We will still always have like a PG bouncer, a red, ass dashboard. That's got all sorts of minute.I about the way Redis works or about its particular metrics. But this is about like having like a like an overall view of the system and being able to say, like Redis, is healthy right now gets it is healthy right now the web is healthy, but API is not like.

A

We need to zoom in on the API and then maybe at that point you go down to the to the more detailed dashboards, and so it's like a high level view. You know, as the application gets more complicated like you know, we kind of like every single ops person being like an expert in every single component, and so this is just like.

A

We have the same set of Health metrics for each service and then- and then you know we, you just know at a high level whether or not they wither or not their work, and so like this is the the main dashboard for a service, and you know these are the four things that I was talking about. So we've got like optics and optics is a sometimes I think is a little bit complicated for people to understand, because it's not a latency.

A

It's like it's a percentage of your requests that come into that service that are satisfied within an acceptable amount of time right. So what's really nice about that compared to latency is that you can give me an apt X for any service and I. Don't need to know whether it is good like if you give me if you give me a latency if you say like this service like if you say Redis is responding to requests in a hundred milliseconds.

A

You need to know that, whether that's good or bad, in terms of redness and that's like not everyone knows that I know Redis very well and I know like if Redis is taking 100 milliseconds to respond like your son's gonna be 20 down and flat, but not everyone knows that, and the nice thing about optics is that all you have to say is that the apdex score for Redis is like 99 percent. That's fine, like maybe one set of requests, are, are a little bit slow but then get silly.

A

You know it's like, oh well, a second is probably okay for some get early calls. So it'll have a twenty different apdex, but the range are sorry totally different latency, but the apt X's will be on the same level, and so because we we converting it from from the time in milliseconds to a ratio. We can compare it all of that. All of the pronounced X's or all of our services and come up with a single value, and we can then start talking about the SLO draft X. We can say well for giddily.

A

We want it to be that 99.9% of requests come within a satisfactory period. The other thing about app dates, that's really really powerful. Is that if you want to say, like you know, some some requests like a garbage collection or Kristy, literally don't care how long it takes it could take like some garbage collection or Chris and get early take half an hour, and we don't want those requests to kind of mess up. The whole thing, so the optics will forget ly specifically excludes garbage collection right because we just ignore it.

A

We just throw it out and we don't count that as part of a big school, um but obviously for for other things. We do another example of that is on the API. We have long polling requests that come in that nearly always take 50 seconds. So if you look at the p99 in Japan or the P 95 value for our API requests, workhorse it's a flat line.

A

It's 50 seconds and and if you look at you like oh there's, something wrong with our metrics, but actually because we have these long polling requests that always take 50 seconds. It makes perfect sense and so for the apdex school for the API. We actually specifically exclude like long polling requests because they kind of they don't give you anything useful, and so that's another reason why using app takes rather than p95 is a much more powerful measure and then error ratios is basically just like the number of requests.

A

The number of errors that you see in a minute divided by the number of requests in a minute and the reason we use a ratio, is exactly the same reason because you know talking about sixty errors in a minutes doesn't make any sense. You know if it's sixty areas IRA's in a minute and there were a hundred requests.

A

Well, that's pretty bad, but it was sixty areas in the minutes and there was 10,000 quests like who cares, and so again it's about like taking your your raw data and putting it into like a into a ratio.

A

Availability is the same again. It's a ratio, but this time it's a ratio of how many services that are supposed to be serving that process are a running versus the number that are vitamins so like if you, if you got lots of server restarts or like hangs or something like that, that will dip down or like during a deployment, go down and back up and it's the the way we calculate.

A

That is super easy, because Prometheus actually gives us to you built in Prometheus as a metric called up and every time it tries to spray per server the Konkan hold of the server it just it says up to 2-0, and so all you have to do is basically figure out the ratio of the number of servers serving say the web or their get. Let me say like well, you know ninety percent of them are now at the moment.

A

We don't have s ellos on that, but like what I would like to see in future is that we actually have quite a learn like 75 percent s alone, because we should always have enough redundancy that we can take like a hits in our availability. And then it shouldn't matter.

A

Like you know, if a single like at the moment, if a single web node hands, we have like lots of alerting going on like to me like in an ideal world like that, machine should just be destroyed and new one created like automatically like who cares like like everything's gonna, fail and having failing nodes, is not a problem. It's only when you you get to like critical redundancy that it's a problem now, obviously with with single points of failure like Italy, it would be much higher.

A

But for the word we should have it low and we shouldn't be alerting on like like a like a web machine or an API machine crashing, because our redundancy should should handle that and then the last one is the operations per second Oh like Arceus is like requests per second, and that's the only one that is actually a metric in its real form right.

A

So the others are all ratios, and this one is a is an actual number and that's because you can't really represent it in a in the form of a of a ratio, because if you look at something like giddily we're doing about 5,000 Arceus across the Gimli fleet, if you look at Postgres we're doing like 80,000 our peers and so they're different numbers, but there's actually something that we do do, which is that we measure the Zed school and if I've got time at the end of this.

A

I'll show that to you, because I'm really really happy with that and I presented about it at monitor, Amin, and then we lots of people that we like that's a good idea. So I'll show you that so okay, we explained all these just need one, and then I really described up next to you on the last slide. So it's not from then as well. So like the the pipeline that we use on get lab comm with with these with these metrics is first thing we do. Is we collect the metrics and then for each service?

A

We kind of adapt the metrics and and and put them into the pipeline, and you aggregate them into into specific ways so like, for example, you know, collection is, is pretty much just check with the next slide rules: yeah, okay, so I'll explain this more. So there's basically four steps, and this is kind of just the overview.

A

The first step is, we collect the metrics and that's exactly as we do at the moment. So there's no difference there. We just collect them into parameters in the second step is where we normalize the metrics and we turn them into like aptX scores and error ratios and availability ratios, um and we aggregate them and- and that's super important so with the metric comes into the system. It is like an arbitrary number of labels on it right, so it might have like the name of the host.

A

The name of the G RPC method, like the status of the G RPC, call afterwards all the HTTP codes. So if right, like 401 for a 243 for a 4 and the problem with that, is that, with all of those you sort of have a like it, unbounded set of metrics on your data, and so it's very difficult to automate things unless you aggregate it down to like a limited sense and that's one of the things we did and that's really important in this.

A

In this whole thing, then, once we've got that we can start building up statistics and you can do anomaly detection, because we have like a finite set of dimensions and we can sort of look over those and then, finally, with that we can build alerts and dashboards. That are that that riu, that that sort of reuse, the same metrics and I'll, show you that as well.

A

So this is sorry, don't don't use Apple Mouse with Google Spreadsheets and Google slides so like this is just another explanation of what I was talking about on the last page, so we have all of the normal exporters coming in and the thing one of the things I really like about this approach is that we don't need to like rebuild our application. Metrics. We don't need to say, ok, we're going to fork H a proxy exporter and make it support at dick's. Now, because, like that ain't, nobody got time for that, you know.

A

So what we rather do is we take the existing exporters and we adapt them to our system, and you have like whole bunch of recording rules that we that we use to take the existing metrics and turn them into what we need, and so the recording rules will do the the apdex does the error rate. It does the request per second and it.

A

This is a left of the availability, but that we also do and then using the error rate CRISPR rate, we figure out the error ratio, which is obviously just one divided by the other, and then from that we can start building up alerts. So we build up an alert for the SLO, the error ratio, and then we also have a warning for when the the that says, apdex ratio, anomaly warning so that actually don't do that anymore.

A

The the requests per second anomaly warning: we we we use as well, but these is two warnings over here. We don't actually use anymore because I've learnt a little bit more about statistics and figured out that they are that that's wrong, but I'll talk about that. A bit more later,.

A

So like this is probably all a little bit confusing at the moment. I'm run out of time already sorry, how much time do you guys have so.

C

Maybe not going to details of explaining like every everything about this yeah.

A

C

Have 15 more minutes, I, don't know about others. Yeah.

C

A

Cool so like what we have is we have like you know. This is what we have coming in for one of our services right, and we just have this like raw metric and then what we do with the recording rules. Is we take in the middle of it? She could convert it into this metric, so this is, as you can see here, this has got like a whole bunch of like arbitrary dimensions on it and and it's really hard, then what we have here is like a metric with a fixed set of dimensions.

A

It's, like the dimensions, are super simple.

A

It's like it's, this environments, it's this service, so we use type to represent service on in our world, but it's basically think of it as a service and then what we have nozzle dimension is stage, and so what that means is we can compare canary to product to not canary right in the same environments, and so we can say what is the aptX of our canary versus what is the aplex of our of our production environments, and we can also say what's that, what's a canary versus an hour ago before we did the deploy as it is, it is it in worst state.

A

Are we seeing higher ratio and we have all of those that those metrics now so yeah? This is just kind of explaining how we kind of turn it down.

A

So, what's important is that for each service we have a specific metric to turn it, so a specific recording rule to turn it into the like of the optics and the error rate and the and and the way that we implement, that is like curse service, and so this was trying to show an example of how we take the H a proxy for canary or for not for not for canary for non canary environments and turn that into an applic score, and this rule is very specific to this environments.

A

And this and and so what we get out at the end, is like a nice clean metric like that, and we can kind of just look at those together and then this is just about statistics, and so once we've got the metrics, we can start doing very clever things like we can look at the standard deviation and the the mean for the statistic, and then that's what we can start using to do anomaly: detection right, because if you haven't normally distributed metric, like you know, if you pick a value now, where does it lie?

A

Well, it's the chances are like. If you look at request for a second, it's a it's a classic example of a normally distributed metric and so like you will find that about. 68% of your values will lie. One standard deviation, either side of your mean right and then it's 95 percent will lie within two standard deviations of school. It's also called a z-score.

A

So if you take a value- and you say how far is this from the from the mean that if you measure that in standard deviations, that's called a z-score and then within three standard deviations, you get about ninety-nine point seven percent. So that's what we used to do. Nominee, detection and I've got a whole talk on this. I kind of I won't go into too much, but that's how we can kind of spot. These outliers and I did alerting on.

A

It turns out that some of our metrics are not actually normally distributed, and so that's why I've stopped using it for the error ratio and the apdex score, but it are our requests per second rates are normally distributed, so I've carried on using it and then I don't got time to go into this. But this is how we we do the anomaly detection or CRISPR second, but basically we use the seasonality in the data.

A

So what we see is that, like week in and week out, we have the same values and so using that and we predict where it should lie and I'm not gonna go to now but like this is basically a whole talk about like how we we kind of do it and I can give you a link to the monitor arm at all, because actually do a much better job at explaining all of this in the monodrama talk- and it's quite short so.

A

Anyway and then, finally, once we've got all of this data in this format, I'm just going to quickly login to here, we can go to.

A

So so what it means is that we have the data that we have, that we collect into each service. We can plot it on like a single set of dashboards and this triage people that I use right and so for one of our services. We plots our apt X's alongside one another here and we plot our era ratios requests per second and our availability.

A

Is that we've kind of brought like lots and lots and lots of different data and we've kind of nudged it down so that it's all the same, and so we can compare like with like so when something goes really wrong. It's really obvious, because you know all the values are the same as acidity the only one we don't do.

A

That with is the request service per second, because it's not really possible, but what we can do with requests per second is we can we can basically plot how unusual that value is for this time of day right. So we can say: well it's a it's a Thursday afternoon, and so this request should be very high, but actually it's like twice as high as it should be, and so this is called Zed school.

A

So if I go over the last seven days, we'll see it's taking a look at this. So so what is this so this room this service over here has been acting again. You usually what it is so.

C

I get right or sidekick.

A

Yeah, it kids, maybe your point, one minus point: one having read us as well so like would like, if you look here, we zoom in on this you'll, see you'll see like this. So you can see here. This is the graph. If we kind of comparing the requests per second rate for all of our services. You can't really see like a big sorry about the birds. You can't really see like much detail in here like it's not, but, but when you, when you compare like the standard deviations, you can see hey wait a minute.

A

This is a very unusual value like Redis was spiking up there, but you know when you're looking at eleven here, it's it's not really clear, and so, when these values go back into the red, that's when we start sending out alerts on them and we've got built-in alerting for all of this stuff. um I feel like I've spoken a lot at you guys. So if you've got like similar knowledge, questions nihongi under standard I I do feel like this has been super rushed so so, like yeah.

C

Let's go for questions if you have any.

D

Well, I'm not sure if this is related, but we have an alert general channel on a slack. So here's.

A

D

Alerts, you were mentioning are those that are like being displayed every I, don't know how many minutes yeah.

A

So there so so, then, that channels kind of still there's several alerts that go in there right. So one of them is like, if you go into. If each of these services here has its own dashboard like here, not I'll, explain the alerts that go into their.

A

Service interests so there's a whole bunch, different alerts that go in there, and some of them are like more important than others, but the loads that go in there are all like on a per service level, and so the really important ones are I'm just going to zoom in on a little bit of this graph over here. So like the first one that we see, which is the one that's been firing for the last week non-stop is the SLR on the apdex score is below so for each of our for each of our services.

A

We have a service level objective, and so what we say is we would like to achieve like that. 99% of our web requests come in within the allotted thresholds for web requests right and what you can see that that dotted line over there is that, if follow, it's that it's, that targets represented on a graph and the the yellow line, is the actual value. So when that actual value is underneath the target, that's when it for five minutes. That's when you get an alert right.

A

The same goes for the error ratio, so these are the SLO targets, and so here you can see our actual error rate is 0.01%, but our targets, error rate is 0.5%, so we haven't been getting alerts on that if you zoom, you know last seven days um you can see here like during this period over here now. This is turned out to be kind of a like.

A

What happened was we enabled a bunch of machines and then you turn them off, but we didn't turn them off fully, and so they were health checking and erroring and then we're like in a bad state, and so here we got a very clear signal that something was wrong and we we could fix it once quite nice, for the SLO is, is that each service has a different level. So you get this to the right. This is 0.5 percent, but the Islamic of Italy is trimmed much closer to so forgive me.

A

The SLO is 0.1 percent right. So it's a much lower SLO and what I'd like to see over time is that trimming these down. um It's effectively a recording rule that the SRO is a recording rule that says SLO get to the error rate and it's just a fixed value. Zero point: five, since most recording rules are like a Prometheus expression, and these recording rules are just a fixed value.

A

So, if you want to, if you want to adjust it up or down you just going to make a merge request and and that value will go up or down and then automatically any service that exceeds its SLO will automatically fire an alerts. There's it's not like: we've wired together, one alert forgivingly and one alert for the web and one for the API, just the fact that the API is generating a very metric numeric get this other.

A

So that's the first. Those are the two SLO nodes right, so those are when you see SLO. Those are the ones that I'm listening to the most one of the things that's important to realize is that those kissel are alerts because we break everything down by stage now and we have the canary stage and the main stage a lot of the iserlohn nodes that are firing at the moment off for the canary stage. And oh, we don't have canary giving you obviously.

A

So if you go canary weird, you know what what we find is that it's a little bit more. It's you know it's a little bit more jerky at the moment and so like I, don't think we'll read those two pager duty yet, but when I see in that tunnel, I see like SLO alerts like this service main stage, then I'm like okay, like things, have gone wrong and that's when I start reacting and that's why I really feel those actually need to go to page duty.

A

Now you had a meat cheese day and we agreed that those actually now like me to fire like and get to keep them out a bit because generally, when they go wrong, it means that there's a real problem. um The availability I haven't, got the the representation on here, but at the moment for all services this is fixed at 75%. Oh, these are probably easier to understand if I turn off the deploy annotations.

A

So these are all fixed. Basically, at this level here and I think what I need to do is the next step is adjusted in the same way that that each service has a haptics and an error ratio is low. We have an availability SLO in the last type of alert that we get in that channel, actually there's several more times, but the one of the last is is when this.

A

So this is our prediction for what this metric should be like the the requests per second for this service at this time of day um this over the week right. So what we find with with our data is that it's very for the request per second data- it's very, very seasonal. So we see exactly the same. If you look at our data, we'll have a small hump which is European lunchtime.

A

We will it see a little hump and then the Americans come online, see that and on Fridays we see the same slow both week in week out week in which we got and every week we just have the same. Gradual growth and this green line actually takes the green growth, the growth intercoms, and then it looks at like three weeks with the data and the whole anomaly. Detection thing talked that I was mentioning is about that. Did the problem at the moment is because we've chopped our Prometheus down to one week's retention.

A

I've only got one week's worth of data to guess this, and so it's not as good as it was like a few like a month ago and and I'm pushing to get us the Prometheus data back. But that's a that's another question, so the the next time. Let me kick this line jumps outside of the green boundary. Then we can give me alerts on that, but I don't think that that will ever go to pay to GE, because we what's better is that we build services that that can handle traffic spikes.

A

So you know at the moment, if you go to pages pages, is the one that gets the most traffic spikes and so let's just look the dotted line by the way is what was happening last week. It's not totally so, let's just look over.

A

C

A

If we go take a look at the spike over here in pages, you know you can see. That is what we were expecting the range to be, and then now that, on its own shouldn't, get someone on a bed right, but over here you can see what that did to the they're so low because they just sort of faints at the drop of the hats when it gets in traffic that just led to a massive spike, and so because the error rate wins over the the pages. This alone, someone would have got paged for that.

A

So then they would have woken up in an ideal world. They would have come to this page and they would see like hey, like look at this, like the the reason is because of this anomaly here in the in that, and so we we haven't a look that goes to to slack but I, don't think that'll ever go to page of duty. Another thing that's kind of worth pointing out is that four pages we don't have an app that school and that's that because at the moment we don't have the data right and so like.

A

What's quite nice about this, is we can optionally put things in one of the ways? One of the nice things about this approach is that we, because the recording rule can be very thin for services. We actually used like H a proxy data to figure out like for the registry, the error rate that we use actually comes from a trade proxy and some kind of adapt the HF proxy data and make it the the error rate for the for the registry service.

A

And so you know, if you go, look at the registry and I think it's the same for pages, but we still missing something, and so this, like the real news, feed, doesn't emit an error rates value, but we've kind of because of the the flexibility of the recording rules. We've used this instead. Sorry I'm, just talking like again so did that answer your question.

A

D

Thank you, cool.

A

B

Any other questions does.

A

It make sense so.

B

Yeah, this makes a lot of sense. I really like this I think you should be part of the onboarding for engineering, this recording, because it's really helpful and the other things it's that I really love that you say it is, and you finally explain me where why, when I go through the rough on, uh most of the dashboards are empty and I always asking myself is something really bad happening or I'm. Just looking at the brunt thing, yeah.

A

B

Think this is a very good starting point for understanding. What's going on, because all the other, you know the other networks or you know what to expect the dashboards for day. You can't use them. There are too complex or to make into the details. You have to be an expert not only of the thing that you are looking at, but also at the how the matrix was built so yeah. So the one.

A

So one of the things that I that I didn't touch on was that I also have alerts that tell me when I'm not getting in load data anymore and because we have a fixed expectation of what the alert of what all the metric data is. We can alert on it not being there and so in a deployment. Recently, we lost.

A

We lost on a psychic um data and here oh, it's back, okay, so that's kind of weird! So.

C

I need to I need to drop off. I just wanted to give you a like a highlights to to Alessio and Myra before I leave, and you can continue if you want to the reason. I, don't know if you realize that now, but the reason why I asked you to be present and try to understand this is you're not as serious you're, not holding a pager. It's not your job to necessarily like jump on all of these alerts.

C

What is your job, though, is to help out here series once they finally get to understand what's happening in where they were looking at and I want you to be the ones also to understand. You know it in case. There is something like this happening: how to like, consume this data and also, if at any point in time, you get like a bit of free time to try and teach yourselves on, like hey I'm, noticing that this is a problem.

C

I'm gonna improve the product itself so that it can actually help out the s-series understand what's happening a bit better. So that's that's the angle of why the two of you are in this Andrew. Thank you. So much for doing this, I really really appreciate it. I learned a lot still from it. So I'm gonna put this on on youtube. If you don't mind, if you mind, send me a message and I won't I guess: I need to drop drop off so I'll catch. You all later right.

B

You are muted Andrea. ah There.

A

We go you just want to quickly show if you want to so. The other thing I've been doing is I've stopped using like refiner to edit dashboards, because it's it's like a horrible way and you like each everyone's overriding one another and nobody knows like why a change was made or what stopped working and so now, like all the dashboards that I'm building I'm, pretty Integra finer itself into the run books repository and it uses. This thing called JSON it and JSON.

A

It is like adjacent templating library that Google builds and they use it for a lot of stuff, and so imagine like Jason, with functions right, and so you could say like. Instead of having like a hash, you just say get the the value of this hash from this function, and then the the guys from the graph on a team have built something called graph on nets, which is basically Cora, fana library for JSON, it's and and what that means is that all my dashboards are in here and it's not just horrible Griffin and Jason.

A

It's kind of like a little bit like kind of ordered, and so you know, I've got all my notes in here and I can do all of that and then the thing that I love about this approach is like. Maybe you didn't know it just maybe did but wasn't the same like colors and everything, and if I want to change the color I go to that function or that thing and I change.

A

The cut like the color for critical things is that and I changed that I commit my change and automatically deploys its secret fauna and all the dashboards are got like. The new color like the color for critical, is X and like I. Do that with a bunch of stuff, and it's like a much better way of doing this. So if you want to contribute to that or if you like, want to add new dashboards like this, is how I'm trying to do this you'll actually see these dashboards.

A

If you try to edit them they're, not editable um I mean you can you can override that? But oh yeah cool I think that's it for me. Unless you got any more questions.

B

No I know I. Thank you for your time for explaining this yeah.

A

Yeah awesome I'll show you this there's there's one other thing with this era: budgets that I'd like to show you the sunset but I can I can I can show to you. Another time is basically this over here is a recording of how much time we're spending underneath the SLO for that sub. For that thing, and so we can start sending attribution to teens and saying you know the get early service is, you know not reaching its era budget?

A

In fact, the giddy service is pretty good, but you know that we now have that metric and we can compare different services to their SR lows and say whether or not they're there any good they.

B

Have a good good bad just because the attribution is easier. It's a separate service, but if you were.

D

B

The web or API and then attribute to manage profile.

A

That's the hard thing: yeah cool, okay, thanks guys, have a good rest of your day with you bye. Thank you again.