GitLab Scalability Team, 4 Dec 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Hordur & Andrew discuss monitoring AutoDevOps feature using key metrics

Description

Hordur and Andrew discuss how AutoDevOps can be better monitored using the key metrics framework used for monitoring the components of GitLab.com.

This follows on a outage in the feature https://gitlab.com/gitlab-org/configure/general/issues/9

A

Okay, are you doing okay, so we kind of well I'll just to catch up. We started at the meeting and then we decided to be a good idea to zoom record it so that if anyone else wanted to do this in future, they could and do you want to give like a very quick brief description of of the of the problem.

B

Yes, so let me just opening the ticket again yeah. So what happened recently was that auto dev ops. It had like a kind of a full outage of the deployment pipelines and we didn't notice until like 24 hours after like so it had been going on for 24 hours, and it was only a customer that escalated to us and yeah.

B

We didn't have any and we still don't have any monitoring of this feature or alerting or anything so I reached out to Andrew for cuz I, know he's the monitoring expert at the company and we're looking into well he's gonna show us some stuff so.

A

And then we, and then you pointed out that the monitoring page, the engineering handbook page, doesn't actually mention things so yeah.

B

A

That, because that's pretty much makes things a lot easier. If you just go to that us now,.

B

But yeah you should put her in the chat again again.

A

Cool okay, so you know like the name of a metric, that's got something to do with what our DevOps or or what you were talking about.

B

No okay, like I, really don't know much about the metrics at all. Okay,.

A

Little trick that I used to find metrics alright, that's this query.

B

I would I would skip like I, would not I would just write auto or DevOps, but not in.

A

A word because sometimes there's a underscore and I.

B

A

Yeah I was wondering about that.

A

All right, sorry, Hil. It's.

B

A

Okayed, so there's nothing, that's got order and dead with the optional underscore yeah.

B

I think that pretty much settles like there's no there's no direct metric for it like I, can pretty much say that, like what could be happening is that there could be a pipeline, though I might I mean a metric that has a label for all of that box. Is that also, when.

A

You're, looking at.

B

A

I'm, just looking at metrics names.

B

Isn't like the only thing I can imagine, is we have an auto dev ops enabled field on CI pipelines so like.

A

It is okay, so I mean working strike. You.

B

Iii, don't think it's being bubbled up all the way to Prometheus, because that would increase the cardinality too much of metrics that are like huge.

A

Another thing we can go and look at is: do you think? Okay, so you don't think it's getting as far as.

A

Because, because really kind of the first thing we need is, is something in for me this that we can kind of be look against young and just rubble. Take a look at see.

A

Look around the CR dashboards to see if there's anything that would help us with this.

B

Yeah I have a feeling like I mean we probably don't have any metrics relevant to auto develops at the moment because, like firstly, the feature is is still like.

A

Well, it is starting.

B

To get more popular, it's not at the scale of calm, okay, most yeah, but it's really the point where we need to monitor it right.

A

Okay, so the way that we the way that I, would suggest and go about monitoring it if we don't have any metrics yet that's great because we can kind of do it in the right way is to kind of start with. This is thing called the Service Catalog and that is books.

A

That's basically a map of all the services in the application and every every kind of thing that we monitoring should mount back. Just something in the Service. Catalog. Sorry go! Look in all! We need to create a new thing in the Service Catalog. That's got a pretty good idea where I think in, but nothing for clues there. So we basically got in here. We've got like a bunch of teams. We don't really use it as much it's a bit out of date, but then the really important part is the services.

A

And if we go to look in here, it's like what would be the right service for this. I suspect that the right service for ci-4 for a DevOps would be the CI run a service, because pretty much everything that you do is part of CI one. That's right!

A

Yeah, probably.

B

A

Yeah I mean it's either they're all the web. You know the the the other service that we have is. Is the web and API Italy pages I mean.

B

I think run runners are are are where, like pretty much. All of the other type of logic happens.

A

Yeah yeah, okay, so I mean if we could always just come up with like a arbitrary sort of dashboard, but the thing you it'll maybe become apparent. Why I like sort of categorizing in this way, and so what we need trying to build up over the last few months is for each of these services that we've got you you've got a set of key metrics and those metrics are the number of requests per second I am.

B

A

Number of errors per second and then what percentage of the requests compete with in a satisfactory threshold like amount of time right and the reason why we do it that way and not say what is the p95 latency of the same point is because then you know, if I tell you that 99% of the requests are coming in within a bit of amount of time. You know that.

A

That's probably okay, if I say it's 50%, you know it's probably bad, but if I tell you that the P 95 is three seconds, you don't know unless you have more context, whether that's for the bad, and so you know it's it's with a ratio. It's also cool is the other way that that's known, yeah and with an optics score. You know we can be talking about Redis. We can be talking about microseconds or milliseconds or Gideon. You can be talking in you know seconds and we're on the same scale.

A

It's basically a percentage like what percentage of requests are competing and so, and so we have optics for we have operation rates and we have the error rate, and what that gives you for each service is the dashboard with values like this. So these three values here are the key metrics for the web service.

A

Today, I've noticed, let's wait for it to come up in McGarry, and so you can see that yellow line over there is like the apdex score. So you know they're these spikes, that's what I've been investigating all day. um You know, something's, causing us to drop down to, like only 75% of requests, are completing with the new acceptable amount of time. So that means 25%.

A

25% are are taking too long and and what we say is too long depends on the on the service that you're talking about yeah, and you know: here's the the error ratio and again we we talk about it as a ratio.

A

So it's like what percentage of the requests are erroring out because of a server side area, as opposed to like a full 404, which is a client-side area and again it's a it's a percentage and then the requests per second, you know that's pretty self-explanatory and then saturation doesn't really matter here so much but like another one of the services that we've been doing this for recently was the was the CI run-ins, and this is really interesting because maybe a month ago we weren't doing this and and that and the metric was all over the show.

A

It was actually really really poor and we got a kind of directive from senior management that 95% of CI jobs should start within one minute of being scheduled. And so what we could say is that our lab tech school is one minute, and then we say what percentage of the requests complete within one minutes um and as long as above 95%, then we meeting a goal.

A

What's actually really interesting is that when we started doing this, I'll value was so poor that he acted to set it to 50% in order to just get it to stop and loading, and just the fact that we've got that we've been monitoring it and actually keeping an eye on it and and like it's become a thing. We've been able to improve it really really quickly and so I actually intended to increase this.

A

This dotted line is the SLO, that's like the rates of which me that that's acceptable rate and I've been meaning to bring this up to about 80 percent. We're still having a bit of a few drops every now and again, actually 4 a.m. every morning um and there's a reason for that. But you see the spike, and so we keys are gonna. Do the same thing, sorry, yeah yeah,.

B

I was just gonna say where, where do you change these values so.

A

There's in the run books in this run books project have it's kind of two ways, so it started off as the original way. When we started doing this, we did it as.

A

We started off by doing this manually and so Prometheus has a thing called a recording room, and so what we'd say is create a recording room that that records the apdex score for this service. I wanted to make it slightly to kind of add a little bit more complication, but also to explain it a bit better.

A

We break services down into components, and so, if you're looking at the web, we have work horse and unicorn as the two components in the web and if we're better instrumentation on the nginx would have three components within the web service and for each of those we wanted sort of error rates, operation rates and at that school right and then we kind of combined them together to the service level, to give us a higher level view, and so here you can see, we've got all of these rules and they're, basically just expressions in fermius.

A

Now this this approach is worked and it's fine, but we sort of slowly transitioning over to this other approach. That's kind of new and not everything's in it. Yet, but I'll just show you what it looks like.

A

And so this is kind of like it's simply repro. It sort of automates a lot of the work, and so.

A

So you can see here, it's saying we have a. This is a definition of a service. It's the web service, it's in this SV chair, which is just a breakdown that we have in Prometheus, and then we say we consider 90 95 percent of requests meet there at tech school. Then we are nestled. Oh, it goes below that we'll resolute and the same with the error rates.

A

I think here it's a percent error rate and then we're saying we have two components: one of the components is workforce and in order to understand workers, this is the latency histogram metric that we will use and then we just add some extra sort of selectors onto Prometheus and the reason we do this is because we want to exclude certain things from this particular metric.

A

So we don't want to include like health checks, in what we consider like the school, because health checks will finish within like 0.1 seconds, and it would completely mess with with what you know it's what we, the users experiences like a fast service in a 99.999 before including health checks, which doesn't really make sense, and then we've got sort of.

A

This is kind of a detail of applix, but with the nap tech school we say, satisfactory score is any request that competes within one second and it tolerates at school is any request that completes within 10 seconds, which is pretty slow but you'll tolerate it, and so anyway, we've got. This is that this is how we define the haptics worked workhorse. This is the request rates. Again, we just give it a metric, like that's a Prometheus metric and some some filters on that metric in this case.

A

We're just saying that it's you know anything could get that will persuade and then error rates, and then everything else has just generates it for us automatically. So you don't have to worry about building up those really big these. These definitions now.

B

So adding stuff in there it goes on automatically into dashboards like this stuff.

A

So once once it gets added in once, you add it into yes once effectively. Once you add it to the web doc JSON and file this one over here.

A

Once once it's been added to this, it will automatically start getting recorded as part of the service, and that will automatically create this this school, this ethics for this error ratio and this request per second school it'll, also show you this you can see here.

A

We've got this kind of what we consider the normal range and what actually happened was it uses statistics to work out in plus minus three sigma, um but what actually happened was exactly a week ago we changed the the ingredients for this request for second rate, and so that's why we have this change here.

A

A

Yeah, you kind of get all of that for free, and then you even get things like the service month right, and so this is saying what percentage of the time over seven days are we meeting the SLA is that we've given ourselves.

A

And so you know we can start saying this: this service is getting better or worse.

A

You can say that a webservice meets its SLA in ninety six point, six percent of the time and all of that gets generated automatically from this file, and so really what we kind of need is to go into the CI runners service this one over here and at the moment we've got two components for four co-owners: one is :, which is recording things that are polling for for new jobs, and so we get the request.

B

A

From that we get half in that errors and don't really have a latency score for that, because those are long polling requests. They all take pretty much fifty seconds, so it doesn't make sense. In the case of toiling to be saying, you know we want to have the request completed within this amount of time, because it'll complete when it gets a job, and so it's not within the you know, I'll control. How long that will take.

A

It depends on workloads, but we do have a request rates and an error rates for e for the polling service and then for the shared runner queues. That was the one where we got the definition from senior management where they said that you know we want jobs that run within jobs to run within 60 seconds, and so he said, the satisfactory threshold is 60 seconds.

A

um There's no tolerable threshold and just say: there's only one threshold, that's 60 seconds, and then you know we also get the the request waits in the airways and that generates everything else that we need we'll. Also, this will generate alerts, it'll also generate alerts. If the metrics stop flowing.

A

Like if somebody breaks the application, we'll don't get alerts for that, so all of this stuff kind of flows out of this file and everything kind of comes from this, and so what I would imagine we need is we need to have at least a request rates and error rates for order. Devops right, I, don't know if you would, if there's anything that you would consider to be a latency school like a con.

A

Imagine you know the only latency school would imagine is how long it takes to shade it with job, and that would be on the shade runners component rather than on the aura devops components itself. Yeah.

B

That's a tricky one because you have so many different kinds of deployments, but for like the fully automatic one, I would say at least the build and the what will the build relieve. A lot of this depends on the size of the git repo that you're deploying right. So you.

A

B

Can that can be hard with, like say built by fragmented, by build pack type like partition like that, should should be pretty like you know the you could probably get something out of that, but I don't think that is something that I would want to attach an SLO on. It's more like something.

A

B

Would like to want that I would want to be just looking at yeah right, yeah.

A

So I mean for the SLO you'd probably want to look at the at the request rates in there and the error rates right. Yeah.

B

A

Really, what we need is is a it's just, I mean here. I'm looking at this get lab run a job, so I'm just gonna.

A

B

There is a tricky one there because there's a lot of a lot of the time. The failures are user errors, yeah.

A

Well, okay, so that's something that if there's a way that you can distinguish between a failure because the the you know something happened on our side and something it happened on their side like that would be way better if we can, if we can do that.

A

Obviously, if you can't you can't, but it is problematic, like I'll, give you an example like with the registry, the docker registry, that we have there's a bunch of stuff that a user can do that, creates 500 errors and the that really plays havoc with all of our loading, because you know we say: show fall for xx errors like that's the user, not gonna. Wake someone up at like 3 a.m. to like look at this, because it's a user's problem but 5 xx.

A

That's a problem on the server, but the problem is that you know a user pushes a dodgy image and we get a 5 xx and if you get enough of those little like someone up, and so if there is a way that we can kind of distinguish between those two, that would be really really helpful.

A

But I know it's kind of hard yeah.

B

That one I am Not sure we like I, can't think of any way that, like we can really distinguish it, except we could at least maybe flatten it or like try to report it by so so do something we could probably do something to have each project not have an outsized impact like so, if one user is like breaking the whole thing that just still only show up as like one error in in, like you know, in the time frame, it happened. Does that make sense here? I'm not sure Prometheus offers that, though we might.

A

B

Some aggregation so.

A

So what we could do is we could set the so you could just have like a relatively high threshold right, so you know like if it's 10%, you know, because there's some user errors in there, that's fine, but then, if, if your error rate goes up to like 100 percent, you get an alert on it because the chances are if your error rates, above whatever the right threshold is, and you can look at the dead end and get a feel for very quickly.

A

But you know you exceed that that threshold and you can make it higher than you would if you, if you trusted that it was only programs on our side. Yes,.

B

A

That threshold a bit higher and then get averted. The other thing I, didn't kind of explain is, is one of the things that's really advantageous about having the Arceus right. The requests per second rates is that we automatically, because of that, if the rate does outside of that of this boundary right, we will automatically get a method on it. So, if the rate, if you kind of expect- and this rate by the way is.

A

Is kind of based on us on a seven-day cycle, so we have, like you know, Monday through Friday, we always get sort of a certain type of traffic and then weekends as January, quieter, um I'm, just loading seven days now it might take a while because.

B

A

Fast, but one of the really nice things about this is that if something goes wrong with what I did up, so this kind of gives you an idea of that right.

A

So this green area, this green boundary is, is what we consider to be like the normal and that's based on three weeks with the statistical data around about that time of day on that day of the week, and so, if order DevOps drops down to zero, then we'll get alerted on it because it's outside of the boundary, but then at the same time, if order of DevOps shoots up to like four times its normal rate, you get an alert on that as well, and so you know last week you know we had that bad outage, and obviously one of the things that happened was that the rates dropped.

A

You know through the floor and so had that been something we were paying attention to at the time. We would have seen that you know we obviously had bigger problems than that, but we would have got an alert that you know the wave RPS in this case was, and so that that could be something you get for free. So, presumably one of the things that happened during that incident was that the rate of order, DevOps jobs, dropped to zero close to zero, and you would have got alerted on.

B

Yeah this is the success rate of the pipelines would have gone like it would have significantly dropped, though I suppose that we would. We would definitely need to split it by whether or not it's a employment or not, because yeah.

A

B

I, don't know we because we don't have a lot of metrics I expect it to be far more just builds. Then there are also deployments yeah.

A

B

Cuz we automatically.

A

B

Auto dev ops for like I, think it's 20% of projects and that would like, if they don't have a Cuban Eddie's cluster attached. It will just be a build and a test point. Okay,.

A

And presumably the deployment are, do we deploy on non-default branches as well.

B

Yes, yes, so we we do. uh We do review apps.

A

But yes, so so you know, we I think you I think you right. We definitely want to kind of partition. The data by fields forces deploys and that's something. So you have you kind of looked at how to put metrics into the application yeah.

B

A little bit like this is something that I would really want to explore a bit further, because there's like you could we could be posting the metric directly. You know, as the pipeline is, is doing its thing, but then you have to look then you're stuck with exactly the data that's in the pipeline like in in that particular pipeline, whereas we could also be doing post aggregation of data from the pipeline's table where you could do, like you know, say, say per project. You know we had this many failures in a minute. Yeah.

A

So you know that's something that I'm a big fan of, but that is more structured logging thing because obviously he can't you can't have like a pilot project in in prometheus, because it's it's.

B

A

Cardinality but there's a you know, put that in structured logging and- and that would be great but I mean where you record that value. You know whether that's recorded in the runner. It kind of feels to me I, know very little about the architecture of your system, but it kind of feels like an enhancer to have that in the main application, then kind of exposed, from the one hand, right yeah,.

B

Because the runners shouldn't really need to know this much about yeah.

A

Yeah that was kind of okay.

B

So it probably should be in the rails application, or for this particular thing and you stay structured logging is, is a so that's. Actually. One of my questions is how much should be going into prometheus, how much should be going into structured logging and for in both cases, what is our like budget like for for collecting the static, because, obviously this data is not free, no yeah, so.

A

The way the way I so so, if it's got, if it's got a any high continent today, it's got an IP address in it. If it's got a user name, a project to top-level name, space, anything that's more than 20 and then it probably shouldn't be going into inter Prometheus.

A

Obviously you have roots and we have a few more those, but because of those everything else has to be smaller, because we have so many roots and we and you're dividing everything by by that, and so we have like then in the strip on the structured logging side. I would say like if this is something that you would find like.

A

For me, structured logging is about like user events or events, and if it's going to be useful for analysis at a later stage like it's like people using DevOps, it's their success with order DevOps going up or down, or you know, another might be. If, on the first time, someone runs order, DevOps of fails, you know what's the likelihood of them, using what it erupts in a month's time you know like, and then you can kind of divide data that way. Those are all things like. Those are all based on events.

A

So if it's something that, like you're, not gonna care about from, like a business point of view or growth, point of view or like a Diagnostics point of view, like I've, been spending the last two days, looking at all of our log data, then try and leave it art. Sometimes it's difficult to gauge that, but generally, if it's kind of something that's been driven by a user, doing something and that thing has started or finished, then I would record it as a structured log like the other stuff less.

A

So you know we we've got things in our logs like every time something goes and checks for a if the database connection is still valid and like that's fine for a certain type of log, but getting recorded in structured logs that might get recorded for six or nine months like it. It doesn't matter like what I'm interested in is things that, like has been caused by users, doing something you know. Events I, don't know, that's a kind of a nice answer, but that's sort of how I feel about it.

A

Yeah so sure yeah.

B

A

There's a there's, an issue called standard usage logging.

A

Because this is something that I really you know earlier this year we were trying to figure out where we were spending money in the cloud, because we're spending quite a lot of it.

B

A

And so much of this data is very hard for us to collect, and so what I've been kind of advocating is is having this kind of standard. You search logging, where it's like. We have a set bunch of dimensions, and actually the dimensions are probably so Bob is working. Do you know represent like ya,.

B

Know Bob, oh yeah, okay,.

A

So Bob Bob's working on something at the moment, which is like it's called AK application context, and it will basically give you this stuff right. So, like the the context in which you're running who's the user, what's the project, maybe what's the IP address one or two other things, and then those plus like an event like. Ideally, what I'd like to do is have all of that kind of just going out to use it logging. So you could say you know this person did one invocation of water DevOps and it was successful.

A

It was a failure, but.

B

So that way, you know.

A

B

A

And that would go into logs in fact I. You know you could do one full, it's starting and one for finish and how long it took to run.

A

You know things like that, but yeah, because then, at a later stage you can try to figure out what things are who's using this, the most who's, maybe abusing it yeah.

B

This is, this is actually what I was missing when I was looking at. Our logs I was really trying to find something like that and I just could not so that doesn't mean it doesn't exist, but I'm not able to find it.

A

B

Been on keep on that quite a bit now trying to.

A

B

A

Yeah and really for me in here like if the loads, not structured its kind of I'm, not gonna, use it, because you know we just get too much data to deal with like unstructured logging and obviously for something like giddily. We pretty much have one log line per, give me call, and then we can aggregate those up in Cabana.

A

But this is this, is you know, really useful for kind of a cardinality metrics and being it's kind of yeah.

B

So back to my question about cost like what would help me a lot like, because I would obviously want to just collect everything, because you know that makes my life so much easier. But like do we have any any roughly like if I'm gonna be collecting a metric like let's say like a single metric with a single label like do I know. Do we know roughly how much that costs like what is reasonable for a project that is a small fraction of our pipelines.

A

B

Both in a way but let's, let's start with metrics and then logging.

A

So metrics basic metrics, the the number of labels, you add, isn't going to add to the cost like it's it's you know as far as we concerned like don't worry about that. Just make sure that the total number of of series is less than like ten thousand. So if you go into here- and we just take.

A

A

Okay, take any one of these.

A

So you can see, we've got so now. I read my cardinality at the moment on a lot of these series, and so and a lot of that's down to you know just got too many labels and.

B

You've got high cardinality labels, then I guess like with a.

A

Lot of this thing, the problem is that we've got like a lot of service and each server has label all right, and then we have a lot of roots, and then we have a lot of controllers and then you know for the for the histograms. We have a lot of histogram buckets, and so, when you add all of those up, it just kind of lowers out of control. But if you look on here, there's it doesn't like there's nothing in here. That's really sort of terrible right.

A

Like yeah, you know, we've got action controller, and so you know, though, just action, controller and fully qualified domain name kind of multiplied. Arts kind of takes you to pretty much fifty thousand. So you know in in your case I guess it probably wouldn't need that many values, it's kind of like deploy versus, build success, versus failure, mm-hmm and then sort of what you should try to do. The pattern is, is, if you have one for, if you keep a histogram for how long something's run for don't include on that all the other labels.

A

So the histogram that says a build, failed or succeeded is separate from the metric that tells you how long the bill ground for, because otherwise you're taking ten saved, but ten buckets in your histogram and you've got like ten values, for you know whether it was a successful failure, so it won't be ten on that, but you know you're, obviously doubling it up.

A

Does that make sense so.

B

You said you mean: have a have two metrics one for like three, so two metrics, so one for successes and one for errors and then a third one. Just for.

A

Latency yeah, so the successes and errors I generally prefer to keep those in the same things, but just with the difference enabled sir right.

B

Right right, one.

A

Of the things that I think has got really good.

A

B

That's that's. Another question is is like we have a lot of different conventions going on there. It seems with naming at the moment. Are we moving towards like do we have a scheme scheme or like or.

A

No so like one of the problems is that, like the binding between the the alerts and the dashboards and the application, metric names is obviously just it's soft right. There's no like compiler checking that everything's the same, and so generally when people change those values. There's there can be a great deal of pain, especially he's.

B

A

All your metrics and you don't get the alerts and, and things go awry, and so I prefer to keep things as they are. um This name this.g OPC server handle title is it's kind of a standard name. We use it on our giddily roots, but then Thanos actually uses it internally as well.

A

You know, you often also see HTTP requests. Turtle is another name, but in some cases we'll also, you also see Hitler underscore workhorse underscore HTTP request turtle I mean there's, there's some naming conventions in prometheus and it's well documented. You can just look it up on the website, but mostly if it's a can't, it should end with underscore turtle yeah, and then you know that this is kind of protocol sort of subsystem and then lucky veins, but but there's not really kind of.

A

We know we don't have any plans to kind of like reunite this, because it would. It would be a very painful thing to do. What I would rather do is kind of focus on on this metrics, catalog and kind of like these are like and have this as a way of saying these are all the really important events and- and this is a map- and then you know, maybe in future we could even slip this in in the CI test and make sure that the application is actually publishing these things. Catch breaks there anywhere sure.

B

So so so it's just like when I'm picking my metric name, it needs to make sense to a broad audience, but that's pretty much. It.

A

So yeah I mean it's pretty well documented. Here, yeah, you know like I would call it something like Hitler order. Devops deployments total build.

B

Total and builds.

A

Total yeah and then and then you know on that yeah those would be good names. If you, if you had a latency bucket, it would be something yeah get lab, underscore interrupts.

A

You know, request or build duration seconds and then again.

B

A

Suffixes, that with buckets, can't take a look at the the metric naming like one of the things that people sometimes do when they come from other metric systems is that they concatenates like what should we label values into the metric name and that's like a big no-no in prometheus world. So you know, people might say, like even successes and failures to me should be something that's the metric and not in the sorry in the inner label and not in the metric name.

A

B

Right so, like I have a somewhat unrelated question, but a follow up more. It's so-so now say: we've added these metrics right like who owns like and there's an incident who gets the alert and like like what is the process like okay,.

A

So at the moment, it's it's like all the monolith there's. Actually this thing that's coming in that we did on sidekick workers with I. Don't know if you saw all that feature category attribution that yes,.

B

I saw that yeah.

A

So the next step of that that's a bit pause, because we've got some problems with sidekick prometheus. At the moment. It's not working very well but but effectively everything's everything's attributed in that way. The next thing that we're going to do is we're going to start enforcing the same attribution on alerts and on services as well to some degree so that we can set what least on there on the components so that we can say like whose is this it's going to be. It's still going to be very difficult, there's a lot.

A

We need to think about with that, but what we can do from now is it was we can routes stuff to a channel, so we could set up, as we actually just talking 2zj about this this morning and then.

B

A

Like every every team as a channel and we'll get it in the in the production channel like we do with all the planets channel, but then you guys will get it in there as well. The the challenge is to kind of get people like if there's no.

A

Like the problem is people, don't they ignore, they pretty soon start ignoring it unless they have to respond to it, and so.

B

A

Know, that's always the challenge with alerts. Is that he's actually getting people to respond to it, but to be fair, they get any guys are pretty good with its mmhmm.

B

Yeah you I mean you need that to be part of the team's performance metrics that they work on the alerts if or or get rid of, the alerts.

A

You know that's where that's when we've got, you know this dotted red line here, which is the SLO and really what we want to do so so when this yellow line is below the red line, you know this number goes down, but the problem at the moment is that you know talking, we've got a monolith and how do you break that and wanna lift up into components that different people are responsible for, and you know that's where the whole future category.

B

A

Worker thing comes from, but there's a lot. You know we'll have to kind of take, take care with that injured. Okay,.

B

A

B

So, like one thing, I noticed with the auto dev ops incident was like so so I are you. It was an incident for us and I like I handled it, but I felt I was lacking a process right because it completely fell out of the normal incident. Handling process and I've followed our gillip incidents before and they're like the way we do. That is, is quite nice, but, like I, didn't.

A

Follow that did you consider like creating like? First of all, if you had like pages, you needs the the on call with that of help in any way you know cuz. Maybe we need to think about a broader sort of Incident Response framework yeah.

B

A

I was thinking.

B

Like I think that like- and this is what I've been talking to like my manager and and others about is, like probably I should have in this case- you know page the on-call or whatever, and they had an incident created, but that was something I didn't know at the time. There was also.

A

B

Ongoing incident, which confused me a little bit that.

A

Was unrelated to yeah.

B

That was unrelated, yeah.

A

It's what if you I'm going incidents but I think like really kind of what I really like to see is- and this isn't your fault I, don't think it's it's well documented is something like you automatically go to the production issue tracker and you create an issue there and we've got scoped labels and the labels are scoped by the service. It should be tricky because I guess it would be the web servers, but maybe what we need is feature category scope, labels as well, so we can say this incident was related to this feature category.

A

You know in obviously not in all cases. Can you do that, but in some cases you can and then and then actually, even if it was an incident that the the production on call wasn't involved in you know it was it was. You know it becomes part of the record of of of incidents yep you know and having it having it in there cuz.

A

You know everyone's creating their incidents in their own tracker in kind of hard to keep track of them. Yeah.

B

Yeah for sure that that would be like so III think this is a great idea I would like I would want to like for the next time. Something like this happens. I would want to follow those steps. Yeah.

A

Yeah we need to. We need to figure that out cool I'm, gonna, stop sharing and you can probably stop recording as well. Yep.