GitLab Scalability Team Demos, 20 May 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Scalability Team Demo - 2021-05-20

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Great and jakob, you have the first item today.

B

Yes, thanks um so yeah.

C

B

In the middle of uh doing a little investigation and um I'm still sort of stumbling around trying to understand what I'm doing, uh because I I've in my experience- that's how investigations work like you can go back and say we set out to do this experiment and we did it, and this was the outcome. But in reality you don't know what the experiment is. When you start so or you sort of have a rough idea, what the experiment is, but you don't know exactly how to run it.

B

So I'm still sort of in this state, but it's it would help me to talk about it and then and share it with your people and I'm curious what you think so sorry long, preamble, redis psychic.

B

We are doing this work to have fewer cues, because it's better for reddish throughput and we have another project on the horizon: to have multiple reddish servers and have different deployments of the application. Have psychic talk to different redis servers and but like? How soon will we need this? Because the idea is that, if we have a more efficient redis because of having fewer queues, we can process more jobs, but how many more jobs.

B

So this is roughly what I'm trying to investigate and I'm using the test trick that craig built last time around when we were looking into this so yeah part of what I'm also doing is learning how to use the test rig because I didn't build it, which is a good exercise.

B

So what I um yeah. I should try to talk you through what I'm doing so. I have a single um single reddish server, which is the same specification as what we use for psychic right now. So that's a c2 standard, eight core machine on uh google cloud, so that is the the redis server and I now have two virtual machines.

B

Acting as clients and clients do two things: some clients submit jobs and some clients sometimes produce jobs, and some clients consume jobs, and the part I'm still trying to figure out is how to balance uh those two parts of the workload.

B

um So right now, there's two machines that are simultaneously producing and consuming jobs, and the good news is that, with a q per chart, configuration uh I'm, I I now have it sort of. I can.

B

It seems to hit a bottleneck at 17 000 jobs per second right now, at some point yesterday I was hitting a bottleneck of 22 000 jobs per second, but I don't remember exactly what was different or I don't understand why.

B

At that point I was using one big machine as the workload generator now using two machines, and um so I I still don't understand well enough. uh Why at one moment the limit is 32 000 at the other moment at 17 000., um but the way I'm observing it now. This is one of the things I learned is.

B

I got prometheus working on the redis server and I'm just using the uh the redis exporter, which is integrated by default, and only so that's very nice and uh the graph I ended up using is the one that shows uh job rates per command and the um because that gives you a fairly good indication of what's going on, because the the the problem with the load generating part is that it's a lot of different processes and if you have to scrape all of them with prometheus, you get a very complicated prometheus configuration, but it's much simpler to have prometheus to scrape the redis server and I just restarted the the simulator.

B

And so what we're seeing here is uh something like thirty, four thousand thirty five thousand uh push operations, this bush operations and uh around 17 000 operations that correlate with jobs being consumed. I actually modified the workers so that they ping redis, so the pings correspond to jobs that are actually running.

B

So this shows a job rate of yeah in the order of 17 000, and one weird thing that happened here, because I left this running overnight was that the whole thing tanked and another command went up. So let me just so like this was running well and then all of a sudden job, processing tanks and these uh zram and z range by score. Things go up and I realized that the the load generators generate logs and the disks were full.

B

So we ran out of disk space on the load, generator machines and they and all jobs started failing because they couldn't log, and so I I still don't understand well enough. How do you like I keep running into things like this, uh which makes it.

D

Just to jump in there sorry, the reason for the these commands then, is that this is the retry set, which is a sorted set.

B

Yes, yeah, that was I I early earlier on. I I thought I found a very good result, but then also all jobs were failing and that also correlated with the cram and z range by scores, because, yes, that's the retry set, um so I I'm still trying to get a good grip on it, but yeah. If I zoom out a bit uh here, I had a long spell of running at 17k jobs per second. So that's good and yeah. Here, there's a spell where it says ping at 22 000 per second.

B

So I did something right there, but I'm not quite sure what I did right, but even if we, even if yeah, 17, 000 and 22 000, is still a lot more than what we process today, because uh our job rate is more like um uh in the one to two thousand range.

B

It should be here.

B

Now that green band, I suppose, tells us what it looks like if I zoom out to seven days.

B

The green band doesn't or barely goes above 2000 and the actual rate here is even not even peaking at uh it's below one point. Five thousand.

B

So it looks like a single reddit server. This was a very long-winded way of saying that it looks like a single regis. Server can uh do come process like deliver about 10 times as many jobs as we're processing. Right now,.

B

Roughly sorry, this is very chaotic, yeah. It's it's it's good news. I just delivered it in a chaotic way, with lots of uh caveats.

C

How do you have an idea, how much the payload jobs or the arguments and stuff that goes in influence, reddish.

E

Cpu a good question: I think they do influence cpu. Let me.

B

See if I have this is a flame graph. I captured just now where um it's a little distorted, because you can see the background safe in here, and that is not part of the single core that runs on a different core. But if you ignore that, you see that very large part is lipsy right.

B

Well, if you look so um this is 900 samples, 920 lipsy right is 1000 samples. Lipsy reed is 700 samples, so that means 1700 samples in io and 900 samples. Or what is this? This is 586.

B

so about yeah, 1700 samples, io, 000 samples, doing red sea stuff in memory on the cpu. So it's already doing quite a lot of io and if the workloads are the payloads are bigger, of course the I o will take longer, although I don't know how much.

B

That's an interesting that would not be very hard to vary because that's just that's part of the workload generator like we can just have a bigger arguments. Blob thing.

C

Yeah, I don't know if that's part of it now, that's why I asked.

B

I I think I think it uses a fairly small payload, so we could make the payload bigger. um Of course, when we're get concerned about io, I can also try. Turning on threaded, I o for the redis server and see what happens there.

B

But the yeah- I guess the other part that I failed to mention- is that if I don't, if we don't do this uh cuber shard thing, then we stall way earlier around uh three to four thousand jobs per second, so um th.

A

B

A

You so you're saying that um what you've seen in the tests is that without single cube per shard, we stall between three and four thousand per second and with single cube per shard. We stall at about 22.

B

uh Yes, 22 was the best. Yes, 22 was the best number right now, I'm stuck at 17. yeah um and another thing to keep in mind is that I was talking about this with sean. How to interpret this because we're getting alerts at way, lower levels right right now, we're in the 1.5 to 2 range and the experiment suggests we could do double the number of jobs.

B

But I wonder well, my experience is that in these test rigs, usually you can.

B

You can push systems harder than you can in real life, because your over your model isn't perfect, and another thing I'm wondering is: if um I don't know, if our concern about redis psychic is mainly about cpu saturation or things where we see users being impacted so, and that would mean job delivery uh taking too long, because it can be that there's an operating range where our cpu alerts constantly go off, but users don't notice and users don't care if our cpu alerts go off users care if job skills deliver fast enough, so I guess another way to say that is that if I observe 22, 000 or 17 000 jobs per second, as the saturation number we'll probably never want to do that, because at that point the cpu would be is right.

B

Now the cpu is packed at 100, so our alerts are constantly going off and that's not practical. Because are we going to silence our cpu alerts, because we know that 22 000 jobs is fine, probably not right. So.

A

Well, I think that's not really what we were looking for. Well, I mean what the experiment was looking for, because I think we were looking for understanding from where we are now. How hard could we push it before? We would start to see the next.

C

A

And is that close to to where we are now or how far away is that actually and what you're showing here is that it's like you know it's, it's almost 10 10 times.

B

A

Than this yeah.

B

Eight to ten times more, so we can double three times.

A

Yeah, but I think what I was struggling with when I commented on the issue yesterday was it was hard to understand um if 22 jobs per second was like something that we could practically take into talking about this, but I think what you're saying now is that, like on the test rig, it's the difference between.

A

Three to four thousand and twenty.

B

Two thousand.

A

B

And this is the part I I'm still um honing in on, so I I want to have.

B

I want to come up with a fair experiment where I can say this is what we get without cooper chart, and this is what we get with cuber shard and like having a good apples to apples number um on to compare, because I I think the outcome of the experiment would be some would be a multiplier saying like we.

B

We are stalled here and if we once we're on cooper's chart, we see this multiplier at the saturation point and then the other part of my argument is that if at 100 cpu we can do this, then at 80 cpu. We can probably do 80 percent of that so like if we're happy at 80 cpu and then we're doing, I don't know 2500 jobs and yeah so to scale it to suggest that it would need to be scaled down. uh Yeah.

D

B

It's more about the multiplier.

D

Yeah, the way I was thinking of this was if we do coupon charge, we expect the cpu utilization to go down, obviously, but we also expect it to grow slower, which is what jacob's demonstrating here right. That's why we get so much more headroom is because of the much slower um demand on cpu time as we increase job through, but.

B

Yeah another thing that is interesting or that I find interesting is that uh just judging by command rates, a lot of the work comes from the uh semi-reliable fetcher and uh we there are different ways to approach this thing and we are they're sort of an optimal way to do the reliable fetch that we can do, because we have multiple cues.

B

So it could be that if we can literally get to one cube per shard, we can change the implementation of the reliable fetcher into something that is also more efficient on the reddit server and buy headroom. That way,.

C

Yes, I know that I.

A

C

I'm fully reliable yeah.

A

I know that I've asked this before, but please refresh my memory. um I know semi-reliable fetcher is something that we wrote, but why did we need to write that? Because we had so many queues.

B

Because um now why do we? Why did we need a semi-reliable or why do we need a reliable fetcher in the first place.

A

It's part of a bigger question, which is do other people that use redis need this like it sounds like we built something that we needed. So why did we need this? Anyone else did and is it because we use this in a non-standard way. I.

D

Think a reliable fetch is inside kick pro, which we can't yeah.

B

Because we provide.

D

B

So we're we're re-implementing somebody else's paid product that that's fine. I.

A

Remember this answer now. Thank you.

B

Yeah and uh well.

D

C

The semi part is more interesting for me as well, because.

D

Yeah yeah, that's that's, probably the one to focus on.

B

um Well, I I don't really know the answer. I can make up an answer, because I've looked at the difference between the two uh implementations and what I noticed is that.

B

So the optimal there's sort of an optimal thing in redis, which does so okay. The idea of.

C

Relationships thing like yes,.

B

One thing to the other.

C

In one command.

B

The idea of the idea of reliable fetch is that you can't drop uh drop a job on the floor if the pro the psychic process crashes and the problem is right now is that if psychic accepts a psychic server accepts a job, it's no longer in redis it's in the psychic process. And then, if you take the psychic process away, the job is lost.

B

So what the idea of a reliable fetcher is that you have an alternative data structure in redis that holds the jobs that are in progress and then, if a psychic server goes away, you can realize. Like. Oh, it's gone, these jobs were in progress, so we probably have to start them again.

B

So jobs need to move from the incoming queue to the in progress set or thing um and there's an atomic instruction for that in sidekick, which is sorry in redis, which is perfect, because then you can never lose the job, because it's by the time you get it red is already put it on the other uh thing, but that atomic instruction only works.

B

If you pull from exactly one queue and if you pull from more than one queue, you need to ex issue that instruction for every queue, so we would have to issue it 400 times and uh the code we're using the reliable fetcher gem does something naive where it waits five seconds in between doing that, so it it would 400 times try to safely fetch from one queue and then wait five seconds and then try again.

B

So that's not good for uh for latency and making 400 calls is also not nice. It's better to do. One.

D

Especially as most of those 400.

B

Queues are empty yeah, so it's uh it's a peculiarity of.

B

I guess it's a reddish shortcoming or a reddish design thing that it's only wants to do that atomic thing on one queue at a time, but I I'm actually curious what psychic pro does because it seem.

B

I find it hard to imagine that sidekiq pro says: okay, you can only use one queue.

C

No, but they do mention, you can only use a handful of queues. So if you have, if like we need to do this operation 400 times or whatever you said, with no jobs in those queues, so that's wasted time. If you need to do that six times, that's not that that's not as good as one, but it's not as bad as five four hundred.

D

Yeah, I think um I think they've recently updated their wiki page or I misquoted it before because it does now set. I can check that because it's a wiki page, um it now says a handful of queues per process, which you know we have different processors running different sets of queues. It's just that the catch-all one runs on a ridiculous number of queues at the moment.

D

Well, one thing they could be.

B

Doing is that they just have multiple redis clients and each of them does one beer pop-up push, because then you can.

B

If you have a reasonable number of cues per process, you would have a reasonable number of redis connections per process and you can work around it that way, but now just reverse engineering psychic pro or imagining uh reverse imagining psychic pro.

C

Yeah, well, we already did that.

B

Yeah, um anyway, what I'm trying to say is that.

B

It it could be that there's also some extra room to be clawed back just by uh taking a closer look at the reliable fetcher, but we're not even close to that right now, because right now we need to do something about the queues, because we're just told on the number of queues.

B

Long rambling story, but thanks any other questions.

C

What are you going to try next to make the uh experiment more.

C

B

um Yeah, I think.

C

I wanted to get an exact multiplier. What do you need to do to improve the setup to get that.

B

um I think I'm just going to fiddle a little right now, it's a little bit hard to switch between the two different scenarios.

B

um So- and I I I guess today I was trying to use a 60 core machine and run lots of psychic servers on it. But then you, I don't know you start running out of file, descriptors and other weird things happen, and then the system misbehaves because of the wrong reasons.

B

um So now I am hoping I can generate enough load with two machines. Maybe I need three three machines at some point. Do I need to remote control all of them over ssh with a script or do I need to change the interface of the script where you need to press enter three times now? So it's I'm still sort of feeling my way around.

B

What is the the right way to run the experiment? Yeah.

B

But I I guess saying it out loud, making it easier to switch is the most important thing right now and making.

C

That reliable, that's the thing when I wrote that script that you are now using and need to press enter three times. That makes sense for that experiment, but I don't think it does for yours.

B

uh Yeah, oh thanks that uh that could well be um so maybe I just need to refactor a little bit where the core uh yeah, where it can do the thing of your experiment, and it can also do the thing of my experiment.

E

I'll just copy it.

B

uh Yeah, we there's a little bit too much copying in that repo I'm trying to reduce the number of copies um yeah. I I really would like to have this wrapped up uh by the end of the week. So that's sort of the rough time box. I've given myself.

A

I think it's been a really useful experiment to do, and it's really interesting to see what's been possible here. It gives me a lot of information for what we need to be looking at past this. So I'm interested to see what, where you get to by that by the time the time box ends.

B

C

I was going to add the thing now that andrew also joined. I was going to add the thing that I was discussing with sean recording regarding aptx measurements for erroring requests, and let me just find the merchants where we were discussing and linked that.

C

C

So right now, situation right now is, and we have a re. We have two middlewares one that we use to record metrics that we use for services sli, so the web api and git service yeah slis, and then we have another middleware that does um a measurement for error, budget aptx. So the duration, that's we record the same duration twice with different labels in the first one, the one for the service metrics, we didn't record an abdex measurement, so we didn't record the duration.

C

If the request raised, um I don't know if that means that we didn't record it, um because I don't know if that app.call thing in the middleware would return normally with the status 500 or if that would blow up and not record so if it would fall back to the insure block or not uh yeah right now, I've got a merger quest out to make both of them do the same thing, um but I yeah I like, I think, that's a good first step, but then I would like to talk about what we want that to be actually.

F

So when you listen to lots of the conference talks and there's a really good one called how to measure latency or something like that, this guy by a guy named, heinrich I'll, find it after this, um you you'll, hear he's very specific. He always says uh sorry valid requests, and so that's basically 2x and 3xx, and not 4xx and 5xx, um and so we've always known that we don't have that. But I guess the main reason is because of the cardinality of including the status on on a histogram.

F

It's just they're, they're, really really really big. Sorry! I haven't read through this request, so I don't know like the context too much, but you don't need the card. You don't need to include the status code if you want to. If you.

B

F

Categories, you can just yeah yes valid and invalid, um but we lost we lost jacob, um but the the thing that I was going to say is like actually from a like operational point of view and just knowing the way that our systems in particular fail. I think that, having that like some of invalid, or at least like the histogram of the invalid things, is still really useful. um Just take the the problem that we're seeing at the moment with um that one end point that it.

F

It returns a four something and it takes 30 seconds, uh and we don't know why, and if we didn't have a histogram of that. That would be unfortunate, I think so. I think yeah having having it as a as a you know, not excluding it but having it as a as a boolean or very low. Cardinality label would probably be a good um approach.

B

Well it maybe it also depends on whether we're counting good requests or bad requests, because um no wait. We can't those as bad requests anyway, because it's a 4xx, it's a 14.

C

Yeah because it's a we don't count that as a bad one. Do we.

F

That's a good one. We should yeah that that's that's kind of the the we should be point. We should be excluding those from the user, because if a user sends.

C

Us a request: that's, we should be excluding them from aptx, but we shouldn't count them as an.

B

Error, so a fast for xxx is not a good request, but a.

C

B

For xx is a bad request.

C

For my question right now, I wasn't thinking about four xx: requests yeah, we're just thinking about 500, so things that would raise and be counted so things that would yeah be a failure system, failure, kind of thing, yeah.

F

C

Measure aptx- and I get andrew's point that, like some of these, can be expensive, like some of them, the one that I was looking.

A

C

Just now was raising a 500 like it would turn into a 500 after spending five seconds talking to the database and failing.

F

Search control accounts.

C

Exactly that one.

B

I mean well there's also the classic failure of uh time timeouts have 502s after 60 seconds. Those are a combination of bad requests, bad latency.

D

Sorry I've lost my connection for a bit there, but so I just wanted to go back to what we do right now, which is that right now, if we say that each request can gain a point and we the the total score, is the number of points gained versus the number of points that could have been attained.

D

So each request could attain one point. So we add, the denominator is just one per request and the numerator is the number of points we actually got. um So at the moment we say and I'm going to use valid and invalid, but like our definitions of valid and valid, don't match the ones andrew mentioned. So at the moment we say fast, valid gets one point: slow invalid gets zero points, slow valid, gets half a point and fast invalid gets half a point.

D

That's that's our current model. That's like effectively what happens. There are some caveats with like that: they're actually measuring two slightly different things, but sorry.

B

What is the point.

D

uh I'm just using point to mean like this is this is one yeah? It goes on what's on top of the the fraction uh what fraction so the fraction is the total availability score right so, like the total number of, let's say for the request, yeah so say you have 100 requests that all succeed um in good time. Then you have 100 divided by 100, so you have 100, but.

B

Now we're talking about error budgets right because we're talking about a combined measure of errors and latency yeah.

C

But we do the same thing for our sla, like um for our availability, service, availability,.

F

Yeah, it's it's like the way to think of it. Jakob is like you have total total request on the bottom and then good requests on the top and normally good request is like an integer one or zero. But in our model the bottom is actually two and then the top is zero.

F

One: two or yeah zero one or two yeah.

D

So, that's why I'm using half.

F

Instead of yeah, um so it's so weak, so we because we have two different gates that people go through right and and obviously it's just like a different denominator. So it kind of comes out slightly differently.

B

Is that on purpose, or is that? Because it was easier to calculate it, this way it.

F

Easier, it is easy to calculate, but I think it has benefits as well and I'll tell you the main benefit like if we just had a single. If we were doing it like the exact way of of like the slo book and everything like that, we get one slo and that represents errors and latency right, and um I think that we have different failure modes when we have 500s and when we have things slowing down and even though they are sort of both bad user experience.

F

It does help us to kind of dig into the problem better.

B

But that's not what error budgets are about error budgets are about. If you know, are we doing a good job and if you're saying digging to the problem, that is us infrastructure analyzing an.

C

Incident just now in the in the demo doc: that's the equation, we're talking about um and we're using the the same for both.

B

But almost like, I don't understand why we use like if we want to understand an incident, we almost always need to drill down and look where it's coming from and what's going wrong.

F

Yeah we could, I mean we could merge them into one one gate right and the gate is either zero one and it and you score one if, um if the server fails with the 500 um or and and yeah the server fails with the 500 or your request takes longer than a certain amount of time and that's basically bundled up. If you get a 400 you don't, it doesn't even count uh in the in the proper model right.

F

um But at the moment, because we're using two different counters in prometheus, we are and- and we can't kind of- I don't know- align them a few ones- we're just basically counting out of two and then and then those two values at the top are kind of independent of one another.

F

If we wanted to go the full hog, I think the the only way you could really do, that is by changing the code so that effectively we have that count where we can't up on every request. That's not a 4xx. We can't up at the bottom and then on every request. That's not a 4xx that takes less than amount of time and it's not a 500.

F

We can't have one on the top and that's the only way and that's a like a ruby piece of code and that's the only way you could really do it in prometheus effectively for latency and valid requests. If you wanted to stick to the like proper, proper definition, but I don't know if we want to do that. Yeah.

B

I'm also a little bit lost on what problem we're trying to solve. I guess I mean what bob was saying that we should treat 500 exceptions. The same way, it shouldn't depend on where our middleware sits in ruby, whether something gets counted as a bad as a five fx 5x right, because if it's, if something turns it into a 5x before the middleware, then it shouldn't fly by and uh then, if it's an exception like that ordering shouldn't matter. But that's like that's straightforward.

C

I think this short question is: does aptx? Do we want to count optics for five x axis, for both, in both cases like right now we're doing two different things? Which of them is right in the short term.

C

B

Isn't the answer there that 5x does not count towards okay and then we get back to we don't like what counts towards what and it's we go back to the complicated model. Yeah.

C

Should they be the same or.

C

F

I I think the easiest thing to understand is: is the the first table that sean put in there? The the smaller one was two rows. That's that sort of sticks with the model that we've currently got, or at least the entire.

D

Idea of that is conceptually. This is what we do. Yeah.

F

Yeah and and and that like, like without kind of going into like a whole different world of of of unifying them, or anything like that like that is definitely if, if we can make sure that that's the model that we're following um then that's the best for now.

D

Because I wonder if then we should just change this to just put that in the dots and say like to people. This is well. Maybe maybe not exactly like that because, like jacob said half a point is kind of confusing but yeah. um Something like that. That says like these are the axes. Your request is judged upon and you know you.

D

You only get full marks for being fast and successful. You get half.

C

D

Being one of those two and you get zero marks being neither.

F

So just um I haven't read through this whole thread, but what a handled error is sorry.

D

So so this change state makes it so that any error that bubbles up, like I mentioned earlier, any error that bubbles up out of rails and then gets handled uh well. A handle out of the application and gets handled at the rails or erect level um gets zero points. Even if it's fast, uh no, is that right bob or does it just not get recorded? It doesn't get recorded so that bottom table is wrong.

C

So, where, where we're at now, for, um if something raises in the code, we.

A

C

Track it as an error, but we will not. um We will not record aptx for it. We will not record the duration for it. If so, the code says the rails app or the rack app says, return 500. We will record both.

F

So would it help in that table if, on each of the things where you've got a one and a zero, you actually say one out of oh well? Actually it's two out of two.

D

F

One out of two, because in that other one it would actually be like half out of uh sorry one out of two or one or.

D

Zero out of one.

F

B

F

That would maybe explain it a little bit better if you kind of said how many of the, because I think the problem is one of the gates is getting skipped. Yes,.

B

F

That's that's the bug.

B

um Yeah, the the bug is very, is very clear like it. It shouldn't yeah. We have code at the level of application controller that like where does the gitly 503 come from? I think that's an application controller. Yes,.

D

It is that was the thing I mentioned yeah, although I didn't link to it yeah, because.

B

So gitly 503 is just an exception that we happen to turn into a application.

C

B

Controller and then our middleware sits around that, but then there's other things that don't get caught until later, but they're not fundamentally different from a grpc exception. They should all be treated as a 5xx.

B

C

B

C

So yeah, but so that should raise, I think, like, instead of rendering it as a 503, because we wanted it to be a five.

D

Then we'd need to know like an outer middleware like what to do with this specific exception and to return a 503 instead of 500.

C

There's a thing for that, like a mapping, you can write out. um I've looked into it with mario who left the company like a year ago. I think, but that doesn't exist.

D

He enjoyed it that much.

C

Yeah, no something was weird there with our stack that didn't work.

B

So yeah so either we can refactor the way we handle exceptions everywhere in the app or in the measuring middle, where we can look at the return value of call and look at the number and say: oh, it starts with the file.

C

That's very close to the question I want to get answered here like uh right now we have a thing where only a certain type of 500 gets handled. Should we change both of those? What do.

B

C

Certain type, um because certain type as in render 503 versus raise something.

B

They should get treated exactly the same yeah they should they.

C

Aren't I think now, but.

B

I mean think about um the rest of.

C

B

Stack, they don't know the difference between these things, if you're at the level.

F

B

Or workhorse it's a it's a 500.

F

And also from the user point of view, like remember, you know, we should be always thinking about it from the the service and yeah. That's that's internal! Okay! You really need that even.

D

From a developer point of view, though, like yeah yeah.

F

This is what you mentioned. This is a pretty minor distinction.

D

About like where you generate that 5x explorer code right, yeah.

C

And then the conclusion is also for those things that result in a 5xx, regardless what it is we don't record the duration.

B

We do whatever we do with an exception.

C

We don't record the duration.

B

F

So we don't already do we are you saying we aren't going to or we don't already.

C

We do for one, we don't for the other.

F

Yeah yeah, and that should be so. We need a.

D

Two by two.

F

Table number three by three: two.

D

Right bob, because we know what the status goes, yeah.

C

We know we have to stay because we just need to move them. We just need to move. You know there. I go again because I just moved something but like um we just need in the rack request, middleware the one that we use for entire observability stack. We need to move the aptx measurement to the bottom and check the status code there yeah, so we record recorded that. No actually we needed do. We want to record it or not. I still haven't had an answer to that. Actually,.

D

So I don't know the table in the box, as it is now put the table in the docs as what it should be and then once we've once we've documented what it should be. We should go and fix it to match what the docs say. How about that.

C

F

I also think it might be really helpful just putting like a good request, bad request without all the different status codes or maybe status code class. So that's what google do a lot. They have the 2x x, 3 x, x, 4, x, x, etc, um and then I'm putting that on the latency, and then we can kind of progressively bring it in right. It's not like a like. You know. We can have two different recordings, we'll.

D

Start recording it on the lower cardinality metric first and just see what we get out of. That is that what.

C

You say, but we have that on the lower cardinality one. Don't we.

D

C

uh Yes, no, I don't know, I don't think so. No, we just have the method get post kind of thing.

B

Yeah so we add two x x, three x x, four x, xx, yeah.

F

Because we don't for the latencies, we don't the the other thing that that we're kind of giving developers a free ride with at the moment is four x x. Responses that come back really quickly and say: you're not authenticated, are are counting towards your good error budget, which is wrong. Yeah.

B

That's a question.

F

For another time, yeah yeah, exactly in some cases.

B

It's it's not wrong, but yeah yeah, yeah.

F

B

With git http, it's actually very good because get hp will always probe and get a 401 before it makes the next request so that first 401 needs to get done with as quickly as possible.

F

Okay, okay, but that is.

B

A very specific thing that we do have a.

F

B

F

But um I I'm less worried about that now, like, I think I think that's rather, this is the they're two separate problems so like let's try keep them separate because it's a little bit intractable at the moment. I think.

F

But certainly putting that graph of the gates and and how they add together, yeah yeah.

D

I kind of like the idea of like two out of two out of two thumbs. One.

C

D

Yeah, I don't think I don't know if we should actually put that in the docs, but yeah.

C

One thing one thing that I now still want to point out with the two tables that are mentioned there: we are currently doing the bottom one: okay,.

D

C

Everything that counts, like everything that counts um for infrastructure, we're doing the bottom thing so.

D

We intend to do the top one, but we're actually doing the bottom. One is that right.

C

We like for the service slis for them not they're, like forget about their budgets,.

D

C

Are doing the bottom one: okay, yeah.

D

C

We don't want to do that anymore. We're going to change that we're good yeah. Okay got it.

F

C

F

I think everyone's mental model, at least um the people who've thought about it. The mental model is for the majority of people is the top one, so we should definitely bring it in line with.

F

Got that awesome bob, you must go look at the open, sli, open, slos, spec, that's coming together and everyone actually.

C

I saw that you, you mentioned that somewhere, that you were getting involved in that, but I.

F

Yeah that was announced yesterday, so there's it's it's kind of interesting, because so much of the stuff that we've spoken about- and I think especially in the m staff there was there- was like a year of discussing occurrences versus time.

F

Measuring um and all of that stuff is kind of. You can see a lot of other people who have the same kind of discussions, which is kind of nice to see that how and things out.

A

A

So bob do, you feel, like you've got the answer and.

C

Answer but I'm going to do what sean mentioned and just uh see if we can like. I don't know where that documentation lives, but I'm going to add it to the documentation first and then I'm going to ping someone else from the team and say that's it right.

D

I would put it on the dashboard for stage groups page. I don't know if that's like the canonical.

C

Place but I think it's the place.

D

Where that's the audience that we want to reach right like.

C

But it came up through the through the discussions for error budget, but I'd rather like, I think the other one is more important for them. Overall gitlab.com availability and we want to like we've, decided that we want to do the same thing for both.

D

Yeah definitely.

C

I can start by documenting the documenting it on the error budget one, but it needs to go somewhere where we document how we.

D

Oh sorry, I mean yeah. Well, the error budgets are documented on the dashboards for stage groups page right. It's not just about error budgets so that that's what I was talking about there is like you know. This is this is where we lead into talking about like what the metrics that we use on. Gitlab.Com are two developers so yeah.

C

The page that I linked could have also could mention this as in. um If your request is an error, it will not count as an uptext like there will no not be an aptx up like in the the equation that I linked there and we could add that to the dot to the dot there. Yes, I was.

D

Also going to add the abdex thresholds to that page today, because a stage group requested it and it's totally reasonable because, like they're like what? What means that.

C

D

C

Yeah but yeah yeah check them for all of the components, though, because there's yes.

D

I will do yeah because obviously I've been using one second as a shorthand because I mean for puma, but that's not the.

C

D

C

True, everywhere: okay, okay, I know what to do great.

F

Are you gonna write? Are you gonna put that duck the table in with the with the outers, because that'll really help explain it as well.

C

uh You know what I'll do that and then I'll mention you on the merge request and you can see if that's what you had in mind.

D

I'll update my comment to sort of demonstrate that as well.

C

Oh cool, thank you.

F

So, like, I really think it's important that all of the different definitions are as close to being in line with one another as possible, like so the ones that we use for service monitoring and the error budgets are kind of. You know we we don't, we we actively try not to diverge those too much and where they are diverge, we bring them back in line because it's just cognitive overhead, when.

C

F

C

Thing for puma durations is like as you've seen and you've mentioned it in one of the issues as well like we can't, I just add the feature: category label and the root label or like the action label or whatever that's called controller action. We can't add that to the one we use for.

C

uh Like we can't add that to the one we use for measuring puma itself, because it will hurt the histogram, do you have an any idea on how like should we work towards trying to unify those somehow by removing.

F

C

F

Like maybe, let's, let's get this done here, but maybe like a longer term thing is just to like actually start introducing another metric. That's that is the um you know, it's basically a one and a zero counter and then what you, the flexibility you get there.

F

You know this was actually in that um slo monitoring v2 um proposal, but the the the proposal there is that you actually give the teams the ability to define those thresholds themselves in you know with a with a thing and then and then it's actually in the code, as opposed to it being on yeah.

C

Outside the code, then controllers and requests are not just tagged with the feature category, but also with.

F

Connected machines- yes, exactly and then and.

C

Then we just count them and we don't have a histogram with.

F

We don't yeah well yeah, we just count the top and the bottom. That's all we're counting and then the rest is up to them to kind of dig in further, like.

C

We just count like we count, request was performed and then we count requests was successful.

F

Yeah request was good request was bad, total requests, good requests.

F

Because and then it really, then we don't have to kind of tell them what thresholds we're using because they they're defining the thresholds. The only thing is to make sure that they don't just set 60 second thresholds for things that that uh they want to fix their budget.

C

Yeah we can validate for that kind of things like we can.

B

C

E

B

I guess we would still need an independent thing, yeah that shows us what it, what the actual threshold is.

C

Well, you can set threshold with thresholds within this range. Well,.

B

Actually, I didn't mean what what the actual threshold is, but some way of saying this feature category has requested or this latency yeah. If people are gaming, the system they it would float up. I.

F

Mean you could just yeah, you could just put that in another metric right, and so it is because you'd want it in a metric, because you'd want to be able to put it on graphs and stuff anyway. So it would be good to expose what that number is. You know: endpoint latency, threshold, 60.

B

Right and then you could even um do things like calculate uh the num.

C

B

Of requests and the threshold and say, okay, so this is projected yeah uh an hour. Is this this many seconds? So this is a really.

C

F

Presentation, yeah yeah, they're.

B

Basically, making a latency reservation for a certain thing, yeah and we could say why are you preserving so much latency for yourself.

F

That's, that's. That's! That's very clever yeah quota management, um but I mean I also think like. Maybe that is you know. I don't know if it's like really really hard to kind of go to that, but it yeah. I suppose it would be it's a whole nother project really. So it's not something we want to do now.

F

But it's something to keep in mind.

C

I'll start with the tables thing: that's that's a good.

F

B

Long term, having stage groups own more things, is just yeah. It is a good.

C

B

General model of uh it.

C

Didn't come up is better yeah. It did come up with a discussion in the in this age groups already like. Can we set our own threshold for this and like now? The answer is no, but it should be yes.

F

Yeah, yeah and, and that would be the way to do it.

C

I'm going to write something about this discussion because I think the whole counter id is much better uh much better like it's. We should try to go there.

B

It's the correct way to use prometheus.

F

I'm also wondering whether, like if, by the time, you've done because you I heard several mentions there. Oh, I just have to change the whole monitoring the the the observability middlewares and rails. Maybe it would be no okay. You don't think that it.

C

I'm not I'm not I'm not right now, yeah.

F

C

I think we need to add the new metrics and then see what we can remove of the old ones and then maybe get rid of gitlab transaction things.

F

A

C

Short term, I'm not going to change a lot, just change the the um that we don't care anymore. How an exception was raised, um how an error, how 500 was.

F

A

And on that note, I hope that you have a good afternoon. This is a great discussion. Thank you very much.