GitLab Scalability Team Demos, 17 Jul 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Scalability team demo 2020-07-17

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Cool uh thanks, everyone for joining the scalability demo, we're a bit uh with a small crowd today, but uh jakub you have an item on the agenda.

B

Yes, uh I uh it's unfinished, but I think it's uh there there's already lots of food for thoughts and things I want to share. um Let me share my browser, so it's easier to know what we're talking about I'm working on an issue where we're trying to add uh latency metrics to all the reddit servers, because we're focused on redis observability right now and um the so I I really only care about this top row of the dashboard.

B

We have three redis, distinct reddit services and they each have their own dashboard, but that they look almost the same and up until last week there wasn't even an error ratios panel here and for context. What we expect to see is four panels: uh latency or aptx error ratio. Request per second and saturation error ratios are super boring. There are no errors.

B

This was already discussed last week, which is awesome, but I still think it's useful to have the panel, because if you know how reddish works or how we use reddish, then maybe it's obvious that there are no errors. But if you don't know that it's not obvious- and this makes it obvious and it makes it uniform. So I still think there's value in knowing this but yeah.

B

So now I'm working on towards the fourth panel, and that is latency and that one uh is less rosy than the errors, because um it turns out that latency isn't all that great or it's not it's not what I expected it to be, um and now I need to digress into another uh specific technical aspect here, which is that we need to measure these things on the client side.

B

Redis is meant to be very fast and if it would log everything it's doing, it would slow down just by having to do all that logging. So you get only very limited information out of a running ready server about what's going on and it's great that it's fast, but that limited information is a challenge.

B

So in this particular case we decided to measure on the client side, and that is a fairly sane approach, because uh gitlab is mostly a monological monolithic code base and almost all the redis use is in the monolithic part of the code base. So if we measure red is there, then we probably see 99 of what we do with redis anyway. So we might as well say that is how the reddit server is doing.

B

That's a practical solution, but it it yeah. Some things look slightly different um and uh one thing that I find found surprising is that in my mind, reddish is supposed to be fast and it is if you think about what the reddish server has to do. But if you think of the bigger picture of what the experience from the side of a redis client, it turns out that reddish is slower than I thought.

B

So I was thinking while a registered quest should take like one milliseconds or five milliseconds would be really slow for redis, but it turns out that sorry. So that was my expectation. So what I did is that we created a histogram in the rails code base where we have. Actually this code is new and this wasn't there. Yet uh the the slowest type of request we could observe was 100 milliseconds. These are uh seconds um so uh anything slower than 100 milliseconds.

B

I thought is so terrible that we don't even want to think about it, but it turns out that if you look at 99 percentile latency, then it's just constantly above 10 milliseconds- and another thing I had to learn here is that which makes sense in retrospect is that the prometheus histogram can only come back with the value of the largest bucket anything. That is any observations that are higher than the largest bucket value, get reported back as that largest bucket value.

B

So all this tells us is that there's a bunch of goals that took more than 10 milliseconds and yeah. The 99th percentile is always above that and yeah, except for this one thing, which is the shared state redis. So sometimes this is fast and then you get uh 0.06, uh so that is six ish milliseconds, that's not really uh yeah.

B

I won't bore you with the math, but it's it's less than 10 milliseconds, it's closer to 5 milliseconds, which is the next bucket, but this is slower than I expected, and another way of looking at this is to look at the percentage of requests that is within a certain.

B

That is not too slow. This is roughly the aptx idea, although aptx is slightly more complicated than this, but what I plotted here is percentage of requests that takes less than 10 milliseconds, and uh that is well here on this line. This is the blue line. That is, the shared state.

B

Right is again that is 99 here, but it's dipping to values like uh 97 and then this cash instance is more like 98.6 and then here it dips to again 97.4- and this is the cyclic redis and this dips to 85, meaning that 15 of requests at this point took longer than 10 milliseconds from the client side, um and so I've been having a discussion with andrew about this, and uh this is not usable for our latency monitoring and we just need to own up to the reality that from the client side, these things are slow and we need to have the buckets that that allow us to reflect that.

C

Yeah, I mean the thing to keep in mind as well. Is that that's obviously including network round trip and the uh the data transfer and- and we do some rather abusive things with our radius with very big objects that we shouldn't really be storing in readers? Yeah um traces comes to mind here.

B

Yeah yeah exactly this is one of the things I learned that we worked on redis. Is that even though people always say redis is single threaded, but it's really the part of redis that looks at the request and writes the response somewhere in memory. That is single threaded, but then it tells linux to send that response back to the client and that then gets done by the kernel in a different kernel, threads and um so yeah yeah. It can be take much longer to do it for the data transfer also.

C

And actually in in redis, 6 they've offloaded a lot more of the non-critical stuff outside of the thread and so, for example, if you go look at the sidekick documentation, they say if you move to reader six and you switch on this feature flag, which I can't remember what it's called with hand.

C

They say you'll get much higher throughput yeah um because of that as well. So there's more to come. That's great.

B

um But I I I think, we're uh it's sort of a blessing in disguise that we have to monitor on the client side, because it's ultimately more relevant.

A

B

What's what we're discovering here is exactly how bad things are on the client side, and um what I would really wished for today is that I could show you this dashboard that'd be a fourth panel here and I could say look. This is how we're doing- and I can't because I had to go back and look at the pockets um which dovetails with another interesting topic that I'm just going to mention for a moment. uh This came up.

B

Sean brought this up a while ago and then andrew also brought it up, and it's sort of going around that we have too many latency buckets on our histograms, and this is flooding our prometheus servers, and I took this as an exercise to say: let's have very few buckets in the first merge requests. uh Yeah again like I, there are four buckets here, and this is a very small number by gitlab standards.

B

This has to do with the difficulty of defining buckets, because, ideally, the buckets should be sort of focused on the the range of latencies you care about.

B

But if you don't know what those are yet, how do you start and the process we've been using all along is to say well, let's just take a lot of buckets with an exponential range and then see what works, and then we never go back and kick out the buckets we don't need, and so this one I thought: okay, let's try something different, let's focus on what I think it needs to be, and then I immediately run into the pain of oh now.

B

The alerts don't work and I need to go back and create a second merge request to change the pocket definitions uh in rails. So it's an for me. It's an interesting learning experience I'm not really sure how to share this learning with others or how this informs. What you want to be.

C

Doing here, here's an interesting like slight detour, but if you cast your mind back to about february 2017 when we were putting observability into getting- and I remember we're having a lot of discussions around this stuff at the time- and we decided on some buckets, but then we also made it configurable in the gita config, which is, I think, the only place in the gitlab application where the buckets are defined in configuration.

C

um And I wonder whether this would be something that would be generally a better way to go. Because, like you can see the pain that this you know.

B

Also, this is specific to our infrastructure, it potentially so yeah. um I.

C

Think you're right here, yeah also because we've kind of decided we we're not going to use latency histograms in this way because they don't work for us, because we have other cardinality problems that are coming into play. Like the number of servers, we saying we want three buckets, but a lot of clients, they're going to say what kind of buckets are these like? We can't we're not going to get any resolution.

C

Our histogram uh estimations are going to be terrible, and so maybe it makes sense to kind of extend out what we did on gilly to a broader audience.

B

I think that's that sounds very reasonable, because this hard cornering is pain. Yes painful from.

A

B

Point of view- and I I'm having I'm I'm pretending that gitlab.com is the only installation that cares about these pockets and that's why I'm hardcoding them and nobody is challenging that. But in all honesty, that's not true, like you just said so,.

C

Also, if anyone else is using um latent aptx scoring they're every time we change these buckets, we'll probably break their their, which won't please them very much. No.

B

A

Think we should ever.

C

Yeah I mean we should never kind of say that the um metrics are stable, because I mean that would really kind of hamstring ourselves. But it is a thing to think about.

B

uh Yeah, but this is this is a mistake, waiting to happen uh and just to spell it out for those of us who are not super familiar with this, uh the the aptx panels we have in the metrics catalog are standardized. This is the system andrew has been building and they all use the same formula, the same prometheus query and these do counts on exact bucket values.

B

So if you move the bucket boundaries um for for computing, histograms moving bucket boundaries is irrelevant because the histogram is just takes whatever buckets exists and compute the value out of that.

B

But the way we do aptx we care about actual bucket values and with this slow process, where we change the buckets merge it wait for it to be deployed can be that we find out the next day we broke our alerting, and then we need to go back and change it again, and if this was in configuration, then uh we'd be able to fix a mistake like that. Much quicker.

C

Yeah there's another sort of related issue that I ran into this week, which is that I realized that we can treat uh the google load balancers, that we have in front of a lot of our services in the same way that we treat h epoxy um as a kind of external ingress, but we kind of counted as part of another service. So we do that for registry and stuff, and one of the nice things about the google load balancer.

C

Is that actually, unlike aj proxy, it's also got latency histograms of how long requests are taking, and so we don't have that, for example, for grafana, ironically, grafana doesn't give us latency histograms itself, and so sometimes when things are running slowly, it's very difficult to know what's going on, but we can get it from the google load balancer in front.

C

But the problem is that they use a purely logarithmic calculation on the bucket sizes, and so the bucket sizes have got these really ridiculous names like ten point: zero: zero, zero, four, two: three, eight nine five like ten digit long, uh latency values, and so I was trying to figure out. Is there a way that we can do it without those exact values, and the best I could come up with was a regular expression which is probably worse than nothing.

C

So it's kind of a. I don't really know what the answer is, but.

B

You mean that if we want to calculate objects from google's data, we need to put these ridiculous numbers into our formula.

C

Yeah we have to, we have to exactly.

B

We have to afraid that they never change them.

C

Exactly and we have to pray that the way that j sonnet represents that that floats to the tenth digit is the same. I mean we can probably just it's not a float.

B

It's a string anyways.

C

No, but it's uh in in in jsonnet, the satisfied threshold is a is it floats, but we can change that.

B

We might want to change that considering it is ultimately a string and it has to match a string.

C

Yeah there's some rules in prometheus around representation of that string, but it I lots of things, break those rules and and don't actually adhere to them correctly. um Interestingly enough but uh yeah. So it is it's very fragile. It's the kind of point.

B

um So we keep learning all these interesting things about monitoring.

C

So there's one there's one more, there's one more little extension to that, which is that um one of the interesting things that we could potentially do but we'd have to change the prometheus client, a tiny bit that we use in rails, because one of the things I'd really like to do is put those thresholds into the application and not into the you know in a separate place. So we only have a single source of truth right and one of the things we could do is.

C

If we had the the thresholds configured, we could mark two of them as the aptx thresholds. Right so I mean you can say this one's satisfied and this one is tolerated. But really you know you it's the lower one is satisfied and the higher ones tolerated.

C

B

um No, I I want to come back to aptx, but finish your thought. First, okay,.

C

So because then what you can do is that you can expose that on the on the latency histogram, but only on um on the on the special thresholds, and then you can basically sum those and divide them by the infinity bucket, and you don't have to know what the values are. You just say: yeah, you know sum the sum of the sum, the thresholds and divide that and then you get the aptx and then it's all managed in one place. Yeah.

B

That makes more sense, because now, even if we manage, if even if we push the bucket definition into config, then we still have chef somewhere and then we have runbook somewhere else and we need to make sure we use the same numbers so yeah. That would be smarter. What I wanted to ask about, abdex is I'm.

B

I don't know if you noticed, but I run this query where I just set all the requests that are less than five milliseconds, and that is not the aptx formula that we use in the run books um and I've.

A

Been also looking at the.

B

I've also been looking at the the second google sre book, the sre workbook, where they talk about alerting and these sort of things and how to define slos, and it seems like abdex, is like intuitively I'd say. The simplest thing to say is number of requests that is below a threshold, but that actually is slightly more complicated than that yeah. And I, but it's actually the way I unders the way I understand it is that uh I do I like what what does it actually mean, because it's the average of the the satisfying.

C

It comes from the and tolerated it's the average of the sorry. It's well the way. This is how I think of it. If I was to to explain it, um it's an error budget. It's a type of error budget based on latencies, rather than like a traditional like your request, is errored.

C

If a request comes in at less than or great less than or equal to satisfied, then it's it. It's a hundred percent, no errors, so one out of one.

C

If it comes in less than or equal to tolerated, you give it fifty percent and that's the weird thing, but then other than that you treat it as a latency error. So but a lot of our aptx thresholds don't have the tolerated, so they've only got a satisfied and and there's various technical reasons why it's like that and for those requests it's effectively a straight error budget, and um you know.

B

So basic and it's much easier to do so exactly my question is uh I'm I I'm sure I think the original app text was even more complicated.

A

B

Requests, but we we have um thanks for that explanation. Why do we need the more complicated instead, instead of just saying, satisfied or not.

C

So really what it is is just kind of sticking to the standard, but like often and and kind of probably tending more so I tend to leave out the tolerated, because exactly the same thing that you that you see here, it's much easier to sort of wreck this sort of cognitively process. It's like this is the number of it's a pure error budget. In the same way, the errors are, but um the the the what you'll find is for a lot of our services.

C

At the moment, um we kind of need that half score in order for them not to look really bad.

C

So so, if we took away the tolerated, we would probably go down quite a lot and all the multi-burn stuff stops working below.

B

99, but uh this is very interesting, but you can because we had the same problem in the discussion about how to set these redis buckets, but um the status quo is that things are as fast as, however, they are and uh yeah we're trying to come up with slos and yeah. uh I think the way I look at it now is that we probably want to have slos that capture. What is going now on now and say this is reasonable, even if it's not and.

A

B

At least then, we know if something slows down and becomes even worse than it is now we know about it and then exactly the future step is that we tighten those slos and say. Actually we want the redis to be twice as fast and we're going to start measuring if we're there, and once we get there. We lower the alerting threshold.

C

That's that's exactly it and and all of the budget, all of the thresholds that we've got are based on what was available right. It wasn't like what you're doing now is you're, manipulating the the buckets, but almost everything else is like this is the best bucket we've got so we're going to use that, and you know like over time. We will um start improving that sorry, you had something to say: marin.

A

Yeah I wanted to actually ask because um in my head I thought that we, um like the reason why we wouldn't want to use tolerated in the in the absolute articulation, is because it gives us, um like you, said it's cognitively more like it's easier to to grasp it, but I also don't know.

C

It's harder to grow is what I'm saying, but.

A

Not using it it's easier having having.

C

No sorry having two thresholds is harder than one is what I sorry. I think we it sounds.

A

Like we're on the same planet, we are on the same page.

C

Okay: um okay, just check.

A

But what I was, what I was trying to uh to uh to understand is: doesn't that also mean that you are forcing yourself towards a higher standard, because you don't have that in between state right. So why is that bad.

C

Because so it's it's not bad and that's something we would like to go to so so say we have at the moment. We have a one second and a five second for workhorse right, I'm just I'm just gonna use like hypotheticals, um so that the choices are okay, we're gonna drop, the we're gonna drop, the five second we're just gonna use one second, then you might find with, and this is all hypothetical.

C

I don't have the numbers in front of me, but you might find that workhorse runs at like 98 percent and as soon as you go into those sort of territories, yeah.

B

No, I I get it so. You just said that earlier yeah in some places I I'm going in and I'm redefining the buckets so that we have good pockets. In some cases we have not gone back to redefine the buckets and instead we use tolerated to create an in-between bucket that doesn't exist exactly.

C

That that's and- and you know, the thing is like: if you go back six months when I started writing all of this stuff, I was kind of like. If I go through and change every exporter, every um uh application, you know every histogram. This is going to be like years of work. So, let's like use what we've got and then iterate on that. um But I don't think that those are the you know.

C

Some of the buckets are terrible, like I think, for api that tolerates it's like 30 seconds or something, but but purely that gives us the numbers and, and we should do a process of of of reviewing them.

C

There's one other thing that I want to add, because I think it's an important thing in this discussion and it actually ties into some of the conversation in the in the group conversation last night, and that is that at the moment we've managed to separate the what sid was calling the slos and the slas, which I call the monitoring thresholds and the contractual thresholds.

C

um But the the thing is that that is there's one part where they still the same, and that is, they still use the same thresholds for the aptx scores. And if we want to do this properly, um we've got to we've got to break those apart as well, because if we go and tighten all these thresholds up, it's going to screw up the the contractual ones without any change in the actual latency.

A

C

So I've got it. I it's easy because everything's in jsonnet we can just kind of and all in the metrics catalog. Now we can just generate something to do that, but it's a task that has to be done.

B

Yeah, it sort of circles back to the so thanks. You've you've sold me on why we have the the funny app decks. It's because we don't want to go back and change the metrics, the buckets all the time, which is exactly the pain I'm feeling today and yesterday.

B

So that makes a lot of sense, but even especially when you're talking about slos and slas and maintaining a more complex set of uh of thresholds, that with higher business consequences, if we mess them up, it seems like it's even more important to have a system that is sane to work with, and yes, uh some something that's, maybe a bit more friendly to work with than what we have now, because right now we have buckets that are hard coded or that don't make sense, and we have run books that need to match things that exist in the pockets and we use app decks to create in-between thresholds.

C

So there's there's several approaches that we can so so. The first thing is, I think, we've got to split this. It's actually not as as difficult as it sounds, because we only need this now, because the sla metrics we're only recording on the key services and that's get web api registry and ci, I think, are there other key services.

A

Right, ceos, no cs, not sorry,.

C

A bigger part yeah so, but what I'm saying is it's actually a smaller subset of of the services so like it doesn't really matter as much for those ones, um but then what we could start? You know we could go with those and define separate thresholds for those. The other way that you could do it, which is a little bit more kind of. I think it might be too confusing for people, but it's almost on a per request in the application we we have a counter so effectively.

C

The counter is: did this request complete within the um tolera within a reasonable amount of time and then effectively? You divide that by the total number of requests and you get the same thing, but then you can vary it per endpoints and again you're pushing it back into the application and you're not relying on the histogram infrastructure in order to build up that score and effectively it's just like a error. It's an error counter!

C

B

C

B

You're you're you're trimming down the whole histogram mechanism to only the part we need and you're pushing it.

C

Which is the which is the error rate yeah yeah.

B

The error rate.

C

Latency error, yeah.

B

If you want yeah yeah, no, I I I was just looking at the google book and they make the point of saying we're going to talk about error rates when we we're going to say air rates. When we talk about latency, because yeah.

C

It's sort of the same thing.

B

Yeah, it's something it's the thing. We don't want, um yeah it. I, I suppose.

B

That might be that might have the advantage that uh if we want to head towards a future where we can break things down by feature category, uh you might be able to see. uh If you.

A

Don't have a targeting yeah, no.

B

But like you can yeah the the the current breakdown of web api kits is is quite coarse, and uh if we want to have a situation where we can say uh I I don't want to pick on any feature category but some feature, and we want to be able to say hey on compared to your goals. Your feature is often quite slow in production, and maybe you need to spend time looking into those.

B

The only way to do that is if we can have break things down per feature that we need. We need to be inside the application and.

C

No, I don't think we do. I think it's, I think, that's a great thing, but I think it's orthogonal, because once bob's work is finished with adding the feature category onto the um the controllers right.

B

Right or the action.

A

C

That will that you don't need to have any application. Other application changes once and you need that in both cases, but so it would be a different way of doing it. It might simplify some things, but it's not um it's not necessary. If you get what I'm saying, because you.

B

Know I think it like from an optimization point of view. It makes some sense because you're only counting the things that you need to count and it will probably scale better in the end. But it's highly specialized to what we're doing- and it might be more like you say, might be more confusing to people if it's not histograms and it's our the one part about histograms. We care about counter.

C

Yeah I mean you could just say that um you know we make it like italy, where you can turn off the histograms or you can put them on and use your own buckets but effectively for the for the for the service level. Metrics we're just counting an error rate um yeah.

C

I mean the way that I was imagining it was. It would go into.

C

In particular, for good for for lab kits for the go services, we could just build it in as middleware in there, but we don't really have that for um ruby. Yet.

B

Yeah yeah, it's interesting, I my main takeaway, is that uh the current system is too fragile and we need to do something and this this would be one way of dealing with it and your other idea of configuring buckets and saying these are the.

C

B

This is the one that means.

C

B

A

C

The data histogram.

B

Is sort of the same.

C

The only risk I have with that is, you know. For me, this is quite, and I don't know if this is a real risk or not, but we just have to test it. Prometheus is quite uh delicate about label matching, and I don't know if you had like a histogram made out of say: 10 labels and eight of those labels matched on everything except ali and then two of them you know, say they have the is aptx threshold um value.

B

They don't really know if we can do that is is the problem, because you know we.

C

Are in the we are in the prometheus client, so we can.

B

Yeah, I'm not sure if that is a great idea, so.

C

B

That's what I'm saying is we have.

C

B

The benefits of the people, who don't have this in their heads, let me spell out what I think: what we're talking about prometheus in its internal data model has histograms, which are things you just say with buckets, and you say I see I saw a number and and put it in the right bucket and these get exported as a series of individual uh counters, but that is sort of uh how the the wire format works, and you define one bucket when you have when you have a prometheus client, you define a histogram.

B

You say these are the histograms of the buckets and that will create, I don't know, 10 different counters and they will have the exact same label sets. And what andrew is saying that.

A

B

Of the observations will get a different label value on there, so we're changing the wire formats of the histograms produced by our special prometheus clients and.

C

B

That's that's a bit scary, yeah.

C

B

I, um the the other I.

C

B

Questionable, like you'd, have.

C

To totally like I, I think it would be the, but the the the other way of kind of putting it is is there's no such thing as a histogram in prometheus, there's only a histogram in the prometheus client and then everywhere else. It's just a it's just a concept.

B

Yeah sure it's all.

C

Just uh yeah, but.

B

If we're going to expose our own funny counters, if we're going to expose funny counters, we might as well call them get labs funny counters. Instead of saying it's a prometheus with the gitlab hack on top uh and.

C

B

Extra values in there.

C

Yeah, what do you mean? You don't support, weird yeah, um so I I agree, and so maybe then going with the like. The error like effectively latency error encounter um is the other thing. That's kind of interesting about that approach.

C

Is that um at the moment with with aptek I I don't want to kind of jump too many steps ahead, but it's something to think about at the moment with aptx we we always score it out of um like good, is 100 and bad is zero percent, and then with error rates, good is zero percent and bad is a hundred percent, and if we just started treating like a latency error as a type of different type of error, I still keep them as two separate dimensions effectively on the on the on the dashboards, for example, but effectively they both good is zero on both of them right and then, if you.

B

Have that that's orthogonal, we can already do that yeah. We.

C

Can minus what we have now and then it should just be very confusing if you called that aptx and you had the tolerated and all of that.

B

We might want to move away from aptx or come up with come up with a different word. That is where it's very clear what we mean data.

C

Maybe they see.

B

C

No because slo is sort of the next but yeah like a latency violation and leg yeah.

B

Yeah yeah, something along those lines.

B

um Yeah, I think the biggest issue sounds to me like is having the the biggest problem we have now is that the system only works, because we use the same magic string in different repos and it all comes together and we get the right alerts, but.

C

But in but in but in fairness right we've, the the system that we've been using, that with kind of predates the metrics catalog by like at least a year and a half, so maybe two years, we've been using updexes for all of the stuff and it doesn't break very often if at all and we do have alerts, you know to make sure that one of those series disappears. We get an alert on it and those the only one we get. That for is the is the errors, but that's a different thing. So it doesn't.

C

You know it's not as.

B

Yeah, maybe it's fragile in design.

C

But in practice it's not as bad. It's not that bad! It's, but.

B

It's hard to make changes to and uh we were.

A

Just discovering.

B

That we are taking the average of two uh types of latency error rates or latency non-error rates to compensate for not having the right buckets because editing. All these things is such a pain, so yeah yeah, the the real problem, is that fine-tuning the system is uh as a lot of friction.

C

Yeah and then you know using a separate cancer which you can kind of just plug the threshold in on that concept might be uh the the other thing that's interesting about it right is that you could then potentially- and this is just kind of like ideas- ideas are free. um One of the things you could potentially do is, you could say the threshold is one second and then in the application they could say.

C

Well, you know we know that these things are really expensive, like I, don't know golf and fetch a source branch, so we'll just have a multiplier on on that threshold. For that particular request, um and you can kind of let's.

B

Do stuff like that.

C

Yeah yeah exactly because it's like a known, slow thing that users are kind of gonna, forgive you for or whatever.

B

I'm I'm not sure I mean it's a nice idea, but I think complexity is something we need to look out for.

C

What I like about it as a as a violation rather than a fixed, you know it's like this- is outside of the yeah. Sorry carry on.

B

Yeah, no, it it it might it might. It could be a useful kind of complex uh flexibility to get a better signal. It's just that. I think we need to be able to get a good enough signal without complexity and lots of fine tuning.

B

Ideally, the other thought I had in favor of your latency violation. Counter idea is that uh clearly we we're using histograms correctly is right. Sorry using histograms correctly is hard, and um if we just this, allows us to move away from histograms, and we can make it clear about what these things are for. These things are for alerting, and this is our latency alerting infrastructure, and if you want to know how fast things are, that's not what this is for.

B

One of the problems, of course, is, then, how do you find out how fast things are in the the explored of the exploring way where you don't know yet uh what things look like, but yeah.

C

I I think, well to some degree you you should you know you should define that up front and then and then, but but yeah um like.

C

I also think, though, that like, if, if because I I actually did think about this quite early on as well like, should we do this thing, and I think that if we'd gone and kind of retrofitted, all the systems of this, you know it it's kind of like a step in that direct like using the histograms like the way it's described in the prometheus documentation like this is how you build it up.

C

I think that was like a good first step, but like ultimately it's something we want to move away from um towards the towards the violation, um but but you know not some like. If we'd gone to that as the first step, it would have been like way too much and and everyone would be like what the hell is. This thing.

B

This is one of the difficult parts you need to bootstrap. uh You.

A

Have to get some.

B

Sort of information about what the world looks like and then say: okay, now we're going to build a system that that squeaks. If the world doesn't look like this, but how do you get that information in the first place? First place yeah.

C

A

I would like to interrupt you if possible, and ask a question that is going to seem now like a very amateurish attempt of being in the conversation, but I'm sorry you kind of went way too far. um For me at least, um I want to ask a question about the the the redis latency metrics that you're working on and the fact that you said we're actually not measuring what ready server is doing. We're measuring what redis client is doing within the gitlab code base right.

A

What kind of information will be able to extract um when we see that things are going sideways like like? How? How are you going to use that information um to dissolve whether code change created, something already's um network is failing or whatever.

C

Can I answer that.

B

Sure I don't know why I don't have an answer uh immediately. It's a good question. So.

C

So I mean so: we've been running this for petroni for, like maybe nine months, so that if you go into the petroleum main dashboard um I could share my screen. Maybe that'll be easier.

C

um This is what we've been doing for this dashboard for quite a while, like at least nine months, and this value over here actually comes from something called rail sql. um So rail sql has got a aptx and an rps and it doesn't have an error. I don't think we don't. We don't do any sort of way of tracking errors. We probably should, um but we don't at the moment. I think it wasn't immediately apparent and they've definitely been times, I'm just trying to find.

C

Like a good, I mean I'm I'm trying to remember, I think. In november we had a bunch of problems with petroni and the the signal was really useful to have. Obviously the the downside of it is and- and I had to explain this on- a few incident calls like this thing is reading as patrony, but it could equally be um a bad application.

C

You know like a like a poor application change right, and so there is a little bit of extra cognitive overhead for the person on call, because it's saying petronio, let's just call it pg balancer, because patrony is a terrible name for the service. um This service is running slowly, but actually because we're proxying the data from a different service.

C

It could be something in between the two. It could actually be. You know pg bouncer. It could be the application itself. It could be any number of things that's causing this, so it is. There is a bit of cognitive overhead, but at the same time it's also super useful because without it in the same way, we wouldn't have that sort of latency information, and it is a very useful signal, even if it's like flawed in a way.

C

B

Trying to remind you to come also to come back tomorrow's question, I'm reminded of working on gettily, where we would have latency information, and that doesn't tell you so much and then I would be able to go into elk and look at access logs and and then you can find out where it's coming from and uh we're just not in the same situation with redis, because there is no access log, um and so the question likes the the the num. The latency number says something is bad now, how do we find out?

B

What is bad, I don't know, but the way we're coming from is that we is that we don't get, don't even get a good signal that something's bad or that something is slowing down.

C

And and that's the primary purpose right is to kind of monitoring and alerting.

B

Although we do.

C

B

Some access log type information because the rails access log prints the total time spent in redis calls. So uh if a particular thing is hogging redis, then you can find it in the in kibana, uh based on that in the rails logs. um So yeah.

A

There are other different pieces.

B

A

The reason why I'm asking this is because, if I think about the the work that the cicd team is doing right now, to enable uh traces to go through object, storage or to go to object, storage, what they're gonna end up doing most likely is they're going to leverage redis for this, like they're, gonna, buffer things and like post it there and like. I, don't know how they're gonna make that performance before they upload it to object, storage and, oh theoretically, that one can impact us significantly with with this number.

A

But how would we find out when they uh enabled that or when they turned it on um like how? How will we know that it was that and not.

B

Five other things yeah rails application log, uh because because of the work we've been doing uh in this epic, we have total number of redis calls. We have time spent in retail redis calls and we have bytes sent to redis bytes returned from redis per rails request or per psychic job. So if you have a psychic job, that is pumping a ton of data into redis or out of redis that will show up in those logs, and you can aggregate that by controller in action. So you can see hey this controller.

B

Action is sending an insane amount of data to redis or spending a lot of time in redis. So I think that would, if you know to that those logs are there, you know how to use them that can help. You find the culprit. Yeah.

A

Okay, so I have one ask for you yakub. We are a time I have one ask for you now that you explained this so well. Could you please add this to the epic, as in what kind of benefits did we actually get from the work we've been doing, because I've been digging a bit this week into the work that is being done, and I understand like when I read the issue.

A

I understand part by part what we are getting by this work, but for the life of me I cannot put a picture together of how this will move the line for us um once the epic is completed. So this is crucial information for me. Could you add that please.

B

uh Yeah I'll write something um it's it's sean's, epic, but he's out, and I I'm happy to give my own version of what I think we did yeah.

A

Nothing else put it as a comment in the epic um discussion section and get sean to uh edit, but you know it can be more than that, because I think there is more than what you just said. There are like a lot of really useful things that we did it's just that I can't explain them in like one goal high level over you like.

B

uh Yes, I I I, I feel your pain, um I, but I I think, we've from my point of view, these are the most important ones that we have the rich data and the real access logs. So we can point fingers at requests uh and we are getting some more signal of the regis if the ready service is healthy or starting to slow down.

C

I I don't think it's more important right. It's like they're, equally important because you don't get the alerts unless you have this stuff, but then you can't dig further. You know, there's a cheesy three columns of observability.

B

That I can give here.

C

But they're all equally important right. The problem.

B

Here is that and metric the problem. As I see it is that we have an epic, we have two dozen issues that happened in this epic that go in a bunch of different directions and we learn something we gather information but which are which is the top line.

B

What is that the most important thing- and I have a take on that- but that is very personal and uh I it pains me to say it, but uh it might even be that the things I worked on look more important to me, so maybe I'm very biased and wrong, but I I can still give my opinion, so I'm going to put my opinion on the epic and say I think this is awesome, that we can do these things now um and maybe someone else on the team will come in and say well actually, this other thing we worked on uh is also great, but that.

A

Would be excellent.

B

But I'm going to pick the ones I like and that I think are valuable and I'm going to write why I think they are valuable and then we can take it from there.

A

So I don't know if you were in the group conversation last night but andrew, like I'm gonna.

B

A

Try to explain from the high level um why I think this is kind of important to actually know down with every epic that we are doing, and rachel is going to get a task to ensure that we keep doing this. Is we know that we improved the platform over the past? Let's say a year.

A

We know that there were like a lot of efforts that uh pushed us over the line to now say that we can target that 99.95 reliably.

A

All of these small things matter a lot because if you go small things uh under um air quotes, because if you think about us being in the incident calls a year and a half ago, we would spend nine hours in the incident called trying to discover what what is happening and not finding it out. When you have information like this, you spend five minutes.

A

Instead of nine hours right yeah you get that information, which means that you can quickly find the culprits, which means that you can quickly resolve it, which reflects in the sla in the end right. So I want to be able to present that the scalability team is doing this type of work, and even if it's not our discussion performance indicator, this is the performance. That's the team is putting out there.

B

Yeah, this is exactly how I would. What I think is should be the ultimate value that we're providing or one of the ways we are providing. It is like how long does it take to find out what the problem is in the incident and how long like, in the case of psychic cues, if something is wrong with the psychic use, how long does it take for them to go back to normal?

B

I presume that this is one of the areas where you made a big difference with the sidekick work yeah, so that makes total sense to me and it aligns with how I imagine things work. The one thing I want to say is that, as someone who is not in incidents, I can say I think this helps, but I don't feel like. I know this helps like like. Like will this type of things now get resolved faster like have we communicated this well enough? Can we measure if it gets resolved faster.

A

That is all really hard to measure as well, because you can give the perfect tool to someone who doesn't know how to turn it on. It doesn't really matter then right, um but ultimately, even if we jump into calls when I say we I mean all of us in this call and watching this goal, it means that we can actually achieve something.

B

A

We couldn't achieve before right, so that's the difference. I think we want to highlight.

B

Yeah um I'll sit down and write a comment uh with my personal take on what I think are the highlights of what we just gained.

A

Perfect thanks for sharing yakub. um I think this was a good discussion, even though part of it that I didn't even follow.

B

I was doing my best to uh interrupt the discussion sometimes and try and keep it to go out of control, but.

A

Yeah, it's fine, but I I got my two questions that I actually wanted to ask. So uh that's good uh cool. Well, um I think we can end the call here and um we'll chat all later and I hope you have a good weekend.

B