GitLab Scalability Team, 22 Sep 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Scalability Team Demo - 2021-09-22

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Okay recording, so what I wanted to talk about was um when we're introducing the new slis will allow people to set their own target durations for requests. Well, we don't want to like we don't want what we have now, where everything that is less than five seconds is okay, there needs to be a good reason um to set a five second threshold on a on an end point and I'm trying to come up with what kind of what kind of endpoints this is.

A

Okay for um so far, I've like the one thing I thought of like is simple: there needs to be a reason to do it like it needs to be sometimes slow, and we can need to be able to explain why that is and that there's no real thing that we can do about it. One example, in my mind, is graphql.

A

If somebody builds a massive query that could be okay, there needs to be other things that we monitor there, but the request itself could take up to five seconds not much longer, though, but yeah that might be acceptable. Well,.

B

I mean with graphql it's kind of arbitrary right, because.

A

Yeah because people fill the request with whatever they want, we should just like um there. We will probably uh define other things like we will, like the the queries that we built for our front end. We need to meet different standards, and we will, like I saw uh paul, was working on um ways of identifying those with the operation name and that kind of stuff and we'll probably set different.

B

Yeah, I think I think that's now, because I think there's also there's just a lot of questions around that that don't really apply to any other end point so yeah. Let's ignore that one. um So, okay, so for increasing iteration threshold, there should be a solid reasoning as to why a duration should be higher than one second.

A

Yeah, um I think I think like, for example, the the search controller endpoints that we're using elastic or something yeah are those okay like. We know that those don't spend time on the cpu, so they're not really hindering other threads in the same process, doing stuff and we're probably fine with elastic.

B

Doing the work there, you could say the same for the database in italy, then right like if it's spending all its time in a long query, is that okay.

B

um I mean practically no because there's also a reason to say no to that, which is that.

B

On the database side, we have a connection pool that has a limited number of connections available and we don't want to um you know saturate, that connection pool, because we've got one process. That's hogging it for a long time, because it's taking a long time on a particular query. So, like I realized that practically the answer is no there. I'm just saying, like you know, hypothetically like it sounds like that, would equally apply to the database antiquity and to reduce for that matter, whether it wouldn't be spending over five seconds um anyway, yeah exactly like.

B

So what do we define as what the what's? What are the requests that we have to change this or what are the problem? Ones.

A

One of those things that came up was um the like. The accounts controlled, like the accounts, search controller account method.

B

Yes, that's when that's the counts, when you do a search but they're loaded, asynchronously, so from a user's perspective, yeah. So then, then,.

A

The product rightfully says like we don't really care that that's slow, because the user already has their results. They can start looking at stuff and maybe they want to dig further and then the accounts are happy. So it's okay, that those are a bit slower. Okay. So if I do like.

B

This search, um then those counts, loaded, asynchronously but like I was able to like you know, go click on this and then they'll, probably load asynchronously again um or maybe they're cached.

B

When I change tabs so um yeah you can sort of just see them loading um there. Obviously in this case they're fast, but if they were, I mean if they were over five seconds. I think that's still bad right, yeah over five seconds. I I I'm.

A

Sorry, five seconds.

B

At the threshold yeah yeah.

A

So over five seconds, it's always bad because that's just way too expensive, like just the puma thing alone.

A

But how do we like at some point we're going to put this out and people are going to like? What's what's to stop people from saying no everything's? Fine, it's five seconds like the way it is now.

B

Yeah, so is that is that the main example we have now.

A

um That's the one that I that, like that's the first group that asked for it, but like everybody, one of the questions that came up is like you've set the the target for everybody to be one second, but there's no way to change that right now, so we're building a way to change that. But we want to keep that under control. So I don't know which endpoints will want.

A

This there's also um the problematic thing of the slo that doesn't apply to their budgets, but everybody's, only looking at their budgets and not the services service availability, stuff. So.

A

Wait, let me load something in kibana. It's going to take a while and then yeah.

B

We've added the ability to make the custom thresholds to sorry to make the custom thresholds for per employee, but we're saying when do we have to use that um yeah.

A

Like it's very easy, it's very easy for us to say we're going like for for us to say, like we're, going to tighten up the threshold on that git request thing because it needs to be fast. Let's get the authentication people don't want to wait for that, and it's already fast it's already meeting slo. So we just set that and that's yes.

B

It's free yeah.

A

It's not hard basically to do that yeah, but the other way around is more difficult that I'm trying to come up with a set of rules or set of something.

B

So yeah, um uh I think I would be tempted to say we don't come up with a set of rules. Now we just take these on a case-by-case basis, so, for instance, for the the search control accounts. We just say: okay, we're gonna. Do that we're gonna um we're gonna, try it out. um We might move this back. We might tweak what the slow threshold is, but we're gonna do that for now.

B

um Let's see what happens and then, if we get more requests we can just say we will approve them or not, and then once we've got a few we can say: is there something in common with these? Do we think all these are valid, like? I think it's hard to come up with a rule when we don't have examples.

A

The thing is, I'm trying to load the kibana query for this, but the thing is not that many requests need more than a second like if we look at the 99.8 percentile duration and the 99.5 percentile duration, which is what we use for our slos.

A

Most of the endpoints are fine like barring some outliers, but that, like it, the the their budgets themselves are kind of using a 99.95, percentile duration. If you know what I mean.

A

uh Let's see if I can fix this by loading a little bit less.

A

C

Hello good morning, uh good evening, for you.

B

Yeah we're just waiting for bob to get something to load.

A

Trying to load a kibana thing and there it is so.

A

Here's a bunch of percentile durations for an endpoint, the most popular ones. On top, we can sort this because it's loaded to the slowest one.

A

So this is what we use as an slo, then 99 99.8 percentile, and these are the things that would require like a higher threshold to meet them. I don't know if any would apply like at the first side, not but well.

B

I mean yeah, so my uh yeah, my inclination, would be to say to like lazily, evaluate it basically to say like okay, you stage groups, go fix this or make a case to us as to why this should have a five second threshold. Instead of uh I mean, and some of these, the top few wouldn't actually help yeah with a five second threshold, um but like okay, you, you make a case for us, why they should have a five second threshold. Once we've got a few of those, then we say: okay.

B

This is the group we're using for deciding, but otherwise we just have to decide whether the case making spaces start like. I think I think we might be going too early with the idea of like coming up with a ball for this. When we don't know what people are going to ask for- and we don't know why, because someone could make you know we can make a rule based on what we know now, and someone could make a perfectly good case for something that doesn't match that rule.

B

But it's still valid like we still decided yeah.

C

A

B

Yeah so, like you know, why don't we just let them do that? Let them make the case in the first place um yeah until and until we have a way of making it rule, but hopefully we don't need a way of making it all, because we only need to do a few yeah. If.

A

B

A

B

Well, even if we needed to do it for 20., if we do it for 20, then say we're not doing any more. Do we need a rule.

A

Or we need to way forward like if we set it for 20. Can we link it to an issue like we need to approve this still? But it's a known offender.

B

Yeah, I mean, I think, maybe for some of them. That might not be the case. I'm just trying to have a look there and see. If that's a good example, um this query is linked from. Maybe maybe the post to artifacts might be a good example like that's not an interactive command anyway, like it's run by ci, um and you know it will have higher tail latencies, because it's pushing a lot of binary data. um I guess would be a potential case for that. One yeah, that's one of the things I mentioned as well. Yeah.

C

We're not in control of the um the completion. Duration. um Generally, we've been using as kind of exceptional conditions, which seems related to the problem.

A

Yeah, that's that's excellent stuff and then, in the end, that's the whole reason why we want to push the threshold into the application to allow people from product to say like no nobody's waiting for this. This can be slow and like if, as you mentioned, matt. This is a good example, and this is upload to somewhere else. So it's io, which means that sometimes the the lock will be released anyway, because we're waiting.

B

Yeah um things like the the blob controller are interesting to me um because we also see you know like that will be gitly, but I will also be highlighting and, like you know, you can make case a couple of different ways there, but I think.

A

Yeah, that's another thing like a cpu bounce request. Maybe we should start cataloging these like we did for workers and then send them to separate fleets with other properties, but that's like yeah. Maybe it's oh yeah. It's kind of hard to get requests.

B

Why? Because well, just because we have to do we don't know what the the route is at the start of the request. We know what the path is, but not the route right. So we have like those regular expressions in the hd proxy config.

A

uh I yeah and why? Why can't we do that with? Well, it's just it's just not complete. Oh yeah. No, it's not convenient because yeah it's hard to match, like we already got some good stuff from.

B

A

B

um Sorry matt you're, very quiet, I'm not sure why. But we can't hear you it's just.

C

Oh right, thanks yeah, let me fix that.

C

Okay, we'll see if that's better yeah.

B

That's definitely better.

C

Thank you, okay thanks! So um yes, the um um uh forget what I was gonna say. I'm sorry, um it'll come back to me. Sorry go ahead.

A

uh One more thing that I wanted to drop again, because I I think we will need to come up with an answer for this like right. Now, it's very easy to explain their budgets because something goes well like it's. It's success rate over failure rate and, like it's easy to explain once we turn that into the way we do the general sla dashboards with minutes where things are good and minutes where things are bad and then taking averages of that and so on.

A

It becomes hard to explain, but it does allow us to have the separate slos like we have for our services like, for example, we have the 99.8.

A

Slo for aptx on the web fleet, for example, which currently doesn't apply for air budgets, so.

B

Yeah, um I don't know, I'm not sure. Let's I mean we can see what happens, but the reason part of the reason I'm suggesting this is I'm not sure like.

B

Project events doesn't sound like it should take that long like is that, just because the maximum page size is too high or because we only support um uh offset pagination or like or.

C

An index is missing.

B

C

Yeah that could be so. I remember what I wanted to say earlier is that is that sometimes there can be interaction, effects between requests, and so the individual request isn't necessarily responsible for its own slowness and one of the ways to kind of identify that is to look at the the distribution, like the not exclusively the long-tail distribution like what we have here, but also looking at, like the the like.

C

If we added the 50th percentile to this chart as well, we'd be able to differentiate between between outliers and and you know, a legitimate long term. It's.

A

Going to be angry, if I do this range.

C

I wonder yeah, it might just add it to the end yeah. I think I think it.

A

B

Complain that they're not.

C

B

To have to make a confusing table with cabala.

B

um Yeah, I think that's a fair point and I think that's that's also a good reason to be like lazy about this and let the let the state troops make the case for the threshold because they can say like well. We've looked into this this uh this end point, and actually you know it's it's slow because of you know it can't because of other threads in the same process or whatever.

C

You know, in addition to this, this is a fantastic uh display and, I think, maybe having uh in addition to the count column, if we could also have a count um that exceeded whatever the threshold we're we're talking about is like five seconds um they're kind of given.

A

I can add that, but well.

A

Because you taught me that I can do it, I've done it for uh it's not. We need to really do the custom thing yeah, but I need to select something some.

B

A

B

Doesn't really matter.

A

Now I think I have the there. I have one of them.

A

And now I need to type in a tiny thing.

A

If it's bigger than no bigger than one okay, d1 yeah, because that's where like we want to yeah, have bigger than one, because that would require them to make a case. uh Okay. And then what was it?.

B

A terrorist, a question mark, so that's.

A

Bigger than one.

C

We count one zero.

A

And then I probably forgot.

A

This is not going to be a fast thing to look at so.

C

Oh uh change the the label to just for clarity.

B

C

Slow count yeah. Oh that's good.

B

Hopefully, changing labels doesn't recompute that oh.

A

B

A

Recompute that and the problem is as well like you don't know if the results you're looking at have been refreshed and well now we know because we added a column yeah.

B

Yeah, it's something because it's a very subtle thing here to stop by the logo. There you go that wasn't too bad. Actually, um I don't think yeah. I don't think there's an easy way to do slow count divided by count without taking us out to another program, but um we can sort of eyeball it. So like uh projects is what one in four to five that are over the threshold, whereas events is a very low portion they're over the threshold yeah.

C

B

Yeah, in fact that seems like what's that a 20th, no a 50th, so yeah if we picked a slightly lower percentile, we wouldn't have even seen that one here uh so in terms of okay, yeah, the the slightly lower percentile is here already yeah. Sorry, I mean like because it's a 50th of request over one second, if we took the 98th percentile, that would probably be pretty much bang on one second.

A

uh Yeah but that's not an slo like that's. No, no.

B

A

Yeah so yeah so they're, probably very slow just so.

B

A

B

A tail like that's a case where the tail latency is high, but the median latency and again you can see the media. Latency is pretty good. um The project one seems more like it would be, not exactly a normal distribution but something a bit with a you know: chunkier tail.

C

um That project events one the the top one there I mean it clearly doesn't represent a large like if we were, if you were scoring this in terms of how many requests uh caused as user to be unsatisfied like the traditional optics definition, then this wouldn't be a high value target, because there's only 19 000 uh requests that you know out of clearly over a million requests uh summed across all these endpoints, but but that outlier is outrageous. 15 seconds.

B

Yeah, that's the space timeout, I'm sure like because the statement is 15 seconds so yeah. So.

A

It's a thing that.

B

Time times out, very often but like yeah, which is, which is why I was running about the offset pagination um as well, and this also this also.

A

um Includes errors, like I'm, not filtering those out here so ah okay forgot about that.

B

B

B

Oh, I see yeah.

B

I would have done the other way, so I got confused for a.

B

Second, it doesn't matter, it's fine, um so, okay, so what else have we got here so again like the the pages? Api again is even more extreme than that, because it's got 20 000 to 4 million, um and it's 99.5 is in fact below one second anyway.

B

C

Api calls from gitlab pages to to get that api right.

B

Yeah, but I think that needs to be fast because that represents.

C

A

C

Yeah, those are all errors, yeah.

B

Oh no yeah! Well, that's the thing right like that's! That's why it's sort of it's easy to go round and round with this because, like if the errors are slow, yeah.

A

But then we end up ignoring.

B

A

That that's almost well.

B

We aren't ignoring those.

A

We're we're just talking about the addex sli here now. It's like the errors, are also an sli, so that's going to contribute there wherever yeah.

C

Fair enough, yeah, yeah traditionally x, includes error and slowness, but we don't handle it that way. We separate those two so.

A

That makes sense generally generally like what I've found in the one of the interesting talks that andrew link is yeah. There's it's a single sli, that's like it's just a single sli and you score well or not! Well, and well, it's successful and fast.

C

Yes, yeah. Yes, that's that's the norm in my experience and we we do it differently here, I'm not sure what the I mean I can. I can rationalize why it's valuable to separate errors from non-error, slow events but um yeah. This is the only place. I've worked that set that separates those two.

B

Events are still there, but it's 99.95 has changed it's like fifth down. uh Where are we looking? Yes, yep right there yeah it's 99.95 was like 15 seconds when we looked before. I don't know how that would.

A

Change that much, oh because of.

B

The remove the timeout, because.

C

B

A

Everything that.

B

A

B

Aired because of a timeout there's no no longer basically, but the basic proportion didn't really change that much, even though the the high percentile did it was. It was still around about twenty thousand out of around about a million um yeah.

C

So this makes me wonder if there's like maybe specific projects that have a lot of events like you were talking about pagination earlier. That.

A

Yeah, that seems.

C

A

uh Or some something that's looping through the last pages or something like that and it's an offset pagination some of something yeah so.

B

Yeah, anyway, that's that's why I would like to push this onto the stage teams at first, so they can like look at each endpoint that they own and say like. Okay is this and what would trigger you to say no well, so for something like the the events, one, I would be tempted to say no, because it is actually pretty close to meeting this sli like if you look at the distribution like it's pretty, it's pretty close. uh It's not!

B

No! I mean it's not, but I mean like if you get to 98, it's still fine, so like it's only at the tail where it's going up, whereas something like the projects. Api is a full like quarter of the requests uh over the threshold so like.

A

But wait the the knight this is the this is the no. This is the sli for um the slo for api, which this is handling and the duration there. So the 99.5 percentile is in two seconds, so that would still require an increase from one second to default.

B

To five, no, no all I'm saying is like I think, it's easier for the development team to get a handle on the events api, because um what is that I can't do math in my head.

B

Two percent of requests are exceeding it as opposed to api version, api project, which is top row where 25 of requests are exceeding it um like that's, I I, I don't think that's a sort of theoretical argument. I think that's just a practical argument like if you have 25 percent of requests for, and we know that the projects api will support a bunch of different options same with the group projects, one where, like you know, you can sort it this way. You can include these fields.

B

You can execute it's basically like a proto graphql api, where there are plenty of different response types. Okay,.

A

B

um That's very different to the project defense, one where it's it's already. Two percent, like uh pages, is even closer because that's what twenty twenty thousand out of four millions? That's! Oh! Well, it's it's! Actually 99.95 is fine. It's only at 99.8 that it gets that it gets bad. I think those ones are the ones I would be tempted to say, like we won't increase the threshold for because you're already yeah. So this one, this one is interesting.

A

Because we would say one second is fine, because this runs on api and the it's meeting aptx or aptx. But the group will see this one, which is like yeah. We should set it to five, and then our error budget will be green.

B

Yeah, but I think I think as well there's a case. That's that's more of an interactive request, even though it's an api request, because this is made by git lab pages. When you visit a gitlab page and because there's a static pages, they shouldn't be taking five seconds. um So there's there's also yeah. This.

A

This happens before.

B

We start serving the pages page yeah. I.

A

Think I think, there's also a theoretical.

B

Basis for that one, but, like you know this, is this because of the usage pattern and like maybe the dispatch one um is kind of in between, because I think that's triggered when you scroll on the diffs page on the merge requests. So you know maybe there's a little bit more tolerance than for the first page load, but it's still interactive um but yeah.

B

My temptation is still to just start by saying, like you know, make a case like make the stage group make a case. We can say what like exceptions, we've added before and why um but also look at yeah and then also make a call like.

A

Saying this is okay, or this is okay for now, like.

B

um Histogram they can plug in their caller id into and show what the distribution I mean. The cabana histograms are very good though, but like if they can show what the distribution of response times is as well, because, like you can see from the summary statistics we have here again, projects 25 are exceeding that threshold.

B

um You know the pages one it's uh one percent they're exceed or just over one percent. Sorry not pages. That's tags uh pages! It's about half a percent that are exceeding threshold. um Those are those are very different, like gonna, be very different. Histograms for those and they'll be very different paths to getting them to to to meet that threshold. If, uh if the stage group can't do that, okay, the other thing is um this might be going on a tangent a bit for what I remember.

B

It almost doesn't matter if every end point for the stage group meets the thresholds as long as the most popular ones do yeah. As long as as long as overall.

B

Yeah exactly like, if you have a request, that an endpoint that has a million requests per day and one that has 10 000 requests per day. I know that's a bad example. That's not the right questions, but you get the idea. If you have one that completely dominates your uh your request count, and that is always fast, then it almost doesn't matter what your other request endpoints do, because when you aggregate it up by stage group, you will probably still meet the threshold yeah.

A

Because most of your things are fast uh likewise like if you have a thing that has a million requests and just because it has so many requests. You know it has 10 000 slow ones, that's still pretty good, but if you have one below that has 5 000, slow ones, but only 10.

B

000 requests yeah, then that's clearly, that's literally just not a fast end point and like either. We really need to fix it or it's basically unfixable, because it's doing uh inherently slow operation.

A

Okay case-by-case basis, and I'm going to try to summarize a bit of what we said here. Yeah.

C

I mean: do you think it's really important to give some some guidance about um how to prioritize these, because if it's, if, if we present, um if we present a case where you need to have a certain percentage of endpoints per stage group that meets the the target response time, um that's that gives a huge incentive to prioritize whatever's easiest um to to improve, rather than improving the end user experience.

A

Yes, the whole.

C

Goal of the of of scoring based on aptx is to improve the end user experience so um so prioritizing. For example, um a high high throughput api endpoints, where end users aren't actually waiting for the response will improve the stage group's score without actually improving user experience. So I think we should explicitly state that the goal here is to improve. You know things that end users will perceive as being fast yeah.

A

And and product manager are in a way better place to make a call like nobody is waiting for get api projects to complete, because it's automations- and I don't know what yeah.

C

Yeah that that example for the gitlab page's api call is, I think, fantastic because, like you know, kind of the the initial gut reaction is going to be well. Api calls are generally, uh you know not a thing that humans are waiting on, but in this case it certainly is I mean it's. It makes a good case study is all I'm seeing.

A

This has been a great discussion. Thank you, bill, um yeah.

B

I think yeah, I think I think it's it's good to get people working on this, so I just don't want to. I think I don't know the sense I get is that people are asking for um right.

C

I think you're saying we shouldn't be too prescriptive right. We should. We should well.

B

I don't think we should be touched. Yeah.

C

I think we should yeah yeah.

B

C

I guess all I'm saying is that I I completely agree with that, and I think that it's very helpful to bring a couple of examples for this yeah.

B

No for sure I think, um yeah the examples we discussed there are probably reasonable cases to start with. Like you know, yeah I'm going to let's post is like probably not that important to end user experience. I'm going.

A

To to bring up the three examples, um I think I'm I'm going to bring him up in the issue for now, but I don't know how I translate that to documentation. But the three examples that I take away are like the internal pages, one that should probably have a tighter threshold, because people are actually waiting for it and they should not lose a threshold.

A

uh Yeah yeah, um the there's, the projects endpoint. That has a million ways of calling it, and some of them are just more work to do, and therefore it's slower and there's not a lot. We can do about that and then there's the artifacts uploading, which um uploads, I think, which is moving data around it's going to be slow and also not a lot. We can do about it, but also not very bad, because it's not happening on but like it's not occupying the worker all the time. While it's doing that, I think.

B

Yeah, that sounds good to me and those are probably the easier cases that we discussed, but it would start with easier ones and work up to the harder ones.

A

Well, if, if we can, we have three cases there if we can correlate requests that we get from stage groups to one of these like um there was some stuff that I worked on with the people from registry as well, where the requests were slow but yeah, they were sending three gigabytes worth of data across the wire. It's not going to be done in a second.

A

So if we can, if we can make a correlation to one of those things- and there might be more, that we will discover, then we'll learn.

A

Anybody else want to bring up something more.

B

um Let me just quickly share this stuff about the rate limiting redis instance, um so we are adding we're doing another functional partition of redis, so we currently store rate limiting data for two different ways. We do rate limiting. So one is the third party rack attack library, which happens um on the request level before the request is like even passed to rails, um and then there is uh what we call application rate limiting, which is inside a request. We can say like if you're doing something expensive like an export, uh you can.

B

Each user can only do one export per period, um and so we have to um two things: there we're going to use this new redis, both of them um like the cache it can be configured as an re liu, because, it's rate, let me see information, it's generally very short-lived.

B

um So most of what I've been doing on this so far is adding the ability for the application to configure that which is mostly been about learning about how this already works. um So the um that's, the adding a class but yeah. So we add a class. So I'm just going to share this. So this is the current state of this file, so this is config.remedy.md, so it's docs, but it's not actually on a doc site.

B

um This is mainly used, I think, for source installs, I'm not 100 sure what the audience is, um but we have very detailed documentation on each redis instance and there's some slightly confusing parts like redis cache default support, 6380 on localhost, redis queues, defaults to 6381 and credit shared state defaults to 6382. So this is only if you have no configuration files at all, um so this is apparently supporting some very legacy use case. um uh What I've tried to do here is um simplify this.

B

So when we add a new one, we don't have to copy and paste all of that stuff and um we can hopefully avoid making the many many copy and paste errors that I made when I was working on this, so we just say, like you know, you have an instance. It's called this. It's going to look in this environment variable based on this name or it's going to look in this file or if it's available, it's going to look in this file.

B

um If there's no url, it's going to use localhost and this port is the basic idea, um so these are hard-coded in the application um and previously the trace chunks didn't have a fullback, sorry a default at all. So if there was no redis configuration, the trace trunks redistance would work, which I think is fine, because that's optional. But if this is, I have no idea. If this is a case, we need to support or not. But if it is a case we need to support for self-managed instances.

B

Then we do need that for the rate limiting one, because it's much more likely that rate limiting is turned on trace chunks, because traditions need a feature flag um and rate limiting, doesn't and when we move sessions, that'll be even more important because literally every lab instance uses uses sessions. So there's no way we can get away without that. uh So that's the basic idea.

A

B

Chunks already does the clever fallback to what it used to be right. So, yes,.

A

But it doesn't fall back.

B

With the port, because I mean it's hard to even notice- that's a thing like you need to literally have no configuration files whatsoever, which is like unusual. I don't even know how you would have redis running on those ports in that case, but um apparently it's a thing. um The other thing is, I noticed, and this is just a weird thing, but I figured it was worth cleaning up.

B

So in the gdk beginner development kit, we set different redis databases for development and tests for each instance, so they act as isolated instances um and that's helpful because a if you have a key clash, although you shouldn't, but you could have a key flash potentially, um but also be if, before, like, we just use the same instance for the same database. Sorry for you, you're logged out every.

A

Time you run the.

B

Tests, yeah development and test. So if you, if you run the tests, they clear redis that logs you out that deletes all your psychic queues, that does everything and these database numbers in these configuration files, which are again I'm not sure how useful they are would different to the ones we use for gdk. But I figure as their example files. I can just make those match up, so I basically yeah I'm just trying to make that a little bit easier in future because it was it's.

B

This isn't hard adding a new instance. It's just tedious, because there's a bunch of like little paper cuts like this. That will trip you up um and it's mostly uh pretty roped, but as we're gonna be asking the memory team to add the one for sessions, it seemed like a good time to to take a bit of time to.

A

Do that jacob already did a lot of work when he added the trace chunks, one uh like to make it a little bit more manageable code wise to do that. But apparently you missed the port bit.

B

Yeah well exactly so. This is the thing like yeah. Basically, I submitted this at marcia, but I was like this looks pretty good to me and yeah. I was like well you've copy and pasted the wrong thing here. You copy and paste the wrong thing here. This port, you know, comes from here and I was like okay right. Let me go fix that and then he spotted a bunch more things and then I was like okay. Let me go fix that and he was off today. So I was like okay.

B

Let me let me take a step back here. Instead of playing with them all and see what see what I could actually learn, because this is clearly not as straightforward as I thought, and it's not going to be straightforward for the next person to do it. So um that's where I am with that and then, um while I also have that thing that the future has been reasons, so people started to figure out where.

B

Just the context map society has a client middleware and a server middleware, so quiet middleware is run when you schedule a job and server middleware is run when the psychic uh worker process picks up the job, and previously I had added something to the client middleware for feature categories, so that would get the correct feature category in the logs, but then that didn't work for metrics.

B

When I was working on the metrics one I tested with cases that I thought were reasonably exhaustive and I figured that I only needed server middleware recently, I found out that's wrong because the logging actually isn't middleware and it happens around the middleware, so the middleware has no effect on the logging code.

B

So now I'm like going back to like do. I need to have this in two places, but I deliberately deleted the client middleware because it was causing problems with the metrics, so I'm having difficulty getting the logs and the metrics to line up and bob's gonna have to review that when I get to it. But at the moment I would say the metrics are right and the logs are not what I want, which is the better way around, because the logs are more transient and also the met.

B

The the logs are wrong in a way that makes them higher cardinality, uh which is okay for logs, but wouldn't be okay for metrics. So when I say wrong, I mean there's two reasonable positions you could choose. It just happens that they don't line up um so yeah. I need to fix that, but I don't really have anything to do over there, because I haven't figured out how to do it.

B

um Anybody else.

C

For the um for the the transitional uh state, when we begin to roll out the um oh, my gosh, I'm blanking out the topic we were just talking about.

A

C

uh uh Right now, what's its name? Yes, the rate limiting things um I'm assuming. uh So I wanted to check my assumptions here, um I'm assuming that we're not going to make any attempts to to transfer uh state information from uh from the the for the for the current state of written limiting when we roll this out, we'll just start the clean slate. So everyone's you know, consumption rate goes back to zero. I think.

A

Yes, I haven't 100.

C

Decided on that, but I looked at so it's vastly simpler that way and I feel like I felt like we could get away with that simplicity, yeah.

B

C

B

Data for a minute, I think so it's betty, I mean the features like since.

A

B

Yeah so like you're, really like, if you're using a feature flag for that, like you'd, really be struggling to notice the difference either way, um I think the longest interval we have for an application rate limit is on the order of single digit minutes.

A

B

With those, I think, I think the current plan, which I need to discuss with craig, because I haven't gone around- writing it up is that we won't attempt to transfer data or do multi rights or multi reads, um but we will feature flag them so that we don't we don't switch them both over at the same time so like we can switch just rack attack or just the application rate limiting okay, but but.

C

Those were both people yeah.

B

So like um because, like you said it's a lot of very simple to reason about yeah yeah.

C

It's a lot of work to do.

B

To try and cut that data over which we will need to do for sessions, yes, um yes, and that will probably need to be a multi-read multi right, um yes, and also potentially creating the sessions instance based on the a secondary the backup from well. I don't even know I haven't thought about that far.

C

B

A

C

I'll always get that is, I feel like we can get away with that this time, because yeah.

A

Exactly if we don't have to do it now, don't do it.

C

Yes, exactly because we're not under attack most of the time we're not under an abusive workload most of the time and we're not going to rule this out while we're under a massive attack. So you know we're giving someone uh we're giving someone a free pass for a few minutes to you know to you know: double the number of requests they're making to some end point that would otherwise be limited, and that seems okay yeah. If.

A

We had the the longer durations on them on those thresholds like jarvis suggesting. Would we would we need to migrate then, or would that still be fine like then? It's like requests an hour. If I remember correctly, oh.

C

Yeah there, so we don't need worry about it.

A

C

Yeah, that's a great question, though um I guess I would still be because it seems like a a complex scenario that would be easy to get wrong. I guess I would probably still lean towards not bothering with the migration this time, because we can get away with it. Yeah I mean if we were talking about a rate per hour. If it was a rate per day, then I'd be a little bit more hesitant about skipping that step. Yeah.

B

I would get back to matt's point from earlier about like we're not under attack most of the time. We would do it when we're under attack. So we'd, probably just do that at quiet periods like not even a quiet compared to under attack, just a quite a low traffic period in general- um and you know- hopefully we do that in a quiet hour, so that by the time the hour's up we're good um yeah.

B

I think I think, for this one it's I mean it's just so much simpler to implement to roll out to everything, but um there's far fewer failure modes this way and it fails open, which is also useful for users. Yes,.

C

B

Absolutely um yeah, I think, thanks for thanks for bringing that up matt, I do need to actually um write that up for great, because that's what I've been thinking about and got around to writing down.

A

C

More, I'm I'm not awake enough to talk about the things that are kind of top of mind right now, um so maybe next time uh there's there's been there's been a lot of. I think uh I think um I think there's been some very interesting discussion around uh around diddly's four clock um or rather go the the ghost standard library, four block um and how that interacts with giddily's um child process caches like the cap file cache. um But again I'm I'm not.

C

I haven't had my tea sink in yet so I'm I'm not quite alert enough to to have a uh a discussion about it, but um maybe maybe just maybe just kind of a toe dip. If that's okay, um so.

C

So there's as as I'm sure you both know, we've got, uh we've got this. This uh kind of fairly complex system of maintaining a pool of um of child processes. These get cat file processes that did label will uh will preserve, rather than rather than making, rather than allowing them to be purely ephemeral processes. That's that satisfy one grpc call will keep them around in anticipation of potentially having another request for the same repository.

C

That's compatible with with this process, and this this saves us roughly about a 50 of the time we don't have to fork a new process um and there's been some uh there's been. Some very, um um I think, healthy discussion about whether or not this historical decision from a few years ago is still the right choice in terms of the complexity versus benefit um trade-offs, um one of the things.

C

So that's that's one discussion and- uh and I feel like that and and um there's a separate discussion that I feel like is about to converge um where, as part of part of incident response, follow-up um we've been uh igor, and I and a few others have been kind of intermittently over the last week um digging into some uh some some concrete examples of long tail response time that ultimately seemed to be like for, for the fine commit endpoints was was one example.

C

There were a few other endpoints that initially we were investigating separately and one of the common attributes that we've uh uncovered is um is a burst of cpu contention and b. um The it's a little hard to talk about this without without a visual aid. But um let me let me just describe it um um in broad strokes. um There are two layers of scheduling that matter um in in this context, um what I'm working up to is um the um go.

C

A given a given go process like a single gilded process has a single, a single global mutex that guards forking child processes.

C

In addition to that, giddily has has an additional layer that that, what's it called spawn token um that imposes an additional rate limit and I'm not entirely sure what um I'm not entirely sure how beneficial that is, because there is already this this underlying fork fork limit. But what I want to talk about is is this: uh is this this global mutex? um It's it's a it's a reader writer lock. um uh So it's effectively a semaphore um that guards whether or not uh the given go process.

C

Giddily in this case is allowed to create uh create a trial process. So um it turns out that uh that in at least some of the cases we're not sure what the percentage is, but in at least some traced examples.

C

We find that there are um 10 many tens of milliseconds of waiting for that lock to be acquired, uh which implies that uh that a there's that either there's uh a burst of demand for that lock, where there are many many um um sorry each scheduling process has 32 um os threads um that many go routines can be scheduled onto I. I started to talk about this earlier, but there are two schedules: two schedulers that matter. One is the kernel.

C

Scheduler to see which uh which tasks get scheduled on cpu and the other is the go runtime scheduler that will schedule go routines to run on these long lived os threads, the the ps in goes kind of scheduler model.

C

So effectively go can say: okay go routine, a that wants to uh that perhaps uh wants to acquire this. This four block you were allowed to start running on this. You know this designated os thread. um That kernel will uh will schedule that os thread, which is has entered the runnable states from kernel's perspective, to run on a cpu, maybe sometime, um here's the catch um by default. Go runtime is going to allocate 32 os threads on a machine.

C

That's got 32 cores, um but we know that on our digitally boxes the get child processes consume a significant percentage of cpu time, sometimes uh often more than giddily itself, and so there's often when we have cpu connection at the host level at the kernel level, um those those uh those os threads that gideon that the go runtime is managing for italy. They don't actually get.

C

You know access to cpu as often as they think they do they're contending for cpu at the kernel scheduler level, with all of the child processes that they're trying to manage we're, not sure how big of a deal this is, but it has to have some influence, whether it's a major influence or a minor influence is an open question, um but it has to have some degree of influence and it may that by influence I mean the the question we're trying to answer right now is um is the uh what is what is influencing the duration of that lock contention for for this four block um um and they're kind of two they're again in broad strokes, they're kind of two two angles that can kind of influence it?

C

One of them is: um was there a burst of requests for forklogs were many uh where many go routines want to to fork our child process at roughly the same time or um do we have a stable rate of incoming requests for that lock? But the luck duration um has increased so on. uh There are a few ways we can kind of test. uh We can differentiate between those two cases um in in the in the latter case, where the the lock duration is is longer than usual.

C

um This is this is one of the ways that cpu contention that we were talking about a couple minutes ago um may be coming into play. um So this is this is very much an open, an open research topic and we're kind of putting time into it, as as, as, as uh you know, as time allows it's not like a primary focus for anyone.

C

um I think, as a uh I have a couple of ideas for for next steps- and I know igor, has a couple ideas for next steps as well, and um I think we haven't talked in a couple days about it, but um um I feel like this is uh this is a worthwhile thing to look into I'm not leading up to anything. By the way, I should have said that to start with, this is just kind of rambling about an interesting problem.

C

um I think this has relevance um in uh in terms of um in terms of kind of the long-term planning, because two reasons one it may the outcome of this analysis of uh of of this. Essentially, lot contention over the ability to create child processes has an, I think, has a meaningful impact on whether or not on the on the discussion of whether or not um having a cached pool of child processes like the get cat file.

C

Cache um is or is not a good design choice still, because we we know that we're we're cutting in half the the number of times we want to do those forks, um and um I guess the there are a few ways to kind of address that if we don't um if, for any reason, we we significantly increase the demand um for forking child process like like, for example, removing the git file cache.

C

There are a couple ways we can address that um um the the kind of conceptually simplest one being spread the work across more piddly processes, which, in my view, means spread the repos across more deadly nodes. um There are a few trade-offs in in. uh Let me leave that aside. I guess.

C

The other thing I wanted to mention is um um so the the question about the the benefit of having of having um a pool of cached child processes that are potentially usable is, uh is, I think, influenced by this uh by this, um by what we uncover here with the four block contention and the other piece is, um is.

C

Being aware of the cause of the contention, if it turns out, for example, to be um contention at the cpu scheduling level, um then we can do something about that, with no code change and and and without having to move reapers around by, for example, um explicitly telling the go runtime to continue to to provision 32 os threads, but but resizing the giddily hosts to have um say 50 or 100 more cpus available, so that there's less contention um at the at the colonel's scheduler level.

C

Does that make sense without reducing capacity in terms of uh schedule and go routines, because that also could change bottlenecks? Are the the number of child processes limited? Somehow?

C

um Oh, that's a great question. um We're not running in yes, uh there is. There is a hard limit on the number of child processes, but we're not anywhere close to it.

B

um It's more the creation rate. That's okay! Yeah I've seen a couple of issues around this, so forgive me if I'm covering some things, so I did see one where patrick was mentioning that we always spawn two processes as part of the cache, even in the cases where we only use form one and then a refactor, he did to try and lay the groundwork to remove that ended up with go routine, unique. So clearly the code itself is quite hard to reason about modify safely. um Yes also, uh I guess the challenging aspect of it.

C

Yes, I think um I I also uh read about that, although probably not in a great detail- um patrick, I think, was one of the people that was saying that that this this cat fell caches is quite complex, that that adds a significant burden, uh maintenance burden to the code, because it's fairly complex- um uh and I I love- I love that we're talking about you- know considering the trade-offs of of jettisoning a complex piece of code if it's not providing enough value to justify it, and I kind of suspect that we may know more now than when we added that code.

C

In terms of like the contention I still want to. Personally, I still, I still want to understand the um the interactions between our giddily specific spawn token mechanism versus the the four clock, which I suspect is kind of um an implicit additional uh concurrency limiting factor.

B

Yeah, I think it would be interesting to see.

B

Why we why we added that, like you know, when the app fire cache was added, what would we see like we were using complete? I'm almost certain we were using completely different infrastructure, possibly on a completely different cloud provider at the.

C

Time we at the time we introduced this like yeah, um it's you know and the go runtime changes over time as well. I mean.

B

Although it yeah.

C

So it may be obsolete at this point um and we can't.

B

Really tell yeah.

C

Yeah exactly exactly.

C

So, uh in terms of like next steps, I I feel, like you know, a discrete helpful question to answer is: is differentiating between those two cases of do. We have a spike in demand for this lock or does the lock duration increase when we have one of these events, where there's there's significant contention and it affects the grpc response time.

C

um I kind of had that in the back of my mind when we were looking at the earlier in this talk, when we were looking at their response time, distribution on a per endpoint basis um that we can have these kind of subtle interaction effects- um and this is obviously not the only way that can happen, but because it was kind of you know top of mind after looking at it in the last couple of days.

C

It was hard not to think about that in in, in the context of trying to evaluate um long tail uh response times,.

C

That's that's again not leading up to anything I'm just kind of using about uh the topics we've talked about.

B

Today, yeah, no thanks, that's interesting! I didn't wasn't. I wasn't fully up to date on that. I just knew that I'd seen the words cat and file and cache a lot lately. Yes,.

C

B

C

Yeah, I I I don't think that there's been any explicit, like you know, connection between these, these, these two topics that the local contention and the and the file cache, but I feel, like they're, they're, probably exploring different aspects of the same of the same uh problem space. So.

A

C

Yeah, I just that's, that's the reason I wanted to chat about it for sure. No thank you for being through that.

B

That's all I had cool well have a good one. Everybody.