GitLab Scalability Team, 14 Jul 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: GitLab Production Incident 5158 Corrective Actions

Description

Stan, Matt, Andrew, Jason, Marin and others discuss some corrective actions following on from a production incident: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5158

A

A

Okay, so we're just talking about the corrective actions from what we described, the problem that we described so there's like multiple phases of saturation and there's a whole bunch of different pieces of technical debt that link together to get us in the situation.

A

So there's lots of fixes that we need to do, and some of them are more urgent than others, and I think, from my point of view, the the first thing that we should do is address uh suggestion of basically moving the authorized projects, workers to their own um uh queue or sorry, their own sidekick shot, um and that will that will be an immediate like temporary relief. While we talk about the other options, um sort of the long term, like really big project, is getting rid of carrier wave.

A

I think that that's kind of like the the the we can't you know, lester has been saying this for two years uh and uh we have to do it because this is this. Is you know bad? um And then, in between that?

A

um I I would say, from my point of view, the reactive caching issue, um with the that's basically masked all of the alerts, because we've been ignoring the alerts for a month that we should fix as well, um because we've been ignoring the all the threat contention sidekick issues because of this known reactive, caching, issue and stan said: there's actually because I've seen on the on the on the issue for that that people have said. Oh, this is going to be really hard, but stan suggested quite a simple solution.

A

So I think that that would be good um yeah.

B

To be clear, I think what it's doing right now. It's just ingesting a giant xml file and doing something really inefficient. I think we do some streaming parsing, so I think I think there are some comments on that issue about that, but um I may have to. We may have to link the.

B

A

I will also um do this is something like every time and this isn't actually gcs issue, but it might have helped us if it was, and also would have pointed us in that general direction is every time we have a gcs issue. We talk about creating a service in our service catalog in our metrics catalog for gcs and it'll basically have two slis one is like sidekick object, storage and the other one will be. Registry object, storage and perhaps pages as well.

A

I don't know if we've got metrics from pages for its gcs operations and then, if any of those things go south we can go and look there and one thing is we'd see if it really was gcs we'd see all three of them dropping and in this case, would only be the sidekick one. So we'd know there's something else. That's the the problem, and that should be quite easy to do.

B

Right and I wanted on that front- I know our gcs support person noticed the transfer times taking longer, and that was I didn't know how to get that data I was looking at the console. Is there? Is that metric available somewhere I'd love to have that at least reported, because that would help us distinguish between when gcs.

A

Is getting slow and where we're getting yeah, that's a good point. um We can go. Take a look at the metrics explorer. I haven't. I've never seen anything like that, but I mean it might be. Okay,.

B

So we can ask the question, because they somehow got the data and.

A

Told us this yeah that time to first bite is super useful. As.

B

Well, um the other issue I think, is to investigate authorized projects and just figure out we can make that much more efficient. um I think you talked about the data database bloat issue, but I feel like there may be some quick wins in either just not updating everything, or at least, if you've already updated. You know a user's authorizations.

B

You don't need to schedule another job for that user right, because a lot of these things are redundant instead of we have 7 000 members per project, you don't need to schedule that 100 times you can just do it once it's bad enough. It's doing once, but you know 100 7 000 jobs is much better than 700 000 jobs.

B

C

Think that that was already a project underway somewhere, but I think they ran into some issues or they got delayed with some other things I'll try to find that specific uh item. I know that we talked about these.

A

Proven to linear.

C

Queries, oh so, the other rapid action yeah.

B

Okay, yeah, let.

C

Me try to find that thing.

B

Right, linear queries is one thing, but like no.

C

No, they got pulled into that uh project right, yeah,.

A

They they were doing the. I don't know if you want to call it like delta's group deltas rather than like group rewrites.

B

Are we capturing that list of um like to do's or issues we could attach them to the rapid action, epic or I'm quite writing notes and then we'll try to find all the issues and link them to whatever epic we create okay, cool? Where are the notes? Sorry, I, oh I'm just taking the miner, I don't know if anybody else is taking those I just I wanted to just get this out there right now and then we can put it in the chat.

C

I understand I understand.

B

That yeah thanks, uh we.

C

Can do it on google doc if you want that's why we have the recording. We can translate this easily into issues right so.

B

Just use ai to do it yeah.

B

um So that I guess what I heard andrea said was that um number one we want to break up the sidekick pools like move some workers to a different pool. um That sounds like it's probably pretty much an infrastructure task. Do you? Are there folks in development we should be involving in that piece? No okay and then removing carrier wave, which I just looked at. That's that file upload library, yeah.

C

So I think there's.

B

An intermediate step and seeing if we can at least get rid of the idle and transactions with what we have. Yes, that's.

D

What I was going to say be a bigger.

B

Carry wave is a much longer term project, um yes, and we already.

A

B

For project exports, it's not you know. We already solved that problem at least for now. It's there's some trade-offs there, but you know availability is a bigger trade-off, so I think that we may want.

A

B

The similar approach, or do something like that and.

A

That's like rewriting.

B

Refactoring sidekick jobs to be more.

A

No, it's just like.

B

Carrier wave, wave's default mode is to basically save the file in the transaction.

B

You can change it. You can change that behavior. um You know. The trade-off is that you might save the entry in the database before the files persist in object storage, but um so there's some danger in like something's there, but not really there yet, um but you know, there's a the infrastructure risk is higher, I think, being a problem, so we could change that mode. For example, we could change and say do that and you know we'll live with the consequences and see what happens.

D

And that that protects us against uh against any cause for uh for a delay in in the effects, whether it's gcs or uh or threat contention. It's it's a it's a more general uh mitigation uh effectively trying to minimize the duration of the the the least db connection from pg banner. That's great. That makes sense. Yeah.

A

I think another thing is that this also illustrates like how problematic that ruby thread contention saturation metric can be, and so we should maybe make that like a higher priority. I think at the moment it just goes to slack. It doesn't alert, but the problem is well. It will alert a lot if we turn it on now, so we should fix.

D

A

Turn it on properly so andrew. What do you.

D

Think about the um so the the thread contention is uh is a major contributing factor to to this incident. um The the resource that we that we saturated was um from in terms of in terms of the side effects spreading to old sidekick jobs, rather than just just the ones with the with the.

A

D

What do you say smaller.

A

Smaller smaller keys, smaller shots.

D

Yeah, I'm kind of wondering um what do you think about? um What do you think about having saturation for the for the pg banter connection pools be um be uh a higher level of alert, because I think I think that's.

A

That's essentially the result. One is one that should definitely.

D

B

What it should.

D

Be was it okay, so this is I mean if it's not that's wrong, it definitely should be okay,.

A

D

A

In in the flood of alerts, then so we've got four different alerts that are all very similar: we've got replicas and the primaries, and then we've got sidekick and then the synchronous web and- and I think the problem is that the synchronous, the asynchronous one, the sidekick one, is saturated. So much of the time, mostly because of archive trace worker, that we, we might have it as a lower level alert because.

C

A

A lot because of another known issue that we want to fix: um okay, so yeah, so, but if it's not again it it, it is a critical situation.

D

B

I know we have like a cpu bound chart and I don't know if we're changing the charge. But this seems like a case where we have network bound shards, which are basically heavily dependent on just doing network activity most of the time. But.

D

They all share the same pg bouncer. There's, there's only there's only effectively, there's just two to two connection pools that matter. One one is for a sidekick and the other is for the web and api.

D

So I guess what I'm saying is any of the sidekick shards can, individually or collectively saturate the the shared db connection, pool that pg bencher provides yeah.

B

That's true yeah.

D

A

B

A

Quite yet, but we should start considering, is now that sidekick is running kubernetes and it's not. You know three days of json to to make a new vm type like. Maybe we should start making the the sidekick shot smaller and almost giving each feature category their own sidekick, shard or.

D

A

Say sidekick shard. Are you talking about? uh uh Yes, um so those basically the way that so so at the moment we have different characters.

D

I don't think that'll help, I don't think that'll help in terms of the connection pool saturation unless we make separate production.

A

D

A

But in terms of the ruby thread, um saturation so and the other.

C

A

A more um immediate option is we could we could create a sinbin shard, which is effectively where we put the the known bad workers, which are mostly reactive caching, uh because of that other issue, the the enormous workloads are active caching infradev issue. What did.

C

You get and then.

A

Yeah, that's hilarious. I love that and- and we just filter.

C

A

Out we basically ignore that from the alerts and then, if we, if we see saturation alerts for anything, you know while that's getting fixed, um because one of the things is we've just been ignoring those alerts because of a known bad cue, and if we put that in the sin bin, then you know we'll get the alerts for hey. You know everything. You know the urgent other queue is is totally saturated.

C

Okay, so that sounds great um sorry.

D

C

Really hate to interrupt this discussion, but it would be good to know whether we can take an action right now on the separating the shard out isolating the chart, because if we have a call it's in half an hour to monitor the situation, it would be best if we can kind of split off and start doing that work. So we can disarm that call if possible. We.

A

Should start on that immediately, if we haven't already cool.

C

D

This is a separate shard for the for the authorized projects, workers right, correct.

C

Correct scarborough, how about you and I separate in a separate location- we can do a call, an issue or whatever and start working on that matt. Would you be willing to do a sanity check, uh because I think scarborough will need some help because of the selectors that are hard to actually get right.

D

E

Help go ahead, go ahead.

D

I I really want to emphasize that any shard can saturate that connection pool. We have to address that. That is the resort. That is the most important resource to address. In my opinion,.

E

So yesterday we saw this problem where one of the shards went down to its minimum pod scale, because it was struggling to do this work. The cp, because we scale in cpu the cpu usage of that shard, was so low. We just scaled down to the minimum. If that is the same situation for this new shard, we'll kind of magically avoid creating havoc across all of sidekick and instead it'll be limited to just this one char. That's the sin: bin shard per se.

D

um So that means.

F

That sounds like that's. What we're saying the the hba's default pattern for sidekick is cpu bound because we, the best assumption we have right now with the available metrics. Is your memory is out of scope or your cpu is out of scope. Let's give you some help.

F

Okay, we can address that, as john has pointed out when it comes to the cpu and individualizing that shard that doesn't directly solve the problem of the pg bouncer being overloaded by a given shard set or shard plural yeah. So you know we may need to have a small, dedicated, pg bouncer for the problematic ones. The good news is, we need to put that into place, but configuring that, throughout the application, consumers is actually very easy.

A

Yeah so my take on that is, is I I really? We don't have a good grasp at the moment of our pg bouncer pool sizes compared to you know, effectively we're taking 500 connections into our postgre into our primary, and then we divide that up between the synchronous and the asynchronous um and and the maths doesn't add up at the moment.

A

Frankly, I don't think I don't know if you've ever looked at this, but it it's pretty and- and that's with two and if you add three, that that's a that's a good project, but I don't think it's a project. We do now like that. That needs proper consideration and a better monitoring and planning around, because you know, I suspect, if you, if you actually laid it all out now we either kind of quite far over or worse, I think, we're probably quite far under and we're actually sort of throttling ourselves.

A

You know if you look at the number of pg balances and the you know whether we over committing or under-committing pg bounces to postgres. I I think it's just been done on an ad-hoc basis. Like we've made a change to pg balancer. Here we've changed the pool sizes. Then we've made a change to postgres over there and I don't think any of those numbers add up at the moment and it's definitely something that we need to fix. But I think it's like here be dragons. um I.

D

I agree with you, but I guess my point is uh the fact that the fact that web and api were not affected by this regression is entirely because we have separate connection pools from them. If that had not been the case, this sidekick regression would have wiped the whole website off offline for hours.

A

And that's why we have it as that's.

D

Exactly exactly that, that's why we introduce it so um so I think in the same, in the same way, even even if we don't get the numbers exactly right, I think there's a large margin for error if we steal some of the connections that are currently budgeted to the psychic pool and give them to a separate.

A

D

A

The pg bouncers that would need to go into that pool are effectively not the ones that cause the problem. They are the ones that use carrier wave, which is slightly because you know it's not it it's effectively.

A

um Carrier wave was the thing that saturated pg bouncer, but the thing that caused the problem was the authorized project which wasn't actually consuming that many connections. So you know that's also something you'd have to take into consideration. If you did that, um so you'd have to do some sort of audit of of the or you you know what you could do. We have the sidekick attribute tags and you could tag all of the um the carrier, wave sidekick jobs with the carrier wave tag, and then we could use a selector for that.

D

Okay, that makes sense to that b. That would be us. That would be. That would be a superset of all of the all of the sidekick job classes that are potentially going to make external hdb.

A

D

Right- um and I think I think you were stan said earlier- that um that it is possible to configure carrier wave to to not do to not hold the db transaction open while it does that fetch. So some of those job classes may already be doing that. My am I being too.

B

Hopeful here, yes, the only one that said export only the project export, because people were saving like gigabytes of files and we were timing out um because there's a there's, a trade out there right, because if you commit the issue.

A

To the database.

B

But the file isn't there what happens to your system right, like other weird things, may happen so um yeah? It's definitely.

D

A race condition um between between the two: if we commit early, um could we could we, I mean there's a few ways to address that, but um it may yeah there's a few ways to address that. None of them are trivial.

A

But, but if we, but if we can get the the things that cause the the front end contention on the threads into their own sort of uh shard, where they kind of won't do that to the the carrier wave um uh workers, you know if we keep those away from the carrier. White.

C

A

They're talking.

C

About removing the triggering condition.

A

Yeah yeah yeah, and it's not a it's not like there'll, be something else that can trigger it in well. Particularly gcs goes slow. Yes, you know, then.

C

A

Always exposed by that, but at least this prevents the immediate trigger condition or the immediate trigger condition. Is the customer calling the api um yeah.

D

Right so so we need to block api calls for this. This bad behavior customer rate limiting could solve this. Oh I'm! Sorry, that's!

D

Okay, okay, so uh all right! So I'm I'm sold on this as mitigation for the for the for the trigger condition for the last two days of incidents. I do still think it's important to address the the connection pooling because that's a that's a more general class, a problem, and that is the resource that caused this to spread across uh a broader scope.

D

B

Okay, so can I can I summarize what what exactly mean by addressing the connection language? Are you talking about splitting out more connection pools or what exactly do you mean.

D

um Yes, I think uh yes, um splitting the. What currently is uh a shared pool for use by all psychic job classes to instead be um a separate pool. I I like your, I I like your idea of having a separate one for care for carrier wave, but if we fix carrier wave matt.

A

Like if, if if we then.

D

We can written.

A

Yeah yeah yeah, but but I mean I do think and that this is something that's always a low-grade worry for me. I do think doing a proper audit of all of those. um You know the sum total of connections that we hold in pg bouncer and like it sounds like a kind of project that that we do need to do as well, and then I.

D

Mean I can take an hour and take a look at that today.

A

Exactly like it probably wouldn't even take longer than that, no.

D

I agree, I think I think I think that that's a very important question to ask and uh I'd be happy to take that on today. If that's, if that's useful, I think which.

B

D

Yeah, I just I hate stepping on people's toes this.

B

Is definitely um there are no toes here to step on? So don't worry about that, but I will say that carrier wave is used almost everywhere, so sometimes it's going to be hard for you to split out.

B

You know the ci ones are obvious, because those are called a lot, but things like anything that such as lfs anything that touches you know. Terraform state fight like all these things touch it. So I would.

F

Argue it's better to focus.

B

On fixing the carrier wave issue than try to split out the pool, because so oh.

D

Sure yeah, I I'm assuming that that's more work than splitting the pool um and if that's a bad assumption, then we shouldn't bother splitting the pool we should just focus on fixing carrier wave. um I I was going to add, though, that we don't necessarily have to quarantine all um job classes that use carrier wave. We could potentially just focus on the ones that tend to be very high volume, because the.

B

Ci phase has to happen, so you could argue that let's focus on the ci ones for now as kind of a mitigation, but you know obviously most of our customers are not going to be thinking about this complexity of you know another sidekick pool, so we probably want to try to avoid like um painting solutions for ourselves that other people can't benefit from one.

D

Of the other things I agree, and I think this is just a just a short term mitigation sure.

A

So one of the things that might actually help there is, if you had say, I'm just going to use hypothetical numbers now but say you had 10 round numbers. We had 10 connections for the carrier wave the the scary carrier, wave workers, then what we could say is the hpa only goes up to like 15 concurrent sidekick jobs um for those workers, and then everything else will start queuing rather than timing out waiting for a database connection.

A

So you actually use the hpa to push the pressure back even further, but that's kind of an implementation detail, but it might make it slightly more resilient.

F

So two things is, I think we should be tagging the jobs that are using the problematic code with the attribute as andrew suggested. We. I agree, though, with matt taking all of carrier wave and going you over. There is definitely problematic if we have two specific problematic background workers, the one that's using carrierwave, we can split that off specifically and limit it in the most generic way and most straightforward way, as andrew has pointed out just give it a dedicated set, keep its hpa low to make sure it can't max things out on its own.

F

Second, when it comes to the authorized projects worker, we may have to do some slightly different things to fix that problem, but it's really just because everything came together at the same time to explode in our faces. As far as I understand it,.

A

With the way of technical debt.

F

I I have no idea what of that is anyways. So whether or not we actually do the pg bouncer. Now that I've heard further discussion and caught up on some more details, andrew points out rightly that if we just limit the problematic set of workers to their maximum number, we can't saturate the pg bouncer actually because they won't be able to open enough to saturate it right that actually somewhat mitigates the complexity of the mitigates, the complexity of putting something in place to mitigate the problem. We have because there's code issue.

F

That would at least we have less to do in the infrastructure in terms of standing up the extra pg pool, unless we believe we absolutely need it, and it gives us a more direct way, which is actually a 10 minute fix through the case workflow.

D

So the the the hpa upper bound idea.

D

The size of the db- that's that's, certainly safe, with respect to avoiding uh avoiding catastrophic saturation conditions, uh but the connection pool is uh under normal conditions, not the constraining resource. So I think that would reduce our our throughput for for those jobs as well right.

A

For the yeah we'd have to get the normal conditions.

A

We'd have to be very careful and figure out what the normal uh you're on youtube matt, but yeah we'd have to make.

C

A

Have enough headroom- um and we don't you know, because, especially with the volume of traffic that goes to archive trace worker, if we kind of constraining that that queue will just grow very quickly? Yes, yes, so you definitely want to have headroom in that in that uh yeah.

D

Agree- and I I was just saying that I'm just thinking out loud here um trying trying to play devil's advocate- is that the right word I'm trying to think about other other ed's cases that we did. Okay,.

F

And there's a number of things here that I would love to have future improvements wise on a number of regards to what we can react to metrics wise. What can cause the changes in the hpa to actually cause that scaling?

F

But yeah I mean we need to have room to stretch, but we do not need to put it so far above our heads that we're jumping up and down on each other's shoulders before we hit the ceiling right there. We need to be watching the metrics on what's causing the actual load and what are the known pain points so that we know if we're hitting that tipping point where we've got these backed up logs due to gcs being slow thanks to carrier wave.

A

In this case, though, we had that alert and we've been ignoring it for a month um because of a different piece of infra infradev technical debt. So we do have the alert. That's why I was talking about putting that uh reactive caching worker into a sin, bin or or just a cl. I mean we could just change the alert so that just ignores anything from reactive caching, oh no! You can't because it's per shot- it's not per worker, but we could.

A

We could basically move that out of the equation, because we were ignoring that alert because of a known issue with the sas.

A

A

um I have to go because I have to say good night to my kids, so I'm gonna jump off thanks. Everyone.

F

Thanks everyone.

A

Have a good evening, good day.

B

So I guess who can we get on? This assassin was created two months ago and it's in the backlog, uh and I got confused with sas with echo with uh junit. So this is actually a different problem, um but I think sam you're on this call is there. Is he? Are you still in this call? Here you jump off. ah Okay, I guess I'll bring him up, live.

C

I think this is something uh uh in wayne's territory, so might be worth bringing wayne first.

C

C

I also need to drop off. It's 8 p.m. Here, um scarbeck ping, me if you need any supports uh I'll uh I'll, have my phone with me.

D

um I had a question for you, um though,.

D

The work on the the redis um uh single queue um project is, oh he's gone I'll ping him offline um john. um Why.

E

Don't you go ahead with your question, maybe stan, who is very well in depth with everything that we do here might know the answer too.

D

It was a pretty interesting question.

D

So this this probe uh actually the the uh stand. This was the probe that I was uh kind of showing right before we started recording um the the one that scans through the the list of all of the the job entries in the in each queue, um so we're ready to so we've turned off that uh that probe, um I think, a day or two ago and we're ready to um we're ready to enable the new probe. So I wanted to.

D

I was planning on rolling with that today and I wanted to see if anyone had any concerns about doing that.

B

uh I guess andrew raised it earlier about. Are we? What have we have? We reviewed that it's safe to do now to do that? Like are we? What are we doing differently, then yeah.

D

So we're we're only visiting at most the first thousand jobs in the queue rather than the entire size of the queue. So this is uh this keeps it down to.

D

I think we benchmark this at most two milliseconds um per um per minute of cpu time yeah, so that seemed reasonable to me, but uh but yeah I just wanted to see if that was still a thing we were going to do today, especially since we have kind of the rapid action stuff to focus on as well so um I'll ping, the eoc about that um and see if they're comfortable with it um john.

D

What can I do to help you.

E

um It would be good to ensure that I'm operating against the right cues so far, while all this meeting has been taking place, I've been trying to figure out where all these cues uh work on, like which shards they operate on so andrew. Provided me with a feature category that I could utilize. Okay,.

A

E

About creating a shard that explicitly uses that feature category in its query. Unfortunately, some of these cues operate in catch-all. Some of them appear to work in. um Where do I document it, and another shard like these are multiple, multiple shards that these cues are spread across. So it looks like urgent.

A

E

All the two that I found so far, I'm sure that might exist in other locations.

E

So I just want comfort in my merge request that when I create the necessary selectors, it's excluding the right ones from the right shards and I'm not going to leave a shard unattended inside of sidekick because right now, I'm more concerned because scalability's been doing stuff with q routing, so staging is slightly different than production for the catch-all shard configuration.

E

um So I'm working on a merge request. You create one for staging one for production. I just need assistance in making sure that I'm doing the selectors correctly right. Okay, so.

F

I didn't mention.

G

Craig the school too, because he'll be online in the next hour or so um so he'll see that incoming when he gets on.

E

Yeah, the other problem is, I don't some of these cues. I don't think we really care too much about um so I don't yeah.

D

E

Try to exclude those or not. I.

D

Was gonna say I thought that we were. I thought that we were mainly looking to move just the authorized projects uh worker to an isolated shard.

E

Well, there's a couple of authorized projects: worker there's one simply called authorized projects, but there's also.

A

E

One two three four five at least six that are authorized project update colon some other thing. Oh.

D

Yes, okay, gotcha. That makes sense.

E

So I don't know if those should move as well or not like that. It makes it easier if it's just authorized projects, because it just makes.

A

E

Merge request: it makes the uh uh query slightly smaller well.

B

The one that says that the I don't think that just one of them is sufficient, because the one that was bottlenecked yesterday was the user refresh from replica worker. So.

B

And I I think that might be the one that's mostly used to um you know like it's called whatever you share a project or create a new project. um I'm not checking.

D

Out one specific api, endpoints.

B

Okay, yeah, I'm not sure, but there's a bunch of ones. This is fairly new, so I haven't looked at this in a while and.

E

Staying the one that you mentioned I've yet to figure out where that one runs at this point in time.

B

Well, we can look at our psychic logs and tell you yeah, yeah hey. Could I jump in with a quick question um unrelated to what y'all are talking about um just uh the root cause, the yeah the under like what what's actually causing the underlying issue? Is there a good summary of that or I'd like to be able to communicate that out to like you know, others sure.

A

B

Let's I mean since andrew summarize at the beginning of this call, but do we need to summarize that in the issue now um yeah, I might not have been on it at that point sure.

A

That was kind of that was, I feel, like I.

B

Kind of have a sense of it, but I couldn't like you know um if you want to if someone wants to say or either point me at it or um if just summarize real quick I'd be willing to, I could type it out and then capture it someplace. So.

D

So matt do you want to take a shot at this sure yeah, so um so the um so we've already got a comment to the effect in yesterday's incident to the effects of um of the connection.

A

D

um Becoming saturated and um the the.

B

Connection pool to.

D

Just sorry which connection pool so this is this: is the the database connection pool? That's that all sidekick jobs share, um but the the new piece of information that that uh that folks uncovered, while I was asleep is, uh is that the um is this bit about the contention in um uh among the um the the thread contention on uh oh gosh? What is it the so having having a flood of these?

D

These authorized projects, um uh jobs was producing, uh was producing thread contention um in uh in ruby in uh among the ruby threads on our.

D

uh On on the sidekick workers and exactly.

B

Because they run.

D

In threads yeah, yes, exactly- um and this was this- was implicitly causing um a delay in processing uh other other jobs that we're currently running on other threads from uh from handling the the storage io and as a side effect of the of that uh of that um kind of recurring delay. They would hold their database transactions for a much longer period of time, uh which uh longer leases means they're, um the the pool becomes saturated and that's that's. What kind of led to the propagation effect.

B

Okay gotcha, so basically flood of authorized project jobs may be triggered by a customer, creates threat contention in ruby, which then starves io. um Basically, I guess cpu cycle is needed for that, and then the db transactions, slow down pool, becomes saturated and it's bad time all around.

B

Yep, okay, awesome yeah I'll capture that in the issue and I'll share it in the channel um because yeah I know folks, are interested and um yeah. The other thing they'd be interested in maybe is like what exactly we're doing and I think stan you've got a good list there and we can kind of update that as we go so I'll stop talking.

D

Thanks for bearing with me no that's, fine and the um so the the the working the the working theory is uh by um by moving the the these authorized workers, and so these authorized projects workers to a separate uh to a separate sidekick shard. They will no longer compete with all of the other job classes that run in that run on the existing sidekick shards and therefore are much less likely to induce the contention uh that that was that was um indirectly producing all of the side.

D

Effects that we were observing during the last two days makes sense, yeah yeah in terms of action items. That's that's that's. What we were describing is removing the triggering condition. um That's.

B

Like a short term fixed for this.

D

B

Know good fences make good neighbors. I guess.

D

B

D

Exactly and there and we we know that we have more work to do because there are other ways to trigger the same, the same pathology but the the way that has been triggering it on monday and tuesday. That's what we're addressing by moving the authorized workers to its own dedicated chart.

B

Gotcha gotcha, and do we know like, is there? Is it like? You know a certain customer comes in and like yeah.

D

If you know who it is,.

B

And it is it the one I'm thinking of most likely it's.

D

Okay, yeah, it is it's the same marquee customer that's been challenging at least.

B

Okay, great thanks a lot.

D

And it's it's uh I mean just to be clear: they're they're not doing anything wrong. It's just that. The way that they're trying to use the product um is not a way that we've made the product scale properly. So.

B

Understood yeah thanks yeah.

D

And that that's why we have some that that's some of the tech debt that we were that we were talking about is is to try to address that, so that what they're trying to do will be more efficient and less less stressful on the system as a.

D

Whole, does that give you what you need I I know it was very handy. Some of the details.

B

Are fuzzy like for right now? Yeah! That's that's perfect! That's really perfect yeah! So I just say: yeah. Let's get focused back on the authorized uh project, migration and um yeah. That's exactly it! Thanks! Cool.

D

G

um My meeting is set to start for us to all go join the incident room, I'm just watching the postgres overview for idle idling transaction- that's not happening yet. So do we all just want to go jump over to that zoom and we can go muted screens off to do whatever we want to do, but you know keep charging forward scar back with what you need to do. I just thought I'd check and see if we were okay dropping this one and joined in the other one.

B

Do you want to put a link to the other one in the chat here just to make it easy.

G

B

G

Zoom jet sure it's it's also. It's also on the agenda of the uh or not the agenda, the top topics of the main slack channel, but.

C

G

D

C

G

The invite, after the zoom needed to be abruptly changed so, okay, invite should have the right soon. Thanks, everybody see you over.

G

G