GitLab Scalability Team, 16 Mar 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2023-03-16 Scalability Team Demo

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

So the first thing I want to show is uh Vasily is working on some stuff uh regarding knowing how redis is used by stage groups and he's added um kind of a wrapper around.

A

um Rails.Cash that can be used like as a drop-in replacement, but when initializing it you need to pass in a feature. Category call site that kind of thing. So that means let me share my screen. Wait! Don't let me first clean up my screen a bit and then share.

A

That we can have um cash metrics, so this shows cache hit ratios by feature category and call site. So then we can see where things are coming from and how things change. This is the like a first step, we're going to like he's going to be adding that everywhere and the idea is then also to build, like um maybe structured keys. So when we do a scan of the Reddit ski space, we can see what a key belongs to to reuse. That kind of information in the key, but that's like kind of far out.

A

We can also add more metrics here, like now. It's just cash hit and miss uh that we measure we also measure the the duration of the cache generation. So if it's a cache Miss, how long does it take to generate um that's in there? Other things that we could add. There is the size of the thing going into the cache, but there's some difficulty there regarding compression and and whatnot yeah, but.

B

When you say, there's difficulty regarding compression, do you mean we'd be measuring it before it's compressed? Yes,.

C

But we want to know that it can practice. Okay,.

A

Yeah, so that's what fazilia is working on um I'll link the Epic from the dock later. This is part of like getting some attribution on cash usage on redis usage related to Stage groups.

D

Is it a problem to measure the incompressed size.

A

I, don't think it is like problem.

D

A

D

Want to know who's using a lot of the cash, then uncompressed data like it's as good as an indicator, I think it's compressed data.

B

Yeah I guess it depends how compressible it is, but um none of the things we're looking at here actually relate to the size anyway. Right and these are all pretty good, but while they were looking at here, I've still got my tab on the Thanos tab. So what that tells me is that the content Char has a very high hit rate and it's not very expensive to regenerate, whereas the uh protected branches are service. Tech thing has a lower hit rate, but is more expensive to generate um yeah.

A

And the thing is like it's: a drop in replacement for rails.cash. Well, everything active support cash thing. Whatever it is, um so the idea would be to yeah start using that instead I, don't know, if there's something magical we could do. That makes everybody use it. But this was easy to do because we already separated off um redis repository cache, so yeah.

B

Can we um I think Andrew might actually be typing the same thing uh or maybe it's a related Point? Can we figure out how many, um how much an attributed and attributed and attributed I'm, not sure um um cash usage? There is by your endpoint, like in the logs or something and then sort of go back to Stage groups and say, like hey, you know, Source codes, you know, makes a bunch of requests that have cash attribution and.

A

Some other group doesn't that's a.

B

Very good because then we can tell people like you know. You seem to be using the cash a lot, but we can't tell how.

A

um So another thing that is in there uh is the the caller ID and that's just from the context.

B

Oh great: well, then we can just do this with the metrics, then we don't even need logs.

A

um Well, the thing is, we haven't, dropped this in everywhere, yet so.

B

Can we join to one that has cash operations by caller ID? Do we have one of those not sure? Maybe maybe we do need the logs.

A

Yeah because looking at it now, this is.

A

Many more coloradans.

E

That combination, yeah, isn't that combination of keys, gonna really blow.

A

Up yeah: that's what I'm seeing.

E

Now seems like, like I, mean yeah, because can you.

B

Just yeah like that, that's too many things the API I want to do it in three places, but the other one is used like apparently everywhere. So.

A

Yeah so I need to take that back to facility and say, let's not include caller ID. That's too bad.

E

Yeah, because that that'll that'll that'll blow up like yeah, we need a Handycam before we can do that.

E

um So so the way that the caller uses, it is basically looks like a redis client.

A

It looks like uh rails.cash rails.cash.

E

Okay, yeah that thing: okay, then how does it know how long the um the the the hits took is? That is because does rather cash use the yield or something like that? Yes, okay, so so, okay and okay, that's that's brilliant um foreign!

E

E

My question is really this is great but like, unless we can use it in a way that we can drive change with it. It's it's not going to be as useful as it could be right. So how? How do we use this? Like? Is there a way that we can kind of include this in a budget like? Have you thought about that and I'm sure there.

A

E

But I'm more wondering.

A

If it was like, we started with metrics because it was easy like we can't, then game was seeing total usage like we have for the database. We say this table is owned by source code management and it's 50 gigabytes in size or whatever.

A

So that's that's the end game, but um there's currently like two IDs floating around on how to do that. One is measuring the the throughput there like, but with custom mode cache and the other is periodically scanning this key space and attributing it to Future category and then including it on to the the report that uh Sean and Stephanie are building.

A

But that's currently not not really used yet.

B

um So, just as a side point another way we can use this to actually as with um Pizza bikes right, because I've had this in the past, where I've been reviewing Mr and someone added a cache like I read his cash for something and I'm like yeah. But like is the usage pattern of this endpoint. It's called a lot with for the same data, or is it called for different data? All the time like?

B

What's what's the actual, like expected, hit rate of this, and it's quite hard for someone to answer that, and it's also quite hard for us to answer after the fact like how useful that cash is, but if we can put it behind a feature flag or just measure it after rolling it out, then we can say like. Oh, we can see the hit rate for the specific cache it's like 30 and it takes no time to generate so absolute kind of pointless.

A

um That's what we could do with the metrics already? Yes,.

B

Exactly that's what I mean sorry yeah like this is one way we could use this already, but that's in a non-structured way. That's not like aggregating data. That's just saying like on a case-by-case basis like I, want to know how effective my change has been or is going to be, and this is a way for showing that.

A

Yeah, so um do we want to have that in something like error budgets that the teams are already looking at or do we want to have a new thing.

E

It's not a it's, not an error budget, because it's not like the slis right, but it sort of ties in with with like is, is there a way that we can say that this team generates this much traffic but generates you know this percentage of redis.

A

And it's only used.

E

This much yes, you know and and because because you know what it's going to be. Basically it's going to be like the the scan, the.

B

Container scan security scanners, yeah yeah.

E

And and it's it's always those kind of um those parts, and and but the problem is that you've got to come up with a single number where you mix in like well like this is how much uh of this there is- and you know like with with postgres right. It's like this is how much Auto vacuum this group is generating. This is how much uh you know CPU.

E

This is how many connections and then you've got to kind of come up with a way of mixing all those things together to come up with some sort of metric, and that's the really difficult part, because I think a lot of people will just think. Well, it's not really um a real. You know you you it's it's a synthetic number which it is, but you've got to make it that it kind of reflects reality in some way. Yeah.

B

And we also can't measure Reddit, CPU I know that the cash for most of it's going to be like getting set, but we can't measure Reddit CPU with it because of the clients biometric right. So, if we're doing like, like, we had the interview with like um uh s, members or whatever with stuff before, like, we can't measure that in this way, unfortunately, yeah.

A

uh But when the keys would be structured like if we change the keys and do that whole migration, we could couldn't we.

B

uh We could know that from the slow Vlog, but I don't know if we could but I don't know if we can know for all commands that are run yeah, because you.

E

Can have unless you capture the.

C

Very first command.

E

Sorry, I'm I'm gonna have to go against.

A

I, don't think we can measure all of them, but if we can.

B

Measure the expensive.

A

B

Yeah but I think what Andrew was going to say was if we have a command that is in principle fast.

A

B

It's called a lot like we don't we don't capture that, but I think this is like you know, sort of down the line, stuff, I think initially, it's quite useful just for uh poking around manually. If nothing else like I said, like you know, if there's something I mean the two hit rates you showed here are what around 85 and around 98, which both sound okay.

B

But it would be interesting to see that across, like a wider variety of um uh cash identifiers and see like are there some that look like they're really low um and, if so like? What are we doing with that cash like? Why are we doing it, um but, as I mean Jacob's been working on a cash thing for, like you know something slightly different and it's there's a lot of variables and there's not necessarily one answer.

B

That's like the right answer, so I don't think we can put that into a table to tell people I think that sort of thing would just have to be.

A

D

Well, I think with a if you have a new data source, it's also okay, to start with ad hoc analysis and learn what you can learn from the data source and how you can use it and over time, maybe realize like well. Okay, if we can see like uh with Reddit optimizations, it I think it's become very clear that just the sheer volume of request is often the most important uh useful thing to optimize.

D

D

I I I think it's good to think about automation, but it's also okay to just do things by hand for a while and learn how they work.

B

Yeah, that also gives you some ideas of how you would want to aggregate that in future, so yeah.

E

Yeah I, don't I, don't think you can really guess where how you would do that until you've until you've got the data and you can look at it and start like figuring that out. But it's just an interesting that next step afterwards. But what.

A

Is interesting about this is as well is that we slot it in between. So that means we can add more things to it like now, we're just adding metrics because, like that was the easiest thing to do at the first time, but we could start writing stuff out to files if we can, like whatever we want. No, if we're in between yeah.

D

C

D

Another very good first step, of course, would be to because this is only for repository cash right.

A

Right now, it's uh for two endpoints that facility just had.

D

But just just getting it out there across all users of rails.cash, because I think, if we're lucky, we can just make reals.cash B this thing. So if any any developer types.

A

D

They get this thing, yeah.

A

That's the ID. The problem is that we do need some kind of like there's a manual thing or everything was going to be more, not owned. Do you know what I mean like they? We need them to say. This is me. This is not owned by.

B

So you would do like rails.pat dot, you know fetch whatever, and then there would be an additional feature: category like keyword, argument or something and cash identifier, yeah.

A

It's it's a you initialize, a client with a feature category and that does rails.cash in the back end.

B

Nice, okay, yeah- that makes sense.

D

Well, that is a another interesting problem to look at like to how to get everybody to uh go through this mechanism.

D

A

Get a wife yeah. We were thinking because the the for, like, as we've seen from the the cash instance most things are rails.cash.fetch and it's a drop-in replacement for that. So my idea would be to drop it in like just doing a fine and replace and let the specs call out if it doesn't work um and having it with not only to start with and then start attributing like. We did for controllers and sidekick jobs and all that stuff.

B

We can also have a link check that failed, diffuse, rail.cash, yep.

A

E

Why am I hearing it right that rails developers aren't suggesting monkey patching the underlying um uh rails.cash or else cash.

A

I think we have enough monkey patches on redis.rb and active support, stuff and so on that I did less like.

D

But you don't even have to Monkey patch it can't you just it it's a. You should be able to put your own cache objects or cash implementation into into rails. It shouldn't be wedded to.

E

D

Ones, it knows about but Andrew the problem uh as I'm hurrying hearing it is that we need to set say somewhere. What's uh who the owner is so.

C

The owner is injected.

D

When the client object is instantiated, I thought I.

E

Thought we'd get that from the metadata.

A

um But that's assuming that, like the metadata being what we have now as a feature category, that's assuming like, as we've seen now with the caller ID, the protective.

E

A

Thing is called from everywhere: yeah.

E

A

E

And and what is the out of Interest, what is the? What is the gate you go through to set the owner? Is it? Is it like how's that Set.

A

uh The client not new, with a feature category.

E

Okay, right, okay, I mean because the other way that you could do it is is like and I don't want to kind of Spike things up too much here but like. If, if you entering a block of code, where the ownership of everything inside there postgres redis changes to a different team right, you could almost have like a yield block or something where.

A

You, like that was my first ID like wrapping it in application context with future category something else like.

E

Because then it's for then it's for postgres and redis, and you know whatever elasticsearch. Whatever you want, there are file system rights can all be attributed in the same way to the same team. Okay, does that make sense? Was it not possible.

A

It is possible and the the framework that we already have for for the context could do that and then everything in between yeah.

E

Almost enter and then leave and other teams sort of code, and then everything inside there is owned by that team. Yeah.

B

E

They don't have to kind of remember to do this as much.

A

I decided against that for this, because it's.

C

A

Big yeah! Yes, because you would, you would add a block then in something that is now well repository class- is maybe not a good example, uh but then we'd start, for example, at the controller and as we see now like not everything within that controller should be owned.

F

A

One team and then.

E

Yeah, it's also difficult to know like because the you've got to know what the boundaries of your where people are entering into those things are it's. Quite it's not an easy problem to.

A

Solve yeah and there's also the the thing of uh in some cases, we push something on the context, but it's not in a yield block and then.

C

A

Is for everything like then it's sequential you.

C

Know yeah yeah.

A

uh Okay, but that's an interesting thing to think about for not not for this, but for everything.

E

Yeah like if, if um say for notes or for to-do's or something and there's always like you, add to-do's from all over the show, I don't know if that's a good example but like when you enter that code. It kind of yes.

A

And everything.

E

A

From this moment forward,.

E

Inside that, inside that blog yeah.

A

E

Think that might be interesting to.

C

A

I'm going to propose monkey patching with monkey patch everything like everything Ruby and we put in no dumb ID like I'm, going to move on to the next thing.

E

Yeah I think would be quite easy to do that with like act annotations on on functions, but we're not using python.

D

I I think this is a fundamental problem for what we're doing with air budgets and and our monolithic code base and I also remember from reading the SRE book that it was uh that they were saying like yeah, people are on the fence or divided on how to attribute resource usage.

D

uh Like do you itch, when do you attribute it to the team that owns the resource, or when do you attribute it to the cons, the team that makes the call that consumes the resource and and it's this sort of problem that you have to that we're talking about.

A

Not not I, don't think it is because we well we're going into a Direction Where We. Have this uh Team the we call them stage, groups and they're building their thing, and now we're have teams on the other side and reliability that um are the stable counterparts or whatever. And then it's going to be a collaboration like we know that this is for this feature and then it's up to a small group of people to decide who's going to act on whatever is.

D

It that that's not what I meant, so we were talking about protective branches and you're, saying lots of different features, use protective branches. So who do we count whose budgets or uh spends does a use of protective branches count towards?

D

Do we have a CL apart from the technical implementation? Do we have an answer to that question like organization? Do we know, is an organization who we want to attribute that to it sounds like you want to attribute it to the caller.

C

So if somebody.

D

uh If so, protect the branches would be a source code feature, but if somebody other than source code does something that needs a protective Branch, then we're saying that's uh feature. Category uh Stage Group is accessing the cache.

E

Yeah I think that you know if you, if the feature if it needs to be called because because we have this feature, called protected branches.

D

That's on them yeah, so.

A

If pipelines need protected branches to know what, if the branches protect like the branch the pipeline is running on is protected or not, then it needs a way to figure out if this branch is protected in an efficient way and it's source code, that's responsible to make sure that they can do that in an efficient way. Is my thing.

D

Yeah and context objects, Get Set uh at controller level, on the outer layer once something comes in yeah. So they are never the right thing because you then go through everything and uh you touch things that don't by different teams and different layers of the call stack. But.

A

These things can be narrowed down like as Andrew mentioned now we just set it on the outside on the controller or the API endpoint or the sidekick job, but we could set it again like override it for certain pieces of code like, for example, when you do something inside the repository.rb class, it's owned by source code. So if you call out.

D

To yeah, but what what you're then recreating is the call stack. So we, um we use threat local storage, to remember the context of the request, and then you have a long call stacking. You go all over the place and the Ruby language knows where you are in the cool stack and what file you're in I mean it's about files. Really, we want to attribute things to files and files. Our own.

C

D

Okay or methods in files, but those are all things that are part of the of the language of the call stack, so that seems like um if, if it, if all the time, when we're calling functions, we are having this parallel whole stack for our own observability, then we're slowing down Ruby and we're sort of recreating the structure of the of the code.

D

So that's why I think when Andrew says there should be annotations that make sense, because annotations will become part of the code.

E

Yeah, you need to wait for Ruby fool.

A

Well, I'm pretty sure you can already set uh a variable on a method. It's an object. Isn't it.

E

Yeah, but it's quite nice to have it like sort of outside you know, instead of actually having decl like have a declarative rather than imperative right.

A

Yes, like having like just before the death you've got.

E

Yeah feature category.

A

D

Okay, well thanks for clarifying which uh what what our world for you is. Also, our world view is that resource usage is attributed to the owner of the code and another caller I.

A

Like I think we just made that call right now during this, where the organization is going but yeah well,.

D

You already made that call by the way you are setting UFC or setting up the the metrics.

A

Yes, it's not used for anything yet so there's plenty of time for people to object.

D

Well, that's not assume they object right. You, you have a. You, have a strategy now yeah.

A

Shall we move on to the next point.

A

uh Callipi, yes, we have this I'll reply, I'll reply later.

A

So the next Point, let me share my screen again.

A

The next point is about historical data in timeland, see if that's running.

C

A

I've generated a the timelines Pages site with 360 days of History here I was wondering: shall we just always do that or not it's easy to do it didn't take much longer than a normal run does because we have all this data cached.

D

So why didn't we do this already or what? How.

A

Did we pick the current number? The the current number was picked upon because we used to call out to Thanos for every run and then we added the cache and then we added the more reliable cache. You remember the data warehouse in the what is it one of the package components like we upload the zip every time to get loud.

D

Oh right that that that that crazy idea, I uh I, was selling it.

A

Sold it and it works well, what.

E

Is it different from the the CI cache.

A

I guess because the CI cache uh like with that random moments not be available and then we'd have to refetch everything again.

F

A

We do that manually and it's stored uh in here.

D

It's a versioned artifacts and we're using one of gitlab's many artifact storage package storage features.

A

A

E

My goodness, that's cache data and it's more reliable in the cache than the CI cache, but.

D

You control, which version is stored because with CI cache, it's the thing of the last run. Yeah.

E

A

E

And you clean up the old ones, yeah.

A

Yeah we clean up the old one, there's only one awesome. That's.

D

Version one which keeps getting overwritten, that's.

A

The download you see it's a 70 megabytes now and if you decompress it it's like almost a gig.

E

Wow, is it all a parquet file? Yes,.

A

All per k files, not a single one but yeah.

E

A

um But yeah once you've extracted it. We now have about a Year's worth of data.

A

uh Soon we'll have more but I think a year might be good enough to show I'm not hearing any objections like it does compress everything a little bit. What.

D

Was the old number one.

A

Was 180 days a.

E

Plus 30 plus yeah.

A

180, historical and uh 60.

E

Yeah 30 yeah, sorry, 60, yeah, 60.

A

E

Yeah so now it's a year plus 60.

C

E

And you haven't thought about extending out the other side any longer.

A

uh I have but then I thought.

C

I already have.

A

Enough work with going through these issues, if I extend it even longer, it's going to be even more work because well, the the range is going to be wider, like this grows this direction, but it's not a bad idea.

E

Awesome I think it's a great idea to add that on to move it to a year personally, okay.

A

So let's do that.

D

Yes, six months seems like a long time, but often I find Trends, not very obvious. uh So here looks nice.

A

Yeah, it's going to become important to have the the to be able to mark outliers because I think it might I, don't know how profit uses like things like this, how it influences the prediction still.

B

Well, it doesn't adopt that you link like it's like outlier support, so you're like oh great, it can handle outliers and what it says is like if you have about like yeah, remove them and you've got the basically. This has removed them and we'll ignore empty data points. um That's that's how it supports them, but you you have to remove them um and then, which is from our perspective, like the easiest thing to do as well. Like you know, um you know we um we don't have to like. Do anything fancy with them.

B

If we know a date range where there's a problem with a metric or maybe not even a problem with a metric, but.

A

An incident that we know about and has been fixed or something like that. Yeah.

B

Exactly like you know, we know that, like oh hey, like everything blew up, then um then we don't need to consider that in the um in the forecast, if it was like not not germane to to Future trends, I mean we do need to be a bit careful with that, because we need to be clear on the distinction between, like you know that this was. This was a genuine like outlier, but it was something that we wish. Hadn't happened, but you know was based on usage or whatever so yeah.

E

So when we get to two or three years, are we gonna turn on um uh annual um uh trends?

E

Because you can do that in in.

B

That would help with the uh seasonality right, because I think part of the problem with the seasonality thing is that profit can detect. These now, I think that we don't like the.

A

E

Annual right, we don't have, um we.

B

E

Yeah because, like.

B

D

Yeah well we're having a.

E

Year alone, profit.

D

To do a better job.

E

B

Don't know I think you need multiple years. I would I would hope it wouldn't make it do a worse job.

D

B

Wasn't thinking.

C

D

Year annual patterns like a year is, of course, not gonna work, but.

A

um I, don't think it's going to particularly help with like the weekly pattern, because I'd already had that I think uh I think we're.

E

But more dates is pretty much going to be better right.

A

That's the Assumption like again scrolling.

E

A

Why wouldn't be opposed to having two or three years from there I think.

D

Well, better start uploading, zip files, then.

B

Yeah, it would be nice to have um metrics that have been around for that long in a stable. That mean the same thing for that long as well, which we do have quite a long, um so yeah.

A

I think, uh as long if we change them, like we've done before, we've worked with the V2 thing and so on to have it separate I think we should figure out a way to not do that, so we can have the same name and then we can add the change point or like have it detect a change point but yeah for now. Let's just start with a Year's worth of data and see where that gets off. The last thing I wanted to call attention to was the postgres primary CPU. The issue is linked. There.

A

People are working on it, but just spreading the the word out a little bit, because this is something to worry about, and it's often not an easy thing to resolve. um I, don't have anything more than that to say about that. Okay,.

F

So this thing Dave does not need to shut up 0.5. That is a pretty short one, so there's still room for more points, so I I think last two, two weeks before I I bought the I was explaining the mystery ERA with redis cluster rate limiting. So uh the patch was very simple, so I I bought two issues to discuss the maintainers and I I I think they were pretty exciting, they're, pretty all right to accept version five, because that's the one that I'm maintaining now but 4.8.

F

uh The maintainer said that they are not open to bad ports. So uh in that issue, I had all the GitHub links, uh so yeah I think it's a little bit about monkey patch. You can't really put it in and then enjoy the code changes on the gem side, but the good news is for redis V5 when we upgrade we can enjoy that change without having to do any patch on our own. So I, don't think we'll be getting it soon, because we are kind of blocked on 307, followed by sidekick 7..

F

They are seven and then then only then can we do red is red. Is upgrade to version 5.

F

Yeah so other than that, not much for me. I think uh I probably want to ask you to show the patch, because it's pretty pretty simple and it's yeah I think in the previous realm I've showed a bit of it already the version five patches, there's a select modification, so yeah yeah, take over.

D

Thanks um I was uh I, I am on callers, I mock for a couple days this week and I was paged on uh Tuesday um or Andrew said it was an incident where uh uh one of where we spent a long time trying to understand what the problem was, and one thing that um maybe this is a good audience for that.

D

I found confusing and sorry I didn't prepare this, but I'll just uh share my screen and quickly show it because that works better um we're getting an alert from patroni main, or that was the really the uh this. This aptx thing was alerting and uh we really had a problem on the CI cluster, but uh petroni CI doesn't have an up next.

D

It only has error ratio, RPS and saturation and RPS was clearly elevated and at some point we it took us way too long to find out that the problem was here and not on the other petroleum cluster, and it was some sort of spillover effect going on where the other one was also slowing down.

D

um So I'm curious, why? Why isn't there an aptx here.

E

I mean my only guess is that it just was not. It was oversight when, when that was split, yeah.

A

um The same oversight we had for Revis.

C

A

Fixed it for redis, we should fix it for literally, say I.

D

Where should I make an issue for this capability.

A

Or start with scalability and we'll see yeah, let's say.

E

A

E

Team should do it, but.

C

D

E

D

We we spent quite a lot of time in that incident, just trying to find the problem, and this was a big distraction. So.

E

The way that it's built over is interesting because, sorry, um just before you move on because does that mean that we have a lot of transactions that are being held open kind of with uh across the both of the postgres instances, and then, if one of them slows down the other one um basically struggles, because there's lots of client transactions. Is it something like that.

D

I am not sure uh I was in the incident for three hours and by the end of that it was still not clear what was going on and I hand it over to the next iMac and I haven't caught up yet on what the cause of what was really going wrong. But it took quite a long time to figure out.

C

D

And um there was a different incident yesterday, so I also didn't read up on what happened yesterday, um but going back to Tuesday.

D

Another thing that was confusing and I just want to highlight here, because people know about psychic uh is that the the psychic database load balancer if it can't find a secondary that is up so psychic psychic database load balancing works by putting an attribute on the job that says, I want to see these wall offsets and if the, if I, can find these wall of sets on a replica that I'm going to use. This replica not use a postgres primary and that's good, because then you're not using the primary.

D

The problem is that if this job gets picked up by the psychic server process- and it doesn't see these offsets, it uses an exception uh to force a psychic retry, because the I think the idea is, if we wait a while, maybe the replicas have caught up.

D

So this insane spike in psychic errors is all these I want to be delayed, exceptions, not something's broken exceptions, but I want to be delayed question and then another fun thing.

E

Here, like it should be like a 4x6 rather than a than a 5xx in sidekick terms right, it should be like a um like non-counting error. Should it, though,.

D

It's not even an error. It's just I want to I want this group to be delayed. Yeah.

E

D

A

Yeah the delay there is the better option uh other than using the primary, but it's still not good and we'd still want to know.

D

Yeah but it so that was very confusing and then uh the other fun one was.

E

This was amazing that there's still files that the last person to have touched it was Eric.

D

Oh I didn't even see that, but.

C

D

This class, uh this this method is where we check if a replica can be used and uh notice this rescue here.

D

So when this rescue happens, this method returns false and if all the replicas return false, then uh the database load, balancing code and psychic raises this exception of I want to be retried, so what's happening here is that we were masking real errors and instead saying oh, there is a database load balancing problem, because the problem wasn't even that the replicas weren't up to date.

D

There was something else, I think the the the thing was to saturated, like all the the back end connections between PG, Bouncer and uh but the the postgres surface, where constantly in use. So we were getting one of these connection, errors which are uh which include PG error.

D

So this is um the idea of just masking all of these and instead saying well, let's retry the load balancing. It is not good for discovering what's what's wrong,.

E

A

Care on those yeah, how is an attribute template error uh connection there yeah.

D

I yeah, but Andrew just pointed out. I should ask you Eric, which is not.

B

My guest is with those that I can do template error. The other error, but I think we can use exception.org to figure out what the actual episode.

E

Might be very, very careful about I'm.

D

Not even sure we should, we should be um catching these, so.

E

What what are the, what are the genuine use cases like where, when is it that that will throw an exception if.

D

If this query fails, then why do you want to go on with the job on the primary.

E

That's what I that's right.

A

Like you do want to try the next replica, if one replica.

E

A

You don't want to yeah.

E

D

A

If one replica is unavailable, we don't want this to stop trying, but we want this to try the next replica to check if.

D

It's this is a sort of client-side high availability. Yes,.

C

E

But do we not want to just do that by getting um sidekick to to repost the job and retry it? That's.

A

What it's doing, let's go, isn't this code also used for uh for web.

E

D

Oh yeah, because it's in this this looks this looks generic like it's not just for psychic yeah,.

A

Yeah, there's a method in there select up-to-date thing that iterates.

D

A

Until it found one that's called up, but if caught up raises.

D

Oh sorry, it's this query, but it doesn't matter yeah. It's.

A

Called raises, then: it's going to fill the method and yeah.

D

Yeah, so we want to transparently on the web in the webcase, we want to transparently handle a replica being.

A

Down for you already sorry, two corrective actions you.

D

Need yeah, I, don't know why I'm doing this to myself, but at least you I get quick answers to why things work the way they work. If I, asking okay, uh I, guess I need to make corrective actions now, but thanks for um looking at this with me.

E

So this sort of ties in a little bit with that saturation that Bob kind of raised- and it kind of feels a little bit scary at the moment, like there's sort of, were a few different.

A

Trades slowly approaching the thing we were in last year before the yeah.

E

That's what's scary,.

A

That's why I'm now shouting it everywhere.

E

Yeah, but how do we, um like you, know, Parts we're not going to do another functional uh decomposition.

B

Well, I guess we are who's gonna. Do it yeah.

E

So, what's the you know and um I mean I know that there's some fairly noisy pieces like again, you know the old um uh scanner or what you know, the container scanner and and or license scanner or whatever it is. But in fact, do we have good accounts like do we have good accounting of exactly what in there is um you know how much, how much of our resources, those sort of features are consuming like? Is there something where we we sort of emergency sort.

A

Of we've done a few things that Nikolai pointed out from uh postgres dot AI some kind of report that highlights some expensive queries, but I think that that resulted in two things that will have been done. But I don't know. We don't.

E

Have like a rapid action type situation approaching.

D

uh We need to wait for an S1 first.

E

It's leaving an awful light, considering all the things that are going on, but um okay.

D

I I was I, was being a bit cynical, but.

E

D

Know how how much Direction I think Bob is absolutely right and to.

C

D

Sounding the alarms about this, but I, don't know how much Direction it's going to get until things really go wrong. Yeah.

B

We would have raised this in the engineering allocation, but you know, but.

A

uh Like that no longer exists, Alan is already involved, uh so.

B

I think we need development as well yeah, at least like we don't necessarily have anything concrete to requests from development right now. Maybe we need to go and figure out what that is. I think.

E

I think like so, you know when infra Dev was was you know, happening I, think one of the things that really helped is if you could go there, and you could really point and say here and and the scalability team is quite uniquely placed to do that like here, is you know, a big problem that you know this and you really kind of put it on somebody's lap and say like this needs to be fixed straight away, and then everyone can rally around that, um and maybe it's up to the scalability team to help with with I, don't know.

A

We need to do. We need to find out what to do like what we are going.

E

To do we need to fix it, but you need to find the problem maybe like, or at least if there's any easy wins.

A

Yeah I talked to Jacob about this earlier. I need to look at this sidekick thing.

E

Yeah, because it's because it seems like it's pretty scary at the moment,.

D

uh So one thing I found strange or or frustrating in in the that first incident, I was in I mean apart from I, was the eye marker I'm not supposed to investigate, but even if I hadn't been the iron mock, I I was would have been pretty helpless, yeah analyzing, what's going on with postgres I, don't have um uh the expertise to look at a melting, postgres server and see what it's basically doing, um but how uh how many people know how to do that?

D

How easy is it to do that, so how many yeah, if we can find things and then we like Andrew, says we can hand them to development and say Here's the thing that is inefficient or is clearly costing a lot and it's something done about them. But how do we find them who's able to find them, and we train more people or Empower more people to find them.

E

I'm curious, you know, there's all that stuff. um A while ago, around marginalia plus tying that to um statement IDs, there's a there's, a doc in the Run books called postgres mapping, something like that which describes like with people using that sort of information, the marginalia and the and the and the PG stat statements. And if you can kind of tie all those things together now and it gives you a much better view of what's Happening inside posters.

D

What I saw in that incident? Sorry.

A

It's still a manual thing and um I I, don't know off the top of my head to figure out the query IDs that are causing a problem because it.

D

It took two and a half hours to get somebody in the incident who knew how to do that is.

E

It there's a there's, a run book documents.

D

Okay, but nobody in the internet, myself included, knew that that was there or how to find it or to start using that, and eventually uh people started people who knew enough started to look in the right places.

E

um Let me find it I mean not that it's too late now, but I mean it makes you wonder, uh mapping statements here, um I'm just going to put it in the in the dock and it sort of describes how each of those things can kind of map together.

D

And I I'm not in a lot of incidents, but what I found interesting was that uh the initial reaction here was more along the lines of we need to increase the limits on PG bouncer or we need to tweak some infrastructure knobs and um because I'm a back-end developer. My my first thought is: the application is doing something bad. It would need to find out what bad thing the application is doing. Yeah.

B

um Also, we're kind of running out of PT bounce at mobs to tweak.

E

Yeah, that was what I said to Ken yesterday as well. He said: are we going to add more PG, bouncer slots and I was like well, you know, there's only a certain number of postgres connections, so you.

D

Know what I wanted to ask is: if there are people in here who observe more incidents and I, think there are. um How often is it uh a problem you solve by by turning postgres knobs, and how often is it the fault of the application foreign roughly.

E

I haven't been an instant call in like a year. Okay,.

E

Okay, you know at all I mean.

B

This is another thing: I mentioned to Jacob is I, think we've gone from the same few people being like. Basically, every incident call which is bad to having, uh certainly on the internet manager side like a huge voter, which means that, like like Nick upset, he doesn't have much experience to be an assistant manager, even though he's been on the rotor for quite a while because, like you know and I'm, not saying people to have more ships, I, don't I, don't have an answer to this.

B

I just think we might have got a bit too diffuse um there. Yeah.

D

Well, my question is my instinct is to say we need to spend more. uh We need to invest in um understanding how the application is driving postgres and if we can find wins there, then we can buy Headroom on postgres, but I don't know enough about the actual practice of what goes on with postgres in the application. If that's true.

E

So so, going back to your your point about like infrastructure people looking at infrastructure knobs to turn like I, think that's where the scalability team are quite uniquely placed because they can kind of translate between the infrastructure world and the application World much better, and they can say you know sure we can change this PG bouncer connection tool but like. Why do we fix this thing in the application, because there's a client transaction, that's holding on to a transaction for 10 minutes or whatever it is and I mean?

E

Maybe one of the things to do is to start saying that you know we need to help out on more like if things are, if the temperature's rising and things are kind of getting more scary. Maybe it's time like we become more involved in those things and actually because I don't know about you, but I've I've been very disconnected from from incidents and gitlab.com for a while I, don't I, don't know what they're fighting and how they're fighting it and but.

B

I looked at this one because it turned out to be really like tangentially related to an MRI mode, but yeah mostly I'm just like well. My shift will come around and I'll deal with stuff. That's on my ship yeah.

E

And that's um you know, maybe it's something where we should be suggesting application changes as well in some places.

C

D

Know I'm biased because I'm an application developer but I. Also it's one of my preferred things to reach.

E

D

Is to make the application do headless.

E

The Jacob as well also the people on the call are biased because they're they're yeah.

D

E

We need the balance right of of people.

D

You can yes, but I. I don't have the the big picture experience to know like if we're, if we're in balance or.

C

D

What that should.

E

Look like yeah, but but I mean Common, Sense, sort of dictates that if you have, you know, 500 database connections and um you know ultimately- and you have something that's taking a very long time. It doesn't matter how you Shuffle the pools around for those database connections. There's still only 500 database connections and the consumption is the problem right. You've got a stream, um but yeah.

E

D

C

D

Bad none of our managers are in this call, because it's something also that you're.

C

Right I think you're right, I'll I'll raise it too, but I.

D

Think you're making a fellow suggestion that maybe we can make a difference by spending time on it. But that is something that we also you.

E

B

Well, yeah I think it's pretty clear that the post-crest saturation is the chemistry yeah, like you, said, Bob's right to keep raising it up because it's I mean like Hercules and I uh notice, the side effect of it when we were looking at just some like some endpoint, that's caused a lot and we noticed that it got slower during some periods and then you know, there's no reason for the end point to get slower, but it turns out that the connection pool was slightly more saturated during that time.

B

So we can see the downstream effects, like everywhere yeah. What.

A

I, like things are, things are moving on that issue and people are doing stuff. The thing that I haven't that worries me is that I don't know how much of an effect they will have like I saw Nikolai mention an issue about a query that I know of so I picked the right teams with suggestions on how to fix it. But I, don't know if that's going to be a 10 drop in CPU utilization or not. This is going to be visible at all or not, but.

D

We got better at this uh with redis I feel, like I I know: we've taking shots at Red. It's like! Oh, let's optimize this and optimize that and then some things did almost nothing and some things paid off and I feel like over time. uh We've gotten a better handle of what what is going to pay off, make a difference, uh but I I, don't know who's been doing this with the database or who is that knowledge has been doing it for long enough to to recognize the good opportunities.

A

Jose was always good at that, but yeah well.

B

I mean uh famous reddish but Matt's birthday person, but yeah.

E

I mean just kind of looking at it from a bigger picture, point of view, I'm trying to find a merge request now, but like sometimes what I see with like some of these things that we know are really um expensive on the database. At the moment like on dedicated.

E

We have these um these reports, you know of of lint effectively like linting reports and no one looks at them and they are literally thousands of items in them and it's this feature that kitlabs got on, and you know every time you run a pipeline, it's generating more of these results and you know we're not using them and and it's unusable, because there's too much information in there and I wonder how much of this is related to that as well like.

E

If, if we know that container scanning is, is one of the things and I know from the projects that I'm working on it's not usable, because there's too much information in there and so there's kind of like it's a it's a feature that maybe in some circumstances is in tuned correctly and it's putting load on our database like.

D

E

Further Upstream by actually making the feature more precise,.

B

Well am I misremembering. All the containers scanning also come up for like a psychic job where they were, like. You know our job's trying to run for like seven hours.

A

uh uh Yes, yes,.

B

And it's not able to run for seven hours. What can we do.

D

I I want to react to this, because um this is a good example of something that looks fishy or that that might be an opportunity, but the the problem is. We have limited opportunity to solve problems, so we need to um pick the most impactful ones and- and this sort of quote, even if this is an impactful one on dedicated that doesn't tell us that uh it would be impactful on.com.

E

Yeah, but if there's lots and lots of sort of what I'm thinking is, if there's lots and lots of teams that are that have all got this turned on by default and I'm just trying to find a good example now, but actually, interestingly enough, the one I looked at has got zero, so maybe I'm being a bit. Maybe it's already been improved, um but let's just go. Take a look in this one uh yeah that I don't know it sounds like we need to kind of look into this a bit more.

E

C

E

I can't I can't actually find uh the the type of issues that I was complaining about. So maybe it's maybe it's gone away already.

E

But I'll I'll raise it like in my way as well, because I think it seems to be getting on time.

D

Cool well should we wrap up yeah.

A

D

Thanks everyone bye-bye thank you. Bye.