GitLab Scalability Team Demos, 10 Nov 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2022-11-10 Scalability Team Demo

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Good um marker would you like to take us through the first item.

B

Yeah sure, uh okay, let me share my screen.

B

Yeah so I guess it's just a sharing for what happened with. uh We had some kind of regression on the sidekick for self-managed users, so opening up the issue here. So what happened and two months ago back in 15.4 release uh we're trying to default the routing rules whereby you wanted to change the default from you know: Cube worker configuration to just two cues one is default and one is mailers for typical assignment users. So um we did.

B

This was safe back then, because um our the two groups default in the psychic site are still stuck anyway. um So we went ahead and deployed this um then, starting from last week, the support Engineers paint us. Basically, ever since the release, there were a lot more than this tickets regarding the back performance.

B

So what happened was that it's pretty pretty much quite simple, so um imagine you would have this Q groups. They already defined this previous to 15.4. So what happened? Then this process only um processing or listening to certain cues except default.

B

So then, what happens for this guy case is the only running process is just this one, so the other two would just be idle. So that's what happened when they start to notice their sidekick latency is getting higher yeah, so I've listed. These are some of the cases, so this was fine and this is even more serious where you only have one process instead of seven or eight here yep.

B

B

um Yeah so basically I thought I mentioned the um the root causes just because they have um Define a custom, Q groups, and then we unknowingly assume that many people wouldn't use Q selector anyway, um so that was the miscommunication there and then um yep. So then, after they upgrade to 15.4, all the jobs are pushed to default queue and then some of the processes inside click will become idle yeah.

B

So actually the fix is actually pretty simple. If they uh they managed to um contact our support. We already had five support. Engineers to there are two ways so first they can still keep their q selectors, but then we can kind of override the routing rules to become. Basically all jobs uh will be routed to the name queue again, so it'll be back to the 400 plus queues.

B

So this will be the previous date other than that we can start to address customers instead of using queue selector. We can now use routing rules. So quick example: if they have something like this previously, then we can address them. We can change the Q groups to the in the routing rules format. So then they can Define. For this skill attributes it will be routed to this urgent other queue and so on and so forth.

B

Then, in the Q groups they can simply just Define the specific cues they want to listen to for each process yeah. So these are for the affected customers, but we also noticed that, from the support tickets, there will be a few more customers that have yet to upgrade to 15.4, and there are some big names out there. So um we don't want to have this breaking changes for them as much as possible. So our plan is that in 15.6, the upcoming release next week, I mean two weeks or three weeks from now.

B

We would be reverting the default routing rules after that.

B

um It would also like to add this logic in the Omnibus and chat so that we set this um default routing list for back of the 400 plus queues if they haven't set the routing rules and the queue groups only consist of stars. So this, basically they are default, Q groups per se.

B

So the whole point is that we don't want to mess up with their product rules the default routing rules and whenever they already touched the Q groups, then we also want to backboard the first revert here to also 15.4 and 15.5 in case the, for whatever reason they couldn't upgrade to 15.6 yep and that in 15.7 we want to introduce back the default tracking rules so basically and revert the reverted change yeah.

B

So yeah, that's about it. This is just a very quick sharing.

C

Mark I don't know if I've made something there and thanks for sharing by the way that was the nice detailed walkthrough of the process and as of 15 7 in does that mean? We can then continue and go ahead and deprecate the old Behavior, or are we waiting on customers or users to make changes um to their configuration.

B

We would still need the first officially deprecate in the documentation, so we would still expect customers to still use the Q selectors before any major release yeah. So that's what I would expect. Okay.

B

So yeah I, don't think by our discussion with Sean. It wouldn't be removed anytime soon for the queue selector yeah.

C

Okay makes sense.

A

A

Are there any further questions on that one, or should we move over to Eagle.

A

All right, Eagle, all yours.

D

Yeah so, uh admittedly, a bit of a click-plated title there um there were there's been a whole bunch of efforts around efficiency in kubernetes and um some of the challenges that have been made had kind of unintended consequences, uh in particular putting more load on some of the shared resources um and just to show some specific cases of that.

D

um So I guess, one of the main drivers has been an increase in number of pods for our web service deployments.

D

um So that's API, git and web and uh there's been sort of a few unexpected knock-on effects from this change, um and so there were a few things around uh packing our workloads too tightly and sort of reaching CPU level CPU saturations, but the one that I wanted to focus on is um PG balancer.

D

So we had an incident actually two incidents related to PG, bouncer client connection limits. Let me see if I can find the second one uh I think it was this one. Yes, so we have one or Max Clan connections on our replicas and we have one for the same thing on the primary. Our future bouncer setup is a little different depending on whether you talk to replicas or the primary.

D

If you're talking to replicas, the PG bouncers are co-located on the replica boxes themselves, so they share the box with postgres, whereas for the primary we have dedicated PG bouncers.

D

um In any case in in both of those situations, we were approaching limits on the same uh metric, um which is client connections.

D

So PG bouncer takes incoming connections from all of the clients, so that's going to be mostly sidekick and web services, um and these are a lot of connections because we have you know one or maybe even more than one connection per pod coming in and so it sort of does a fan in and then multiplexes that onto much fewer back-end connections when it talks to postgres itself, um and that's because postgres has kind of a pretty hard limit on how many connections it can support.

D

So we need PG bouncer to to do this job for us.

D

But peachy bouncer, each PhD bouncer process can only uh handle a certain number of incoming connections, and so as we going back to this as we increase the Pod count here we're putting more connections into our PG bouncers and, let's see yeah, we can see a chart here.

D

So this is, uh let me open the actual chart if I can get it to load and it looks like it's not loading, not sure. What's up with that, um but it's uh it's! It's our saturation ratio for how many connections PG bouncer can handle, and we were awfully close to the Limit. Actually I think we even hit the limit in the end, um which then caused some some user-facing impact because they get 500s because the pods can't connect to postgres.

D

So this is not good and there's a little bit of tuning and tweaking that we can do to this limit.

D

um But the broader question is.

D

We kind of it we in scalability uh um projecting our our usage over time and sort of capacity planning and when we make changes to how the infrastructure runs, um that can actually drastically change the the usage and um I've been sort of thinking about. How do we approach this in a better way, so that we can better predict the outcome of those types of changes right so that when we make a change like that, where we're seeking to gain more efficiency, what were the other consequences of that be?

D

Will we create more load on some other Upstream systems? This is going to put more load on PG bouncers. It's going to put more load on postgres and.

D

And I'm not faulting the the engineers who were working on those changes because I think it's more a question of do. We have for the mental models for that and do we have the tooling to maybe kind of try and visualize that and.

A

Can I ask a question so.

D

A

Ahead, so the the increasing the pods increased. How many connections needed to be made um is increasing the Pod, something that an engineer chooses to do, or is that, like an automated thing that that just we're in a certain level of usage is achieved, it automatically creates new pods.

D

Yeah, so uh we use a horizontal pod, Auto scalar for all of our web service workloads, and so that means we configure. We've got kind of two variables that affect how many pods we get.

D

um We have the utilization Target, which we set on the horizontal puddle to a scalar, and we have the CPU requests and CPU requests kind of affects. How big is each pod and the utilization Target is relative to how big it is. How much do we fill it up um and both of those sort of and indirectly based off of the dynamic Behavior results in getting more or fewer pods, um and so maybe to to illustrate this, because that's kind of the main thing I wanted to show uh so I've been putting together.

D

This is still kind of early stages, uh but I kind of try to put together a diagram to show some of these interactions, um and this is sort of a a model. If you will um so, we can sort of look at this and and try and reason through some of the scenarios, um there's a good chance that there's stuff missing from here right. uh This is this is an abstraction. This is sort of simplified, but let's say we want to increase efficiency.

D

um And so maybe we raise the target CPU percentage so.

A

It was previously when you say, efficiency.

D

A

Want to increase the efficiency of.

D

The efficiency of resource utilization, okay on our Fleet and effectively reduce the number of resources that we need so operational expense, I guess, is sort of what it comes down to in the end right.

D

um So if we were to say, increase this utilization Target, that means uh we drive each one of these parts harder they they have now. Each one has more work to do.

D

um Presumably you know each one of them will be able to actually do that work, and so, if, if that works out, then same incoming load, the same incoming requests that we get will be spread across fewer pods, and so we can see you know, CPU pod percentage wise goes up.

D

Pod count goes down, however, this has other potential consequences, because now each of these parts is running hotter, so that can increase sort of the contention on the host um and it can also affect the contention of uh the Ruby Global VM, lock, um because we're now so Ruby Global VM lock is based on how many processes you have per pod. That's the Puma was tunable, and so we didn't change this.

D

So this state game, but this went up, which means the CPU to Puma ratio, changed in a way that each Puma worker is now doing more work and that then drives this contention metric. So you know it's complex right. It's kind of the point that I'm trying to make um and uh and it can be kind of tricky to really predict what exactly the effects are. Even with something like this. We can sort of try and reason through it, but chances are we'll get it wrong.

D

So I think there's also uh an element of um scientific method involved in making these types of changes.

A

And everything that you're describing here sounds very similar. I mean it's similar in terms of class of problem to when we were trying to decide I think it was. Was it the connection pool size for the database and it felt like we would do a certain amount of reasoning through the problem we would set.

A

It we'd get it wrong, and then we would do another set of reasoning and then set it a different way and see and see what that did, and it seems like what you're talking about here is is the same class of problem. Where there's only so much reasoning, we can do and we're going to have to set them and see what happens to see how that actually plays out when it's on production.

D

Yeah I mean I think we can also use those experiments to refine our models. So we can, you know we can potentially get closer to being able to actually predict it. um But yes, I think it's very much in the same realm. What you described well.

E

uh I kind of copy of a production environment where we could learn synthetic tests uh with load. Would that be a reasonable approach or are we do or is the volume and number of moving parts so massive that it becomes impractical.

A

I think the problem with with completely replicating a production environment is also cost I mean the production environment is, is rather large and if you were running, if you had an exact copy of that running synthetic tests on it, I think it would. It would become quite expensive quite quickly.

E

D

Yeah I think it's also just technically pretty challenging to do like you. You either do synthetic float and then you're likely not going to match what is actually happening in production or you try to replicate real production traffic, and then you have the question of how do you deal with States changes and rights because there's a kind of tricky to replicate in a way that you can really replay it properly because it gets sort of out of order effects and whatnot, so I'm I'm a bit of a skeptic when it comes to that approach?

D

D

um One approach that I do think has also some limitations, but can be pretty effective, is um doing it on a subset of making changes on a subset of the fleet and that's what we're doing right now. So we have the three zional clusters and we're making the change in only one of the three um there's still potentially some sort of interplay between them and so I I. Don't think this solves necessarily for all combinations of uh changes. It could still kind of tip over once.

D

We once we roll out to everything, but it gives us at least a bit of an idea of how um how the changes are behaving in a way that sort of mitigate some of the risks.

A

Yeah- and that seems reasonable.

D

I, don't think it's if it's being.

E

A

A controlled way plan for you know if we see these certain things happening when we're making these adjustments, then we we alter the plan and change it in that way to keep the system safe, but it seems, like um you know, doing these experiments on production seems a reasonable thing to do.

A

um It seems like it's the easiest way to see exactly what will happen on production is to is to change it just doing it within the wrist within a risk. Tolerance.

C

Yeah I think that's.

E

C

A

C

Failure thing that we've we've discussed before, but obviously controlling failure at the same time, ego I, don't know if this um is what you were describing in a drug diagram or something different, um but is it? Is it possible or feasible that, like our Auto scaling strategy is dependent on certain hard limits in the system um and I? Guess what I mean by that?

C

It's a common problem right that when you're scaling, you basically move like a bottleneck further down further down the stack and is: is that something that that's possible so that our like the pods in this case, wouldn't kind of grow out of control, because they're aware of, like certain static, hard picked limits that exist in the system.

D

I think it's a really tough one to solve in in such a dynamic environment,.

D

I mean the the question is: what do we do if, if the pods are at 100 utilization and so I guess what one answer to your question is? Yes, we have some controls on how far we scale out so for for the horizontal pod, Auto scalar. We have Min and Max replicas, and so that that place is an upper bound. It's not necessarily in directly informed by what Upstream limits might exist.

D

I mean we can sort of model those Upstream limits and say this uh depends on number of Parts times number of Puma workers per pod, and then we set the limit accordingly. So you know that that's something that we could sort of semi-statically calculate and then set um but I think, even if we were to do that, it I mean it just means that the pots won't grow to a certain like above a certain limit, but once we reach that limit, those pods are still they're still having a bad time right.

C

D

Yeah I guess: there's still one note.

C

That we don't necessarily know the relationship between kind of pots and every other part of the like web, for example, in every other part of the system. Sorry Rachel or.

A

What I was going to say is I. Guess it's the problem with having uh like a limited arrangement with the database, because there's going to be that there is a hard limit, because there's only one writable database and if you scale beyond what that database can handle everything's going to have a bad time, but at some point like at some point that limit gets reached and it's the case. Well, what do you do at the point that that limit is reached?

A

That's it. That's.

E

C

A

I suppose come.

C

Back to like saturation forecast, try which, as I, think like the demonstration, it's like that's the way that we work and it works until it almost doesn't right. When you, you kind of get a change that happens so quickly that the system starts behaving in a different way, which invalidates a forecast that we've made previously.

D

Yeah so I think um one other sort of orthogonal axis here is the the isolation side.

D

um So if we think about sort of uh isolation patterns and patterns for handling overload, the the two that come to my mind are bulkheading and um circuit breakers. So bulkheading is effectively functional, partitioning or sort of partitioning by some kind of failure domain or some kind of group that you want to isolate.

D

um And so, if we think about the database, you know if we had um maybe dedicated PG bounce uh notes or dedicated PG bouncer pools. We can allocate them with fine-grained and say this uh group of PODS can now use up to this many connections and if it tries to use more it'll break but it'll, because it's a shared resource, it will not affect the rest of the consumers. So there's sort of that that kind of overload protection there and then circuit breakers are really for.

D

If we do reach overload, how do we deal with that in a way that we can recover from it? And um and so it's more about detecting that the Upstream system is not able to respond and kind of having the clients back off, and um so that can also help kind of stabilize during situations where that overload does occur.

D

Yeah, that's that's more or less what I wanted to share I! Don't have any real conclusions just yet, but um it's something I've been thinking about and I I do think it's kind of relevant to what we do in in scalability. So yeah I wanted to share that.

A

E

Thanks for taking.

A

A

Just heading back to the uh agenda: I, don't see that anything else is there yet um is anyone? Is there anything else anyone would like to demo or chat about.

A

Alrighty well, thank you so much for joining the call. Thanks for the conversation, um I'll upload, the video and I hope you have a good rest of your day, but.