GitLab Scalability Team Demos, 1 Apr 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Scalability team demo - 2021-04-01

Description

See also Craig's demo: https://www.youtube.com/watch?v=NuamleKHRDA

A

uh So the first item on the agenda for this week is um craig, um has been doing some experiments with sidekick redis. So the redis that we use for sidekick has been very close to cpu saturation for a while um last week, igor matt and bob uh got us back to uh what was it like. 90, odd percent, bob uh from like 99 percent.

B

A

From from saturated to just not saturated yeah, um so we still need to make more improvements there. um So what craig's been working on this week is setting up just a couple of vms with um a redis server and then a client that sort of acts like our application, but doesn't actually do any work in the sidekick jobs.

A

It still schedules jobs and the jobs sort of sleep for uh an amount of time that is sort of matches, the the distribution of how long our jobs take overall um and then like you, can treat those like uh shards, like the shards we have, but just with one box like doing all the the client work so um to try and sort of reproduce what we're seeing in production.

A

So it's not identical, but it's it's a good way to test out things that are harder to test in production or would take more work to even be able to test. um So he's recorded a video about that. I haven't watched it. Yet I'm not sure it's a great idea to watch it in the demo, because the demo will be going on youtube and the video is also on youtube. So like it seems much more efficient to uh to watch that separately. um Has anybody watched it.

A

So far, no okay, um so in that case I guess I can jump to his conclusions uh once I find them.

A

Oh, I can share my screen for that. I guess, and then we can all read along together.

A

um So, basically, for we had sort of a couple of speculative ideas, but the basic two options we figured we had, both of which require a reasonable amount of effort. uh uh What we were calling psychic zonal clusters, I think essentially, that's sharding. um Well, chad is kind of an overloaded term in the psychic context, but having multiple, independent reddish sidekick instances, which should work um because sidekick doesn't guarantee ordering um we can. We have some places in the application where we look at.

A

We store things on the sidekick credits um which need to be global, but we could equally store those on the persistent redis which would remain global under this proposal. um So sean.

B

One thing that I was thinking in this context- or maybe we should talk about it later, but um if we are going to move those things uh redis instance, should we move them to a different reddest instance. Just to not add on to the pile of that's.

A

A very fair question, so red is persistent, is also a risk. So we have a separate issue about that. Let me see if I can find that one real, quick.

A

There we go, this is one eagle created um so.

A

This was an incident from three weeks ago about so when we say redis in our services. That means the persistent redis um and you know we store a lot in the persistent redis. We can um shard that I guess as well. um So there are a couple of options there.

A

One is to use redis cluster, where it like handles that for us, but we haven't got any experience of doing that and another one is to sort of do one manual split um now and obviously, if we're adding sidekick stuff here, although we already have some psychic stuff here, because I think this is the.

A

I can't even remember what this one is bob. Maybe you remember um but yeah. Basically, um the idea was we could either split out sessions or database load, balancing stuff from the persistent redis and then the next step after that would possibly to be to move one of those to redis cluster, um at which point we can rest a little easier. um But yes, bob you're right like if we, if we're solving the problem on sidekick redis by pushing more stuff into the persistent redis.

A

That might not be solving a problem so much as just moving it to a different uh well.

C

But um as long as we can keep whole functional units on one redis, uh it doesn't really matter which writers they're on.

A

C

It means what matters is that we can break off a functional part uh and there's no code. That assumes that it can do related things that have to be on the same reddit server. Yes,.

B

Exactly so also, but we know we know that this stuff is like, if we're doing the work of migrating like with transitioning and all that things, then we might want to. Like only do it once.

A

Yes, I mean, if we have to do it twice, we will do but like ideally, we would just pick the right place in the first place. um So as far as I'm aware, most of the stuff, where we deliberately add something to the sidekick redis, doesn't need to be on there like it's just on there, because it's logical because it's related to sidekick, but we don't do a lot of. I was worried. We did.

A

We would do more of this, but we don't do a huge amount of poking at sidekick's internal state directly, except for a couple of cases which I'll mention in a second um so yeah. So sorry, one of the options was the sort of what we call zonal clusters. So the idea was that we could have a redis in a kubernetes, zonal, cluster and sidekick instances in that zonal cluster, and I think api nodes are going into zonal clusters soon. Job is that right, yeah they'll be going into several clusters soon.

A

Yeah so we'd have psychic nodes in there as well uh all of the sidekick shards so like catch all memory, bound, etc um and they'd have their own redis. And while we were talking about that, we realized that redis doesn't need to be in kubernetes for that which probably simplifies the operational work there. So we could um do that because we might never want to put um redis and kubernetes like. Oh certainly, it would be probably further down the line for data stores, I'm guessing right. Jav.

A

uh Yeah that would be down the.

D

E

Yeah, oh sorry, debatable debatable if it happens right right, that's.

A

Sorry, that's what I was trying to get at like there's no need to couple a sidekick, zonal cluster in with the api nodes to redis. Being in that zonal, cluster redis can still be on a vm. It's just a different redis to the one that we're using now. So we have two redis sidekicks on vms and some sidekick nodes in one zone across all psychic nodes in one zonal, cluster use that one and every other sidekick node, either on vms or in kubernetes, uses the other one.

A

So like it's not what I'm trying to get at there is we've been calling this sidekick zonal clusters, but technically there's nothing stopping us from. Even if we were still on vms for sidekick. We could do this kind of split. um Like you know, we could.

C

It could also mean two different things, because one thing it can mean is uh to reduce the number of jobs per second, uh and another thing it can mean is to reduce the number of clients connected to redis, and that depends a bit on whether it's uh like, I think the original zonal cluster proposal would have all cues exist in each cluster completely separate. So then uh the number of uh well, the number of clients goes down and the number of jobs per second goes down.

C

But if you do a functional split, then all clients um like, if say the one- let's say the post-receive queue- is on a separate um uh redis. Then all clients would still have to connect to the post receive queue. So then you've, yes, you've reduced jobs per second on that one, but you haven't reduced the number of clients. These are two independent. uh Strictly speaking, those are two independent variables that.

A

Yes, yeah, that's right and and um for the assumption we were going for this was that it would be a full split, so we'd have like a full set like everything would be processed in both um and there's, obviously an open question there about what we do about the queues that are only processed on vms because they um need nfs, um etc, etc.

A

But that's the basic idea, because shared that's essentially shared nothing as far as psychic radius goes, um which is nice, um so the other option was something we talked about a year ago, just over a year ago, when we were wrapping up our work on sidekick, because we have like 400 cues that we listen to in our application and sidekick recommends no more than a handful of queues.

A

um So if we considered that we have, I think, seven shards, maybe maybe one more if you count the vm catch all as distinct from the kubernetes capsule. um You know what, if we had essentially a q per shard, and we could do the same kind of job splitting we do now where some workers go to one shard and some workers go to another shard and we can control that without application changes, but through configuration- and so that's quite a big job, but it is very attractive and um it from craig's results.

A

It will lead um to quite a a big improvement because uh well, I think jakob can explain um why sticking with a single redis instance but drastically reducing the number of queues uh improved so much jacob. Do you want to go through that uh sure um I'll stop sharing and you can see.

C

Yeah, it's easier. If I share, let's see, prepare my browser here um there flame graph.

C

uh This is a flame graph I captured earlier in the week on the reddit psychic primary, and I I just I don't think there was something specific going on there. I just wanted to look in detail at something I noticed in multiple flame graphs, which is that these things blocking pop generic commands and handle clients blocked on keys are very big.

C

uh That's where a lot of the time goes. So what uh I did then is to look up uh what those things do, for instance, blocking pop generic commands. If you look oh now, just sorry, I tried to get to the tab so that this zoom thing comes down and I can't click a tab. I don't know that um blocking pop here we go so the problem with these is that they start here right, you see. That's you see the first instance of the problem.

C

This is a list over all the cues that the client is doing a blocking pop-on and in the case of catch-all, that is a list of 360 cues.

C

So for 360 cues it looks if the keys exist and if, if the key exists, then it tries to get work off the key and returns early and if not, it goes into this block for keys function, and that is in here and again that does a loop over all the keys. So both these functions. um So that's the that's roughly the call stack of blocking pop generic command.

C

So this whole chunk is proportional to the number of keys. So if we go from or where keys are queues. So if we go from having 360 uh cues on the catch hole that this loop is going through to one, then these things would shrink and uh the same argument goes for handle clients blocked on keys, because that one is also in here.

C

um Where are you sorry.

C

Clients so that loops over the ready keys and then um yeah the interesting one here. Sorry, I didn't super prepare this. uh It's right. You can see here that spends most of its time in unblock clients. So I need to look for unblock client and block client and uh you then end up in unblocklined waiting data and that one is again iterating over uh well, it needs to destroy all them, see this this thing here.

C

The client stores the list of keys that it's blocked on, so that is 360 keys for catch-all, and then it iterates the dictionary and it uh looks, looks all these things up and needs to update these things.

C

So again, this loop is proportional to the number of keys, and that just explains a lot of the work that is going on um yeah. That was a bit rushed. Does that make sense.

A

I mean it does to me because we talked through this yesterday.

A

Okay, so marin, the basic idea here is that um you don't have to explain it for me, like I'll rewatch. This recording three.

E

Times to try to understand the.

A

The cpu usage on redis in the section we're looking at in the flame graph, is proportional to the number of clients, essentially multiplied by the number of keys that each client is listening to. um So we have different plans that listen to different number of keys, but we have a bunch of catch-all clients that listen to about 350, odd keys.

A

um So if we can get that 300 well, this is this. Is this? Is the next part? So the single q per shard work is quite big like to get to the end state, but I think we can get most of the benefit by focusing on getting catch all to a single cue per shard, because um a it's the biggest in terms of number of queues. There are 350., it's it's got a ridiculous config. As java knows um it's.

A

um We already listened to the default queue, so this is the only chart where we wouldn't actually need to add a cue to listen to, like you know, adding one isn't a huge deal considering we already have 400 plus, but you know it's a neat thing operationally, because we could start making this config change to say, like some work starts, going to the default queue and not make any other change to our catch-all nodes like they could continue listening to the same set of queues, and they would still pick up that work um and, in terms of the total we'd, go from listening across all shards from about 420, odd queues to about 50 odd.

A

So it's basic. It's pretty. You know it's an order of magnitude just from focusing on that shard and craig did an experiment uh which I've linked to in the agenda, doc where he tried that option. So he tried single cube per shard and um no just regular. What we have plus catch-all is one cube um essentially, and we got about three quarters of the gains.

A

So in his case he saw a 30 drop in cpu usage compared to 40. If it was one q per shard everywhere, um so that seems like the best place to start with, in terms of like that's going to be the quickest to implement and it's going to get us most of the gains that doesn't mean we should stop like once once. If we get to that point, we should then seriously evaluate.

A

Do we want to go ahead and make a single cue per shot everywhere, or do we think the sort of horizontal shouting of like completely independent psychic reddit instances is the next best step? But I think this is the best first step.

C

Because it's so lopsided right because.

A

When I started.

C

Looking at this, I thought I just did a count and I thought: okay, 400 cues, eight shards, so they're 50 on average, but the average is very deceptive, because this one is.

A

A

yeah and the other, the other nice thing about that is most of the catch-all cues well so say nice most of the casual cues are boring and some of them don't even do any work, but not doing any work still contributes to that flame graph. Jacob showed because we still block clients on each of those cues. That does nothing, and in the past we have removed some cues from our processing.

A

I think like goqs one point we removed, for instance, because you know we don't use geo on gitlab.com, so we don't need to listen to those. um So.

F

A

Exactly so yeah, but obviously like. If we can say we, we will process all work and do that in a way that doesn't mean that we have a bunch of clients blocked on cues. That will never actually have anything in them. That's that's a much better way to solve that. So.

F

Apart from geo cues, what what else is not doing any working catch-all do you know I haven't.

B

Looked it up no um wrong job queues are going to be super quiet, there's like jobs that run once a week that kind of stuff.

A

Right, that's a good point. So I've looked at, like you know, over over a day that I looked at in prometheus, we had like a maximum of like 190 cues processing over any given minute in catch all, but that doesn't mean that they're the same 190. I was just looking at the the peak, so, like bob said like, if you have crown jobs that run once a day at different times, you'll have a different crown job in each of those um but yeah.

A

I think it would be useful to look at like what the other ones we have that don't do anything.

B

So why would it be to care.

A

Yeah exactly and um sorry marin, do you ask your question.

F

Yeah I, why do we have queues that are not like? Why do we even have those cues that are not doing much? So is this just because it was easier at the moment to add a cue intuitively.

A

I would say it's probably one of two things, so it's either a worker that doesn't run on gitlab.com for some reason. So, for instance, I think we don't run the usage ping worker on gitlab.com, because.

B

Up timeout yeah.

F

Anything right now, but instance, wide.

A

That doesn't really make sense for our multiplayer.

F

Usage only doesn't do the work because we run this manual in production database console or rails console. Rather so you have. We have a developer who is dedicated to every week going in there running a script. Look at it looking at it run for two days. I think and then like seeing. If something fails. So that's.

A

The effective value.

C

But I'm not sure if this answers your question marin. But why.

E

C

Queues that are not used, because every worker class automatically defines a queue and.

B

C

Into the sidekick config and then the redis, sorry, the psychic server processes start putting pressure on the redis server, because the config says that q exists. Even if.

A

C

A

The other case is that there's a worker that we accidentally don't do anything with anymore, but we forgot to delete and, like that's like technical debt, it shouldn't cause us a performance issue like it is at the moment, um so we would like to get that back to relegate that back to technical debt status instead of technical debt plus contributing to psychic cpu psychic-related cpu saturation.

A

uh So there are a couple of complications with this, so I need I haven't actually written up the epic so like I was gonna make this proposal like a sub epic of the one q per shard, because I think it's the first place to start with that project.

A

I haven't written it up, but there are a couple of complications um and I'd just like to talk through those quickly, because I'm sure there will be more but I'd like to talk through the ones I know about um one that is I'm hopefully, I'm hopeful is mostly tedious work rather than anything. That's particularly complicated is that our monitoring is by cue, not by worker, so like our dashboards are by q, not by worker, but we do have the worker label on our metrics.

A

So for most of those we can switch over to worker and then we switch the recording rules to worker and the dashboards and the alerts to worker instead of queue, because if everything's in the default queue and an encore gets an alert about the default queue that tells them basically nothing because there will be 350 workers sharing that queue. So that's not going to be helpful to anybody.

A

um The one thing we can't get easily with that is cue depth, because at the moment we can say, like um you know, the the q for merges is 10 deep and the q for post receive is five deep, um but if both of those are in the same queue, we get that from git lab exporter anyway. So I guess we could inspect the queue and see what the depth is for each worker. But that's I don't know. I haven't really thought that one through yet.

C

I'm not really sure where gitlab exporter gets it from and if gitlab exporter just looks at the queues via the psychic statistics. Api psychic also don't know how d it doesn't make sense anymore. To say how deep is the merge request queue because it just would be part of a bigger queue. Yes,.

B

C

The observability story it has, it has implications for observability, and uh that is yeah one of the areas where most of the work would be.

B

C

B

Is this is the whole reason that we are in this situation is because we had workers that did what project dot um authorizations.

F

B

Yeah alterations does like drop, ten thousand things on a queue, and then it's that it's the only thing that's happening for a while yep.

A

And um so, like I said, some of these observability things like anything anything that's exported by the application, has both queue and worker, so we can mostly switch those over to workers. They can tell us like what rate a worker is processing at what rate we are queueing jobs for a worker. um What the error rate is etcetera. We can, you know we can map those all over fairly easily.

A

What we do need to look at is any other metrics we get, which are probably from gitlab exporter, which are probably using the sidekick api, which is based on queue not on worker, but it will still be useful to know what the depth of the default queue is. I'm not saying we should replace them. I'm saying we should add just to be clear and also because we'll have the separate cues for the other shards.

C

So what bob just touched on is that there is a conceptual downside to having fewer cues, which is by having more cues. We can uh have more or less have fairness, guarantees.

A

Yes, it's like, like you, know, passing the command line passing to the command line. Like you tell sidekick a psychic process, listen to these cues, so that's the natural unit and I think it was you know, probably a reasonable decision when we did it and it's also easier to add cues than remove cues. So that explains why we've not done this before it's like.

A

You know it's much, it's harder to go in the opposite direction and that's why I'm using default as kind of a trick, because we already have the default queue, so we don't have to remove cues. We just have to shift work to default and then, when we're confident that's working, we can stop listening to the other cues.

A

So we can do it in a safer way, because, obviously any major change to our psychic config is, you know risky like we could lose jobs um and the same applies to the zonal clusters. Work. Of course uh anything else.

D

Yeah, just just one comment about that: uh we should probably prioritize the known cues and catch all that don't rely on nfs to begin.

B

D

Because we still have that split, yeah.

B

D

To do yeah, yeah um and then the other thing I'll. Just echo the concerns about you know we would be increasing the blaster ideas for failures like right now. Our cues are fairly isolated. If there's a problem, there's a queue that depends on an external dependency and it starts to back up it's limited to that feature and- um and we have even in catch-all, I think we have like- maybe some external dependency external dependencies.

D

A

The army may be properly yeah.

D

Yeah, so um man, it would be great if we, I would maybe organize things like, because this is this is what worries me. The most.

A

Sorry, I completely missed a part of the top job and I'll just um explain that, because that I don't think it's going to completely answer your question, but I think it is going to partially answer it. So the idea here is that there is a configuration that lets you move, which queue a worker puts its jobs in, so each worker has its own queue name, and the idea is here that the configuration would let you say to that worker. Don't put it in your own queue, put it in the default queue.

A

So that also means that we have the option to say to that worker: do put it in your own queue, so yeah.

B

A

You know we have the option to like go back to having a dedicated queue for I don't know or say we put authorized projects in default and then we were like. Oh no. This is a terrible idea. We can put it back into its own queue under this model. We're not actually deleting, I guess, cues. At this point. We are just shifting yeah.

C

That's one of the.

B

C

Of the questions I have here and jarv you maybe know more about that, is how often do we, so what the current system is very flexible and you can a lot of stuff can be in the queues and you only make a choice of where it goes in the server config. So we have a lot of flexibility, but I don't know if we use it and if you start sorting jobs to different charts before they go into the queue into their respective queues.

C

It's much harder to reorganize those queues because you'd have to, I don't know, run some reels command script that I think we sometimes do that like right, where we pop through a queue and we toss out jobs or something, but um it's about when uh you need to it's about deciding early uh deciding eagerly. So this current system is lazy, but that concentrates a lot of work on redis and if we have an eager system, then it's more efficient, but you also you're making these choices earlier.

C

So do we use that flexibility now or how do we use that? No, we don't updated.

A

The run book for this, I think, about how to stop move a cure around and it was not as easy as we'd hoped it would be.

D

It's it's not something we do often, and I think we we basically rely on having like lots of these queues and isolated workloads based off of queue. um If this, if we change that model, then we would definitely want to improve our manage management.

C

So in the example of authorized projects where it creates a crazy amount of jobs, um if, if we sort jobs early like sean said, it should still be a tweakable model where either based on. I guess we still keep the selectors, so we could deploy a config change where we say we maybe would even be able to instantiate a new chart or just a separate queue where we say. Okay, this stuff is a problem right now it goes in the other queue, but, of course, that only changes jobs that are submitted after the conflict change.

C

It doesn't change job, a problem where jobs are already in the queue like we have a hundred thousand jobs in the queue, and we need need to do something about them. So does that matter practically or is that not something? I think you just I'm sorry, I'm restarting my question. I'm just trouble checking that um how we use the this flexibility just.

A

To make sure, I understand your question: is the question like how often are we doing this to redirect work in flight, and how often are we doing this to.

C

A

It's it's like something in flight versus changing how the system works right.

C

Now, yeah stopping.

A

Stopping stuff that's going into this queue and I think the current model can only work on the in-flight basis, but the question is like, then: how quickly can we do that and how you know like if we can't do that quickly enough for the current backlog to be reduced, then it is effectively the same thing.

C

Exactly uh so and and certain things can be reproduced like in the current model, we could deploy a conflict, change and say: stop processing authorized projects because everybody else is starving for cpu time, but that is a very clunky way of making a change, and I don't know if we even do that or if we should be doing things like that.

C

But even if we have a model where we're sorting early, we could still do these sort of things with a server-side middleware where a job comes in and we would have some sort of dynamic conflict that says. Oh, this is the authorized projects worker I'm going to re-cue this on a separate queue like we have an overflow chart or something- and I'm going to redirect this, because this should not bother anybody else. So, yes,.

A

There's a small delay in processing like jobs um that need to be requeued, but obviously, if you're making jobs be re-queued, then you probably want a delay.

C

Yeah but I, but now I'm talking about capabilities that we could build, but I'm not even sure if we use those capabilities now and if we don't use them now, we probably shouldn't make that part of the epic.

F

Can I just ask a question that is a bit of a separation fork from this from this discussion?

F

So if we go down this route, we're going to get some savings, what did you say 30 to 40 percent uh cpu? That's what we saw in the experiments? Okay, setup! We can't um that is going to buy us.

F

Let's say a year, maybe if you are lucky depending on what happens with the development, what happens with the platform and so on? Are we hitting the limits of what we can do with these two components with radius and psychic?

F

Do we need to start thinking longer term, meaning not saying that we just ban this work right now? That is not what I'm saying what I'm saying is once we complete this work, do we start thinking about a new component in there?

F

That will do something um change of of not only configuration but a change of architecture, of how we do these things so like is this.

E

F

We should a.

E

Real drop q, instead of redis for the presence.

F

For example, you know like I know what I'm saying, and I know what I'm saying in a recorded call. I know how hard this is, but at the same time we're talking here about the next level scaling next level scaling means we make really tough choices.

F

C

Wonder what I I don't know if anybody knows but like what size sidekick farms. Other people run like what is sort of the number of.

B

Jobs per seconds where people are so mike mike mentioned in the issue that he got created when we were discussing the timeout thing and then there he mentioned um something like 25 000 jobs, a second uh in an issue so.

C

Yeah, let me just find out we're under 2 000 right now.

A

C

We're under 2 000 right now right.

A

He said I've had customers report, 25, 000 jobs per second, um so yeah.

A

um So my initial instinct marin there is that if we can make the horizontal shouting story work, then that is probably the the next step after we've reduced the number of cues to whatever level we determine, um but that you know I did use that magic word if at the start of that, so there are some cases like the admin apis. We have like the run books. We have to delete jobs from a queue that don't work.

A

If every application instance can only see one sidekick thingy, because then somehow you need to send the same command to different application instances that can see different psychic credits.

C

Which is why I prefer a model where you have functional yeah.

D

Functional splits.

C

Where some cues live on, one reddish, psychic bear and other cues live on another.

A

C

Sees all of them.

A

Yes, because then everybody has to know about all of them, whereas the nice thing, but also the trap with the the other split, is that you can say no an application instance only needs to know about one redis sidekick, but then it only knows about one reddish sidekick so like it can't see what's going on on the other instances, um but hopefully we can. We can think about these questions over the time that we work on this as well.

A

I do think I don't know.

C

Craig said the same as you marin, by the way in uh at the bottom of his comment, he said.

B

C

This is uh uh his conclusion, is reducing number of queues seems like the biggest win, but that's a one-time thing we can do and we need to think longer term uh what we can do about growth.

F

Yeah, where did you comment that I'm not seeing it.

A

uh I think that's in the comment I linked in the agenda doc, but I'm not sure if that's the one jacob meant.

C

um Yeah, it's the one where he says, interpretation and recommendation.

A

Yes, it's in the recommendation section yeah.

C

Yeah the last sentence.

A

F

Okay, I'm looking at the wrong comment. Then I guess um sorry.

A

I'm sorry yeah.

F

It's in the epic.

A

No, it's the one in the issue that said the one under two a in the dock.

A

I also think some of the answers to this question will be affected by the other work. That's happening now, so I know we're looking at like not relying on a single global database if we can, which is its own can of worms, but like we probably don't want a completely different model for how we manage our database to how we manage our job. We probably want different models, but what I'm saying is like if the database ends up being sharded by it's likely.

C

A

Then we would probably do the same for sidekick or look at least look to. You know work with that, um whereas you know that's a very, very good answer, but.

F

There is a group being formed right now to answer some of those questions like a sharding, not a working group but a sharding team for the database, and it might not be the worst idea to actually like peek into what's happening. There.

A

Because- and I think this is another advantage of doing this- first- is that this is, however, we shared in future. Reducing the number of queues will increase our headroom on whatever shards we end up with, because we won't be listening to a bunch of queues. um So you know we'll have we'll have a bunch more space um available to us so and also this one's harder to do. If we have started charting already, because.

A

We need to migrate some things and we need to migrate them in more than one place. Basically um so yeah.

D

My from my perspective, I would prefer to hold off on charity and redis by cluster, uh just because my my main concern with that activity is our inability to drain an entire cluster, which is something we take advantage of now.

D

If we have redis, for example, running in cluster right, that's a big problem.

D

If, if we don't have horizontal skip horizontal scaling for redis, then if you offset the load like if you drain a cluster, that's going to put pressure on the other radiuses and could cause a problem. So I would much prefer we focus on like reducing the number of cues. If we think we can get some cpu savings.

D

I'd rather do that first personally, just because, um especially with the migrations that we're doing now, we're still kind of in the early phases of feeling out kubernetes and we've used, you know draining clusters quite a bit since we've done the migration. This will probably stabilize over time, though.

C

When you talk about draining you talk about, the pro is that the problem that we have a number of things that can handle the combined workload, but you take one of them out and now we cannot handle the combined workload anymore.

D

Yeah currently right now, if we drain an entire zonal cluster, it's quite nice because of hba, it's just the other zonal clusters scale up because we scale based on cpu. If we add redis into that equation, then we may run into problems right.

A

So say we had two zone clusters and no vms, for instance, and like each of them are doing 50 of the workload and we drain one of them. Then the other redis suddenly gets a hundred percent of the vote.

F

So you're basically saying that redis needs to be in its own set of clusters. um Well,.

D

F

D

It to separate.

F

From the application right- and in that case, if it goes into kubernetes, it needs to sit side by side by our application clusters rather than within them.

D

I think the problem is also if, um even if it's in separate clusters you're going to be, um you would still have a one-to-one relationship between cluster.

B

And reddit right so.

D

Yeah, unless, if your redis can scale just like the zonal clusters, can.

F

D

Yeah, you would prefer to have.

F

The the purposes mixed within that cluster, so that you can uh yeah, I see what you mean: okay,.

A

Would this be a fair like way of framing part of the concern there is like? If we have two zonal clusters, then then draining one of them is a big problem. If we have 10, then it's probably not as big a deal because, like you know, the other nine can pick up the rest of that so like. If we're at any clusters, don't buy us as much as we'd hope.

D

Yeah, I think if we had ten zonal clusters, we probably would be able to absorb um the additional load, but um I don't know if that would happen right.

E

We're not prepared.

D

D

Yeah, well I mean necessarily but yeah I mean yeah. It would be a little bit more overhead for sure right.

A

Because if we have a tooling isn't ready for that, if we have two clusters and they're 50 50, then if you remove what, if you drain one, then the other one has to do double the work essentially um which isn't yeah scales in the same way,.

D

Right and we have three right now- um we could conceivably increase to five. Maybe, but I wouldn't imagine just going beyond that, though yeah.

A

And it sounds like if we do this and we do buy ourselves say a year-ish, we will be in a much better position to see what the kubernetes sonar clusters story looks like for other services at.

D

That point yeah.

A

uh I'm going to move on to a different complication, because I wanted to talk to bob about this briefly. um Oh actually does anybody else have anything for the demo.

C

Oh, I wanted to briefly mention uh that uh the pack objects cache is very close to being enabled, but um it basically is waiting for some omnibus code to reach production and the deploy failed two days in a row, uh so we're hoping every day, I'm not looking to see if it reaches production and uh matt smiley is poised to um uh to make the change.

C

uh So that's um I.

F

Find it very ironic yeah, because.

C

F

I I kind of pushed a bit to get the the the code merged in right.

B

A

And then everything got blocked.

F

F

You know that I would.

C

uh And one thing that was interesting, um slightly scary for me, but I was I I happened to see an incident yesterday where about slo aptx slo violations on philo 4, which is not a gateway server I often hear about, but uh I just peeked over people's shoulders like oh, that looks that sounds related and I was just talking the feature flag, so I could see the cache keys for uh the back objects uh things. It was looking up and there were crazy numbers of repetitions.

C

So, and so everything looked again like this is the classical ci problem. So it would have been really fun to turn it on. In the middle of that, if I write that that was it but uh yeah, I did something else to make it stop. um But it's just a reminder that it's not a problem that is unique to falcon area.

F

One that something else was they talked to the customer and said: stop doing what you're doing and they're like we're using ci. Oh maybe we made a mistake somewhere, let's check.

A

F

Right: okay, no! No! It's not! But if the customer stopped.

A

Doing it and then can start doing it again once we enable this, that would be a good test.

C

Yeah well, that's sort of the same test as also removing the pre-clone script, but yeah it's right, just that's just a little progress blurb and I keep hitting f5 on the the version numbers of the only first packages.

C

uh So back to uh back to you, sean.

A

Yeah, so just quickly, um bob mentioned a really good complication yesterday, which is that one nice thing about the q per worker model is that with a mixed deployment like we have now, where canary can be scheduling jobs into um cues. That sidekick doesn't know about yet for workers that psychic doesn't know about yet because they're in their own queue and psychic won't be listening to that queue.

A

That's fine, like you know, if you schedule some jobs from canary that don't exist in production, they just don't get processed until production sidekick is updated, um but in the new model, if they go to the default cube, they will fail, um because we can't find the constant for the worker because it's not been deployed yet. So I think there are a couple of options there. um One would be to have a very specific config that we update every time to not move those workers to the default queue until they're ready. I don't like that.

A

Another is to have some kind of custom retry. So, ideally, we would just use psychic retries for this, um but we, a while ago, set our default retries to three, which happened in about two minutes: psychic defaults to 25 retries that happen over the course of like a couple of weeks.

A

I don't know if I've created an issue about this. I've talked to a few people about it. What I'm going to do is I'm going to create an issue about that.

A

But one thing I was thinking was: we could set all existing workers to explicitly use three retries because we don't want to bump those up, because we might be putting extra load on sidekick for retries that are never going to succeed and then set the default to back to 25 and let all new workers retry 25 times, which would be plenty enough to wait until that work is actually available in production and would also nudge us towards having a sensible retry policy or uh what.

C

We could also is the retry just an exception, like a special type special type of exception,.

B

Sidekick does some things sean. What I was thinking if we just put the retry thing in our application worker.

F

B

And then set the default of sidekick back to 25, since the constant can't be initialized, it falls back to the default and everything else is.

A

Oh, that's a good idea yeah. So if the worker doesn't exist, then the retries are 25 and if it does, then it uses that.

C

Yeah, I wanted to say the same thing.

A

I don't know how to.

C

Do it and you guys do, but we we should be able to detect effect, the failure reason and if it's yes,.

B

It could be a name error, yeah.

C

Then uh it it should get more retries yeah.

A

That's a smart idea. Thank you both I like that one, because that's um I do. I do kind of like the idea of making all new workers have 25 retries, but it's a bigger change than is strictly needed for this project.

C

Yeah we shouldn't make that part of the.

A

Yeah, it's not really directly related and it could cause other problems so yeah. I.

B

Like that idea, if we do that suggestion, though it does kind of tie everybody to like from now on, like it sets it more in stone like it's going to be tree forever for the actual workers, which is.

A

Yeah I'll I'll write up an issue about the the retries thing if I haven't already created one because I do want to talk about this, and I was hoping that once if we did the century project, for instance, we could get back to that, but obviously we don't have time to do the century project because we've got to do the psychic project, so there's always more to do than we have time to do.

A

Basically, um I don't think the retries is anywhere near the top of our priority list um in general, it's just um a bit frustrating that it's it's limiting us in this particular case.

A

uh I think those were the only two complications or the only three complications. I guess with the one that jacob mentioned about the um the workload shifting that I wanted to talk about on the call, so the metrics and observability workload shifting and uh how this works with our deployments uh did anybody else.

B

A

Anything they wanted to say about this before we go and try and write all this up into.

B

It kind of it kind of ties into the metrics and observability, but sometimes we like the workers I brought up. We have this uh self-throttling thing that workers do that check their own queue size. Yes, so we will have to re-implement that using something that can be shared like that can be saying shared state. But, like you know what I mean.

C

That doesn't rely on the actual queue depth because yeah it shouldn't yeah yeah.

A

So so there's some workers that say you know we can only process x amount of these at a time right.

B

uh Yeah and I'm not just talking about the limited capacity things that I was working on with the ci teams in the package teams, but also global search, does something like that yeah. I linked the workers that I.

A

Found a quick plan, I.

B

A

That so for global search, we can kind of punt on that because we're not proposing to rename their cues to default just yet. But it's definitely worth thinking through. So thanks for bringing that, because there were a couple that you mentioned, that are would go into default and we would need to handle somehow.

C

I think this is still. I just want to emphasize it's really nice uh that we can try this all out, but only focus on part of the queues, because this lopsided thing where default is so huge because we'll discover so many problems that we don't know about yet and learn how to deal with them and have some flexibility in dealing with them or not.

B

Yeah one question that I had in the dock um was um like, as you mentioned, I do like the idea that we can just move some stuff over to default and then like remove some cues already, but I'm wondering if we should like if we're going to be playing for the. Is this a configuration thing or is it going to be like a hard-coded thing like our workers opting into?

B

I don't need my own cue and then we have configuration that says. That's default.

A

No, I think, initially from the default gdk source, install omnibus, install charts, install of gitlab and indeed for gitlab.com.

A

Initially, everything will continue to use its own queue and we will allow you to opt in through configuration to saying workers matching this queue selector or this worker selector. I guess it would be now um or yeah virtual cue, selector, let's you know whatever we call it go to this cute concrete queue instead.

B

So then we have a thing in in the application that says this worker we approve, you can do fancy things with cues.

A

Yes, so essentially.

B

Each worker would.

A

Check its own attributes against whatever selectors there are and say like, am I going to my own queue, or am I going to my queue that based on matching these selectors and if there are no selectors matching? I go to my.

B

Cool, so that means that we like as soon as we've done this we've got it all figured out, and we just.

A

Yeah as soon as we've done this, it's easy street got it.

C

It's important, I think, to do this whole thing in a way where uh it's configurable and we don't change everybody else's config. While we figure out what works well, yeah.

A

We might want to make other people's default. Config look like this in future, but I mean 14.0 is in june. Then.

B

We also don't need to think about migrating other people's.

A

Exactly because, because migrating for us is just basically going to mean listen to those cues until they stop processing at least initially and then stop listening to them um and maybe do something special for scheduled jobs and that's that's kind of it. um Whereas if we need to like migrate it for self-managed users, then we actually need to put in the work there. But if we're specifically focused on gitlab.com and specifically focused on the catch-all shard z, then we don't need to do that. um So yeah.

C

Yeah, so it reduces the amount of work, but we still get a. We still expect a big benefit both from craig's experiment and from the theory. Yes, that's the brilliant part about only looking at capsule, so.

A

We yeah we expect we could reduce the number of queues we listen to by two orders of magnitude and if we specifically focus on catch all we can do the first order of magnitude just on one shard. That's how bad catch-all is compared to the rest.

A

All right, uh if nobody has anything else, I'm going to wrap this up and then I'm going to watch craig's video um about his demo. uh His experiments, because that's super interesting as well and uh yeah have a good day.

A