GitLab Scalability Team, 1 Dec 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2022-12-01 Scalability Team demo

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

I think I have the first agenda item, um but my screen is a mess. I lost the agenda. I have the only agenda item: okay, I'm, going to share my screen.

A

I wanted to um talk a little bit about some work that we've been doing uh that led to this graph. This graph is the dominant forecast for reddish psychic, read the primary CPU and we have a very nice forecast, um which is actually a different issue. I still I keep wondering when timeline will catch up to the fact that it's stable at 40, but the answer still doesn't.

A

uh But what I want to talk about is how we got here and what happens um and this kind of up and bottom up. There were a couple optimizations we found and we just started doing them and then at some point, I realized I should make this an epic, so there's actually an epic for it. But it's just to put those two issues together um and I. Think there are some interesting lessons uh or I. I learned some interesting lessons from this and I want to share those.

A

So what the two issues in the Epic are um do something about psychic Chrome and do something about duplicate jobs, because they both use redis a lot, and this uh stood out because of looking at the redis commands. Duration counters, which show time spent in the main uh main thread of red, is in the in the handlers of in the request handlers so for ready, psychic that was showing that uh the top commands were exec.

A

That's the this one and then the green one is H get all and when I saw that I thought that's funny, because neither of them is about pushing jobs and Reddit psychic is for pushing jobs to psychic processes, so that look funny and uh I think egord ended a traffic capture to figure out what the age schedule was and that turned out to be the fault of psychic krone and then Alejandro worked on changing How psychicron does its pulling so psychicron does corn drops and every psychic process periodically checks uh the list of all cron drops to see if they need to run and that code was written in a silly way where we would uh pull way too often and that's so.

A

The actual rate of these H get all commands was very low, but they were quite expensive and they happen too often. So when you see this drop, that is where the h capital commands went away and then the other thing. The exact calls were caused by the way the duplicate jobs. Middleware is using redis and we had a couple. False starts with feature Flags or making changes after uh we iterated a couple times and then from this point on that workload is gone.

A

um Any questions so far.

A

Some of you have known this was going on at the time. So it's not you. What I found interesting here is that uh of the two. The duplicate job change seems to have had a bigger impact, and if we look at the CPU graph, um so October 25th, roughly that's about here- is when the H cattle change happens.

A

So here we were peaking at 60, and here we started peaking at 55 or something and then uh duplicate jobs went away, and then we went to about 40.. So this was roughly a five percent drop, and this was roughly a 15 percent age points drop, um so the drop was bigger. Even though looking at the graphs, they were not.

A

You wouldn't expect one to be uh a three times bigger.

B

Drop than the other, the H Kettle, as you mentioned most of the time, didn't do anything, so it was just a lot of them, but well.

A

I actually think the answer is different, uh because here I have a graph of the request rates, and here you can see October 25th, uh there's a drop in the request rate because of the H get all, but the drop because of duplicate jobs is much bigger.

A

Because here we see that the total uh redis request rate is speaking at around 20, and here it was around 40..

A

So apart from the time- and this is one of the.

B

Applied by the number of jobs we execute yeah.

A

Exactly because every job does this, that's that's why there were so many requests for this. uh So even though these graphs were about the same Heights, this is the duration. Again, uh the green graph was caused by a relatively small number of requests that took a lot of time, uh computation time on the main threads, but because the request number of requests was slow low. The per request overhead was also low, and these were not necessarily that computationally expensive, but there were just a whole lot of them and they were individual requests.

C

A

C

Guess, what's what's I think not shown on this graph is the the actual I o processing time so the system time effectively, yeah.

A

Exactly that is I forgot to pull up a flame graph. um I don't know so. I can quickly find one. But if you would look at a flame graph of okay, I can say this without getting a plane graph, I'm sure I have.

C

One somewhere, I mean I, I, think one one other thing that we could look at is the CPU utilization on the main thread: Broken Out by mode, so that we can see uh user and system time and see if we see that that corresponding drop consistent time. So this.

A

Is for people who don't know these graphs? This is a typical what you typically see when we capture a CPU profile for redis process.

A

Eager knows this very well, and what is weird about this is that most of this is IO, so this is right IO. This is read IO and the actual uh computational workloads of what you think redis is doing or what redis is doing in user space.

A

Is this small part and um the graph I have here which shows the durations is only measuring in here in this block, but it's blind to what's happening here and what's Happening Here and what is happening here and here is proportional to the number of requests we do so another way of explaining this is to say that, because the duplicate jobs, workload also did a lot of requests. It accounted for a large part of this and this so it shrunk this and this Tower.

A

Thank you and for me it was an interesting um like when I say looking back, it seems obvious, but it was an interesting reminder of the uh yeah of of the different effects of uh and also of how to choose what to work on.

A

So the because the simplest way to explain what we did with duplicate jobs is just do fewer Reddit schools, which is a very simple metric to optimize for um yeah. That's what I wanted to share, and hopefully at some point this thing will stabilize, because it's not really going to stop.

B

um Do you know about when we introduced the the cookie? No, not the cookie starter, separate the wall locations where we Store Wall locations next to you, the duplicate jobs, and do you know how come we didn't notice that at the time.

A

um um Okay, I need to fill in the rest of what we're talking about here. uh Duplicate jobs sets. The middleware was originally written to set just one or two keys and then at some point uh it. The another team worked on database load balancing for sidekick, so every job uh carries uh wall locations in its metadata.

A

So this can wait so this can be sent to a postgres secondary that is up to date with those wall locations.

A

So the idea is to send as many uh postgres queries to secondaries as possible, and for that you need to know in a in the context of a job or a request like what how up-to-date your postgres has to be. So this was added later to the duplicate jobs middleware, and this was where most of the extra requests came from um I, don't remember when the code was added, but another funny thing here is that the load balancing was actually not being used because there was a bug in how the data got consumed.

A

So we still hit the overheads of the load balancing codes in the duplicate jobs middleware, but we were not using it and we found the bug that caused low balancing to not work. Then because Bob remembered that- and he pointed me at it- and I happened to see it because I had to I- ran into the same bug more or less. When working on on this project.

A

um I think now we have other bugs with with low balancing, but.

C

B

A

Story, I'm not sure we could say it's 100 working now, but it was. It was really not working before well.

B

Now we're reading from replicas just.

A

B

Not always the right one or yeah.

A

um I, don't know when that code got added and uh I have one graph up with a one year. View or another thing you can see here is that we for.

B

A while we looked at the Shared state with.

A

Duplicate jobs, the shared State- and that was a bad bad idea, and the funny thing is actually that here we were not doing duplicate jobs, and here we are doing duplicate jobs, so the optimizing them was worth it, but where, in this period the low balancing got added I, don't know, I, don't see an obvious uh jump, it might have been longer ago am I yeah or it might have been longer ago. Yeah.

B

The reason I ask is that ideally we'd respond to things like that quicker, so we have like so normally. We would have seen that because um capacity planning work pointed out or something else.

A

Well, I suppose we see it if we cross a threshold. But if the like this, this capacity planning work with an alerting threshold or with the the.

B

Issue the rate at which we're growing the the rate at which is growing like if you have like, if we're at the four like. Let's say we were at the 40 before and then this got added and it jumped to 60, then the prediction would have or should have.

A

Would have temporarily said we're going to run out of capacity in.

B

A

And that would have caused.

B

An issue yeah, so my hope is that you were going to say me to tell me that we did this two years ago and we didn't have capacity planning but yeah I.

A

Don't know uh how long have we had capacity planning like this.

B

uh This year, I think january-ish.

A

Yeah uh we can probably I could look into commit history when the database low, but let's go and added to the middleware but uh I.

C

Don't think I'm gonna do that now.

B

A

Okay, uh thanks for any uh thanks, everyone I was looking forward to sharing this.

B

You want to touch on your last question. uh Should we remove the clubbing that we did so we already tried to move this this workload once we moved it to Shared State, and that was a bad idea, because shared state was uh more set more I.

A

Can illustrate that for you Bob if I share again, uh if we go to Tim lands, this is where we moved duplicate jobs to Shared States and we made uh Plumbing for that to be able to move the workload. And if you see this is the second half of July and then we go to reddit's persistent uh that we're here, and that was bad and I wonder if this is because we did duplicates rights because it starts early.

B

A

That starts a week early and.

B

But so we started writing and.

A

B

If possible, but doing it on both things, then yeah.

A

Yeah, so we backed away from that, but the codes to make that data migration possible is still in.

C

A

And I saw the other day. It even still has the feature flag and I thought. When do we decide that we're not going to do this and we throw this out.

B

Do we decide that we're not going to do this, because it was my understanding and even from what you just said, we would expect sidekick red is to only do sidekick, like actual uh like the things that come from the sidekick process. Gem yeah on the sidekick redis instance.

A

So the argument is, it doesn't belong on Reddit, sidekick.

B

That's the question, I think yeah.

A

Well, I think it doesn't belong there, but I'm not sure it's worth moving, at least not at the current I did I think. The reason that this work was scheduled is because of psychic sonal clusters.

A

Because duplicate jobs um would have to work correctly across sonoclusters, so they couldn't be.

B

Yes, they needed to be on one of the shared redis instances yeah, but now that all the plumbing is in place should we do the migration and then remove it, but there's no point in doing it now.

A

The plumbing yeah and the plumbing is built to move it to Shared state which is not in the best shape, so I I wouldn't want to move anything to Shared States. No.

B

But I want to move things deep away from shared States. The plumbing that is there has one of like has gitlab rather shared State, and we can just replace that with something.

A

Okay, we could point it at something else: yeah.

B

Yeah, so that's a very small change, but then the question comes: why, like the graph shows that we don't really need to and I don't think we'll be looking at zomo clusters anytime soon because of the graph you're showing so I.

A

I yeah I think I'll just make a merch request to remove it and that then.

B

If you threw it my way, I'd probably merge that.

A

So maybe I need to find out who doesn't want to remove it and then let them review it. So we can have a discussion about it, but that person is not in this call because I'm not hearing them.

A

Thanks for remembering or noticing that I forgot to ask one of my questions, Bob.

B

A

B

C

Yeah uh so I've been working on reddit's cluster cookbook this last week and we have a first working setup on pre. So I just wanted to give a quick demo of that.

C

um Here we go so the naming scheme that we've gone with and we can change this if needed uh is redis cluster rate limiting. So this is the type this is the name of the service so to speak, and then shard01 and that's zero one through zero.

C

Three and then each shard has three nodes inside of it, so there's uh sort of three zones per shard, um and so this is running router 7, a custom, build of red S7 and we've implemented a gitlab ready, CLI helper script that uh finds the right secret so that you don't have to always remember this stuff, so we can do ping.

C

um One new addition is that this actually supports multiple users. So I can this. This was added in redder 6., and so we can say who am I. So this is the console user and I think there's ACL lists which I'm not sure. If this just leaked our passwords Maybe?

C

um Yes, we we can rotate it if needed, um but yeah in in any case, uh yeah I'll rotate it. After this call.

C

So yeah we've got like a console user, uh redis exporter user, a replica user, that's used by the other redis nodes to talk to this redis node, and so we get a little more fine-grained access control, um but what I actually think is more useful than that is potentially ability to rotate these passwords, um which currently is pretty difficult to do and also um attribution so uh I, don't know if all log lines include this information, but the the hope is that if something weird is going on, we may be able to tie that back to the user.

C

uh So then we can run some some redis cluster specific commands. So here's the dash cluster info- and so we give it you need to give it a node, a redis cluster, node and so I'm just going to give the same, one I could also put localhost here if I wanted to, and the display here is a little weird. So in some cases it uses IP addresses in some cases it uses host names.

C

This is just Reddit CLI being picky about when to show host names and when not, um but behind the scenes, it's it's using host names for all of this stuff.

A

It's host names on The, Wire,.

C

Yeah, uh and uh so we can see, there's a single key so far and uh let me try and remember what the the command was so I think uh I think it was redisclustic. Well, actually we can call cluster help. This should tell us, there's a call command and we can give it host port and then command followed by arguments.

C

So we could I'm curious. If this is going to work, we could try and do the the naughty thing, which is uh cluster call keys. Don't do this in production.

C

Interesting I would have expected to see a key here. So I don't know if the the keys command just doesn't work in cluster, um so we could. We could try instead of keys. uh I think was it. Why does it say snap.

A

I don't know it looks to me like something some shell.

C

Scripts snap yeah you're right you're right.

A

C

I think snap is a directory that I have here. Yes, oh snap, yeah, so so.

A

If you make a directory called key star or directly called star, maybe then it works.

C

I not sure I want to do that, because then I'm going to have to uh remove it and removing.

C

Yes uh seems a little scary, yeah.

A

C

Yeah um to do for me uh figure out the correctly passing arguments uh to to this or I guess: yeah I mean I did quote it here, so there is some some additional thing that needs to be fixed.

A

Shell script, it.

C

A

Then it's probably a quoting problem in there.

C

Yeah yeah I can yeah I'm gonna fix it later, uh but but in any case so I can uh do get through and I know that this is a key that exists and we can see okay, uh what most of the hosts responded with moved and then the slot and then the owner of the slot, and then one of the hosts actually had that key and gave it back to us. And so you know we can say uh Inc increment request, count.

C

And we can sort of see that uh counting up and yeah- that's that's pretty much it that's! That's kind of where we're at um and I guess the The Next Step that we're working on right now is getting the clients configured to actually talk to this instance. So that's gonna be uh redis. Sorry, gitlab Helm, chart updates.

C

Any questions or comments.

B

How do you want to um see um like observability, dashboards and stuff on a redis cluster foreign.

B

Assuming that you want at least one aggregate of the entire cluster thing, the.

C

Reason yeah is.

B

Because because I got I bogged down on the name, so the first sentence you added and then you've got the cluster. uh So the cluster name in the beginning and then Shard one two, three yes or zero to two I. Don't remember.

C

No one, two three yeah.

B

uh So I'm wondering should those be called shards or not, because there's also an issue where we want to combine all of our redis instances and and each redis instant each thing that we currently call a redis instance would be a Shard. So we have the services redis and The Shard is plate limiting sidekick cluster yeah.

C

So I was thinking of repurposing The Shard label, which would kinda block that other idea that you just mentioned. That's.

B

C

I think that that might actually be the the right call so keep one service per redis and then use The Shard for the for the actual redis cluster shards.

B

The alternative that I talked about, while you were talking, is using redis cluster as a alternative service uh short as each shard in there, and then that dashboard needs to be different. So it accommodates for the different chart like that.

C

Right but then we need like a deployment, an additional deployment label or something like that. Yeah I think the easiest thing to do right now and most consistent with what we're doing is redis cluster rate limiting as the type the service name and then repurpose The Shard label, with a numeric Shard, that's kind of what I'm proposing at this point.

B

I think that makes sense like we need to do a lot of migration anyway, if we want to change it, it's a lot of migration. Well,.

A

Is it possible to um country I, don't know how to fix this is but the the way um the presentation suggests. These are three separate things that happen to be collaborating in redis cluster. Although from what I know within reddish cluster, it's actually nine things, and if you would let credits cluster do its thing and they would move all over the place so you're using a config setting to prevent that.

A

But if we think of them as three separate things, uh we could also have three separate dashboards and then they would sort of look like Sentinel deployments.

A

Would that be a way to because I I think? Ultimately, you want a dashboard that understands cluster and that can.

C

A

Show things that are about cluster in the show can show things that are about instances but and I understand that we don't want to Maybe, not immediately make that so as a hack, we could say that these are three things and I I. Don't know what to call them, but three kinds of.

A

I, don't know whatever shared States and psychic and Trace Trunks and rate limiting are three of those: oh I see, okay and and then we and then we happen to know as humans that actually.

C

A

Are collaborating in a single greatest cluster yeah.

C

Go ahead, yeah I would say: I just suggest.

B

We just call everything shards and that way the problem is resolved. Just everything yeah.

A

Also, everything is a cluster foreign.

C

So what I would like to see- and maybe Bob is gonna yell at me, because it's hard to do uh is have a Shard selector on the um I.

B

Want that as well on.

C

The dashboard, the.

B

Service overview and then you can select all of them or.

C

And have one that is just all of them and then you can drill down to.

B

The specific one I'm not going to shout at you I want that, but.

A

Aren't we I I wonder if we can both uh so if okay, so what bulk means by shards is the red is, is the center of the poems we have now and uh what Igor means by shards is the individual Masters within the um this one redist cluster's appointment? But if, if we say that we treat these individual Masters and they replicas like individual things- and we just forget for a moment that they're joined together in British clustered that you can both and you're both talking about the same thing again.

C

In terms of ergonomics having to check three dashboards is not so great.

A

No, but that's why you really want a redis cluster dashboard, but yeah.

B

But you'd have to make that anyway, because you need different panels in there yeah, but you can add anything to a service dashboard, it's just where you start, but so you can like, like we made our redis overview, dashboards, redis aware and our sidekick dashboard or sidekick aware we can add whatever we want to to the redis cluster dashboard and then, if we continue down the path, we're currently doing we'll just end up copying that dashboard to a million different services that we call that are instances of sidekick cluster.

B

All this conversation is hard, but I think I agree with Igor for now, like we have um some things that are already correctly using Shard and I'm thinking about sidekick, that's a deployment and each deployment is a different Shard and right now the only thing like the way we can C different chart is because we've adjusted the sidekick dashboard to include some panels and because we link to different dashboards, but it would be way handier if every sidekick chart had the same kind of slis, which is like throughput of jobs with what.

A

Are psychic shards now.

B

uh Catch-All low urgency, CPU box.

A

Okay, so they're roughly job queues, they're.

B

A

Yeah, okay: this goes back to what what are shards.

B

But that's that's what The Shard is in sidekick and now what Igor is proposing and I'm I'm kind of on board, because that's where we are now is using uh adding a new redis cluster rate limiting service and using the the three uh yeah. The three groups of three sir of three redis servers below that as a Shard. So every redis cluster service would have three shards.

C

Or more in the future, but.

B

At least three yeah at least three shards got it uh and for that you're going to add the the short label on top of the the service, dashboards and I would reuse that for sidekick, so go ahead, go for it. Okay,.

C

B

C

Yeah I mean one of the motivations there is. We already include the shot label in almost all of the aggregations, and so there's, hopefully not too many places that we need to touch to pass that through.

B

B

Or is it just that we don't have the short label very often.

C

Yeah I think we I think we don't have it a lot, but all of the generic aggregate aggregations I think do include shard.

B

Yeah, it looks like you're right, cool.

B

That's reason enough to go with that: I think.

B

C

I think that's it. Anyone else have anything else they want to demo or talk about.

C

I'm calling it thanks, everyone have a good one, bye.