GitLab Scalability Team, 5 Apr 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Redis Sidekiq Scalability Experiment Demo

Description

A quick run through of the redis sidekiq scalability test harness from https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/956

It is crude, but has given us good numbers.

A

Hey this is a quick demo of the work I've been doing with the redis psychic experiments. The goal of this is to get to our base load where we can see roughly production level expert um performance out of redis on in the smallest environment. We possibly can and then experiment with the various mitigations that we've got proposed and planned and reasoned out to see which one's going to actually have an impact to confirm that we're going to have an impact and see what the scale of the impact is.

A

The full details are on scalability issue 956. I recommend you go and look at that. If you have any questions, I'm going to skim over here very very quickly. Let me finally share my screen. I want to share this one here: okay, right noting other things, I have three instances. I have a regis side, redis node, a client node and a prometheus node, the redis binaries. I told you this. It is simply binaries copied from production.

A

um Compiling was a pain, it wasn't worth it and it's actually slightly nicer to be really sure about the same binaries. I've got from production or staging. So I've got 5.9.9, that's the older one, 6.5.10 straight up and 6110 patched, which is with the our br pop fix that igor and matt worked on together.

A

The configurations have been very uh basically from production um very lightly modified.

A

Just where necessary are setting directories a simple password, so I don't have to care about that particularly much cool. So that's the client node. We select with a little bit shell script, six subset of five six or six p six patch and then run which selects the configuration file and just runs. So that's done out of the way no problems uh on the client. This is a little bit more complicated, so we have created the worker classes.

A

So if you don't run this here, you run this in the get lab code base from the top level. It looks at all the cues and for each queue generates a simple class that sleeps.

A

So this is not exactly a one-to-one mapping in the code base. It's weird in terms of the classes and namespaces and which queues they get put into. What matters is that we've got a class per cue that we've got and that they do something. The sleep is a random duration and if we look at application worker, basically we've got gravity zones out of um out of kibana various percentiles uh 10 intervals and p95 and p99, and we randomly choose one of those.

A

So what this means is that the jobs will be flowing through at a good rate, it mimics production. To some degree I mean we are talking. You know, ten percent up the bottom ten percent will take only yeah 0.02 in a second. We've got to be able to throw a lot of jobs through, but this gives us a reasonable layout, a reasonable shape for those jobs for how long they take how long they actually take that we ignore the last percent, one percent they're about p99, because it gets up like 1500.

A

Some of these jobs take 10 15 minutes to run it's not interesting in the scale of what we're trying to do. We're looking at the high throughput stuff, a single psychic worker, just they're doing nothing for 30 minutes, it's hard to resist, doing nothing, not relevant, uh so that's the workers that are ready to run. We run them with a little. We sidekick thing, so it's literally uh sidekick itself um configured. We just require all the files configure pointed awareness. We must use this every reliable fetch.

A

This is interesting, we'll be able to do this again um when we get past this to the single queue per shard, we'll be able to test with the fully reliable feature in this environment and see what it does, because it uses a completely different risk command.

A

Yes, commas are popular instead of br uh yeah, so we set that up and that's the simplest thing. Sorry once again and right, you can run it with this, so we literally oh I've, also copied in um basically most of get lab embedded from production.

A

So this means we can run the psychic binary with that psychic derby, which configures things and a shard configuration and we've got some just some logging. So we can run multiple of these to simulate a full fleet of psychic workers.

A

So the configurations look like this, so this is just the concurrency and the cue lists which we are basically the same as production. I got the list of the queues with sidekick cluster driver and would thank goodness for that. Whoever wrote that love you.

A

It's worth noting just in case you're, not fully familiar. That catch-all has two fleets of workers, vms and k8s communities. The vms are running the ones that we still think might might or do need nfs. The communities ones are the ones that have validated explicitly do not know any more require anything. They are two discrete sets of workers from the catch-all shard, the ones with no resource boundaries, uh so that runs sidekick.

A

So, with this run multiple sidekicks, I can run a whole bunch of them in the background, and so with all of those running, and it looks a little like I will run them sort of like that, so I will run them on the catch-all vm chart or 56 of them. I grab those numbers literally from our current production number of running workers across vms and kubernetes and pods and processes underside, cluster and all that stuff. So if we try to get the same number of worker threads and from sidekick talking to readers.

A

And then so with red is running and the workers running, we then generate jobs very, very simply again, load all the workers connect to redis for each type effectively shard, we schedule one class.

A

I think this is okay in terms of what we are trying to test here um every time I think about it that bounces back and forth in my head, I think redis. Doesn't it doesn't matter for performance, particularly much, whether we're listening, whether we are putting work into one queue or work into 20 queues?

A

um I think just because it's single threaded and it's only one data structure- it's there's no locking to worry about. So I think that's. Okay, I'm gonna have to get a little bit further than that to do some of the other experiments we want to do so. We will see how what happens then, but I'll be putting that off, because it's not quite the tests. We really are interested in uh very brave. Basically, I I just stuff a whole pile in these numbers. Here are the production work accounts.

A

I was originally thinking that I'd be able to stuff working faster than the workers could pull it out, but that's not true turns out it's actually quite a bit of overheat, yeah more than we would think in getting the work into redis in the first place.

A

So I do that I've stuck with that. I haven't gone back and re-edited it because I can generate production like load and then we just schedule the work in here. So I actually do it twice and that's still barely keeping up so in practice we actually find we need to run this generator twice. So it's really cramming work in there as fast as it can two or three times to get load up to base load.

A

uh What we also have so that's the base load. We also have, for example, one cube for a shard from the one key for chart experiment. The only difference is this here so instead of um instead of simply performing on the class, it sets the queue explicitly. So it overrides all the other work. That's going on to select the queue and just use the state name there, and so then we have slightly different config files for sidekick.

A

There, where it just listens on one queue so to see it in action, redis is running up here. We want to start the workers, so I will start the workers. We will do the full original load, so we run that there, I'm tailing all the logs over here. This takes a minute or so to start up I'll, just run top here.

A

So this is a the client. This has got 16 cpus the load goes nuts from it, as you would expect, but that's fine wait for that.

A

Let me find prometheus, so this is the graph of redis cpu, so this is just redis itself. um Our prometheus is scraping up every second yeah. We never do this production, but I want to see the subtleties of the variations over time because this we're working at such fine grained.

A

So the moment ruby is still starting up doing its thing, I'm probably terrified by that, but it's fine. Now we wait for these logs to there we go they're all started up. Load goes up, and we will see in here here we go. This is our base load gradually rising, just drop it down, so our base load. This is also with the br pop timeout set to five seconds, which is what we're running the production. uh Two seconds is worse than this, so very good. We get up to about 30.

A

This might stabilize in another few seconds, yeah just rise at the start, when everything's connecting that must be catching a few jobs cool. Probably from my previous tests. There we go so yeah there. You go, that's the startup time and it stabilizes at about 12, that's consistent with our previous experiments.

A

So this is a full load of clients simply sitting there listening.

A

For no work and we will run the generator.

A

And we will run it twice, so this is really throwing cues in there once.

A

A

Yeah, that's running once um just showing you how long it takes to do to schedule all those cues uh it's about three seconds, it's quite a bit of three to four seconds uh and then we will run two, and just just trust me on that. This is a good thing, so I can see all the workers doing their thing. uh I'll just cancel that for a second, so yeah 1.374 seconds, quite a few of those like that and so on and so forth. 761 yeah.

A

That's interesting.

A

So that's busy, those are busy doing things and we can see. What's happened over here two minutes and there we go so with two generators. We can get our renters up to about. Actually you can see where the first one started one and then started the second about 90, which is actually pretty close to what we're seeing in production right now at our peaks. So there's lots of variations here. This is not entirely accurate, but a disorder of magnitude there and not only automatically pretty close over magnitude.

A

So I'm very happy with this, because then we run all the other experiments which you can see all the results of, and we can see what happens to that like dropping this down from 90 down to 50, which is awesome. I hope that makes sense. uh We can get access to this. um It's not well set up, but I mean I can. If I drop your ssh key on there hit me up, we can run experiments and we can do this in the future there, which is awesome.

A