GitLab Scalability Team Demos, 28 May 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Scalability - Redis BigKeys analysis demo

Description

A quick view of recent work by the Scalability team in viewing/analyzing the biggest keys in our persistent redis, giving the background to the problem, showing the tooling, and some initial interesting findings.

Issue: https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/360

A

Hi welcome to a demo of the big keys analysis that I've been doing. To recap. We don't know for certain what we're storing in rest, we can reason from the code and as much as you can reason anything about a large code base, but there could be bugs. We could just misunderstanding the full practicalities of what it looks like when we run certain code.

A

There is no substitute for actually looking at the data Estel, so we thought we could start by finding the biggest things to see. If anything stands up. The Red Seal I talked part of the British distribution, has a bigger keys and a mini keys option on the command line they scan through the key space. They look at each key find the biggest key of each type. The runtime is a newer of the number of keys is pretty efficient.

A

Big keys show us the size of the keys in terms of the number of entries so for Strings their spikes, but for complex items like lists hashes and remembers its items, blissfield fashions and members of assets.

A

Miki's shows the same information accepted in memory usage, the biggest list. Finding usage, we were not 100% sure that these would be the same thing, so the biggest justify keys would be the finished list divided bites. Maybe the list you know analysts with 10,000 entries, which would be the things Maki's, but they were all small and then another list with four entries that we're all huge, which was to describe turned out that so far, the biggest keys by element are the biggest keys, 5 bytes.

A

This is not invariant, though it could change as we limit some of our outliers. Maybe we will find something else. It's worth noting couple of things, this mode of British CLI is not particularly special. It's just using the readers protocol to scan the key space. It's a scan between this game command. Query is the keys for type and size and does the analysis. So we could plausibly right around to do the same or something slightly different pressures. What to do somewhat in detail, and there was this top team, something else. It's not impossible.

A

There's nothing special! It's also worth noting that it is just a scale, so it takes a while to run through and it may miss keys that are in it all removed quickly. During the run and the keys reported may be gone by the time it finishes. It's a bit stiff enough for this purpose. As we will see so, I will show you a little wrist that has to be ended up on my machine here.

A

This is a snapshot of the state images from a while ago. It's nothing sensitive in this at 48 for inner ear, CLI with dish dish binky's it scans through a bunch of sales information as it goes, and then interesting thing is for sorry, where we see that it's game 313 doesn't keys, found the biggest was the scheme.

A

Entries biggest hash string because it's it and there's a little summary cleaned, just not quite useful and minties does exactly the same thing, so shiny mechanically run them in the same operation, but that's life and the interesting thing the same keys. But this is the biggest entry. That's that's huge yeah!

A

So what we do is we can run this periodically. Another script out of the system, new time and service currently really every hour and we're at the moment only running on the persistent that shared state is the case is going to come tomorrow, a little careful about that. We pass this final, useless, useless, useful bits of the output and turns into JSON and chunk there JSON until GCS bucket for later analysis.

A

The repo is available this on Caleb comm, slash, get there column, specialty old session for our special readers under school keys feel free to go. Have a look.

A

The result is just jason'll what we put in there is just Jason, so we can do anything. We love with it, but for a first few I've written the cursors, a CLI they'd help me to get a handle on what is and is not changing. The code is also.

A

It's possible to run up with no beat up, so this will point directly to a GCS bucket. So if you've got your VSU turn on to use cloud, jockey cloud credentials it up so that you can run those things on your laptop. You could run it from there now for clear.

A

Is that there's actually some slightly sensitive information and there's an IP address and decision ID, so I'm not going to run big instead I'm going to run other games to a local coffee which I am is copied down my hand and redacted at the same, the best is the same.

A

So if we run this, it shows the latest report that has gotten by default and lifted right errors will scroll you through time, so we can scroll to go backwards in time. So, if I scroll all the way to make, it's given I on those keys up here right up here, these key names. What we'll find is that they don't change throughout this entire week, those keys that were the biggest at the start of the week now, the biggest at the end of the week. That's interesting. It's probably good.

A

It means we've got some nice early candidates to look at now. It's coming forward again, look at things like the size of this hash and the size of the set watch closely, and you will see that they are just increasing over time. They never go down.

A

Also, what on earth is going on here with a hundred and five cane fields and dependency list? I have opened issues for both of those two cases there. We have two one 907 one for the session RP and to my 954 for the penises dishes. Counter for free go, have a look at those. If you want the job way is interesting, how its static the whole time. This is interesting that the static ones are as interesting as the ones that are changing.

A

What would be not interesting the ones that sort of come and go, and if we get to a point where we Keys changing all the time focused ones are coming and going and slices them down. Yeah, that's for the steady state.

A

These are all outlet, this job layer, one that's sitting there with the same number of items, the whole time something's stuck there now I would I do the same thing is happening with decision obviously rejected that, but there is actually the same session ID throughout the entire time, so that once issues the air- and it's big that seems a bit big for me, my liking, maybe they're. Something we're stopping in there that we should be stuffing in there, and this is it.

A

It was actually just an artifact, so I cook used to be pointing at this realist and isn't anymore, and we never clean that out. When we finished the dig cute little attack, you never got cleaned up so one side to make that there will be 40 remaining zits its but looks like they're all timely computer network. Naps- probably not interesting one other interesting things. If you look at the size of this, is it it changes between two different values? So that suggests that something is still writing tonight.

A

Although I can't quite figure out why that will be the case and.

A

It's possible that there's some in determinism and the calculations of the memory usage, these titles, that the bottoms not actually as interesting as you might think. Initially they do give you an idea of where most of the storages and the thing it's mostly in strains, half as much and see it's and about 13 and lists, and we will graph these eventually in Prometheus, and that may be interesting.

A

We will basically start nibbling away these items. These keys will work on their last one by one and as we get rid of them, we'll reveal more and we will get to a point where we've revealed it and it's interesting and we're down to the weeds I think it's interesting that something appears to be wrong with every single one of these.

A

It's something worth at least looking at that's good. It's exactly what we were hoping to find, how many I'm surprised at which ones I've been just wrong with everything. Now it also already suggested some interesting avenues for aggregation, so I am very interested in knowing what the distribution of sizes of sessions are outliers, where there was something too much in there. We can work through that.

A

Once we get rid of some of these other outliers and these other keys, they might be interesting to see what else we find I'm hoping there will be some other key prefixes which are going to be interesting for education, we'll find out might be worth writing a different scanner that captures the top 10 entries for each time time, but we should let this early our complete versa.

A

To see what happens, we may find that once we've got rid of a few, we are down in the weeds and we don't actually care about being more fine-grained iterate on that and see what come off if you've got any questions or want to contact me talking about this, probably having up in the scalability channel and slack, and we can discuss it there. Thank you for listening.