GitLab Delivery Team, 20 Jun 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Delivery: Store root-namespace storage statistics on database

Description

We discussed the details a proposed query for store root-namespace storage statistics on database https://gitlab.com/gitlab-org/gitlab-ce/issues/62214

A

Okay, we are recording, so thank you. Thank you for joining this code. The thing I would like to discuss is what we are doing for this performance problem in the aggregation of storage usage for namespaces. You know the the thing is that we have this: let's start with sharing the screen, maybe so that we can do this.

A

So let me go here: okay, so I would like to start with this as an example, so that we can talk about a bit more about the problem that we have so.

B

A

We started discussing doing this namespace migration for tagging, the the road namespace, but in the meantime my plan was started working on the next part of it and I and I said: I just need the aggregated data. I can just do slow, query and object and then, when we would be ready, I will replace it. So this was my my certain point, but then because I was reviewing my report. I said this: is this we really too slow for using fresh? So that's why I brought this now.

A

I put the explain here and you just Lincoln me: the yeah explain explain which is.

B

A

Things here and from what I understand here is this know that okay, so exclusives means only this node inclusive means this node and everything below yeah, and this is just rows. X is if the planner underestimate, or overestimate the number of Drago returned by it. So that's the thing and then the number of row that it returns and those okay yeah so from what I do understand here, is that this part line number six is taking a lot of time.

A

But the reason I think is that it's because this is going through is looping over hold the projects that we have there, which is 504 here in this case and is matching with the the project with the project. Statistics yeah, because we have the summit, and this can't change right.

B

Yeah, so the the thing here is the the reason you see these loop counts is because the second node is a nested loop and that's in this case I believe because we do with some with the the group by you'll, essentially loop over every row and aggregate them. So the in this particular case, most of the time, spend just fetching that data, because summing the extra numbers is very fast.

B

And in particular, you can see it shows the I/o timings where it spends about 160 milliseconds reading now, depending on.

B

Well, depending on what all things that may or may not have to be read from disk.

B

In this case it seems they've been read from disk, but if you didn't probably reeve on it, it's much faster because it's in memory.

B

The tricky thing here is with this particular plan: I'm, not sure, and that's also why I notice that, like I'm, not sure if this can be improved, because if, if you have say 5 projects, it's going to be fast, but if you have 5000 and of course simply because it has to process more data, it's gonna take longer a.

B

The only optimization if I look at this I could think of is with no 4. It uses their routes to filter for sup routes and there it what it basically does is using index to get the initial set of data and then filters out that with I think it's this our filter set out by source type. Okay.

B

A

B

Things there you could optimize, but apparently only eight milliseconds to spend there. So it's not really that's.

A

The point to say that the way we are fetching the the project right now is really fast. Yeah.

B

A

If we talk namespaces, maybe faster, but is this a win? No.

B

So here this is so normally for me, a query that is slow is something that takes, let's say more than 20 milliseconds and that's that's a very small number. But it's mostly because when we look at WebP requests we typically have 50 200 sequel queries. So if everyone takes more than it adds up very quickly sure and I think for a while, we have the sort of goal that prayer requests.

B

We spend I think at most 200 milliseconds in SQL, I, don't know if you still meet those goals in this case, I think this is a query that would run in sidekick right exactly.

A

B

With this run daily or like every time, these statistics change, okay,.

A

The thing we followed your suggestion, so the thing is that went something change. It will be scheduled for, let's say 3 hours in the future and all the subsequent changes in the same team time frame would be not recording.

B

A

Suggestion that I made here as an improvement is following the same pattern that we are doing for the project. Statistics, which is get leads on Redis 15 minutes leads. We can change the number so update now and update at the end of the lease and sir everything in between. Don't don't update it because yeah so you have, it can be current data or like 15 minutes all.

B

Right so, in that case, I would say that is this time is probably sufficient.

B

In particular, I think that since names face will change quite a bit, some of this data will probably be cached more or less most of the time there is the problem. I I, don't remember. If we enforce a limit of projects for there's a quicker route, namespace I, don't think we do it's.

A

But I don't think we enable it all.

B

Right because I think I've.

A

Been considered limits, alright.

B

So I think that would that would at least allow us to sort of put an upper limit on how long this will take yeah, because I can see somebody creating like a namespace with, like I, know, a few hundreds of namespace, whatever they're, like thousands of projects yeah, we.

A

Have one of this I mean it's it's why we started called this optimization thing, because you were looking.

B

A

The production database and you found that there's there's a couple of big up layer. Yeah.

B

So the thing there, those since it is triggered only when there's a state change, even then I think you know 200 milliseconds or whatever it is- is fairly reasonable.

B

And so I would say, probably for now. I think this is good enough, because I don't think anybody can come up with a better way. Okay and I. Think the only better way is. If you do this approach, we initially thought of like where it's incrementally updated for project and but it gets super complicated, yeah, I.

A

Mean I was thinking about building some kind of with possible stable inheritance. Now that we dropped my serial.

B

A

Can build some kind of journal by days so that you.

B

A

Everything so you you store the Delta as a logline and died once in a day or twice a day. You just consume it and drop it too soon. Yeah.

B

I think there either you end up with the same problem where, because you have to simply group by the number of events or whatever you like to call it.

B

But if it's limited, you know every 15 minutes and everything in between gets merged together, that that should be fine.

B

Expect we might have to tweak the intervals a little bit and stuff like that. I I personally, would be fine if you just do one update per day, but I think some of there are people like sails on such whatever they. They might not be happy with that, but I will probably start with updating, ideally updating less often than more often simply because if he lets say we deploy this and we update the statistics at most once every 10 minutes, yeah.

A

B

We I think we have, let's say four million namespaces. No, of course not all of them are going to be changed all the time, but I suspect it's gonna, be quite a few changes. Yeah.

A

I was thinking about doing this on a feature flag loop, so that no yeah.

B

A

Let's say um give up on our github or namespace, because we use it a lot here.

B

A

Take a look at how it feeds because we have pressure yeah, but we are just affecting our own stuff. So no.

B

A

Kind of experiments there.

B

Will be very good, but I would then also maybe start with updating, say at most once an hour. It's not still. You know fairly reasonable and.

B

Yeah I'm wondering that might be overkill if we could somehow change that interval with the application settings that way. The idea is there that if we find it, for example, one hour is still too often we can just change it without have to basically deploy a new version.

B

The whether that should be in the mo fell, I mean.

A

We can, if we are doing this behind a 50 flagging in case we can just say.

B

A

The number between future flag as well, and even if you are in UX and whatever it is it's this isn't apply and we are running an experiment and.

B

Basically, it's yeah I think application aesthetics might be overkill, because I will prefer that we provide a value that works for pretty much everybody and frankly, we're probably only one is actually gonna use. This I don't think we have like. We can set values in future flex and retrieve those there they're all boolean. As far as I know,.

C

Yeah I think I think you can set a percentage of actors I'm, not sure if it will be useful. So.

A

You're kind of tweaking that percentage just between one hour and one day.

B

C

What we did with the of the DevOps thing we just release it with 5% and then we were incrementally add in it. Yeah.

B

I have it so in this case, it's it's not so much. The percentage of users I mean that there will be separate thing, but I.

C

B

Let's say we enable this trickle up, or can we say once per hour and we find out? Oh you know that even is too much. Ideally, we want to be able to very quickly change that setting that how I could go through the deploy process, so I think if you in a llamó file, we probably have to start messing with omnibus, which is not ideal know if you do an application sense, we have to create a UI field and stuff for something that people probably will never change.

B

Plus, you needn't happen to change it. A little annoying.

A

We can write in Redis and just they were actually.

B

Yeah I guess we could do is provide a default interval and at least for this period, where it uses a feature for, like you, do a fetch from Redis and you say hey: if the key exists we just use whatever value it has yeah.

A

I was thinking something like this, so that we can change it, but we would plan to remove this part by when we.

B

Are that actually sounds reasonable, because that you can just do from a rails, console.

B

Yeah I think that's probably the most straightforward. We can do the alternatives that we add the column to application settings but don't create a UI for it. But then it's kind of we're gonna have to move in. So we have a migration in the codebase. That's exactly.

A

There's a migration, we add the column and yeah.

B

I, like the the idea of using Redis for that, okay.

A

Good, so let me check so I actually think that this solves or the the question that we have. Oh, no, because this we can add, maybe Mayra already did I, don't remember because we discussed we have the updated at in the root in the aggregated statistics. So, from the UI point of view we can say this is the number, but it's updated to that hour, because we.

B

A

A UX problem, the new X, will tell us yeah.

B

A

Sorry because programmer performance reasons we can just update on time. So and that's one- and even if this query is not the most performance right now, but we can start using it measure how isn't behaving and improve it later more.

B

A

Touching namespaces now because actually, as we said, this is not the slow part of this right, no.

B

I think it's query, you know, since it doesn't run that often should be fine and I think should we have cases where we run this too often or too many times in parallel. You can always increase the interval or maybe spread it out more where you say, oh by default, we enforce the interval for our body. If you have this many jobs, we might increase it for some. You can go very far with them. Okay,.

A

So I think we reach an agreement. Yeah.

B

A

Everybody for trying this mostly auric for spending time with us, explaining full details. Yeah have a nice one. Okay,.

B

Aci Thanks.