GitLab Scalability Team, 13 Apr 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2023-04-13 Scalability Team Demo

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

I thought you do.

A

Impromptuagen, that is.

A

Yep I was navigating on this rule group plus duration circuits that you mentioned that has the group rule group in it, but I wonder why it just jolt up so high and it's even for just the mapping rules, which should be a fast rate.

B

Yes, uh is this in Prometheus or Thanos Prometheus.

B

Interesting I don't know.

A

Yeah, it's supposed to be like a constant threat. Almost.

B

It's a it's a static one, but there are a lot of.

A

B

B

A

Shall we like try this.

B

I also Wonder Maybe. These ones don't need to be in Prometheus. They could be in Thanos.

A

Like the mapping yeah.

B

Because we only aggregate by Stage, Group and tunnels- that's where it used, but yeah still doesn't explain why it's so slow, why it should be slow, I, wonder.

A

If something else.

B

Was going on at the time.

A

Man, it was almost instant like every time just before and after.

B

Maybe there was a deployment or something like a rule groups being rolled out. Something like that.

A

A

uh Yeah anyway, I was also going to check whether the any CPO memory got hit quite significantly during this period. If it doesn't then I guess it makes sense to try with a longer interval for the recording rules.

B

Yeah this with this one. Well, this is the duration that you're looking at so this like it, won't be affected. The duration is the duration right, so this is weird, but it's a separate thing from what you are from what you're working on.

A

um But I still really understand what the last duration means in this context, because you have the you have uh duration seconds, rule group duration seconds and you have the last rule.

A

B

Seconds is going to be an incremental counter, so every every iteration is going to increment. It.

A

B

So if you Group by room.

A

Right right, right.

B

Then you're going to have from all Prometheus the same like the sum and then it's going to be an Ever increment encounter that you can then do a rate over. But the last duration seconds is a gouge, a gauge that all.

A

Right, it's just okay,.

B

B

I've added something to the agenda, because I see that Sam is on the call and I think it might be something interesting for them to for you to look at I, don't know if you've seen the issue already link it.

B

Basically, what they're discussing is part of what we're doing wait. Let me clean up my screen, so I can show what's going on.

B

So here's the issue that I'm talking about um this is basically preventing integer overflows like for primary keys or whatever in the database. We have a column type with integer, but then we add a lot of jobs. It could run out when we hit hip.

C

B

Ceiling, so we want to monitor that and replace the the type if necessary, we have that in timeland for um yeah, basically, a very select number of tables, so the database team is working on expanding that to all tables great um they're, proposing a system of alerting and yeah notifying people of that. That is not timeland and like that this works a little bit different and I. Think Roger Wu had a a proposal here and I think we should Maybe get in line like get these things aligned. So we can use one thing.

C

Yeah I mean I'm consistently Amazed by the stuff that gets proposed and scalability saying yeah. We already have that. Actually, um you know basically.

B

Yeah and then just now saw that you're already in this conversation, so I don't so that's that's great.

C

Yeah I think that one of the things I've been talking with Roger about is, is how do we actually tie some of the things they want to do together um because in some of the data we're presenting for like the table, size and costs, and things like that, I think that's really great and and a great start.

C

um But how do you tell the story to like some of the other folks at the company? Maybe not completely engineering focused of why that matters and connect the database teams. You know kind of abstract will be 100 gigabyte tables and no bigger um to the metrics kind of we can we can Expose and and show to users? So um if we have that it's great I think.

B

I think Stephanie is working on that yeah.

C

um And you know any alerting we can add to make those things actionable um is great and I. Think the part of the the problem here right is is the other functions of the other teams. Don't know how much stuff we already have and.

B

We have a lot of stuff already, but we don't know what they actually want, because we built something that we are currently using. So like here, they're proposing uh four different thresholds. We currently have two um and they're proposing a notification issue whatever when the threshold is breached and we do a prediction of when a threshold is breached so like yeah,.

B

I think we shouldn't like we should avoid doing double work, but I don't know how.

C

Yeah well, um I think that's that's one! That's on my list to go back and look at and just kind of respond in that thread, um but we do need to bring those things together. I think we have some quite a lot of um agency with what we've done already. That seems to work and I think that the value add for the database team will be tying some of those things together.

C

So if we can use the stuff we already have in Taman to notify or make those things actionable when they need to be actioned, I think that's! That's probably um where we want to head overall right and presenting usable users with actionable information, um as well as the the kind of prediction, um because, with the you know the primary key overflows if it stops working, it's it's a bit of a hard recovery.

B

The nice thing about this particular one is that it's very gradual, like it's easy to predict and the predictions are pretty good.

C

Yeah and then I think there's um it's it's kind of the the database team's problem to solve right in terms of you know you are going to overflow. um What steps do you take? You know to do about that, so there's probably some work there to figure out um how to align those things best and that's on my my to-do list.

C

B

That's helpful, yes, very.

A

B

You need anything from me, let me know, but I think yeah we can. We can figure something out, but we should also be mindful of what teams actually want. We, we have defined two thresholds now and we allow customizing them and everybody can actually customize them if they want, but we only have two thresholds, so maybe they want some something more or more types of actions or I haven't read entirely true, but now we just create an issue and.

A

B

Raise urgency test dates approach, but maybe they want something else.

C

I think that's it's kind of mainly what's laid out in that that issue um I. Think we end to correct me if I'm wrong, but I, don't think we have a a complete single approach to alerting and how we want to do. That um is that is that fair, just across GitHub.

B

Yes, I think that's fair. We have a most of the time like historically when we talk about alerting, it's always been alert. The SRE on call to do something because something is burning down, and now we want to move that to the left some more and then it's not always the most urgent thing. They should look at so yeah yeah.

C

That makes sense and I think if we had a complete approach- and you know I'm not sure it's anywhere, you know it's sensible to ever get to. There is just one approach to do it. um We could just put it in there, but maybe we have to take a bit of a um a custom approach and see what works with the customers and what they engage with and then make some decisions about whether to invest more heavily in that kind of learning approach, yeah.

B

And one thing that is difficult about this specifically for capacity planning is that we're actually having three stakeholders, for example, with the database. We've got us we're, building the capacity planning framework to predictions, and we made assumptions on what people want to be notified about and.

A

B

We've got the database team who are responsible for the resource like the and responsible for the for the resource, and they also want to serve their customers, which are the users of the database, and they want them to help yeah do the right thing. So that's actually three levels that we need to get through here.

C

One thing just was run the topic in the the database team that might be interesting is I, think, there's a there's a case for service owners um to have slightly different uh prediction levels or timelines. One of the things that I've talked about um with Roger and with a couple of other folks is that, for you know, pieces of our core infrastructure like database that we need to vertically scale um they can see benefit of having quite a long prediction out.

C

um You know, 12 months to 18 months to say am I at least trending in the right track, or you know, because database upgrade is a substantial piece of work. um I want to be able to plan that early in my kind of planning cycle and I, don't know how hard that is. Given all the timeland stuff is plotted on there equal graph, all the way down right.

B

Now it doesn't have to be like now it's easy and it's nice, but if we can do like, let's open up timeland, can you still see my screen.

C

I think you stopped sharing, but for for some of those service owners. If it's a matter of extend the prediction a bit for them, um I think that would be a a quick win and a good way to get buy-in to show that um we care about. You know what they want.

B

We have the Separation by service here now. These are all the same graphs but and I like that. It's the same routes, because you can call correlate things right like if you look at uh yeah here, async primary pool and CPU and for example, they could have.

C

B

And then it's easy: it's nice! If the graphs have the same kind of it's.

C

Very sensible, ux like and graphs are hard to do and make usable so.

B

But perhaps it might be worth considering having that per service like the we use one year now, they're everywhere, but perhaps.

A

B

Database, we could use a year and a half and then for these faster changing resources. We have a different different one, because yeah the graphs are not on top of each other yeah.

C

And I think it's it's very correlated with the work that is required to shift Direction substantially. So you know on on postgres, that's significant, um maybe for a micro service. That's much easier, um so you don't have to predict as far out yeah.

B

C

B

Hey take more, anybody wants to add.

B

Then I think we can give everybody some time of their day back.

B

Talk to you all later.

A