GitLab Scalability Team Demos, 15 Jul 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Scalability Team Demo - 2021-07-15

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Right we're a little bit late and we had to scramble to add things to the agenda. But let's start with the the deduplication of that q, I'm going to clean up my screen a bit.

A

um So we had an incident um yesterday or I don't know the beginning of the week um and what happened was a bunch of jobs got dropped on them on a sidekick queue, and that sounds very familiar familiar to incidents we had last year and the reason this started happening where's the incident issue. I want that open as well.

A

Just a second, it's really unprepared.

B

I'm also just adding the links to the agenda for people to see.

A

Where is the epic.

A

uh Let me share my screen now.

A

So during the incident we saw, this is the comment I was referring to. That made me bring up the the issue. So what happened? Was there was a user that did some some transfers of projects or something like that, something that requires recalculating authorizations and and that scheduled a whole bunch of jobs um at the same time that we that sidekick would need to get through.

A

We had this pro this problem in the past and and like a workaround, for that was like make sure that these jobs are idempotent and de-duplicate uh the jobs when they're being scheduled at the same time. So if a job is in the queue for a to recalculate, the authorizations of the user, and we try to enqueue another job for the same user, we'll just skip the the next one and rely on the single job that is um already in the queue, because that would be the same.

A

The same work when we started moving loads from the primary database to the replicas from for sidekick and we disabled deduplication for that job, because the jobs could count on the replica being up to date and that might be different for the job that is scheduled. Second in line.

A

So, as you can see on this graph on my screen here and we dropped all of these jobs. That would that had this, uh this feature to read from the replica enabled and we dropped all these jobs on the queue and we saw that there were like yeah a whole bunch of them that were duplicates. That's the purple line, and then the this other colored line is all the jobs that were the first one for a single user to be scheduled.

A

So if we were able to deduplicate jobs still and then that would have like we've, this would be the the amount of jobs that we had to process, but instead we were doing the sum of both lines. And so that's why I suggested um to manoy who, who added this um this worker in the first place uh a way we can have our cake and eat it too, and by using a different way of deduplication, uh come on.

A

By not using the the replica feature that checks the times that that checks, if a replica is up to date, because we know we schedule these these jobs in the future like a little bit. um So we can actually still duplicate them in this case, and I suggested a different way to make the queries go to replica without having to check the replication delay of the replica.

A

Does that make sense to everybody.

A

Yeah, regarding that, I just um think monoi on the on the issues, since I was involved in the murder quest and so on, and I added it to the rapid action, but I'm not quite sure that's the correct process to do that.

A

The graphs show me that it's definitely worth doing and stan also also mentioned it in the slack tread yesterday. So I feel comfortable asking money to do this. Anybody.

B

Yeah, I think I think, since it's I think, since it's already part of the work that's been identified, it's uh it contributes to helping resolve that um and I think because it's like it's framed in the sense of the six things that that were already recommended. So I think that that's that seems appropriate and it looks like it will have a significant difference on the on the work. Yeah.

A

Especially since we we know this worker already, we know the work to be done. We've been through this and part of that's kind of on me, because I did review that merger quest and uh yeah. I didn't think that um I didn't connect the dots that this is the same, the same work, even um even if it's scheduled differently so yeah.

B

Oh, it's been challenging because there's been quite a lot of change on those workers already, like they've, been making such good progress about improving the state of the workers. um I think it's just. We can't expect to catch absolutely everything, but at least we know that this is another thing that can be done to try and resolve the problem. That's happening there.

A

There is also an issue open for allowing deduplication for all kinds of jobs that have this data. Consistency delayed set on them, but this is a quicker win for a job that we know schedules a lot of duplicates in some cases. So um I think it makes sense to prioritize that one in it like a special case.

B

A

Anything else awesome thank you for watching.

B

This not that I can think of but I'll check on I'll check later how the progress is going here and make sure that people are aware of it.

A

The other thing I wanted to maybe bring up, but that's more of a like a fun project for me to do.

A

What was the epic.

A

A

I think I already added to the agenda doc.

B

A

Right um the fun like um the recently, we started um calculating error budgets for feature categories and stage groups, and this has been great because um people are starting to improve their features and what I think is even more cool is that people started to contribute to the sli's themselves in the run box repository in the um service catalog.

A

So this only currently only works for groups that have like a feature category that is mapped to to a service. For example, the the pages um slis already have a feature category marked on them and that contributes to their budget of the package package group release group yeah. The release.

C

Group, I think.

A

And so they are interested in improving the visibility of that service and and they are defining slides and improving sli that we already have by changing, adding metrics changing the the service definition together with us. This is not something that's currently possible for everybody that is contributing to the to the rails, monolith because and yeah.

A

That's like a single service and several feature categories come out of that, and what we're talking about in this project, five to five is making a generic way for people to add metrics that would feed into their error budget and allowing them to customize it on the fly. The first one that we would be working on is the request duration, which is currently capped at one second, um so a request, that's faster than one second gets. It gets a point for a good app text.

A

The other ones get the other ones get binged for that, but that doesn't apply for like that. That's not correct for all uh requests. For example, getting a jwt token should be way faster than one second and, on the other hand, getting the trace of a job or a long pole or whatever can be slower because people, the users, aren't waiting for it.

A

So in this project we would allow um the stage group to define these thresholds themselves on on endpoints, but while we're doing that, we would also set up a framework for them to add new slis uh to the services we have, and so we can. We from the infrastructure side of things can monitor things that users care about, because product knows what users care about and that's the gist of it well just kind of a ramble.

A

I think it's going to be a fun project to do and we might be able to reduce some of the metric cardinality that we have, because I'm thinking of using mostly counters for that stuff, instead of histograms with several buckets uh yeah.

B

Yeah- and I already see that there's comments on there um from uh people on the the product development side, so I'm I'm, I'm also quite interested to see how we can get this one going. um While it's not just about um you know agreeing that some people can have a slower response time, because the customers don't mind so much about it.

B

It's also about balancing that with what the infrastructure can actually cope with as well, but I think it's nice to at least be able to have that conversation with people, because again it just it drives home.

B

um It drives home more about how they're using the system, but we're able to engage in a conversation to be able to say well, you know we can increase it, but this is how we're going to have to do it, and this is what we can cope with, and this is what we shouldn't be coping with, um whereas at the moment there's not really much of that conversation, because it's a case of well, it's one second or it's nothing.

A

Yeah, I don't think that we're going to we're going to move the 10 second upper limit that we have on our current puma component sli for the web kit and api fleet. Like that's, always going to be a hard limit. Perhaps you can even lower it we'll see this and that's the thing.

C

I don't understand: how are we going to enforce that like if we are giving I'm up for giving this freedom just to be clear like? I think this is a good way forward, I'm just not sure I follow. How are we going to make sure that people don't go inside 10 seconds, I'm fine. This is the max, and so.

A

Like right now we have that request, duration, sli that doesn't get factored into the stage groups, because the cardinality would explode and we can't, um but that one has two thresholds, one second and ten seconds I think um we're going to add validation inside the rails application just like we have for future categories on on endpoints. If a feature category doesn't exist anymore, the test will fail um here. If you define a threshold that is longer than what we say it can be, then the test will fail.

A

That's my idea for enforcing that right now, but I haven't written that up in an issue. I wanted to get some input on the bigger picture first, but that's like a detail that I did mention in the epic like we can't just it's not a free-for-all. 60 seconds, yeah.

B

And the way- and I was also thinking that, in order to get a um an sli change sort of approved into the system that it's going to come through, like it's got to come through a conversation with us, so if people want to increase that, you know, let's talk about it, let's engage with them.

B

Let's see what is a reasonable value to set um for exactly that, like having everyone set it to 10. It's just not it's not what we want them to do. Yeah. I.

A

Think um so I think that threshold should go to us. There's also um like we're going to have to um figure out a way to to get this going. Perhaps we're going to do like we did for the feature categories and we do a first pass of the most popular things that we set thresholds on and that can then later be adjusted and all the rest just stays at the one second that we currently have, and if they want to change that, then it will have to go to review.

B

I was just almost going to set everything at one second and then, if they want to change it, because some people have highlighted that they want it to be changed. But not a lot of people have said that I know once we give the ability to do it, more people will, but.

B

Yeah, I thought there was a decent first pass is just that everyone's to one.

C

And um can we maybe consider this is just a suggestion without me, knowing any technical, technical details about like how this is going to be implemented, but I I kind of feel this needs to be attacked from two sides uh so from two levels: it's fine to do it in the application like you're, proposing bob. I think that's pro the right way to start, but I'm still concerned with you know the test will fail.

C

People will do random stuff with this at some point right, like I'm, not saying someone will do it right now, I'm saying we're gonna grow right, like it's things are gonna change, so it would be good if we have two levels of protection. So one is we set those goals inside of the rails app, but it would be also really good if we find a way for our infrastructure to automatically check right, like whatever is being input there. So on.

C

I don't know when we roll something out when we deploy something right like emits a metric and then alert or fail or prevent from the deploy from going forward. So, just to get to a point where we have you know self-serve, which is great but also a gate where we say that our infrastructure can't take more than whatever workaround someone might have placed there.

A

um Are you talking about this single sli that we're talking about uh the request, duration? Or are you talking about slis in general,.

C

Well, I'm talking about this one specifically now.

A

uh That's a very good point, I think at the first like we currently have that gate in the form of an in the form of the sli that we have. If you would see that too many requests would be slower on a deploy, we will just like qa will fail, hopefully on.

B

A

We'll see stuff going wrong, we'll stop the deploy and I don't think that's going to change like I think we're going to still have an overall thing. We don't want all of them. We will still grade requests in a general sense, but we'll have a better visibility into the features themselves. On top of that, so.

C

A

Think this will this is going to be adding more layers. It's not going to replace.

C

I almost feel like there needs to be a budget for this budget. um You know like if you say everyone has this much and you can go up to maximum of 10 10 seconds. Let's use that as an example, we'll add more things to the application, so this thing is going to grow, but our uh infrastructure supporting all of this is probably not going to grow at the same rate. So we need to say that you know if you have a bucket of 60 seconds total to spend, you can only have you know.

C

Six groups uh sets 10 seconds. Everyone else gets zero right like.

B

C

The level it's.

B

Almost like you're summing up all of the duration thresholds and saying well, the the total value has to be 500 and in order to get it over 500, you need to provide more infrastructure before it can be. We.

A

Also need to we.

B

A

To take like it's, not just the groups that are like some groups have 50 end points they need to spend they, they spread they bucket over the 50 endpoints. But what if one endpoint is super popular and others get hit twice a month like this all stuff to factor in which is already caught by a general request?

A

Aptx sli. So, um but it would be good to figure out a way to catch the stuff sooner like before they set the number.

C

Yeah, I can imagine a bug in in rails somewhere, and this doesn't get.

A

C

No, it's not possible right, but let's say in this impossible scenario: uh when a bug gets introduced, limits get you know, ignored or something like that. It will just I mean I'm hoping we're going to fail uh in early environments, but at the same time I think it's also impossible for that not to happen in staging right. So so it would be good to have like another layer of protection here or think about it. At least.

A

I'm going to take a note of that on the on the epic as well. I don't think like for this project. um It's going to matter, because this project is not about uh replacing what we currently have, which already guards us against that, but I think in the future we might do want to like unify both, and then we definitely need to think about that cool.

C

And naively, like I'm thinking of the world where the application is actually going to emit, metrics and inform the infrastructure underlying infrastructure, this is how you need to scale with kubernetes. A lot of these things are possible. um It's just a lot of these things are also hard to do so. You know like it will be outstanding if we get to that point at some time at some point in time,.

C

Cool thanks bob.

B

Is there anything else we want to want to demo this morning.

B

Awesome well, if not thanks, so much hope you have all have a productive rest of your day and for those taking friends of family day tomorrow. I hope you have a good day.