GitLab Scalability Team, 2 Feb 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2023-02-02 Scalability Team Demo

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Found it Igor, would you mind taking the first point.

B

Yes, um so this came up uh in kind of a fireside chat that Stephanie, Matt and I had a couple days ago, uh where Matt and I had a pretty similar idea of how we thought. Multi-Store works for um migrating data and Stephanie refuted both both of us and it doesn't work the way that we thought it works.

B

um So the the mental model that I had was that we would have kind of a four-phase rollout where we we first write to the old data, store and read from the old data store. Then we enable dual rights. So we read from old. We write to both we kind of leave that running and warm things up.

B

Then we switch the reads over so we read from new, but still right to both, so we can sort of switch back whenever we want to and then finally we stop the the Dual rights, so that was that was my mental model. uh That's not how multi-store works at all.

B

um So the way that it actually works is um you enable dual rates, and that will implicitly also enable reads from the new data store, with a fallback to the old data store on on not found and.

B

And so you don't, you have no way of enabling dual rights without reading directly from the new data store, um and also you can't uh independently say where the reads are coming from the reads will always go to new and then fall back to old, uh which is pretty surprising Behavior, or at least it was to me, and this issue kind of talks through some of the consequences of this design, which I think for the the original, like initial migration that made sense and I think that was fine, but as we're sort of getting um more latency, sensitive and also more kind of consistency, sensitive workloads, um it may be time to rethink the design and there's a few ideas outlined in the issue.

B

Any questions, thoughts, comments.

A

I think it's a mostly a historical thing. What was the first migration? We did with the multi-store thing.

B

um It might have been rate limiting or it or it might have been. um What was the other one I think? What sessions.

C

A

Yeah, yes sessions sessions was it to write.

C

In a in a typical migration, if there is such a thing at what point, if we went with that different approach that you suggested Eagle to.

D

C

Of I guess manually decide when we start to read from new what what informs that decision like at what point do we say right now we want to start reading from you.

A

In this case, it would have.

E

Been a CPL I think like eight hours, we had to.

C

B

Eight hours on that stuff, yeah.

E

So I would wait eight hours.

B

That's that's part of it, so General like warm-up period and um I I, guess generally speaking, sort of backfill of all data. If we need to do that right. So if we do like proactive migration of data um which or say persistent or shared state will likely want to do something like that.

B

um But I think the other piece is um in in gaining confidence is to actually do some analysis on the data that was migrated, so you can sort of leave it running, and then you can snapshot both uh the old and new and kind of. Compare the data and say: is this a strict subset?

B

Is there stuff missing? Is there stuff that's different, where we would expect it to be the same? How how bad is the drift and is the drift something we need to worry about? So that's kind of an additional capability that I would like to have for future migrations and that we don't currently have.

C

B

C

Would you would read from old and it would also check consistency in you and then once you had 100 consistency? That would be an indication that you could do the switch over.

A

I think that part would.

C

Be mine pretty much.

B

Yeah well, I I, don't know how well it can be automated I, I think yeah. We would need to kind of a former a snapshot of the data so like rdb dump and then do comparisons on that which can be scripted, but it's not going to be fully automated. Yes, it's.

A

Too bad the occupies here, because he was mentioning moving multi-store into something that is configuration so when we're working on this anyway, I think having the option to do uh uh to leave the reads where they are while having dual rights is something we need to consider in that design.

C

Yeah, as in like it's a strategy like, is that we suggesting that you can choose which approach you take.

A

I think more flags, as in now start writing to both like read two feature flags for reading two feature: flags for writing. I think would be highlight. Do it now and then it's up to the SRE, how to use them. Don't like don't shoot yourself in the foot. Maybe.

B

Yeah, so um if, if any of you have any more thoughts on this, please do leave them in the issue.

A

B

A

Mentioned it like that I'm kind of surprised that wasn't the approach we went with like I was in the initial discussions and then, when I reviewed the merge request of the multi-store before it was ever used. I thought like ah yeah. That makes sense, but how did you mention it? Otherwise, I think oops.

B

Okay, Bob, you have the next one.

A

Yeah so um recently the data I think Igor brought me into a discussion with the database team and they were looking into saturation point for, uh uh like in teachers, running out like um yeah, on the primary keys that have a not a big end as their their primary key, um and they were working down that already.

A

But they were surprised that we had notified them that they were Across, the Threshold that they had in their mind, um because that's a saturation point that we do monitor in timlad um and out of that, um I restarted the work on using the soft SLO, so the soft threshold for capacity planning and then leaving the hard threshold for only monitoring. Let me show you where I got.

A

So here's what that saturation Point looks like currently before we would.

A

Alert as soon as we forecast the date on the full red line and I'm proposing to change that as soon as we forecast the date on the dotted line here, which is the soft SLO um in the merger quests. That I linked here I've noticed that that would create 30 something more.

A

Yeah 31 new capacity planning issues but I think that's just a trigger for us to set the thresholds correctly, because uh currently, we've just used the hard SLO, which means we would create a capacity planning issue as soon as we think we might alert the SRE on-call at some point in the future. But for a lot of saturation points, we want to start work before then, specifically, the the integer overflow. One is months of work.

A

So if the team had waited until we notified them, it would have been too late so good on them for not not waiting for us.

F

So is that the main problem that you're trying to solve here is that some teams require more notice to be able to fix this. uh Yes,.

A

What the problem is, the thing is that it's coupled that the the thresholds are coupled, we have the alerting threshold that is going to page an SRE on call and they need to solve a problem right now, like especially for this one, and if we saturate here it is going to be loads of hurt and I. Don't know what the easy solution is. So the main thing that I want to fix is decouple the threshold that we set for hey team.

A

We've got this capacity planning issue, tell us what you think needs doing, and the paging of an SRE on call.

F

I quite like what Andrew's written.

D

Yeah, it's just I mean those names. Are you know they're just uh you can blame me for them, but they never really made a lot of sense like hard and soft um and I'm. Well, on the one hand, it makes sense to use both I think it's a perfectly good change, but maybe we should just rename them to alerting threshold and capacity planning threshold, because then it's one less thing that people have to understand.

A

D

A

The name the name can be transient from now on, like I, didn't, pin myself on the names. If we change it on one side, we just need to change two strings in Timeline as well.

D

um But I think it's a perfectly reasonable Channel, like a good change, actually that you made.

F

So alerting means the point at which the SRE gets paged. Yes,.

A

Well, depending on the severity of the saturation point, okay.

F

Yeah, okay, that sounds reasonable, so.

C

F

Want to give people enough warning to deal with things when they when, when we know, there's a problem coming.

B

So Bob the.

B

The current behavior before we change this to uh soft is that it looks that it projects when we would reach the alerting threshold three months in three months. Right, uh yes, is that correct? Yes,.

A

B

With the eighty.

A

B

Yeah, so what that means is, if we change this, we'll likely want to raise the to to avoid having those 81 or, however many they were 71 issues created. We.

E

B

Raise the the soft or the capacity planning thresholds to to match, basically the alerting ones for most of our services.

A

We need to I think we need to go through some of those that have been set so like just make a because there's 31 issues so 31 saturation points, maybe less, because that could be issues separated by Service. uh So we just need to go through them and see I wouldn't change them all, but the ones that would change that would create an issue. I, don't think we should just blindly set them all to match. In my opinion,.

B

A

um The next item is also mine. uh Yeah. uh Can you still share my screen because the thing says shows a black screen here for me.

F

We'll share it and see what happens uh because.

E

When you shared your screen earlier, we could.

F

A

A

This is the other thing. I just wanted to drop this here to get some IDs, because Stephanie and I had a short call about this and we don't really have an ID yet. But what I've seen is some recordings.

A

That's just the wrong thing.

A

Some recordings that we use to display on dashboard and so on so Global recordings that happens in that happen in tunnels ruler, don't match their actual Source metrics, and this is an example. So here we're using I'm not using the error budget metrics here, because I thought yeah I was just looking at something different, and this is what popped up.

A

So the green line here is the component obstrate recorded in the separate prometheuses and then the the monitor Global one is what what we actually end up displaying in places in dashboards, and this is also the kind of recordings that we use for error budgets. So the problem happens there as well, and you can see these huge drops and they don't match what the permit is.

A

Recording and I was wondering if anybody had any idea, like my current theory is some Prometheus is not responding or timing out or whatever, and we use partial response strategy warm. So we we just ignore it.

A

um That's the theory. Yeah I'm.

D

I'm sure it must be that right like it, it will be surprising if it's not that um you get logs for that. Hey.

A

Yes, but I haven't been able to correlate that because there's so much logs that these.

F

D

A

Out and I like it's, not super clear, dude.

D

uh Do you see, but do you like? Does the log not have the recording rule name in it that you could kind of filter it down by perhaps.

A

um What I've noticed is these logs come like in burst, and then it's a bunch of recording rules because, like these drops specifically I've also correlated to, in this case, um the error budget recordings that also happen in tunnels.

B

A

B

Theory is that this partial response happens during a recording rule scrape and then we persist the bad data.

A

B

A

And there's one um that I go ahead: Andrew go first, just.

D

Just before we go on because that's I think it's a kind of important Point like if you go read the Thanos documentation. It very much says, like only use partial the partial strategy. If you know exactly what you're doing and like you probably don't want to do this and there's a lot of that kind of in the documentation which is right and when we were doing this, it was kind of like our choices, are either error or worn.

D

And if we had um error, then kind of because at least you know, if one Prometheus drops out, everything else still gets evaluated at uh you know the level at which it's at it just kind of gets excluded, and it's better for us to still do the SLO evaluation rather than just not evaluating during that period, and so it was kind of like a like the least evil uh kind of option available to us is.

A

It safe for us to because, right now our run books prevent, like the just a static validation, prevents us from setting from not setting the partial response strategy. um Would it.

D

A

Acceptable for you to like for us to do that, like we need to specify it explicitly or nor abort.

D

So the thing is that that you know the problem is that if, if we had like one pre Prometheus server, it doesn't even need to get any data on it right like it just has to be malfunctioning, it doesn't even have to have this particular data on it. You know one host out of hundreds. Well, not hundreds, maybe many dozens um is out. Then then we don't get any metrics.

A

That brings me to the next thing: I want to bring up. Can we separate Thanos ruler and run it per environment and have uh still the single tunnels? Query instance that we use to create all other things.

D

And we have a than a whole than a stack per environment to.

A

Be able to query like a whole ton of Stack per environment is.

D

A

D

Definitely a good first part like I If. We can do that. I.

A

Don't know if we can do that. It's a question I just brought it up yesterday, because I broke Thanos, like that was my fault, but if I had stages to roll it out to pre-staging and then then I would have maybe caught it.

D

I'm I mean I'm sure you, because you can just configure it with the back ends. You know with the I'm just looking here um with the query. You can give it a bunch of queries right so right.

B

Query doesn't care how many, how many backends you give it.

D

Yeah but Thanos rule you give it a bunch of query: backends I! Think: don't you.

A

That's what I'm, hoping and then there's also the Sidecar.

D

I yeah I mean it's worth investigating.

A

Okay and short term.

F

Yes, do you think that these problems have always been occurring and we just didn't notice it like we weren't looking at it at this level, or do you think something has happened recently that has made them I think has tipped over the kind of point, and then it started these events I.

A

Think this occurs more as we grow, as we gather more series. This happens more often, that's my theory, but also.

D

We should be investigating that we we should be making ourselves more immune to it, but we should also be investigating, because that's a that's. You know if it was a pre-server, you wouldn't see much change right, a pre-server or or a staging server would be a tiny little dip in the metric doesn't matter, but that's a big chunk of data that's falling out, and that means that it's one of our proper servers from production, that's now becoming invisible for a five minute period. So we should. We should definitely investigate that as a separate issue.

E

A

Yeah one other thing that I this is not directly related to this, but this is a little bit uh how my understanding of recording rules, Works everything in the group gets evaluated at once, and we've separated the pitch category metrics from the component Ops once, but both use the same in some cases, SLI aggregation rules, and we probably knows what I'm talking about I, don't know if the others do, um but those get recorded in a separate group. Would that also cause incorrectness.

A

D

Sorry I was thinking like I. Don't think that the group just kind of dictates the pace at which they run because they're always running in sequence inside a group and then it goes back to the beginning and starts again, but I. Don't think it if it fails like halfway through the group. I, don't think it kind of short circuits the rest of the rules or is that what you're talking about.

A

Yes, it does, it does I mean like either the whole group gets evaluated or nothing does. If I got it right.

D

I I didn't think it was that smart I might be wrong.

A

I did like that's why I separated the the while back I think a year ago, I separated the fast burning Rules.

F

A

The Long burning ones, because then we had shorter evaluation times for rules.

D

Yeah the the evaluate like the evaluation times, definitely and um and and obviously, if the if say, you've got it every 30, it's just a it's just a loop right, it's basically a while loop and it and and then it's you know. If it's running every five minutes, then it will run and it takes three minutes and it'll run for three minutes and wait for two and it'll start again. But if it runs for six minutes, then it will just run you know over and over and over, because that's just how the loop works.

D

If you, if you get um but I, don't think it it there's no atomicity to those groups. Okay, yeah.

A

D

As far as I must be wrong on it, but I've never heard of that follow-up.

A

Question what do you think about separating the recording rule registry rules per service like now? We we have a recording.

D

A

You remember when I talked about registry rules, the recording rule registry in the runbooks is like this thing that has the intermediate.

E

A

Rules, uh let me show it in code, perhaps.

E

A

So here we've got this recording rule metrics thumbs up.

D

A

For hard cardinality metrics, like rails request and HTTP, and so on, we have those intermediate rules that then get dropped in.

F

A

Here these are these rules.

D

A

Those are, then, the ones that we use for further down the line, so they don't.

D

Have yeah the reason, that's very specifically in the same group so that you don't use stale data um because the the order in which they evaluated is always from top to bottom. And so you know if they were on two separate loops, the one you know that could be kind of off-kilter from one another where that one is being evaluated immediately after the thing that uses it and then you're always getting data that one minute further out of date. So having them in that in that order, kind of prevents that, yes,.

A

Yes, but that's that's actually my question so for the future category metrics, we use those same aggregations. Oh.

D

A

D

Are actually yeah they're, not in the same group, so they um so so those ones could be staggered like an actually delayed, which is not necessarily good. It obviously the longer period that you're expecting to use them the less. That's a problem like if this you're using to evaluate over a 30-day period Then, you know one minute delay on your on your metrics doesn't make any difference. But if you're using it to alert on immediate issues, then it can be a problem.

A

um I think because we use these things in ratios, so if the top side is there but the bottom side, isn't you get these weird things yeah, but yeah right.

D

You got jaggies.

A

I'm going to uh because it works it worked before and like now, we're looking into it and I noticed this so I'm going to postpone that for a while another question that I had um here, we have these aggregations in Prometheus and they have a type label in their aggregation. This means, if you specify a type label in the aggregation and no selector.

A

So that means, if you specify HTTP, request.

F

A

Here, that's going to record for all services and it's going to really yeah just that recording rule for all services. Yeah was.

D

That intentional or.

A

D

I have a vague recollection that we were hand rolling those when we first started doing it and we shared those intermediate ones between API, web and um I might be wrong on this and.

A

No, they are shared.

D

Now they are shared yes yeah, so then, but then you obviously don't get the um the advantage.

A

Like yeah here, you've got uh like the HTTP request. Total one isn't going to be here, so you don't get the advantage that you just said that they're in the same group.

D

Yeah yeah, so there you've got the you've got the that same I mean the only other thing you can do is just maybe double the speed at which you evaluate the um the intermediate ones and kind of you know hope for the best you.

F

E

D

Get the some degree, but you can kind of just I, think increase the frequency.

A

I think my kind of my my first thought would be like limit these, that we drop here to the service. So here with that uh right and then.

D

Selecting them and then line them.

A

Yes, so then, here we have the selector.

D

So the question, then, is how many different places are they being used well like that.

F

D

We even need the intermediates, or do we just get rid of the intermediates like if it's being used in two places and it's being recorded, it's probably not worth the extra overhead and then we just inline it.

A

I think that would be awesome if we didn't need them, but I have no way to safely try that out yet yeah.

D

um Yeah because I mean you can turn them, I think there's like a flag. You could turn them off. So if you, if you're feeling adventurous, you could turn it off briefly and see how I mean I. I think that the first thing we need to do is kind of make sure that Thanos and Thanos ruler in particular, is firing on all cylinders and uh not like my car um and um and then you know. Maybe we also, if you could scale ruler up horizontally.

D

Well, you know there's various different ways that we could do it like. Do we even know the health of of that ruler instance like? Is it just totally maxed out, like you know, unable.

A

Yesterday, I maxed it out like really hard.

D

um But it is a I mean, you know. If you look at the documentation, they say you can scale these things horizontally, so you know, have we just got one because it was the default and something we should be thinking about and then get rid of all this complexity around these intermediate recording rules and the jaggies that they give us and all of that stuff and just instead of having one Thanos ruler, we have 100 Thanos rulers and we don't even have a problem anymore.

A

That sounds awesome I want that. Okay, thank.

D

A

D

It because it is the the the the cognitive overhead of all of that is like, like it's really complicated, especially when you're trying to get back to the source metric foreign Eagle. You said something and I missed.

B

It um I was gonna, say we. We have a similar issue with Thanos compact as well, where we kind.

E

B

Scale that out, if possible. So um if, if we're gonna, think about how to structure one of the sort of um monolithic, Thanos components, we can maybe think about it in a bit of a broader way as well.

D

There's a there's, a team that owns that stuff now and I think there's some work to change it from.

D

um uh What's that thing to to helm, so you know kind of bringing in line with how everyone else does it and maybe now's a good time to start engaging with them and saying okay, we want to scale this out and you know.

A

uh Igor, you have. The next point have I stopped? Yes, yes, you.

B

Have okay, so um the next item is uh something that Jacob brought up, which uh is related to how we manage our redis and, in particular our Sentinel configs, on the M's.

B

um So those those configurations are Chef managed, and that means every time we run a reconfigure, it'll completely nuke uh that config file and like rewrite it, and so we start up with a fresh state.

B

One of the really weird design decisions in redis Sentinel is that it stores its own state in its config file, and so we kind of get away with this, because we only do one of these at a time.

B

We try very hard to coordinate these reconfigures, and so we we sort of get to repopulate the state, um but there's there's certain edge cases where, if we were to new kit uh on multiple Sentinels at the same time, then we would get in a really bad State and Jakob built a test case for this, which I am going to share now.

B

So uh this test case has two redises and three Sentinels and it starts the Sentinels off with an empty config um and the master in the Sentinel line is mismatched with. Who is actually the master as in which red is, um does not have a replica of line in its config and so I'm going to run this, and uh so it's going to take a few seconds to start up.

B

um We can see the three Sentinels and, let's see, hopefully this is going to reproduce the issue, but I saw previously.

B

um So yeah Sentinels take uh a bit to kind of converge, but now we've got um uh the Sentinels, don't detect the current state and say we have one master and one replica. This is fine. Instead, they say: oh well, we have a master, but it's not the one. I thought it should be that that one is actually a replica.

B

Let me perform a um a failover, but it can't do a failover because it doesn't know about the replicas, so part of the state that Sentinel stores is known, Sentinels and known replicas, and in this case uh my suspicion is that because it doesn't have any known replicas, it can't do the failover.

B

So um what I changed in this test case is to actually write those known replica and known Sentinel lines to the config file uh and and basically that's sort of the the proposed change to how we manage our Sentinel configs, and so um we can see how that changes. The behavior under this specific case, and hopefully it's going to go a little better and not leave us in a state where the cluster is broken. Right like this, the situation, this situation is bad.

B

We don't have a working redis setup, um so let's uh I've checked out uh this new branch, which writes those configs to The Sentinel .com file.

B

um And so, let's see what happens- and this is also logging redis log lines. So it's a little little more verbose than the example that we had previously.

B

um So we have a primary and we have a replica. And so, let's see, let's see what happened uh so again. Sentinel decides okay, we should perform a failover, but this time it actually successfully performs the failover.

B

So um there's a promotion and let's see if the yes and so now we get into a really strange state where we've promoted uh red is two.

B

um But red is one where I think we've promoted redis one, but then one of the Sentinels tries to reconfigure redis one to be a replica of itself, and so we get into this weird state where it's like I'm trying to connect to myself but for some reason I can't- and it keeps doing this for like 10 20 seconds, um at which point the Sentinels realized that there's no primary and will initiate another failover, and so they performed that failover, uh at which point we have a primary but red is one, is still trying to connect to itself.

B

So we can see the thing yes, so it's still trying to connect to itself. So this Loop is still ongoing right and then eventually it takes about a minute for this Sentinel realizes. Oh, we have this note that is trying to connect to itself that shouldn't be happening. It should be a replica of this other node, and so it does this fixed replica, config command and now we're finally in a stable state.

B

So um it's super weird and it's not ideal, but it does eventually converge. So I I would argue that this is probably better than where, where we were previously, um but it's still kind of scary that redis has these these weird edge cases.

D

Yeah it's. It reminds me of every time. I see people using um like a chef template to manage the sentinelon conflict, and you know that it's gonna end in I think the gitlab um main Chef one does that doesn't it it's not it's.

B

F

The one we use we.

B

D

Yeah, which is um terrifying.

A

And we get around by disabling Chef if I got it right.

B

Pretty much I mean we, we disable reconfigures in chef, and we we run them very rarely.

B

Which is also not great, for other reasons,.

B

D

But I think with this recipe, that's specific. You know that does sentinel it doesn't. You know it doesn't use just a template and blatant sorry guys.

B

You mean, instead of using omnibus well.

D

Kind of in Omnibus like actually have a specialist task, or you know for for where you can say, make sure that the Sentinels in and this the sentinel's not in there but like don't mess with the epoch, don't mess with the the CL. You know it's much more sort.

C

B

D

Get the way that's, but that's actually.

B

I, do a config file parse and rewrite, as opposed to a full, replace yeah, yeah, I I. Think we'll replace with those sensitive lines added in will likely get us pretty close. So I think that's that's my proposal, um but I'm happy to hear other ideas there as well.

D

Would you Upstream that into into omnibus.

B

Yeah, like I, think so I.

E

F

Long long term.

B

We we probably want to move this to kubernetes. Hopefully um the the uh bitnami redis Helm chart actually does what I'm proposing to add to omnibus. uh So there is some prior art there as well, really.

D

Interesting awesome.

D

All right, probably we'll need to actually put out a thing to like a bunch of gitlab Administrators to say it might be safe to turn on uh reconfigure on your redis nodes. Now, because you probably find there's many GitHub administrators have turned it off for the same reason,.

B

Yeah that I mean I wouldn't be surprised if Sentinel isn't even that widely used by self-managed I.

D

Don't know actually difficult to know but yeah.

B

Cool, that's all I had thanks.

B

E

B

E

D

Thanks everyone.

F

Great well thanks everyone for the great chat. um Thank you have a good day. Bye-Bye.

A