GitLab Scalability Team Demos, 27 Mar 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2020-03-27 Turning off free pull mirroring in production

Description

Part of https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/248

Related to https://gitlab.com/gitlab-org/gitlab/-/issues/10361

A

It's only it's only a sidekick that we that we've rolled them out on that fireplace how.

B

Come how come they're doing so much work? It's just doing the cues, because.

A

The keys are a lot of work, yeah.

C

It's a lot of jobs, yeah.

A

It's a lot of jobs.

B

Yeah I am the be queuing, is work.

A

Yeah, just I just been pushing him on and then pulling them off, and you know having just the because there's not a long. Oh.

B

Yeah because it's like the update, Omer worker, creates like two jobs.

A

B

Each project that yeah.

A

Yeah yeah, it's not it's not like it's just it's just compute, but the network is look at the various network, that's quite insane how much traffic that was generating.

B

A

Was an artifact so yeah one.

C

Thing I think we already kind of know because we put it in a bunch of comments, but I really wish we'd written down somewhere. It's like how many jobs we expect it to be processing right now, yeah, because we could see that it's processing a lot fewer.

D

Jobs, but is it.

E

Processing too,.

C

Few just the right amount like.

B

It was processing any jobs, so yeah the graphs drop, which is great but yeah.

A

So I mean we are, we were generate. We were doing like 70 jobs per second, and now we're doing like seven so like 10%. Is that good.

D

C

From what I remember- and this is sort of kind of out of date- facts but I think it's a reasonable approximation about ten percent of projects are public, so I don't normally use that as a rule of thumb, for what percentage of projects will be paying. It's not I'm, not saying how many what percentage your projects are paying because I don't know, and obviously that's a bit more so simple about 10% of projects of public public mirrors will still continue to work as well.

C

Paying private mirrors, it's just the free private mirrors that should stop working. So that's my source, Oh, sir Roderick public.

A

C

Yeah public always gets all gold features.

B

Yeah, that was.

C

Quite a big disappointment and.

B

Myron Myron fronted that out and then I gathered some numbers because most of the mirrors were actually public mirrors and that's that's the only ones that were keeping an eye on right now because of the weirdness, with the plan ID and stuff. So if we missing mirrors it's the paint like we won't get alerted if paying mirrors aren't being processed, which is not good. There's an issue well.

C

We have private mirrors right, like on the security project. Doesn't that have a private mirror ah are.

F

You all right, yeah, I,.

A

Mean I can I.

F

C

A

Do we have privately person k-dubb that we can know from we.

C

Don't need to get a repo tomorrow, we can just mirror anything.

A

C

I'm just gonna mirror and get lab they get that project in a private projects. How do we actually? Let's do this.

D

Small rather Sean.

C

I can't actually figure out how to do it from the.

B

Ground we can create a project and then just add a remote URL and then yeah.

C

That's what I was gonna do yeah. It's like I'm sure you can do it with the import project from somehow but oh yeah, maybe repo by URL.

G

C

G

Ya know former CSD.

B

C

B

Buddy Marin raised the first point as well like am. We should hold off a little bit on removing the future flag and less like if there's a huge backlash tonight, when Americans wake wake up and so on, then we might hear from from above that we need to enable it again, no.

F

Not only tonight like we need to date the future flags removal one month out, because this has a long tail. So only in one month we should consider removing the future flag. Okay, but if can you put.

B

Like a due date or body here, link, oh.

G

Yes sure, there's also a question from Henry in the production channel about the sidekick earth, through output for priority and I'm, not sure how to respond to that.

A

In the production channel, oh.

G

Yeah, where I disable the feature flag.

A

Yes, yeah. That's.

B

How does the priority utilization graph work.

A

So it's cants the number of workers that are busy so and we're sorry. So you know we have concurrency times nodes times processors, so we need some set up so say they so they're. Basically you could have say a hundred jobs running at the same time and then it looks at how many jobs are running and and basically it's a percentage of that. So if, if you've got 500 workers and there's 250 jobs running then you're 50% and we've kind of like I spoke to Kraig about this.

A

It's not super intuitive, but we sort of aim to be in the band of like 23 to 43 percent, so basically 10% either side of 33% and it's a little bit counterintuitive like if we had auto scaling on those groups, we'd probably aim for 50% and just kind of keep it on that. But 33% is kind of like a we do it at that level, because it gives us some extra capacity, but obviously the threads will slow down.

A

You know because, obviously it's really yeah it's radial.

E

A

Kind of gives us a bit of capacity but yeah.

B

So right now we're at 42%.

A

B

A

I think it's got a one hour range vector on that query. So it'll slow down it's kind of designed to be like a very rounded graph, so you can kind of read it nicely. But if you explore that and change the range vector to dollar interval you'll get it much more accurate for now.

F

Do we expect any alerting to go off no.

A

My is anything in the Palmers, like all the notes, there's.

B

An alert that the one that I just mentioned, that's going to only watch the public mirrors.

B

So but we don't expect that to go off. It went off when we first tried this, which was obviously like broken yeah.

C

Yes, when we first tried it, it went off because we were spending so much time doing nothing with the projects that we don't mirror anymore. Is that right, but yeah.

B

That's it so. The update on MERS worker worker was never be like it couldn't fill the capacity up, so it wouldn't schedule any jobs that needed to be worked on, because it spends all its time checking projects that don't need to be mirrored that what Shawn fixed.

E

So yeah I'm a little bit late to the party, but I just wanted to understand why we dropped so much Rupert for the pulling the horse. I, don't fully understand it. We.

C

Just have fewer jobs, okay,.

E

C

E

Is just going through less cortex anymore right? Yes,.

C

F

Yeah, we just were just not.

C

Mirroring as much yeah.

F

C

F

C

F

Might be missing the context of the the whole story.

E

F

Story is we offered full mirroring for free for everyone for I think a year or something- and this was an effort to do CIC D for github projects, and we always had a due date of 22nd of March. And a month ago we started discussing whether we are going to end up doing it or not. Product wanted to postpone this deadline, but we it scalability like pool mirroring, and you know as well right like how much grief has pool Mearing caused us. It's.

E

Because of that yeah yeah.

F

So we spent like the past month discussing with product and getting the approvals to actually shut this off. So the decision was made to do that, but when we turned it on on like at the beginning of the week or Sunday I think we had an unintended consequence where everything was just stopped. Basically, but the idea here is only people who are paying customers and people who have open source projects will be able to use this. Not everyone cool.

E

Yeah I saw the feature Affleck and the ticket to it. So I understand it now. Thank you.

H

E

H

Getting perspective I see a drop of off about what is it five six which seems unreasonably big.

A

That's that's exactly what you know every it's pretty much! This is why we've been so heavy on this DJ is because every time we run any of that BPF tracing, it's like random pull Murr. Is that no one's looked at in the last year, and- and it's like you know that that's exactly I haven't actually looked at that, but that's exactly what I would expect.

F

Before we continue, can can I ask one of you to create an issue and start dropping logs and start dropping on graph Anna links so that we can collect this data as we observe things merriest you already.

C

Have an issue for this.

G

Have an issue for monitoring edges from the issue to remove it? Okay,.

A

See if your wood craft were you looking at cuz I'm, looking at the front page of getting email and I, don't see anything such.

H

Remote, so that's the only RPC fetch remote, okay.

A

H

Can drop a link and zoom chance.

H

You're, probably the one who made the last edits as well yeah.

A

Why is it gone up again.

A

I've noticed that on the saturation, the worker saturations also gone up. Yeah.

A

E

We have the same throughput again like before yeah.

A

Something's changed.

C

The estimated median latency went back up so is it possible? There are just some jobs that we're taking longer.

A

Yeah, the saturations also is spiked as well so yeah, and that might just be kind of surprising and the.

D

A

It's not just that it's actually also the number of jobs has gone up Sean, because if you look at the throughput, it's also gone up, so it's not just longer jobs. It's actually.

C

Yeah this is this is kind of a challenging thing, because it's only selecting the right mirrors like it's very hard to tell, because this is the thing that, like changes from point to point like it's very hard, to tell how many should be processed to anyone's like how lumpy this is right. Now. Yeah, basically, because, like the first set, might have just paid a bunch where there are only a few to be processed, and now there are more to be processed.

C

But we couldn't see that before, because it was all just or just maxed out all the time. So it's hard to.

A

Actually, going up and dropping again.

C

Yeah, this is what I mean like.

C

Let me just know, try and read the code and understand it so I.

C

C

B

C

B

You had the code open, yeah.

C

But I haven't made my changes. Previously we had a thing where we had maximum capacity. We could process at any one time and I think we were constantly at that capacity, whereas now we should be beloved a lot of the time, so we could potentially pull more in because of that capacity that we've got set, then we would have done before. So we might mirror some things more often than we did before, because we're not mirroring as many things as we were before, but I don't know how to verify that right now,.

B

Also bear in mind: we've always had like kind of lumpy lumpy grass for boomers. If you look in here, first graph is the.

G

One ministry.

D

A

Graph right, yeah here nothing- this is high yeah here.

B

The the mirror updates per second we've always had this kind of really things. I, don't know exactly where they come from, but yeah. We.

A

Should we should put effort in to smooth that out if we can somehow, because, like that sort of lumpiness has impacts downstream right on like Guinea and post Chris and Ritter's? And you know if we can make that smooth generally you'll get much better performance and our hard? It is because I don't have any idea about that. Shit-Eating algorithm, but probably I also.

C

And I this is a classic sort of like I'm gonna, say this, but I don't think we should actually do it now. I. Also wonder if, like we're processing many fewer mirrors, we could make the scheduling simpler, because you know we don't have to like we basically before, like from what I gather it was the process of like processing mirrors as fast as possible like up to our limit, but not beyond it.

C

So like we just constantly hammer, yes that limit, but if we're processing many fewer we probably hopefully can be under that limit all the time, and we could do something simpler, easier to understand.

H

Why, wouldn't you want to keep the kinetic and just decrease the limit? That's.

C

Also a reasonable point but I don't because what's.

A

H

It's a dynamic programming exercise where you have the Crab Shack and you can only pack, let's say ten kilos of stuff and then you need to decide.

C

How do you basically reduce the capacity right yeah.

H

So make the collapse like smaller and keep the algorithm. It's the collapse, a core knapsack someone who's.

A

Not in English it's like knapsack, it's a one of those weird English words, yeah I,.

H

Hate equality and em in English, yeah.

A

Okay, just random yeah and factors to.

H

A

Fermi comes from that you know is: could you not because clearly the thing with that is you actually want to Alec? You want to set the step that size dynamically, depending on analogy, but.

B

D

Other thing isn't: okay,.

C

Yeah, it's it's! um How it responds to how many mirrors we are processing, I, think.

B

So we have a maximum that said somewhere and that yeah we shouldn't be hitting that the maximum is somewhere in application settings if I remember correctly and that's how many mirrors can be running at the same time, any given point I.

A

Think we need to tune it because, basically now we back it where we were.

H

But looking at the overdue graph, we should turn through get first and then get to a state where we have lower load rights. So it seems that the the dip was first rescheduled to mirrors. We weren't supposed to run so we just DQ'd those and so I click was very happy about it. But now we're just scheduling more stuff that we actually need to process right.

H

C

That makes sense so.

H

It seems that we actually should monitor the ooop days overdue and then, if that trends down then finally someday in this, the next week will have lower loans.

A

A

Well, the positive thing about that is then it yeah I can see Craig C to upgrade back in its full glory, because we back in where we were.

B

Those graphs that you're mentioning G K aren't exactly correct because they don't include everything they need to include, because this subscription weirdness.

B

They're only including the open source stuff so.

C

If you think it's a bulk of it, if you expand the overdue graph out to the last two days, they current like sort of peaks and troughs, don't really show up.

C

So that suggests that yeah, like it probably is over a longer period that we need to look at this because, like you literally can't, you can barely see the the current like sort of up and down from what we just did compared to like when they were 15k mirrors over you a little while ago.

A

But Sean going back to your original point like I, don't know this code, but it seems like everyone's a little bit where the scheduling card, but it seems like everyone's a little bit sort of like it. It doesn't, it seems like it has very high cognitive overhead to actually understanding what's going on in it, and maybe you know to your point about simplifying it. Maybe that's a signal that it needs to be simplified.

C

Yeah, it's one of those things where, like the cognitive load, is there because we made a lot of changes and it's probably not a great idea to change it. If you don't understand why we made the changes in the first place, unless something like this happens, where it's like an external thing that would like cause sort of a rethink but I think we could.

C

We could consider looking at that as, like you know, just completely replace it by a feature of like, like you know, isn't like do the old, like capacity management, do the new capacity management yeah from here, because we have this external thing, which should reduce the the load I'm just trying to figure out as well like. What's because.

A

There's the one: should you learn like schedule, all the even ones and the other one? Should you loved ones or like because.

C

A

Schedule is there conflict, no.

C

I just mean run one at a time, but like be able to mouth rather than try and like change the existing wanted to simplify it. Okay,.

B

Some of the reasoning behind this complicated stuff is because we were hammering in herself the bed with boomers.

A

At some point, if you, if you go back to seven days on that graph from two days, if you know the the peaks are even higher, like you know, hitting up to 30,000 overdue and that's kind of these graphs kind of a pretty skew.

B

Know that that 30,000 is her own thing when we flip the switch and the feature was working. Oh.

A

Right; okay, of course, but.

C

The 10k Peaks are right here at then yeah.

A

C

They don't seem to be like I, couldn't tell you what's causing those 10k Peaks like in the past graph.

B

Work like if they come in when people are working a lot on those external repositories, so that mean.

C

All right, so, if the mirrors have a lot.

B

To RSS so then yeah the capacity is taken, so we don't schedule other things which means over. You grows.

D

So I'm just suddenly thought about the guys at github. Right now he like something's happened, like all, are normally graphs are like dropped out like we don't know. What's going on here.

A

In the operations team there they were, they gonna be very happy about. This is what I'm sure yeah.

C

The throughput for the pull mirror is sort of yeah so hard to interpret these, but like even the spike that was, there was kind of below the previous normal yeah, and now it's going back down again.

A

B

Yeah I'm afraid that, because we schedule like every time a mirror finishes, we schedule it half an hour out. Oh maybe this is going to flatten out after a while when they thought to be not.

A

Other random interval on that as well.

B

Yeah a little bit of randomness seconds.

A

C

One thing I can do which I'm gonna try and do now is to try and figure out how many free private mirrors have not been updated. Since we did this because that's like it doesn't tell us anything about what we should be seeing in the graphs, but it does tell us something about like correctness. Right, like you know, is this: is this actually working so I'm gonna, try and figure that out right now.

A

B

Shawn, what's that class call begin project mirror project? Has one.

C

Why are you looking for the.

E

C

Estate yeah that one thank you to the.

A

Jitter yeah and.

C

The gesture is in get lab mirror, which is a file in Lib.

B

For one minute, one minute around 30.

A

Okay, so over a few days we should get a flattening out then right over the weekend.

C

Yeah my bigger concern right now is- and this is because I changed the query so late top of my right. It's correct now so like if we are currently not scheduling any mirrors that we shouldn't be, and we a scheduling all the mirrors that we should be. Then we can like this is lower than it was before and we could just leave it running and if it doesn't even out, we can look into that.

C

But if we are scheduling something that we shouldn't be or worse, not scheduling, something that we should be, then we need to fix the thing again, which is gonna mean we have to like. Do this whole thing again, potentially, if we're scheduling something we shouldn't be: that's not as big a deal because someone's getting something for free that they shouldn't, which we can you know fix later on. But if we're not scheduling a mirror that somebody's paying for them, yeah.

F

But how do we find.

E

F

We are not doing something that we should be doing. That's a really hard thing to do.

C

Basically, because, as far as we're aware, like you know, various well but I have all looked at this. My result of this, like we are pretty sure that this is selecting the right mirrors, but you know maybe there was a bug in the old codes that were scheduling. Some mirrors that it shouldn't have been sure.

A

But if you run a query every so often in rails, console that tells you how many like private page repositories are to you and updates or haven't been updated in. However long you know in the last hour, and we just run that every now and again, wouldn't that be a good enough signal. Yeah.

C

But that's assuming that I can get the query to tell us how many private papers repositories there are, which I'd be doing the same way as I'm doing the sched think so.

C

So, like I'm, pretty confident it's right, I, just like it's very hard to do, because the way our subscriptions work is very confusing as marius and Bob and I found out the other day. So yes, it's also.

B

Why, or or overdue, alerting thing only takes public projects into account and again, if we update that alerting IP using shells query because it's a complicated query and he already wrote it.

B

Sometimes he's.

A

Dying yes on our radar sidekick.

A

Regular interval- sorry someone's doing scans on irate a sidekick I wonder what that is. Why they're doing it? Because there shouldn't be.

C

You see is that an which scan operation like I want, like.

A

A the scan across the key space. How are.

C

A

Yeah, like not a keys better, but it's an but it's.

C

A

No, no not. It is scan like a like a normal, like a costume okay, which is just like wasteful and getting.

C

Done no, so just as a thing that it's not I was looking at how a psychic cron worked yesterday and that stores a set that then points to keys that it uses. So it's not that that's.

A

How that's how I'm famed as it as well, it does a it has a never said.

A

I'll be right back cool.

F

So should we close this call because I think we we got to a common understanding, what's happening, yeah.

B

I'm going to paste some graphs in the issue that Shawn created and try to write up our cover anything behind it all.

F

I'll share this link to the recording, so that we can also add it to to the issue thanks everyone for participating. This was enjoyable sink.