GitLab Scalability Team, 12 May 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Scalability team demo 2021-05-12

Description

https://docs.google.com/document/d/13TW4x3ofw0RxifZvZ7eNvrPxFnnXmhzQ8fal3fhYgjg/edit#

A

So yeah there's only a couple of us so, um like I said I'll, just do a kind of speed run of the two items I had and then, if that's it, then it's a reasonably short video for other people to watch.

A

um So, uh first of all, um we're working on um being able to uh migrate um from having one queue per worker to or one worker per queue to, multiple workers per queue so that we can drastically reduce the number of queues we listen to on sidekick in production, which should give us a big uh reduction in the num in the uh cpu saturation as well, um because the current cpu saturation is basically some combination of the number of queues.

A

We listen to the number of clients we have and the distribution of work within those queues and of those the easiest one for us to affect downwards is the number of cues. We listen to so wang men's done. Most of the work on this in terms of allowing queues to be uh sorry, workers to be routed to different queues, so by default a worker will go to its own named queue as before, but you can set some configuration.

A

That will say actually this job should go to this queue, and so then the migration becomes fairly sort of simple. To think about. You set those configuration changes, you wait for the old queue names to drain, and then you stop listening to those cues, but there is a wrinkle which is um oh. I need to remember what this uh set is actually called: um oh gosh, so there is a wrinkle which is this um uh in sidekick.

A

There are two sets, or there are actually four, but there are two that we care about of jobs um that are um sort of global. So we have this sorted set for scheduled jobs in the future um and we have um assorted set for jobs to be retried and there's also um interrupted and dead jobs. But we don't we don't care about those so much, because we don't really do anything with those on gitlab.com.

A

But those are things we need to deal with because.

A

A

In those sets um the I entries have um gosh. This is about explanation. Those entries contain the queue name as part of the json payload. um So if I take a look in here, you can see this is going to the background migration queue and this is going to the background migration queue. So what that means is if we just stopped listening to those queues and then something gets popped off the scheduled set. That would go to a queue that we don't listen to, and hopefully we'd alert on that, but you know we'd lose it.

A

um So I just wrote a rake test to um address this. um I don't know if it's gonna be useful to show the rate test really, but I'll just show what it does.

A

First, so yeah I'll just talk through quickly what it does so, basically, it just uses, scan or said scan because it's a sorted set steps through the set, looks it just uses the uh the queue configuration that you've already set up, the worker configuration for the routing, so it just says like make the scheduled set queue, names and worker names match the queue names and worker names from the routing configuration.

A

I was thinking about allowing arguments, um but I think it's just simpler to just say you just run this and if your configuration is what you want it to be, then it does what you want it to do. um There are 20 000 jobs in the scheduled set, I have locally, and that took a couple of seconds. um There are like 60 000 in production, but obviously nothing else is using this redis at the moment.

A

I'm not even running sidekick right now, I'm just running uh redis and a console, um so it might take longer in production, but that's the basic idea um so that there's also a um equivalent task for the retry set. So the idea there is, you have sort of add a couple of steps to that migration.

A

So it's like set up the new configuration run, these tasks to migrate, the scheduled sets and the retry sets wait for the cues to drain, stop listening to the old cues, so yeah, that's the basic idea there um and that's in review now so, hopefully that's that can be merged soon. um Yeah. Any questions on.

B

That for the right task, did you um I'm I'm curious how how long the longest single redis operation lasts was. Was it it wasn't the entire multi-second duration? Was it.

A

Oh no right so yeah. Sorry, I meant to explain that when I started saying it scams and then I stopped um so uh yeah good question, so uh it uses zed, scam which is uh or z scan, let's just say, scan, um which is uh 01.

A

I think um it uses then to oh log n operations, which is uh the remove and the add, slightly annoying to me that you can't remove by key and by score, so I just have to remove by key, because obviously there could be several jobs scheduled at the same time and I might only want to remove one of them and we can't edit it.

A

It does also check because, like this set represents work that can be popped. um It does check just you know there is the return result of what happens when we try to remove it and says: okay. If we tried to remove that and nothing got removed, don't add it back, so we don't double schedule the same job um and yeah. I guess technically, you might need to run that rate task twice. I forget what the scan guarantees are on like.

A

um Oh, no, you shouldn't need to run the rate task twice because by the time you run the rate task, you've already updated the configuration, so anything new going into that set is going with the correct goes.

B

A

Yeah that makes sense so yeah. That's the basic idea: um I've written up some administrator documentation for this, um but, like I said it's still in review, but I also realized today. I should probably add this to the run books, because this is also in issues, but I think it's going to be useful if we put it in the run books, because then I can, I could just be more specific about what we want to do on gitlab.com, rather than the general you as a generic gitlab administrator.

A

This is what you do here. um So yeah.

C

So a question for you like a generic git, lab administrator or operator rather, yes, would.

A

Have to do this.

C

A

To do this only if they want to do what we're doing. Okay.

D

A

Documenting this, like this is optional at the moment, the eventual end goal would be that we probably do something automatic um if this works well for us, but we could leave that for like a while, because we still have the q selector and the idea is that we don't need the q selector.

A

If we do this because you will be listening to a handful of cues and there will be these uh virtual cues, I guess that are configured like you know that are rooted within the application itself, um but yeah it's not it's not a high priority to move everyone over to that that way of working. I don't think so.

C

Just a quick, I don't know how to call it then, but reading reading the document that you placed in that merge request, um it was nearly impossible to me to figure out whether I should run this or not right.

A

Yes, so part of the problem there is, I link to a document that is not in that, mr it's in another openmr that wang wins created, but also the documentation, is what I'm fixing in that today, because at the moment it doesn't, it doesn't really work.

C

All right so you're aware, I think that's.

B

C

That's: okay! Okay! That's the important part thanks.

E

Sorry to make a good note of what we.

E

A

uh Okay, so I've got the next thing as well, um which is about how we actually roll this out. So I spoke to craig about this um and we yeah. So we we talked a bit asynchronously and then we had a chat yesterday uh evening, my time uh yesterday morning, his time and um yeah, so we wanted to like clarify a couple of things. So, first of all the scope of the project that we're working on the moment, we said catch all initially, which is kind of ambiguous, because we have two catch-alls.

A

We have one on vms and one on kubernetes and they listen to different queues. um We're making it explicit that this is for the kubernetes one um one listens to more queues already.

A

I made a recent change with scarbec. That means that by default, new workers go to kubernetes rather than vms, because obviously we don't want to add new stuff to vms that potentially might not migrate to kubernetes that's kind of a nightmare. um So like it's much better. If it goes there first um and of course, you know, migrations in general are going from vms to kubernetes, not the other way around. So we're making it explicit it's about that.

A

um I also um andrew you might remember a while ago we stopped listening to some queues on gitlab.com to get a small cpu drop on the redis sidekick, so there are like 400, odd cues and we figured there were like 30 to 40 of them that we don't actually use in production like some goq. Some of the chaos um and.

D

If I recall, um we've got some funny alert that, if anything ever appears in those queues, we'll yes generate an alert yeah, that's the same thing: yeah out.

A

Of this we can remove that because, obviously we just won't be putting anything in those queues anyway, once we're done, um but I also realized.

E

A

Or we were listening to those on production until like monday, when we fixed that, so we like undid, that change as part of um uh some of the migration stuff. I guess.

D

Is that because we've gone across to like using net like literally every named queue and actually going away from at least for kubernetes yeah? So is it related to that.

A

Yes, so I've done some actually I'll just share real quick I've done some changes based on that. Oh, that does not link to the right lines.

A

Yeah, so this this is now simpler. So, like it's still quite a long selector um wait, I'm not sure. Are you intending to share your screen. Sorry yeah! I um I forgot to actually click the share button, so catch all used to be a uh like. It was literally like 9 000 characters long now it's the concatenated list of the other shards, including I just tagged all the workers that we don't currently run on kubernetes, so we can put them there.

A

That's still not ideal because it means in the application, there's a thing that says exclude from kubernetes and there's no actual reason to exclude most of those from kubernetes, except that that was what we were already doing. So this needs to be a temporary situation, not a long-lived situation, um but for now it makes that a lot simpler to reason about, and then we've also got an exclude from gitlab.com tag, but I think I might be in the wrong um yeah. I think I might be in the staging file here.

A

I can never remember which is which, but anyway, we added that uh exclude from gitlab.com tag to some workers as well and again, the idea is, that's temporary. um I actually applied that to staging and production, and then it turns out. um I should have realized we actually do use geo in staging. So the got we're like wait. Why is nothing working.

A

So yeah, um but the upshot of that is um a we got like a. I don't know: one percent drop inside. That's.

D

What I was gonna ask what what what do we see on the.

A

uh Oh actually, let me let me share again um once I find the chart that craig posted, I think we posted it in.

A

A

Yeah there we go so the lower line is after the change, and the above line is before the change. So uh this is week on week. Sorry, um so that's you know it's a drop. It's not not really gonna make any headlines, but like we did, we did save that cpu back.

A

um The other thing is that we can use these cues to test on staging, because if we set up routing rules for those to go to default on staging and production on production, it won't do anything anyway, because we don't process any of those workers and on staging it will do something.

A

So it's quite a nice way of um like having the first part of the rollout, be something that would only affect staging anyway, like I'm not saying we make the change on staging at the same time as production, but even if we did, it wouldn't actually be um impacting production. So that worked out quite nicely um so yeah the plan is we'll basically do those we'll then define uh sorry.

A

I should just keep just keep sharing my screen uh instead of turning off and on so the way the routine rules work is that they're, global um and obviously andrew you'll know this, because we talked about this event like a year ago, but they're global. So um at the moment each sidekick shard only knows which cues it processes.

A

It doesn't know which queues other shards process, which is why we have that negate selector, where we concatenate all the others, and we say anything, that's not in this doesn't match any of the other selectors do on this shard, but because this is global, we can just say we can match stop at the first one that matches.

A

So this is a priority list, um which means we'll have something like this initially, where we sort of define all these shards and null just means match the rule, but don't change the queue name, um so we can have them all there and then exclude from gitlab.com can go to default because that's the way we can test on staging and then, as we want to roll this out, we can just add selectors up here. We don't need to combine into one mega selector. We can just um add so the plan there. Oh sorry, just.

D

A

Where is this config um specified just so it can help me sort of oh.

D

Right so yeah, this is.

A

In uh gitlab.yaml in the rails, app okay gets there via omnibus or via charts, um so yeah yeah, okay, thanks yeah um and um yeah. So this would be the sort of initial state. So we've we're rooting the ones that don't do anything on production anyway, and everything else ends up in null, um which means that, like you know, the rest of this is essentially no ops. It's just to make it clearer to people what's going on.

A

um After that, we can go to. um uh You know actually migrating workers that do stuff and we're not like set in stone on this. But one way we thought might be reasonably. Neat might be to do it by feature category, because we have some feature categories that represent some sort of fairly low volume workers, and then we don't get back into the situation where we have like a bunch of things selected by name that we then need to untangle.

A

um We can just sort of do it by feature category and if it breaks something, we only have one team to talk to when we break it, because only one team will owe that feature category.

A

So that was that was sort of a nice thing. It's a shorter selector um and yeah, um and then once we get to the the final stage, we would be simplifying again because we'd.

D

um This is probably a stupid question, but with that feature category um there isn't any sort of like uh division by cpu bound or.

A

No, so these would be below these so.

D

If it matters right, okay, yeah, it's only the catch-all ones effectively. Yes,.

A

D

That it's that it's filled! Okay, sorry that, like that's, why I was probably yeah.

A

No, no sorry, okay, that clears it.

D

A

So we'll be adding the feature. Category rules above this bottom, which is yeah.

D

A

Select uh here so that if it matches any of the other ones, we still don't.

D

It just goes to those.

A

Exactly and everything.

D

A

D

I think okay, I.

A

Think that's kind of a nice model um anyway, um just in terms of like, like I said, understanding what it does, um it's just like read it down until you find one that matches and then stop, um and it does mean that we can do things like this, where we can just add that arbitrarily in there and we don't need to change everything else to um to fit um and then, at the end we would delete this line, delete any feature, category lines we had and just change this null to default, um and it should all work job done so yeah.

A

The end configuration should be fairly simple. At that point we then need to decide like if we want to go through and like allow um arbitrary queue names to push these two.

A

If we think it's going to be worth it or not, but if we do decide it's going to be worth it, it would be fairly quick to do.

D

So we might and there's no, I I think the answer I know the answer to this, but you've probably thought this through a lot more there's no sort of um risk from the the speed at which that conflict change is going to get rolled out across vms and kubernetes. Pods is probably going to be quite vastly different yeah.

A

D

So and no risk in that is there.

A

Right, so that is actually a very good question, so, when I sort of skimmed over at the end there I said like this is fairly straightforward to do for the other ones. The thing with the other ones is, we would need to add a new cue and start listening to that cue before anything showed up in it right because we'd need to say, listen to all the cues related to say, uh project export, but also listen to this dedicated queue that will contain all the jobs for project export.

A

Once I've done this next thing, so we need to make sure we do that in two steps for this one. Specifically, we pick the default queue which is there and is never used for anything but is already listened to by the catch-all nodes. So we're going to get that for free and that's why we pick the default one, um because okay, yeah.

E

Basically, yeah.

C

That's a freebie.

A

Just consider if, if it.

C

Doesn't go that way for some unexpected reason, it might be valuable um figuring out how to pair with delivery, to actually get the rest of the vm sidekick queues over to kubernetes, because we we don't have a blocker anymore for it yeah there.

A

Is still pages.

C

But the I mean I just read the comment that that's not really a blocker, so it's a bit undefined and it would be good to define it. Yeah.

A

Craig- and I were actually talking about that yesterday because I think the pages thing is like it's well java will know, because he's also on that issue, it's a temporary directory that happens to be on the nfs mount, but we don't think it needs to be on nfs. It's just that! That's where it happened to be historically so inc or no.

F

Well, no, we know I mean I think, but maybe you said this sean like it made sense for it to be there because you're moving temporary files um and you want to keep it on the same volume right.

A

F

You don't want.

A

To move from local disk to nfs, if you could just.

F

Yeah yeah, so we so I think that was a smart decision at the time. It's just that we're still using this as a temporary scratch space, um so we're thinking about just reconfiguring the page's root directory to be somewhere else other than the nfs mount, which will write the temp files to the root partition where we have like, on average, 10 gigabytes, free, we're thinking, that's okay, I mean another option would be to expand the root volume temporarily yeah I mean it's either root or var log, because there are two options we could.

F

We could create the temp directory and we created the page's root directory in var log. It just feels weird, uh maybe that's a better option, though, if this is only temporary anyway,.

D

Yeah, I I I don't know it's just some old-school part of me, which is how they're beaten out of me.

F

D

F

B

Of the things directly yeah I mean they're they're, both neither one's an ssd is oracle and um one is much much larger than the other and there's a meaningful consequence to filling root and there's not to filling var log. So yeah go with a safer route, even though the name go.

E

B

Yeah, that would be my vote.

F

Okay, yeah, I I think, since this is temporary until we move them to move this to kubernetes, where we'll have like we can.

A

Hold our nose yeah exactly.

F

So maybe we'll go that route.

A

um Yeah one thing um uh on that move: point jav that I realized after I posted that was like yeah. It makes sense that, like you know, if you want to move a file from temp to thingy, you know you don't want to move.

D

A

Mounts like you know, it's it's on on fs from the start to finish, but with object, storage like obviously that wasn't an option.

F

It doesn't matter, yeah object, storage,.

A

So like, is it really that bad.

F

Yeah, I assume this is just like it's the the reason originally was for that and.

A

Now it doesn't.

F

Matter anymore, yeah, yeah,.

A

um Cool yeah, so marin craig- and I did sort of discuss that briefly um yesterday when we spoke because we were like the other thing, is once we've done catch-all on kubernetes. The next best target is to stop listening to all the cues on catch-all vms and one way to do that would be to give catch all vms its own queue and like migrate to this, but another one would just be to migra.

A

I say just it's a big just just be to migrate those jobs from vms to kubernetes and get this as a result um so yeah. We think that would be the natural next biggest impact, assuming this makes a decent impact um so yeah, I think, I think, carrying on with the kubernetes migration. There makes a lot of sense instead of avoiding double work. um Yeah.

C

Yeah, I might see if we can prioritize that, like sooner because as soon as api is in production, we we could like take a short break from the next service and see how api behaves, and while we do that, uh just migrate, all of the sidekick and and be done with it, make it simpler for everyone.

C

Cool thanks, sir. Thanks for explaining sean appreciate it cool.

A

um So yeah that was that was the shawn show. I guess for the demo. Does anybody have anything else they wanted to talk about bob had something on the agenda, but he's uh not here.

A

All right I'll upload this shortly thanks. Everyone have a great day, um thanks sean sorry, andrew do you want to go.

D

No, no, no, it was something that just I just crossed my mind, but um it was reminded by bob's thing and it does anyone know about it's more question than anything else.

D

Do we have any observability around samuel, because um there is a thing that's happening today we had an incident and I looked at something and I saw that I actually blocked an account that wasn't causing the incident, but that's a that's the thing for another day, but actually what it is is every time this person uses the api, it uses 400 megabytes of memory and uh heinrich looked at the at the at the request.

D

He said: there's nothing in this request that uses for it makes a gidley call, and then we realized that the customer is using saml and it's also using like six seconds of cpu every request, which is um really crazy, and so I started looking around. I couldn't find anything in our logs and I couldn't find anything in our metrics for saml and that's kind of scary.

A

Yeah, I don't think I don't know, I don't think we have anything like. um I do know that the the risk, some of the responses for samuel, I think, can be big, but I don't know why that would be when we make a italy request that happens so yeah no, but.

D

It's the, I think, it's before the guillory request. Oh.

A

For the authentication step, education, yeah.

D

Step first yeah, so, okay, I I'll probably open up an issue about that.

A

Yeah, I think it would be worth speaking to that team as well, which I guess is the access team because, like they would want to know, what's happening here as well. Right yeah,.

D

Totally totally.

E

B

I was just thinking of putting it in.

D

Yeah sorry go ahead: math yeah.

B

Was was the was the same operation um also correlated with the cpu bearing you observed, because that's an enormous.

D

Amount of cpu yeah, it's it- I mean it's up to seven seconds, so um the the I we I don't really know yet and I haven't had a chance to look at it in much detail today yet, but the the call itself is is really basic and actually the call fails because there's no repository, so it's not there's like literally it should be a 404. But it's a 500. because.

A

The response is not big is.

D

What you're saying yeah yeah yeah the giggly response is like not found and and that's all it's doing, and so the there's something else for these. The only other thing is, I think it's an api token, but I don't know how api tokens interact with saml, because presumably you get the api token from gitlab, not from saml, but it probably has to do some some dance there, but yeah I mean if, if, if their saml provider is giving back like a 50-meg chunk of jason every time, we call them, then that's not really that good.

D

But um I'll I'll open an issue about it.

A

Yeah, I don't know it will be interesting to find out, though.

D

Yeah, the the thing that bob opened is is also really weird, and that is that the reddest durations all exceed the and in the logs exceed the total duration. So I don't know, what's going on there well.

A

It's not it yeah, it's not only that exceeds the total request. Duration, it's that we have four registration fields like redis cash duration,.

E

And this queues.

A

Duration, redis shared direction and the total registration exceeds the sum of those as well, which is like they're, measured.

D

A

Same they used the same.

D

Thing to measure.

A

That so I don't know how that's.

D

That's gone wrong. Yeah, there's some there's something breaking.

A

Just on the samoan one andrew, is it possible yeah.

D

A

I mean, I guess, you'll, look into this when you create an issue, but if it's saml there's a decent chance, it's a customer, maybe we'll be able to get support to help them like reproduce this. For us, uh yeah.

D

I mean, I think it and you know the the this particular customer looks like a really they've got six accounts and so they're, probably using like I don't know, a sample provider, but it would be interesting to especially now that we've already reached out to them to tell them that we've blocked their accounts. um uh It would be. It would be like interesting to to figure out what's going on there but yeah I'll open up.

D

I mean what I was imagining was that we have like a saml duration or something like or saml size or some more response time in the logs for the request. The only thing I'm a little bit worried about is we're getting a lot of fields in the logs now yeah, and is that a bad thing?

D

Obviously there's a lot of good that we get a huge amount of good like all the mechanical sympathy logs and like, but there is also like a and we can pinpoint exactly you know, which request is bad. We can do all sorts of good things, but you know: do we just keep adding like all of these things to the access log.

A

Yeah, I don't know like I mean I guess part of the point of structured logging is that you have one log per thing: yeah.

D

A

Everything attached to that, but we do have very like in database terms. We have very wide logs at the moment.

D

So I think it would probably probably add one more: it's not going to hurt, but.

A

That's how you get wide logs.

D

Yeah uh application settings table yes,.

A

Exactly just one more like.

D

There's already.

A

D

Yeah, exactly all right cool thanks. Everyone.

A

Have a good one, bye.