GitLab Scalability Team, 4 Feb 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2021-02-04: Scalability Team Demo

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

There we go okay, so I have the first agenda item and um oh jarvis is also posting something interesting, but yeah. First, my item. We have to get through my item. First, uh I was thinking about the uh two projects that we're working on and more specifically, that I'm I've been involved in recently about git performance and about how to frame them and how to explain uh what we're doing and why we're doing it. Because I I get the feeling that it's, maybe nuts.

A

Well, it's one of those things where I to me it feels a little bit like we started doing them because they were a good idea on some intuitive level. But it's important to also be able to explain why it's a good idea, even if it's after the fact and I'm trying to come up with that um and I'm going to run a sort of high level explanation by you.

A

And I when I want to hear if, uh if it makes sense or if it's convincing, so the gist of it is that gitlab or gitlab is hosted on a dedicated gitly server, val canario one, which is an unusual situation.

A

And even though it has a goodly server to itself, it presents a vertical scaling problem and what happened in december is that we looked at some incidents around this gitlie server falcon area one, and we saw two opportunities to get us more cpu, headroom, meaning more vertical scaling headroom, and I think this is the overall theme of uh of the two epics. So the one epic, it wasn't even epic before, but I made it into an epic now because I realized it had to be one.

A

The one epic is about reference iteration, because we saw in flame graphs that a lot of cpu time went into iterating references and the other import.

A

Sorry about sorry about that. um Yes, so the I was trying to.

A

Paint a picture of what the theme is and uh of these two git epics and uh in one of them it was about reference iteration and we're in the funny situation that we already had our big win, because we uh applied the no tag setting on ci for the gita pocket lab project and that caused an unreasonable drop in cpu, which is like a really good, unreasonable drop in cpu and uh the work we've been doing after that uh is to understand why that happened.

A

And then we discovered there was a performance problem in git and now that is patched and now we're trying to get that get batch deployed in production.

A

But thinking about what I expect to happen, I don't expect anything to happen, because the problem on file canary one is it's dominated by ci and ci already stopped, hitting this performance problem. So the best thing I think we can hope for is that once this is this is deployed.

A

We remove the no tag setting on ci and nothing bad happens and that's a bit anticlimactic climactic. But I think it is a good outcome and I think it's worthwhile because it's better if we could just handle this reference iteration stuff without any special assistance, then when we need to make custom conflict settings on ci are.

B

We going to notice for other uh italy servers.

A

I uh so I looked around a bit and I I didn't look at all 60 of them, but I took uh looked at a couple busy ones and they don't spend nearly as much time in reference iteration, so uh also it's harder to tell because they, if you look at the cpu profile, you see a mix of traffic across different repos and the reference iteration problem uh has to do uh it's bad for repos, with lots of refs and a repo like github or gitlab, where that we've been gitlab creates too many refs frankly, but it does and we can't really do something about it, and we've been doing this to the gitlab or gitlab repo for seven eight years now, or something or nine I don't know so it's a bit special in that it's um has.

C

A

uh It's been used in the gitlab way for a long time, so it's going to be ahead of other repos in having this problem, so I don't know how many other people have this problem in terms of having that many references, plus, we are very enthusia, enthusiastic users of our own ci.

A

We have a, I think, sean mentioned, that we have like a parallelism of 200 jobs or something uh which is a bit crazy, and I know we're not the only company in the world who has a massive parallel ci pipeline, but it not everybody has that.

A

So I I don't think there are enough repos, with both a lot of reps and both massive parallel ci to notice the difference on this at least looking at the perf, the flame graphs. I don't see this obvious pattern that we saw on falcon area. One.

A

I mean we won't know until it goes out, but I don't expect the same dramatic uh at this point. I don't expect the same dramatic change and the outcome, I'm hoping for I'm hoping for two things: one that we can turn no tags off and nothing happens, which is good but uh anticlimactic, and the other thing I'm hoping for is that we all notice- and that is a bit knock on wood.

A

I hope that we notice in our daily interactions with github or gitlab, that's faster, because if you do a fetch, when you do a gtk update or something, then you also hit this and then it's very noticeable, because, if you're human and you have to wait, 500 milliseconds or you have to wait one second, that is very like we are sensitive at measuring and experiencing these differences in latency. It's just that. There's no graphs for how happy jacob is waiting for a good fetch. So, although.

D

A

Is a graph that measures part of the clone performance and I exp? If I'm right, then that will drop, but that is not something: um that's not really an infrastructure goal, because that's more about the developer, experience goal or just having happy well having happy users is an infrastructure goal, but it's not like we can say: uh look. We cut the cpu by fifty percent on all these gatorade servers, so we can scale them down and save x dollars a month.

A

I don't think that's going to happen.

E

Yeah, it sounds like the only way we could actually demonstrate this. The effect of this is to deliberately remove the flag from our repo first.

A

And then deploy it.

E

And then make it worse and then deploy it exactly.

C

And I don't think we should do that. No.

A

No uh well so the way I'm trying to frame it is. uh We discovered an important improvement, that's important for vertical scaling and a lot of repo. Don't have this problem, but we have it, so it is important that we solve it. It's also an infrastructure concern, because vertical scaling is our problem. We need to get a bigger server for the thing otherwise, but we already solved it with some ci settings, but it's better if nobody needs custom, ci settings and that's that would be the conclusion of that epic. That.

B

Way, nobody can surprise us if they come from another git host with ci and have massively paralyzed jobs and start doing this yeah.

A

Exactly and I think also if you take a step back from uh infrastructure goals and look at company goals, it is also our goal that people just have a good experience with git lab and they can throw their whatever their workload is added and they don't have to do custom settings and do a special work to make it handle their uh their requirements.

A

B

This way, we might have helped other guests get hosts as well, which is nice.

A

uh That too, yes yeah. Ultimately, this is the best karma to solve it in case itself, yeah um and uh and but it yeah. I guess if the moment we pick up another repo or another repo grows to the level of uh ci activity and number of refs that gitlab or gitlab has we would we might run into a vertical scaling problem, and now we won't because at least not because of this thing, because we fixed this.

A

So I'm still working on how to say this in one sentence for an executive summary, but thanks for listening to it um and it sort of carries over to the other epic, where I will have to refine the same story, which is about handling the parallelism of ci clones by using a cache to de-duplicate them and again, that is mainly going to be beneficial or where the use case we're targeting is highly parallel. Ci clones, so that's going to be effective for gitlab organic lab more than a repo that gets cloned three times an hour.

A

Although the nature of the cache, we don't really know yet how the uh how the the storage cost is going to work out and if we can afford to keep a long window of cash stuff, because if you have a repo that never gets updated, but it's expensive to clone in every five minutes, somebody shows up and clones it and it's still in the cache. Then that is also a win. But maybe we can't afford to keep that data sitting in the cache. For that long.

A

I don't know yet thanks for letting me ramble and uh I'm looking andrew, like the things I said about file canary one and vertical scaling uh and gitlab or gitlab, presenting a vertical scaling problem. Is that more or less correct.

F

I think so I wasn't paying a huge amount of attention. I would just be totally honest um but like I was listening uh mostly and it and it all rang. True like there wasn't anything I was like uh yeah, so I um yeah, I think.

E

You know one thing I was curious about was: I saw dragons doing some recent um uh analysis on like the ci build table and when we'll hit the integer limit for the ids and stuff and like we're talking about breaking that table up, um and there was a chart of like ci, builds over time, which was nice. You know up and to the right, which is good, and I was curious about.

E

If I'm sure this would be a heavy query, but like what proportion of that builds over time is us and what proportion of that builds over time is like everybody else. You know, because, obviously we are very heavy users of get lab. Ci.

A

It's I guess it's likely that it's us, if you would sort them by frequency, that we would be the top.

E

Yeah, but it's just interesting like because obviously it's grown significantly over time like and our usage has also grown over time, but presumably our usage hasn't grown as quickly as everybody else's usage, but like how. So? How has our share changed over time is what I was curious about there, because the smaller our share gets the more relevant this stuff becomes for other customers as well like at the moment. We are the problem customer a lot of the time.

A

I know it's already relevant for self-managed instances, because I know this from back when I was on the gitly team and even before that from when I was uh in support uh that people want to figure out how much, how how do people have ci setups, that they cannot, that create people need to somehow horizontally scale, their gitlab installation to handle more clone traffic, and it's all because of ci. That has been a story. I've been hearing for years and yeah.

A

I I think this also addresses that- and this is also one of the reasons why, when I realized that this was this looks with looked within reach, and I still think it is, but that's why I'm working on it? I don't want to jinx it by saying it is in reach, but that's one of the reasons why it seemed like such an obvious idea to me that, in from experience reports or from anecdotes and hearing what people self-manage say, it's like ci parallelism is a challenge.

A

A

And and that sort of makes sense, because I mean if um normal users, if you look only at the surface of your repository, but the moment you do ci, you go into depth of the whole size of your repository. So it's always going to be an expensive request. Anyway,.

A

Oops uh jarv you're next.

D

Exactly I'm not sure I I wanted to talk about this, just because it has like a scalability angle. Apologies if I'm hijacking the median a little bit, but I wanted just to kind of take share some of my findings with the recent websockets, a problem we had where we had a thundering heard of connections after disconnects, and it was an interesting thing that happened, which was uh we. We issued kind of a typical, uh rolling reload of h.a proxy, which causes the ha proxy process to fork.

D

The old process stops binding to its ports and we'll just like continue to process connections that are in progress after five minutes and then after five minutes, h.a proxy kills the process with websockets. This behaves a lot differently, because uh these connections are long-lived and suddenly, if you do the rolling reload on ha proxy quickly, you suddenly have a bunch of h.a proxy processes all being killed at the same time, and uh you know thousands of clients, uh you know reconnecting at the same time.

D

This caused a huge cpu spike on our websocket uh puma fleet, which then caused a bunch of errors. Thankfully, just for websockets, which is not a big deal, I mean just means that your sidebar might not get updated, um but uh I would say we like discovered some interesting things here um and I just saw that heinrich uh sees that there's a pull request to rails that allows for some better, like back offs on the client that will hopefully mitigate this a bit.

D

I would say I would say in hindsight it was definitely a mistake to have a small number of pods servicing the majority of the traffic, because this means that whenever you terminate a pod when we, when we do an upgrade in kubernetes, it brings up a new pod waits for that pod to pass the readiness check, and then it removes the old one. Well, of course, with web sockets. The connections are still connected to that old pod.

D

So as soon as that old pub gets terminated, all the connections to that old pod will go away and all the clients will just reconnect at once. So um I think one of the lessons here is that we probably for these types of workloads. We need more, more pods to kind of like force a more staggered deployment.

D

um That's pretty much! It.

A

Yeah, I guess thanks, so I guess this is the um maybe the only traffic is this. The only traffic where we have persistent connections like this, because I guess with as get ssh, that's also a persistent connection, but somebody does a fetch and probably after five minutes, they're done fetching their data.

D

It's possible to hold up an ssh connection, uh and it it's always it. We always have these very long, persistent connections, which is why we have h.a proxy kill after five minutes of like time for people like because otherwise the aj proxy processes would just stick around forever, because you can create a persistent connection and just leave it open. I think some.

A

People do that as a performance hack just for their own laptop experience, so they're doing another good fetch and there's already a persistent ssh connection to their gitlab server, and they can just reuse that I've seen blog posts about that years ago. um But if that's the case, then that would be a small number of users, because somebody would have to manually set that up and then with web sockets. It's more like ci, where you have. Who knows how many web browsers we're all running the websocket clients and we're all going to reconnect.

D

Yeah and these initial connection upgrades are like take up a lot of resources, whereas websockets itself uses a very low amount of cpu, so it was interesting. We we made the decision to isolate this service from web, so it has its own puma workers and in the steady state, we're not consuming a lot of cpu.

D

So you don't need a lot of pods, but when you have a whole bunch of clients connecting at once like we just can't absorb it, it would have been different if we were serving action cable on the web fleet we might be able to, because we just have a lot more resources there. We wouldn't have noticed this.

D

F

Ptsd for getter incidents, where the whole kit went down and then the big problem was getting get it back up, because every time it came up, we would ddos the site and take it.

C

F

This happened on like many many incidents and I'm getting the like trolls now.

A

I wonder so I don't know what the situation was with kitter. um I mean this is a case where uh we control the clients or more or less right, because.

C

The client is our javascript.

A

Code, so we can reasonably assume that only our code is trying to reconnect, so we can try to solve the problem client side and I think that's what that rails thing is probably about that the client would back off.

A

But if you have an uncooperative client, we could also do something like inject latency on the connection uh in workhorse, because if the cpu problem was in puma, then workhorse could see these things and say: oh you're, a web stock, because I'm gonna make you wait: 50, milliseconds or random.

F

I think just relying on the javascript is the is the simplest likely surprising solution to that, but also the other thing. That's really nice about um about. The way that we use websockets is is injecting that latency is not a big deal, because it's not like chat where you're kind of waiting for the next thing right. It's like if it's slightly delayed it. It doesn't matter too much, which is quite nice.

E

Yeah, I think, as we as we use this more, that might get an issue, but um we can cross that bridge when we come.

F

To it like 10 10 15 seconds is probably never going to be like the end of the world, but yeah.

C

Exactly you know.

F

We're with chat people are like 15 seconds, you know, yeah. There was.

E

An interesting suggestion from craig in there as well to like never keep the websocket connections open for that long on the client side, like always reconnect after a certain amount of time, um because heinrich said something he noticed was that because we only have the three uh um pods at the moment is that the traffic doesn't end up very balanced after the restart, because they all reconnect at one time and like not all the pods are available.

F

E

Time so we end up like yeah.

F

It's exactly the same as the conversation we had about recycling the database connections from the client from the rails, application to um pg bouncer, because if there's a deploy, it's it's the same same sort of route, reach problem.

A

So do we have enough control over the javascripts to do these things because andrew you were saying uh making the client behave. Better is the obvious thing to do, but now we have to wait for a real uh change to make that happen, and that's the nice thing about having workhorse is that we have a server middleware layer where we can mess with things, and we don't need the cooperation of the client or anybody because we sit there. We can do what we want. Yeah.

E

A

E

Yeah you go first jeff, okay. I.

D

Was gonna say, um like we could rate limit this at the edge uh at h, a proxy? We can rate limit new connections and that would protect rails from this. But I don't know if that's a good idea.

F

I I really feel like the best way like if, if this is not something that the I I would be very surprised. If this is not something that's already an action cable, um then you know some sort of like pluggable configuration or some sort of backup, because everyone has this problem and action. Action. Cable is quite well um used as far as I understand, and so like I'd, be very surprised.

F

If you can't do this in the client already and if that's the case, then we should talk about other options but like I think that would be very surprising and probably something we should contribute back.

E

Yeah, all I was going to say, builds on that is that, like it might be a bit fiddly to get the javascript from rails in patched. But you know one sort of the hardest that will be is that we create a fork of rails. We apply that patch to the javascript there we use our fork. We publish our fork of rails as a gem, we use that which we've done for other libraries. Obviously this is a big one, um but you know.

A

Other sites use forks of.

D

F

Seems to work okay, for are the javascript assets actually in the gym. Yeah.

E

I haven't, I haven't actually looked at this like exactly how we'd be able to like separate pulling in the assets from the the rails back end code. But um if it came to it, we could do it like just by replacing the whole gem. So it should be possible.

A

It is less surprising, I think anything that uses workhorses by definition, surprising, because workhorse is surprising.

A

But uh that's uh yeah great that we figured that out. uh Thank you.

E

For sharing that so jeff one thing I wanted to ask about your last question was self-managed like. Would this be a thing? We'd have to recommend for self-managed as well like? Ideally, you should use a different endpoint for websockets, so you can have a because, obviously we don't bundle aj.

D

E

D

I don't think so I mean I think, especially after this client change, uh it's hard to see anyone having the same problem, especially since it will be rare for people to have a separate fleet for web sockets. Right like that.

C

D

People are going to have all of their hps traffic serviced by the same pods, so they'll probably have capacity there. If, if this this happens on yeah, I don't, I don't foresee it being an issue for anyone else, except gitlab.com cool.

F

So I just went through action, cable um tutorial that uses um straight web sockets, and so what I was surprised by is there's almost no um layer on top of websockets, it's kind of raw web sockets that action cable seems to be doing, and so you know we might even just be using straight um our own websocket, in which case we have total control over that I I don't know, how do you measure the code? So there's a tutorial here of like building an application that talks to action.

F

Cable with a with a vanilla, javascript front, end and action. Cable is, is super basic, so you know it doesn't have like the things that, like say by you, protocol or um socket stream stream, socket whatever there's there's a whole bunch of protocols that are built on top of websockets, because websockets is basically like a tcp connection right. So it's it's pretty basic, and so a lot of people have built protocols on top of websockets. To give you like a whole bunch of features that you know we take for granted these days.

F

But just looking through this tutorial, it looks like websockets is like almost like. Your tcp connection with json objects coming down it right to give it a an analogy so in which yes, you've got very easy control over the over the back off, and all of that.

A

Well, as long as you uh can, I guess it depends on who makes the connections on the client side. I don't know javascript uh frontend programming well at all, but something has to decide to connect and.

F

Yeah but but we might, we might not be using like what I'm trying to say is there's a. I haven't looked at it, but there's a chance that we're not even using any library other than plain old, javascript and web sockets, um in which case like the back off, is all up to us. Well,.

A

Web sockets would be. How do you address web sockets? Is that just it's just.

F

It's just a function of the browser object.

A

F

Yeah, it's just like, like.

A

It's so it's a dom. uh The dom has a method to say. Give me your websockets.

F

Yeah, it's a w3c thing and- and you can do it without like any javascript and what I didn't realize about action. Cable sort of this was my point was that there isn't any like layer on top of web sockets. It's just like really straightforward.

A

Yeah, no, the idea of uh periodically disconnecting. uh I think that is a very, uh very nice.

A

F

The the only risk with that is that, because it's unlike um anything else, you know like bayou or socket streams or any of those like protocols that you get. I don't.

A

F

Those protocols are but yeah.

A

F

Anyway, one of the things that they have is is, if you disconnect and reconnect you know or or you go through a tunnel in a train your when you reconnect on the other side, you'll get your data and with the plain old websocket you don't have any of that so disconnecting and reconnecting. Obviously, the likelihood is small, but any events that happen during that touch point you're going to miss. So the more you disconnect and reconnect the more likelihood there is of that right.

A

In the context of chat, that's important because you're missing chat messages, but I I I well, I guess we're coming from a place where you have an item, potent uh communication mechanism anyway, where, if you miss an update, you get all these things that change like if it's for live updating a web page.

A

um So it depends on how we use it, whether that matters that you miss a message.

E

Yeah there's another issue with that like and I don't think we're necessarily persisting going with that as like our first option anyway. But there is another issue with that, which is what java mentioned earlier, which is that it's the initial connection that has most of the server load and we would be doing more initial connections deliberately um here.

E

So you know we'd have to trade that off, like you know, at the at the extreme point, it's no better than just polling adjacent endpoint with um with no caching, because, like we'd just be doing all the initial work each time, but um oh it's slightly better than that. But you get the picture so we'd have to trade that off and see what works.

E

E

That feels like we might be done.

E

I guess uh have a good day: everybody and assemble the recordings I'll.

B

Hope I'll assemble it. I hope that I get two files out of this. If I only get one I'll add a note to the video that I upload sorry cool bye-bye.

E