GitLab Scalability Team Demos, 25 Mar 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Scalability Team Demo - 2021-03-25

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

And bob you've got the first item.

B

Yeah, thank you, um so I wanted to have a well I'm very unprepared again and I found a bug in recording rules while I was preparing, or at least trying to but anyway. So I want to to show where I'm at with recording error budgets for stage groups. Let me share my screen.

B

This is the issue that I was working on. As you see, we've recorded we've started by making sure we've got all the source metrics in place. The sort of metrics are error rates, total operation rates, abduct success rates and total optics measurement rates, we're calling them scores and then for now and then a mapping of feature categories to stage groups.

B

So we've got all those in place now and then now we're going to aggregate the future categories. uh All of these metrics that we have for fidget categories up into.

B

Success and error rates for stage groups and we're going to use the mapping that we've defined here to do that and I've started that off in this merch request.

B

The meat of that of the changes here lives in uh the aggregation sets so yeah. These are some rates that we forgot, but this is the the main thing so we're defining the stage groups aggregation, set and then later uh we're going to take the feature category aggregation, set and use it as a source for this one and then to add the mapping. On top of that, um I added this bit, and this is the thing that I would like some input on from all of you.

B

If anybody has it thought, um so, this is the mapping. It contains um feature category label and um stage group label and the product stage label that we want to have on uh the metrics that we'll be recording here and the way I'm adding. That is um like joining it in a string like this.

C

Yeah, that's where I'm at and I I think I really need to go through this actually and mess around with it on my computer, because it's um yeah like I think that's that's the best way to kind of for me to give input into this. um If you.

B

Want to play together on that later, I'm happy to.

C

B

C

Let's make sure we don't have like any incidents or no no jerky, no.

D

It's fine, though, if I, if I knew how to do that, would I be building this.

A

Don't start another incident bob.

B

Okay, because you asked nicely one thing that I did want to point out- that it's a bit weird, that I just noticed before the call which made me say that I'm not prepared like um we want to record a success rate here and for some reason and we're summing the the weight. So I need to see what's going on and if that's a problem with other other recordings as well uh yeah, that's where I'm at for now.

B

C

Let's, let's do some pairing on that, because yeah.

B

uh Stop sharing. I can't stop sharing.

B

um You have stopped cheering. That's fine! Oh okay. The thing says I haven't.

B

uh The next one is jacob. I think.

A

Okay, thanks bob uh now I'm going to start share. I have two items. uh One is about an experiment. We did where we bypassed the ci pre-clone script.

A

Actually we don't, but this was the change issue for it, um so the pac objects cache we're building we're not calling it pac file cache anymore, because back files also are.

A

I don't even want to go into why it's confusing, but uh unless somebody wants me to, but I think it's confusing to call it back file cache. So I'm calling it back. Objects cache now.

A

Well, the thing that prompted this project was project was the realization, uh on my part, how important the ci pre-clone script is, which is a custom thing we have for gitlab or gitlab, uh without which the server that uh the gitly server that github or gitlab sits on melts down because of all the ci clones and uh we've been experimenting with the cache. The back object back objects cache in the past couple weeks and sort of as the final experiment.

A

I wanted to see what happens if we just bypass the pre-clone script, and this is an interesting test, because we know from instance in december that normally, if you don't have the script, the server melts down and the server did not melt down, that's the short short version of it.

A

It was a bit tricky to enter is excited. It was a bit tricky to figure out how to uh bypass this, because at first I tried to tweak the script so that it would exit without doing stuff percentage of the time. But I ran into all sorts of bugs in the script and I realized after a while. I just need to not touch this script because it works and the moment you change anything it might stop working and then just don't do that.

A

So what we did in the end was that there's a setting in the project settings where you can say I want to do a fetch on hdi, build or a full clone, and I just toggled that to clone and the result was that the pre-clone script still ran and downloaded stuff and then the ci runner came in and rm rf'd everything, that's recloned scripted and did the clone again, a partial clone, not a full blow, so that was the simplest way to bypass it and when we did that one of the most obvious things that happened uh was this on the network transmit rate.

A

So this is for the window where it happened. You can see when we made the switch. So here it's around 50 megabytes per second- and here we are in the 200 to 250 megabytes per second on the network network egress rate, and the reason for that is that the pre-clone script works by making the ci runners fetch less data and it when I do spot checks, it looks like they fetch between 10 and 100 kilobytes of data each, and that is great.

A

But if you don't do that and you do a partial clone, then you fetch uh 120 megabytes of data.

A

So we were each of these ci runners was fetching, uh maybe a thousand times as many as many bytes, and that doesn't really add up, because 250 is not a thousand times 50 you'd expect more, but it does make sense that the network, egress rate, grows up a lot, because all these builds are now much bigger.

A

But the good news is that there's a big bump here on this, but if I look at the dashboard for what happened then and see yes, we're looking at falcon area one. This is the same time window.

A

You just see nothing interesting!

A

Well, if you scroll down, you see something interesting here, so you can see. This is where the we were doing, um bypassing the pre-clone script and there are more open file. Descriptors, there's, also more threads, and you can see here in this one that there are more upload back processes now. My guess, slash headway. The explanation for this is that, because we're downloading more data, each upload back process runs for a longer amount of time. Just because sending 120 megabytes will take longer than sending 100 kilobytes, and that means all these processes.

A

That would normally finish some. Some of them would finish one after the other. Now they will overlap. So that's why there were more processes. But if you look here, the cpu does not look much worse, although there is that interesting spike there um and I'm not sure what that's about but yeah in the grand scheme of things that we would have had 100, cpu and incidents and all sorts of stuff going wrong if we would have done this in december.

A

So the fact that this just held up is nice and and one more thing before I uh hand over the the stop talking. um So that's the network transmit rate that is very clear that went up. uh This metric shows the number of bytes served by the cache, and I need to change that to a 10 minute thing. Then it looks the same. So that goes up pretty much as much as that.

A

One by the same factor, the amount of bytes generated doesn't change that much that's surprising, because the things we're cloning are the things ci runners are cloning in that window are 120 megabytes instead of 100 kilobytes, so you'd also think more bytes are going into the cache, but you barely see a difference here.

A

There is a little bit of a difference, and then the last graph I wanted to show was the disk size of the cache and there you can see that here it dips to almost zero. More often and while we were doing the bigger clones, it never dips to zero. So you can tell from that that there are more bytes in there, but it I was. I was bracing myself for a bigger impact on uh the term number of bytes that go into the cache and that was just sort of barely well.

A

Not not not all that interesting, like the only really interesting notable part that I saw was way more bytes going out and more processes running. At the same time,.

B

Could you um like I'm trying to like think about what the first graph you showed that you didn't expect? What was it the.

B

Bytes generated.

E

B

um Could that be like the the reason we don't like that there isn't such a difference. There is that, in the grand scheme of things, lots of other things not ci are also generating bytes from this project or.

A

Yes, there are other.

B

Ci is only generating it once now, no because of what you built. So that's not that much in the grand scheme of things or.

A

Yeah, but still there's uh there would be yeah, there is other traffic clearly and whatever ci is doing in terms of bytes, apparently is not as much as the other traffic, because.

B

It's only doing it once forever.

A

Yeah yeah yeah that I mean that's the amazing thing about this cache because uh it it is, uh the cache is relatively complex, but that's because it's very tries very hard to only do it exactly once uh to do the work exactly once and not because if you naively build a cache, you can have uh two things that produce the same cache entry and then one of them gets thrown away and it's built the way it is so that you never throw something away.

A

You only do the work once, um but still we're going from um putting things into the cash that are 100 kilobytes to putting things into the cache that are 120 megabytes.

A

So I I I struggle, I I think you're right it. It there's other traffic going on that is dominating here, but I I'm surprised that the deduplication works as well.

B

Can't you just be happy, I.

A

Know I am, I am happy I I think I am a little bit nervous about this. uh The the egress rate, because.

B

That used to come from object, storage.

A

uh Yes um exactly and there is going to be a limit at some point of how many many bytes we can pump out of that one gitly server. So maybe at one point we'll need rescaling just for the network pipe, but there's always going to be a bottleneck somewhere, and I I guess, let's um go ahead. Google google tell us that the network is.

C

A

Okay, that's what.

C

A

But there there is a network card on that uh uh on that gitly server, magic.

F

A

F

A magic network card, okay,.

A

Yeah, didn't you.

F

Didn't you see that it's like the infinite one? It has the infinite sign? Oh okay. Okay, so.

A

um Yeah, I have no idea what will happen, but it's yeah. Yes, bob. I am happy it's just it's kind of like amazing that.

A

It can do this and yeah it's not. It's not melting down.

C

How does this tie in with um prefix distributing reads.

A

Well, it's orthogonal! So if this this thing is not sitting behind prefix and if it.

E

Was but in a different case, yeah.

A

Yeah, if it was sitting behind prefect, then on each uh italy, uh replica yeah, you you, yes that'd, be separate cash.

C

A

C

Rates but we increase the the distribution. If you want yeah.

A

Yeah and if, uh if there's projects where the number of concurrent clones is lower than the number of replicas, then you're not going to see a benefit from the cache. But with yeah.

C

Like you could to appoint almost direct the same traffic to the same node in prefect so that it's it it actually artificially inflates the cache hit rate yeah but yeah. So that's yeah.

C

It's not a problem. Yeah.

A

No uh and I no exactly the hit rate goes down, but you still get the benefit of only one process doing that job. uh The the deduplication of the processes for each individual replica, yeah.

A

Okay, well, if there's no other questions, then I also have the next item, which is a small thing.

A

I noticed when I started writing the documentation for the future, which is that- um and this is this is probably not going to be true in the general case, because we're doing the we've been doing the experiments on uh prefect, the prefect cluster and on canary one, which both have very few repos on them, like most uh normal gitly servers, have lots of different repos on them, but these are sort of I don't know test beds where we don't have a lot of different repos and the reason I'm saying this is that there's not a lot of different repos there's, not a lot of git data on disk in the grand scheme of things, and it all fits in memory and we're not going to stuff the page cache full of git repository files, because the interesting thing that's happening.

A

That I noticed is that on file prefect 1- and maybe I should just now look at the time when we did the experiments.

A

So there we go.

A

This is byte served again and you can see it peaking. But if we look at discretes in that.

A

Time then, uh it's 100 kilobytes per second here peaking on discretest and in this graph we're serving uh over 200 megabytes per second and just to be clear. This is a counter that correlates directly to read system calls so we're asking the linux kernel. Please read these bytes from a file, so this is from the point of view of the read system call.

A

This is how many bytes per second we're reading from files, but then at the level of the disk, underneath we're only reading 100 kilobytes per second, and the reason for that, I think, is the linux page cache, which means that if you try to read files from disk, they linux keeps them in memory, because so that, if you read them again, it doesn't have to read them from disk again, and these machines have so much memory that our entire cache in this win window.

A

um Actually, this graph is wrong, because uh this is pretty so in prefect, it's not showing canary, so the experiment is not in here, but it doesn't.

A

Prefect is still pumping out a lot of data here right, the the I I'm looking at the same servers and the cache it's in in principle it's on disk, but in practice it's in ram because the linux page cache keeps it keeps it in ram and if we would have put the cache in object, storage, we wouldn't have had this effect, because there is no page guys for object storage, but because we're reading from disk uh it, the linux like as long as there's, not enough ram available linux puts the caching ram for us.

A

That's what I wanted to share.

C

When you were saying that I had that image of the magic gif, except instead of saying magic, it said linux.

A

Yeah it it's quite, I and I've been thinking about it more because it's even if, uh if we were memory constrained, there's the effect that we have the producer writing into the file and all the consumers must read from the file. So all the pac file back objects. Data must go through the file in a way, but because of the page cache, if you have a reader, that's right at right behind the writer.

A

The thing the reader wants to read is by definition, still in ram, because linux hasn't even written it to disk, yet so the somehow this design by by luck. I think- because I didn't plan it this way, but by luck it it cooperates very well. It looks like it cooperates very well with the page cache.

B

How will that look on busy servers? Sorry, how will that look on busy servers with lots of different repositories.

A

um Well, more data will be paged in from disk, but linux will try to balance it somehow. The one thing that can happen that is less than ideal, is that repository data gets paged out. That needs to get paged back in, but.

C

What kind of monitoring can we do to make sure that the buffers and cash that we have available on those machines is sufficient that it's that it's sufficiently tuned- and this is.

E

A

C

Joining the team will be great.

A

I think we already have metrics for page faults, because the the the what this is called is a page fault. So if.

C

A

C

A page fault isn't about uh like the the kind of excess memory. If you want is all used for buffers and that tends to shrink down- and that's not page faults, is it.

A

I think a page fault is when you want to read something and it's not in the cache, and you have to get it from disk, uh irrespective of what it's for okay, but I I okay matt, will help.

A

um uh One thing I want to say here is that I'm not sure how much control we have. There are some tunables in the linux kernel over how it chooses what to do with these buffers, but.

C

The main tunable I was thinking like was actually amounts of memory that we have um you know, and you know, would it make sense in some cases to have extra memory on the machines, because you know they they they would maybe.

A

Much more efficient, or maybe in some cases it is smart to do what we are already doing, which is to isolate busy repositories on their own servers so that all their disk data sits in ram anyway and they get accelerated. That way.

A

But I I don't know enough about the page guys to see exactly what the what the mix is between these different things. I just noticed that if I look at these graphs of what's happening on these couple of guinea pigs servers, then I can see that apparently we're not reading from this or barely.

A

So, what's the next step, um the next step is that we make the cache configurable via chef, because right now it uses hard-coded config values and you can only turn it on and off with the feature flag. So it should have a config value that says on off per server, and um so I need to do a couple of merch requests for that and I write some documentation, but I think the moment it is configure configurable via chef, I'm going to raise an issue to have it rolled out.

F

So um yeah, sorry, I might be missing something, but is there a reason why we couldn't use the feature flag to just roll it out to a percentage of users now uh parallel.

G

F

To to the configuration changes.

A

If we wanted to roll it out now with the feature flag, we could I just um yeah that we don't have very fine grade control with the feature flag.

A

So one thing that I'm slightly concerned about is the file hdd servers because they have slower disks and uh we don't have a way to say this is on everywhere, except on file http right now we can only say it is on for projects uh or a percentage of projects where it's sort of the set of projects gets picked randomly by the feature flag mechanism or it's on a percentage of time, but we cannot turn it on and off for individual servers, but do.

F

We know the guy ahead man. I wanted to say that we can't expect that file. Hd projects, don't actually get a lot of traffic because they're supposed to be, they might be.

A

F

Right, so maybe, um if you feel comfortable doing this, I would rather want to see this running in parallel.

F

um If we can do a percentage of traffic while right like if you run this and we kind of observe it while the configuration is being written, we're going to get best of both worlds like you can get, maybe like edge cases if we see edge cases, and we can then stop the configuration work because it's more important to actually fix the um the actual operations.

A

Right uh yeah, so the idea of the experiments so far has been to look for edge cases, and at this point I think we're going to be fine with edge cases and the configuration work. um The gitly merge request is already approved by one reviewer and the omnibus merge request is I need to. I want to check it manually, but what I'm trying to say is that up to the configuration work is almost done, so I'm not sure how much time we gain by starting a percentage.

F

A

Time rollout in parallel, but.

F

The reason why I'm telling you this is because we are basically fully blocked, um but we.

A

F

The gem problem we are on on like we're digging ourselves out of it, but uh apparently there is also like a big open, ssl vulnerability announcement today. So that's going to be fun, um so, my point being there are delays with all everything else that is happening and given that you actually have a tool to to start rolling this out and see what kind of pressure it creates in the infrastructure right like if we in this situation, where we are at now, can enable this and it still works and the platform works.

F

It means that in these times it's going to actually work really really well right, like it's going to.

E

F

E

A

Blocked yet, uh but the moment I get blocked, I can raise an issue because I've already started a bunch of things. It doesn't make sense to drop them if I can keep moving them along right.

E

F

Agreed yeah just also keep in mind the distribution team, who is supposed to be reviewing those things have 15 different priorities that are being now put as high as top prayer, yeah.

A

I I noticed actually I tried to do a test build on my merch, my omnibus merch request and nothing happens, and I figured it has to do with the gem problem or something so I might realize I'm blocked pretty soon today.

A

So that's a good suggestion to just do the percentage of time thing and learn more. That way. Thanks.

F

Awesome thanks excited about this super excited about.

A

This great, um I think that uh wait, I'm now that was it uh or unless that was my part of the agenda. My part.

C

Is there anything else that anyone would like to demo or chat about.

E

Great well, then, we can have a half an hour back.

B

Thanks everyone hope you have a good rest of your day.