GitLab Scalability Team, 28 Oct 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Scalability Team Demo - 2021-10-28

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Right, I've got the first item, um so I wanted to talk a bit more about. I think what I just talked about last time, but I've been away for a week, so um I've not been working on this for that long. So I set up um when we were doing the test to see what one cue per shard would look like in terms of whether that would be a valuable project to do um craig created um a pair of instances.

A

I think he actually created three but um uh to test simulate what we do with sidekick on our radius instance and then say: okay. Well, if we have one cube per worker and we have this many jobs for this many workers and they take approximately this long per worker. Like what happens, we have one cube per shared. What happens um so? I'm just grabbing that and updating it to see what happens with different scheduled set permutations.

A

um I thought I had this set up before I went away um and then I came back and I realized that, like a bunch of stuff was wrong.

B

C

All my results were wrong.

A

So that's what's been taking me. The time so far is uh figuring out whether my results are actually valid or not. I think I'm getting there, but I still need to tweak the actual total amount of traffic to get the right load uh in the first place. I'm gonna share. um Well, I'm actually running one now, so that's a good example.

A

So uh if I just spin this back a bit further, so yeah one of the issues is like I run an experiment. I want to leave a gap between experiments to see you know to make it clearer where the demarcation is so we can ignore all of this to the left.

A

These three here are with the new sidekick six scheduler that we had issues with. um So this is with no scheduled jobs, um and you can see I've set the base load- probably too high um here, because this is one. um This is with half of all jobs being scheduled, and this is with 100 before drops being scheduled, and you can see there's no difference between those two again, probably because the base the underlying load was too high.

A

um What is interesting, so I made this a stack chart to make it clear about make it unstacked um is that the user time was pretty much the same in both of those. It was the system time that went up and if we look at the redis exporter for those, we can see that it was similar to what we saw uh in production, where we have a huge amount of zebra commands because of the way the sidekick six scheduler works, um and so a lot of that I haven't taken any photographs.

A

So I don't know this for sure um or profiles, but a lot of that is um probably due just to the additional network overhead of running.

A

You know: 75 000 commands a second or whatever I'm doing here um so yeah, then.

A

So these peaks are, with the.

A

The new schedule that heinrich's putting that is atomic, so it does a single z range by score and a z rem if needed in a lower script.

A

um This is, I think this was ramping up slower, so this was going, no jobs are scheduled, 30 jobs are scheduled, 50 of jobs scheduled, seventy percent of jobs are scheduled, and I was like wait. I don't have enough. I don't have enough headroom here to see the difference, so I'm gonna have to back out and reconsider, um and the distinction here is that the it's uh user time that's going up, not system time, um so the blue um peeps, here um and down here we can see um the rate of commands.

A

We're sending to redis is uh much lower. This is valshar. This is the so that's it. We should have as many z ads as zrams. Basically because that means scheduling a job. Zebra means uh de-scheduling it running. um So I'm just running one now as well.

A

That's why this is changing when I hit execute instinctively, because I want to see what's going on um so all I'm doing with that is now prepping to see what happens when we change to only scheduling from certain processes, so not scheduling from as many processors that we have now, um and I can also I haven't done yet. Look at the I've got the psychic uh server logs available to me.

A

So I can look at those and see what the schedule, what that does to scheduling latency as well, but, like I said most of my time on this this week, has been spent finding that actually what I thought was valid. It was not valid, then thinking right, I've totally fixed that and then finding another problem. I think now, I'm at a point where I don't have any problems, but I thought that on tuesday and on wednesday, so who knows.

D

Is that including the change that uh heinrich, I think, heinrich just the well, uh I think you merged it or did I merge it? I don't know, but the idempotent dropping the item, button scheduling or what was it like? The lua script.

A

Yeah, that's that's the atomic thing I was talking about, so that's where you can see the about. Okay,.

D

A

So that's uh valsha um here, so that's these.

D

A

So yes, that does include that, so I I was just showing the difference between the two schedulers initially to show the initial difference, um and now I'm looking at using the scheduler we have if we do that in fewer places. What happens, um but I want to make sure my results are actually correct. First, so.

A

uh There's no questions for that. Then I think it's bomb.

D

Yep, okay, let me show my screen.

D

I wanted to show how, like um we want to introduce the new customizable request: duration, app dex sli to error budgets, it's already in the services, uh so the services are like uh have that sli now and we will use it for alerting and so on. That's already running, um but we wanted to include those on error budgets for stage groups and uh like we wanted to give the people time to adjust their threshold on the endpoints that they have before feeding that in the air budget, and that's mostly done.

D

How that'll work is I've added everyone here and, let's see, where's a stage group package is a stage group and I've added the key ignored components. Components refers to the component label that we have on uh on our metrics and when you set that um that's going to result in these kind of rules,.

D

uh So yeah, what changed here is that now we're going to instead like this used to be um just a rule that aggregates for all stage groups, but now we generate like a separate recording uh per stage group that let me put this out of the way that include that excludes categories that are ignored.

D

Let's see an example.

C

You mean components that are ignored.

D

Yeah so here we see the package group that has the rails request component ignored as we see here, and we do that so separate according for each group, and we do that by selecting here and then negating here. I don't know if there's any thoughts about that, because the annoying thing is that now we have this separate recording for each group.

C

Thoughts, andrew yeah, the first thing is: where does the package label come from? Oh, is that coming from the that's.

B

C

Yeah, okay, there: okay, because you could also just put it on the recording rule it might make it slightly easier to, but it doesn't matter on the you know, as a label as a static label on the recording rule.

D

That would have the same effect: yeah yeah.

C

D

That was just just a confusion.

C

But the the bigger question I have is is: what's the, why did I I missed the the beginning? Why do we need to exclude um these components.

D

We want people do we want to so these are already included for uh service monitoring and so on. That's.

C

D

Just stay the same, um they are not included in uh error budgets because there we wanted to have the you know the meta from uh yeah.

C

D

Source metrics- that's addict here. So if you set.

D

Yeah the the this thing as a feature category on your sli, then that's going to be part of the aggregation instead, so that's what's happening um when that's.

C

D

We're going to the idea is to have recording rules for future categories.

D

uh For example, I don't know yeah um that is going to have a.

D

So here we have the feature category and the aggregation rather than a static label.

C

D

So those metrics are like the going to be well that's the idea is that those are the source of the the stage group metrics on the right, but the ones on the right are for error budget, so those will be adjusted, but we would still have the actual data in the component feature category metrics on the left.

C

So and the reason you did them on the team is because some teams might want to include them.

D

Because some teams well right now we're waiting for teams to start setting thresholds I merged the first merge request related to that yesterday. I think- and we want to give them time before, feeding it into the budget, because otherwise, if we just turn it on, then everything's going to be red and people right.

C

For everyone, yeah.

D

So that's why it's kind of like it's opt out, but it's opt-in.

C

D

uh Does it make sense the the reasoning there.

C

I think so yeah, like I haven't, been looking at this a lot recently, but that does that that makes sense.

D

The reason I did it with these two was this separation, so um future category metrics stay the same and stage group metrics. Don't is. um I want to at some point build like um like a group overview dashboard like we have the the service overview dashboard that shows the metrics on the left and then you can basically click on and off components using a template.

D

I'm hoping that's possible like in my in my mind it is so then you could basically see what things would look like if you enabled or disabled certain components from your air budget, and then you can um remove the like the exclusion in the themes of the ammo when you think that it's close enough, that's the idea.

C

um No, it's not directly related to this, but um something that I've been trying to do in the engineering allocation meeting, and I I want to bring this up because you might have a better way of doing it. Is I've been trying to kind of like connect the budget to like the user experience so every week?

C

What I do is, I say you know these are the five violations and you know it's pretty hard to violate now with the with the geptics as it is, but we still get a few and what I'm trying to do is kind of build up. Trust in like these numbers are real, and so I'll say like here are the violations, and this is the user experience that people have. You know this is.

C

This is what it is, and one of the first things I do with that is I go and I break it down by feature category, so I take that that number. Maybe it will make sense if I share my screen and just show t that way. uh There's the browser uh give me a sec.

C

C

uh Yeah, I might as well just go in there.

C

So we've got um you know these this report, which has got all the error budgets, goodness in it and and then you'll- get something like this threat inside threats insights over here right, which is a violation and then kind of what what I've been trying to do is like actually work back from that to hey. This is this is what the users are experiencing like.

C

This is why this is bad, and so one of the things that I've been doing is I've been breaking it down by feature category and it's basically taking your queries and then kind of rejigging them so that they're aggregated by feature category, let's see if this works and then and then I'll get, I get a breakdown by that and then I can go from there into into into further detail.

C

um You know, and so in that case I could see immediately that the problem was pages and then I it helped me kind of go into pages. But obviously I've got like a lot of um tribal knowledge that lets me put together this query and get to here and then kind of go from there. The teams don't have that I was wondering- is there's nothing on the on the stage group dashboards that has this breakdown. Is there.

D

By featuring the the things the sli's have those significant labels and feature category is one of them, so I want to build this thing. Yeah.

C

D

Sharing my screen, I want.

C

To build this thing that you can.

D

That you can, but you know what I'm talking about it well yeah.

C

D

The thing that you can expand on the service dashboard.

D

D

uh So you've got the.

C

D

Indicators here is the so I wanted to do. I wanted to basically build the same thing, but for for stage groups that uses those.

C

A

D

A

D

Here on the left, so that's one that just shows the overall thing rails request should be in there yeah uh here and.

C

And then you and then you have this one with the.

D

C

D

Expands by all significant labels, so the thing that you're, showing with the the feature category should be well, would be there yeah.

C

There it is yeah.

D

Yeah, so that that's that's going to work, the the thing is on this dashboard is like, as you can see, a bit um too much.

C

Yeah, can we hit the stage group.

D

um I I want to do that per state like per stage group.

C

Would have a dashboard like this.

D

Yeah, an extra one.

C

B

Yeah and then it.

C

Wouldn't have, as many.

B

As people start using the error budgets, but also as we start using the error budgets and the data that's been collected, I think we're seeing that there's certain uh like additional bits of tooling. That would be helpful to have um and there's a couple of issues that need to be raised about that additional, tooling and how they could be helpful.

B

um I think this is one of them.

C

Yeah that that um the attribution approach right where it's like ratio on on the number of requests is also really helpful, because you could get something that if you just the way that I was doing it there, you might get a feature category. That's like 50 or something really poor. But it's got like 20 requests a month um and the attribution approach, because it's taking the total error budget and like what percentage each subsection.

D

Is since that thing is just an aggregation set, I'm kind of hoping I can steal the entire thing.

C

Yeah, it's like a really ugly um query: yeah.

D

But it works every time.

C

Yeah so that and that that'll be super useful, but the the other thing that I that I've been finding really helpful is like running that little exercise and I unfortunately have only got 30 minutes before that call to do it. And but I but I go through it and every week pretty much.

C

Every week I find like one or two new infra dev issues, um because I just go I'm basically starting with the error budget and then I'm going backwards from that to like what what's the cause of this and there's like always new things and one of the pm said to me, like you just seem to be finding all these bugs that we didn't know about, and I thought that that was such a nice comment, because you seem to be front running our bugs um and it's purely down to those zero budgets right.

D

Yeah, I I it's not even front running the.

A

Bugs that's important because you know that's.

D

E

C

Them yeah yeah.

A

It's the important ones, the ones they hit, often.

A

D

I think I think it's going to be cool when they can see their budget in a like a shorter term. That's not 28 days because from there it's very hard to go back to what happened.

F

D

But if they can see like.

C

D

C

Yeah in the in the call they keep saying. Well, we don't know, what's hap we've fixed this thing we don't know what's happening yet and I keep wanting to go like I.

F

C

Run this query: it'll tell you, um but it'll be much better when I.

D

Review this in the dashboard.

C

D

I review things like uh that are related to stuff, like this, I tend to like drop the tunnels link in and like when this is deployed. Click here to see the effect.

C

Yeah it'd be good to just get it in there because yeah they you'll hear like directors going yeah we're waiting a few weeks to see what the results of this is and I'm like. Oh, we don't need to do that.

D

uh Kwang min, I think you had the next thing unless there's more comments or ids or anything.

G

Okay, so it's my time I'll share my stream, um so rock- and I have been working on the interesting issue for the last two weeks, and this is about evaluating the different approaches to scale up already agencies and we are trying to capture the rotation traffic and then try to raise the road test to test different approaches, different system to serve our production scale. And it is a little bit different from what we have been doing before and because we are trying to evaluate the data uh using reddit uh on top of reddit luster.

G

The key distribution is really important. So that's why we really want to capture the eruption level of loss, and then we try to capture the keyness by the data size and even except patent of each key, and so that we can generate the gross profit. And we said it's not uh it's not only about the traffic or the request rate and it's all about the uh key distribution and the command we use to uh access to each key.

G

So uh one of what we are trying to do is to snip into the radius interns and try to capture the risk traffic via the distribution command, and it will only have a really an excellent guideline. How to do this. So basically, after we um snip into the upgraded instance, we are trying to generate the pickup file with this item and then we use a simple script in the grab book to analyze the traffic, and then we can get a full list of um keys by equity and comments.

G

We are trying to run against release and rob, and I are working on this on top of that, and we got some progress that we refactored a script to capture not only the request, but the response as well. The original script doesn't have the response, so we have to match the respect and response and to analyze the data so from the big file.

G

We uh split on the um file into a folder of smaller files, and it file contains some kind of raw data from radius and this data is generated by the radius protocol and for each request. We have uh to file one file for the request and another file for response and uh in each file we have an index file and um yeah, that's it and we have to pass the request file to get the uh no.

G

It is serious, respawn file, so we we try to pass the request file here and with the raised roots corn, and then we try to look up back to the response file with using the timestamp here to match the response. And after that we can generate something like this one and yep. So basically, we try to reassemble the um all the bases from this video data back to the data, and then we try to generate the key pattern files from this one. So after this one we got a file. Look like this.

G

So in this file it contains an array search. It contains a hash of the key button and for each key pattern we will have a value size, which is the data side. We set it to reduced and response side, which is the size, the responsive we receive from redis and some other data on important one is the unit key frequency, which is the distribution of the key upon the pattern and the total uses of the key.

G

So from the data we try to have analyze and have a big picture of what we are trying to do so I get I did some analytics on this and when we capture about 30 seconds of data on russian traffic, we get about 600, 000 requests equivalent to 20 requests per 30, 000 requests per second, and then we get the offer and uh picture of the release and profit and profit uh for read data.

G

We are trying to push the file into a slotted system and we're using cases doc io right now to generate a loss test. So basically it allows us to write the load test and based on the setup scenario. So it has some okay with uh cases. Okay, the scenario looks really simple like this, so it allows us to declare the scenario on in javascript and then we will run the system against our reddit instances.

G

However, it doesn't support the radius outer box, so we have to write a thing adapter on top of a gold line to plug into this javascript scenarios file, and then we try to run that and the result is interesting. So let's say the test. Pin script is really simple: uh it's not that simple, but we will pass the file the key pattern file there.

C

G

Try to con translate the key button into different scenarios. It's in natural is a combination of the command the key button, and then we try to repopulate the data before we run the last test and after that uh we will buy the um the patent to issue the rule command against radius here so on. We implement a team adapter in golem. uh Right here is the relating adapter to block our command and issue a real great command, and after that we just need to run the test file and it will perform it automatically first.

G

So before we do anything, we will publish the data and generate random keys based on the key frequency and after that, um the cases with about a hundred of uh clients, and then it will issue the data with different iteration speed. For example, we generate about 300 of scenarios and each scenario is the combination of command and key button here, and each scenario has different speed.

G

So for phone correction traffic we have about 300 scenarios and after we run these scenarios, okay connection reviews. Oh sorry, I forgot to start my race server. Okay, I will run the script again and wait some seconds.

G

It is the population step and we try to predict the and populate the data into radius before running a loss test by on the response, histogram and the key distribution.

G

In the example, I'm trying to run the lotus for 30 minutes with the multiplier by far it means that I'm generated five times bigger than the russian traffic. I'm not sure my computer can handle that traffic.

G

Okay, so after running the last test, we can see that I'm issuing about 50 45 000 requests per second against my local red instance, and then I have some metric for experiment issue and the the distribution that matches the one we capture on russian and for its command. I can I have a set of metrics um history metric here to see whether our histograms is satisfactory to what we want to when we scale the reddit instances up.

G

So um we are moving into the last step of this issue and after that we bring up different, related instances and uh register and run a lot against that with different settings so yeah. That's it.

C

That's awesome. Thank you.

E

That's that's really cool um one question I had uh what kind of environment are we running these tests in? Do we provision like dedicated machines and have a separate machine for the client to assist the server.

G

Yes, we will revision about some um client not and run the law test. Against that. One good uh benefit of this approach is that, because on the lottery is in the court, we just need to learn the scenarios and just clone again in different notes and the debt resistance we don't need to well took many manual works.

C

Just one environment I've used k6 before and I think it's a super cool tool um is the javascript execution environment node or is it just like a vanilla, javascript.

G

ah It is an interesting question, so this is not not yet uh the undersized environment, so we cannot use uh an existing in um library, so it is basically.

C

Written in your line.

G

And then it will translate it just rip into the run, the online runtime and that's why we have to write a javascript binding on all the gold line body for javascript, and we have to rebuild the binary in order to run the script.

C

Okay, that that was kind of where my question was going there. So that's interesting to know thanks yeah.

E

All right, I guess I have the next one.

E

So I've been working on a little script that basically runs through the install of the helm chart and then does an incremental upgrade and I've tried to match the configuration to what we have in production as well. So this is basically the script, so it runs, helm, install and then helm upgrade um and.

E

This is sort of the the conflict that I tried to match as closely as possible. So if I run that.

E

It's just going to do a dry run kind of given an indication of what it's going to do. So I'm just going to say, go ahead, install that and we should start seeing- we've got canines running here, so we can see the first pod coming up.

E

I've got stone for tailing logs in kubernetes here and we can see the the pot is um starting to come up, but it's waiting for its own ip to appear in the kubernetes service. So there's some some logic there for it to kind of wait for for that to propagate. So now it came up.

E

The pod has both redis and sentinel. So this is matching what we're running in production right now.

E

So that's that's the first node we see the second node coming up, it's kind of doing the same dance, so it takes a couple seconds for each one to come up because of this.

E

And yeah, I guess we can wait uh a sec.

E

Okay and there's the third one and these are being managed by stateful set. So it's kind of relying on kubernetes to do a lot of the heavy lifting.

E

And I guess, if I look at one of these, you can also see the two pods or the the two containers per pod.

E

So, let's see where we're at okay, so we're we're now up and running, and so the scenario that I want to simulate is upgrading the reddest version, so um we're gonna simulate an upgrade and it's gonna set the image.tag to the like one like plus one redis version, so we'll do a helm, diff first and indeed we can see it's kind of proposing to to update the image and the other important change that it's doing here is it's setting the partition on the update strategy uh on the stateful set, and this is the kubernetes mechanism for isolating a change that you're making to a stateful set.

E

So basically, this is saying.

E

Do not apply this change to any pod with index below 2.

E

And so we've got 0 1 2, so it'll only apply the change to this part and it'll keep these two on the older image and that will allow us to roll out the change to only one test it out see if everything is working and and roll back if needed. So we get a more control over the rollout.

A

How would we um can we manually fail over to a different make a different one? The primary um with this.

E

Yes, uh there's there's some caveats with that that I'm still trying to figure out, so I can actually maybe try and demonstrate that.

E

Starting a redis client container and then using this to connect to that container and triggering a failover. So.

E

E

So that was a failover looks like in this case. It worked fine.

E

Let's see these always spam a lot of text, so it can be a bit tricky to to see exactly what happened, but I guess we failed over from node 0 to node 1.

E

Yeah and yeah looks like that worked fine. In this case, uh I've had some tests where this puts it in a weird state and it takes a while to recover and okay. So so I can try and do another failover and see the opposite.

C

Of a normal demo.

E

This is it so. This is.

E

Node 0 and node 2 are now both trying to connect to node 0. like node 0 is trying to connect to.

A

Itself yourself.

E

Yeah and but it hasn't promote like it hasn't, stepped up yeah. It doesn't think.

C

It's the primary.

E

But it's trying to connect to itself, and so it kind of takes it a while, but usually after I think I think it took like 20 or 30 seconds in previous tests to to kind of recover. Okay.

E

A

I see a full resync mentioned.

E

Yeah, we've also got next failover delay, uh so it's kind of not deciding not to do a failover yet because it doesn't want to do too many failovers. So node 0 is still trying to connect to itself, but I I think we promoted a different one.

E

Yeah, I think we promoted one to be the primary again, maybe yeah. In any case, this is sort of weird behavior and it usually takes a while for this to kind of um fix itself.

C

When the server reports itself, it reports with a different ip as to what sentinel sees it as right or as a different host, something.

E

Could be a mismatch between like yes,.

C

They have different addresses.

E

Yeah because there.

C

Could be, it could be something.

E

Related to that- but you can see here now after you know like 40, I don't know like half a minute or so uh sentinel noticed that something is off and it does. This fix uh fixed slave config thing and sort of fixes itself and now we're in a good state again, so at least it recovers eventually, but it's still not really nice to have this weird behavior. So I I want to dig into this and.

A

During that time, like the system as a whole is still like able to like serve requests right like there is still a primary, it's just not the one you intended to fail over to. I think so, yes, okay.

A

um The other question I wanted to ask was first of all, this is really cool. Thank you. uh The other question I wanted to ask was: um what was it oh resizing? So, um presumably this makes things like the resizing. We had to do a while ago, simpler as well, where we needed to make british persistent have more well disk, but you know, I guess that's handled by the other things in kubernetes and then um also memory.

A

I guess this makes that a lot simpler, because it's just sort of like that's a regular thing, you're doing kubernetes or can do in kubernetes.

F

Is is the operator flexible enough for that, where you can say, I want to resize or change one thing in the cluster and not everything at once, because oh.

A

We're not using an operator here at all.

F

E

F

Dropped in yeah yeah.

E

The so the the partition I can- maybe just so you see that as well, we're using stateful set and we're using the partition mechanism on statefulset, and so the example that I showed was using that to update the image on only one of the three parts. But we can apply that to any change. So that also applies to update updating um uh requests and limits.

F

Yeah on the part, so this this specifically allows us to have heterogeneous uh sentinel deployments, because that's what you need to do these sort of upgrades right. You need yeah the.

B

Configuration management system.

F

Needs to understand that it's it's not just because that's one of the things I find difficult about understanding our italy, their server terraform conflict, because it just says the kidney servers look like this period, and it's not very clear to me how you make uh changes to that's where you account for not everything can change at once or how? How does the change work.

E

Yeah- and this is kind of the the built-in mechanism in kubernetes to handle that case, yep.

C

Seems like you're on your way to building an operator eagle.

C

um We'll see yeah, but uh what was it gonna say? Is it possible to run kwang wins uh load testing during your upgrade as another, as another part is another thing running in the cluster.

E

Should be doable or I guess it could also run externally. Yeah.

C

Yeah, just as a way of measuring like the how rough the the failover is right or how rough the upgrade is.

E

Yeah, I I do have a sample workload as well. That basically just counts up from zero and adds that to our list. And so we can look at list length and we can look at the last element and that will give us an estimate of how many writes we all right. Yeah.

F

Yeah, it's not just about losing rights right, it's it's!

F

The reason also.

E

C

A situation where.

E

It's connecting.

F

E

Yeah reads should remain available um due to the way that the the ruby client handles failovers so on failover will either get some stale reads from the primary before it's stepped down once it steps down, it closes all connections to all clients, and so the the clients will reconnect.

E

They'll re they'll go back to sentinel and rediscover the new primary.

F

So is there a new primer instantly or is there a window where the ask sentinel and sentinel doesn't know.

E

There is in our current configuration there is a new client, a new primary already present before the step down. So it's like the sentinels agree on who the new primary is and then the step down message goes to the old primary and so there's there's a window of stale reads in between and potentially lost rights as well, because we're writing stuff to the old primary that then gets thrown away. But um the new primary is known at that time.

E

So reits uh remain available.

C

Because you're issuing a failover to sentinel as the first step in the in the process.

C

Like a sentinel over yeah.

E

Either that or we could you know we could um shut down like if node two happens to be the primary we can shut down that node and sentinel will elect a new, but.

C

There's a there's a there's a if I remember correctly, there's a longer delay if you do that, it's kind of.

E

Like it's like hey, there are you yeah, are you there yeah and then it's a 10 second.

C

E

Okay, yeah yeah, so that that could that could induce some um some.

G

Unavailability.

E

F

I I generally think it's better to have some of it, an availability, but a system that behaves in a predictable and reliable way than uh to. I, I think a small amount of unavailability is a good. It's a reasonable price to pay. If you, it makes your system more reliable and predictable.

F

And I think sometimes in discussions about aj people have expectations of everything has to be available 100 of the time and then you can work yourself into a corner, and if you accept that a controlled amount of uh lost rights or lost reads or whatever is part of doing business, then you can yeah design better systems.

C

Sounds almost like an error budget.

F

Yeah, I suppose it is yeah.

F

I can give a small update because it's quiet and I'm interested in spotting and I can give a small update about the final tag stuff that I talked about two weeks ago. I think so I demoed something here and we looked at and.

C

F

Just the speed improvement was uh pretty surprising because it was end-to-end six times faster and it used nine times less resources on the on the server um and then, after a while, we decided to just take the proof of concept and submit submitted as proper code and that met resistance from the gitly team because they felt it was making too big of a change to their architecture, the, and um that resulted in a bit of a stalemate where we didn't know how to move that merge, request forwards.

F

But yesterday uh the gitly product manager mark wood got involved and uh I impulsively thought: let's just have a call and see uh see if, where this goes and um uh that turned out to be a very uh helpful conversation. So what I'm going to do now? Is um I've been approaching this as fixing one little thing? But it's it's really a pattern. There's a lot of rpcs that follow this pattern. That is inefficient, uh and I guess that's also where the the friction comes from, because I'm breaking that better.

F

So what I'm going to do now is write a sort of short document where I explain what the pattern should be. I think and uh see if we can get buy in on that, or that makes it easier to to sell the idea and um and mark offered to help me write that or give feedback on uh on how I write that and like what needs to be in there or uh so it's not like. I have to write a document that, and I don't know what the requirements are um and uh that's yeah.

F

That's not the approach. I chose, obviously because I thought just from a technical perspective: let's just go in and do the least amount of work and fix the thing, uh but uh I'm happy to um to paint this picture and see what uh see if people like this.

F

So that's the next step there.

F

And- and I think it kind of makes sense, because the pushback had to do with the architecture. So if maybe I just need to sell the architecture first or do more work on that. And what related is that stanhoo and jungkai have been.

F

Well, a number of people identified the likely source of the bad traffic on file 43 and they created a new gitly rpc that solves this one problem um and we never quite got to the bottom of. If final tanks is really every time the the main cause of the trouble.

F

But it seems to be a lot of the time and that new rpc got enabled uh like one or two days ago, and I think it's been a bit quieter since so, if we're lucky that one particular problem uh got solved in a different way and I still think we should do something about final tags and there's other rpcs that have the same problem.

F

So I I I think that the thing I built a proof of concept for and the idea I'm trying to sell is still worth selling, because it's just uh better for everybody involved. But that would remove the the urgency and change the nature of the conversation.

F

But we still have to see if file 43 is now less of a problem child because it seems to happen like a couple times a week and it's only been. The change has only been live for one or two days, so that's a little bit too soon to say it's definitely fixing it.

F

And uh just as a quick uh sketch of what the the bigger pictures that I'm trying to sell is that um in the current model we have rpcs that have a curtain certain certain design which is inefficient. We now know from a computational perspective, it puts a maintenance burden on the gitly team and when teams like create source code want to build, build new git features, they need to build new rpcs all the time and every time something is not part of the the big, uh the big abstract interface.

F

The interface needs to be expanded, and I think what uh a better approach would be would be to say that most skitly rpcs are very thin. Wrappers around git commands and gitly streams. The output, as is to the client, because then, if the client wants to parse another piece of the data, then they don't have to ask italy to send another piece of the data because they get all the data there is and uh another thing that is not strictly about performance, but that is about lower maintenance.

F

Overhead is that I think that these rpcs should just expose git command line options, as is and quickly to still validate them, so that because some command line options are bad, we shouldn't allow clients to run arbitrary commands, but now we do things like.

F

We have protobuf definitions to say how you can sort tags in find altex, but what's really happening is that there is a sort flag on the git4href command. That takes a string argument.

F

So if the client can just put that string and then they can sort on any field, they want- and that's just less work than to have to come back in and say. Oh now, we also want to sort on this attribute.

F

So let's update the protocol and update the rpc and ship a new gitly version, so the client can sort on another field when, if instead, they can just pass the sort argument that git already supports, uh then that saves uh yeah, that that removes a necessary friction for the clients, meaning great source codes to develop new features.

F

Does that make sense.

D

I it's broader than create source code, but yeah. It makes a lot of sense. I think.

E

It does sound a bit risky in in the sense that um we have less control or we depends how you say we, but with a broad interface, there is less control on the allowed set of combinations.

F

You'd have to validate them anyway,.

D

Yeah, I think like if the the the thing that engineers need to add if they want to add a sort by author thing to the final tags then like, if they need to add author to one of the allowed things and that's well acceptable. I guess.

F

I I don't think it's about it's necessarily about having less control. It's it's more about whether you want to define everything that is possible in the protobuf definitions, the protobuf definition, sort of act as a type system, and you can say I want to lock everything down, and this is everything that's possible. But then what you end up with is a protobuf definition that looks like the get manual page of a command where all the possible values of all the possible flags are protobuf fields and and constants and.

F

Yeah, I think it makes more sense to say, you're allowed to pass uh flags and then there's an allow list in italy where we say your if it becomes a runtime error right is it um if you encode the information in the type system, then you cannot make certain calls, because it's more like a static thing right, but yeah you're doing this across repositories and you need to have merged requests in multiple repositories. Multiple repositories do wait for a release to be integrated back in and.

F

By making it a little bit more dynamic and pushing some things to the runtime level, I think you remove friction um but yeah. We definitely need to validate that. That's that's! Not a.

E

F

E

Guess the the analogy that I have in mind is kind of like graphql, where you really have a lot of flexibility in the the types of queries that you send and maybe that's a bit of an extreme example, but it um it generally makes the rpc performance much less predictable.

E

F

At least yeah, so.

D

E

Kind of where I'm coming from.

D

That's what's what we're doing with graphql now, as well as measuring performance differently, so it is kind of predictable, like don't remember, measure request duration anymore, because anything can be in there. But if you make a distinction between what's happening, then you can measure it again because.

F

Yeah, I think graphql is a more extreme example, because you can make very well combinations of. I want to get all the x and all the y, and I want to have this hundred deep and a thousand of that, and you can make an arbitrarily complex query um and if I say I want to run git for each ref and uh I maybe can add some sword flags or I could say, I only want the refs that contain this commits.

F

Then it's much more bounded. How much complexity there is what the client can ask of the server.

F

C

Yeah another fairly successful api with a service with the api like that is sql, so you know you can put anything in it's pretty open-ended and uh I you know in a way it's the same sort of thing that you're talking about right.

F

uh Yes, and no, I uh I think sql is in a way more like graphql, because you can do anything. It's uh yeah.

C

That's sorry that was my point. Yeah yeah.

F

And, and and as we know, that is, uh that is great, but it's also a problem.

C

F

Yeah, because you can have uh sql queries, that are absolutely horrible, um but this would be more like you have an api that allows you to do a select on a certain table. So it's never going to be worse than whatever you can do with the selects on uh on the given table, and it's it's not a perfect analogy.

F

But what I'm trying to say is I'm not suggesting to create a system where people can glue arbitrary git commands together and just see what comes out, because it's very hard to make that's secure and um yeah. Then you end up with a completely opaque system.

F

But but in practice we have, I haven't counted yet, but we probably have four or five, if not more different commands that are all variations on git for each ref, so different rpcs that call git for each ref with slightly different flags, and if you just have one error pc that gives you the output of git for each ref, then uh you can defecate a bunch of old ones and you don't need to have to keep adding new ones, because the rpc that got added to address this problem on file 43 was yet another variation on git for href.

F

So if we would have had a generic git for href rpc, we wouldn't need a new rpc. At worst, we would have to tweak the allow lists of the generic git for each ref rpc to say the clown. The client is allowed to use this flag and I think the if that's what, if that's all that the rpc does, then the code is shorter. The tests are shorter, there's less to test. All you need to really prove is that the the flags can apply to the command.

F

You don't have to prove what the flags do, because that is kit's job and yeah people can on the client side, can iterate and build features uh without with having having to make fewer rpcs in between. I mean, there's. Also these ridiculous things like we have these rpcs to find all new lfs pointers. So what we do is we run git ref list and we look up all the blobs that it enumerates of size less than 200s.

F

So that could you could just have a thing that says: get ref list return blobs up to a size, but then we have special rpcs that tweak the argument to get ref lists by weighman um that tweak the arguments and uh gitly tries to filter the blobs to see if they look like lfs pointers. But then, on the client side, we parse the blobs again to make sure they're really lfs pointers and if you just have an rpc that sends all the blobs that are less than 200 bytes.

F

What you can have a generic rpc, where you say, give me all the blobs in this repo that are less than 200 bytes and then these different variations, just you, don't have to maintain them anymore.

F

So there's a there's, a bigger picture here and and it's not just about performance and resilience, uh but also about uh developer convenience, but but I think, there's also more advantages for us, because if you think about something like git upload back the reason that doesn't completely explode on us all, the time is that the clients receive back pressure or the clients exert back pressure on the server processes.

F

But if you do uh clumsy parsing on the server you do extra work, then you can make a cheap kittley call and then the gitly server goes bonkers, trying to parse all the tags and fetch them from git in an inefficient way.

F

So you get this leverage where cheap requests cause expensive replies or expensive servers are at work, and if the server is really dumb and it just sends back a byte stream, then in most cases the client will exert back pressure on the server because it has to consume the byte stream.

F

So the kidney server just has an easier job and there's also the fact that if you want to do something like c groups or limiting it's much easier, if you know that all the workload happens inside the process and not there is x overhead, like the git processes, are doing some work and then there's a chunk of overhead in the gitly process that we don't know how to quantify or limit it's just a simpler model anyway, thanks for listening, this is something that's in my head now and I need to write down and telling you about.

F

It helps me find out what the story is.

C

Graphql for italy.

F

That is definitely not the story as far as.

C

F

Concerned, I'm not going to say that.

D

That's good you're not trying to rip out grpc.

F

Well, I still want to do that too, but I think that's uh I I I want to give them a break and or I don't think it would go down well.

F

I also, I don't think it's the most important problem to solve right now. I think that uh uh the the badness of grpc was amplified by the volume of traffic of of git fetch, and I think that it really is better now. So it's just not the the impact isn't as high, but I, if you want, I can tell you how I would rip out jrpc without anybody noticing.

F

Thanks for listening.

C

Have a good day, everyone bye.

C