GitLab Scalability Team Demos, 22 Apr 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Scalability Team Demo - 2021-04-22 (First call)

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Yeah and I've got the one and only item on the agenda right now and I think everybody on the call already knows about it, so we can do it quick. um Let me perhaps share my screen to show you the mulch requests.

A

So we want there's certain sli's in our services that have a feature category like the entire services, for example, owned by a group and has a future category, and we want to include those slides in error budgets. Some of those services include like italy and the container registry, and so on.

A

um So those slis already have a feature category the find them and that's just in the in the json definition. They have a feature category and to include those into the error budgets. We want to add that label when it gets added to when the metrics get recorded into prometheus. So the first attempt I did added that label directly into the recordings that are also used for key metrics, which are used for alerting yeah. But those are the things you look at at the service overview, dashboard and so forth.

A

So that didn't go so well, because we were getting alerts on traffic being increased because the the metrics would get aggregated together like we would count uh the metrics without the feature category label and then add the metrics with the feature category label, and that would come double and it would look weird and would tell sres to look at things while it's actually not true like we didn't, have more traffic. So we reverted that and then the second iteration, for that was just to record this separately from the key metrics.

A

So to have a recording that records exactly the same thing as the key metric would record but add a static feature category label and use that for the for the error budgets. This obviously, is a bit annoying because we record the same thing twice: just add an extra static label. But that way I was sure that it wouldn't get in the way of um like anything else and yeah.

B

Bob it's weird I when I reviewed that first murder request. I was like there was this bug in the back of my head and I was like. I don't think this is going to work and then I looked at it. I was like no it'll work. I was like no there's something wrong and then, as soon as it happened, I was like. Oh obviously, you got to be super careful about adding dimensions to um to those existing aggregations. For that exact reason.

B

But there was something bugging me and I couldn't figure out what it was until I saw it.

A

Yeah, but it's no big deal well, the the thing is, I think we all need to do this once to like. I remember sean wrote some documentation about changing the key metrics um using like when using different sources and so on, and that was also triggered by him changing sources and then.

C

Yeah that was slightly different, because I was I was trying to combine the before and after because I was changing what we were um yeah.

B

And we've actually got a workaround now that's much safer now, because before we used to have that like uh the aptx was measured like over an hour. There was like this weird thing that we did, and so, when you change synaptics, um you would get this sort of decay of the old one dropping off which was really irritating. I don't know if you remember that.

C

Yeah so, like you'd have like, if you had the new one, it wouldn't have started until we started recording the new one, and if you had the old one, it would stop um at the point we record. We started recording the new one and then um it would still counts after.

B

It stopped which either one of those would count and.

C

Then you would be like, oh, you know, everything stopped happening or you know yeah.

A

Yeah this is different because it was actually well. The sli stuff would probably just remain the same, but the problem was in the things that were using the metrics that were recorded from like an aggregation set, but it wouldn't take into account the labels that are the final aggregation set. So that's um the discussion that that sean brought up like are we going to keep two recordings like forever?

A

um We don't have to. I think we can figure out a way to get around that, but it's I'm kind of wondering if it's worth it um could.

C

We change them over when.

A

C

Down for the postgres.

C

Upgrade just add more risk to that, but take advantage of the uh everything being off.

A

ah So everything turns off and we add the label and then yeah yeah.

C

No yeah I mean if we have both it's not ideal, but it's also like we generate them automatically and it's not adding cardinality to any of the existing metrics, because they're, a separate set of metrics and both sets of metrics have exactly the same cardinality one just has an extra so.

B

I I have a different opinion, and that is that, well it's the same opinion, but for a different reason, and that's that, like from a replaceability point of view and a like, it's it's better to keep them apart, because we can rip the one out and we can replace it without. You know if we kind of tying the same metrics in for lots of different things, everything becomes risk. All change becomes riskier right and at the small cost of having a few extra recording rules.

B

We, you know, we don't have to be as worried about change, and you know things are replaceable.

C

Yeah, okay, that makes.

B

C

So yeah like so what we're saying is like.

C

If we change these feature category ones, that's fine like because they're not gonna, alert we're.

B

Not gonna broker yeah we're not gonna break like all of our production, because otherwise we're not going to want to make any changes to them, because it's going to be like.

A

B

You know rather like segment things. You know this.

A

Is a very good point as well, because in the same merging quest, both merger quests, I've also added validation for the feature categories defined on those slis to see if they're actually still feature categories that exist in our handbook, stages.yaml file. um So that means that when product decided to rename something, then we would have this exact problem. Well, yes, yeah! So so I think the conclusion is: let's keep them separate and keep the extra recordings.

A

While we're on the topic- and I kind of wonder how expensive it is to record something like because we we just add stuff and we keep adding stuff, and I wonder if that's at some point going to is at some point somebody going to be angry at me for doing that. So.

B

So my take on this is right now I think we have several hundred machines running git and we have like five machines running prometheus. So that's a pretty good ratio and that's just git right, that's not even like the rest of our fleet. Okay, this is actually come, comes up with the efficiency of kubernetes discussion, but that's for another day.

B

But my point is that, like the ratio of the observability stack compared to the rest of our stack is really really small at the moment and if we need to uh you know, if we get better efficiency from having these extra recording rules like adding a few more machines like you know, I think we should. Rather I I guess my point is we shouldn't be skimpy on these things if we get value from it like, let's carry on with it as long as it's technically feasible and I think it's technically feasible.

A

Yeah, the the other nice thing about this is what andrew hinted that in one of the like one of the related merger quest is like now, it's a source rule and it's using like it's being recorded in the in prometheus and nothing thanos.

A

Okay, thanks everyone. I don't know if anybody else added something to the agenda, but if so.

C

Yeah I did like I said I might have to drop as someone comes around to give me a vote for removals. um But um so um I'm just gonna talk through it. First of all, because um it's gonna be more fiddly to stop screen sharing as well.

C

um If I have to jump off quickly so um for I'm coupling two things together, which aren't directly related, but it seems like a good opportunity to do them both at the same time, um which is that when we switch to having a single queue for the catch-all charge, we will have a problem which we don't currently have, which is that, if you deploy a new worker to canary, obviously canary, doesn't run sidekick jobs, but it can schedule sidekick jobs.

C

So it can schedule that worker, currently that's fine with an asterisk, because it will, um it will have its own cue and sidekick on the main stage. Won't start listening to that queue until we also deploy sidekick.

C

So, basically, those jobs just wait for, however long it takes between us deploying canary and scheduling the job to us, deploying maine and that's already kind of a gotcha that I think like we should be a bit careful about um because, like you know, um if you, if you add a job to do something, when someone um does a thing in the app and that's not behind a feature flag and that's just available when it's deployed to canary and then we deploy to production like 12 hours later, and the thing in the background runs that's kind of weird.

C

um So I think we need to document that anyway, um with a single queue for the catch-all shards. Those workers will go into that queue and start being picked up immediately, but they will fail because the worker class won't exist on the main shard, sorry on the main stage, because sidekick won't have been deployed.

C

So uh we need to do something, um and another thing I was thinking about was we set sidekick default retries to three a while ago, which is very very low. So that means that if your job fails, it will try again three times and that will take like it's got an exponential backup. So it takes a couple of minutes um and you know if our database goes down for, say five minutes like it did.

C

Some point in the last few months: I can't remember um all those jobs in that point will fail um and then we'll go to the dead jobs queue, but we never ever look at that and that's always full. So um that's basically saying they failed, um and so what I want to do is change the default number of retries back to the psychic recommended default, which is 25, which happens over a course of like three weeks um on average, um and that seems to be the sort of psychic recommended way of handling this case.

C

Where you've tried to run a worker. That's not you know that sidekick's not doesn't know about is it'll. Just try get a name error. Try get a name error, try getting a mirror and eventually run um so what I'm going to do is explicitly set all existing workers that we have to use three retries, um because I don't want to change the behavior of any of those like some of them might want three, some of them.

C

Might you know it might not be good for us to try more than three times and they have the default going forward to be 25. So then, eventually we could go back and change the the ones that are already there from 3 to 25.

C

Which is probably a more sensible default, so the original reason we changed it to 3 was because a lot of our workers at the time hit external services, and if that external service is down, there's not really much point in trying over the course of three weeks, but also some workers are time. Sensitive or some workers um might fail in expensive ways, but I think it's much better to keep the default at 25 and have people handle that on a case-by-case basis, rather than have the default for those sort of exceptional cases.

C

um Andrew sorry, bob you added a couple of um things there and that andrew did as well.

A

uh You want me bob.

A

The failures that you mentioned, the name errors we would currently count them as failures.

D

A

Metrics and alerts and stuff do we need to find a way to exclude those and only consider jobs that are completely failed failures or like yeah.

C

uh No, I think we should still account for it's the same way we do now. I think um if you have a job that takes 19 retries to work um that should count as 18 failures and one success.

C

I think that's reasonable. um I think um the.

C

The better way to solve that is to not get into that situation in the first place, so to not schedule like if you're deploying some code that can schedule a job put, that behind a feature flag or something. So it's not immediately scheduling stuff that wouldn't, like I said, wouldn't work anyway. If we deploy to canary and then it takes us two days to move that to production, that's already a two day delay. That is nothing to do with the retries.

C

It's just to do with how we deploy things, um so I think you'd already be better off using a feature flag to handle that case and thinking about that and like I said, I need to document that, um but I think we should count the failures as failures, um because they are yeah.

A

And I think in the end, for new features, where we've currently hidden behind the scene, the single cooper worker thing when deploying canary, um I think, since they are new features, they aren't immediately going to be super used and so on, hopefully they're even behind a feature phone yeah.

C

Exactly like none of the I had a look and like we've had a few, but none of them are really like. um I had a lot back in that like a month and like there's not really so like. You can just look for a high scheduling, um latency and that's sort of a good proxy, um because, like these will end up with like a scheduling latency of like um eight hours or whatever, it is because they were scheduled when they were on canary and then they would run when they were on main it.

B

Could also be over the weekend as well right.

C

B

C

On friday and then like for some reason, yeah production until monday.

B

We have a freeze and yeah yeah.

C

Yeah, it's like it's already a problem. It's basically yeah yeah and we add new workers relatively frequently yeah. But it's going to take us a while to get to the same number of choices that we have.

B

Just as a like a what about as an idea, what about, if all new workers um went, because I I'm kind of worried about the 25 retries thing, because I think it puts like extra work on like the developers to know about that. And then you know if they are using an external service, they have to know that they have to set it to three and we're kind of putting extra stuff on people.

A

To set the flag has external dependencies, so we could already count on that a little bit and then we should maybe just figure out a way.

D

A

Them if they have external dependencies that we can see because they perform requests or whatever and then notify them like this worker, it doesn't have the correct attributes to be doing that.

B

Yeah, what about if it's like, I haven't, thought this through fully and it probably doesn't work but like if we in all new jobs that got introduced what happens if they were in, like the I don't know what behind the ears queue and then and then that queue yeah that doesn't really. And then you know that uh because then you don't have to worry about them, getting retried because there's nothing! No! No! I haven't thought of it.

C

Yeah, I think I think I see where you're going with that. I think the challenge is like who yeah that job is new. It.

B

Almost sounds well.

C

It's when you're.

A

B

A canary stage for sidekick, so so that is actually really the problem here. Right is there's an impedance mismatch between between our workflows here and there's. There's a there's, an issue, that's as old as the hills that is canary sidekick and it it's actually kind of amazing that we are still running sidekick without a canary.

B

Considering it's such an important piece of our infrastructure- and you know- maybe you know- and this is just a smell- that kind of comes from that.

C

Yeah, I think um I think, sort of a theme of some of the work. We're doing now, though, is to try and work with, rather than against psychic, um so yeah use fewer cues use the default retries, like you know, do what it's.

D

C

Us to do, um I do see what you mean about uh not running a psychic canary like basically, this is only a problem that affects gitlab.com, like it will affect some.

D

C

Downtime customer deploys um yeah, but they're not likely to have separate yeah exactly they won't have a canary. It will just be because they've deployed some part of their fleet schedules jobs before they deploy some part of their delete, the processes job. So it's not going to be several days. It might be like on the order of an hour or two, so we could we could go lower than 25. um I just think so. The default.

B

Us back in line with like the way sidekick intends to be used, like I think that that's all good. um My other question about the the canary smell is, I imagine that when we have zonal sidekick clusters, it'll.

D

B

Much easier for us to run a sidekick um canary cluster, because it's kind of just another zone in a way.

C

Yeah, well it would, it would happen actually.

B

C

Deploy to one one zone, first, right, like.

D

B

C

I mean yeah. Obviously you would want to formalize that a bit more to make it like a canary um but like it would. It would just be like yeah.

B

I think so so the way that it's this is what I something I discovered yesterday. So you know, we've got the three zonal clusters so get an api run in the zones, but then um sidekick, sorry, not sidekick. Canary of those things runs on the regional cluster, um which is kind of to me. I find that very surprising, like it's like wait. What what?

B

Why is this, so the canary spread across everywhere um and then so people were like, oh, but we don't want to have like these node pools and you know because it's going to be too small. You know each canary.

C

Yeah but then, but.

B

But the thing is that that I kind of think that that if we just and then the other problem is that helm doesn't understand, stages right, so that's one of the reasons why we can't deploy the canary stage into the existing clusters so either we need to do like a whole bunch of work around that, but another option would be to have you know: we've got three zones at the moment. We make it four zones and the fourth zone is.

D

B

And it's just canary and it's and then everything's the same right as opposed to everything being kind of different.

C

Yeah and like you said that also then makes it automatic for things like anything else that goes into yeah, that is zone local to, like you know, potentially sidekick in future automatically gets a canal.

B

C

B

C

Because it's just.

B

Wired into its own canary radius and and.

C

And I don't know.

B

Where the sidekick is actually.

C

For canary because I haven't- I haven't really looked at that, but I assume it's just as simple to say: sidekick doesn't run on the on zone, a which is canary, as sidekick doesn't run in the regional cluster, which is presumably.

B

At the moment, well, yeah, so at the moment sidekick runs in the regional cluster.

C

Along with the canaries,.

B

C

We deploy the canaries without.

E

Deploying sidekick, then I that's a very good, maybe.

B

Because they're different um pods and they're different containers right one, but it's all a bit like um yeah.

C

It seems like it, you know, I I don't know enough about kubernetes, but it seems like it's.

F

C

Say, like this zone is canary.

D

C

Doesn't run sidekick and then all the other zones run sidekick and then potentially add something.

B

C

Zone rather than to do like we're gonna deploy just webinar.

B

Whatever it just you know, the most important thing is that people can grok it and like from the last day, like I've, been like wait. What what really, and so that's only going to get worse as we move more workloads over there and so keeping it simple is like as important as any technical solution right.

C

Oh yeah, speaking of keeping it simple um one other thing I decided to do was at least figure out what the current catch-all on kubernetes sidekick queue. Selector means because it's like a 9 000 character line that just picks a bunch of cues by name.

C

So we need to change that yeah, that's just dead, so yeah I need to. I I'm trying to.

D

Remember the reason to just.

C

Say default mail is because, like.

B

The other day.

C

B

Yeah and someone was like uh yeah, it's kind of terrifying. I don't know why there was some reason. There's a reason. Scobeck knows, but.

C

Yeah yeah, I need to speak.

B

C

Him because I.

B

Need to figure.

C

Out, like the new workers, denuclease automatically go to kubernetes right now, or do they go to catch all on vms because I think they're probably going to catch all on vms, just because it's not.

B

That's a good question. Yeah. You could probably look at find some new ish, but.

C

Yeah I want to.

B

C

Tidy that up as part of this as well yeah yeah, please, the end goal is to change it to default and mailers. But I don't want to change it from 9000 characters to that, because, like yeah yeah.

B

I don't even know what that's on at the moment, yeah yeah.

C

Okay, well yeah.

B

C

100 sold on the 25 retrace thing, because I'm worried that, because I want to increase the number of retries anyway, I'm sort of using.

D

C

Reasoning to say: well, okay, this could also fix this other thing that we need to fix now, so I'm just going to do kind of both at the same time. um So I am still going to play around with this a bit, but that's um that's where I'm.

B

Right, I guess I guess if every retry is counting like it's counting down the the sli, then we'll get alerted if there's like really bad things right. Yes,.

C

Exactly so like, and um you know, if you have a high retry to success ratio yeah at the moment that could be masked by just three retries and then failure um and again this wouldn't apply to existing workers initially because we'd need to go through them, one by one, basically and check them, but for new workers it would be nice to be able to be a bit more confident and say this isn't just by failing three trans and going dead, because we've added metrics for that.

C

But we don't have any alerting on the dead job stuff at the moment right. um But we.

A

Also, aren't we talking about moving towards using a worker like worker, instead of queue for sli's as well like.

B

A

Do queue yeah yeah, it.

B

Has to move yeah.

C

But we have an issue for that and we can already change it because we already have the labels.

B

C

B

C

B

Because otherwise, it would be way too mushy. Oh.

C

Sorry I say we could already change it. We we sort of blocked that on the issue to create the dashboard which I think yakub was going to take so we'll create the dashboard, which will also create the recording rules and then we'll create the alerts after the recording rules have been in place for a while um cool, because we don't want to add the recording rules and immediately alert on them.

C

B

uh This is just a silly thing that I was working on immediately before this call and it's like a terrible hack, um but I'm working on all the the kubernetes um monitoring at the moment and one other thing. One of the other unusual things is that we have a lot of node pools um and far too many so like for the git service, we have separate node pools for the shell and http ssh and https components.

B

You know so kind of like we're saying to kubernetes here: do bin packing, but, like you, can only pack one type of container in these bins right, like you, don't get any choices um and we need to fix that.

B

But also the other thing is that all of these different um pools are all fixed in size, and some of them are probably like maxing out and we don't have any way of monitoring that um stackdriver doesn't export any any information for this, and so the first thing I started doing was: okay, we're just going to um manually sync it up, so we just can have a manual list and then I realized, like they're all different and there's no rhyme or reason in the sizing, because there's three pools in terraform there's kind of a copy and paste- and I think, what's happened- is that people have gone and updated the pool sizes.

B

But they've only done it on two of the three or one of the three, so they're all different, and that's only going to carry on right, and so I decided against manually syncing it up because there's like 25 different numbers that we've got to keep synced just for production and um then we've also got staging and everything else, and so I just decided it was much easier to do uh an automatic thing.

B

So my choices then were: do I build a an exporter that queries the api and then stand that up and get that into prometheus and, like my eyes, were dulling over, and so I built this little hack and basically, what it does is it takes the the terraform it asks.

B

Terraform like give me all the information about the state that you manage, and then I just use a jq script to pull things out of that and then turn that into prometheus metrics, and I push that to the push gateway which is like the world's biggest hack. But it lets us get on with more important things. Could.

C

That also be a get lab pager site that pure atheist, then scrapes.

B

I've it's funny because the first way that I was going to do the timeline prometheus metrics was going to be to publish in the site a prom file.

D

That gets scraped.

B

And then I realized that push gateway is much nicer than that, so you can use push gateway and we have push gateway. So that's how tam land talks to um okay.

C

If you're aware of that change that I did, the scrapes only happen when the script runs as opposed to like you know, if it's a page's site, then the scrapes will happen. Every yeah, no.

G

B

Every whatever scrape.

C

Into place, even though nothing's changing most of the time.

B

Yeah push gateway has got like a slightly surprising approach like I wasn't sure, so it has state right and then you have a thing called the grouping key and you it's arbitrary, but normally people use job and instance, as the grouping key and uh job for timeland. The job is tamland and then grouping key is basically the page that you're showing and then, whenever you push to that job and grouping key, all the existing states is flushed out and then it's replaced with whatever metrics you push in.

B

So you push in a group of metrics and then you can delete so so I guess the thing that I didn't understand about push gateway was that it has states. That's kind of that stays like that. Until the next time you push and then it doesn't just like overwrite the individual metrics. It basically removes everything under a particular grouping, key with job.

D

B

And then replaces it, um this is actually quite a nice. You can do quite nice things with it, so you can structure things and you can also delete by grouping key and do things like that. If you, if you make a mistake,.

D

B

But yeah so so I've just kind of um you can see on that merge request. If you scroll, you can see, there's like lots of different numbers there, and none of it's very but hopefully once we get monitoring on this it'll show up. You know where the problems are in in the current thing, because we'll start alerting on on node pools that are saturated.

C

Yeah, that would be yeah. That's that's an item for the uh infrastructure christmas quiz like what's the what's the maximum no pool size of shell, one versus well, there there's.

B

Diff, like I think, two of them.

C

B

C

One in us east one d compared to shell. What is.

B

C

B

And they're, not the same yeah yeah I mean the other thing that the other thing that's kind of shocking is how many instances we are running compared to before right so before we had 25 git machines and now we've got three clusters with like a lot more than that, and I think we need to do.

D

Some sort of assessment.

B

D

Over what kind of time period is that increase, like that's like from 25 to what we have now? That's like uh a lot.

B

Six times I I haven't actually done the maths, but I'm in my head, it's like in my in my heart. It's like six times more machines that we're using to run the same fleet and that's in you know. When did we move get over.

B

Maybe maybe four times.

G

D

In like is it in a year.

G

Is that it was no.

D

No no they're, just in the act of.

G

D

B

Yeah, so there's there's. Definitely I don't I don't know. I don't have like a full story here, but I think now that we starting to look at the monitoring of this, I think it'll become because at the moment at least to me in my head. It's all quite opaque, like I don't really know that kind of kubernetes is just doing things, um but we need to kind of like understand if we're running things right, because it seems like the the the step up has been pretty drastic.

D

It just also I mean if, if the sidekick, if our challenges with sidekick um and raiders have have taught us anything, it's that we need to be making sure that we're using the technology the way it's supposed to be used and yeah.

G

D

Think anything is it's always good to make those assessments from time to time just to check that this is definitely still the right thing to do.

B

B

Yeah so, but the the the other reason why I kind of brought it up here is that there there might be other things that we wanted users, for, I can't think of anything yet, but that's why I specifically gave it quite a generic name like terraform report.

B

So if there's other things that we want to stick into um the the prometheus that we need that, sometimes I do syncs at the moment. So I'm like there's a comment on like the run books. If you change this, also change this and then on the other repo like if you change this also change this, and this just kind of like is a- is a real hacky way around that, um so something to keep in mind if you find yourself writing those like cross-reference.

B

Sync things.

A

Would already be helpful if we moved some of that stuff over from ops to gitlab.com.

B

Yes, um I think that the terraform will always run on ops because of the permissions and.

A

Yeah running stuff, fine, but then doing stuff like changing stuff.

B

Yeah now you're, like the.

A

Cross references you mentioned: if they are yeah yeah, we could have a bot shouting or that kind of stuff.

B

Yeah yeah yeah yeah, it's it's, I find it quite irritating and then, like you, never you never see the references and yeah cool and then the last thing that I wanted to raise. This is kind of like. If we have time um this presentation.

C

B

Yeah, so I don't know if it's just me, but I kind of wanted to talk about two things. The first is around um like expectations of like what those minutes are. um You know, because what it what you know when we say 20 minutes in a month, um you know how we derive that as opposed to 99.95, because I kind of maybe this is just my like me. Reading.

A

B

Wrong, but I feel, like people think that they're two separate sources and they're not.

A

Check the check, the comment I link, they have the same problem and the best way I can explain it is using a ridiculous example of four requests and how that calculate. If you have four requests in 28 days and they perform like that, that's how much you.

G

A

Yeah, I think I think it makes it hard but um yeah, it's something that.

B

You think it makes it.

D

Hard, what is uh you think it makes it hard? What are the two.

A

D

A

Using minutes to communicate this makes it hard makes the error budget hard to grasp, because um it's it's okay, if you are, if you are a group that is source code management or continuous integration and so on, who have plenty of traffic but there's also groups there that have very little so one bad request has a lot bigger impact yeah, and then they see the minutes go down. Super fast yeah.

D

Because their feature.

A

Suddenly gets more traffic and they weren't prepared for the scale or whatever, and it it just yeah.

B

So so my my in my mind like this is this might not be true, so you know, because I haven't spoken to enough people about it, but in my mind I imagine when a product like when some of the product managers think about minutes. They think that, like gitlab.com goes down somebody's like start a stopwatch right and then for three minutes: gitlab isn't available and then stop the stopwatch. And then we assign that to a stage group yeah. Then.

A

And that's like absolutely not, but that's not it yeah.

B

It's not that at all right, it's basically just turning a percentage into a time value. It's also that.

A

Percentage is not gitlab.com availability. It's your future yeah categories, availability, yeah,.

D

That's also very important yeah. Well, that's what I was going to ask next is um like so thinking about like what you said there was like you know: they start the stopwatch and they in the stopwatch, because it's the difference between on and off like it's gitlab.com is.

G

D

Or it's down, but what we're trying to convey to them is it's not as good as it needs to be so it gets. Calm can still be up. It's just not running as fast as error-free as it's supposed to like by the thresholds. We have defined it's underneath those thresholds.

B

So what it is is like every request we assign to a group right, yeah and and and we're saying that that this is and then at the end of those 28 days right we say of all the requests that came to your group right 99 points. How many points nine percent of them were good and 0.1 were bad.

B

Then we say 0.1 of 28 days is equal to 10 minutes or whatever 20 minutes, and then that's how that number comes about. It's not it's not because, obviously, if we gave every group.

G

D

Yeah, but I think I think it's about how you communicate the message to them, because, yes, we don't want them to think that this is about the difference between on and offness in gitlab.com, it's about how your stuff, the stuff that is attributed to your stage group performed.

D

So as long as I think we we need to put insert that so, regardless of percentage or minutes. The first thing that we need to do in that presentation is make an adjustment to say your number is about your stuff. Your stage group number is about your specific stage group we're not taking other, and that's why the attribution of feature categories on your sidekick workers and on your requests and all of the other pieces. This is why the attribution is important, because this is how we read the stuff to you.

D

Not owned is performing pretty good. So if you want to.

A

Pick something good like, then: you can get.

D

A

To you, that's another way to get your.

D

A

Can yeah so we can make that.

D

Change in the in the in the presentation, then, the second part of.

G

The conversation.

D

Is what number do we want to show and the reason that I've been pushing for having minutes is because I feel like the difference between 10 minutes and 15 minutes is more pronounced than the difference between nine nine point: nine, five and nine nine point: nine. Eight, like a point, not three percent different, like.

B

It's harder because you realize that the minutes is arbitrary right. You could just make it like five. Oh.

D

You could make them like it has the same.

B

Level of it's not like an actual time value right.

D

No, I know, and it's like you've, it's like you may as well say. Well, your team has scored five cats this month. You know like.

B

B

About minutes is that it's a smaller, you know we have different scales and then and that's jackie added.

A

Is I think the best like is, I think, a very good idea and it goes from 100 to wherever negative you are and 100 is. You haven't spent anything any percentage point of the 0.05 that your feature category, that you're yeah, that your features have, and it goes from 100 percent to, however much you spend.

D

Well, I think she was talking about it more as the minutes uh value, rather than translating it into another percentage value. So, for example, if we start with 20.

G

Minutes and then you can go.

D

To minus 5 minutes or minus 10 minutes or minus 20 minutes. But I think that why I like having a round number to deal with, is because it's easier to it's easier for some to reason about than a fraction of a percent.

B

D

B

F

F

F

A

The thing is, we're going to be showing, I think, three numbers now we're going to be showing availability. That's the thing that you can compare with get like a home availability and we want it to be above 99.95. The second number we're going to show is um 20 minutes counting down. So you start with uh yeah. That's the thing that everybody seems to like is 20 minutes, and if you have slow requests, then that number is going to go down and the third number is starting from 100.

A

How much of the 0.05 available as error budget are you using, and we want that number to be above zero if it's below zero, then you've spent more. If it's about.

D

That's not quite I understood.

G

D

Third number, so that's not quite how I understood the third number. I understood the third number to be that it starts at zero because it's.

G

D

It's something that jackie has suggested to know how much the overage is. So if you use your your entire error budget, you used a hundred percent of your error budget and, if you used to, if you used um 30 hours, you would then say. 150 percent of your error budget was used.

G

Because you're over yeah.

D

Sorry 30 minutes sorry.

A

That explains it pretty. Well, I'm going to link it from the agenda.

B

My my concern there is is just around having two percentages that are.

B

You know like okay, so you use 99.95 we use 99 wait is that 99 yeah? I think.

D

B

Errors always have 99 of the of the of the budget. Right like you, you see the the it's going to make very confusing conversations. Perhaps.

D

What we do to start with is we have availability and minutes remaining, but we leave the budget percentage remaining off and if we get feedback that they wanted, then we put it in later.

B

Because it feels like we're trying to give them too much. If that's a problem, you say you use 25 minutes out of 20 minutes or you use 15 minutes. Out of so you always keep the out of dot dot, right, yeah um and then, and then people can if they want to work out that as a percentage, it's very easy to do in your head.

A

The the thing that we had that's the thing that we have now actually on the air budget, dashboard counting up yeah.

D

A

Have an off 20 minutes, but that's what we have now and the general consensus was no. We wanted counter down budget remaining.

D

Well, I think we can have the we can have the minutes counting down, but the percentage budget remaining can come off, but also general.

G

Consensus is jackie, so.

D

Like we've had, we've got, we've had jackie's feedback and I think one other feedback, and I think that perhaps we just need more data points, um and I'm not saying that jackie's opinion is not wrong. Like is, is.

A

D

Just saying like it's, it's I'm.

A

Very happy to say that the metrics are fast and when we're just using the global rules, we can put there anything we want. We can do casting.

B

Yeah exactly like it's it's it's like, but also like what we're talking about here is like subtraction of like numbers below, hopefully 40 right. So hopefully you know like most people can do like some level of this in the head. But if you- but my concern is like if we have two things that are percentages, you know, percentages are unitless and therefore it becomes very confusing as whether you're talking about like percentage of requests or percentage of error budget. Or you know, and and so I that's, the only part that I don't really like. Then.

B

The other thing that I yeah.

D

I was going to say just for the first iteration. uh We have we're going to be doing the ama in the first week of may so for the first iteration. Let's just have availability and minutes remaining and we'll leave the budget remaining for a future round.

B

I think that's a good idea. Sean.

C

Your thoughts, I agree.

D

C

No I'm just looking at the table and jackie, so we're saying we'll use the first two columns from that.

B

C

um Yeah yeah yeah, that makes sense to me um I mean I, don't really have a strong opinion on the middle too. I kind of agree with andrew that it's weird that.

C

It's weird that one percentage like relates to your budget and one percentage relates to the overall availability. I think that's potentially a bit confusing, um but I don't.

A

Have a strong opinion on the middle too, overall availability. What do you mean.

C

Sorry, the left column.

C

Yeah the left column on the right column, like are both percentages, but they mean very differently. Yes, one can be negative and one can't so.

D

This is exactly why this is a problem.

B

So the other question, the other thing that I was kind of a little bit concerned about is this seems to be a bit of a maybe again it's difficult to to read this out of a single presentation, but we are like aptx and latency is totally included in that, and it feels like maybe jackie, feels it's a future iteration um where I want to be clear that this is in here now and then it is counting towards the error budget.

D

Yeah, I think she, I think, when we first started, having the conversations we had only done. The first part which- and I think the aptx was the first thing included in the budget. I may be wrong, but I think.

A

Yeah, this has always been better. We always we always had both, but we started with one component and component being puma. Now we're adding that's the thing that I talked about in the first part of the demo. Now we're adding everything else, except for sidekick and then we're going to add site yeah.

D

We can just update that slide to make sure that it is correct and it includes the things that it does. We just don't have to go.

G

Into a massive.

D

Amount of detail with all of the you know the the audience is such that they don't need an exceptional amount of detail, but you're right. The latency's in there should say so.

B

The thing that's important for people to understand about latency is that the way you measure latency in in this regard is it's just a type of error and we treat it as an error. If it's above a certain that you know that's what aptx is, if it's above a certain latency, it's an error, and if it's below that latency, you know the threshold in for rails. It's one! Second, then it's a success right and so in the same way we say: well if it comes back with a 500, it's an error.

B

If it comes back and the 200 is a success, it's just the same thing, because I think people get really kind of confused as, like you know, if you're talking 1995 percentile latency of like 0.9 seconds, they don't that there's no sort of trivial way to take that and turn it into an error budget, because it's a it's a different way of measurement right. It's a it's a percentile value, and so that's why I think a lot of product managers are like.

B

We don't have a story for for latencies, because they're still thinking in like 95th percentile latency. How do you convert that into an error budget where we we're not we even not even using that we thinking of latencies as a as a boolean, effectively on every request, either passwords or failed its latency test.

A

Yeah and I think um for aptx in particular, we can. We might also stop calling things requests, because it's not always a request, something that can.

D

A

And can be slow.

C

Yeah, it's something.

A

That we can calculate an aptx for so.

B

This reminds me, I think we are horribly abusing the term upticks at get lab, but it's fine, it's our term, but it's not the industry term. No.

A

Yeah, that's yeah, because we're measuring performance of a thing not of the entire.

B

What we're doing is, is you know like it's exactly what everyone's doing like like? Don't get me wrong: the the sli with the threshold and the above and below, like there's lots and lots of really good talks about like this is the way to do it, and you know we're following that. um You know that's the industry standard, it's just that. We call that aptx, where other people just call it an sli.

A

C

A

We have an abdex inside an sli kind of.

C

Yeah on the naming thing, um our metrics did used to call these transactions, um but then that also got confusing yeah yeah.

D

Living things is hard.

B

C

Yeah, I could have I kind of like request because, like most of them are requests like a few of them aren't, but like you get there, you get the picture.

B

I I'm talking with alessio at the moment about uh an aptx for deployments where a request is a merge request. Actually, I suppose it's a request and it's how long it takes to get into production, um but we, you know we're using the same approach there as well. So you know.

C

B

Should take less than a day or whatever yeah.

C

I guess the other option is even more abstract and you call them like a sample or whatever yeah, but then yeah. That's that's so abstract that people won't know what you mean either. So it's.

D

Until we introduce the new feature of github samples, yes cool, I will go ahead and make those changes in the presentation. um That's fine.

A

Could you also make the changes for um what we're going to display because we've gone like.

D

A

A lot of back and forth everywhere, I think, if it comes from you it'll, be better. It's.

D

A

D

I'll put that comment on later today: cool uh anything else. Anyone wants to chat about. We've got five more minutes.

D

If not I'll say, thank you very much. You're welcome to join on the call later on again, but if not we'll be recording that call as well so take care bye.