GitLab Scalability Team, 25 Feb 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Scalability Team Demo 2021-02-25

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Great well yeah, thanks for putting the first custom on the agenda so over to you.

B

Yeah um totally unprepared, but I thought I should give an update on the uh pack file cache project because that's the main thing uh I've been working on uh together with sean who also does gets other things done, and uh it show us out for a week that I was out for a week. So it may be less clear where we are now for everybody, so I thought I'd give an update um we.

B

So the idea is to uh make git closing get fetched faster by caching, the large stroke of data that needs to be cooked on the server every time. Somebody clones something and we now have an rpc in get delete that allows us to observe this work happening, and this is this.

B

Rpc would eventually have a cache sitting behind it, but right now it doesn't, but it's already very interesting because we can get log data from it, and this log data shows us the computes, the cache key, and it shows how many bytes are generated by the rpc.

B

So just by looking at those logs, we can see how much data we'll be storing what the hit rate would be. We can think about uh what a good retention time is or what the trade-off is between retention time and storage cost.

B

And uh where we are now with, that is that we have that uh on in production and it's creating log data, and we can then fish that out of bigquery and download the log data and do offline analyses, because I think in four hours we create about one and a half million records.

B

Well, the the request rate is about 100 per second and we thought that the 24 hour period is a good like 24 hours on a weekday is probably going to give us a good insight of what goes on, because weekdays are busier than weekends and there's a definitely a 24-hour periodic nature in traffic. So if you have a 24-hour window, then you should see everything roughly that happens so we're now busy capturing a full 24-hour period.

A

When does that 24-hour.

B

Period end, um so uh our plan is to make it literally uh thursday in utc, so it would end at the end of today, but strictly speaking, it started somewhere yesterday. The thing is that.

A

Sorry, if to interrupt the reason for asking, which is that tomorrow is an unnaturally slower day for gitlab in general, being friends and family day. So I'm glad to hear that the window is before that yeah.

B

Yes, yeah exactly, and we have some flexibility because we started already on wednesday afternoon. So if you find that there's a strange taper towards family and friends day, we could even cut it a little earlier.

B

It's just that um yeah we're both sean and I are not very comfortable working with bigquery and the best way we understand it right now is that you have to grab. You have to tell bigquery to create a table uh per day. I think that might be because we store logs per day or something so it's it's. Everything is much easier if we constrain ourselves to one calendar a day, it's not that bad.

B

Otherwise, we need to do the work twice, download two chunks of data and glue them together locally, which is not rocket science, but it's easier if it's one file, hence one calendar day in in, I guess, utc the.

A

Project's complicated enough without trying to figure that out as well. So that makes perfect.

B

Sense, yeah um so, and it it's fun that we get to use bigquery now uh and uh well. Sean gets to use it and I get to ask sean how help?

B

How do I do that and then sure figures it out, um but uh yeah one of the things we also want to have as a result of this we're doing this analysis on the log data and we're trying to do it in a way where it's reproducible with a script so that, um assuming that this all works out- and this becomes a feature we can put that script somewhere in the documentation and if somebody else wants to roll out this on a self-managed. This feature in a self-managed instance.

B

They would also have the option to analyze beforehand if it's worth it or what the impact would be.

B

And so that's also why it's good that we're not working in kibana, because not everybody has kibana, and the other reason is that the type of analysis we want to do. They are easy to express if you just have a script in a loop that looks at json objects but they're hard to express as kibana queries. So it's just easier to do it this way and it's I was going to say it was 100 requests per second.

B

uh So we think that that is going to be about 10 million records or something um and a laptop can handle that like it takes a moment, but it's it's doable, you don't you don't need hadoop or mapreduce or whatever in the cloud to analyze that yeah. So that's uh that's actually so.

B

That's exciting and uh I hope today to submit for review uh the sort of the core of the cash mechanism uh that I've written as library code for greatly so right now it doesn't get cold yet, but it just tests that exercise yeah the interesting behavior web skill.

B

Yes, we try to keep it below web skill and uh so yeah we're trying to get that into review already, uh although there's a slight chance that the log analysis teaches us that we need to tweak things a little bit or that we need to make some optimizations, but uh I I'm hoping we won't have to we'll see.

C

When sean was enabling this stuff, I think was it last week for the first time and then he pulled back, because there was a degradation.

B

C

Get aptx or something was that related or unrelated. I didn't follow the discussion to the end.

B

To our understanding as unrelated.

B

There is a noticeable impact of turning this thing on and we were looking at the continuous profiler for the kittely processes and if you look at the heap allocation, so those are short-lived.

B

um Those are allocations that happen during the profile, but that may not uh that doesn't count whether they get when they get removed again, and it looks like every byte that gitly serves also gets allocated, uh which is not a great way of serving data, because usually you would have a buffer and copy something in copy it into the network circuit and reuse the buffer. So you only allocate that buffer once and because of the way a grpc works, and we structure this. It looks like every time we have a chunk of data.

B

We want to send from gitly back to workhorse or gitlab shell. We allocate a buffer put it in there, send it out and garbage collect the buffer, so every byte has to be allocated and garbage collected, and you can clearly see this in the profiler, where the rpcs that do git clone creates a ton of allocations and uh that, so that's a big chunk of work and we get a full copy of that chunk of work once this rpc is on because it's doing the same, it has to send the same amount of data.

B

um Actually there's a picture. Let me just quickly show the picture for that. um I can find it that's what you get for not preparing this.

B

D

B

Wait not this one, but with the last one.

B

Now these graphs don't show the memory increase that clearly yet, but because they were taken early yeah. This is the picture I was looking for. So these are all the allocations during 30 seconds, and this is an average where it's allocating 1.5 gigabyte in 30 seconds in this particular window or over these profiles and the worst one. It was allocating 14 gigabytes in 30 seconds.

B

It doesn't mean that it was using 14 gigabytes of memory at one time, but it was allocating that much worth of memory and this part is sort of a given. So that's the workload created by get http and get ssh, so on average across all these things get http is bigger than good ssh, but it will vary per server and per when you're looking.

B

But so this is a big chunk of work and there's an identically sized chunk of work. That is all the new rpc um and there's one server. We still don't really understand why, but in the uh in the dashboards in the saturation graphs, there's one graph that goes up for the memory and um no I'm not going to sorry. I need to stop sharing.

B

I want to keep showing things and slow down more and more. I just need to say it in words: the memory there's an aggregate memory graph that goes up in the saturation and if you try to see why that is there's about 60 servers and it all seems to be because of one and uh that's no. It's file 29., and uh I don't really understand what's special about that one.

B

But when you turn the rpc up on memory usage uh on that one sort of doubles, but we have plenty of head room on the memory usage. So it's not. um We can get away with it but yeah and then looking at the profiler, where there's this one outlier of 14 gigabytes or 15 gigabytes or whatever. I you can't see in the profiler from what server that profile is. But I strongly suspect it's that one, because the memory usage also went up.

B

There um but it so it's noticeable when you turn it on, but it doesn't seem to be a problem and the big question that we can't answer yet is how much of a benefit we see once the cache is in place, because the rbc that just measures things does not offer any benefit besides measuring things so.

C

B

C

It will, it will tell you hit rate already, so you will have like. We will know something out of this yeah more than what data.

B

Yeah, we know we know the hit rates, um what we don't know yet uh is or what I what I find hard to predict.

B

We know that creating the spec file responses uses a lot of cpu and especially on canary one, where we have lots of ci cpu usage on the gitly server is dominated by creating by the cpu calculating the same thing over and over again. So it's obvious that that chunk of cpu should shrink, but the reading from the cache causes io, and that means cpu time will be spent on io and.

C

B

C

That's going to calculate like in total how much time we're going to say, yeah.

B

What is the networking from I o to from cpu to io? I don't know what the net effect is uh for for one, and the other question is that I have this ambitious goal of turning off the cache, the ci pre-clone cache on gitlab or gitlab, and so what happens now is that we download a chunk of like a stale clone from object storage so that doesn't hit the gitly server.

B

The runner uncompresses that and then there's a fetch on top of that, and these fetches are a small amount of data, but they still happen 200 times, so it's 200 times the same fetch.

B

So once the cache is in place, those small fetches will collapse onto just one or two and that's already good. But then, if we stop downloading that data from object storage, we are generating it on the gitly server and what I'm hoping is that we can just get away with that, and we don't need this cache anymore, because uh it is fragile and it is not a gitlab feature. It's hard for other people to use this cache, but again the what is the net effect this? How does it add up in the end?

B

That's an uncertainty.

B

So I'm hoping we gain so much from the cache that we can turn off the other optimization and it will still be fine.

C

That's what I suspected from.

D

The beginning was your goal: yeah, it's just it's hard to. uh I.

C

Well, basically, you're kind of building the same thing, but into git, so everyone can use it.

B

Into italy, yeah.

C

Italy, sorry.

B

Yeah but the yeah.

D

B

B

So that's uh yeah, that's where we are with that: uh the we're learning from the analysis, we're gathering data and uh we're submitting the next trick of work of implementation for review.

A

What will definitely be worthwhile once that? um Well, once the the data collection is finished and once the analysis is done, then publicizing what we think the impact is going to be, because there was a lot of interest from both the data stores team um and we were working with getly on it as well, and I think once we once we can tell them what we think the impact will be. We should let them know, because that might help our mrs get through a bit faster as well.

A

If they know how much this is going to improve the situation.

B

Yes, yeah one thing that uh I'm very particularly curious about that. I didn't I I did some preliminary like we got a couple small chunks of data, and I did some analysis on that, um but I didn't take it very far. Yet one thing I'm very curious about is the difference between italy servers, because if you pull the data out of bigquery, it's just everything and you can do a global analysis which is going to give you a ballpark of what to expect.

B

But there's going to be a lot of variation from server to server, and uh I expect that the cache hit ratio will be much higher on file canary one because of all the ci and other servers. It will be kind of meh and uh just just to see in a picture like what yeah, what that distribution looks like will be interesting.

A

I like that, it's a very technical term meh.

A

Thanks for thanks for showing us that and for the update there, it's um I'm really looking forward to seeing the next step like I can't wait till the date is till the date is.

A

A

So is there anything else that anyone wants to go through uh or wants to show today.

C

uh I could perhaps show that we now have a stage group and stage mapping in prometheus. I don't know how interesting that.

B

C

Shoulder me uh yesterday played with that to get a number of how many endpoints a group would own. Let me look that.

C

A

Hey bob, if you're talking you're.

C

Muted, so yeah sorry, I was muted. So um before we couldn't, um we didn't have this information um inside prometheus. We only had feature categories, and these are the things the feature categories are the things that we've defined as features that are owned by groups that are in stages and we use features categories to tag things inside, get lab rails and we use feature categories. So they don't change that often um so we tag workers and endpoints with that, and this shows how many endpoints um a group owns.

C

um So we can see that, like the results are not super surprising here, the oldest um features have the most endpoints.

C

The most used ones package has a lot because we think that, because they support all the different kind of package managers and so on, so that's all different kind of endpoints and the fun part about this is that this is coming now all from prometheus. So we don't need to know the list of feature categories that are owned by a group to be able to look for this. We just need to know the name of the stage group or we can even do this by stage. I think.

B

So the metrics have stage groups on them, so whatever uh possible values for stage groups exist in the metrics are whatever stage groups are.

C

um So so the the metrics, the metrics themselves, the interesting metrics, are um feature categories, but then we export this thing. This is called gitlab feature this one here. If I copy this.

C

And that's just a series uh that it's a series for every uh feature: category with the stage and a stage group on it. uh So we can join that to other things and it's just the one all the time. um That's something where.

B

Does it come from.

C

It comes from the stage indirectly from the stages that the ammo file in the um gitlab website repository.

B

So what what exports this.

C

uh The it's a recording rule so um prometheus, I think, yeah, that's that's a thing that is so.

B

It's basically in the config file of a process that scrapes and it creates this on the site.

C

Yes, so it ends up in thanos, which is where we want it and there's um something, because here we can see which prometheus is exporting it um and like that, like that, as a side effect of that it has a lot of labels with that we're not interested in that. We need to remove in every query. So I might look into that too, like.

B

Well, if it's a recording rule, you might actually, I don't know how that works. Maybe you could say in the rule that you don't want to see those things. That's.

C

Well, those like yeah, maybe I don't know either uh how exactly.

B

That works like.

C

We only need this; we only need the three labels on on them and we don't need, because that this will effectively cause duplicates, that we need to filter out.

B

C

When does this.

B

Read the website yaml.

C

Get updated, uh it gets updated when you run a script and something is going to shout at us by creating an issue if it becomes out of date.

B

Which is the same mechanism? We have for.

C

Checking that everything every.

B

Feature category exists and has an owner yeah.

C

It's all hooked into that.

B

A

So this is all pulling together towards having this information for the dashboards and the alerting.

C

The dashboards the alerting, but it stems from our recent work on the error budgeting demo. So we need to be able yeah. We need to be able to collect everything for a staging for a stage group, but these labels are on several metrics.

B

Yeah, but so this is the first step, but once you know how to do this uh this, um if, if we're pulling that yaml file from the handbook, we can probably also see what slack channels belong to these stage groups.

C

B

All part of it, yes yeah, so in the next step we could extend the script so that it grabs select channels and if it needs to it, can ping channels or create messages that reference the channel. So if somebody sees an alert, they know what channel to go to. So this yeah.

C

There's there's a discussion that um I think dylan. The search group raised in the global search group that that suddenly, um their like elastic search, wasn't processing any and new things anymore, and nobody knew because we don't have alerts for those queues because they're throttled like we expect there to be build up of work to be done. But we don't want to overload the elastic search so like, but in this case nothing was being processed and because it's using our general alerting rules, we didn't notice uh yeah.

C

So that's probably the first thing that we're going to um have like a rooted alert first. So the global search team can keep an eye on this like without us intervening because they play around with um elasticsearch itself and so on. Yeah.

A

It's been nice to see the engagement from everyone between the dashboards, the um this alerting piece. uh Everyone seems to be interested in what we're doing with this, and it's just so nice to see what we're doing being used, and I was appreciative for hwang min putting in that link to an okr that uh source code has now an okay or linking to how to regularly review the dashboard.

A

um So it's it's it's just nice to see what we're doing is is of interest to people and being used.

C

Yeah the thing that keeps coming up is alerts, and it's not something we've thought about before.

A

Well, we have sort of we've thought about it in the sense that it's been an issue on the backlog for ages, um but we.

C

Haven't that's foreign now the stage groups are coming to us like this is something we need to alert on, and it's not it's not like a service level indicator that we used to look at overall health there. Oh, I see what you're saying they.

B

Want to have custom.

C

B

And they don't know how to make them, and so then the question is uh who's best place to help them. Can we help them.

C

B

C

What what do you think, I guess the question.

B

Is what do they need to be able to make these things themselves?.

C

Well, I think um the dashboards were a good place to start, and I think that's what triggered it, because now they put lines on a dashboard and they put red lines on the dashboard and they want to get an alert when the blue line goes below a red line. For example,.

B

I suppose um what we're doing here is that we're getting them we're pulling them into committing to the run books, repo and that's also where the alerts are and they're now in the directory, where they're defining their own dashboard and they we need to somehow make it easier or the goal is that they can also do things in the other directories where the alerts are, and I don't want to make it sound like they should know how to do that, because I don't know how to do that either.

B

I find the rainbow scary, uh revo scary, and it's sort of a jungle myself, but.

A

Yeah, I think that's all we can do is is get them towards being comfortable with it, and I think you're right. The dashboards were the right first step, because it was this gentle introduction into this is what the production systems look like when they're running and um yeah, I think a next step.

A

A natural consequence of this is having them be more involved in what we're doing so, I think we'll get there. This is. This is the start of quite a long engagement. I think, with all of these groups.

C

One thing that I was um thinking about like yesterday on the on the bicycle was: um we could stop by I'm going to run this by andrew once you get back, gets back because he's like um he has like a goal for this and just not enough hands to type it.

C

But then what I was wondering about was maybe we could um start building like a group catalog like we have a service, catalog and a metrics catalog and then define like have a way for groups to define group level indicators like we have service level indicators, so that would be like an automated way for that.

C

Well, automated an easier way for them to define alerts without having to actually define alerts that way when they add an indicator like a group level indicator like that to their group definition, it will automatically be on the dashboard and it will automatically have alerts and all of that kind of stuff.

A

It's an interesting idea, I'm keen to see what, if I'm keen, to see what andrew's thoughts are about that and if it matches up to what he's been thinking about.

B

Yeah that makes a lot of sense because uh I'm still not quite used to this, but I think this is what andrew has been building with all this jsonnet code. Is this idea of declaratively saying what should be out there and having as much as possible be auto-generated?

B

So if we can have a declarative syntax for where group could say, I'm interested in this metric doing that and then the whole thing just rolls out. uh That would be uh enough. One way to help them uh get the information they need.

C

Yeah like right now we're just we're kind of telling them what is interesting generally and then removing what is not for them, but then this would. This would allow them to pick what is interesting for themselves. There's always the problem of what is already available in metrics as well. Yeah.

B

Yeah, but it might be very interesting uh to work with source code if they're the ones having interested in having interesting custom metrics, because working with them would teach us what sort of things uh are useful uh or what uh what those metrics should be. One thing I'm wondering is: do people want met? What sort of alerts do people want because they probably don't want their developers on pager duty? So do they want slackpings? Do they want issues?

B

Do we have a nice way to say if this happens open an issue.

C

I only briefly looked at the code, that's all going through uh what is it the pager service thing? um We have ways of different of rooting, alerts differently,.

B

C

Slack is like the main thing and.

C

I think starting.

A

With slack makes sense, because it's the more immediate form of communication than an issue um also, if we get into a stage where we're just creating tens and tens of issues for a stage group um we're just like throwing it in a hole, whereas I think if you, if you're constantly sending things at slack, at least you aren't destroying their backlog. While we automatically create issues for over the weekend,.

B

You don't ruin their, you don't ruin their issues, but what I've seen with these uh alert channels is that they just got flooded with alerts and uh I lose track.

A

Or well I mean it's the same problem as as what the sres find in uh in infrastructure. Right, like you, don't want the alerts to be noisy, they need to be actionable things and if you suddenly start pumping noise at people, they just drown it out like it like. They did sorry, they just block.

B

A

So we just have to be careful about what we choose to alert on and also advising the the groups themselves that they don't need to see noise because they might come to us and say I want to know every single time a dot. This line drops below that line. It's like you, don't actually want to see to know that you want to know when it's been down for this length of time or these other things match up like we need to discourage them from wanting to see noise, but that's.

C

The that's the interesting but hard part about that, because some of those metrics, like um the the thing that kagosh was working on with the ci trace conflicts that discovered um the yeah the the problem that I won't mention on. The public thing um like that was as soon as this was above a very low number that we in infrastructure wouldn't really look at as something to worry about.

B

But that that is not the same thing as having a noisy alert. It could be that uh if you're a stage group- and you know like this feature- is being used in this in this way that this number should never drop below x for more than y time.

B

And that would make sense at the from the point of view of owning the feature and not making sense of making. From the point of view of making sure all the servers are running well and coming along.

E

It's a little bit well difference between the view of the status and the view of the ser se.

E

When I talk to some startup, especially in code review and support, they mention that they don't care about the real-time matrix, not actually don't care, but they will care more about the metrics over a really long period like one month or two months, because, typically when they perform some kind of optimization or some feature, it would take them about one month to for the optimization check effective. So they really want to compare the current like after this deployment by one month and the about two or three months before to compare the performance of the optimization.

E

So every point is a little bit different and even though that we can post a lot to them uh it, I really highly doubt that what they can do to resolve the situation yeah because most of the times this is that uh they can't do anything this under the noses. They have to take some time to investigate sometimes to implement the long-term solution and then some really long time to make things go introduction.

E

So I'm not really sure whether we should focus on the alerts or not. Maybe we should focus on the collect, what metrics they need the most in the right now or in the future, and when we have only data and they can just leave and use our dashboard daily, then we can come up with the alert.

B

Yeah, the the thing is, what, if you have good dashboards and uh if you can fake, having an alert by just looking at the dashboard regularly and maybe for these long-term.

D

B

Is the right thing to do because it's hard to define an alerting rule that would spot something that is a human you can see when you look at this dashboard, so maybe we shouldn't get too distracted by the alerting parts and well we shouldn't get so distracted by the alerting part. We forget that everybody needs to have a dashboard.

E

C

The two examples are actually of people who already know they metrics, like they know what they're looking at on the dashboard, and they don't want to go there when to see. If it's wrong so like that.

B

A

C

Search group and the continuous integration group.

A

And in terms of and in terms of observing metrics and using these things, those teams are more advanced than some others who don't even know that the dashboards exist right now and I think each because each of the different teams is at a different level of maturity, we'll have to think about how it's best to engage with each of them to actually get them to the next step, and someone like like they're so much further along, so we have to almost be in a different phase when we're working with them.

A

But I think the key thing is to just keep the engagement with the teams up, because I think the worst thing that could happen is we've produced. All of these things. People come to us and they're super interested, and then we just say no, but we'll get to that in six months.

A

Time, like I think we we're gonna, have to be flexible with with treating different teams in different stages, because that's where they're at um but that's hard, because it's not like straightforward, saying right, we're in this phase of the project now we're in that phase of the project. It's I think it's going to be a little bit more messy than that.

B

But it overall it's exciting, because I maybe I'm misremembering, but I feel like this has been on in our team mission statement for a long time now that we want to help state groups understand how their stuff operates in production and we're. Now at the point where we're actually engaging and becoming that bridge or that link between it's.

A

Pretty much something that means it's pretty much something I've been saying since I joined the team was that we could be this bridge and as soon as these engagement epics started, I closed down a whole bunch of other ones that I raised when I joined. So for me.

A

This is super like this is really fantastic to see this, and I think this is a really effective way of almost scaling the infrastructure team in a sense, because we're outsourcing looking at some of the problems to back to the teams who own the code and um as I might take this on a bit of a diversion I've been involved in this conversation about infra, dev issues and as part of being involved. In the conversation, I've been looking at actual data like so there are more.

A

Are there really more infrared issues than there used to be? Are they being closed at the same rates or not? And why and um what I can clearly see in the data is that over since um that massive hiring spurt in 2019, like the mr8, has shot up and, at the same time, we've increased.

A

How fast we get things to production, so naturally, there's going to be more incidents that come out of that, because there's more change and we get the change there faster, but we're not doing enough on the other side of dealing with that, like the infrastructure team is largely it's only a slightly bit bigger than it used to be and the most effective way. I think to encourage to encourage people to be aware of what they're doing is to give them the visibility for that.

A

So this is like all coming full circle for me, like it's all, coming together, really nicely from my perspective and- and I'm very excited, if you haven't noticed.

C

I'm happy you mentioned that because I one thing that I keep wondering about is how the error budgets fit in like I know we are talking about that and that's something we want to work towards, but I don't know when and how well I kind of know how, like I know what needs to be done, but I don't know where it fits in well.

A

Marin, I'm hoping I'm remembering this correctly, but marin has started the conversations again with christopher and that's why we were why we asked you bob and andrew to have something demonstrable at the last assass call, and I think that also with the like just the increased awareness about the impact of incidents lately, I think that it's a good time to be having these conversations and to start to reintroduce it, but as to when we actually do the work to put these on the dashboards, I need to double check with marin when, when is the right time, for that I mean I, I didn't want to wait to release the dashboards until we had error budgets, because you know in the spirit of iteration we could have been waiting months for that and I think going back to the stage groups again saying right.

A

We've added these things to your dashboards, and this is how they work. um So so short answer to the question: is it's it's coming it's relatively soon. I need to double check with marin for for when we need to start pushing this out, because the issue.

C

888, that is like the main thing like it has a whole bunch of information that andy andrews spewed out before he was going out, was taking time off. That's going to be the start of like that's going to spin off in a bunch of issues. I think yeah.

A

So that that issue came about because we had a conversation about what needs to happen next to get to error budgets, and the statement was oh we're actually not that far. We just need to do x, so I said no we'll write it down because I'm concerned that it's not just x. So writing it down has produced this list of this. This is what this is. What needs to happen in order to get to the next step.

C

Yeah, the the the time consuming thing time consuming as in uh wall time. Not people time um is the metrics need to be adjusted and then I'm muted again. uh So the thing is the thing that's annoying with all of like with a lot of that work is that metrics need to change and those metrics are used for alerts and graphs and all of that things. So we need to do that carefully, and so that takes time so.

B

You mean for the literal enforcement of error budgets.

C

um For just recording, basically yeah the first thing is recording, um but to be able to do that in like a proper way. We need to tweak metrics to include feature categories and not explode cardinality like now. We have some histograms that have 15 buckets or I don't know what which yeah we can't. We can't add a feature category label to those because then well, ben's not going to shout at me because he's gone but yeah.

C

Prometheus doesn't like it.

B

Well, it sounds like there's two main or major parallelizable things. One is to have the data for the error.

D

B

And the other is to have everybody looking at the data and having everybody engaged with these dashboards and knowing the way around, knowing that they're usable and that's like more mature groups, can have custom alerts because they know that they need to look at those things, because it doesn't make sense if we just show up and say well now we can enforce error budgets, but you can't see what what went wrong or you don't.

A

Know the errors.

B

That you just have a stick to whack people with, but they they. We also need to enable people to do the right thing um before yeah.

B

So that sounds like two major two parallel things.

C

Yep and the thing that that that's, what makes it hard because we are using metrics that sometimes look quite nice now, because they are granular and all of that. But we want to roll that up in a budget and then like.

C

We need to make sure they're still.

A

A

Well, on that note, is there anything else to chat about today.

A

Great well thanks so much for the time, um hoping you all have a great friends and family day tomorrow and we'll catch up with you again next week,.

A