GitLab Scalability Team Demos, 2 Sep 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Scalability Team Demo - 2021-09-02

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Okay, so I have the first item, and uh that is that I want to share a little bit about the rollout of one of the new rpcs we built for the get fetch efficiency project and yeah. Let me just share my screen, so this is a cpu utilization graph from the process. Exporter, which shows data, summed across all kittery servers in production, and it shows two process groups. One is kitly hooks here at the bottom and one is gitzly and uh get getting hooks. um It was created to make very small, relatively boring, rpc calls.

A

But when we introduced the cache it became a conduit through which all pac-file data has to pass. So then it started using a lot more cpu and the first rp's new rpc. We deployed and turned on changes what getly hooks uses- and you can see here the moment when we turned it on so that is a nice drop in cpu.

A

I should add that I believe this graph under counts cpu utilization, because our mat looked into this and I think it only samples running processes when the prometheus scrape happens. So if a process runs and finishes in between scrapes, it doesn't get counted here.

A

B

It's definitely on an interval. It's not um constant yeah.

A

Yeah, so we're missing short-lived processes and uh kittley hooks is exactly a short-lived process, so there will be more seconds spent in cpu time for quickly hooks than this graph will show now. The other nice thing is that if you look at this graph of gitly cpu that utilization, then you also see a drop and in absolute terms it's actually a similar drop, because here we're in the say, mid 90s and here where, in the mid hundreds between 100 and 110, so it's a drop of about 10 cpu seconds per second and here.

A

Well, this is a slightly bigger drop here. It's something like 18, and now the peaks are uh four. So that's 14 cpu seconds per second, but it's the same order of magnitude drop, and it could be that we also deployed something else to greatly. At the same time as the that I changed to feature flag, but um I think it makes sense that it was this.

B

You mean the client side of it or the server side of it.

A

This shows the client side impact. This shows the service.

B

I think that's the server side so so effectively. That drop would probably be the server-side impact of the hooks.

A

Yes of the different hooks.

B

A

B

Seems seems like logical, I mean it seems like a reasonable assumption to make.

A

Yeah, I I thought so too thanks uh but yeah. We can't really be certain because we just don't have enough uh insight into uh well the getting process, there's a whole lot of different things, and some of it is spent on these hooks. But we can't go back in time and say well at that time the gitly process spent x percent of cpu seconds on hook traffic.

A

We can a little bit by looking at the google cloud profiler, but um it's it's it's hard to get useful data out of that I found so that's one thing and then another thing I wanted to show is gopc message rates and I have to filter this. If I want to show a graph, because there's too many servers and too many rpcs- and we don't have a recording rule for these. So if I try to draw graph across all servers, it just won't render.

A

So this is the the primary of the prefect cluster where the handbook lives, and what you can see here is that you have these little bumps, which are back objects hook, which is the rpc we replaced, and here these bumps are gone and it's interesting that these peaks are up. I don't know what's up with that, but the good thing is that uh these are post upload, back messages and we're going to replace that rpc2.

A

So these peaks will roughly drop to zero, not not literally, but we will have only one response message per per call.

B

There's no chance that that's like the server has more bandwidth and it's therefore sending more traffic or something like it's kind of you know, because this is obviously messages sent yeah. But there's.

A

B

Possibly because it's got more available cpu to do things I mean I'd, be surprised if it was that announced, but.

A

uh Yeah, maybe I should just use thanos directly here: network transmits.

A

Terms, why is it not auto completing.

A

Network, transmits bytes node this one yeah, um oh and then I can select.

A

I don't understand what this thing is doing. uh Fqdn is file prefect.

C

Dot star: oh okay, you need a rate on that.

A

Yeah and I need to exclude localhost vice- is not localhost.

A

And then that was 20 days.

A

Does that look different?

A

It doesn't really look all that different.

B

One hour or something like that, just instead of two.

A

Minutes yeah, I forgot what interval I did the same above yeah yeah.

A

That should give us a better graph, but it's not there's.

C

Not really anything there, no.

A

So I'm not sure what's up with that peak, but we will bring that peak down, because that is literally. The second step of the project is to make this not go for your grpc messages anymore.

A

I did try to get some global numbers, but for that I could only do a table uh instead of a graph, and I did a one-day rate and a one-day rate offset by one week.

A

So that gives us something to go by, and you see that one week ago we have uh 300k for postal pack, 200k for ssh, upload pack and 100k for backup objects hook, and now we have 300k for poster blog pack 200k for ssh upload back and then I dropped to 10k for info wraps up back, which is a order of magnitude less so out of the top three it's sort of three two one and and we've dropped the one that was uh one um but it I I don't know if this is a good measure for the improvement, but uh if this means that we can expect uh a drop in cpu, that is three times bigger once we patch this up, that would be very nice, but I think that's wishful thinking and we won't know until we get there, but anyway, it's.

A

uh This is a nice story. Of course this graph looks more dramatic when you don't have the big one up top, but I wanted to show how the two correlate.

B

Can you just just for our entertainment purposes? Only can you can you drop the um the giddy side of it.

C

D

That looks really good.

A

Yes, yes, it does, uh but it's it's only one uh small piece of what's going on and uh um yeah. No, it does look really good um and it's interesting because we we have had an ongoing uh support escalation with a customer where they're concerned about uh get fetch performance, and it looks like the overhead of the hook. Executable is part of their problem, although a lack of concurrency limits is also part of their problem.

A

I think because I heard that sometimes they have 700 concurrent clones on a single server is never going to be fast, but we do we.

B

Ship any um concurrency limits in omnibus.

A

No they're, I think, they're all off by default. We should probably put them in yeah because well right now, as I can the way I see it from where I'm sitting it looks like there's a difficult conversation going on convincing them to turn those limits on um and if it was the default, we wouldn't even be having the conversation but yeah. A lot of people, including matt from our team, are uh are on top of that, and I don't want to get more too mixed up in it.

A

I think a lot of people who know what they're doing are spending time on that particular.

B

A

B

Avoid saturation and actually everything would probably be quicker.

A

Yeah unbounded is never good, but it's for some reason. It's a hard sell in that case, uh but to to connect this to that uh support situation. um The it's nice that we can show that. That thing that is using a lot of cpu in their case is dropping by 75 percent in uh and the latest release.

B

And- and I think you know, I don't think it's only that customer, like I've, certainly heard of other cases of you know where I'm sure this is going to benefit um other customers as well.

A

uh Yeah once this is working properly, it should benefit.

D

A lot of people.

A

Yeah, uh but part of the story here also is also that um when we built the cache, the the plumbing of the cache, the fact that we make this extra rpc call back into italy and that the cash is implemented there, that adds overheads and for people who are not using the cash. They are getting the overheads and they're not getting the benefits of the cash.

A

And when we worked on this on gitlab.com, it appeared that the overhead didn't really make a noticeable impact and it's more uniform if you just always have the same data path and if you can get already log cache, keys and stuff like that. So at the time we decided- let's just always, go through the hook, but it turns out that in their situation uh they, the overhead.

D

Of the hook doesn't.

A

Matter but yeah on top of that unbounded, concurrency, is also a problem, but what they're actually going to do now is to uh I think, backboard italy patches to not use the hook when the cache is off, which will uh which will help that particular installation.

B

So just remind me does the is the cache only on as a as a as a setting that people turn it on on specific repositories or specific gitly hosts.

A

It's it's a per host setting yeah.

B

A

B

Would it make sense in future to have it on by default once especially once this changes in.

A

Perhaps I one of the the concerns we had at the time is that, especially on the the handbook repo, we write every byte that gets served into the cache. So if somebody's hosting gitlab on a raspberry pi with an sd card as the repository storage, then all their traffic becomes sd card rights. That's a very exaggerated and contrived example, but it's doing a lot of writes can have a negative impact. A lot of disk rights can have a negative impact on the whole server, so we were concerned about just dropping those disk rights on everybody.

A

And in our case, we're we're fine, because the disks are fast and.

B

Yeah they're going to go and take care of a snake. Good luck with that I'll see you guys in a minute bye.

D

Well, I hope we see andrew back in a minute, um yaakov quick question, for you is this: the mr3812 I linked it in the demo.

A

C

Let me see what that is.

A

No, that's a different one, but this one um I think made this bump a little lower. Is there something special about.

D

That, mr no I'm trying to find the dmrs you're talking about, because this type of performance saves, even if they're not um reliable across every single self-managed customer, would be nice to highlight in uh in the blog post, uh release blog post and I know, you're not necessarily interested to highlight those things. But I think we should given that um you know there is a lot of work that goes into this.

A

Yeah, um I I do want to highlight these things um and I'll I'll add a link to the rollout issue to the to the agenda. Excellent.

D

A

In the upcoming, this changes feature flags and in the upcoming release, I asked with the kisly team and they want the feature flag to be off uh in the upcoming release, because we're calling a new rpc so otherwise, during a deploy, it looks we'll start calling an rpc that doesn't exist, so it would be next release, plus one where we can tell people that it's on.

D

A

And another thought I had is that I hope that at some point, the pro in the near future, the project is done and we can write a blog post about the impact of the project and that would also be uh publicity.

D

Okay, great, I think it's more important than ever now to actually show that we are having some um actual orchestrated work, helping with some resource consumption and performance improvements, and so on. So keep that in mind. uh So I already see that you are so.

A

That's good yeah um yeah, and in the case of this I don't know when you joined or how much you caught off this story about this support escalation with a self-managed customer.

A

um I I can tell from the reactions that uh people have been communicating this work towards the customer, so it's we're already using it there to show. Of course, that's just one one audience member. We want to reach with this message, but it it is being communicated already is awesome.

D

A

Yeah uh but yeah, so the really big hope I have is that uh if we look here- and we see that before we made this change, backobjects was a hundred thousands and that's the nature of brokeback is 200 000. Then postal brokeback is 300 thousands and I'm getting very.

A

Sorry I need to put the graph back where it was getting very optimistic. Now.

C

And I want this page to load. This is going to.

A

So this drop- I'm now thinking from from here to here, is because of 100 000 messages per second less.

A

Now, if we have 300 000 messages per second less, does that mean we get a drop three times bigger that'd be really nice, but I don't know, but we um we, we may have a nice graph to show when that happens, when we get there.

A

Because then, we're talking about uh because this is a drop of about 10 and the whole graph is 100 to 110, so a drop of uh of 30 is near 30. So that's that's a lot, but the the real, the real thing I'm hoping for is that uh what I'm hoping to see is that uh these bumps on these abdex graphs of gitly servers uh that we have let fewer of these. That is really the thing where we're going for.

E

F

Then we could maybe add the bumps again by allowing more rpc calls in the aptx metrics.

A

uh Now, well, I don't think are we are. Are we filtering calls that we shouldn't be filtering.

E

We're filtering calls, I don't know.

F

I think we're filtering too many of them, but.

A

Yeah, because some of the ones we're filtering, we should be filtering because that if we say that uh you cannot expect a one gigabyte clone to happen in one second. So if, if it goes, one gigabyte.

B

It's only unary um calls and it's a it's a it's a subset. It's a basket to indicate, and it's anything that is like that is generally taken out, like I think the biggest uh unary one is maybe archive or something like that, there's a whole bunch, but yeah. No, I mean that most of the slow things are out of there. It's yeah it's basically looking for um like get commits that are slow. Well,.

A

One thing I remember which.

B

Is a good indicator.

A

Yeah one thing I remember changing is uh that operation service, which is part of kittley, which makes merge, commits and things like that, so I think that got excluded. That's.

B

That's gone now it it became its own service with with vastly um broader uh thresholds, and now it's gone so because it was just it was just noise yeah. We couldn't get it to kind of play nicely. So you know we kept adding things to the exclusion list and, and then the problem was that the op, the rps on the service got so low because we'd excluded so much of it that it it wasn't.

B

Very it wasn't very good and then it just became noisy and then we just said: let's kill the thing, uh and so operation service is now totally excluded from from all.

D

Yeah as a whole.

B

Yeah, that's probably what you should have done, but you live and learn yeah. It's.

B

E

Just say one thing.

B

It's going to be amazing when the teams that drive and own and build these services are the ones putting those thresholds on there, not some guy in infrastructure who's guessing what the rpc.

D

B

Yeah or any of us, but but you know how much better that's going to be because they'll be able to say you know, like literally, we don't care about this because of x, y or z, and um it's going to be so much better. I think like from that point of view.

F

Yeah, because, right now, the durations of stuff that are like one two, I don't remember like satisfied, is one second but like uh get commit that takes one second there's a fine commit or what they are.

A

Finding commits should be fast, but the reason operation surface is problematic. Is that it's creating commits and some.

D

A

Commits have validations on them that can be very expensive and that's why it gets quite gets super slow. So it's yeah.

D

Right now and users.

A

Are waiting for them because they click merge and they see the spinner, but on the other hand, we there's things out of our control that can make it slow, which is a ground for excluding them. I think yeah.

F

But that's the the the reason that what andrew said like if we pull that apart and we can say, find commit which we sometimes do hundreds of times within a request needs to be faster than I don't know something lower than a second. But then that thing that you just mentioned to create a merge, commit or whatever can be different.

A

Yeah, I think, what's also happening here- is that uh with kittley we've been defining these alerts for way longer than in general, with error budgets, um so they're more refined because of that uh but yeah long term it should be easier. Oh that's what we're working working towards right! That's teams can own these.

A

These things, that's the plan.

F

I, what do you mean by more refined, because I I was looking at the the italy service just now and like the things like the the sli's, there don't seem like super accurate to me but yeah. That might be me.

B

Yeah, I would say like jakob that they are probably um the ones that received the most deltas and changes because of the amount of alerting, because we have the per node level alerting, which is kind of unique.

B

While it is unique um in our system and we get so many more alerts because we're effectively dividing the slis, you know 60 ways and we have 60 different buckets that we putting those in and therefore we get more much higher volume of alerts because of that um people are changing them much more frequently than almost any other one.

B

um And so I would say that there's probably like a, we probably need to go through them at some stage and kind of get everything in order, because it's probably been a thousand small changes that people have done as they are fed up with getting an alert at. You know three o'clock in the morning on a saturday morning um or whatever, and you know we it it probably needs. Some consolidation is what I'm trying to say because does that that check files probably changed more than any other.

A

Yeah, but I don't think we, the kind of filtering we have, there would be like a stage group saying this route uh should be excluded and this route should not be excluded, and this route is this and like we have different threshold categories in the application and and that level of detail no stage group can currently say that about their requests.

A

That's what I meant by refined. That's uh the the way, the run the the metrics catalog is organized. We can point out individual individual rpcs and ignore them or not.

F

Which brings us a little bit to what I was going to talk about unless there's something where you want to add the echo.

A

F

uh So wait a second share. My screen.

F

F

So right now, this is running on my local and it's producing graphs that are not as impressive as the ones that jakob was showing because they're fake, but these are the the sli kind of metrics that we want to like. This is going to be the metric that we will allow stage groups to set thresholds for, and it's going to have two counters the total counter and the success counter.

F

I I wanted to point out here: is this nice period where everything's zero, the speeds that we are that we will have recording when um the um rails application has just started on a new pulp, for example, where before that would be missing now that will be zero, which makes it easier to calculate with, and it will avoid like missing metrics like we see now for error budget when suddenly there's a huge spike or a huge drop when the metric starts to record when it's coming from nothing to something instead of going past zero.

F

um This also means that we will not have like one side of the graph being there and the other. Not this shows the number of metrics. So this is the number of endpoints that we have in total and we can see that total and success is the same like yeah.

F

The thing that I wanted to discuss here is that number it's 2570 different counters times two, because we have total in success um and that's going to go on to our fleet. So that means times the number of bolts times yeah, so that could go fast.

F

uh One id that I had while working on this merger quest was limiting what endpoints we initialize in the beginning, um based on the fleet that we'll be emitting them from so, for example, the api fleet doesn't need to initialize all the controllers or with the graphql controller, but it does need to initialize all the the api. The great band points.

F

The git fleet only needs to initialize the internal, the internal endpoints and the git controllers like that kind of stuff.

F

um So I was wondering if that's something we need to do or that we should just um ignore and make sure that our prometheus instances can handle it, and I don't have much more to show so stop the sharing.

F

Anybody have thoughts on that.

B

So it's funny because I came into this call, having spent quite a bit of time with jegos, because he recently added um or someone in in one of the teams that he works with added some metrics and they were basic. They couldn't even run it in table mode.

B

You know with the one minute rate, and it was just crashing and um kind of people are saying. Well now that everything is in um pods, it's just that the cardinality of everything and everyone's kind of starting to complain about this is becoming a big thing. The very interesting part was that we then went straight to the prometheus server and we're getting you know. So we skipped thanos and we started getting much better results um in instant results.

B

We could look at stuff, so I I'm a little bit kind of concerned that the narrative has already become that it's because of the pods, where I think there might actually be a different problem, because if you skip thanos out, um but it kind of ties into what you're saying.

A

But it's not, you know.

B

I mean that's the there is this set.

A

Of data there's too many series.

B

There's there's too many the the the the narrative is that there's too many series, because we've moved everything over but.

D

A

B

A

To the immediate instances.

B

A

Querying the same number of series is what you're saying.

B

Yeah and and pretty well, it is like, maybe a third of them, but the results are instant right and you can look over different periods of time and see results where in thanos, um yeah.

F

The thing is you.

B

You can look at the metrics and see nothing if you.

F

Would record that entirely and drop all of the the like? I'm not saying that the pulse or.

D

F

Cardinality on the points is the problem. I think the problem.

D

F

That they are not aggregated in tunnels, so thomas needs to get it all and then spit it all out if it was recorded with those labels removed. At that point already, yeah.

A

F

We do for all our sli aggregations.

A

So can we configure thanos to just drop the port names.

B

A

Because you'll.

B

Because then you'll, if you just dropped the pod names, you'd get clashes right so that all overwrites you'd be you know you, you need to have an aggregation function to say this is what I should because.

F

Otherwise, it would be 10 series at the same.

B

F

B

F

That's exactly what we're doing with the recording.

B

F

B

So so that so what I actually just put it at the bottom of the agenda, but what I always think, because you know we to we think in aptx's and error budgets and and and that, but a lot of the engineering teams they they still looking at. You know: histogram quantile. You know they want to know what the p95, even though it's terribly inaccurate.

B

That's what they're, looking for and they're always trying to do that on the raw data and that's basically just failing- and I was one we could generate. Recording rules like that are useful for those quantile things for all the things that we've sli's for automatically and then we can give people the option to to to run those. But you know what I mean and then and then that's what they would use rather than the raw metrics. And then we just have to do some education and tell people about that. I'm.

F

Pointing people at the logs most of the time because generally.

B

Seven days.

F

Worth of data is enough to get that and it's accurate and.

B

F

So, like gut feeling, we just don't do the complicated thing where we let the application figure out which metrics it needs to initialize and yeah. So we initialized, because.

A

We would be we, we don't have a problem emitting and storing the data, it's only a problem once we query it and we would and we would be querying the recording rules and not the raw data.

F

And this is the part that I'm not sure of, because in the past uh I've showed you like on one of the talks, this weird trick, that we did to get the feature category onto the http request, total metric and that we're only initializing part of it, because not everything is emitted from everywhere, and that was um that was with ben at the time and ben was worried about prometheus, not thomas, not querying.

A

Well, how, I suppose it depends on how many uh uh pods we have to a prometheus server.

B

Well, we have it's, basically one prometheus per cluster, and so we've got the zonal cluster and then we've got three regional clusters, so it's kind of, although all the sidekick metrics go into the zonal cluster, so the zonal one is quite loaded and then the the regional sorry, the other way around the regional one has got sidekick, that's the big one and then the small ones have got um api web and uh git.

F

And those are each a different cluster with their own prometheus. Yes, in that case, I wouldn't worry. Yeah.

B

F

They're all going to get the same number of metrics initialized like yeah.

B

I mean I I looked at it for something else. The other day and those prometheus were like they're sitting kind of pretty. At the moment I mean we should go, look through it. You can go look at like how many what the sample rates on each of them are, and you can obviously just query that at the thanos level to kind of, but they they were pretty, they were pretty good and uh yeah. You know one of the things we should ask mikkel to start thinking about is how do we add like a second prometheus?

B

You know, because with thanos now, there's no, you know and all of our metrics and recording rules are happening at the at the thanos level. We can kind of break that out in in whichever which way, it's just a kind of data collection, layer to some degree.

B

But but I think I think the advantage of having everything initialized is you know it's much better if we can have it that way. Well,.

F

I I like: that's, that's good. I want everything initialized because now like for error budget, I get people creating an issue because suddenly something has moved to to kubernetes and then the metric hasn't followed yet and then yeah. It does these weird things, because the metrics weren't initialized and yeah that I want to get rid of it's just easy. If you don't need to think about it, but some of the things like on the git fleet- we're not never going to have the web ide render.

F

So we could skip initializing that, but that's like the complication that the application needs to know where it's running and then needs to make a decision based on that, because that routing lives in h a proxy right now and yeah. So.

A

It sounds like more work and if we can get away with not doing that work that way.

E

Exactly that's that's my thought as well.

A

But how so are we? How confident are we that we're not going to knock over a prometheus when we add those.

F

A

F

Right now, since it's one prometheus three times, I think that's probably fine. Five thousand metrics yeah 5000, like if you look at the, for example, the graphql thing that is also running there. That's orders of magnitude bigger, if I remember correctly,.

A

Well, one thing we could do is um query: each of these prometheus servers individually take a metric that exists across all pods and do some sort of estimate of how many different, what the cardinality of the fault is and see. If it has a number of how many metrics it can even can even track, and then we can say number of parts times, 5 thousands does it.

F

Fit I like that. Let me take a note of that and I'm going to use that approach to add comments to my merger quest arguing why we don't have to do it or why we should do it.

F

Andrew you wanted to bring something up. I.

B

Think that that metric, that you can look at for the number in each is called tsd prometheus tears to be head series. Sorry, I was urgently scrambling to try to find that I'll. Just stick it in here. I think that's what it is. If I remember correctly uh I'll, just stick it in there. um This isn't really a demo, but it's just a little heads up.

B

um So there's a project horse going on and we need monitoring, um and so the first iteration of that was to take the metrics catalog and put it into that project.

B

um But I think there was like some concern around you know this whole big thing, that's quite complicated and also the fact that there are other there are a lot of other use cases. It's not just project horse. It's it's! Actually, um uh you know lots of self-managed customers who could he could use this, so it was a bit of backwards and forwards. I'll just share my screen. Quick. um Sorry just take that off there um and after a whole bunch of discussion, we ended up deciding that we're going to stick this into runbooks project.

B

For now, and maybe in future, we will take this, but also the other metrics catalog and move them out of run books, but it didn't make sense to kind of have it half in and half out like. We should definitely keep this alongside the metrics catalog rather than a part, and it's too hard to take the run books, metrics catalog out of the run books. You know, with the with the time frames that we have now and move that out.

B

So we ended up just saying you know, let's look at putting this in here, and so it's just the start of a of an mr.

B

um But what I'm kind of imagining is that we'll have like several different, like topologies like a get hybrid topology and that will have its own metrics catalog. Now a lot of this is going to be.

A

B

Gets is the gitlab environment toolkit, which is effective. I mean this might actually end up being called the reference architecture. The hybrid reference architecture, topology, is probably actually a better name for it. Because that's what gets uh generates um you know you could you could stand up a similar.

D

Provision you provision all of the environments and you configure gitlab on them, so you can actually get what omnibus was doing for a single node across multiple nodes across vms, across kubernetes and so on. So it's a basically an installer.

B

Yeah, it's a multi-node.

D

Multi-Node installer.

E

B

The yeah terraform and ansible, um but yeah. So so you know what, um with with horse we've actually got this going, and it's deploying and it's there's only one service, but it was really just to kind of prove that it worked, and so, if you look at get and actually, if you look at the hell, the home charts, there's kind of the single service called web service, which is a funny name but anyway, and so you know just as an experiment.

B

I created a metrics, a metrics catalog with this web service, um and you know it's at the moment. It's only got a puma sli in it, which is like simple, but it's kind of just to prove it out, and what was really interesting is the same customer that I suspect the same customer that jacob was talking about earlier. Somebody who was working on that was saying it's really awful working with the omnibus dashboards and we really need to to put the slo monitoring in there like.

B

How can we do this, and I think that this is like a really you know we can have if we set up a reference architecture, um metrics catalog, we can generate this on the you know, for that customer we can either give them the yaml and the json or we can you know if they say. Oh, you know, we've got these other labels or whatever we can just put them.

B

We can say well, you know, go and edit your your jsonnet file, wherever the metrics config is um and add add the labels that you need or customize it. However, you need for your environment and then run it again, but we can also ship it with like the the recording rules and the and the dashboards like in the same way, we have the gitlab dashboards project at the moment with a bunch of json in it, and we could do that and one one of the reasons why I like this.

B

Is it kind of takes there's a lot of things that need to happen for horse and like just keeping this out is like one less kind of overhead for that, and it also gives other people like a big advantage for for being able to use these dashboards. um You know they're not specific to to horse. They are specific to multi-node gitlab.

F

The thing so the thing you're looking to share here is the like all of our helper stuff, that we have here to define services and so on. Or what are you.

B

The thing that the thing that I want is is like: I want a way that people can like look at dashboards and understand quite quickly like the health of of a gitlab instance. um So it's still very much specific to get lab instances, and I want to kind of be doing the same thing we do on gitlab.com and having the same kind of tied in with the same sort of rate of iteration that we have on those um you know when I've looked at the omnibus charts.

B

I don't feel like those give me a clear picture of the health of a gitlab instance. They might tell me cpu and memory usage, but that's not quite the same.

B

And as an experiment or what I was thinking of doing is putting the ci runners into you know, starting off with with. If we, if we have a a reference, you know a vm reference architecture.

B

We could start off with just the um ci runners in there or a few of the ci services, maybe the git service and the gita service, and and actually ask some of the people that are working with that client to give that a try and and see what they're you know see because they've already asked for this, so we could get that going quite quickly um and give them the yaml and say here's a here's. Some rules apply these rules and then, let's take a look at the dashboards.

B

So it's not really a demo, but if you're wondering why there's like more than one metrics catalogue in the run books, this is this is what this is.

F

So for now we're not going to extract anything out.

B

No, but the what's quite nice is that I did extract it out and it worked so we know that the encapsulation of that lipsonness directory is pretty good. I.

F

Thought that the the way you did it now forget was pretty cool actually like just importing it directly.

D

F

The run books and just having it live there. We need to be careful not to break stuff. If that's going to happen, can we figure out something for that first and then decide on extracting later I mean like.

B

uh The it would have to it would have to live in a different place, not in not in um yeah. So I mean I I'm happy to discuss it more, but my main thing is: I really want to kind of get on with with building up some dashboards, and I you know we can move it out.

B

We can put it somewhere like that to me is like a less of a risky decision, but I'm happy I'm happy to hear different people's opinions, because I know the run books is getting really there's a lot of different stuff in there.

A

If you're trying to um it's, I would probably find it easier to work with to say in one repo, yeah and and to discover what the structure is and where the. Where the the lines are, where you're going to cut out bits. uh When they're just directories in in one repo, rather than.

D

A

A separate repo and having to work across them.

D

A

Had workflows like like with gitli and gitly proto, that that thing was annoying and then we went back to having the pro tool back in the kidney.

D

A

D

A

I went through a project of putting workhorse into the bank repository uh there's a uh if you want to move fast and try things or if you.

D

A

Flexibility then not creating projects. You don't know unique, yet is simply the smart thing to do.

B

So I mean my biggest worry was that people would check changes into the runbooks project that would break the downstream project. And then you like find out. You know the next time you try run it and it all becomes and just having it all together, sort of solves that, even though there's a bit of extra complexity.

B

But that's why I ended up putting all that jsonnet stuff into asdf and putting versioning on it and everything, because it's actually related to that, because we need a new version of the json tool to kind of be um doing a lot of the stuff.

A

Yeah, but also, if you're worried about breakage, if it's in the same repo, you can run ci and see the downstream.

D

A

Yeah, you don't have to depend on what happens in what happens next week, somewhere else. Yeah.

B

Yeah yeah, so it's I think I think that's a reasonable thing, but also yeah, I'm I'm looking forward to like they've been some people that have reached out to me and said like. Oh, we want these dashboards and maybe saying you know, pinging them back and saying: hey! Here's like something alpha. If you want to try it, um you know give it a try.

B

And we'll be dog fooding it so that'll be cool as well.

F

Well, you could the alpha thing that they could fork could just be like an example like you made now that references the rum books repository, which is public anyway,.

B

Yeah, the yeah, the the one thing I am kind of worried about is with service discovery.

B

um I suspect that lots of people use different job names on their on the names for jobs in their gitlab instance, and that's going to be kind of you know, because I don't there's no standard on on what you call the giddily exporter. So we'll do things where we'll select job equals gideon, but someone else like in um omnibus has got a different name. It's called like giddly prom, or something like that, and you know so. They're all they're all different and that's going to be a bit of a challenge.

B

So it might, it might need quite a lot of customization per customer.

B

But we'll be focused on get as the initial customer.

A

Well, it might not be feasible to say we publish a bunch of json files and everybody can just post these into their grafana and it will work. It may be something that needs a config file. That says this is my mapping between uh some sort of.

D

A

Service names and and prometheus job names and yeah, some things like that and then that spits out the json that they need and.

B

That's the that's the work that was done like last week. We took as much of that conflict all the conflict that I've seen so far and we've put it into a single file and there's one for the get instance and then there's another one for gitlab.com and it's got stuff like this. Environment has stages, yes or no, so that you don't have like stage label- and this is this. There's, like you- know, environment label, for example, and that's another one.

B

I think type label is going to be there forever, because it's just like kind of fundamental to the way we do things but the other labels. It's you can configure those um and it's got a whole bunch like basically all the differences I'm trying to put into one file and then and then you know we can we can do it that way and actually, interestingly, if you go look at a lot of the kubernetes charts, um they're all doing this as well.

B

So you know all the the the kubernetes monitoring it's all presented in json, primarily they they normally have like the raw kind of default version in yaml. But all of the like, and for lots of different projects, I'm seeing them presenting it as jsonnet, and then people saying you know they say if you want to change this, you know put this config in here and change this value and then run js on it and you'll get a new and you file that c2 environment.

B

So it's not like we kind of like completely going against the industry. There's other people doing this.

A

It also seems unavoidable, like we, we don't have standard labels for everything, so you need to have a place where you can configure custom labels and things stuff like that.

C

Cool we're done.

F

Automatic recording rules for sli.

B

So that was kind of tied in with the last conversation, but that was just um I did not mention it. Maybe that's. When I ran away from the snake we have, we haven't stopped we, uh but we don't have for every single sli. We don't have a rate. So what I was thinking is for every sli, that's based on a histogram.

B

We generate um effectively some by le comma, significant labels and then the the underlying histogram and then and then it's very easy for people to do. P90, p50 and.

F

All of those things, so you would.

B

You would record histograms, not the success error thing. I think yeah.

F

B

In addition, we'll we're going because.

F

We have the automatic automatic stuff for histogram attacks. We have the automatic success and total counter recording.

B

F

B

This is alongside that yeah and- and it comes from you know, this call that I had with all this discussion. I had with jegosh, where he's trying to do this and it's just failing, um and but should we be facilitating this at all because, like.

A

Bob said people should use logs if they don't know what, if it's exploratory and if they yeah.

F

So I I think how many of the.

B

F

Many of the histograms that we are supporting now are actually like the buckets are we have ten buckets.

B

F

Are never yeah.

B

F

Buckets that we look at and yeah so yeah.

B

I I tend to agree, like I think, that's probably better in general, because but I don't know in this case, if it was, I got the impression that these steps, there's multiple, there's like up to 20 steps in a single request and that you don't want to be logging every single step right. You don't want to have like 20 log lines per request.

F

But the separation of each steps inside one log line is okay, like we have for registrations database, durations, yeah,.

B

But this is a variable. This is like a ci. These are sort of ci processors, and so they have. I don't you know it's slightly different, so you couldn't have like 20 different labels for 20 variable steps in a in a ci pipeline, um but I I might be wrong on that, but also the other thing is clearly there's something wrong with thanos and it's because if you go to the underlying prometheus to promethei, it's it's working, much better.

A

Well, the thing that's really wrong with thanos is that it plots graphs with white lines. On a white background, I think that's just outrageous. I didn't.

F

Know that I also don't like the fact that it tries to be an editor, and it's not very good at it.

A

I I was half joking, but I I was getting confused because I was running queries with uh lots of results, so basically all kittly methods and the all the graph looked like it was at the bottom. And then there was all this white space above and I thought why is it sizing the y-axis to have all this white.

D

A

And then I realized that one graph was one line, was dominating everything else and it was white consistently. Every time I run it, the dominant line is white and it's.

B

Welcome to my world, except it's not white in my world, it's like cyan. Well, it never happens.

A

No, but it's cyan.

B

A

I think we should find the background right. It's not nice! No.

B

But it's cyan on a white background. I can't see that it's basically like invisible.

A

Okay, yeah well, in this case, switching to classic health. I don't know if it's uh if it does any good for you.

B

I've just left I've left classic. I don't use that other one, the other one's not for me yet.

F

F

Okay, I guess that's it.

B

F

B