GitLab GitLab Runner Demos, 29 Aug 2023

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: How to set up Grafana Dashboards to monitor a GitLab Runner Fleet?

Description

With visual examples, in this video, we discuss how to use Prometheus to monitor your runners. We go from describing how to activate the runner Prometheus metrics endpoints to creating queries using PromQL and how we use Jsonnet at GitLab to create the Grafana dashboards for operating the GitLab SaaS Runner Fleet.

A

Hey everyone Danielson here principal park manager for Runner I'm joined by with the ideas team, Thomas, musukin um senior engineer for Ronald who's, been with runner for at least six or six years now, two months or seven years,.

B

Will be eight in December.

A

So Thomas is like one of the original architects of all of the great things in runner, and today we are talking about metrics and specifically metrics endpoints in Iran. It's a question. I get all the time with. Customers is okay, customers typically say: hey, David I know you're building a bunch of like new things in the GitHub UI kind of related to observability and and Runners, but like what's available today and typically when customers accident I say well.

A

If you look at our jobs, we've got this prometrius metrics endpoints that we expose on a runner, and you can do all these great things with it in terms of monitoring. Let me just share my screen uh and that's typically the extent of my answer, because I actually haven't done a whole lot of work myself in doing some of this, so very recently, Tomas actually added some new metrics.

A

uh Hopefully the screen is showing up and he's what he's calling the Q duration histogram metric and he was sharing this and slacking machines have really great grass, so I thought it'd be a great opportunity for us to like chat with two of us. Talk about what this new thing is kind of the benefits, and maybe two of us can kind of walk us through for us through a less uh incline like how you go from understanding.

A

There is a prometrius metric endpoint under Runner to creating some of the really cool graphical dashboards that Tomas us to manages. You want us to get that back up, so tell myself, stop sharing and like pass it over to you.

B

Okay, thank you. uh So you are opening here, a huge calm of what can be done and how to be done. I try I, try um because we have here both things that are really really easy and we have things that are complex, especially if you are new to Prometheus monitoring in that way. uh Let's start maybe from the basics. Let me share my screen and.

B

I remember when I think this is the this. Is the one I want to use? Yes, so uh can you see the console I think you can see the console yeah.

A

I can you've got all kinds of really cool things going on with gitlab Tanuki images, and example yeah world map.

B

uh Yeah, so the simplest way, the the simplest thing is to enable the the metrics endpoint in in the router, because when you look on the config terminal file, this is the only thing you need to add a global setting listing address with the address to which you want to bend. So basically, this means 0.9402 on any possible interface.

B

um This will be on my laptop. uh The the endpoint is, of course accessible through PCP, so if we can, we can close the the configuration right now if I, if I start a runner from that setting, uh you can see here information that metrics are listening. We can forget about about that output. There will be nothing interesting for us for now.

B

What we are interested in is how to read that, yes, and fortunately, with uh with the way how primitive use originally uh decided to go with metrics, it's a simple HTTP get request to a web server that now is started on that listing address we defined so when I, when I do that request, I get a list of metrics with the values current for the moment where I did that call right uh by the way, I think that's that format of the output is now uh um open format by itself, it's still used by Prometheus, but but I think it like evolves to um uh cncf uh maintained format for for metrics uh anyway.

B

This is what what we have in the runner and when you, when you scroll through that there is a a lot of information. Some of the metrics are metrics uh like very general, because we have few uh what is named in Prometheus World collectors.

B

A collector is is something inside of the process that like collects, collects data that can be next exported through the metrics. So you can see here some metrics name process, something something. This is one of the one of the collectors available in The Primitives um SDK for golang that you can just hook up and get some basic informations about the uh the process so, for example, how many file descriptor the running process is using right now there is another uh General collector for go specific metrics. This is again coming from the Prometheus SDK for golang.

B

It's able to hook into goal runtime and provide you some information about, what's happening with your process like the number of go routines, things like that. To be honest, I, don't remember. When was the last time I I had to worry about any of these go specific metrics I! Think from the from the from this, uh let's name it default metrics, the one that we've been observing a lot was the number of open file descriptors, because this is usually a limited resource on the server. Usually by default, has very small limits and with.

A

B

How running Works, how many files it creates? For example, any network connection in Unix is a file descriptor, but also running when handles jobs is storing a temporary file with a buffer of of the lock output, so by handling hundreds or thousands of networking connections and thousands of jobs at once. We are creating a huge number of file descriptors and like on SAS runners. uh It was very very long time ago, when we, when we discovered that the default limit that is usually set for the process, is way too small and.

A

B

We had to increase that so from this many default, General metrics I think the process open if this was the the one that I was looking at the most times, because we we had few cases when something went wrong and exceeding the open if this limit had caused an incidents. So so, yes, that that metric was definitely observed by us, the most interesting ones and I think the more the the ones that customers are mostly interested in the customers. Users are the metrics.

B

We created ourselves that describe internals of how Runner works, so this will be kind of the of the of the business metrics of the business logic that uh that you, as a owner of the runner, wants to look on to understand, what's happening with your Runner and whether your settings are what you would like to have, and here at the very top.

B

I'm sorry this is this- is this is a different.

B

Q, q- maybe let's maybe, let's grab it I- think I have locally an older version compiled that doesn't have this new metrics, so I may need to I may need to recompile the runner. Anyway, uh we have a list of metrics and for now we don't, we don't have any documentation for what exactly metrics. We are exporting because that changes in time and and in the past. Initially we had that, but that documentation was very quickly outdated.

B

That's right, yeah, remember that, but but but the way how Prometheus SDK works is it forces you to give every metric a description and there are a few types of metrics.

B

So what you can do, for example, is if, if you are learning that, if you are new to this, uh you can look for just help and type information. So, for example, you can see that there is a metric named, let's say, gitlab Runner concurrent, which is a gauche, and you can see the description is the current value of concurrent setting. So in the config term file you have the settings name concurrent, it's one of the default settings and one of the few that are required.

B

It needs to be higher one or or more for the runner to even start, and it describes you how many jobs, how many concurrent jobs in total running process started from that configuration file can handle.

B

So it defines a capacity of your run and there you have also a gitlop runner limit metric, which is also a gouge, and you see the description is the current value of concurrent uh and I. See I need to update that that line, because it's definitely copied and that.

A

Is yeah exactly yeah.

B

This is uh the limit is uh the limit setting, because the concurrent one is for the whole process is required. Then you can have few workers in that config term file and on the worker you may don't need you may set a limit so that metric will then be provided for every worker and we have gitlab Runner jobs, which is also a guard which shows you the current number of running jobs of running bills.

B

So one of the simples and most probably informative things that you can do and let me let me open uh Prometheus Square Explorer it's to find out. What's your saturation, like you have a runner, you run some jobs. You set some concurrent.

B

The thing that you would definitely would like to know is whether it's enough or not whether I'm, using all of that capacity or maybe the concurrent, is too much. Maybe I can save some money and limit it a little. So what you can do is is take a simple proportion, so we have the gitlab runner jobs, metric and divide that by gitlab Runner uh concurrent.

A

So this is where the magic happens, so you're actually doing a bit of a formula here to get the yeah. The video.

B

So how many jobs per how maximum can it be and if uh what we, what we usually want to do here is also uh limit a little.

B

The output I don't want to go too much too deep into how Prometheus work, because this would be a totally separate learning course and for anyone who wants to work with that going through Prometheus documentation, understanding how it works, understanding the concept of how how Prometheus gathers metrics and how it stores them is important like without that, all of that will be in and a magic that that is confusing.

B

A lot of that will be will be something something very very unclear. This is because Prometheus stores, all metrics As multi-dimensional Time series and.

A

Max dimensional time series, which is super interesting.

B

What does that mean in laymonster? If, if we, if we, uh if we look on this gitlab Runner uh jobs, metric only, let's see what it outputs, so you can see here the the help information. You can see the information about type and you can see there is a name of a metric. There are some strange things between brackets and there is the number at the very end. This is the name of the metric. This is a value of the metric.

B

My Runner was started, but it it doesn't execute any jobs right now, so the the value of the jobs metric is zero, but here we have what in Prometheus world is named labels, so every metric can be labeled by one or more labels. Every label can have one or more values, and this is what creates the dimensions if I would now execute here, like hundreds of jobs on that Runner. This line, this line here would be repeated multiple times.

A

B

Would have multiple values and different values for the metrics. Let's say that I have only one worker there, so all of them would have the same Runner and system ID label, because this would be in this. In this specific case, this would be static, but then stage State executor stage. These are labels showing us at what moment of the execution. The job is because the job on the runner can be preparing can be an execution can be in turning down.

B

uh I think this is where the stage comes in then different executors may do a few more things with a job.

B

So again, the same the the jobs can be on different places in the executor and Stage or state. One of these is about the what we started calling sometime ago, steps in the CI job, so preparation, getting sources, downloading cash, downloading, artifacts, etc, etc.

B

With that we are able, then to do few things, so the simplest one, let's summed by the some that by the instance label. This is one of the default labels that Prometheus ads by itself when scraping the metric from the target.

A

uh Sounds like that call is getting worse. Sorry.

B

uh It's in fact it's it's going down, but I still have some cough.

A

This is what happens when you do live: TV, yeah.

B

Why I don't have uh okay, maybe let's maybe let's go to to promote use directly? Maybe there is something odd happening with Thanos, so GitHub Runner jobs. Let's go quickly through that yeah. We have a lot of data, so we can. We can work on that. So uh let's go to our previews previous queries. So this is a H and now I. Remember why this is this is because of of the multi-dimensions and uh the labels that we need to include because the.

A

B

Metric is from the um is from the Global level, so it's not labeled with the uh with the label named Runner, because it's not.

A

Oh, the top level.

B

It's not level specific to the specific Runner worker. It's a it's a global value. Maybe it will be. Maybe it will be easier if we will go.

A

B

By I'm sorry, I'm I'm.

B

But this is great- oh here here this.

A

B

To see, okay, so instance is one of the labels that Prometheus has by itself when scraping the metric. It scrapes it's from some remote Target and then that remote targets, so it's IP or fqdm, depends on how you configure Primitives and the port.

A

Sorry so in this particular case, the label instance is a Prometheus generated label. It has the yes, yes,.

B

So when Prometheus reads that metric, you can see that it has already few labels yeah and our Prometheus is reading a metric from example, Forum a remote machine. That's named Runners manager, private blue, one blah blah blah blah blah Port 9402. So it creates an instance equal to that value label and just add its here at the end and then stores that locally in a very, very magical way, how Prometheus stores the data so that it can.

B

It can then use any of that labels to do querying to do aggregation which allows us to do something so, for example, first random thing: you can see that on this specific Runner instance, the usage is at the level of 51, because in this case we will get percent.

A

B

Can we can like multiple multiply that by by 100, and you can and you can see that it's like 48 here?

B

uh Usually you don't do that because, because uh when working with this output, more or less, you know what you want to get if we are working with grafana, which will be another think in a moment, then grafana is able to get these raw outputs. You teach graph, like you, tell grafana what type of data it is and it's then uh adding the the the the specific unique unit handling by itself, so I say brilliant.

B

So, with with this simple query, we can see how many jobs are executed on every Runner by how many jobs can be executed on that Runner. So we get the concurrency the the saturation value in the scope of the concurrent setting, so how much the full possible capacity of that Runner is already used, and- and this allows us to do a lot of magic now, what exactly we can do?

B

It all depends on what metrics you are looking on and you need to know what the metrics are about, and uh this is something that is hard to explain, because many metrics are very, very low level things for me, it's very obvious what they are about, how to understand them, because I know in the core how Runner works. I understand the internal Logic.

B

The mechanisms that we have so I know exactly what this metric means for some for, like for people who don't uh don't, like the engineering, don't contribute to the runner, they just use the Runner for their work.

B

uh Some things may be a little harder, so this is an opportunity for us to maybe maybe start explaining something more what we can do with the metrics or we could start sharing some grafana dashboards, where you don't need to understand the metrics, but rather you have a dashboard that explains what you can read from it right, so something something to consider for the future. Anyway, we have, we have multiple metrics.

B

uh If you go through permit use documentation, you will see that there are three main types: it's a gouge, so a value that represents the current value of something it goes up and down. All the time you have a counter counter is a value that is only increased in time. Counters may be restarted to zero, for example, when Runner process is restarted or when the the maximum alerts number by the by the type all metrics in from it, you use SDK for forgo, are using float64..

B

So whatever is the maximum value for float, 64 will also be the the maximum value for the counter before it's restarted to zero. That Primitives, when dealing with that Matrix is able to handle the restarts so counter you can you can understand it as a metric that always goes up and what's the difference, gouge is a good way to show us things when we want to know.

B

What's the correct number at this moment, for example, this this value of jobs, how many jobs we have right now now we have like five of them in five minutes. Maybe it will be 50 in another five minutes. It will be one, this will be going up and down all the time, and we would like to be able at any point in time to see how much it is, but we also have a metric named.

B

A

Building a metric like the total number of jobs ever executed on that Runner.

B

Yeah we have, we have that metric I think it's it's.

A

B

First time after the first job will be will be taken so each time one Runner starts a job. It increases a counter of of jobs that were ever started, so we can see how many jobs we had at any moment, but we can also see a ratio of how many jobs we are increasing in time. uh Counters are very good to handle tracking of events that are happening in times.

B

Gauchos are good to handle tracking of what's the current state of something, and there is also there is also one last one last type, which is a histogram and the histogram allows us to Define buckets of numbers and put readings into that bucket. So we can then using Prometheus and or graphana Magic.

B

We can do some readings like, for example, the metric you started are called with the the new Q duration metric.

B

um So as I it seems I've compiled Runner locally before that change. So let's maybe go quickly to to the code and the description of that metric.

B

A

B

A

Is doing that just to again to reframe it for everyone? I might be looking at this video. So remember we're we're talking about the Prometheus metrics endpoint on a runner. So if you have 10 Runners, you will have 10 Premiership metrics endpoints to work with. Yes,.

B

Exactly uh okay, so we have a metric named gitlab Runner job queue, duration seconds. This is a new addition, but it provides a very important information to the runner user. When you create a job in gitlab CI, it's going through several States first States is first state is created uh when you create a pipeline, all jobs in dark pipeline are created instantly instantaneously and then, depending on how the pipeline is processed, jobs are being slowly transitioned from a created state to a pending State.

B

A pending State means that this job, given any circumstances that control that it's now ready to be started. So we transition like gitlab transitions that to the pending state and then that job awaits for the runner to pick it up as I always remember repeat: it's remember to it's important to remember that it's Runner that asks for jobs, so Runner calls gitlab and says: hey. Do you have any job for me? Gitlab then, does a lot of magic, which we will not cover in this call and finally gets with one of three most popular responses.

B

One is now I, don't have any jobs for you 204.

B

Second, one is yes: I have a job 202 and with that response, gitlab set us sends back the Json payload with all of the details, so Runner can start executing it and the last most popular response is 49 conflict, which means that we probably have too much requests at that moment, please repeat again because for some reason we've been not able to find you a job or decide that you don't have a job to to execute.

B

So please repeat uh anyway, depending on how you configure the job, how your project is configured, how many Runners of different types and configurations you have available the specific pending job that you're looking on Maybe finally executed by only one Runner of doesn't available or maybe executed by any of them we can like I can tell that it depends on a very specific configuration of your case, but let's simplify that that thing, let's say I have my own gitlab instance on that gitlab instance. On the whole gitlop instance, I have one Runner nothing more.

B

This Runner is configured to take any possible job, no matter any tagging, not tagging things like that. Any job that is in the queue that Runner can take it and let's say that I created few jobs and now I see that it's two minutes later and the job is still in the pending state.

B

It was Independence Day two minutes ago, it's still in a pending State what it means. It means that the single Runner I have didn't ask for that job uh and from our experience and from talking with many customers and and community members, the speed of taking jobs from the pending queue is probably one of the most important factors when it goes for Runner user experience.

B

uh People like on on gitlab.com on our SAS Runners people expect that the job will be taken from the pending step when the job is targeting one of our sauce Runners that it will depend, it will be handled and taken from the pending queue in a matter of seconds, maybe a few minutes, but when that goes beyond 10 minutes, 20 minutes an hour for for most of the users. This is something that they can't accept and now, from my point of view, as a person who manages these Runners this.

B

This means that tracking, that timing is something that is important for me, because I can like track. I was able right now to track multiple different metrics I was able to know how's the auto scaling of the runner behaving. What's the usage of the runner host resources, the file descriptors, we talked about all of that I was able to to observe fine tune, notice that we are like reaching, maybe some capacities.

B

So maybe we need to to to to reconfigure something, but knowing whether the runners pick jobs in an acceptable time was hard, because we didn't have that information. On the runner side, we had that information for a few years on the gitlab site that on the gitlab side, this has two problems. First, we don't provide this metrics to our customers on gitlab.com and if, if you are a self-hosted gitlab administrator, you may also like provide that instance to users who not necessarily would like to to give the metrics of the system too.

B

So, access to that metric is already limited if, for example, someone self hosts runners for gitlab.com projects no way to find out whether Runner picks jobs as fast as we would like. The second problem is that on the GitHub site we are labeling. The metric with the duration was counted for the instance Runner or not nothing else. We can't track that per project. We can track that per group.

B

We can't track that per Runner, because that would create a label cardinality that Prometheus don't really like, and that is a good way to to kill a Prometheus server very quickly like gave it hundreds of metrics, with, like thousands of of small amounts, different values of light, because for every label value Bar for a metric, a Prometheus is storing the metric, the metric as a separate time series, and then it's using that.

B

Then it's using like a lot of lot of compute magic to like the disaggregations, allow you to to like quickly access this metrics and now, if you, for example, this is this is one of the one of the things we've been often asked, and we also always need, unfortunately, to to decline. Hey there is this metric on gitlab? Could you like label it with the job ID, because I would like to track something per job no weekend because a job? It's like this will be two big cardinality.

B

We will quickly quickly get hundreds thousands, maybe even Millions, jobs like on gitlab.com. We are handling few million jobs every week and each of the job would horse permit use to create a separate time series stored in the storage and then every query would need to like gather all of these millions of Time series to like compute the data data data together.

B

If you work with Sobe cardinality of the of the label values, we learned that hard way eight kills permit use very quickly. This is why, in gitlab, we in fact have some some policies and checks about what metrics we are adding, what labels they have, how many labels they have uh basically I think at this moment we are declining, adding a metric that would have more than 10 labels by itself, because there are also a few more that Prometheus will add, especially if that metric is going to be to be tracked by gitlab.com monitored.

B

So this is a problem on gitlab.com like on. On the gitlab instance, we have a very, very small understanding on when the jobs are. We know only what's the histogram of Q duration, but, for instance, runners or for non-instance runners, and then we can give that information to the users and even for us this was a big limitation now, since gitlab 16 4. This is uh this is a feature that will be released in gitlab and gitlab parameter. 16 4.. That metric will be also given to the run.

B

So when the job is scheduled to the runner, when Runner asks for a job, gitlab finds a job assigns that to this Runner and sends back the sends back the the job payload one of the new information. The job payload is the value of Killington. So we send to the random information about all details of how the job should be executed and hey. Your job was in the pending queue for five seconds or 10 seconds or 12 hours.

B

Yes, it happens, uh and now for us for us speaking as a runner owner for the SAS, runs I'm now able to track that queuing time for every single Runner that we have so.

A

B

Back, let's get back to I'll copy this information, let's get back to our Prometheus and let's for now, don't do any magic here, but some by shot shot is our way how we, how we label things on our infrastructure. This is not a label that is automatically added uh so for anyone else. This Square will probably not work you could. You could use here instance, let's, let's do instance, let's do instance for a moment and why it doesn't work.

A

B

Is what happens when you, when you do such things live?

B

uh Oh yeah, yeah, because this is a metric name, but histogram creates met through like when you define a histogram metric like here. It's named like that, but because, if it's it's a histogram, it will create three different metrics.

A

B

Them will be count and then we'll be.

A

Bucket and so Thomas before you jump off the screen, if you go back to that screen for a quick second um for folks, so quick context, I don't think we mentioned this or maybe Thomas mentioned this. The screen that Tomas is looking at right now, I believe, is the Runner code base and he's looking at the actual. Yes,.

B

Yes, it's in GitHub Runner commands buildscalper.go. If anyone would like to look at that uh yeah. So uh if we yeah, if we execute here, it will give us some information.

B

uh Let's do this for now by instance, and l e l e is something specific to the histogram bucket and it defines the the bucket value.

B

uh So if you have like two or three instances of runners doing this by instance, will probably work in my case I'm not interested, usually in a single Runner manager, I'm interested in in group of Runner managers, and we, we name this group as groups as charts.

A

Here, A bunch of instances.

B

Okay yeah so now here what I can do uh is I can see some information about every separate chart so previously taking metrics from gitlab I could only know how good behive all instance Runners that we provide, and we have like.

B

Currently, we have seven eight different shards of of of instance, Runners for three different platforms, and each of that platform has different needs. For example, Linux runners. uh These are generally available. These are most mature ones, and these have very, very fine tuned, SLI and SLO.

B

So anything going wrong will be very quickly alerted, on the other hand, Windows and Mac OS. These are still in beta. We are still. uh We are still experimenting with them, learning how to how to tune them to like fit best.

B

What we want, and while we are also interested in measuring them and providing us us as good performance as we can, we expect that there may be something wrong, so we would like to, for example, have lower alerting thresholds or higher alerting thresholds, depending how you look on that previously I couldn't reverse that it was all hidden in one bucket. This was, for instance, or non-instance run here.

B

I can I can track that for every specific shot, so I can now Define alerting, based on whether we meet our acceptance criteria for how long job is in the queue differently for different SAS Linux Runners differently for the Mac, OS and differently for the windows runs.

A

Hey and quick notice, when I jump in here for customers that I've been looking at this video. It's a quick note and I'm sure it's pretty obvious, but I'm just going to call it out um on getthat.com. These are all instance level or shared Runners. I know for some of my customers. You have a mixed environment. You are offering some instance Runners you're, offering Runners at a group level in some cases, you're allowing the group owners to offer their own Runners.

A

So again, if you have a large run of three, if you like, GitHub and you're, doing tens of thousands of jobs per month across a mixed, flip and you're thinking about how to do something like this, you have to kind of think about. Okay, maybe start with my instance Runners first, you know and kind of think methodically about how you might want to implement monitoring, especially if you have a clinical disparately thanks, Tomas yeah.

B

And yeah and the good the good thing with this new metrics added in 64.

B

is that now, if you are a runner owner on gitlab com group, Runner or project Runner or self-hosted gitlam instance group or project Runner, you are now able to track this metrics and you are now able to analyze whether your Runners are picking jobs as fast as you want them as fast as you think they are picking jobs.

B

uh This is one. This is one funny from ql query. So what we do? We are getting this this bucket metric for every single entry and entry here will be this metric with all of the permutations of label value pairs. We are calculating a ratio in the past five minutes, because this bucket metric is a counter.

B

Then we are summing that by sharp because I want to know about sharks, we are summing that by Le so the bucket value, and then there is a Prometheus function named histogram quanta, and this now says me that on our private Runners, ninety percent of jobs are handled from the pending queue in less than 7.8 and 8 Seconds on our most popular SAS, Linux small, where it says Linux, small, oh South, Linux small was not yet updated to that version. So we will not see it here.

B

uh But, let's, let's, let's say uh SAS Linux medium.

B

Because it's uh 19th, percentile, 90 percent of jobs were eggs were taken from the pending queue below 0.98 seconds.

B

If you look at 50 percent deal so a mean so half of jobs that finally land on the medium runners are cute for no more than 0.54 seconds.

B

So in less than a second, and now we are able to to like Analyze That per The, Shard per instance, per whatever other grouping definition. We will Define for ourselves and we will like to say hey. This is a a virtual group that we like want it to have a specific SLI as low and, if like, for example, that the small runners that we that we have here are current.

B

Our current objective is set uh uh is said to to also um like look on the first scheduling algorithm again, not something that I would like to go deep into I agree.

A

B

But like for specific cases of jobs that are finally landing on the small runners, we expect them to be queued for no more than a mint. If, if this specific specific types of jobs will be cute for more than a minute, we are reducing our object score. Reducing the object, store called below some threshold, others, the EOC alerts us.

B

It declares an incident and we are forced to find out what's happening and how to fix this problem so that our customers don't feel that the problem is even there and that metric is based on how long their job is in the queue which now allows us to to adjust this. This alerting for every single shot and the the deciding where we accept bigger or lower delays.

A

A quick recap of the Box, sorry Thomas, so pick recap, of course, Thomas mentioned SLI, which is just for clarity. Service level indicator I believe he mentioned SLO, which is service level objective, and so Thomas's team has service level objectives to meet for how quickly those jobs get picked up. Imagine adx just hold the discussion another day where, basically, we do a learning or Thomas's team-based data learning based on whether or not we're picking up those jobs within the SLO, Target or civil server objective targets.

B

Okay, so we talked a little about what properties metrics are how we can learn something about them from the output, how you can query them from Prometheus now there are two more things. First thing is that all of this magic here works when you get the metrics to Primitives, and for that there is no other way you need to go through the documentation, learn how to set up permit use and think how to set up Primitives. In your case, because some people are using kubernetes, some people are using bar metal.

B

Some people are using Cloud instances with with infrastructure, automation, Some Cloud instances with everything done manual. I can't like answer you what's the best thing, because you know by yourself probably how you would like to have it. Prometheus is currently a mature project is part of the cancf widely used across the the cloud computing industry.

B

uh So at this moment it shouldn't be hard to find tutorials and good documentation about how to set it up, in any specific case that you think you are so so. This is a part that that I really can't hear told much because trying to show how we set up for gitlab come infrastructure.

B

It would be few hours of talking and I'm, not sure if I even know everything, especially that we are wrapping primitive servers with Thanos, which is like clustering mechanism and extending storaging and querying, and a lot of things I have a Prometheus server locally on my laptop, but that's probably useful for local development machines and not for someone who'd like to have it for their production, so go to premiersdocs and and learn from there.

B

The the thing is that you need to get them from your server and in this primitive server you need to configure scraping of the endpoint that we uh that we configured here so first thing enable metrics on the runner and start the process. This is important if you will update this config term file. When the runner process is working, the metric service will be not started. You need to restart the runner process, so start the running process. With the listing address in this will this will create the listener.

B

Make sure that your firewall rules allow your permit your server to access your Runner configure your Prometheus server to start scraping metrics from Runner and when you will have that you can start playing with the queries like here start experimenting with what metrics Runner provides and what information we can get from there and and the last thing that that users very very often ask us about, is how to go from here so row, metrics, where you need to understand how to write the query and how to analyze the graph to to this the magic where this is printed in grafana, with sometimes additional information like here some explanation for the graph.

B

This one is maybe not as interesting, but let's look, for example, on Runner manager, um Runner manager, details and all of the graphs. Here we have it's all from Prometheus metrics. Most of them are taken from the runner, but we also have some metrics from gitlab. We have some metrics from gitlab work course. There are a few parts of the gitlab application stack that are more or less related to gitlab cicd, that more or less indicate uh what's happening with the runners, and especially in case of problems and incidents.

B

We we track also gitlab and Workhorse and few other places to know whether it's caused by Runner or whether Runner is just having problems because I don't know, we have a database incident right now and because all of the jobs skewing and assigning to Runner is happening through SQL query. Thank you, the database. If database is struggling, it affects all gitlab, including the runners, so we may be, for example, by this here we may be alerted that the updates dropped, but then going through some incidents.

B

We may find out that okay, it dropped because we have right now a database incident, and this is not something that I can fix because I'm, not a database specialist. We need to get a series on our database team involved so that they can fix the database and that will fix many things, including products.

A

Yeah thanks exactly.

B

Yeah this, this is a live view of the metric we've been playing with here. So this is this. Is uh this? Is taking this metric and using a grafana heat map panel type to show the timings below you can see the The Legend So, the darker, the color, the smaller number of of events in a bucket, the lighter the color, the bigger number, and looking at that, you can see that the majority of jobs happens here so below one second and.

A

B

For this is for all our shards, excluding the small, because the small one doesn't provide that metric yet, but all others are providing that. So if we look on most of the SAS Runners that my team is maintaining, you can see here that in pass 12 hours, the the dashboard is set to show 12 hours.

B

There was six jobs scheduled for more than an hour, but you have like 1 to 15 minutes. Occasionally the majority of jobs is scheduled below 10 seconds, and this is what I can read from that from that graph and now, given that our our objective is that specific cases of jobs should be scheduled below one minute and I see here that most of the jobs are scheduled below one minute. I now know that everything is going fine as I wanted.

A

And then Thomas last question on this one before you get off of it, everything is going fine, don't agree, but um for those sections of the histogram, where there's a little bit of like abandoning say up to the five minute or maybe 15 minute Mark. As you look on this graph I, do you should you be concerned? Should I usually be concerned about where those bands are having five or 15 minutes or because the frequency or the count is so low? It's not something to worry about.

B

Yeah this, this all depends on. What's what are your objectives, and this is this? Is there that the class of question where I can give you a straight answer, because it depends it depends of what's your goal, it depends of what your objective. What is your configuration, for example? We want to get. We want to bring the best user experience. We we learned that for the users, one of the most I, don't say the most important, but why?

B

One of the most important indicators for their happiness from from usage of our platform is that their jobs are not scheduled along in the pending queue, because they may take longer or shorter time. It all depends on what you are doing there for some people. A job executed for more than five minutes will be too long, and they need to look to search.

B

Why, for some people, a job executing in three hours will be a success, but almost all of them want the pending job to be taken it in within matter of seconds, but to to to to to get that, we need to fine-tune our Auto scaling our capacity.

B

uh Basically, we need to throw some money, sometimes over provisioning, our Runners Fleet, to like give it a lot of space to quickly take the jobs, but you may be a user who is fine with having this these timings even much longer and waiting even an hour for a job to be taken from the pending State, because you produce like 10 jobs a week, and you have one Runner configured to handle only one job and you don't want to invest more money in that CI power.

B

You don't just require so big capacity and speed of that and then seeing seeing in that histogram seeing the jumps over hour over the Infinity. For your specific case, it may be totally okay and- and you may even don't care about that graph at all. You may be interested in other metrics about Runners to to decide whether this rather Runner is working as you want or not. So it's not it's, not something where I can give you a magical answer that will fit all cases, because most cases will be different.

B

But from our point of view, like the SAS Runners, maintainer I know that our users are interested in the quickest possible scheduling. So what I'm looking here is having the lowest readings and now this occasional spikes, it's not bad, and it's definitely within the uh like below this one minute. It's within the uh I'm. Sorry, within the service level, objective that we have and like I said it's not for all jobs.

B

It's for specific jobs, given how first kitchen algorithm works, and we would like need to go a little deeper to understand why these few occasional readings below even five minutes. It's it's not a problem for me, but if we would jump to I think it was Thursday. Last week when we had an incident, Thursday or Wednesday I think it was Wednesday.

B

Or not anyway, uh we had an incident when um oh I can see it, because it was on the small runners. uh We had a small small Linux Runners chart incident.

B

Basically, some jobs were cute even for six up to eight hours because of problems that we had and on this graph you would see that it like goes all of these lights. Rng cells are going up here and you have like a a huge orange orange yellow blob like.

A

Spreading across.

B

All of that panel right and for me knowing what's our objective. This is definitely not what I want to see here, and it was one of the metrics. We've been looking on to decide whether the incident is resolved or not brilliant.

A

B

Ahead, this is this is this is where, where the most magic happens, because we occasionally are using raw Prometheus queries to like look for specific, specific metrics, especially if we don't have yet a panel for that or it happens, so so really that we even don't need that panel, but most of our monitoring is going through grafana.

B

So we have Prometheus alerts that are pinging us and we have grafana to get an overview of what's happened and the the big thing for the customer is the big problem for for customers and users is how to get into this rafana dashboard and yeah. My Firefox is failing recently, so that's not that's not a problem of grafana.

B

uh This is, this is again a problematic thing, because uh we have our dashboards uh created programmatically uh from from code we have, the project is public, so I can I can share it here: um dashboard, EI runners, uh so looking at looking on at this dashboard that we've been looking here, CA Runners incident support Runner manager, so the dashboard that you can see here with all of the panels and all of this nice color photographs is defined in this file.

B

Which doesn't tell you much because.

A

B

To know what's happening here, you need to go to all of this includes, and here we are going very, very deep because first, if you would like to like look on this project of us ours and and analyze it and build on top of that, you need to learn jsonnet. You need to learn the the jsonnet library for grafana so, for example, the the Q duration metric that we've been looking on. It's defined in job Cube, graphs, duration, histogram, and here you see the where we had and.

B

Is it this one yeah? It's this one I think or maybe in job history, job graphs anyway, as you can see, there is a lot of strange code here. You would need to learn Json it. You would need to learn how Prometheus work different panels of grafana, etc, etc, etc. If someone wants to do it in scale going through that hard lesson definitely will benefit like for us.

B

It was a huge change when we switched from handcrafted dashboards in grafana, because in grafana you can go through through UI, like YouTube videos, virtualization the heat map, let's go to our job, Q histogram, and so we add the query. We need to make sure that, of course you need to set up grafana. You need to point it to your Primitives. There are documentation for that, and I definitely will not uh explain that.

B

We would need to throw away this because this is part of jsonnet templating and something will probably not work here, yeah, because we use variables that are not defined, so maybe let's get rid of the variables, and because this is a heat map, you need to know that the Heat match requires that you will use a heat map format of the queue and you have more or less. What we had. We can then go to cell display unit is a time represented in seconds.

B

And probably have to refresh- or maybe it's not in the cell display- maybe it's in the. Why yeah it's in the y-axis so time seconds, and here you have- you have basically the same output that we've got uh in the dashboard we've been looking on. All of that clicking can be eventually turned into a Json file if we go back to our dashboard, uh I'm, not sure if I have here an option to look to the code of the full dashboard yeah Json model, so that dashboard that you can see here.

B

All of these graphs Etc for grafana is represented as this huge Json file with all of the panels and setting for this balance and that Json file in our case is Created from the Json, which allows us to Define reusable components and, for example, if I switch now to a different different dashboard.

B

Let's give it a moment to load.

B

Which takes a while uh we'll see if it will load first, two rows of that dashboard are exactly the same, yeah included. So here you have exactly the same, exactly the same panels that were on the previous dashboard because for the runner we have what we can see here incident support dashboards. We recognize that there are some repeating patterns of running instance and they require looking on specific metrics. So we group that in in few in few groups, but for each of them, we want to see saturation.

B

We want to see the updex value a general view of how many jobs we are executing. What's the queue timing, what's the queue size? All of that is repeated on all of these five dashboards here, but below we have things specific for in this case, is the database incident support using jsonet using this Pro. This project allows us to define a reusable component, like this service objects panel, that we can then pull in dozens of dashboards and don't need to repeat the same Json definition over and over again.

B

So if someone wants to work in a bigger scale, then going to the hard time of learning Json ad graph on ad uh learning how to like paint all of that together and feed grafana with these dashboard definitions will definitely pay off in some time.

B

uh But for those who just like to experiment with that and have any hints of how you can configure that uh I think it's in the GitHub or G group.

B

um Dashboards exports yeah, it's github.com, gitlab, 4G, rafana dashboards and this project from what I remember has a daily or Harley yeah. Oh I, think I think it's the daily Daily Dump of.

B

All or sum of our grafana dashboards. So if you go to dashboards and in this huge huge list, maybe let's do CI Runners incident support Runner manager, this one, the first we've locked- and this is the Json file.

B

That you could copy go to your let's: let's do it, let's? uh Where is the row view here? Is the row View? So let's copy all of that, let's go to grafana and and not new dashboard, but import dashboard. And here you can.

B

uh Oh, no sorry, sorry we would need to. We would need to save that Json file and then here you can load that Json file and that will create a full dashboard from the Json file. So by copying this and importing to grafana, you can basically reproduce the the dashboard that we have at.

B

Add these others. uh One thing is that uh there are some annotations. Annotation is something that you need to additionally configure in grafana. This is a totally totally different story. I, don't remember where on this list uh is the I think it would be in the inputs or something like that, because you may have multiple metric sources for grafana?

B

Fortunately, for most of our dashboards, we are using a default input source and in grafana we configured that to be our Thanos cluster.

B

So if you load that that Json file into your grafana and create a dashboard from that, the next thing you need to do is make sure that the default metric Source in your grafana is pointing to your Prometheus server, where you have metrics from runner from gitlab from Workhorse, sometimes from node exporter, on different nodes.

B

uh Basically, if you want to use our dashboards as they are, you need to feed it with the metrics that it requires, or you will have just empty panels in some cases, but if it's just for the runner part, if you like, don't care about many things like here, but you want to have this graph or I'm. Sorry, you want to have this graph and this graph and, for example, these graphs, because this is all that you're interested in you could like copy that Json file.

B

Then here I can go into edit because, like I said, these are created by from code and they are marked for grafana to not be editable. But if you create the dashboard by hand even from the import, you can go to each panel edit. It see how exactly it's created. What query it uses, what settings it uses. So you can learn in that way. What metrics? In what way we present to give a useful information for someone who who manages the runner and would like to know what's happening with this one.

A

Wow Tomas um this was eye-opening I felt like I've learned like a whole Year's worth of stuck in like the last hour, and for folks that are looking at this. um Please, you know give a shout out to tomasa news for us I'm going to upload the video to GitHub and filter it'll, be public, so customers, if you're looking at this video- and you find it helpful and you'd like us to cover additional topics in the future. Please comment on the video.

A

Let us know, I might also maybe create an issue that we're linked to as well, but we definitely want to get your feedback, because the the goal for this is to give you the information that you need to manage your free Tomas. uh This was brilliant I, but this was beyond my expectations. Thank.

B

You so much for doing this, I'm I'm, happy I could I could give you information. That is useful, uh uh like I said all of that, it's hard when you look on it for the first time, especially when you need to learn Prometheus like Prometheus, has a specific mindset, let's name it behind how it handles metrics and not for everyone. It's uh it's easy to understand from the beginning. When you start working with it, when you get used to it, it becomes very, very clear, and at least for me it was very easy to work.

B

After a few weeks when I learned how to use it, then same goes with with grafana when you, when you first time start using everything, is like this panel boxes settings. All of that is very, very cryptic, and there is documentation for most of that, but, like every time with the recommendation, you like read that and it's like yeah, okay, I read that and now what to do so. So the the hard part is to start then, uh because, like I went, I went through all of these stages.

B

When we started working with Prometheus I remember, we had two or three Engineers who who've been part members of the Prometheus project and they worked for gitlab for a couple of years, and they like bring all of that to us when we introduced first, two metrics to the runner I was like struggling with that concept for first few weeks.

B

I couldn't understand why why we are doing that in in this way, because I was used to the monitoring systems like zabix, like check MK, where I'm, sorry, where you have a different way of of tracking metrics, or there were some systems where you were like pushing specific metric to to the to the monitoring system and feeding it with with some specific values. And here we have this, this metric, exporting and collection, and why this is happening, sometimes asynchronously and then the specific way of naming metrics I, remember.

B

I was struggling for at least two years what to put into metric name and what exactly to put into the label what makes sense to be part of a metric name. What makes sense here as a label name and value, but once you start working with that, more and more it's become more natural. Then, when we have this metrics, we had to like start looking on them. We started defining first alerting because Prometheus is not only monitoring system, it's also a large alerting mechanism.

B

There is a way to define, alerting rules and all everything, methods and routing between them and it's a huge, huge thing by itself, but we've been defining that by hand. Then we've got grafana how to connect rafana with Prometheus how to shape the dashboard I. Remember the CI dashboard that we used before we migrated to this Json net produced One had about 100 different panels.

B

We we like made the rows collapsed because, like one day when I tried to load that panel I was even not able to edit enter the edit page because it was like failing unloaded, because there are so many things that it could was trying to pull from the API from grafana, and we wanted to decompose that. But the composing that to multiple dashboards means I, need to copy and paste panel definitions and once I update.

B

Something I need to remember where to update that, and this was like stopping us from decomposing that huge dashboard and month within month it was less and less useful. Until someone showed me that hey we are using jsonnet from a few months. Maybe you want to like experiment with that and migrate. The runner monitoring to that so again, starting from scratch.

B

Learning how jsonet Works learning uh works in the graphonet library, so jsonnet library for grafana learning was in a wrapper to that library, because we already created some wrappers for the most most commonly used panels that we have. uh How is how how how we Define dashboards uh so that they look like they look and- and they are not editable and have some marking and linking to code, etc, etc.

B

So, first days it was like a black magic today, I like when we, when we got this new metric, it took me, like maybe 20 30 minutes to to update the definitions to like start using the new metric in a new way that we that we want. So this is very hard when you start, especially if you don't have experience with premade using grafana.

B

If you know how Prometheus and your funnel works, then you know everything you need, because the runner part is just knowing what metrics there are and sometimes to understand what the metric represents. You just need to go into the code to sell to see how it's gather, how it's collected, what information it presents, because sometimes it's hard to like put into the description.

B

What exactly it is, and you like need to feel feel what part of the code it's in and what exactly it shows, but things like concurrent limit jobs version info, which is a static, constant value, just showing what version information about the runner is. uh These are these are things that you just need to to understand how they work and and fit into your parameters and and gravana.

B

If you don't know Primitives in grafana, there is a big big work ahead of you, but that is something you just need to to go through and, and there is no magical solution here- go through permit use, docs go through grafana docs, if you are thinking about huge scale, definitely think about Thanos or maybe hosted Prometheus service in one of the cloud providers.

B

But you need to you need to learn this or understand these Concepts to be able to to get from our metrics from runner from gitlab all information that is, that is possible to be to be gathered from there.

A

Awesome, hello, Thomas, Thanks, again I'll be I. Guess we'll be seeing you next time in one of these um hour long sessions talk to you soon, bye-bye, okay, bye.