GitLab Ops Section, 30 Jan 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Dive into Graphana and Prometheus at GitLab

Description

GitLab Infrastructure SRE and Director of Engineering discuss internal usage of Grafana, Promenthus, Thanos, Runbooks, and production observability

A

B

A

Ben on the infrastructure team and we're gonna dive into crow fauna, you want to take it away. Ben Earth sure.

B

So maybe we should start with a little bit of the the beginning, which is the the actual Prometheus infrastructure underneath our Griffin instance. So.

A

This is kind of like the time series data store that the metrics live in and.

B

This is the monitoring, Seri monitoring, system and time series data store so with Prometheus we're not just collecting data for visualization we're also collecting data for alerting, and so all of our production. Alerting. Monitoring is driven by Prometheus data, but we can be very data-driven in term in in in our approach to what is actually the root cause of a problem. We can get that from our data need.

A

Are in terms of how you configure the alerts? Is that done through Prometheus I'm, not super familiar with it? Yeah.

B

So the Prometheus monitoring system has a number of components. The the core Center is prometheus, Prometheus, the Prometheus server and it's a it's. Basically, a data collector a time series database and an API, and the API provides a query language to get to the data that you've collected. So it has all the data it stores it, and then you, if you can query it from that single single source of data and so each Prometheus server, is an independent running.

B

What we call a non coordinated cluster mm-hmm.

B

So it is a it is a scalable cluster Abell solution that doesn't require no denote coordination, so that, if there's are, if there's problems on a single node, it doesn't affect the other nodes. Gotcha does.

A

It our data across that cluster, or do you hit theoretical issues or real issues with data size and limits? And, yes,.

B

And no yes, there is so there is charting, but you tend to shard vertically and not horizontally, because horizontal charting again requires coordination and coordination on a network is fragile if you've ever run a Cassandra or MongoDB. You know that it has a tendency to blow up and so.

A

We've had yeah yeah.

B

We do a little bit of trickery in different way of thinking in the Prometheus ecosystem in order to deal with the the sharding and high availability gotcha there there is a there is a concept of horizontal sharding, but you'd have to be Google in order to actually have a big enough system to need that on for a single application, because part of the idea is an individual prometheus server, node is so easy to deploy, and so scalable that you can, you can deploy hundreds of them.

C

B

So it's designed to be modular eyes down to you, have one Prometheus and one app and one prometheus and one app, and so it's designed to be hyper sharded to you know, so it's one-to-one with an app and see you'd have to have a single application that is so big that it requires horizontal charting before you even need to start thinking about that. Gotcha.

A

It's fascinating, thank you and, in terms of the I, had a quite another question about alerting just how do you go about configuring, those alerts, or is it kind of individual static thresholds that you're configuring and and how does that or is there is any kind of you know, anomaly, detection or more approaches like that.

B

So the there's the configuration of Prometheus for alerting it's all done through what are called rules and rules are basically continuously executed, queries, and so you can. You can ask that you can say you got a bunch of data on memory usage and you can ask the cluster. What's the average memory usage for this process across my fleet?

B

What's the, but you can do more complicated things, you could say say you have a hundred servers and you want to make sure that the as the average across the fleet is okay, but you also want to make sure that you don't have any individual instances that are not okay, and you also want to check to see is: are there some instances that are say out of balance by say the 90th percentile, so you could but I'm asking for the average memory utilization you could ask for the 90th percentile of utilization, and so you can ask all of these statistical queries of Promethean and what you do.

B

Is you put those you put your idea of what statistics you're interested in into the end of the config and then Prometheus will continuously run that query for you and based on your threshold or your idea of what's wrong. It will then fire an alert, mm-hmm and.

A

What does that? Like? Yes, sorry so I'm assuming, but can you talk about the process for if you need to create an alert? How do you you know, determine you, need it and then go about pretty.

B

Sure so we have we have a general, and so the the gitlab essary team follows some of the what are called the the used method or read method, and that's we we don't look at. We actually typically don't look at like. What's the memory utilization of our servers, we look at the latency for our application. We look at the error rate of our application and we look at the saturation of our application and so we're looking at these golden signals, as they're called.

B

If you read some of the various papers or the Google SRA book, for example, talks about the the things that you want to learn on are the things that matter to your users, not what matters to the operating system, because the operating system is not in telleth at your getting from the operating system is not intelligent enough to express your business need, which is I, expect my get commit to happen in one second, on average yeah.

B

So we don't so we we have these. We have these signals and we have them standard across standardized across all of our services and that's all checked in so from a from a simple logistical perspective. All of our data rules are all checked in to a git repo, and then those rules are automatically synchronized up to our Prometheus servers so and and Prometheus automatically picks those up and and and acts on them gotcha. What's.

A

The name of that repo run.

B

A

Books, ok and you were saying that the run books in general in the in the group are pretty pretty good. Pretty solid, yeah.

B

So we so we have this. We have this idea that everything that has to do with paging and alerting and on-call should all live in the same place. So say we have a new service that we're turning out. We add some new alerting rules for that service, but we also want to add the documentation for what to do for that service. So that's all can be done in one merge request and or say you change say: word you're, gonna change the threshold. We document the fresh. We change the threshold, we document.

B

Why and we document any changes to the to the run books or to the you know, to the to the troubleshooting playbooks for our production and say we update any dashboards that need to be updated when we add that new service, and so the dashboards are also in the same repo, the the run books and the troubleshooting guides and the how to fix production is all in this. One big run books, repo.

A

That would be what you're refer to what you're referring to say. If you're a sorry and something's going wrong, you get an alert. You go to the run book, yeah.

B

A

Maybe that's a good segue to kind of dive in a little bit if they're, if you're able to screen share, but obviously you know start, we can start with an alert start with run book or start with a graph, but it might it'd be great to kind of see just how you flow through that process and sure I'll have questions about we'll.

B

Start right here with.

B

My calendar yeah, so here is here's one of our templated templated dashboards, and so this is an automatically generated from.

B

Get lab comm run, run books so.

A

The configuration of the dashboard itself is is actually stored as Cody. Oh.

B

Well, yep! So if we go into here, we can see that this is the general metrics general service platform metrics. And if we scroll down this, there should be a.

B

Service dashboard, or is there a general directory? Yes,.

C

There is yes, there's.

B

General and then there's.

B

A

Okay- and this is the graph on a--, so this.

B

Is a this is a templating thing called griffin net mmm-hmm and it uses a it uses a language called JSON it to compose dashboards through a templating system, and so you can it's a similar.

A

To JavaScript but I'm, surely a little more limited in what you can do.

B

No, it's it's! It's a it's a domain-specific language. It's called JSON it and JSON. It is a tape. Data templating language that was inspired by Google's internal data, templating language called Borg CFG.

A

We are the Borg yep.

B

And so this this was a data templating language that was used all right. So it's a it's a data, templating language that was inspired by the need to configure jobs, running and Borg, so basically the same thing as kubernetes. So there are some people that are using this same templating thing to configure their their kubernetes jobs instead of using something like helm, because helm is not very powerful in terms of language expressions. So it's. This has a much more powerful language expression to actually do function, programming and other types of input and output.

B

A

It seems like your pause today, like the this, is a nice interface to configure these forestry. You, oh.

B

Yeah like, instead of having to go in and edit the cuz, the rod tastes like instead of having to click around here and, of course, clicking around is never reproducible and say: I want to change like I, want to change the line, type or say I want to say: I want to change these variables, and this very this this stuff, at the top here, is shared between multiple dashboards, multiple services.

B

So this is inheriting stuff from multiple template layers like general graph panel, and this is a graph panel new and it contains this data source and, and all of these things are all templated. It just says you just pulls in add template and template, add template, and so this whole entire complicated dashboard with all of these graphs, a huge huge thing. This is all coming from like.

B

Yeah, so this this node metrics detail row is one this one add panel generates all of this stuff, so.

A

You can generate it yeah. Add a lot of complexity in here. I had a question actually, since you're on it, I was wondering when I was in here the other day I'm the graph with the kind of baselines around it. There yeah.

B

A

Green area, how do you that it it looked like those were label low, normal high normal? How do you, what used to calculate those.

B

Yeah, so let's take a look, so if we go into the edit mode and.

B

A

So--That's requests per second against this service that we're looking at.

B

So this uses average overtime, mm-hm.

C

B

Week so in Prometheus you can set, you can request something you can request offset one week, mm-hmm.

C

B

We can see upper normal and so in our data, so we don't in Prometheus we we, we explicitly don't have any anomaly detection as it were, but there are, there are practical, simple standard deviation, anomaly, detection that you can do that are significantly less computationally intensive than then some people's attempt to do magic, anomaly, detection, and so we we we program these things in by looking at the standard deviation and this, and so we we have the this, we have these recording rules and those are all in the run books.

B

So if we go back to the run books.

B

There's there's a there's a couple of there's a couple of blog posts and some other documentation on. What's called practical anomaly: detection with Prometheus and if you search for those keywords, you'll find this stuff, and this is all generated through our are these these service recording rules. So if I look I can find. Let me let me see if I can keyword this into my source code right here. Look I'm in the right directory.

B

Yes, so this is in service, ops,.

B

Service operate, and so what we're doing is for each service. We have these recording rules and we create these component operations rates and we get this from say: Redis commands to processed, and so this is a recording rule elsewhere in the clinic in our rules, and we take that and we weave we get this by this environment tier and type.

B

And then we stick that into this recording rule, and so this recording rule is now generic and we we move the we move the labeling up to here, and so we can actually go to say, oops.

B

Here and we can see all these components and we can see, we can see type Redis, and so here there's two more fish.

A

That we're looking at now, so this is not grow fauna. This is you on.

B

Business Rob medius. So this is what this is.

B

Basically, the the Prometheus debugging console, and so here is a pre-recorded data of the overall full rate of our Redis requests per second and so we're doing something like 30,000 requests per second on the production Redis, it's a good clip and yet- and so that- and- and this is this- is Prometheus T prod, and we can also look at Prometheus G stage and we could see the staging environment is only doing one and a hat 1.3 thousand requests per second now the fun part so will come well, come back to Prometheus in a second, so we're recording them the rate and then.

B

You can well, but we're recording it generically for every service.

B

And so, if we look scroll down here, you can see this component operations per second average over one week. So what we do is we generate an average over one week and a standard deviation over one week. So we take a whole week's worth of data in Prometheus and we figure out what the standard deviation is over that and we record that as a separate piece of data. Mm-Hmm.

C

B

So now, if we go back to Prometheus and that's.

A

B

A

At there is two standard deviations for rum the mean from an average of a week of the full week or some.

C

A

Slice of the week that correlates with the present, though,.

B

The the full week from the the previous week from the present mmm-hmm.

A

Gotcha gotcha interesting well, you mentioned performance and you know other types of related to other types of normal or anomaly detection. Have you have you hit issues with that? Where there were queries that you know had performance issues or maybe weren't advisable to be running because of that type of concern? Well,.

B

So yeah we've had you know we have a lot of dip. We have a lot of different dashboards, and so we've had to go and create these recording rules to summarize out one week because Prometheus can I can ask for this right now and Prometheus will go and calculate that. But if you want to graph a whole one week, standard deviation, that's gonna be stupid, slow because it's gonna be pulling in millions of millions of samples in a memory and trying to do a live calculation on that.

B

So we actually so we pre record this, and we only we only update this this back and back into time. One week, every five minutes mm-hmm just to make sure that we don't overload the Prometheus server with these huge long one week, calculations. Okay, do.

A

You have a this gets a little meta, but do you, how do you monitor the Prometheus server? Well,.

B

B

Every component, like so the standard thing in Prometheus, is we have this thing called a metrics and point, and the metrics endpoint is just a simple HTTP API that when you say get metrics, it dumps all the current state of all the data, and this is a standard API that every Prometheus target uses. So if we look at targets in our production environment, you can see that while we- those are those are probes, but we don't care about probes. Why are we still looking at probes?

B

Let's look at something more interesting, so here we go so here is 90 101. So that's C advisor. So here's here's data coming from C advisor. We can't actually click this because it's on our internal network- and we can't get to that- but you can see that were, will reach in and get this data out of our internal 10 net, and this will print out a thing and if we go to something like Prometheus dot demo do de curie cio.

B

B

We can see something like one of these things like a.

B

Port 90 100, so we actually just copy and paste that we'll see that. Well, this is the root level and if we click on give me the metrics, there's our Prometheus metrics in point and it's just a simple text: output and Prometheus can ingest this stupid. Fast awesome.

A

Jumping over to CRO fauna, maybe cookie.

B

Oh sorry, go ahead. Yeah you want to know about the meta monitoring, so yeah Prometheus itself.

B

Sure enough Prometheus produces its own metrics, and so, if we look for rule evaluation group intervals, we can see.

B

This metric previous rule group last duration seconds and so let's go and go back and ask Prometheus itself.

B

Let's do that they have returned 780 metrics, that's a little noisy. We wanted to know about this, this one, which was this one so I, think if we yeah. So if we we have this rule group and if we go up here and we filter for this rule group string- and we say, let's only look at one Prometheus server since we have wait.

B

Yeah, let's look at one Prometheus server.

B

Okay, so now we've narrowed it down to one metric, and now we can graph that- and we can see that most of the time this rule evaluation takes pretty it's pretty quick 200.

A

Milliseconds sir yeah yeah.

B

That's an average.

A

Of the of the operation.

B

Yeah so they say the 90th percentile is somewhere in the 200 millisecond range to do that one-week calculation across all instances of that metric, interesting awesome.

A

B

For that, and so.

A

That's very minimal.

B

That's how we can minimum two previous.

A

Also, could you talk a little bit about you know say if you were to try to answer the question like how are things going in production now or are there any problems? You know in production kind of how you would approach looking at that, if you're just trying to get a sense of it.

B

So we have some of these, like big general dashboards, that tell us SLA trends on various services give us our overall SLA.

B

Things for our front-end services.

C

B

Front-End api get the CI runners, the container registry. So all these all these- this is all these big public dashboards is.

A

It typical, if I noticed some of these are still loading up when you're in here. Do you find it? Are you frequently waiting for graphs to load or are they usually pretty strong.

B

Sometimes yes, there's a bunch of optimizations that were we're needing to make to make this stuff faster.

B

So one of the things that we were talking about is we were talking about the fact that you might have hundreds of Prometheus servers and soget lab has not hundreds, but you know dozens. We've got about 20 or 30, or so I think, and that number is growing as we as we expand and in order to view all the data all at once. So we you know we lurk. We were looking at this one.

B

One Prometheus server in front of Prometheus we've actually got a proxy system called Thanos, and so, if you go to Thanos and we look at the stores, we can see all the backend prometheus servers and also what are called so there's there's different, there's different back-end types, there's the sidecar and these are attached to the individual Prometheus servers. So we've got like a Prometheus server running in gke and kubernetes.

B

We've got these other ones that are just running in regular chef management and then we've got these rule evaluation servers which we'll get to which we can get to later. But then we also have these things called a store server and the store server is so one of the things that we do is we have this thing in Thanos that takes the prometheus is constantly scraping data storing this time series database and then what happens.

B

Is it writes that data to a local disk every two hours, so every two hours, thunk, a big block of data, hits disc and Thanos. The Thanos sidecar takes those two hour blocks of data. Those are immutable. Basically, you know once Prometheus has written.

A

That data it's.

B

Just done that data is is read-only, it's not an updatable database like a sequel to database, and it takes those two hour blocks and just stuffs them into a object, storage, and so now, we've got two hour blocks of data in an object, storage, and so, instead of asking Prometheus to go and get this data way back months or years, we can ask this store component to go.

B

Do that for us and the store component pulls in the data that is uploaded into GCS, pulls it down and serves it up to the-- know so that we can do queries way back in time.

A

Gotcha, okay and we're using GCS. That's that's! Google's object, store.

B

Google's object, storage. You can use s3 as your Mineo SEF, whatever there's there's a bunch of different options. They're awesome.

B

And so that's that's how we famous lets us tie all of that all this Prometheus infrastructure together to scale it up and.

A

I swear that on some of those high-level dashboards, some of the performance issues can come in because here.

B

Yeah because we're yeah we're having to hit a large amount of data from Thanos, and so we're we're in the middle of trying to optimize optimize these queries and improve the performance of Thanos and and make it so that these these dashboards are more reliable. Is.

A

That the infrastructure team as a whole doing that or they're.

B

So we have a subgroup of infrastructure that is focused on observability and so we're working on. You know: we've we were working on expanding and improving the Thanos performance, we're working on it, expanding our elasticsearch infrastructure and we're kind of trying to come back and look at observability from a high level perspective.

A

Awesome I know we're kind of up on time here, but anything else you feel like is worth sharing or I should.

B

B

Okay, let me give you some fun numbers so from the ops environment, we monitor all of our Prometheus servers in production, so there are now so the number is 22 right now, and so you can see how many, what are called head series is how many current metrics are we tracking. So if we take and just to a some of the head series, we get the ding. A high-level number of 29 million is 29 29, 29 million individual metrics mm-hmm.

C

B

Then, but that's that's deduplicated between all of our high availability instances. So in chrome, you see have high the high availability, where there are two Prometheus servers, just both doing the same thing: they're configured identically. They have slightly different data in the time series database, but the they're there they're operating independently, and if we execute that we have 58 million metrics metrics.

A

Yeah, do you know how much, how many bytes that works out to or.

B

Well, bytes over time in our in the in I was just doing some auditing of our production database. For all this, all the production side, GCSE data which covers about the last, which covers lat the last year of raw metric data plus downsampled data. For the last year for the last five years we get to about 26 terabytes of metric data, which is a cameelious compressed format, yeah yeah, it's quite a bit yep well,.

A

I really I really appreciate you taking the time to go through this with me. Maybe we'll do it again sometime if yeah, if you ever are free and I'd love to you.

B

A

Just pick your brain more about this stuff, yeah.

B

If you ever have more specific questions for your free to book time, I'm.

A

Sure I will yeah so I appreciate that so yeah enjoy your evening and thanks again then yep.

B

Have a good day, oh yeah,.

A

See you later bye.