GitLab Delivery Team Trainings, 18 Jun 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Delivery: Intro to Monitoring at GitLab.com

Description

Follow along as members of the Delivery team discuss the various components involved in monitoring GitLab.com.

A

Don't push it to Google Drive Oki.

B

Rerecorded, so monitoring there is quite a lot of components to it shopping, so here I, listen to all the components that we actually install somewhere. We have a bunch of stuff that we use externally. Some of these are all part of the Prometheus. Suites are like alert manager, Prometheus and the push gateway are all components of Prometheus itself.

B

Griffin is our fancy dashboard that we all know and love. Em tail is a component that we use to monitor our logging and it will generate metrics based on very fancy. Reg xed, which is amazing and that's produced by Google I, think we got a component called Thanos which I'm not 100%, familiar with, which is what drove the creation of this issue trickster, which I learned is new. We.

A

B

Many many exporters for various services we create our own, our applications, export metrics, Prometheus is geared towards scraping all of them, and then we still have influx DB in a relay sitting in our environment, so I'm just gonna kind of roll through what I've discovered and if you have questions feel free to stop me and you know, try to figure out what in world I'm talking about. But as far as the alert manager goes, this is responsible for sending alerts in all shapes of sizes and forms everywhere.

B

I decided not to dig into precisely how rules are set up because we have so many of them, but the alert manager is configured to send alerts to various places. Pedro duty and slack are the primary locations for those and based on the type of rule, will determine the severity that Pedro, Duty or slack reacts upon.

B

An alert manager is set up in a redundant fashion, so we've got two nodes sitting in our production environment we've got two nodes sending in ops and because we still have some infrastructure manager. We have an alert manager sitting in the Azure environment as well. What I thought was kind of interesting is how Prometheus actually operates with the alert manager, so it's the job of the Prometheus service to reach out to an alert manager to determine how an alert gets sent out. For me, this will simply send the data to the alert manager.

B

It's up to you. Lurk manager the send the alert so in the production environment, for example. He knows a four total alert manager servers. He picks one and the alert just goes out in some way, based on how the alert manager service is configured staging, for whatever reason will only use the production alert managers, the disaster, recovery environment, uses production, alert managers, the pre environment uses both production and absolute managers. Testbed uses both as well and offs, will only use production alert managers.

B

So I'll find it kind of weird that we've got to configure this way. I have an open issue to discuss how we get.

A

B

That point wait.

A

This one sec just one sec, we have ops using production, alert manager and we have pre using production and absolute managers. Yes,.

B

A

Have staging using production alert managers? Yes,.

B

I created an issue to figure out why we landed this way. I did not look into commits to figure that any further, because I didn't want to I didn't want to sit there and you know dream over it, but.

A

B

A little odd I know at one point in time we had a desire to move as much monitoring as possible into the ops infrastructure and I. Wonder if we're just in the middle of that, but I never saw an issue to discuss that comment. So I can't validate whether that's true or not, or if it's something I made up when I was dreaming, welcomed, jarv, so yeah, that's the alert manager.

A

Okay, that is extremely confusing. Yes,.

B

So gir fauna, we've got two dashboards. You probably know this, but dashboards, calm and dashboards or excuse me, dashboards, gitlab, calm and dashboards that get lab net.com is the public facing one. Anyone could see any metrics that that specific instance has access to people supposedly cannot modify and create their dashboards, or rather, they would not be saved if they were trying to create dashboards. Dotnet is our internal one. All the developers could log into it. They create dashboards.

B

This is where dashboards that we create live, and they get eventually copied over via some process that syncs them between the two instances. What.

A

Is the process do you know? Yes,.

B

So they there's a recipe that are. This is gonna, be my way. We've got a script that gets installed called literally sync refined dashboards. It's just a small Ruby script that will take what's on one instance and shove it over to the other, using the Griffin API.

B

Kind of nifty we create dashboards automagically, the are run books, so we have a job inside of our get lab, run books project and it will take. We got a small shell script. It looks through a directory that we've got sitting inside of here. Excuse me, where is that directory at dashboards? So it's just a bunch of Lib sonnet files that creates a lot of dashboards.

B

Andrews been the primary person for this, and we have a CI job that will sync those up to the dotnet instance.

A

B

Of course, why not.

A

I'm really happy that we standardized on something.

B

As far as data sources go, this is where things are a little I wouldn't say hectic, but they're a little difficult I would think.

B

Wars.Com only has access to application metrics for staging and production infrastructure wise. It has metrics for staging production in the ops environments, all of the data sources are created manually. This is a known thing and I'm pretty sure. Ben has created an issue to address this in the future.

B

For internal I'm pretty sure the same exists, but we have metrics coming from many many places. So Santos query is one location, I'll get to the thinnest query and what that is later. We got an elastic search data source for our production environment as well as some of our abuse cluster stuff. I, don't entirely know what all that entails. We got in flux, DBS for staging and production.

B

Metrics are exposed from a service called site, speed I never looked into what that is, but I think other people know that it's it's constantly measuring the speed of our website in some form.

A

B

Some of the webpage speed test comm or whatever we have infrastructure environments for the staging production, disaster, recovery, ops, pre and the testbed environment, and we have application metrics for all those accept testbed, which I thought was kind of interesting, but that's probably just a small oversight and I'm pretty sure. All these data sources are entered into that instance manually. So anytime this instance needs to get rebuilt. Some needs to go in there and change it, which I think kind of sucks cuz. That happened recently to the dashboards comm instance, which really says.

B

So in flux, DB is another component: we've got four nodes, they all serve as database servers. We have two of them in staging and two of them sitting in production. As far as I know, just the WebKit API inside cake fleets actually talked to the influx DB relay, which is installed all no servers and the relay will push the metrics to influx DB. Wait.

A

Why are we using includes DB I thought we removed that or we don't use that anymore? Do you know I do.

B

A

Job, do you know I thought we stopped using influx, no.

C

We've told you to watch TV chemotherapy during the migration, but there was some back-and-forth and we decided to keep it that there was something. But we only.

B

Maybe it's something where we have a few metrics that do we only care about few them and they just been imported over.

A

B

Don't know because I felt the same about I'm, what you're asking so: ok they're, still there and there's still the datasource I, don't know what dashboards pulled in information I'd had to look into that at some point.

B

M tell another component mentioned earlier. This looks at our log data and there's I use the word amazing, reg X's. So there's a cookbook literally called get lab M tale, and this is where we store all of our configurations for the amazing reg X's that generate our metric data. So these are pretty complicated. I, don't know how we test this, but somehow we do, but it makes it out there and it becomes a scribble endpoint for Prometheus.

B

Thank You jarv.

B

Prometheus this one is pretty complicated. We have I, don't know how to quickly discuss Prometheus I'm going to you.

A

Don't have to quickly discuss it, go for it start and let's see where it ends. Ok,.

B

Prometheus responsible for scraping metrics responsible for sending metrics to the alert manager, so he knows how to alert things. We have many many jobs across a few different types of Prometheus instances, so we have three Prometheus types. We have simply called prometheus. We have Prometheus app and Prometheus DB.

B

The Prometheus DB servers are only located in.

B

Production and staging the other environments like ops, this has to recover in period they only had the prometheus and Prometheus app instances, so for what I could tell all of our jobs are in chef, so there's there's roles that define all these scraping jobs. I did find some inconsistent configurations so like we have Prometheus TV, but we have all this stuff related to monitoring our database stuff inside of the Prometheus app instance, which I thought that was kind of awkward, but Prometheus is pretty.

B

It does a lot it's it's configuration is pretty convoluted, but it's easy to determine what it does, because everything is in chef and it's very well configured from what I could tell it's. Just there's some inconsistent configurations in there. I have an open issue to figure out how we got to where we got where the app instance is polling. Some database information out of it.

B

We even have some application metrics being carved out of the non a Prometheus systems. I guess before continue. Do you have any questions about that? Because I really I don't know how to go into any more detail of Prometheus. Now.

A

I think I didn't know that we have three are separate instances, but this the rest I understand what it is. Yeah.

B

So, just to be clear, like all all these Prometheus instances, they store data locally, as well as pushing them to the cloud.

B

Excuse me, Thanos is what pushes it to the cloud I believe, but the DB is not just the database back-end from rapini pervious, it's actually just a scraper. That does the same thing. The other Prometheus instances do I had that misconception when I started looking into this I'm glad I did this.

B

Also have a couple Prometheus instances dedicated for runners, so we got two of them sitting in the GSP project and the CI CD project, and then we've got one sitting in the digital ocean, instant or digital ocean project, but it's probably not scripting anything because I'm pretty sure we don't have that turn up right now.

A

B

So exporters we have a ton of them alert manager has its own end point for scraping metrics, so you know, Prometheus has a way to figure out whether or not we are actually able to alert or not. We have an alert to tell us whether or not we're, alerting or not, which is pretty funny. Black box is a Prometheus component. It's primarily used for scraping things that we need from an external endpoint so like we would scrape our kind of like what we do with Pingdom, but for black box.

B

It's just a service that we run and from a description. Fluent D is a kind of an interesting one. Instead of running a dedicated service that runs an exporter, it's actually a plugin into fluent D made in Ruby and Prometheus will scrape any data that the fluent D service provides us C advisor for some odd reason. We have this only on our giddily notes, so I don't exactly know what metrics were getting out of C advisor that we don't get out of, say or node exporter.

B

Giddily also has its own.

A

Ways I didn't understand what she advisor supposed to be doing. It's supposed to be a node exporter like ting. Only forget Lee, though right I'm, not following so.

B

C advisor was first created as a a method of gathering berries system metrics when docker was first created, I, don't know why we would be using it at this moment in time. We.

C

Probably use it on the founders because we have secrets of the families: okay, okay,.

B

That makes sense to me: okay,.

A

That makes sense yep.

B

It's good 'le, the only place where we heavily use C groups I think.

C

A

Can be your only using C who's there? Okay.

B

That makes sense to me.

B

We also have something called gitlab, giddily I, don't know what it actually is, but it's also installed alongside the see advisor in the same chef recipe. I forgot to do further investigation on this, so I feel uneducated about this. So my apologies, but this is something else. I gets installed in all to get early notes. I completely forgot to circle back around to that that sucks not to do that at some point.

B

No one exporter is exactly what it sounds like it exports a bunch of metrics that are specific to whatever is running on that note that we configure this gets installed on pretty much every single server that we run.

B

Postgres related metrics. We have a few different things, so Postgres has its own exporter. We use the community provided one for this.

B

We have one dedicated for PG bouncer again that we're using a community provided one and then we have a project called get lab, monitor, I, learn about this. This is new to me, but this is monitoring a few items for the Prometheus database, stuff that I guess either Postgres or PG bouncer exporters do not provide so database bloat database mirroring the row counts. I. Guess we have something specific to CI builds remote mirrors.

B

All of that is built into our get lab, monitor, project and apparently get Lamont or something that we also ship to our customers. The analyst package, so that was kind of interesting to learn about get lab, monitor, also provides us with some process. Metrics, but I didn't look into what those process metrics are and why we don't have stuff defined in there provided by like the node exporter or process explorer exporter, but I.

B

Lost my place: okay, but we'd also use the process exporter provided by the community. This is also installed on nearly every server I think the only place where we use this heavily is our H, a proxy boxes at this moment in time.

B

Sorry, what I think you it looks at processes that you configure and gathers various metrics about it, so H a proxy. For instance. We want to make sure that we're only running a specific number of those instances we want to make. We want to monitor the CPU usage of the HT proxy processes, so it takes.

C

About that this came about because of the problems with H, a proxy being single threaded fact. We were running the single bit version and that it would just like news, like Mac CPU, on the single core, and then we would have performance problems, so we had the process monitor, so we could monitor it a bit closer. Now we have multi per day now.

A

We have multi credit, enter proxy running I, think Darwin said yeah.

B

Joke is frozen, oh well, I'll move on the push gateway. This is a Prometheus project. This allows us to create an instance where we are able to send metric data to it kind of like we do kind of like what in flux. Db is for. We have this installed on every black box instance.

B

Just because we didn't want to create a separate node for something that's not going to use very heavily I. Don't I can't recall off the top of my head. How many services actually use this? It's not many I'm, pretty sure Andrew set up a few services that send metrics to it, but it's not heavily utilized.

B

Stackdriver exporter, we want to make sure that we capture all the metrics that GCP is providing us, so we got a dedicated node set up to monitor all that stuff get lab itself. We know, has its own ability to generate metrics. So here's a list of all the port numbers and services that generate metrics for I'm.

A

A bit surprised that we are actually.

A

Using those I guess it's good! Why.

B

Are you surprised, I.

A

A

Taught everything oh wait is this: this is individually setup. This is not outside in the package, we're not using the one that we're shipping with the package right. We.

B

A

B

What I could tell you? That's.

A

Why I'm surprised? Mostly because if I recall correctly, some of these things were set up prior to us shipping this in the package? But if we are using it, that's actually pretty good yeah.

B

A

Oh, that's good, I think.

B

The one that I want to highlight out of here is the unicorn one, this one when I was doing my investigation of this, as well as when I was hopping on one of our recent incidents. Unicorn would constantly time this endpoint out. It would oh, it would constantly take over ten seconds, and it makes me think we're losing some metric data because of that but Andrew made. It seem like this was a very well known problem and that there's an issue somewhere to address it. So I thought it was interesting.

B

Okay, a to proxy exporter- and this is dedicated to gathering a cheap proxy metrics out of the socket that the HP proxy admin interface provides. I already talked about em tail and Prometheus exports its self, so it scrapes itself due to her mother Prometheus is running or not, which is pretty fun. The Redis service there we use the community provided Redis, exporter and registry has its own metrics interface when you enable it on one of the debug port on 5001. So so we try to a feel feel like.

B

We have excellent coverage as far as scraping as many metrics as we possibly can, whether or not we use it to the fullest extent possible is a different story. Obviously,.

B

The other thing I want to talk about was thinness a little bit, because this is where I was getting frustrated, which is what drove the creation of this issue.

B

Tennis is relatively new to the world in general, like it's on version. Point zero, five, zero at this moment in time.

B

This has the this serves a few purposes. One it allows us to connect. Multiple premium allows us to create a single interface for which we could send queries to to gather data from multiple Prometheus boxes throughout our environment.

B

It sends our data to the cloud for long-term storage and allows us to be able to query that long-term storage quickly without needing a store or cache that data locally for long periods of time, so that we were controlling our disk space usage, and it also allows us to actually modify what's in long term storage. That way we are not blowing up our cloud storage bills, so the few components Thanos query is one of them. This is what allows us to connect to various Thanos store instances.

B

The query is the end point where we would send queries to so Gravano would use Thanos query as the data source for itself. We only have one of these. It sits in the ops instance and it is configured to look at everything icebox that we have in our environment and we've got the network peering in GCP, set up to cross various projects. To allow this to work, the Thanos query is going to connect to the Thanos store instance.

B

The Thanos store is what connects to the the location of where we could find data, so that could either be object. Storage, Oriol to connect to a Thanos side, car.

B

When the Thanos store receives a query, I guess somehow it's got magic program. So it knows how to interpret that query. So it knows where to find the data. So it know if you send it a query for old Lita. It somehow knows to look at object. Storage. If it knows the data supposed to be from five seconds ago, it knows to go. Look at the Tendo sidecar for the data I'm gonna skip the compact. The FEI know. Sidecar is what runs next to Prometheus.

B

So for every Prometheus box we have a Thanos sidecar service that runs. It actually looks at the data that Prometheus has dropped on disk, we'll peel that data and serve it up as a method for querying that data, and then it also has a secondary process for sending that data to object, storage for long-term storage.

B

So when we make queries inside Agri fauna, we're not talking to Prometheus, unless we have our data source set to Prometheus, if we're talking to Thanos query all that all those queries that go there we'll hit the Thanos sidecar in our environment, okay,.

A

Few questions you say that when we load our dashboards we go to tennis query is: did I understand that correctly we go to thymes query. First, that's.

B

One of our data sources, so if we have that configured for that query, then yes, but you can configure multiple data sources on the dashboard. Okay,.

A

Sorry I don't know if.

B

You mildly easily, it's not when you edit a chart. Yes, it's visible. Oh.

A

B

If we edit the 5xx response.

A

Yeah I'll get that I know they.

B

Have such a global global is the Thanos default I, don't know if I'm logged in or not, but if we go data sources, global we're talking Janus.

A

B

Sustain is clear, so, yes, for that particular chart we're going directly to the Thanos query, which means we're querying only Thanos in our entire environment: we're not querying Prometheus directly. In that case, okay,.

A

So when we say we are, we are not seeing anything on the dashboard tennis is down, tennis query is down then either.

B

Finals, query down: it were there's like some underlying Thanos, that's down somewhere inside that queries, trying to reach out to that I can't get to.

A

I get it now, so it can be any of these making. It super simple to find. Okay.

B

Yeah I think this is where we need to improve some of our monitoring, because I think it's happened a few times in the past few months, where metrics don't show up and no one knows about it, yeah it's.

A

Master, do we start monitoring.

B

Or thanos query instance, so.

A

I mean we're not monitoring those yet know if.

B

We are don't know about it. Market.

A

So we have a new component in there. That is a bit of a black box for, for us, potentially.

B

You know I mean driving. This information is hoping making it less of a black box, but ya know.

A

I'm just trying to understand where we are at with cool thanks. The.

B

Last one I wanted to mention was Thanos compact. This is primarily for the ability to save us in billing costs, so it maintains a set of life cycle rules. I, don't think I linked directly to that. So maybe I won't be able to find that quickly, because I do think it's important for us to know that information.

A

B

It attributes they knows Oh.

B

Perfect, okay, so yes, so Thanos will actually reach out to long term storage, so object, storage for metrics that are actually doesn't really tell me that it modifies the resolution of our data based on these metrics. So for the past one year we should be keeping every single metric available to us that we ever captured. If it's older than a year.

B

We change the resolution of those metrics to five minutes and I'd be curious as to what kind of algorithm that uses so that we're not severely hampering us, but I, don't know how can we go back five years worth of data anything older than five years? We keep indefinitely, but it changes the resolution to an hour.

B

So thinnest compact will actually perform that operation.

B

So it removes old data and modifies the resolution over the course of time.

B

Sana's compact also has a awkwardness that it needs to be the only thing running in that environment per bucket. So that's a singleton on purpose. It needs to remain that way because otherwise to compact instances might run and bump into each other encrypt our data, which is really fun to hear about.

B

Back to the Thanos query, in order for it to determine like which servers it needs to connect to have a running on time, I'm over on time. I'm sorry go.

A

For it, don't want to hear this till the end. I still.

B

We have a helper file, just simply called generate file generate inventory file. It uses a it's not in this file. In this we use a chef search to figure out all the famous instances determine which port they're running on and we feed some extra data into it. You determine whether or not that host is public or not I. Don't know why we need this and I. Don't know why we need that third item, but our chef search in particular just to look at it really quickly.

B

Thanos query: there's a search okay, so we look for anything that calls upon the recipe say knows whether or not the side car is.

A

Enabled and whether or not.

B

We have store name well know why we need that in there either, but we're basically pulling everything off instance available out there and populating an inventory file that Santos consumes and it consume. It shows us every single box and across all environments.

B

Any further questions about Thanos.

B

A

B

All right so trickster this one is new to me. Trickster is a reverse proxy for prometheus, so our dashboards comm instance uses trickster to communicate with Prometheus. It doesn't talk to Prometheus directly. That's we got a kind of a secondary win with this, because trickster is only configured to look at that specific application or infrastructure data for those environments or that app here so because Trix is only configured to look at those for me. These instances that live in these environments, dashboards, calm, will only have access to those instances.

B

Trickster doesn't do much. He just sits there in proxies data between Prometheus and Griffon. No, that's only does nothing super exciting. So if dashboards calm, Goes Down, we know it's probably because tricksters down.

C

Soon, because of performance right mostly.

B

Yes, there's some caching I.

C

Think that's like there's no reason for us to round up about that. I guess no!.

B

Because there's only 600 ish users hitting dashboards net, whereas comm could have the public world bashing us because they want to know our uptime so right so tricksters neat. But you know it's a lightweight thing, that's kind of new to me so there's that.

B

Any questions about trickster, ok and the last thing I want to talk about was the third-party services. We use Pingdom to provide us external monitoring via some service that we pay, for. We have a mixture of automated configurations that are generated and stored inside of our own books repository. We also have a couple of checks that are created by hand.

B

We have a CI job that publishes any checks. We've got automated status. Io is primarily just for advertising our state over the course of time. Our C mockers responsible for saying that situs we have a there's a the link about this is here, for that page duty is simply responsible for paging us, so alert manager will send alerts if they're your high enough priority.

B

We have the ability to send pages to people if we need to gather people's attention, but only you know if they've got a council page or duty, and then slack is where we got a bunch of alerts they get set to again. This is all configured inside of our run box rules repository and we have a a lot. We probably have a think, 20 or so alerts channels, I highlighted the ones that I feel are the most important, so alerts is the bucket we're, like all the alerts, just go to alerts.

B

General appears to be specific to a lot of the operational things that s eries are starting to pay attention to. So a lot of this work was done by Andrew. It also uses some sort of plug in so we get fancy charts. We get little buttons that allows us to do fancy things inside of it, I think in general.

B

That's what we want a migrant to in the future, but until then, because it's such a large thing to accomplish the alerts, channel still exists, alerts, G, staging specific for staging, see, ICD, they have their own channel I'm, actually not a member of, and then our abuse team they have their own select channel as well. So.

B

This concludes my presentation.

A

This is awesome, skarbek thanks for doing this overview. You.

C

Know it's really great. Thank you. I am.

A

I'm really really really glad that you took some time to do this and I think we should talk about next steps here. Should.

B

I stop the recording for that Ricky.

A

No, no, no, no, no, keep going.

B

A

Gonna know that you're talking about the recording first of all, you're gonna have to upload this to YouTube. Oh, there.

B

Has to be done manually, yeah.

A

B

Me later, I'll.

A

Tell you later about that second thing: I want to have this whole issue converted into an epic instead of GL infra. You can use I, don't know this.

A

This major comment you made, but I know that you made a couple of other comments around this, so please organize organize it in such a way that I think this way might be okay per topic, basically, where you are going to like put this again in the epic description and then start collecting the issues that we have linking them to this epic and then also you have a couple of open questions. There I want open questions to also link to issues.

A

There are a couple of unknowns as well, so not just open questions. There are things that you just don't know there ask about those as well. What was the other thing that I wanted?

A

B

Documentation some incorrect.

A

Rom books collect the ROM books that we have collect handbook items that we have related to this and put it in one place and create issues that are going to be covering groups of of things. So you don't have to go very granular on this, but, for example, I, don't know describe trickster why? What have write describe Thanos, what why and how yeah and so on right. So this is for specifically for documentation so specifically for documentation, you're going to apply on each issue. Documentation and you can apply I, don't know, find the label.

A

If we don't have it already create a new one related to monitoring and alerting, maybe the two together or yeah two labels monitoring, alerting and put them together in in those issues, and this is where we are going to stop. We meaning delivery are going to stop so uploading, the recording so brigadiere Suri's, giving the epic to whoever is going to be looking into this. If that is us at some point, then, okay, but for now just have it written down somewhere, and then we are continuing with the work you started.

A

Originally, you started looking into this that should take you about four hours total to complete, in my humble opinion, so linking all these things together, uploading to YouTube and all of those things so basically rest of today. You should take to clean this up and then you're gonna communicate this with the rest of the organization.

A

So, first of all, we have DNA meetings tomorrow, so put it in the DNA meeting a Europe and afternoon whatever it was called, then, obviously, in all the channels, and then I mean when I say all the channels, I mean infrastructure, lounge and yeah.

B

A

For now, what you can also do is for curious people just place it also in cross linkage to development as well, because I know that we have some people wanting to know more about this. And finally, we do have our monitoring group, so they need to know about this as well, so go to wherever the the monitoring group is in slack and also cross link to them.

B

Monitor that, like the team that works on the project, get lab not come or get lab product yeah, okay, yeah.

A

Because I think they're going to be very curious to know about this, or should they should be at least one.

B

Sales would be interested because I wonder if we have any large customers that would like information like this.

A

Possibly I would for now I would just depend on product managers. Seeing these fingers.

B

A

Then thinking about it I think that's not necessarily our task, um so yeah. Those are the couple of things. I want to see as there's follow-ups and then obviously, once you complete. All of those in this issue link to the recording link the epic link to well, you don't have to link to just say that you placed a certain message: copy/paste the message inside of issue and then commenting close and that's.

C

A

Work done, okay,.

A

Awesome yeah: now you can stop the recording. If you want to.