Cloud Native Computing Foundation PromCon Online EU 2021, 14 May 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Monitoring the monitor - David Leadbeater, G-Research

Description

Don’t miss out! Join us at our upcoming event: KubeCon + CloudNativeCon North America 2021 in Los Angeles, CA from October 12-15. Learn more at https://kubecon.io The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

Monitoring the monitor - David Leadbeater, G-Research

There are various ways to make sure Prometheus is working. We’ll cover these from cloud based services to Prometheus instances monitoring each other. Then we’ll explain why we developed a component to help with this.

A

And I'm here to talk about montreal, I'm david and I'm here to talk about monitoring the monitor or, if a prometheus falls, does it make a sound prometheus to monitor your service, but what monitors your prometheus?

A

So probably the first query you come across when you learn prometheus is something like this is prometheus up or actually is it down. So this is an alert that if the metric up with a job label matching prometheus is zero, then the query will return results.

A

um So you can use this in an alert with something like job down with an expression of up job prometheus is zero and then it will raise another annotation prometheus down, for example.

A

So that's quite simple, um but that's not enough to actually monitor prometheus itself because well it's not a cartoon. It's not that simple and you can't monitor yourself with yourself so going back to basics again, the architecture of a normal previous setup is something like this. We have prometheus talking to an alert manager, sending alerts to some kind of alert receiver.

A

So we only have one of each of these. Now the receiver is maybe something like page due to where someone else takes responsibility for actually making that reliable for you, but prometheus and alert manager unless you're using a managed service are probably your responsibility.

A

So a common setup is to run multiple, often so a pair of previous instances, monitoring the same target and then also alert manager in a cluster mode of some kind, and these ideally running on different machines. So there's now some level of resiliency there, which is good.

A

But what happens if the receiver is down or unreachable? Well, alert manager tries to raise an alert, but it can't go anywhere. So, for example,.

A

Premiums raises an alert like this saying jobs down and then can't go to the receiver.

A

So a common approach in the past was to have some kind of backup device connected to your server directly, which meant you could use the internet and also sms. For example.

A

um Obviously, it's a bit difficult to connect a phone to a server in the cloud, so how people often deal with this is rather than alerting when something is down, have a particular alert that exists as a heartbeat that is always expected, which then is always sent to the receiver and the receiver somewhere on the internet knows that it should expect to receive an alert and if it doesn't, then it raises an alert. So it inverts alerting essentially saying if there isn't another, then start raising the load.

A

So there are many ways of doing this. uh Healthchecks.I o provide a service that does this, which is written in python. You can run it yourself or there's a crowd-hosted version of it dead man, snitch, integrates with pagerduty and is cloud hosted karma, which is a web-based ui for alert manager, can also display an alert when a particular alert isn't present, so that obviously doesn't page anyone but can show on a screen or something that there's a problem which, if you have a knock or something, could potentially be useful.

A

um Alternatively, to do something entirely custom.

A

So let's look at how we actually set up alert manager to talk about heartbeat receiver in the alert manager config. We have a route that matches a label of severity heartbeat and then sends that to a particular heartbeat receiver and you'll see in this example. The url has a id in it, which would be team specific or um specific to each prometheus instance. That is monitored by the receiver at the other end, which unfortunately then means that this alert manager file needs to have every id for everything that is monitored in it.

A

um Obviously, that's not too difficult, so it can be templated or various other approaches, but it still means that this is yet another thing to configure and the configuration needs to be managed and so on. um It's yet another moving part. Essentially. So, instead with prom msd, we have the same alert that we had before.

A

But in this alert you'll see that there are some annotations that have msd at the start of them, which essentially tell prom msd how it should behave.

A

um The activation is the activation time um some labels to override, and then the alert managers to send the alert to which is, unfortunately, the one thing that primosd compromises on it can't support dynamic alert manager discovery because the allowed managers have to be actually specified in the alert itself, although potentially we could fix that with some changes elsewhere in the future.

A

um But this does mean that all the configuration for a team's alert is actually contained in the alert itself and nothing special is needed for heartbeat events. Obviously, they probably would have team-specific routing in the central alert manager, but they don't need separate configuration for heartbeats, which you know might get forgotten or so on, because it's not used all the time and so on.

A

So what then happens is this in this case raises two alerts for each of our um availability pair and those go to problem sd. So let's actually see how this works so over here I have um some of the example configs that come with prominence d. So it's just a conflict directory and I'm running four terminals here. um First of all, I'm just running a netcat listening on a random port. This is going to be the normal alert receiver. So we'll just see the http requests sent to that so inside prime msd.

A

This configs directory has an alert manager, config an alert and a prometheus config. So what I'm going to do is I'm just going to run alert manager using that pre-provided config, um I'm also just going to run prometheus and prometheus will then be running so you'll notice. I haven't yet started prominence d, so I also just need to do that.

A

So we now have previous alert manager and prime msd all running, and let's just first of all go to prometheus here and if we look at the alerts ui, you now see this expected on that heartbeat is active and we can see, as I discussed all the activation things you'll see in this case, though, that I've put the activation at one minute. You also notice. This alert for now is not actually active because there's a full threshold of 30 seconds just to make sure that the previous instance isn't flapping, so this amount is still pending.

A

Hopefully, I've spoken for long enough. I have and that alert is now firing, so that now means that we have an expectation of that heartbeat. That is fight rank, so what's happening to that. Well, that is going to alert manager which conveniently I have running here, um and we now see there's an alert for expected alright heartbeat over here that we have the relevant annotations on um and if we check where that's going, that's going to rob a misd, and actually our pager has no one that's going to okay.

A

So then, if we go over to prominence d over here, we'll see we just have a prometheus, um and in this case it's not running in kubernetes, so there's no namespace or anything it's just prometheus, which obviously, in a real setup, you would have a few more labels there, but this for a demo. This works so you'll see that that is saying it will activate in a few seconds and I've actually got this set, I think, to repeat every five seconds.

A

So if I just sit reloading this page you'll see it actually never gets below about 55 seconds. So now, let's just go to where we are running uh prometheus and I'll, just kill it okay, so that was pretty fierce. Yes, so I've now stopped prometheus.

A

So actually that's interesting, because you'll see this alert is now still active and that's because alert manager over here still knows about this alert. For now, um I've actually set the in the prometheus config the evaluation interval to 15 seconds. So, if I carry on talking for about four times 15 seconds, um we should eventually find that that alert stops being sent. uh Luckily, this stuff isn't live. So if this fails I'll just edit, it.

A

Okay, so it's now about to activate I'm just hitting refresh here, so you can see what's happening there we go so it's now gone red and says it's sent in alert.

A

So if we go back to this alert measure and have a look yep, the heartbeat's disappeared and we now have a no alert connectivity alert and in theory, if we go here, we should actually get that delivered to us.

A

A

There we go so we've now been told we have no relay connectivity. So that's how problematic works.

A

Okay, so obviously that's a very simple setup and in reality, you'd have a few more components involved. So um a full architecture of deploying it might look something like this.

A

You have three teams running prometheus instances for applications which talk to an alert manager, cluster, the alert manager, cluster routes to prominence d, as well as things in the cloud for other alerting, um as well as an infrastructure prometheus, for example, that, rather than using the prominence d running locally in the cluster, uses something in the cloud which could also be a another instance of promo sd running elsewhere, or it could be one of the mentioned cloud monitoring, services um and you'll also notice.

A

If you follow the red line that if prime msd here detects that there's a problem, it sends it to alert manager, but it also sends it to a webhook receiver, which goes straight to something elsewhere.

A

um Which means promise d doesn't need to depend on anything other than a weapon receiver which could run on the same machine or you know, even in the same pod as a as prominent d in kubernetes um and yeah. As as mentioned, the infrastructure prometheus has a separate monitoring that is potentially in a different cluster or elsewhere, so the infrastructure team can be notified.

A

If everything is broken, um application teams can be notified if their prometheus is broken by an explicit alert, but if they actually are running in multiple clusters, maybe they don't need to be told about their actual application being down. If it's an infrastructure problem, because the probers don't fail- um and it means that you know you don't get a critical alert, everything is broken when actually, it's not all broken. So there's flexibility in how you how you set this up. That means you can make sure that the alerts are actually actionable and so on.

A

So prime msd is now open source and it's available on our github there. So thank you to do research for supporting my work and open sourcing this, and thank you for going to this presentation.