GitLab Incubation Engineering, 7 Aug 2021

Previous Meeting

⏯

youtube image

►

From YouTube: Incubation Engineering APM -- Weekly Demo August 7th 2021

Description

A whistlestop tour of the GitLab monitoring capabilities, specifically around Kubernetes environments and metrics (Prometheus).

Incubation:APM SEG - https://about.gitlab.com/handbook/engineering/incubation/monitor-apm/

Kubernetes Observability Stack project - https://gitlab.com/gitlab-org/incubation-engineering/apm/k8s-o11y-demo

A

Hi there I'm joe shaw, I'm a full stack engineer in the incubation engineering department, working on application performance management and I'm going to give a quick demo in this video of current monitoring capabilities in git lab itself.

A

To lead on to the sort of work that we're going to be doing in the next few months to create a sas first solution, uh an agent-based solution to solve this problem. So, as you can see here, I've set up a kubernetes cluster with minicube and I've called it minicube. This is linked to a test server uh and you can see here.

A

If you look at my canines instance. This is a little mini cube. Cluster I've got on my machine three nodes ready to go, uh and after a bit of configuration that I'll show you that's connected to this uh local gitlab instance that I'm running here. So you can see some information about the cluster. I've managed to get it to connect to a local address and change some of the background settings to do that um and the ca certificate and things like that. It's all local to my machine.

A

So there's no risk in me showing you this um right. So there's the mini cube instance, and what I've done is I've used uh an observability repository that I've set up locally and I will add, links into the youtube comment- uh youtube um video for that repository. So you can see how that works and I've deployed into my mini cube instance.

A

A

A stack, an observability stack, including a prometheus operator which deploys prometheus and various other services. Like the node exporter, I have an alert manager.

A

As well as that, I've installed a fluent bit an elastic search to perform logging operations there and jaeger as well, sat on top of elasticsearch um to do trace monitoring- and this is all running in the gitlab- managed apps namespace, which is a requirement to get this running with uh gitlab itself, and there are specific requirements around service names and things like that. So I'll switch back to gitlab here and we can click on health component here and it loads.

A

And it's making a request to my cluster there. I've no idea what this there was, an error, getting dashboard validation warnings, information means not a clue, and we can see some basic information about the health of the cluster cpu overall cpu usage and memory usage, and we can narrow that down a little bit there and look at that and that's just a top level bit of information.

A

If you look at the integrations here, I've just turned on prometheus for the moment: elastic stack and logging I'll cover in another video, and you can see more information about the the way that you can actually get prometheus integrated there yeah, like I say you have to have it in a particular name: space with a particular service name running on port 80, which isn't in the docs. I must create an issue for that.

A

um So there's there's, there's that integrated cluster and that's set up against a test environment here called production which doesn't have anything deployed into it, but it's attached to that, so we can easily look at it and set things up.

A

uh What I can do now is just show you the monitoring space. That's set up now.

A

You can see it's hitting production here. Let me change that to 30 minutes, so we're not loading too much stuff again. That error message still don't know what it means.

A

Now in this page, I've noticed that obviously there's a lot of integrations here that I that are irrelevant to my cluster but they're. Looking for specific things in the prometheus instance, this gitlab instances connecting through to uh via the api server of this mini cube cluster that I've got.

A

uh I can't seem to get system metrics to display here. I'm not sure why. But if I change this to k, it's pod health at the top.

A

I can again change this to 30 minutes.

A

When it gets to it- and I can pick a pod out here- so it will actually query into the cluster and get me the pods. So if we have a look at say, the elastic search master that I've got running in that cluster, give it a moment to catch up, and there we go. We've got some cpu usage. There see we've had a little spike there. This is all sort of caused by millicourse, but container memory metrics.

A

As expected, fairly high memory usage for elastic there, 800 meg and you've got some network spikes there. You can see, allegedly not using any disk, which is highly unlikely. um So you can see there. We can get some basic metrics here and we can get anything from the you know get these from the pods there and it is possible to customize this. So you can create new dashboards and those use.

A

A a specific yaml structure that you can put in your repository to configure all this, and this is how this is all sort of prebaked in and you can add a metric here and you're using the prometheus query, documentation there to get that information out and then add that that would add that to those charts- and you can pick any of these metrics and expand that panel and have a look at it.

A

uh So there we go and you can explore around and see see what's going on there. um So it's fairly limited. I I have seen that you. You should also be able to view the logs here.

A

I know there is some implication that you need elasticsearch, but it should, I think, be able to go into there and pull the logs out and it doesn't seem to be able to at the moment. I'm not sure why that is. It doesn't seem to be able to pull the pod information, even though we saw that was working the previous step. So I'm not quite sure.

A

What's going on there, the other aspects of monitor on the side you can see there are tracing, and all that allows you to do is essentially add a jaeger url here to link to uh and other areas like error, tracking alerts and incidents I'm not going to go into, but they are areas that would look to feed into in the future to complete that devops cycle.

A

If we look at the monitor settings generally here, you can see that you can actually configure dashboards, so you can link to an external dashboard. Likewise tracing error, tracking setting up alerts a grafana instance so that you can embed uh who could find the dashboards in that metrics area there. So so there is a certain amount of flexibility built in there. You don't have to use this built-in solution, but it does fit in with the environments quite well.

A

um I think that's kind of a quick tour of the sort of top-level metrics capabilities that are there in gitlab.

A

um What I will do is look at the cluster that we've set up and what we've got in this cluster. As I said, this mini cube cluster has uh this chart and again I'll, put a link to this. This repository this project in gitlab, under the video, and this sets up prometheus operator, elasticsearch cabana fluent bit to feed into elasticsearch and jaeger for the tracing and I'll just give you a quick demonstration of that.

A

So this is just setting up all these sort of common observability tools for kubernetes, specifically, although you can use these for any environment really and I've set these up in the cluster under node ports here, so we can easily look at them, so we have a grafana instance. Here we have a jaeger instance and we have a cabana instance that we can look at. So if we quickly look at the grafana instance, you can see that by default with the prometheus operator, the cube prometheus stack.

A

I think it's called you get a lot of dashboards and functionality just out of box there straight away um and really some quite decent stuff in here. Of course, you can customize all this and write your own queries, but you can use the you can see. Things like you've got the use method, performance, monitoring, method for this cluster. Here, look at memory, saturation and cpu usage, and things like that.

A

You can look at, for example, uh slos for the actual api server for the cluster itself. You can see this sort of basic error, budget implementation up. There read availability, right, availability, that's very useful stuff and of course you can it being funny. You can dig down into any of these, it's very responsive and, of course we can pick up any part here. So we can look in that get lab manager, that's name space that we set up there.

A

uh Let's again, uh elastic is always a good example of these. You can see a bit of cpu usage there you can see the memory usage is relatively high. You know sat between a sort of min and max that we've set up there. You can see the the limits, requests and limits for those pods that have been set up, which is really useful.

A

um So likewise the elastic page. Here we can just jump quickly into the discovery section. This is set up on a fluent bit index. That's being pushed in again, that's getting everything from the cluster, so we can. We can quickly look at the namespace name.

A

Oh, we can get lab, managed apps there pull that in. We could pick a particular node for elastic master there. um What else can we look at? We can get a pod.

A

I've already got some things to select here now we can expand that and have a look at the logs that we've been getting in we're, obviously getting a lot more in just in the last few minutes, as we've been hitting these services and they've been coming in, and likewise, we've got the basic gager interface here out of the box and I've just configured profana to push metrics into jaeger there.

A

So we can look at you know. Looking at that interface, the file doesn't give you particularly interesting traces, but you can see there. We hit that data source reverse proxy and we can look at the tags and information there. So it's quite a powerful observability stack that you can set up really easily from this repository. Just by doing a helm, install essentially and there's a make file in there that you can look at to easily set that up out of the box there with a minimal effort. Of course, it's not set up to scale.

A

It's not going to handle a large number of requests. It's just set up for experimentation anyway. I thought that might be interesting uh and that will do for the first video and I'll see you next.

A