GitLab Scalability Team, 18 Aug 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Demo of new improved deep-linked Grafana dashboards for GitLab.com monitoring

Description

A quick demo of https://gitlab.com/gitlab-com/runbooks/-/merge_requests/2684, which allows SREs to quickly navigate between different observability systems, such as Kibana, Bigquery, Stackdriver and Sentry. The aim is to reduce the MTTD for incidents, helping to drive up the availability of GitLab.com.

A

Hey it's andrew. I just want to give you a quick demo of a new feature that we've added to our grafana dashboards, specifically to the service overview dashboards. So I'll try to make this as quick as I can, but I think this is going to be really useful to a lot of people.

A

So I'm going to choose the web service overview dashboard and, as you know, the top line is our key service metrics for the service and that's aggregated up to the service level, but down below that each service uh is broken down into components and for the in the case of the web servers, the components are the puma service or the puma component and the workhorse component, and these two together um work together to to service requests for the for the web service.

A

And for these we have uh again aptx error rates and request rates. So we have sort of component level metrics for each component, and this is the new panel that we've added, so each component gets one of these one of these panels. So this first row is all about the puma components, and the second row is all about the workhorse component and you can see because the different services they each have their own sort of links.

A

What this panel gives us is a list of additional resources that we can use in diagnosing a problem, and hopefully this will mean that the the mean time to diagnosis and problems will will be reduced because we don't have to spend five minutes configuring kibana, to give us a specific graph.

A

um So let me give you a demo, so most services have got a sentry um tool associated with them. So if I clicked on that, it would take me to the right place in century to show me the exceptions for the service, but probably more interesting. One is kibana, so we have various logs, we have slow logs, we have failed requests and we have just the normal logs, those aren't as exciting as the visualizations. So obviously, during an incident, one of the things that takes up a lot of time is trying to build a visualization.

A

So we can dig into some sort of high cardinality data like ip address or username, and what this does is it just automatically generates pre-canned visualizations in kibana when you click on this link, it'll take you there, so the first one, if I click on this, this is remember for the web service that we're looking at and we're looking at the uh the puma components on the main stage, and if I click on here now.

A

Hopefully this will give me a graph of all the requests coming to the web service on the main stage and it's the rails log, which is the log associated with the puma components. And so that's just already saved us a whole bunch of time. And then, if we want to dig into into further detail, we can break this down very quickly by just splitting the series say by status code.

A

um Is that going to work there we go, and so it's much quicker to get those visualizations out of kibana.

A

um We also have other tools like google stackdriver, and this is probably more useful for development teams. But if you come into here, you can see, workhorse has got continuous profiling, and so all you need to do is click on this link and it will take you to the right place to go and investigate the continuous profiles for that service.

A

And, let's just wait for this to load, and then we can sort of move from here and continue our investigations.

A

um We also have for other services. If you go to uh whoops um the front-end service, which is where we run aj proxy now aj proxy is unusual in that the volume is too high to to import into elasticsearch, and so, instead of sending it there, we send it directly to bigquery.

A

But that means that not everyone in the team might necessarily know how to query that information in bigquery. So for each of these services in the front-end service, we've broken it down into canary http services, main stage http services and ssh services.

A

So if we want to do some investigation into what's going on somewhere here, all I have to do again is click on that link, and this opens up bigquery and what's great is it's got a pre-canned um query and if I run that I will be able to get a whole bunch of information into what aj proxy was doing at at this moment in time, I'm not going to run it because there will be personally identifiable information, and I don't want to include that in the video.

A

So hopefully, people find this really useful. There's some other tools that we link to and over time, I'd really like to expand this out to include other useful things. If you have any feedback, please give it to me. Of course, merge requests always welcome as well. Thanks a lot.