GitLab Observability, 30 Jul 2019

Previous Meeting

⏯

youtube image

►

From YouTube: Resource Saturation Monitoring and Capacity Planning on GitLab.com

Description

A presentation I prepared at the request of the Self-managed Scalability Working Group.

https://docs.google.com/presentation/d/1xx8sOoWsRvw8_wHqBKbujrWQwmK-meDQrSs4zGNY53I/edit?usp=sharing

Issue: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7217

A

Hi I'm Andrew today I'm going to give you a quick walkthrough of the resource utilization and capacity planning work that we're carrying out on get lab. Comm.

A

I'm going to start off by explaining why I think this is useful using a true story: June was a bad month for gait lab comm. We were hit by a string of separate but related issues which had a severe impact on the performance of the site. During the month June. We only managed to retain our service level objectives about 89% of the time. Some of the issues were related to progress, some to ArrayList cache instance, and some tyrannous persistence. Incident instance and I've included some links to some of those issues over here.

A

Broadly speaking, most of the incidents we experienced on gate lab comm, well that we do experience when get lab. Comm can be broken down into one of three categories from most prevalent. To least they are application changes, and this is when a change is deployed to the application and works well on a developer's machine and it passes the QA process, but we are unable to scale sufficiently to handle, get lab work on traffic, and this is by far the most prevalent class of degradation that we see on github.com.

A

The second most likely is a user or a user action, and this is when a user or group of users overloads fragile components of the application leading to performance degradation and our teachers. This can sometimes be abuse, but usually it is unintentional.

A

The third class is infrastructure changes, and this is when a change is made to the underlying infrastructure, and this doesn't scale. So we might make a change to Postgres or Redis and then find out that due to some unforeseen circumstance, it doesn't work as we expected and, of course, there's a long tail of other issues, including cloud provider, related issues etc. But these are the three main categories of issue that we've had to deal with up to now on get low calm.

A

So in order to monitor incidents, the two key metrics that we observe for each service are error ratios, and this indicates a percentage of requests that results in a service. So an error ratio of 1% indicates that one requests out of every 100 that is made to service will fail with some sort of service line error and the second key metric is the app tech school. This is a measure of the percentage of requests that complete within a satisfactory amount of time.

A

So, for example, when 95% of user requests complete within a threshold duration that we deem to be acceptable say within one second, the service will have an applic score of 95%.

A

The Jun incidents I mentioned previously didn't fit into these categories. They didn't appear to be related to a single change or to a small set of changes in the application. No infrastructure changes have taken place and, for the most part, user action did not appear to be part of the problem in each of these instances.

A

What had happened was that we reached the capacity limit on a particular resource. Once these capacity limits were reached, the system started performing burden. These resources included, PG, bouncer connection, pools, Redis, CPU and unicorn workers. When we're utilizing a resource at or above capacity. We call this saturation. There are many ways to reach saturation, but you only need to reach saturation in one area of the application in order to see degradation.

A

What's interesting is that as you reach saturation things don't fail gradually. Instead, they often failed catastrophically. So, using our parlance, here's an example of wind turbine reaching and exceeding its rotational velocity saturation point.

A

Now the problem is that using the two key metrics that we rely on for determining the health of our services, the optics and error ratios, we didn't gradually, they didn't gradually get worse as we approach saturation- and this is the apdex measurement for get labs web service in the second week of June. The dotted red line is our threshold SLO.

A

When the apdex is above this value, we consider the service to be healthy and below this value we considered degraded like the wind turbine disintegrating the quality of the service, rapidly deteriorated, and it became apparent only once it was too late to take action to prevent the issue from affecting users.

A

So how do we avoid this happening again in future by introducing a third key metric for each service? And this we call saturation saturation is pretty simple to understand for any given resource. What percentage of the maximum utilization for that metric? Are we currently consuming so for each service in the application we can start tracking a set of appropriate saturation metrics. Some of these, such as some of the saturation metrics we track, include workers, disk utilization, several different CPU metrics database pools and others, and many more are to follow.

A

So here are the saturation metrics for the github.com web service. For the past week, each saturation score has a different profile. Some, like disk utilization, are very flat, while others like CPU, can be quite spiky.

A

Now that we have these metrics, we can start alerting on any resource exceeding a value of say, 90% for a period of say, 10 minutes, and this is already a big improvement over the previous loading rules that we were using, which tended to be quite piecemeal. So while we had some CPU alerts, we didn't, for example, have any alerts covering single-threaded, CPU utilization on Redis or single node CPU utilization on sidekick.

A

With these alerts, we now cover that and many other cases that weren't previously covered so after we've got alerting sorted so that we can know of immediate problems. The next step is to build out a long-term forecasting model. What we've realized during the Troubles in June was that once we hit capacity, it wasn't really too late. So many of the fixes for the reasons, resource utilization problems required application, changes to inefficient unscalable features, and these changes might take at a minimum several days to isolate, fix and tests.

A

So it's crucial that we're able to forecast saturation problems well in advance of the saturation event so that we have time to improve the application or scaler infrastructure. So how do we do this? A naive approach would be to linearly interpolate our resource utilization to predict when it's going to reach saturation.

A

Unfortunately, most resources are too spiky. Take, for example, the single core utilization on our Raiders fleet. This chart is taken from a period of the June troubles, so CPU was routinely spiking up to 100%, which is a very bad thing and was one of the reasons we're seeing dismal web performance. The dotted line, however, shows the rolling weekly average for this metric. This graph is almost flat if we were to linearly interpolate on this. It would indicate that in one month's time average CPU would be at about 76%, which doesn't sound that bad.

A

Now, this isn't a very good forecast model because it uses the average which barely moves, while the number of spikes up to 100% was actually increasing.

A

So instead of interpolating directly on the value of the metric, the better approach might be to divide we resource utilization up into three zones: green warning and saturation. This is based on the concept of app set optics but applied to saturation. This is why I sometimes refer to this as saturation applix or a Sat Dec score when a saturation metric is below, say 80%, we say it's in the Green Zone above 80% centered the warning zone and when it exceeds 90% is entered the saturation zone.

A

We then score the metric for the amount of time it spends in each zone in the saturation zone. It's called zero in the warning zone 50% and in the green zone it gets a score of 100% and using the score we can calculate the average score for the resource over time. This is an advantage over direct linear interpolation. In that the metrics saturate, and in that the metric saturation score will highlight dangerous spikes and resource utilization, even if the average value is not affected.

A

So going back to the previous metric for Redis CPU, let's look at how its effects core performs over the same period during the Troubles in June. The top graph shows a single called CPU utilization, and the bottom graph shows a septic score was for red, a single core CPU. As you can see, the metric is the subtext metric is tending down as CPU spikes keep occurring and using linear interpolation on the sacked x-value interpolating one month into the future. We can forecast potential resource saturation issues.

A

This should give us more insight into long-term trends and having one month warning should be enough time to la application, changes to be made to avert degradation before they reach saturation.

A

Finally, I'm just going to give a demonstration of how we can view the saturation reports at the moment on get low, calm.

A

So this is the platform triage page. It shows the health of each service in the in the application. It shows the key metrics for each service, as well as the saturation at the bottom. There is a report which gives us different services and their current state, so this the first graph gives us resources that are currently at risk of saturation.

A

So here you can see it's telling us that the diskspace resource on the Patroni service is nearing its capacity limits and likewise on the Postgres archive and Postgres delayed services, the same disk-based resource is nearing its limits, interpolating arts, one month into the future. We get the same report on the same resources because this is not being addressed. This is actually something that is currently being addressed. The discs on these machines will be increased soon, so this is something that is being worked on.

A

So let's go take a look at the Patroni service and see what we can see with that service, so you're going to go to the top and click through to the Patroni service and again we present it with the the key service metrics for that service.

A

Again, but instead of seeing them for all services together, we only see it for this individual service and, as you can see here, saturation is at 90%. If we open up the component metrics, we get to see why the saturation metric is at 90%. So what we do with saturation is we aggregated on the maximum? Because, if we use say the average, you might have three resources that are a very low utilization and one that's near the top, but you need want to be saturated for the service to be saturated.

A

So here you can see that disk space is at nineteen point nine six percent and that's what's giving this service a high saturation value. If we increase this to the last seven days, we can see the growth in that saturation metric. Well, it might take a moment for the graph to update.

A

While we wait for that graph to come back, I will show the capacity planning section of the service dashboard.

A

So what this gives us is. It shows the current saturation school for each service in the Patroni service. So here you can see this service has got a disk space resource, a single node, CPU resource memory and overall CPU, and what it's telling us is that in one month's time the disk space will still be near saturation, but everything else will be fine and then down below that we get a long term saturation trend which shows us over time for each resource on the Patroni service.

A

What it is doing over the last three weeks, so we can see that CPU is gradually going up, but is still very low. Disk space is, is has exceeded 90%, but it's growing quite slowly, and then we get the long term saturation for this particular service. It isn't very exciting. Most of the resources are at 100% on the subjects core, except for the disk utilization, which is a 50%. But if we switch over to the Redis service, we can actually get a more interesting graph. So it's going here.

A

Did I click the wrong graph.

A

Okay, I'm just going to.

A

Go here directly, it seems to be a bit of a problem with that.

A

So Redis was one of the services that we saw some problems recently. If we take this and look over the past well so.

A

What we have here, which is quite interesting, is this graph over here shows us the sap tech school for the Redis single threaded CPU, and so you can see when we started recording this metric. It was very low near 50%, which we would consider to be a bad score, but then, as we managed to make application changes that take some of the load off Redis we've seen the school tend up and then recently we actually split our Redis cluster into two, and you can see that this.

A

This score is now tending up to a hundred percent, which is telling us that Redis is actually doing very well now. Last week when I was looking at this, what this told us was that the single threaded CPU was at capacity, and it also told us a one-month saturation forecast that Redis CPU was was going to be at capacity and because of the changes that we made this weekend. This is now improved. So hopefully, next time before we see a saturation event, these graphs will tell us of the problem.

A

Obviously we're also going to not only just rely on graph on our dashboards. We're gonna create alerts out of that, and that concludes the demonstration. I hope you found this useful. If you have any questions, please get in touch with me. Thanks.