GitLab Observability, 4 Dec 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Andrew & Marin pair on adding disk saturation metrics for GitLab.com's NFS servers

Description

Andrew & Marin pair on adding a new saturation metric to the GitLab.com monitoring suite. This video resulted in this https://gitlab.com/gitlab-com/runbooks/merge_requests/1679.

This change was a corrective action following on from Production Incident https://gitlab.com/gitlab-com/gl-infra/production/issues/1419 and again in https://gitlab.com/gitlab-com/gl-infra/production/issues/1437 (exactly one week later)

A

Hey this is Andrew I'm gonna be working on scalability issue, six and three, which is the correct faction from the incident that we had yesterday. Basically, what happened yesterday was that we.

A

Sorry, basically, what happened yesterday an incident was that we had a huge plot of project, export jobs that got created and those jobs were personally watching to invest server, and then we had some API endpoints, that's totally unrelated to the project.

A

Export jobs, we're using the same, manifests server, and so that machine was put on the load everything slowed down, and so we got this connection between project export jobs and CI traces and because the export jobs are putting saturation pressure on the NFS service, the the CI trace jobs escalated to about five times the pneuma latency.

A

Those jobs already take about 60 to 70 percent of all compute CPU in our API fleet, and so when they started each taking five times longer. What ended up happening was that the API fleets got very saturated and that started causing lots of latency and queuing. You know this navigation saw Miller dance throughout the application, so we got a bunch of alerts. The first we got was full API latency issues. What would've been really good is if we could have got some forewarning on the NFS latency issues.

A

um We've got a capacity planning framework, that's we use and it tells us how things grow over time. A morn.

B

A

Was just explaining what I'm gonna be doing to myself but I?

A

So yes, so we have a resource monitoring framework um and the resource monitoring framework gives us like long-term trends and saturation, and what we've done is we've kind of abstracted the idea of saturation, so it could be.

A

Utilize, it could be the number of sidekick workers that we have. It doesn't really matter, it's just a saturation metric and then we take a look at it over time and we see how much variance there is a metric and what direction it's growing. We use that to predict. Well, then we get our problems and obviously the other thing we get out of it is that for any saturation nurtured minute reaches a threshold where it's too high. um We can alert to Newton, but.

B

A

We didn't get any notes for any of the NFS servers yep. So the first thing is, we don't really know what it was on: the NFS servers that was slowing down and obviously something was saturated, but it could have been the disks on those machines. It could have been the CPU on this machine. Job said it might even be the network. Hello GCP claim that their network is infinitely.

A

C

Nothing true that that's not.

A

The case but I'm so I'll che must screen.

A

Should be shed.

A

So I guess the first thing that I'm gonna look at is: let's go and take a look at hosts stats, which is a dashboard that we've got and you can. It sits at empathized dashboard. So you can give it any machine. You know, fleet problem is pulled on and it will give you a bunch of statistics about the machine.

A

The moment we don't keep the scene just from it, so I'm gonna go look for full chair share a one, two, probably the best machines and look at ya, and if you want to look over the past two days, because the incident that you're looking at was yesterday right, wow, so yeah.

A

So one of the things that's kind of interesting here is that CPU was only up to 25 percent. So one of the things that that could mean is that it's a focal machine and one of the cause was pinned but I can't. Imagine, oh well, there you go. You know it's! It's a hateful machine and it's just.

A

So and every more want.

B

To be every core, it seems to be equally affected.

A

Yeah, it doesn't.

A

Like so I mean he's so mostly you that it never really got beyond 20 25 percent, and so it's new user mode. There was nothing extra. This one is more, it's so small. It's like very very definitely seem to, interestingly, that definitely affected CPU six Shannon really understand romance money. Could give me some interesting details as to why that is, but I'm not sure. So. The second thing we have is right.

A

So this could be the problem here. Right, look at disc three puts bytes per second and SDP right. This to me is like a cassock saturation signal like when it goes up like that and it's it's got like a clattering like a rooster.

A

Like this, you can see it's spiking, but it's not it's not tipping, and so that's probably three. That is probably not saturated right. You could cook me so that's kind of interesting, and that could be the thing that we need to put saturation on and then network. We see the same pattern, but obviously, once you hit a bottleneck in one thing, you know you're gonna see that pats and reflected in all the downstream elements, and so you know on receive. We could only receive at a certain rate.

A

The next thing I'd be interested in looking at is actually II.

A

The claim through puts on Google console for that disc, on the machine.

A

For the era of what the district puts would be, and.

B

You could you check your connection because you seem to be of dropping out. Okay.

A

I, might let me just see if I can I can't be one second sure.

A

A

And I will share my screen again.

A

Okay, so what I was interested in looking at was the discs, the the disk utilization or the what TCP claimed we should be able to get through on that disk, and then we can tell whether the problem is was in fact a saturations.

A

Mónica, the search doesn't move amazing.

A

Yeah, the other thing just thinking out loud that we could do is if we threw more CPUs on that machine, we could possibly get more destroy boots and on manna-fest servers.

A

B

Isn't that a bit of an overkill for a problem that doesn't appear often yeah.

A

It's well you see. This is the thing. That's that I really love about building up the saturation fragment is we don't actually know where we are at the moment, so we could be skirting along, like 5i, ops per second lower than the threshold, and this just just gently nudged us over the other side, and you know with the saturation threshold of limits. It's always chaos. When you, when you, when you exceed them, you know we had CI logs failing. We had API failing all because of of a distant project. It works I'm. So.

A

It was STP within the volume.

A

So, while I'm trying to figure that out I'll just go looking on the train.

A

So looking at the disk sizes, it's probably that one over there share one.

A

Right, so these are the limits that we've got: fifteen thousand fifteen thousand and eight hundred and four hundred. So as a comparison on our giggly notes, this is 60 thousand thirty thousand so four times as much twice as much. This is 1200 so 1.5 times, and this is 800 so so.

B

We are right, but we shouldn't be comparing them Ryan.

A

B

Because up until, like last night, we expected that we have like almost no workload on it right, yeah, yeah,.

A

And that's I'm, and that was my understanding as well.

B

And was my understanding as well? So that's why I was so surprised so.

A

But it's good look.

A

This is this: we right so that.

B

Makes sense, though, right yeah, because we had the abuse that continuously yeah yeah.

A

B

A

But here we can see this garage is like 300 feet about three hundred and twenty megabytes per second. If we look here right, it says 400, but I think what this shows us. Actually, the limit is a little bit down on that, maybe there's overhead. Obviously, the other thing we should take a look at is, let's add another metric.

A

What operations is second not like weight shift? Maybe the easiest way to do this is to go back to yeah.

A

This doesn't happen monitoring and when it does it no.

A

This might become useful to the Google cloud product managers to see how easy product is right.

A

It's advanced okay, so this tire operations that for one minute.

A

mmm It's like one day.

A

That's kind of interesting to me like.

A

That is as a son, let's just say, because I guess this is across all the drives and what we want as we once.

A

Aggregate is there a way that you can just donate money, because that's summing, the I ops for all the disks and I.

A

Don't even know what that means: I'm, not very exciting golf dating you.

C

A

The woods surprising to me is that this is not that far off.

A

You know that this okay, so this is right and I, don't know that writes across all the drives I'm guessing it is, but it's like mm I ops right out up to second, and if we go back to here, okay, so it's 15,000, 15,000 or thinking wrong. So you know we kind of win up, but it wasn't anywhere.

B

Close to the yeah so.

A

The limit that we hit was.

A

On on on this right throughput, so so GCP have a formula that you can use to basically calculate the I opts for a disc and it's it's got like a lot of variables and I spent about two or three hours trying to express that formula as a funky well query, and if there was missing data and it just wasn't possible, but obviously for giddily.

A

It's critical that we don't it's right through books and if we do, we have to move projects off the machine and so what I ended up doing for gibbyyyy, and this is why I know what those values are of. My arts is that we hard-coded those values into our saturation matrix. So for all of our saturation metrics, we say it's between 0 & 1 will 1 is saturated and so for right, 3 puts. um We know that 1 is full. We use the stated number that they use, which is 400 megabytes.

A

The second and then we'll put a saturation threshold at a box. I, don't know 70% or somewhere will work out what the right number is, but like so that when we get today, we get alerts and also like not that I expect this to happen, because I don't expect the us to be using NFS for much longer. But if this number grows over time, we'll get alerting on on that tree.

A

So what I'm gonna do is I'm gonna go to the run box projects.

C

A

And I'm pretty much gonna take the way that we do this for giddily and I'm gonna replicate it for what we have.

A

ah And this is what I was doing yesterday before I decided to join the court. What the meeting said my work so there's a file called saturation service situation, yeah more and just because I'm lazy and research was 60,000 and chilly day.

A

So we said, we've got some explanation. What we're doing here, but basically we say when the rate at which we are reading from the disk divided by 60,000, is greater than 1 or is tending towards 1 will.

A

So, let's start off with these these two, interestingly with kiddingly, we never come close to these 60,000 and 30,000 members, but on again on write in the same way that we saw here if somebody's doing abusive things on a Guidry server, we actually see right-side aeration on single machines, but often so that's interesting that it's slightly actually I was wrong. It is 400 megabytes per second for giving me as one.

A

C

A

No, these machines are cold, so you just could look it up in like a chair and the top of these show machines is.

A

Machines, can you know that rusty as well.

A

Okay, so this is a little sort of side project that I've been pinging, particular Anthony and Ben has been helping out as well, but what we have is we have a label on things called type and type represents the service, and so yesterday we had things failing: it was the type equals ABI type equals sidekick, and then from that we have a dashboard for each service, slash time and there's like a lot of stuff in our system that is outside of that and because of that it actually lacks a lot of observability.

A

So we have, and I was ready wondering about this. We haven't loading on CPU, but we only do that for machines that have a time we have a whole bunch of alerting, but there's a bunch of stuff that doesn't have types, and you know we should fix them all. Basically, like everything should be every service, you know, every server in our fleet should fall into a specific service and not be like this, which is just so some random machine.

A

Basically, so what I'm gonna do is I'm gonna kind of hack this, because I need a way of recognizing these machines, and so I'm, probably gonna use the fully qualified domain name to say to select the machines. But then, once we add types on to these machines, we can we can fix it in a better way. So how.

B

Much work is that it's.

A

Very, it's just a chef, repo change. It's just adding the labels to chef to tell it that when Prometheus scrapes this machine, it must automatically add this type label other place. So why don't we do it right now we love.

B

Acting things just like brought it out, doesn't matter yeah.

A

So did you say, do it yeah just do it? Okay,.

B

Gonna take two minutes extra, but yeah.

A

I'm not actually sure I know.

A

So Ben do the kind of recently for monitoring.

A

So the another type of machine that didn't have this information on it was all of our ironically, all of our Prometheus bosses, so then kind of like in a monitoring black hole and then fix that so here, I'm guessing that this.

A

Looks very promising default their default attributes, then. The second thing we need to find is.

A

So my my chef, knowledge is not amazing, but to me it looks as though the share box doesn't have a role in chef. Oh I, don't know it's fine, it's its role is like a base. Storage machine. Not you know, and the base storage machine is also shared for other heads and yeah. So I think I can do this, as I said. I think we need to do is yeah, but I will create a merger quest for them and then we need a name. I get it's the.

C

B

A

And this is for.

A

What is the difference between that and that maybe maybe ah I take that back? Look, that's all okay, so I was thrown. This is an azure box. Wait what I'm pretty sure that represents the old naming schema that we used to measure. But again this isn't so this looks much better cuz there. We have this guy.

A

A

B

It's gonna take much longer like you can ignore mine, but like it should be like pretty straight yeah.

A

The only question I really have now is de caldas share or NFS. Let's just quickly check what else has got this.

A

So the only box that's got it is is share, although you know what, if I'm thinking about a service, name like what is the name of the service, and if this is super obvious, like the NFS service, forget lab, so yeah I'm gonna just make a call, and that is store. So thanks for pushing me to do that because I think people- and why can you do this and then it takes weeks yeah so easy to do.

A

And so just because we've added these labels on you there's a whole bunch of stuff Prometheus report, recording rules that explicitly exclude anything that doesn't have a label. So this is actually gonna fix a whole bunch of other stuff and add a whole bunch of other reporting in uniting as well.

A

It's funny I saw a guy the other day, tweeting about the gay push. Copy-Paste then I mean I'm like I know. I could figure that out, but it's easier to be taught by git.

A

A

A

You, you know, through.

A

The microsoft Authenticator.

A

Or Microsoft Authenticator is awesome, because if you have a change of funds, you can get the QE phase from one phone to another, backed.

B

A

B

That's nice yeah.

A

B

What I miss from Google Authenticator yeah.

A

It's it's terrible. Okay, you sure assigns to you.

B

Jerry's working scientist ask him in.

A

So now that we've got that we don't have to do hacky stuff online books.

A

Which is much better, so you can say here type equals an offense.

A

I'm gonna use I'm gonna get a link to that disk again. So that's I've just found having that link really useful.

A

To not smile about it the other day, and he and it was just great having a link to where I kept the data points and check that into this community.

A

There's actually another thing that was working with Henry on where, unfortunately, at the moment, these values are hard-coded and then we need to replicate that hard-coded value in other places it's horrible, and that's because this is yeah malleable, but one of the other things in doing is removing all of this across the JSON X and so at least for all of the others. They on one place in JSON it, and eventually this will be generated from JSON it as well so, but for the moment, you've gotta do this. So.

A

Everything else here would remain exactly the same now.

A

One of the things is that we've got the sustain, just watch we put and the sustained discreet throughput, and that's if you have a saturation metric at the moment. The way we do is there's only one. Well, there's two SLRs, but whatever reports, those things has the same. Miss alone so I'll probably easier to explain this with.

A

So at the moment we define this as 80% and 90% are the two values and now because we're reporting a disk sustained three three three wooden disks, the same write throughput from giddily and NFS services, and both of them will be 80% to 90%, and that is I think what we. What we noticed, though, was that if you take 80% of the 400 well, we know it is 323 yeah.

B

And 320 is just about restaurant. You.

A

Know what I suspect is that we actually want to know of both of these because net that there's no proof that Katie will be any difference, so I mean the only worry. Actually. What we can do is forget. Li and God. Look at the graphs and see what we see, because I don't want to get anyone getting alerts when I don't want them, so uh not get any dashboards.

A

Interesting. What's that, but just disk usage, we can go down and use this new panel called saturation details which is super useful, and this breaks down the saturation it generates charts according to your saturation for each of the resources that you monitoring in the street. So this is the one and we gonna do this of this, assuming so that we don't that's our baby falling looking for three ports, so you can see that we do occasionally sort of he convinced the next acres of two days.

A

There you can see the dotted line is the current through soft targets and the red line is the is the hard target. So you get these spikes that are over, but I, don't think any of them last for five minutes um in fact, I'm just going to do this over a week.

A

Because I think we should bring them all down yeah and I'm just gonna quickly, tack this. This is one other thing I want to check, and that is that the resolution increase. It was a mission to see if there are more spawns.

A

The reason why I'm doing that is that, with these spikes, it'll often be very, very steep and if you've got a one in three resolution. Basically the way you can imagine that is each data point represents three pixels. So when the resolution is one, then each data point represents one pixel so effectively you loading pixel, while a data point per pixel, and so sometimes if you've got very spiky data that will round it out enough that you actually lose the spikes. But you can see here that this hasn't made it a knock, no spiking.

A

So that's fine! um So let's bring this down to 70% and we'll bring this to a read three books and then I'm gonna be the same online throughput as well. Let me say that this will be 17 and this will be 80, because it might be being a little bit optimistic on on what they were putting on agency.

A

So going back to this, the actual numbers when these numbers have been copied. So that's going here.

C

A

We want it to be 800 and 400.

A

So let me just double check 800 for me and 400 for one this isn't right. This is 400 and this is a 200.

A

400 800 I'm just gonna check, so one of the things that I'm gonna check is that this actually works. Although we know that it's not going to work because type equals anyplace, one select anything out, it's fine. You can completely fix that I'm going to take that and you can take it.

A

Yes, this is a node export of thing, and that goes straight from you case.

A

So at the moment, that's telling us that it's 8%, so let's go, take a look over two days and that that's like a really good signal. Right like that, would have alerted us to this problem. We would have got saturation on that desk. You can see that there are these spikes, but I suspect that if you zoom in those are like well, maybe we should just sue me because I don't want to get like a bunch of anger to me saying so.

A

This three hours, so so it's basically lasted for 20 seconds, so we wouldn't have gotten alerts on that. This one definitely won't even answer well only its 75%. So let's just check it's actually from there. I.

A

Know and jobs already 20 lakhs that I didn't fix staging in the same way. So this is why we get the professionals to yeah.

B

I didn't want to press approve as well. I, don't know anything about yeah.

A

Yeah I mean with me it's amateur how it's cool so, but I mean this is really good. So we we know that, like we wouldn't go in alert, but I'm actually curious, like what happened on Tuesday I wouldn't be surprised that actually, what had happened was a was a precursor to what we saw yesterday, but as an.

B

Try it out to see if it clicks yeah.

A

So well, somebody just accidentally like ran a whole bunch of projects. We don't really know now, but you do have so. We've got this metric. Oh, this is good. Did I did I, take the other two as well. Three good ones, I will add the pups ones as well. Just to.

A

Just to be comprehensive but, as I said before, they're, not really as.

A

Because we never get anywhere close to them, but you know.

B

A

The next incident- maybe we will like millions of tiny rights- you you you, you.

A

A

A

And this is where I was talking about the magic numbers. So whenever you change this number, you need to change magic numbers dog lips on it, which is like values that we can't calculate. There are magic numbers and I'll. Add that in there in a second.

A

So, let's go and update magic numbers, tufflips, onyx, I,.

A

Really like to be able to get this direct from I'm an explorer looking plant at the moment. If we could get this direct from like a node, Explorer or some sort of tcp exporter, we could do it literally across entire fleets, so we wouldn't have to have these specific cases. We could just say when any disk on gig lab comm is saturated, like give us an alert, but because at the moment there's this expense of maintaining. All of these things we just stood in the most frightful places.

A

A

800 megabytes- and this is for me now- the last thing that we need to do is on one of the things we have is. If you go take a look in slack, we have these.

A

Alliterate, okay, so here's the saturation in it, and basically this is no it's single note. Cpu has exceeded its capacity. What that means is that this there's a single machine out of the get in 40 machines. That's basically running at like 95%, see even and we should figure out why, because it's not good, and so, when I click on this sorry for how long? In that case, it was what the the trigger is five minutes.

A

But if I go take a look here, it.

B

Is seven minutes apparently yeah.

A

If you go, look at the bottom here, I always send another graph when the the resolve so.

B

A

Always useful to look at because you can kind of see how long actually in this case, so let's do this now I, don't think no! It is when you click through from here, and it gives you this bra, very important to notice that the time frame is from when it's from six hours before they look, it's not happening until when they notified, and so this isn't now so some people look at this and think it's now. It's not so. Let's just change this to custom time frame so.

A

That you can see it was for them, but now the problem is that this is. This is what we use for alerting. It's just saying like someone in the giddily fetus is not happy, but this doesn't really help like an operator that much the bottom here you can see, there's a say for further details. Select the saturation detail menu from the links menu at the top of the dashboard then select.

A

The detailed dashboard probably could do with a bit of editing, but what you going to do saturation menu now we know that this is the single node CPU metric. Take on that and.

A

This froth will give us the machine, that's misbehaving, so this like instantly chops down the amount of work that the operator needs to do, because, instead of going and looking through the fleet I'm trying to figure it out, they just they go to this graph. And here it is pilot pilot, big surprise.

A

It's basically didn't 100%, see B, and so that's bad, and so just having this graph really helps people and the other place that we use all of these graphs. I really kind of showed you earlier. If you go to the kitchen dashboard.

B

But is five minutes really enough for us to get an alert as in what kind of operations we'll get Ali? On that note, a half that will require it or that would pack a CPU for five minutes.

A

Generally, not if we check and look at that I wouldn't be surprised if it's if it's an N plus one. So it's a lot of a lot of small operations and often it comes from sidekick. So that's we can. We can dig into that a little bit if he wants.

A

All of these need to come in super for now focus on it.

B

Jaw rolled out change.

A

The Lord that we saw on that machine that dotted line is the number of chords. So obviously, if the load kind of exceeds me, of course, is my page, but what I'm looking for is G OPC method invocations, so there it is, and it's list converts by already. So this is something we've we've looked into like list commits bio IDE is a laburnum. What, in.

B

Lfs right no.

A

I think it's a midge, it's metal, Chris controller I'm, pretty certain. That's doing it and I suspect that it's it's the widget. So when you know and and the worst part is that we actually news some of the observability, because if it's polling and it's cached, we not seeing those log messages at the moment but yeah. So this is probably maybe a lot of people in the company. Looking at the same, merge request, and maybe somebody pushed something and all of the widgets started, updating and because of race conditions.

A

They all kind of start fetching from the cache again and I would guess something like that, like there's a whole sort of thing that we need to look into. What's going on with that, and and the other thing is that gitlab calm by sorry, a good level gitlab is like ten times the traffic of any other repository with that miss convinced by already so there's something about us and that we go and it's not a general thing, but I'm kind of interesting anyway.

A

This is this: this is all Alden broken.

A

So so, anyway, going back to here what we had was we have the detail, metrics and then we have that that we used to create the saturation detail panels here, and so this is the general Gimli dashboard. You know, we've got the key metrics for the service up at the top here, and we also have those in like more detail.

A

And yeah, but then the saturation detailed really often helps its figuring out what's going on, and so one of the things we need to check is that this graph is going to be okay. So actually that's going to be fine because we're using the same useless measure. What I do want to do. There is take a look at where this is being used to make sure.

A

That's the problem.

A

I'm gonna have to figure this out. It's not so. The problem is the saturation detail node, because we have to read recast the the metrics that we're using with different aggregators and so that we can see individual nodes without aggregating them up. We divide by the magic number, but the problem is that, because previously I know how to fix this all right. This is a bit of a horrible hack, but I can fix this. So I can say.

A

And we can say, type equals NFS and then.

A

Yes, so this will give us for the giddily nodes, it'll divide by the giddily number and for the for the interface nodes, but it'll only ever show one of them, because this selector over here will be where type equals giving you, and so this should work. Let's go through all of these and do this then.

C

A

A

Let's just say: there's Cummings as well.

A

C

C

A

A

A

A

A

Okay in that closet, so the last thing out there, but it's pretty boring, is I. Once we have that. Did you say that change was rolled out that job did you.

B

Push it out his head.

A

It hasn't shake, hasn't, kicked in and actually changing the boxes yet, but it'll it'll happen, um but once we have that I will build a graph on its interface dashboard after this. But it's kind of difficult to do. If you don't have any data, so I'll wait for that to roll out.

A

Well, if you want to record that, that's fine is wrong.

A

Thanks, so the last thing I'm gonna do is.

A

I'm gonna use my daddy secret because I happen to like gooeys. Oh, it's so much better I can like everything and I. Can click cuz I find it as a review tool. It's just fantastic for me. I, don't know it's like in my mind, is how I like Nietzsche, when I have like a change with no Chinese, that it's some sort stream. That's.

B

A

So I, just let me just so that that one's really basic, like you, can't really miss that one or too much 800 450 this one in. If this interests me first.

C

A

It's good, it's a different thing: I'm. Can you put together like a script for a guinea pig queries and you're not familiar so this is 1500. Maybe we just bring this up. One talk about bones.

A

So you got 1,500 feet, so that's read: 1,500 tacticals, anaphase, 1500 rights topic is in the face and then 400 rights v equals n FS + 8 + H. We try because in affairs.

A

It's restaurants in the way cool.

A

A

School and then.

A

That's in if we ever get saturation on those NFS tests again we'll get alerting and we'll it's.

A

You know these will get this coming through here. We will get them showing up on capacity planning as well, because they're part of the framework now so when she's, really struggling so so when that is, is, is looking like it's going to be saturated in the next 14 days. Even if it's not yet we'll get a chair and.

A

That's a new disk space on the API.

A

C

A

Have all of this stuff kind of come through and it kind of it all fits together quite nicely yeah. Look at that! Wow I! Think you see this. It works I think we should get Ingrid to I'll. Just switch this out of it too. So I'm just gonna, say from saturation video I'll go to disk space, saturation, video and then from this graph we can see exactly where it is where, on the other graph, there's.

B

A yeah relations.

A

Their interpretation stick.

A

A

A TR to an APR 24.

B

B

8 H and all of that, so it.

A

Sounds like yeah, okay, awesome, cool and.

B

I just uploaded.

A

B

Upload it is, it is I, think there is a lot of interesting things in here and I learned a couple of new things. So that's good awesome, cool thanks doing it good one.