GitLab Monitor Stage, 16 Jun 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Miguel talks about his SRE shadow experience

Description

The monitor team joins the SRE team about once per milestone to learn more about their work, Miguel from APM spent a week shadowing during the 13.1 milestone.

Learn more about the SRE shadow program:
https://about.gitlab.com/handbook/engineering/development/ops/monitor/#sre-shadow-program

A

Yes, so the Interphone for necessary when they are on call it's always an alert. This alert is created by the Prometheus alert manager. That's that's normally how its set up and then this alert when the rule is broken.

A

It's creepy. It creates a an email or a through some kind of SMS notification or a call which should be answered almost immediately. So during my SRA shadow I received quite a few alerts.

A

As you can see, I tried not to delete I deleted a few of them. As you can see. In one day, I can receive around 10 to 20 alerts that are low priority. This seems to be a kind of a flaky metric and then high priority alerts. I receive around one per day and usually when a system is failing, we don't just get one alert. You get several alerts, several things that that are failing, but they are mostly old symptoms of the same the same underlying issue, so an alert looks in pager duty.

A

It looks a little bit like this, so that's.

B

The question you are not getting duplicated a live like each line, single.

A

B

Happening you won't receive you.

A

You might because there are different rules, so not for the same rule, but let's say if there's a slowdown in in the general service, and there is a slowdown in one in one part of the system. You would get two alerts and they normally come back like one right after the other.

A

Maybe I can show you.

A

There are few that calm your trigger, because here three or three trigger incidents, so these are three different relay, probably related incidents that happen. At the same time, often, it's part of the series kind of job kind of to know if things could be related or not so this these two are related thing Kingdome check for these, and then this Pingdom is also it's basically the same incident but there's another one. That is something else and it might be related.

A

It might be unrelated, maybe is happen at the same time or not so you kind of have to know a little bit the architecture in the system to know which, if you're, dealing with two different alerts or with one Alert 100.

B

Language effect, but the fact that they come in, like in a single group. Thank you, so three alerts in a single group. This is something that is configured right. This is something sure.

A

Automatically this is something that I received from pick your duty, but there's not I, don't I think it. They simply do it so that it doesn't trigger too many emails. But I don't think there is something super clever of grouping different alerts. They just just happened to be at the same time, so they mix into the same email, I.

B

Know that there are some solutions today in the market and when you're getting like some sort of an alert storm, so they can and they can add, like a cause and symptoms, so there are like coping all of the single incidents. Okay, this is this. Is your cause proven.

A

B

Is just noise? Those are just symptoms.

A

Yeah, that's right so so they could be yeah I guess it could be grouped. But as far as I saw there was no like specific grouping. It's just. They are set up in a way that.

A

If it happens for more than a few minutes, then they will, they will notify you, others, there's not like I, don't think it appears at the level that, with the alerts they can group them symptoms and causes, because this this is evolving and also just a fact that this is already many alerts are being ignored by the system. There are many more incidents that are not reported here, such as these no urgency alerts. They you don't get a call all the time. It's already an improvement from.

A

Let's say that flow that was there before so I think it could be any pollution. It's already there. Okay.

B

Can you open open one incident? You can see what out there.

A

Yeah, so one incident comes if the incident is triggered by Prometheus alert manager, which is the main way in which we receive alerts, then the alert manager will we'll send you some information about what metric was acting up and then you will also get so. What I saw mostly was being used was the graph of Prometheus. So this alert is not firing anymore, so we might not see data, but let's try.

A

To execute yeah, so there are no data points found, but I can try to increase the time window and you know and when this was triggered zero very so long. Oh.

A

Yeah, it's low, I didn't see.

A

It's not, this might take a while I might want to pick another alert. Another incident- oh yeah here here you go mmm so the church, because we don't see the lines. We should see some lines which are typically the date that is all per passing, the other the alert level there might be something better, which is true, but since bad. Yet this one is interesting.

A

A

Typically, during an incident the day takes from the last hour it's already useful. Oh.

A

That's not good.

A

Yeah, this is still very fast evolving. So that's fine I'll see all the latest data yeah, so, okay, that was probably the trigger trigger of the alert. This is lying and, as you can see, suddenly jumped during the incident. It probably was a smaller window time that you can see putting putting data here.

A

I think this is what you can use to try to identify the age and then then, after you see this, then you probably will jump either to graph an ax or this related service that is broken, and then you will try to find find a solution or mitigate the issue, or you can also report the issue in slug as an assignment even hundred credit.

A

Yeah, so here, okay I'm, starting to find then the peak, so during yeah during the incident. Probably we will see something like this. We will see suddenly this thing is firing. We see it at the end of the chart and then we got to find the resolution. This one happened to be that database was very busy. So then what the team does is they go and check? What's going on with database, they might use the final not whole days everything members I work with three of them.

A

One of them was more I was depending more on Griffin and the other one was was going more often to elasticsearch. So once they see this, they probably go to elasticsearch. They try to find exactly, which is the which is the metric that is failing and try to find logs that are related to that.

B

A

The feedback, so the feedback was a little bit that your phone is changing so often under, and it seems to be getting too complicated for for usage during an incident. Is it's not really a fun attitude? It's just the way, the way that is being used right now, it's in there too many words to make things easy to find and to navigate to during an incident like you would have to scroll through several dashboard and trying to find your way.

A

Even if you keep a bookmark of the dashboard that you can defend them, then this for maybe disappeared, because somebody did a change for the they change or they remove to it or they chip C or to do something else that made the Sigler funnel a little bit less reliable during an incident, it's probably useful for other things, but during an incident it's it's hard to navigate to where you want to go, so it may be doing doing the drill down or going from here to the corresponding graph on a chart.

A

Tell us something that was difficult. I think some alerting rules have more, for example, this one is about much information. Some of them have more graphing, are related information, and is that these, depending on how they prefer now the alert manager in setup I suppose I could try to find it, but it's not something that I so people people were. This is dashboards.

A

This is something that I I didn't see. They were using very often closed. More now we get into elastic search to find.

A

A little bit more.

A

Creating a chart during.

A

B

Interesting so.

B

A

So the the logging, so if you're, if you look really good with lastic search, you can you can pinpoint from from here, you can kind of pinpoint which kind of events are being triggered and there's a charting section, I think it's called chart, or either that or her district that lets you visualize the entries for a given according to due to some filters and some dimensions, and then you can chart exactly what's going on and then this will help you drill down more to the specific kind of event like maybe which IPS are creating these these kind of jobs and then then you're able to say ah maybe it's time to block this appears because it could be malicious or they could be overusing our resources so that these things happened a lot during the incidents, maybe finding their piece of the attacker or the person that was overreaching.

A

A

Yeah, so let me see so I created four issues related to things I found. Maybe you have already seen them so the first one is that.

A

I started understanding a lot more. What are the query results? Looking like so then, I I started reading a little bit more about the documentation of Prometheus to check what kind of information is giving us back. I found out in some charts we are ignoring so in in every chart. We can have any number of metrics and in any chart we can have any number of rows of data because in all of them we could have a matrix resolved.

A

So then, for this reason we we can have in any chart. We could have many occurrences of multiple series that are accompanied by multiple matrix. So we should consider like this. This kind of expanding the data in our charts and I check to the code and I saw that in many many occasions, which is do very lazy, very lazy to say, oh just just fetch that the first thing define and then display that and I think that's not enough if want to create a more robust solution. So that's what fun finding!

A

So that's why I put a toasted this? This am I. So, like the heat map, the best one is the time series, because I think series shows more stuff, but I still want to check if we are showing both multiple matrix and multiple rows of X series and the same for the other charts. So I think now we're creating already issues I'm going up about that quite cool and I hope we can start work.

A

These smiles on oh yeah, another another aspect of this is when you are dealing with, we don't alert or with a specific metric. You often you might have a holding the data, and this is important to know because you could face some kind of shortage or add some kind of outage of primitives scraping or the source of permissions, and then you might be missing data points, and this is important to see and I think right now. What we do is when some data is missing.

A

We simply complete the line, so we go from here to here. Even if to take a look, maybe I can be more yeah. If we have this kind of this kind of data, what we would do with our chart with our charts today will be to simply complete the complete the point until here unless assume that doesn't matter, because so because it looks better, so we shouldn't do that.

A

We should always show the holes in data and there are other more tricky visualizations where we should like here, the full of holes, and we kind of have to show that chart.

B

I guess we need to. We need to also separate between, like a hole in the data like a missing check like we have. A missing check for attest was missed because of unit okay. This was done versus we gotta respond, but the respond was like zero, so this should be between a.

A

B

A

The response, if the value was 0, we would see something like this. We would simply see the metric drop to zero, so the sort of problem right now, but it might be that for a series of values, we have long intervals between two values that have noted and we should be able to know and detect that so that we can display a whole I I.

B

Think that one yeah, how do we know that I mean? How do we know that I mean I? Guess we need to put the test in place that you're.

A

Not missing data.

B

Is just like we are sampling it in a right.

A

B

A

It right so we have to consider the right or the yeah the sample, right and and understanding the sample rate is is to be what is bigger than a threshold that we should start adding, maybe some empty values or or something to become very that information is missing. So that's something else that we should try to address initially I suspected that we were removing these values so like these I was complaining all these lines, where we kind of remove undefined values, but writing now. That could be something where we have to actually fill out.

A

The missing information, so it could be, could be more difficult than I, initially thought removing code, but also add something else to like put the holes in there when they put something. Oh.

A

Yeah, so this is something that I think we're both.

A

Possibly Mary as well is: this is something that was super useful during incidents was to be able to create a new preview of a metric so that we can send the users from the alert manager from these incidents directly to our dashboards, but without the need of creating a new dashboard or immediately create a new, create and save a new visualization. So it could be something like a metric. Metrics preview similar to Prometheus is doing you can throw in any query you want yeah.

A

Maybe we even help the person type, the right query or the query comes from the alert, and then this immediately displays a chart and we could do that inside. Gitlab I think is not that much effort so for all ready for the panel for the dashboard panel that we have, we could add a text field or something of the sort and let the person type under the user type the queries and then we will be able to display something and maybe select. Okay, do you want a time series?

A

You want a line chart or do you order when a bar chart or whatever you want and then and then let them see it live what what they want and I think this could be useful. Hopefully, if it's, if it's fast, it could be useful during an incident if it's like it lets you check quickly. What's going on.

B

Yeah I think it's very.

B

The fact that they are using it I think in fana they've, even like a hot key I, think it's an X or you click on it and it's like boom. It started it's like a playground for queries.

B

A

One of the team members showed me how it was to add a new chart and, in his opinion and I from what I saw, was quite difficult to create a new chart during an instant or when you're in a rush. So from me, tease this this simpler UI primitives does a better job at that. So this is something where we could be a little bit more agile.

A

B

B

A

If you saw that in the UI of refiner, you can add like another axis, and you cannot, you can select the unit formatting and you can select.

A

If you want to use the logarithmic scale or the regular scale. You can siddhart very like very many knobs and that they probably very useful for the graph- and I use case, because not all the data's coming from from against. But here we've done from it is this something much simpler. Just show the data shut the time and then and ensure the series and I and I think that becomes more useful- to create like quickly a visualization and explore.

A

What's the status of the system.

B

That's okay, I! Guess it's pending the use case and it's important to think is important to emphasize in the issue.

B

Because you know you look at the vomitus UI and it's very dull, just like an API in job and the phenolic.

A

B

All the bells and whistles, you know it's.

A

B

But I guess for for when troubleshooting an incident and you quickly just won't like to change the query and see how that looks like you don't care. No, it's not! That I think it's important to like emphasize the use case, because some would expert yes, I have to have like a preview but I'm like no I'm, going to select the chart and select the lines and colors.

A

And I'm not saying so I think I. Think I mentioned this yeah that there is a very satisfying issue for a chart. Builder and I think this is already going to happen soon. So so the features are very similar. I think there is a lot of overlap, but there is a line where it stops becoming being about exploring the data of the lace the last hour when it becomes okay. Let's make a really a complete solution to build a new charts and see how it goes so.

A

I'm talking a lot about the perspective of the saris, that's what I was doing of.

B

A

There's another use case where you really have to create a complex chart, and so let them. Of course you need more knobs or more cities. Yeah.

B

I think I think it's important I mean the difference between like a preview and like a metrics Explorer. This is like a matrix Explorer. You just go. Oh my trick or good thing. There is a lot of overlap, but this is where I mention, like the use case, important to better understand, I'm, pretty sure. If we build one, we can leverage the work that.

B

The, u rdy need to be different. I see the like. The preview for the child builder will help our user create a chart for their dashboard, and while this issue, which is more like a metric, Explorer kind of thing is not is not for preview is to explore metrics, so maybe yeah. Maybe it's important. You probably.

A

B

The item analogy and understand the user flow, and why is it different.

A

Definitely another so some other details are once they create a chart here and he's very easy to go on. Bookmark it and share it with a colleague for something under the chart will be repeatable. So so that's something that perhaps.

A

Metric builder would not do I I think, as you are adding more and more things is not like. We are keeping the state on the URL, but in this use case it's it becomes very useful to like parties, it's very exact I'm. Not this is part of the incident right, so I already have I think we are going in the same direction, we're converging to that, but it's a little bit different. So that's why my proposal is here.

A

If I said, anything else lose interesting, yeah, it's more. So it's more about past few hours or days, you're, not probably not going to explore very old data. I would like that we can switch quickly between different charts.

A

So we can just see bar chart may be more useful or keep my or for something else, and then after that, probably you maybe want to add a new matrix after you have explored when the next step is in about to and say, okay at this metric that dashboard and then you start that more complex flow, where you, edit things I, don't know when did the mention to me was saying the the raw Prometheus Earth was important because during an incident you want to probably see exactly what the API is. Throwing back at.

A

You I think that's something that we're not doing right now. We, we just say hey: there was a problem sets of metrics sorry, but sometimes they want to see more details so.

B

Maybe when, when we are adding like.

A

B

A

B

Are is an NBC? We can.

A

B

Copy that yeah.

A

B

I think what one another thing is we need to make sure.

A

And yeah another that's the idea or misconception. I heard well that I assumed that all girl, fauna or all charts are always corresponding to alert rules and and I think that's the assumption that we have done in our UI, but that's not necessarily true, as you can see in this case. In this case the alert manager is sending us something that is a chart that most of the time we do not generate any data points.

A

So unless there is an issue right like when I showed you initially, let's say we are you're here: yeah, it doesn't even doesn't even load anything, so this chart doesn't exist is is only existing during the incident. So not all the alerts not for every alert. We have a chart.

B

A

But it's only because.

B

In committees, right.

A

So that that's where these, this kind of thing comes in handy, because you can kind of create a visualization, really quick with a URL and then just go and navigate, and you don't have to put it in at Ashford and arginine and all the other things we do.

B

Interesting I think that for us, you're using like vomit uses a manager app. You will not be able to sales.

B

A

Right conception.

B

That we have built in into the into the product what I mean from doing that. But it's definitely you should explore yeah.

A

B

Configure alerts: how can you configure I mean with external poetess it's easy, but with internal.

A

B

Use visualization.

B

We probably need to think about we to set up balloons on manager without having the need and creating jobs exactly.

A

Exactly- and this is yeah they could and a dashboard that is only empty. You take the points only when there's a problem, there's something there coming. So we we should allow for some having something like this yeah.

B

I think we have an issue for big make sure it's possible to create the lens, even when there is no okay, I think I think our intention of doing that was something was different than yeah.

A

B

The same point, yeah.

A

Good right anything else, yeah and I really love the Prometheus approached dates.

A

Yeah, so this is say: ball is really nice. It's a elastic failed a little bit at this because elastic elastic state bigger, it's super complex.

A

This gets the job done really really nicely you you want to increase the time window. You can do so. If you want to go back and forward, you can do so, and then you can change the resolution so in these three fields, what our permittivity a lot of things that are very used for a series and that they would not want to change for a more complex, UI, interesting.

B

When you, when you, when you click on the arrows like so it move by by the time window, that you said, yeah.

A

If I, when you see as if I can manage to yeah, so here, we are here moving moving back and forth in there in.

A

A

Little bit more yeah.

B

A

B

Like the like, if you wanna troubleshoot likes, ladies diamond, because if you want to go any back in time like a week over a week or a month over month, then it's less cool yeah.

A

But but a week you I mean if you want a week, you can always increase just a bit yeah.

B

You make increase, increase increase like you need to do, is I. Can.

A

Get to a yeah yeah I can get to a week fairly quickly. You don't even but I noticed that you don't even modify the start time, so you just say okay and time time window.

A

So it's it's a something more simple than what we're doing and I think it gets the job done. The.

B

Most important most windows yeah. We should definitely explore this I think a lot of maybe.

B

A

Understand: wake up far now I understand why your phone I would do something like that because we often are dashboards can serve so many use cases. But if we want to narrow our use case to a sorry, then something like this- it's felt a bit easier for me.

B

Only exactly so, we can definitely think about that cloning. This support and I know that we have any issue on how to import this thing. This is definitely something that we designed.

A

Created as a allow, rewind and forwards, but if you, if you think that so this is like an iteration iterative approach, but it still adds something you if you see that this is pretty better to reduce I, think we could open another issue and say yeah a little bit more simple. Without.

B

A

Why not write turned into an epic which is the one about.

A

Keeping all the data.

B

Like a master issues form the.

A

B

And group them, and ideally other engineers that will also they will use this epic Lulu.

A

B

End those issues into the stating epic putting them in the backlog, because I really think those are really valuable and it's important will do did they will be on our side.

A

So thanks a lot I'll stop the record.

A

I can stop Cher.