GitLab Monitor Stage, 5 Aug 2019

Previous Meeting

⏯

youtube image

►

From YouTube: Discuss Dogfooding of Monitor Features

Description

Discussion between Monitor Stage teams and Infra

A

Okay, recording cool and then I'll show my screen, so the intent of this conversation is that Devin, it's my understanding that you are a part of the infrastructure team that is kind of I, guess you're, maybe personally, in charge of the project to have other non gitlab, comm properties, utilize, more auto, dev, ops and other functionality, and so I am particularly interested in this team. Is a group of folks from the monitor stage and interested in how you're doing that.

A

If we can help you figure out other parts of the tool that you of the product that you aren't using for the monitoring of the various properties and maybe also I'd love to get your kind of roadmap for which properties are going to be brought on into this process and how you manage them. So it would be good to start with, like an intro of your of that effort on your part and kind of what your goals of it are. Yeah.

B

So what the way we're thinking about it is: we've got kind of our core gitlab comm site, and then we've got non core properties and the non core stuff is mostly support tools, things that that are part of the whole offering, but not actually part of yet lab the product. So it's things like the licensing tool, the customers tool.

B

We've got the version tool which collects all of the the phone home information from all of our self-managed instances.

B

We've also got things like the forum, you know the user forum and the design tools the pajamas, so those things are non core they're, not really part of the get lab product, but we still run them as production instances, and so those are prime for deploying with auto dev ops. Rather than trying to do it, the the system administrator way that we've been doing it in the past. So what I'm trying do is set up all of the infrastructure parts, the the tariff lorem.

B

The infrastructure is code to run these the the kubernetes clusters that this all attaches to as production.

B

You know production type of set up, but not as part of our get lab, comm toolset, because right now it's kind of cluttering. Our existing repos we've got a lot of stuff in there. That is really a one-off thing that we don't. You know we don't spin up all of those tools when we spin up a staging instance or a test instance, so we're trying to separate all of that.

B

So we've got all the tooling that we use to run like a gate lab comm instance and all of the peripheral extra stuff, so we're standing up those as kubernetes clusters, with whatever they need to to do the auto dev ops thing. We've got some cloud sequel stuff that we're putting in there for persistent databases stuff like that. The plan is to use as much of the internal tooling as we can, even though we understand that at the moment it's probably not sufficient for our needs. As far as monitoring alerting.

B

Redundancy all of the things that we need for for production, but if we stand it up and start finding those issues, then we can go back to the product teams and say: look. We really need this fixed and presumably everything that we need is gonna be stuff that our customers will also need perfect.

A

Awesome that was super helpful. Thank you, yeah I would love to just maybe from there. You can tell us where you are in that process with the first one which I believe is design.

A

Is it you yeah have the clusters you are deploying to it you're, not using the monitoring tools or you're. You haven't really even started with the clusters and deploying to it. So.

B

For the design, which is the first attempt, we pick design kind of arbitrarily because it doesn't have a database attached which makes it easy to get it up and running the harder part of the other ones is going to be figuring out. How we're gonna do the database and load the data and stuff like that. So design was fairly easy. We set up the clusters, we've got two clusters, we've got one for production and one for the staging and review apps that staging one's kind of shared.

B

We could split it into three, but wouldn't really see any need to at the moment and pretty much. What I've done as far as the monitoring part is just switch on. Everything I could see that by default would give monitoring.

B

That's probably not what we ultimately want, but my focus was really on getting it up and running so that the DevOps pipelines would succeed and they could deploy it by merging and that was kind of the goal for this round. One we're gonna have to be a little bit more complete with some of the other stuff, because with the design, it doesn't matter so much if it goes down for a few minutes, whereas if it goes down with some of the other stuff, it's gonna be a little more impactful.

B

So I would like to come up with a plan for monitoring. I just haven't really thought about it. Yet you know we just got to the point where they can merge without having to talk to the infrastructure team and have it actually deploy to production, cool.

A

Yeah, that's an awesome first step by the way and I love the just planned process for that I guess. One of the things that we had had in the discussion in this issue was like. Can we help you gain familiarity with the monitoring tools and also what's kind of coming in the roadmap and I would also love to see any?

A

It might also help to give you I guess scoping for what were attempting to do so that when you encounter like hey I can't get it to do this you could, you would know to create an issue versus being like that's, not even got anywhere near where we were kind of headed. Would that be helpful? It.

B

Would yeah I mean really I haven't thought about monitoring it? All I've been concentrating all my efforts on getting this stuff running, and you know my next step is getting the databases running, so we can start putting some of these other apps in there.

B

I'm also on call this week, so I'm pretty much putting the whole effort on hold until early next week, but if there's some easy way to digest it, if there's some like screencast or anything that you guys have that's, it wouldn't require sitting down and reading pages and pages of documentation. Yeah awesome.

A

Amelia says Jose I, don't know of any like here's, a review of the monitor stage features we have those in some other stages. Panaya monitor anybody, know of any place that we appoint event.

C

That we had one with Josh, but it's pretty old I, don't think he covers the new features. We should do a new one. Okay,.

A

So what I thought Devon is I could just briefly show you in the settings for design I got myself admin permissions here. I could show you kind of what monitoring you have today and like this is all out-of-the-box, because you installed nginx, then you get these three sets of data I, don't think there are any default alerts configured for any of them, but you could easily define a query and an alert threshold.

A

This looks like it spans the various yeah various environments, there's also system, metrics that it's like the cluster metrics and again you could define that works on these, and, if you were to like, have some standard alerts, then you could set it up in this operations tab so that those alerts don't just send your email, but that they create an issue for you.

A

Actually, it looks like that's already been set here, but no templates- these are just create dummy wants, but you could also set it up so that it is creating issues for you based on a template. That is the thing that I'm, like as a product organization, we're most interested in getting feedback from dogfooding of that entire loop, like I, could an Auto, DevOps project, I set up some alerts and then I am having it create incidents for me and are those incidents?

A

Can I configure those incidents efficiently so that I could actually triage and resolve that incident, and we don't have anyone frankly using that in any sort of production way today. So that is the thing that was most interesting to me. I, don't know! If that's interesting to you, if that, like breaks, how your process works today, it.

B

Could be there's a couple questions that pop up right away is: how do we get that to go to something like pager Duty? Is it just doing that, or can we set up integrations like a slack integration where it goes to into slack or whether it if it goes to pager duty and calls one of their web hooks, then we can get it into our workflow? Those.

A

Would be great issues for us to add sets I, don't know like oh.

D

Yeah, we do have a slack MVC one that would hook up like incident creation with posting, something in slack and then being able to use like chat apps, to interact with that. um We don't really have any pager Duty integration right now, but that is something we want to do. I'm, not I, don't think we don't really haven't designed out like I'm, not sure what where those connections would be, but you know I. That is a it. Definitely. It feels like something that we need to do to make. This very useful know.

B

If you'd want to make it just paid your duty and probably be more useful to have a generic web hook that you could put whatever it is that you want to do. You know where there's pager duty or some other tool, most of them yeah, so.

D

Tell me when you, when you're talking about hooking it up with pay Judy what what is the workflow, that you are envisioning so.

B

Right now, we use alert manager to.

B

Alert when certain conditions exceed certain thresholds and that sends an alert to pager duty, pager duty then has its escalation path, and you know it first sends a push message to the on-duty person's phone. If they don't answer, then it'll text them, then it'll call them. Then it'll escalate up to the next person. So I don't know. If that we want to try and replicate any of that functionality. It's it's pretty involved and it's easy enough just to call a web hook.

B

The web hook would also let us post a slack just using the web functionality.

B

The other thing is that it looks like from this screen that the issue is only opened in the project. That's correct, yeah, and we don't monitor that project. So if we do get an alert on this, nobody on the info team is gonna, see it. So if this could open a issue in the infrastructure queue that would be a whole different story, then we could set up a template that has all the right labels and tags and the right people would see it when it happened.

A

Cool sorry and then split screening also trying to write stuff like that down. Those are all great issues that we can create me be sure I'm capturing this, so the other thing is Seth probed about what interaction you would want. If, if there was the ability to like check-in, your alert manager, config and have that deployed. Is that a shortcut around using this UI based configuration.

B

Yeah that might work yeah, because if we could, if we could just say just use this whole alert manager, then we'd have all the queries and everything um yeah. That could be a.

A

Shortcut yeah I mean I would love for you to just be using generic Auto DevOps, but I understand that that is not sophisticated enough at this point, but if we could have some weight, it's that Auto DevOps saw your alert manager, config and just deployed it to the Prometheus cluster. That is an issue that I think we have in the upcoming roadmap.

A

And then you had suggested creating incidents in other projects, yeah.

C

A

Than the self one.

A

Are there, like you mentioned, you have your own set of metrics. The other thing that you know you we would love feedback on is. Are these the right metrics? Are there other metrics that you would want to immediately have in these dashboards that you consider more critical, especially for a project without a database like design?

A

These are what rolled out of the box, but are there other ones that you would have rather seen you don't dance that now, but that those are also the kinds of questions we can make sure we're instrumenting as part of the default set up that you get different views or different. Metrics have.

B

You guys looked at our standard dashboards.

A

Not for pajamas or for design know.

B

That we have specifically for that, but usually I, saw the the 505 xx errors. Those are the the main ones that we use for non database stuff.

B

Just give you our triage dashboard link, see.

A

So the other thing that you can do as part of our product is, if I go to this operations, settings I can and your dashboard to this external dashboard link.

B

A

What did I do? They came here that.

C

Needs to be configured instance. Why I, don't think you want to do that? Ancilla calm! This.

A

Is instance, wide configuration it's just scoped to the project at that yeah.

C

But the external dashboard well external URLs, and need to be configure instance wide. It's good. It's good, for you know development purposes, but not for production. Gotcha.

A

Okay, so that might be another one.

A

And then do you primarily a lot? Let me just pull up that link. You can be walking through these words that I'll be eliminating for us.

A

So these aren't going to be the kubernetes cluster health. That's.

B

Not in there, but this it should just give you an idea of what we look at on the gitlab comm site. Okay,.

A

But you don't have something similar. Does it mean there's no general monitoring for design today? It.

B

A

Be for a moment, okay,.

B

No, this is, this is pretty detailed. You know, and a lot of this isn't really going to be relevant because we've got things like you know: Redis and the giddily nodes and stuff like that file space.

B

But you can see by the way we have it laid out what's important, and so the error is coming from the backends. Then the H a proxy would be kind of the equivalent of the ingress responses.

A

And then yeah, it would be interesting to see I don't know if I like it would be interesting to see if you like, wrote your own alert manager config for design get lab comm. If we could then look at that and be like. Oh here's all the places where those metrics are missing and we should have included them or the types of alerts and where you're sending the alerts, we could use that to target UI improvements.

A

How do you, how do you personally feel about configuring alarm thresholds in UI versus in a file like another.

B

Man, oh I, like it. If it works the thresholds, it can't be just like little. You know, radio boxes or or whatever it, because we a lot of the the things that we're configuring are using. Prometheus queries and some of those are are fairly involved.

B

So you know just having a threshold is not as useful as being able to actually write the query, but you know it's writing. The queries requires the user to know how to write those and know what what all the fields are and a lot of documentation.

B

You know stuff like this is useful for just a quick and dirty dashboard, but when we get into like a little more heavy monitoring like what we're gonna be doing on get live, comm we're gonna have to be able to write custom queries, so it might be nice to have something like that for people to just get something up and running quick, but then also have the option of adding your own yeah, maybe like a custom option. There.

A

Yeah we have the ability to add custom, metrics and alerts on those custom, metrics I think day soon, but I do think. We there's like the data dog way of where you can really cool, like pull up the metric and then draw where you want the threshold to be both in terms of X and like width and height of where that threshold sits.

A

I know if you're familiar with that view and data dog, but there is kind of a question of how much investments do we do in that versus allowing you to determine like to define it in a common alert manager, config and then deploying that more easily.

A

B

Talk to Andrew about the any of this I have.

A

Yes, significantly, I, don't know if others on the team at it. Yeah.

B

Yeah he's working on some newer stuff that we're gonna be using on get lab comm, eventually that there's a little more predictive and uses algorithms to really tell if there is a problem based on the historical data, it would be really cool to have some of that stuff in here, but I know he's not ready enough with it for us to even use it on get LOD calm. Yeah is.

D

That the stuff was talking about it, monitor, Amma the seasonality and well.

A

He's also been defining SLO, so there's this interesting collision because parts of the monitor team, one group of it, a PM group, also is in charge of gitlab self monitor and we're moving. Some of the monitoring for how we give the tooling we give our customers to monitor their own, get lab instance and which is very analogous with the monitors that Andrew and others have forget, lab comm to use these features right so that the when a user is monitoring their get lab instance.

A

They have a get lab instance, self monitoring project that has dashboards created with some of those metrics, so that is that is kind of underway. But that feels like much longer term, whereas you working on some of these smaller non core ones feels like we can at least get to where you're actually doing incident management in that full lifecycle with it sooner rather than later than that, as I said before.

A

That's like my primary interest, is to have a real real-world internal dog feeder of that complete loop make sense if I may go back to the pager Duty thing, because I'm wondering what would it take for you to not use pager duty? Is it especially for a smaller project like this.

B

So Pedro Duty gives us escalations and that's the hardest thing to do is schedules and escalations. So we have a rotation or we have two rotations. We have the rotation for Europe and the rotation for not Europe, I, guess and.

B

Every five weeks we have a different person or it's a five week cycle. So we have, you know: I, go on call every five weeks for the North American AIPAC rotation and getting that kind of scheduling into the product would be. It would be very cool, but I think it would also be really a lot of effort, and you know because Pedro Duty gives us things like overrides, so you set a schedule.

B

That's like what your plan is and then somebody says well, I've got this thing: I need to go to a conference that week and I trade with somebody, and so you go in and put in overrides and switch switch places, and all of that stuff is pretty heavy and paid your duties very specialized than that yeah.

A

So you're gonna hate me, but just bear with me as I describe this. What if you had issue templates like one for each individual and you just flipped that drop-down on the integration? Where is it right here? That said which template you're using for a given week, so you're like well, this is who's on call, so I chose their template and that template is, you know obviously part of that template. You can have quick actions that assigns it or kings or add, mentions somebody when the issue is created.

A

You could also have like a little bada runs around that says. Oh, this has been created and Devon was assigned to it and there hasn't been a response in five minutes, so it escalates to their manager. Just like we have. We have internal triage BOTS for doing stuff like that I'm trying to figure out ways that we could get you using this. So you start putting feedback into it for something like design without yeah.

B

Work I mean it's the there's the manual step, but.

B

Kind of envision it and it's not going we're not gonna, be able to do anything like service level agreement type of stuff with it, because it doesn't actually page somebody's phone. It just does an app mention and yeah.

A

The most you could do is any notification from.

B

A

Let's see I, don't know what our notification settings are, but I'm wondering if you could manipulate if it, because it would be in this project and you would get mention on it. You could have separate notification settings for this project, but yeah you're right at most would still send you an email I mean.

B

The easiest way to integrate is to just have it do a web hook when an alert gets triggered, just run, run a web hook and we can put in the page or duty web hook for that. That would be the quickest way to get it into our workflow yeah.

A

What I'm trying to simulate is a customer who doesn't have page or duty today and one.

D

Day, something.

A

Similar it's hard because you guys, obviously the whole team is integrated with Pidgey, because this is just one minor property. I was trying to think of a way that we could get by without doing it. Well,.

B

If we're not concerned about actual like emergency downtime incident kind of stuff having it be able to file this issue in the infrastructure queue with an on-call label, the on-call person will see that because we do monitor that queue with that label.

A

So that would be prioritizing having it send to a separate project within it. There.

C

A

That in the org or group.

B

No I think it's get lab comm /, GL, infra, /, infrastructure, okay, so.

A

It would need to be instance, wide incident redirect.

B

Yeah, because that would put it right in our workflow, with nobody needing to adjust or know anything.

B

A

We can prioritize that one yeah.

B

That seems like the easiest way.

A

Cool we're almost at a time amelia, jose sent any other questions for devin. You just say, get to drop off. Thank you. Anybody else, I just.

B

Hit a window listen to see where this process was that, because I'm looking forward to getting feedback, so it was a great introduction to what's going on so thanks, yeah I'm glad we can start started integrating this stuff, because a couple weeks ago we wouldn't have been able to and I hadn't really thought about, monitoring other than I had a conversation with my manager, saying you know, are we gonna set up monitoring for this? He was. He was sharing my opinion that we should not.

B

We should use what's in the product, even though it's insufficient at the moment, because that'll help us push. You know getting this stuff prioritized yeah.

A

And I should have started with this, but, like I, consider this a high priority. You know the monitor tools are great, but we're gonna be able to move so much faster. If we haven't some internal team dogfooding and pointing us to it, I I would appreciate Devin if you like, made a little bookmark in your toolbar. That said, create a new CEO and just anytime, you found anything. You just do an issue and thing to me on it. That is gonna, be the best way in like very responsive to those.

B

Years, okay, and can you just like send me a quick note with all the things that you want in that, like the the tag who you want me to pin it's just you or your more more people, yeah what label to use and where exactly to open it. If there's a specific template, you want me to use yeah.

A

I will do that I'll. Give you I, have a little quick link in my menu bar and I'll. Tell you the exactly opposed to throw on and to pinging myself and Sara is the other product manager.

B

Yeah I've been doing the same thing with the configure team and they're not consistent about what tables they tell.

A

Yeah, okay and I'll just give you one and if you ping us, we can have the rest of the labels.

B

Yeah I like how this is gonna shape up, and this is a good time to do it because I haven't even started thinking about it. Yet so, knowing what your vision is when I do start doing, it it'll be easier. So as far as the integration did, what what I did on the design project is that sufficient do I? Just you know, add the Prometheus pods to the the integration, and it should be good or is there more stuff that I need to do to configure it? No.

A

I mean by installing Prometheus on the cluster. It should auto populate those dashboards. It doesn't set up any alerts, but it if there, if you did setup call it, also create an incident in that project and so that all happens by default. But the the next step would be adding alerts and creating incident template. You would want to use okay.

B

So once there once, you guys finish with the instance wide issue opening then I would just go in add the Prometheus pod and go in and set up whatever alerts to send to our infrastructure queue, and that should be it.

A

Yeah I mean you already have the Prometheus pod, at least in yeah.

B

And that one right.

A

But the other ones yeah, that's what you do. Yeah.

B

It was being a little finicky and one of the other ones that I was working on, but it seems like hitting the upgrade button fixed it. Okay,.

A

Yeah, and not just features, also bugs that's another big point of this is that.

B

Yeah, probably.

A

Encounter some bugs cool okay thanks Devon, it's.

B

Great to chat thanks for the meeting and I'm glad we're on the same page. Now awesome thanks. Everybody all right. Thank.

A