GitLab Group Conversation, 23 Jan 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Infrastructure Group Conversation (Public Livestream)

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Okay, hello, everyone welcome. My name is Gerry Lopez and I'm, the director of infrastructure. We are doing a live crew, conversation for infrastructure and we already have. We had an offline question about continuous delivery, but I'm gonna ask Marianne to take on since he's the one. That's provided the detailed answer.

B

Yep thanks very first of all the question is: is.

C

D

Way, can we always let the person if they're in the coal, ask the question? It's a boy, more fun for people out I.

C

E

I missed in cuisine in the list of participants, I was would be able to skip this okay, but a supply of increased deployments in production from weekly to daily or even switch to a continuous deployment for petrol calm. And what are the blockers now to implement this change? Good thanks, I think because.

C

So yeah I wrote a couple of things in there that are interesting, so basically I could say we can tweaked frequency of deployments right now. If you want to the problem, is how will this affect the platform? How will this affect all the teams involved in this and I wrote up some of the examples there gave you some links.

C

Basically, we need to depend more on automated system and metrics than on our gut feeling, which is yeah partially what we are doing right now when we are promoting to production. So what release managers do when promoting to production is talk with s eries see what the state of the platform is, get the approval and then click that button for the deployment to go through. We are working to make sure that we remove as much as human interaction as possible and then with for that we need multiple stakeholders to to contribute so from quality.

C

The rest of the infrastructure or end development is always so. There is a lot of responsibility in development to know exactly what they're merging and ensuring that when a change is deployed is where the work actually is completed rather than when you click that merge button. So I'll ask you to ask any additional questions from the comments. I wrote up, so I don't go over them again.

C

C

I guess that's a no anyone else.

C

All right sit over to you.

D

Yeah we want to kind of talk food metrics and get lap, so might get get lab as the ability to display Prometheus metrics and want to use that forget Lancome. How are we thinking of getting there? I see a few roadblocks. We use multiple Prometheus servers, use tano's for the long term retention of metrics. We heard about some extra information, math and graph Anna. We might have a learning setup through that. So now we have to then have some alerting setup to get lab. What is a simple start to go?

D

There is the observer, the right place to do this and just wonder what people there found. Sir.

A

So I specifically don't have details when I found funny or been or in the call. They can talk about some of these things, but one simple star that we decided to to start with is essentially displaying our app time and get lock-on itself as part of the application. So I think the application should have an understanding of whether it's functioning or not, and then be able to display that data.

A

So we've made that a K R for this text quarter, and we also need to sync with the monitor team, because they also have that as a kr. So after we sync, then we can get a better plan in place to figure out what we're doing in the New York term to implement some of these things.

D

The the uptime that seems super useful and that's the thing we can struggle to have a single source of truth or not. So that seems super useful, but my idea with dogfooding or like obviously it's great to use Gila for that, but it's a whole new project project that, as we've been trying to get a better understanding of our uptime for over a year now. So it's probably a complex new thing. Is there also some existing measure that we can move something we already know like it's not figuring out the metric, that's hard.

D

We just move it to a new environment.

A

That's right for a while what you mean said.

D

The uptime do we have a graph on a dashboard that currently displays that video.

A

It's the we call it the SLA that the SLA dashboard I'll I'll, dig out the URL and post it there. Okay,.

D

So the idea is to move that SLA dashboard right. Okay, though so we're not changing the metric. Okay, now I misunderstood that in that case, that makes sense on a sense thanks.

F

And I'll add one of the considerations here when we're talking about dog fooding, monitor is with respect to metrics and graphs. Our workflow isn't that s arrays are sitting with dashboards open or have you know, Chrome extensions that rotate one to the next next and they with vision, detect an anomaly. So the fact that a lot of our reactive response to alerts then has SR he's going to dashboards has led us to say well, we need to keep these in an external, non gate, lab comm location because get lab the product to monitor, get live.

F

Comm campion get live.com, there's a cyclical, chicken-egg dependency there. So, in the conversations with the monitor, but I do think we have consensus generally, I know we're not consensus, driven that the OP server might be the right place to do this, because a lot of the metrics tend to be a byproduct of the incident management workflow. We have an incident. We have something that we're trying to discern about the status of the system. That's not working properly.

F

Let's look at a visualization, in which case the certainty that it would be available to us on ops instead of calm.

F

Please just say that believe that that's probably the right place, so we had a great conversation, Dave, Smith and I with product earlier this week and I think we're gonna see improve velocity on this.

D

Thanks Anthony, that makes total sense. But when I look at issue, 61 5 I find that it talks about having new metric available on get Lancome, not on the observing.

F

Yes, so we're specifically going to focus on bringing incident management workflows in building additional dashboards into dotnet. So what will adjust this so slightly? But we do want to have these other metrics and a long calm.

F

Is it part of a static, not incident workflow availability, just when get live, comm is up and running in perfect state and someone else wants access to information tricks that will be on Comm, but the actual dogfooding of monitor stage is more in the vein of reactive response to incidents and less of the everything's going fine I'm, just looking for metrics which, because we could probably separate these two things, a little bit more yeah.

D

I think that we should be very clear. The goal is not to have some nice metrics on comm that nobody uses in incidents. The goal is to have our core work flow with gitlab, something we're expecting our customers to do with their applications. So having a metric on comm I, don't think should be in scope of this at all.

D

I think it should be very much focused on the observer and it should be a replacement for the graph on a metric, because if you do things on comm- and you say- hey okay, now we're going to deprecated this graph on our dashboard. It's like well, if comm is down, where do I look so I think we should be much more crisp about this. Okay, one additional consideration.

F

That we have is to continue to embody a value of transparency. A lot of our incident communications that are public facing generally depend on updates and comments to a what we call our production issue or an incident issue. If that workflow moves to ops, then we would have to rely on a different medium to communicate out to the public the status of the ongoing incident.

F

Now there are, of course, ways we can bring that back into the incident review issue and the like, but we might need some guidance from a values perspective on, what's appropriate, to keep an ops which is not public facing and what is teams signal sufficient for status, Caleb, comm.

D

So I think this is about the issue more than the metrics. Your question. Yes,.

F

Because the metrics that would find their way into the issue, in addition to the static end points, there are a lot of integrations where the graphs can be embedded into the issue and while you're working on the incidents, it's here's the snapshot from Prometheus from this time, and then you can pop out and go look at a different dashboard elsewhere. But that would all be done with the issue as the primary record of of truth and live state.

F

So oftentimes when we're updating status, Dockett lab comm, we give a link to the production issue. That would not be a public URL. If we move to the ops installation, that's.

D

An that's an interesting question: I, don't think it gets any worse like. Currently, we have private dashboards because we're using craf on our dashboards that get about Nats a lot more than dashboards that get Lancome so that this is not a change. I do think for incidents. We have a communication, Officer or person who's designated. To do the communication is that correct, that's correct!

D

Is that a full-time? Is that, like full-time thing at the time, there's an incident like that person is dedicated to communication. Yes,.

F

It's evolving we're working very closely.

D

With customer support to spread responsibility, they're great, that person couldn't lifestream the issue, a person can just be screen, sharing and continually talking and sharing the state of the issue and talking people to it.

A

Thank you. So, okay, since we're having the conversation, we've waded into a number of different aspects about new metrics. So one comment that I understood you making was if guitar comes down, then having this uptime I'll get Lancome makes no sense and I. Think Anthony's point is we do use production issues as one of the ways in which communicates. So, if the logic for the first one applies, then that is also racing.

A

This other issue of she would revisit all their current workflows, which do rely on the production team, get comm to move them elsewhere in the same fashion. I think.

D

We should I think it doesn't make sense like if there's a serious incident and that's that's one- that people are gonna hold us accountable. You can't be using an issue on comm, so make sense to move them to the observer, and instead of linking to the production issue via Twitter, we should link to a live stream of the communication person.

D

Just giving live updates, like the person is literally it's gonna, be very boring, live stream in the most of the cases, because they'll be the screen, sharing the issue and maybe watching some of our internal slide and giving comments based on that. Do.

G

We want to create extra overhead in the process to communicate publicly. It seems a little bit inefficient if, if we already have everything going into issues, we're communicating there and that's actually the primary place, having the additional responsibility of somebody reading that out and showing it seems inefficient to me that you know it's.

C

Not an extra overhead, because the manager on-call responsible for that communication already has to do something like that. They need to ensure that the discussion in the incident is flowing while the engineers are working. Sometimes you go into different alleys instead of keeping focused on the challenge at hand, so they already have to do those those things like they need to communicate over Twitter. They need to update the public with what's happening so that overhead should be minimal, and it's mostly just that you need to start up a YouTube livestream communication communication incident.

H

Is the most important thing in an incident like I? Don't know how to stress it like fixing the issue is super important and figuring out a workaround or how to quickly address the issue is, is also important, but actually communicating customers what's going on and that you're aware of it is is in many instances more important than actually resolving the issue. Assuming that you're gonna resolve it quickly. Thereafter,.

G

Would put it this way, though, if you had somebody scrolling down a page to be showing graphs as they're happening that might be less efficient than that person, finding the actual information that, like you know, just doing it.

H

I'll stress again: communication genic, okay,.

G

I understand what you're sayings and doesn't.

H

Hi I understand what you're saying, but we're splitting hairs and generally what happens in those situations is the person who's trying to solve the problem sends more time trying to solve the problem than actually communicating, and then people are like okay. Are they really working on it? Or is this not a concern sure.

F

And that's why we have to find the role of the C Maki and the I mock for those that aren't where we have a incident manager on call which is the nighlok acronym and then there's a communication manager on call which is a specific role. That's less involved with actually discovering the underlying cause or the solution to the incident and only concerned with the communication to internal and external stakeholders. In this case the business and our customers.

F

So I think it's worth splitting hairs in saying that that second role, the Communication Manager on-call when it comes to what they're communicating, is simply current status and expectations and provided those two needs are Mads for the stakeholders. I think the way that that's communicated out is something that we can iterate on, because, frankly, that person needs an inlet of communication as well and whether they can feasibly screen share, while also being in another zoom, where they're consuming communication and like may not be realistic, but I think there's definitely an opportunity to explore there. Let's.

D

Change that so so, first of all, very important manager of the incident is not the communication person. This is a full-time I have full-time as in completely focused on communicating it shouldn't, be a separate zoom call, so this person should be should not be. There should not be the incident. Zoom condition called there's one zoom call, but for that.

A

Means so let me let me try an hour transmits a little bit, so that means having people joy. You know live-streaming. The incident call which we can not do because sometimes the information that flows through people's screens is you know that makes sense. So we stopped doing that a long time ago and I am NOT going back there second sort of parroting an incident over a call, makes no sense. Incidents tend to be very chaotic events by the song, not by the sign by their own nature.

A

So you would have someone who's trying to read an issue much like Dylan point it out and then there's gonna be times where we were going down a path. Then we realize that some of the paths, then we go back that gets communicated on the incident call. So this individual we're actually have to be on two calls at the same time, trying to make sense of what's happening in one filter in real-time passing things to the other I mean it would be messy, I think having an issue that people can trail they control.

A

The comments is the right way. Everybody does it that way: we've been doing it that way. For a long time, we tried the rule, live Google, Docs that grass, in a fantastic way, because lots and lots of people would get on that on that document. So I think the issue is Plus. Twitter has worked well to keep end-users well communicated. We fail sometimes that sort of keeping the clock in saying. Okay, every 15 minutes we're gonna give an update.

A

We need to get better at that and I agree that this you know we've tried to do the iMac and Samak, based on the number of people that we have available to put on call, because now you have to SRA rotations a primary and a secondary plus you have the manager some call plus now we need another set of people on call, which is why we've been working with support to get more people available during incidents to be able to do these two things so I think we're generally working in the direction that that you're pointing to which is for us to be very clear and there and have a good cadence of communication, but I think going into live streaming.

A

Just it's pushing us to where we were about a year and a half ago, and because of security concerns. I would not go back to live streaming incidents just because we may find.

D

Mics that make sense that make sense.

A

D

Don't have to explain further yeah, no.

E

D

That that we're gonna tell confidential information. Okay, is there another way? Is there I don't know? Can we I'm gonna, ask obvious questions that I think I know the answer to, but is there a way to open up obstacle app net in a way that we have make a public project for the incidents where we open up the issue tracker only to the public, maybe not, maybe that whole server is cordoned off also to prevent the DDoS or something like that. But is there a way to have public issues on that server.

C

So a job here, I, haven't seen him on the list anyway, I can answer that question office is not cordoned off from from public, but office is also a single node single instance. That is crucial for our op.

C

Racial needs on github.com, so increasing the load and opening ourselves up for four possible denial-of-service would not be great alternative. If we decide to cordon off that, we will have to have VPNs and we will have to deal with that stuff. So I think.

D

We can we use Geo. Can we have do follow for the option? Since then we can point people to words. Let.

A

Me ask one quick question before that, so if the objective is to have something that is out of band from kidnap calm to provide updates, so people can see the issue and not necessarily participate building, something that builds sort of static versions of that on a regular basis, and simply it's you know it sits in front of like we can do starter. Second I thought comment point to the right issue at the root, and it's just every minute is just scraping whatever comments and posting them like.

A

There are solutions to do this, without necessarily overloading off the ops instance or open it to to Amanat players. Yeah.

I

Comment that that's also one of our product categories status page. So that would be a great opportunity for us to start dogfooding a feature like that, where you're kind of trailing the incidents directly doesn't separate page.

A

Yeah, so we get enough when I work on that we.

F

Can do that that's safely without it's as well, it is behind CloudFlare. So we will have a lot of these protections on that. We can scale it up a little bit to accommodate that use case, but I think what we're all describing is a Status page.

D

Cool, let's think about it, but thanks for the conversation things she's reading, sir Antony. This is interesting.

A

So work five minutes, the yenicall I don't see any additional questions.

A

Cool well, thank you so very much a LEGO sets Cummins has been one of their life, learn more entertaining in front group conversation. So thank you. Everyone Thanks.