GitLab Group Conversation, 17 Jul 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Infrastructure Group Conversation

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Okay, verify I made myself so welcome to infrastructures. Group conversation, it's a short one, because the topic is actually significant and there isn't really a whole lot else to talk about other than availability in the past few weeks, but I'm keeping an eye on the document as people post questions or have comments and if not I'll dive a little bit into the slides.

A

So I'll go ahead out into the slides, real, quick to get the conversation started, but obviously the biggest events for us over the last few weeks have been the instability on get comm. They've had very significant ramifications.

A

You've done a fair amount of analysis on why we are where we are, and essentially you hit a ceiling on on Redis, and there was a ton of work that happened in quick fashion to address that, and then we had some wine spots on observability and also there's. It's gonna work on I'm going to address that as well. One of the things that we notice is that there was a Kanis mismatch between development and infrastructure.

A

Essentially, we interviews continuous delivery and that sort of changed that cadence to be more of a daily or weekly thing versus the traditional monthly monthly cadence, where the relevant was in feature development for part of the month and then very focused on fixes, but with the introduction of continuous delivery that cadence had to change, and we were a little late to align with development to make that change. So that's been addressed and we in infrastructure need to sort of go back to and focus relentlessly on observable availability and to do that. We're focused on we.

A

We've kicked off this initiative to optimize or to maximize the observability of the infrastructure, and then we've built some channels we're developing to be able to expose issues to to the development organization as soon as possible. So they have time to maneuver and address them. I took a little bit of a detail on slide number three about the dashboards that we've created and the boards that we're using to manage these. Do these two initiatives.

A

Go back to not.

B

Real quick I got the first one, a jury, it might be a very quiet one but sure comes fishbones. So first thing stood infrastructure team for all the hard work in the last month. With all the outages and challenges. I know: it's been a it's been a tough month, a really crucial artwork of everybody on that team.

B

I just wanted to like it wasn't out at me and I think I've seen the list somewhere, but if you could just kind of articulate what the areas that were short on from an observing observability perspective, I thought they heard.

A

So I'll think real, quick and then I can lay under speak more authoritative way about this. But essentially we've known that Redis was a vertical component, very close capability component for a while. We've seen it sort of hit that ceiling.

A

But we didn't react to it and it was hard to pinpoint when this was going to become an issue under has work has done some work on creating two new metrics, a saturation metrics and then a subject metrics, which is kind of a saturation optics which gives us better visibility into where those points are and there are links to those dashboards and we're also shifting a little bit more into capacity planning. So it's not just look at the metrics and see how things are going but try to be more proactive on.

A

Where are we going to hit a wall and making sure that where we have awareness of the Walt's, not just in a sort of instinctive way or by looking at very specific racks for and the more and I'll use a buzzword here, holistic way? So one of the dashboards that and to create it is really the world of the capacity planning. It looks at all the services and it's trying to evaluate their capacity and their saturation points.

C

Yeah, if I can just go into a little bit more detail on that, like I, think that several of the problems that we saw over the last months were like totally different problems in different parts of the code base or different parts of the system, at least, but they all kind of shared.

C

A similar problem in that we had been really close to capacity in different parts of the system and everything was working just fine, because we were under capacity and then we deployed a new piece of code into production and we went from being like 98% of capacity to Hannah's in that capacity and then, instead of things degrading, gracefully things started falling over quite a bad way and we were caught by surprise because you know the day before everything had been perfect, and so in the beginning we were like. Is this a single piece of code?

C

That's changed, and then we realized multiple, that we were just really really a capacity and in some places like what Redis immediate instinct was to just throw more cpu at the problem. But then we realized that we were pretty much well. We are pretty much on the on the biggest machine that you can get in GCP and because Redis is single threaded, we pretty much can't scale up in any more vertically, and so what Terry was talking about with with saturation is on lots of different metrics.

C

We try to measure like how close to the top of that metric we are and how quickly we're growing and and then we can with that, we can predict like what happens over the medium term, so we can say well, like the ridaz client, the number of Redis clients we have is a 50%, but at peak every day we get 90% and and we're not too comfortable with that, and then with the saturation epochs.

C

We basically say how much time we're spending each day in that uncomfortable zone and is it growing and growing and if so layer, how much longer can you at both four and that's where the capacity planning comes from and we're doing this on lots of different metrics, we're doing on a memory metric sitting around CPU on disk utilization on disk IO and so hopefully, next time we can kind of feed back to the development teams in advance and early enough that hey, there's this car, that's not very efficient.

C

Can you optimize it before we reach that peak.

A

Thank you, Andrew.

A

See maybe there are no questions or comments. I can actually just share my screen and show the dryers for with development and discuss it. A little bit move open it real, quick.

A

So this is a dashboard that we use whether out meant to drive issues that we see in gala comm that are likely to become problems, and we tried to keep this dashboard really clean and very sparse. So we only highlight very important things that we know will eventually affect get calm.

A

Obviously, right now it looks kind of busy because we you started doing using this last week. It was in response to the the incidents over the past couple of weeks. We already we have a weekly meeting where we discuss these items and we make sure that they get attention and that they're prioritizing based on severity as well, so that we can make sure that these things get addressed. So this is were saying.

A

Christopher I'm super thankful for the work that development has been doing in helping us deal with these issues and I think this board is probably one of the most important things we're gonna do this year to make sure that get calm, stay safe and running.

A

Chattering, no, no.

A

More questions.

A

If there are no questions and comments, sure are we okay to finish this early.

C

Absolutely we'll see everyone at the company call all.

A

Right. Thank you so much.