GitLab Group Conversation, 13 Dec 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: AMA with SREs (Public Livestream)

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

The fact first question is carry over from week before soon six weeks before mech, who I do not see on the call so I will verbalize this we're getting more questions from customers and prospects to use. K-8 that's communities to deploy, get lab for context, questions from a prospect in his Google Doc follow-up questions. How far are we from testing helm track deployment at scale? On.Com I see some s eries on the deliverability team here I will hand that over to maybe jar for you, you here.

B

Oh sure, so we are using give a home chart for Caleb calm, but we're only using it for a couple services right now where we started using it for the registry service and that worked out pretty well, because registry is a standalone, stateless service, it didn't have any other dependencies and then the next service we use it for was mailroom, which is also a pretty standalone service that does connect to Redis. But currently we have a single production, kubernetes cluster.

B

We have also a stage in cluster and we have those two services running in a minute and we are planning to move some of the site. Click use next and that's gonna happen. We're already in the planning phase of this, and that's probably gonna happen early next year. Were there any other details, you'd like to know about the kubernetes migration.

A

Follow know we do have a slate of services or components of our stack that we are considering. So it's going to be piecemeal, we'll keep moving things over and eventually we'll have a considerable portion of the application infrastructure on kubernetes. It will never never say never, but it's likely to never be everything, especially taking an account. Some of our workloads with get file nodes.

A

All right question B was believe already answered and jovi already answered about registry mail, room. Ok, great, all right! Michael! Are you on the call this week as well? Would you like to verbalize your question.

A

Perhaps not the same, Michael I see a couple. Michaels don't know who this was all right. A new support agent started Monday, so it's six weeks ago, welcome over the first six weeks, went well yet to explore. Much of the currently developed committee's features as container attached storage been explored or is it being used currently I, don't know the answer to that question.

A

We may need to come back to it unless someone else on the call has insights.

A

C

Ring any bells I would be inclined off. That's need-to-know or not yep,.

A

Yeah I mean kinda lean towards no, but I, don't want to say outright right.

A

Okay, William she likes verbalize your question, yeah.

D

If I was a recently on aw screen pet got to talk to a lot of our customers in the field, a lot of them were very, very happy. Some of them were not as happy about a recent outage and I just wanted to kind of track down the issue and here's some of the state there. It looks like Dave posted the incident issue, which is 85 28 in the infrastructure tracker, but I was working. You get just on. This call give give a little bit of a summary of that. A retro.

A

Sure we had a synchronous, retro I, believe the meeting was recorded. I can try and dig up the link to that afterwards as well. The main takeaway from this was we have without summarizing the incident itself. I think that's done quite well in the issue, so I'd encourage folks to go there and take a look. The major walk away from this was a slate of 13 corrective actions, a number of which are currently in progress, three of which have been closed.

A

So as far as uptime goes, I think we have also metrics that we would publish November was a bit of a rocky month for us that is certainly true. December's off to a bit of a better start.

A

Is there anything specific that you heard as a theme beyond hey folks aren't really happy cuz? No one, including us, is happy with the incident. The subsequent performance issues yeah.

D

I can share I had a lengthy conversation. This was an individual developer they're, their own projects were on get lab comm and they were basically you know. They came by the booth to ask for advice on what can I do. Is there? Is there something I can do to avoid this uptime or when, when I'm down and I can't access? You know, I can't get work done. Basically, what can I do about that and they were asking about potentially self hosting, so my advice was I highly recommend it against self hosting I, basically said I'm.

D

Sorry, you know, I didn't know all the details, but I know we have been staffing greatly. I know our team. All of you have been tremendously improving things incremental II as we go, and so my basic advice was I'm. Sorry for this outage, things are getting better and I have personally seen that over time, so my personal trust isn't getting lab comm and I'd highly recommend, not self hosting, but then that was kind of like the ask on their part and honestly I'm, not sure.

D

If there's anything like you know, maybe not storing large files or some other nuanced thing that relates to some types of outages that maybe they would they some action. A user could perform to be more resilient against outages, I, don't know what that would be, but maybe that's a fair question for the team. I think.

A

That's a more answerable question, so thank you and I. Imagine there are different responses that any of the many a series on this call could could offer with unique perspective, because.

A

This issue wasn't caused by any customer behavior. This was a error that we imposed on ourselves: unintended consequences of a configuration change, so users themselves. We we're looking to protect users from inadvertently impacting themselves and every other user uncom, because we don't have a lot of isolation in place, so we're continuing to work with developments make certain that application limits are focused on a priority oftentimes, its API rate limiting or parallelism in builds that runs off in a combination of not having one and having too much of the other wreaks havoc.

A

You know fifty thousand concurrent CI jobs running all trying to do exports project repositories. Things like that, so I suppose. To answer that question directly, we say be hyper vigilant about what actions you're taking against the API. We currently give users a lot of power there without a lot of we'll say. Well, we give them unchecked power, perhaps in ways that we shouldn't. So that's one area, any other esterase have ideas, lots of things, I'm sure we've talked about sure.

E

I'm Cameron I'm, one of the sarees and my advice to people who are using get lab, comm, I, guess in this case there isn't a lot of things that you could specifically do to avoid it, but you can definitely complain.

E

Tell us what didn't work tell us what the impact was like, because getting that negative feedback when this stuff happens helps us understand what that impact is and make sure that we give it the proper attention to make sure that the people who are using gait lab comm don't have to worry about this in the future.

E

We don't want to worry about it either and that's why we had so many corrective actions come out of this particular event, so that we can all you know sleep easy. Knowing that, like we're not going to make that same mistake, we're not going to you know, have a similar outage with a similar profile. Then we'll probably be others in the future.

E

Just due to the fact that get lab comm is not a static piece of software, but even when that, even when that experience is bad, getting that feedback from people who use get lab comm, knowing what was painful, even if it's just like this isn't slow or this I'm, not getting the response. I feel I should, like that's all good stuff, so that you know the product gets better, and that includes the product of the service.

F

um Hi this is Matt smiley, I'm another one of the SMEs I would say that I would add to what Anthony and Cameron have just said. That's as a developer, you can obviously just just state the obvious. You can obviously still work with your local clone of any ket repository if, if the inability to get productive worked on was based on lack of access to CI pipelines, for example, that's we don't really have a good answer to that, one that when, when you can't access the service, you can't push your commits, sir.

F

You can't trigger pipelines manually. That's that's! That's just a cost of having of having software as a service, but I would like to add that what's up old self hosting is absolutely a viable option.

F

The the.com service that we operates has a great deal of internal redundancy that is really cumbersome to to operate in a self-hosted environment. There are absolutely reasons to run large gate lab installations in a self-hosted manner, but that's almost certainly not what this developer is is talking about like, and we can talk more about that offline.

F

If that's of any interest to you, but but like, for example, where we're actively working on understanding how this particular failure cascaded to a broader scope than we expected, that's one of the corrective actions where, in other in other aspects of resiliency, where we're currently actively working on addressing several single points of failure in in the gait lab architecture. That is, that our it's going to be feasible to address in calm scale deployment, but really isn't practical in self-managed.

F

Unless you've got a team of people to manage it, sort of give a little bit more color to the answers.

D

Yeah this is all tremendously helpful and that that's a great summary of my advice to this particular person, based on my conversations with our large customers that self host essentially yeah. This is this is a massive undertaking. You know kind of as a small individual developer you, you really do not want to take on all of this effort and all of this risk, the the value of having this team run. It is, is great and so I think that they ever everyone is always.

D

You know frustrated when they can't do something that they want to do and I think that's very reasonable response, and so I I was super happy because you know I said to them. You know well we're very transparent. We post the retro online and the issue will be online and they said oh yeah, I've already I've already been in that and I've seen. All of that and that's a reason.

D

I love get laughs, so that was great feedback to the so the company as a whole into the this team, in particular that we we make those issues public, and so that was nice for me as like a booth worker to be able to just say like yes, that's there, they knew about it. But yeah. That's like that was my advice as well.

D

Is that if you're self managing there's a lot of complexity and if you're a large enterprise- and you know, have thousands tens of thousands of users on a system, then it may be make sense, but for an individual developer it's it is way more pain to self self. Manage yeah.

F

Plus other I, totally agree and plus, as other folks have mentioned, get lab is a really fast moving target. Staying on top of upgrades is you know it's it's a meaningful investment of time if you're doing self-managed so yeah, if it's, if it's a single person or a small team I mean if, if they're interested in running itself, that's perfectly fine, there's nothing wrong with that, but you do get a lot of benefit from running on a from from running on a managed service cool. Thank you for bringing this up. Thank.

A

You very much and I'll see you on the call. Would you like to verbalize question number four yeah.

G

Happy to um I was just wondering sort of what, if anything, can department. Other departments do to make your jobs easier and if you wanted to interpret it in the negative kind of way like what frustrates you, when you interact with other departments,.

A

I will pass and let other people I think I'm saying that, because I think I have better points of communication to voice those frustrations as well as continue to improve on a lot of the communication channels that Dave and I have developed for the team. So I'd like to hear from other s, eries in the group.

F

This is Matt again I I love it people are super supportive. It's awesome, anytime, I have questions. You know it's really a my main problem is finding out where to ask that question to get you know to get in front of the right people- and you know that's the more I might get loud, the easier it is to find there's places but yeah people they're super supportive and very helpful I love. It.

H

To make some of my statements clarified, joining the giggly team would help us greatly, because right now, one of our biggest operational pains is the fact that giddily is singly home don4t, five different servers, which means we don't have a single point of failure. We have 45 single points of failure and we need to get giggly AJ built out as quickly as possible and continue to iterate on Gilly availability so that we can continue serving without and do maintenance at the same time.

H

Right now, we have basically no way to do maintenance on the get storage servers without taking down users for extended periods of time and that's difficult to to stomach as an SRE as far as making meals, better yeah. Basically, anything we can do to improve the performance. The rails stack, caching, improvements, memory, efficiency, improvements, the better squeezing performance out of the existing hardware. We have making it more efficient.

B

B

I put it there for a rate limiting and application limits in general. I know that there was a recent case where someone was spamming us with spam issues at a rate of I, think it was 600 a minute or so, and you know these, these sort of things can cause us a lot of pain and so I would say just helping us to get.

B

You know, application limits, implemented and also just having a little bit more awareness of those types of prominent problems, as well as like the infra dev issues. I also was going to write that- and this is more for you- know back-end development, but I'm, just being aware of all of the different errors that we see on github.com.

B

This is being aware of like what's going on in sentry, what's going on in bogs to help reduce the number of broken windows, because I think you know the the fewer broken windows we have the easier it is for us to spot new problems that arise when we do deployments.

G

What about in terms of like just interactions on slack like if I notice, something's wrong, I, don't know it's easy to just kind of swoop in poop into the SRA channels, and is that the best way to help you I.

B

Would say so: yes, that's I, wouldn't call it soup and poop I, like that expression, but I never consider it that at all, like I, think it's really great when people do and I think we're, often than not I feel like if there's a problem, especially I'll, get like on.

B

Please like, let us know in the production channel or in the SME lounge I also um keep tabs on other channels as well just to kind of monitor what's going on and what people are complaining about if there are any complaints, but when you definitely don't don't feel shy and definitely let us know you.

C

Know I won't on a dad. This is like I wanted to add that if you feel like you, it's okay to tell us something, you think you might already know that we might already know, because in the midst of an incident like more data, points can really be very helpful. So you know, like, like Jarek, said, they'll, be.

I

This is Dave, I'll actually speak up too I think the other thing that would help us, though, in terms trying to steer a little bit of Lyles comment. Is that mmm just dropping something in slack and incident management channel is not going to start the production team? Looking in there the infrastructure team looking into an incidents, we actually have ways to work with pager didi and we're working on improving those two, but Pedra do use our escalation path. So slack is our asynchronous communication channel.

I

It will not generate an immediate response and using threads in slack is important, particularly in incident management. So a little bit of slack at AK. It actually does help us if there is a threat going on cuz things crawl by very quickly there, so I would I would ask that there is a I, don't think we're doing anything wrong there, I, just like being mindful of how that communication happens, it's useful and then back up at I actually noted.

I

We have a merge request in flight for the if it's slow and things are going on, we're looking at getting a little bit more content on that merge, request of advice in the handbook for if you've got an it's slow coming from either you that you're noticing something or a customer. You know talking about using the performance bar grabbing horror files from a particular browser. Information from chrome from the developer tools will also help us looking into performance problems.

I

So we've got a merge request and fight for the hand buck-toothed that, I think, would be useful.

A

Yes, more information is better and will do a good job of making that frictionless for you. When you get into slack right now. I know, there's been some confusion. Do I page with the /pd command? Some of those are gone paid. Your duty upgraded their api to v2, and we tried to integrate some things. It hasn't worked so well, there's a start incident command that does work it's in the handbook somewhere with some failures here and there, but it still opens issues.

A

An issue will actually get the attention of an SRE the issue with an incident label on it that and, as they've said, the pager escalation is the right way to go. There's an app thing. I linked in the doc it's 100 am are, is working to set up a set of integration points with all of our escalation tools: slack zoom, pager duty lab. So that will have a single point for kickoff. So take a look at epic 100.

A

There's some good context there once that's rolled out in the next Rob would guess before the next one of these AMAs. That will be the formal policy for kicking off an incident. Oh you'll see it'll be much easier. Slack's Sikes sent some really cool stuff.

A

Lately, they've got something called our workflow builder, which is just a JSON definition of a UI GUI interface, so you can do a slash command and then it pops up like we're all familiar with pto, ninja and Google Calendar, and things like that, where you can use buttons and dropdowns and menus and all the like, what will hopefully be pretty populating the intimate issues and hitting the correct escalation policies, etc, etc. So we are working on that and it's day flaring out more information well fields for more information for you to include in there like.

A

Did you try this first before you acknowledge? Yes, I did I'm, not just letting you know. Hey I hit a 500, because it's good to know, but context is better.

A

I, don't see any additional questions, so this meeting is scheduled until the top of the hour but and get lab fashion. We can end on time five minutes early and we'll s. There is another question.

A

Going once twice, okay, I will figure out how to get D recording up to unfiltered. Thank you. Everyone hope to see you all again. In six weeks, happy holidays, see in the New Year.