GitLab Delivery Team, 5 Jun 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2023-06-05 Delivery team weekly EMEA/AMER

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Hey hello, hello,.

B

B

The obviously can't save them.

C

Reuben and I are like properly hello.

D

Everybody sorry for being late, I did realize, never see you from this angle.

D

So um welcome everybody. This is the 5th of June email, America time zone. We have a small announcement like last time, just a reminder, since we have the last hours for the engagement survey and we start to have a discussion topic that I didn't finish to write. um Maybe we can start to uh I'm gonna verbalize it and then I'm gonna. Add my notes on that uh during the 16.0 um release, with a problem with the release.gitlab.net and with the QA jobs. So this is actually something that probably you know, especially in a major release.

D

Something like that. When we are time constrained, we don't want to have this kind of problems and everything that affects our monthly rates, especially a major one, I think something should be treated as a with a very high priority. So I put this item in the agenda to understand from the team.

D

uh What do you think about that? And in particular, what do you think needs to be done to be able to not incur in this problem in any release, not even a major one, so we can Define, maybe which are like the main working items on that, and maybe we can stop them to prioritize some of this work and solve this problem once for all.

E

So I would like to add something to this, um because I'm actually working for the past few hours on this with sres and QA. So on this specific issue, it seems like that something changed in 16.0 so that when we process outgoing emails and it failed something changed, and so three years ago we misconfigured release, so it never sent outgoing email because of the wrong part in the SMTP server and from with 16.0. It started failing badly because we have a lot of errors and what we are thinking is happening.

E

There is that every QA job that is generating an outgoing email is basically failing and keep failing and slowing down the sidekick, the sidekick on that single node and so all of all the QA failures, and they happened when things run in parallel, because it can't keep up with the amount of SMTP fails, we're talking about 30 seconds timeout for each one, so very, very likely. Those jobs are sitting there for 30 seconds, doing nothing and then retry right. So this is what is happening.

E

um What was what happened to me? Did I realize today when I was doing. This is that we don't really have no one has really good understanding of how release is configured uh who owns it. We own it, but basically uh so things like we are slow, slow it down. uh Where should we take a look at and we've been left with asking the uh the on-call engineer, the Australian call to just SSH into the box and try to do some manual uh checks, because we had no dashboard.

E

No anything and so being this uh vital component in in the release cycle probably should just improve the monitoring in general, and maybe also our level of troubleshooting as um release managers right, so that everyone in the team can do some basic uh understanding of what, if that, it's a single instance machine, so it should be easier than debugging uh ha system, but I had no idea what to do right. So.

D

Okay, so having better obserability around, this should be probably best requirement that we need to improve as far as understood that as soon um so. This is a single node right release do even when we have more observability. Do you think having this as a I mean I has to be a VM this one or uh it has to be, uh or it can be uh on kubernetes we want to I mean the question is: do we want to keep testing uh Omnibus on this yeah.

F

C

Yeah, let's ask to the work that Graham's doing with the release. Well, the work Graham was doing with the release environments before we pulled him off um kind of has an assumption that we'll still have release at the end of that process to have the Omnibus test so yeah. This needs to stay uh as VMS single note. Sorry, something that runs omnibus.

D

Do um would be any need to add these, not a single node, but the multimodal node, even if zomb on this.

E

But the downside of this is that it will take more time to deploy okay and More in general when we're talking with uh the engineers from quality, they basically told we're not count, so those those instance tends to be uh downscaled compared to our reference environment.

D

E

But uh the the QA test is not that big of a load right, so the the load that QA is putting on the system is not even compatible with the smallest reference Target architecture, which.

B

E

1K users I think right, so it shouldn't fail right. So that's.

G

E

With us, because we there are many shops out there- maybe they're three four five Engineers working on software company and they have uh GitHub running on a small machine and it should work so I mean I think that it is important that this fail. It is important that it failed. It is the things that is not good is that it is when it fails, is to is to um late for us right. So it's on the critical part.

A

C

Well, that's it just so you mark the upgrade release instance as a P1. Is it sound like that's not what we should be doing now like I I is that are we.

E

Actually, when I, when I increase the priority, we still had no idea why it wasn't failing.

C

E

Safe it fails. We had our release, blocker issued and Reuben said this. Is this already happened? So we know what has happened? No, no. We know what is happening. It already happened, so we say okay. We should bump priority here because we found a problem three two weeks ago and this was before and no not in any board and no one knew about it, and that was why this should be P1.

E

In the meantime, we were actually working on this, and so as initially that was frame as we need to upgrade the instance or I don't know doing what we it looks like it seems to be just a misconfiguration, plus uh Behavior change in the application, with the major release.

D

So we're lacking here is actually visibility on. What's Happening actually yeah, okay,.

D

So what you would you still write this as a B1? Now that you did investigation.

E

Well, I still have to complete a security release which can't complete at the moment. I started packaging things more than five hours ago and still packaging, so I would say: I, wouldn't remove no priorities from anything that is okay from me and the release to right now. But.

C

Yeah to break it down, maybe more usefully unless you're like what. What are the P1 tasks that we should because very much like a P1 is we need to drop something else and refocus on this. So what do we? It sounds like we have two problems like do. We know what the configuration problem is on.

E

Release yeah we and I changed it. So.

C

B

So we're trying.

E

To figure out, if this concluded the issue or not, okay,.

C

Awesome and then in terms of the what broke like the maybe the slightly genuine bug here is quality. Looking into that already.

E

So um basically, what we're trying to do right now is to see if this was just. uh It was the cause of this failure, so the misconfiguration and knowing that the things has been always misconfigured.

E

So if this was the root of the problem and being the the beginning of this Error Alliance with the first release of 16.0, then we know that, like there are two options, they still don't know either something changed with the major release or we enabled some quarantine tests or some more tests that are actually making use of the of the email feature we don't know so we're trying to I mean because at this moment still we don't know if we resolved the issue or not, because what we're doing is we're playing one by one, the QA job to to make it pass right.

E

So it takes a lot of time, and so what we are hoping to to see is complete. The QA move on with the release then rerun the full QA all together and see. If now, the machine can handle the load. Okay,.

C

E

Maybe trying to understand why it changed.

C

Okay, that makes sense okay, so let's hold off on doing other.

F

C

Then until we have the answer on that, one yeah.

F

A

Thank you should be awesome, so um deploy to release instance more often so that we get problems like this before they are critical.

E

That's the problem with release in theory, one of the thing that is testing is the uh pet. Now the patch patch upgrade on Omnibus, and so, if we are putting overseas or something else there, we don't know.

C

Yep I agree, I think the longer term is we get going back and we sort out the release environments project and then we have kind of environments ready to go um I. Think until then we probably just need to try and um like debug these things as we go.

D

I mean in any case, I think observability on that needs to be improved. No matter where we go definitely probably depends. What was the outcome of this, maybe not as happy one, but definitely something that we need to work on in the future.

E

Yeah, we also had a big moment of uncertainty when no one knew what to do right. So should we rise an incident call for releases who owns releases? Will the EOC be able to do something if releases isn't working so and oh I mean we already. We were already trying to get the release out and we had other problems and then and then we had this layer of complexity. That would not even know what knowing what to do right. So quality had no ideas. The engineering code had no idea as well.

E

I was, uh let me create an issue that we can start work on that, but I had no idea as well right, so so.

C

I think for that one generally is we we do own release, so we should figure out how we kind of shed spread some sort of skills and knowledge about release across the whole team and time zones, but I think generally, if ever you're stuck and it's blocking a release or a deployment raise an incident because the.

B

C

Point of an incident is it's a way for us to escalate, so you know, even if it's something that we fully own. If, if you need help- and you need to help pull people in then raise an incident um and we can like, we can coordinate there if we need to.

C

Would it be worth? Is it someone who could actually take on the action of create some issues for improving uh observability on release because I'm, assuming those are relatively standard, sort of um like tasks? It doesn't particularly depend on the outcome of this released investigation right, that's just environment observability,.

D

Okay, if you no one, can take.

C

It here, McKelly and I were going to discuss and.

D

We'll assign one of you yes,.

C

The usual person good.

D

Okay, thank you for this. uh Unless you please keep me posted uh if you have any discoveries, so we can just I can just look at it immediately.

D

um Any other topic to discuss before we jump to release management.

A

D

Okay police managers.

G

I can do it this week yeah. Thank you. Thank.

E

G

G

um Let's, let me just have things open, first of all, um say deployment, frequency and I think that's about it. Okay. So let's take a look at first of all sharing my screen, I guessing everyone can see my Chrome tabs um okay, so um this was last week. We didn't have that many Brokers as compared to the other previous weeks.

G

um There was a failure here, um mostly just CI um stuff, as you can see, on the deploy blockers, um yeah, just mainly just flaky stuff, some CI config changes that um broke RCI pipelines and yeah, um and then we had a small hiccup with the security stable branch on last Friday, but that was about it.

G

um Let me let me know if I'm missing anything um to note about this graph other than those and deployment frequency um over the last month or we peaked that five deploys on June 1st. That's Thursday, um I'm, not sure what happened in that dip. May 31st.

G

um There seem to have been something on May 31st, but can't really tell what that was right now. Does anyone know.

H

A

Oh, which day we.

H

Were doing the the 31st.

B

H

We were doing the approval for the backboards.

E

For the security release, I don't know if we put more effort in the preparation of the security release, and maybe we also had yeah.

B

Something to block the deployment because one looks really low. Let me see if.

G

I can yeah that looks really low. I vaguely recalled there was a blocking CR that day right. Yes,.

D

G

Yeah, we should definitely capture that I totally didn't I forgot about that when um I was filling out this graph, so I'll just I'll just redo that um after this call cool nice catch um lead time last month, um nothing too abnormal. What we expect out of what we just saw, um what else we went over the deployment blockers and yeah? That's that's about it.

G

Anything you'd like to add.

E

I have a consideration, or maybe more of a ask for who have been in this shift recently. So today, Dev is taking five hours to process uh each tag request so building packages.

E

I do remember in old issues. We had something like, and now you take some time off, because it takes 18 minutes to to build the package. 80 minutes is not five hours so and I. Remember I was complaining about this uh probably last week last week, because we did a lot of release last week as well and I think that Steve told me now now it usually takes around three hours. Again. Three hours is not five hours, so anyone has any idea why tagging a security release for each tag is taking so long.

F

um I noticed that uh it takes a while for jobs to be picked up by Runner. They are using these pending States and when I was watching the security release or a pass release. The other day, I noticed that it took at least 20 minutes to pick up a job. So if you put security packages that time could pile up and could explain why it's taking longer.

F

So perhaps it's not about building the packages. It is the time that it takes to initialize the pipeline and the whole jobs, but.

E

This is the runner, so it stays in pending State until a valid Runner that is designated for. That pipeline can actually pick up the job. So this either means that I know. I mean very likely means that we are at capacity in terms of the ability to build packages, and this reflects uh on Ocean deploy as well right, because then I noticed this still today, I had three uh package for the security release that we're building I tried to stagger them a bit.

E

I think I had left one hour between the first and the second and 20 minutes between the second and the third. It wasn't really helping and then the autoplay package that happened after all of this, he ended up being yeah impending State at when it was supposed to be complete.

E

So it gave me that um error about the is still running and then I was checking the status and was still pending uh the generation of the build, and usually this is because the the runner doesn't have capacity to pick up another job, so either we split the those run so that we have um say yeah and also deploy dedicated uh runner. That can pick up all the auto supply package generation and then we have another one for the releases.

F

Would it be more important to have a dedicated one for the release.

E

B

E

Yeah, that's that's the thing right, so one question: so if it's taking five hours because in general generally spending one hour to pick up a job, then yeah, maybe just removing, because the amount of packages we are building with a security release is huge. There are really a lot of packages there. So maybe that's enough, but yeah I, don't I mean it really depends on how how many um jobs those are running. So maybe they're just run every type of build that happens on dev and it's this same set of Runners.

E

So we can try to segment and give priorities.

C

Yeah, um when you're through this release, Alessia, would you mind just sort of seeing if you can break that down a little bit into an issue or or even if it's an investigation issue, and then we can start trying to prioritize that so.

E

This is still owned by distribution.

C

It's technically owned by reliability. Technically, the runners Runners are not no, uh actually, maybe maybe they are. Reliability have stable counterparts that.

E

C

Runners so I think, let's get the issues created that we can.

E

Figure out like if you're.

B

At the Right audience yeah exactly thanks.

D

Hey um anything else on recording.