GitLab Infrastructure Group, 7 Feb 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Discuss HAProxy downtime requirement

Description

A short discussion about alternatives that we have for site downtime for the HAProxy upgrade.

Notes (internal): https://docs.google.com/document/d/1v6kheVYq1ap4Xq6Kmnj0uKRelGTq5s-mRHZHGos7KLQ/edit#heading=h.dhp4sslmd35r

A

Hello: everyone we are holding a meeting to discuss the AHA proxy downtime requirements for the AJ proxy upgrade and I'm going to share my screen.

A

I have the first item, which is to just try to get a sense of why we need to have a downtime event for the AHA proxy upgrade I read through uh a little bit of the background here, and it seems to me that we need to change the node pool for the new nodes and doing that is like a destructive event. Does that sound right, yeah.

B

Anything I'm missing there, yeah so in the in the Epic upgrade, and the Epic update that you linked the Assumption was that we just needed to replace the node pools. However, in the Mr that you linked, um I came to the realization that basically, that rep, that also forces the replacement of the um of the internal lb for.

A

B

General Internal lb, um which means that we cannot guarantee that the IPS cannot stay the same. uh So we're not talking about something that might disrupt might be disrupted for like a fraction of a second while the node pool is switched over, but something that potentially carries um uh well the MS caching playing into the whole situation, um which is why it's probably a more delicate change than I. First anticipated.

A

Why why um I don't recall, but is it that we don't set a static IP for the internal ilbs.

B

uh It's it's provision in the module itself.

A

B

So uh it would be possible to rework it to be able to specify an IP object. However, because we're not using a reserved IP already, we would need to do the Ipswich, regardless yeah.

A

um Okay, so we're not using a reserved IP right now, no I see that's what I can tell so we're not using a static, IP, I wasn't I, wasn't sure and honestly like um this was Point number three like I I, don't think, there's anything that depends on the IP for the internal bouncer. Are you aware of anything yeah.

B

It's just the DNS, so the DNS record will be automatically updated by by terraform, um however stuff that uses the internal Loop, also such as Pages uh Gallup, shell. um You know Etc um if that caches DNS that can be problematic, yeah yeah.

A

B

We've we've seen that the Google metadata service, which does the DNS in gcp, also really likes caching aggressively for DNS records yeah. um So like the the blast radius of how this is gonna. Oh, this is gonna blow up. Basically, it's kind of unknown.

A

um Okay, so yeah I see now: um okay.

B

So they would kind of be to schedule a maintenance and then yeah I.

A

B

Let's just go ahead and do the switch and then make sure to like just ensure every cache that could be there is cleared uh by whatever means necessary yeah. We can kind of see what the actual effect of the changes and.

A

And uh bouncing back up to point number two. uh So are we only talking about the internal lbs as an issue? If we took away the internal LPS, we wouldn't have any downtime right.

B

Yeah so um right now, so the the changes split into two merge requests. So one that you've linked is 4951 and 4950 is the one for the the other ones, um because they're not using the internal load, balancer module. We can just concatenate the node pool there and.

A

B

It's gently Swift over um but for the zonal load. Balancers. That's not the case. Yeah.

A

And okay, so moving on to point number four, um this is how, like naively based on my very surface level, understanding like. Can we just created a DNS entry and then move everything over.

B

I mean we could probably maybe do that for the internal one I'm, not sure about the CI ones, because there's a lot, it's probably more hassle than it's worth, I think um I. Think I.

A

Think for the CI.

B

This is probably easy. um It's.

A

B

We then need to also change the configuration everywhere where that's been used um and it's kind of been pain to update it, as it is right now, uh because there's a few consumers there.

A

I'm, just wondering like I'm, just wondering whether um There's an opportunity here to one is like. If we don't set a static IP for the internal IPS we could we could do that. We could fix the module or even like I I, don't know how much Croft is in that module that we use for the internal LPS, but maybe there's an opportunity. We can clean that up a bit just provision new internal lbs um with a new DNS and then and now we have them both running in parallel right.

A

Internal lbs, new internal lbs are pointing to the new AJ proxy Fleet old internal LPS are put into the old age of proxy Fleet and then and then we just move the DNS over I mean you're right. We need to identify everywhere. We connect to the internal lbs and change that configuration, but I think it's not going to be too too bad, like I I. Think it's only in three or four places for the CI stuff. There is the option. The CI stuff was only done to save money.

A

We could even disable that temporarily, but um I would say.

B

Like yeah, that sounds like a reasonable idea actually, because for the internal ID I think it's it's feasible to create a to create a temporary stand-in.

A

B

Even just a permanent and just rip down the old one afterwards.

A

Yeah I mean I, don't I, don't mind, just flipping I, don't think they're yeah famous last words right like is anything depending on hint dot. Gpro.Gilla.Net.

B

um I believe so.

A

I mean I, know, I know in the application. There's a bunch.

B

A

Things using it but I'm thinking outside of the application. If there's anything that uses it and I.

B

Mean we could, we could like add a cname as well.

A

B

Like uh after after we've removed the L1 at the C name to the to the new one and be like okay, if something breaks, we.

A

Can yeah because I mean we use a star cert right for Star, so it's like it doesn't matter what the host name is. It can be into 001, it can be int, it can be food yeah that should work yeah, so yeah. This sounds like. Are you? Okay? With this.

B

Yeah, that sounds good.

A

So we okay for CI.

B

I think um because there's like when I when I looked into the terraform stuff, there was a lot of you know, policies um and also configuring, the CI environments and that kind of stuff. So that's probably if we can just turn off CI, you know the CIA proxy usage while we're doing the migration and then turning it back back on again afterwards, that's probably the easiest way to deal with CI.

A

Okay, um yeah I'm I'm, like pretty sure this is feature flagged or easy to change. Can you um you know that Oxnard is like the best person to talk to about this um since he's the one who set this up originally or um bertomas.

A

Obviously, he knows everything but um yeah um I. uh If you don't get far with that, just let me know I can I was also like involved in it, not heavily involved, but like involved enough that I can probably help um it would be also better. Like I mean more people should know about the CI stuff.

A

This configuration it's pretty opaque, I think to sres like how that's configured I'm, not sure I, even remember like all the details, so that maybe we could check the Run books and um you know, get that stuff, updated and, and that was kind of serious half serious, at least about like. If this internal lb terraform module is really bad, we could maybe fix it in a way that it's nicer, I, don't know.

A

B

I I, don't think it's it's actually too bad. So, okay, the it has pretty strong guarantees that stuff doesn't change.

A

B

A life cycle prevent deletion rule in place, um which also makes the CI job fail for the merge request, because you need to manually hack in local copies of the terraform files to make that work. um Cool.

A

Okay, that's pretty much all I had so um yeah. Let's, let's try to sketch this out, see if it'll work and if it does, uh then we can avoid the downtime yeah.

B

I mean we can so we can probably sketch this out to also do that on staging yeah, because the proper way to do it yeah um and then, if we can yeah and that works, then we can do it in production.

A

Oh man, uh that's all I have.