GitLab Geo Group, 13 Jul 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Pairing on zero-downtime upgrade process/instructions for multi-node Geo deployments

Description

Working on understanding why there was downtime in a recent upgrade and figuring out steps to mitigate it. https://gitlab.com/gitlab-org/gitlab/-/issues/225684#note_375956757

A

B

Okay, so we're working on this issue to investigate down time work, seeing during zero downtime upgrades of multi node geo deployments, we're going to upgrade.

B

Have it open, but we're currently on.

B

12.10 they go here. You just look at the admin panel, yeah and Postgres eleven point: seven we're gonna update to the most recent 13, which is 13 point zero point, ten, be monitoring the h approach boards and some production logs running some tests here to see when yeah. When we see any potential issues will be monitor the logging all the details in this document here and with that I think we can get started so here. The instructions know the way beginning.

B

Yeah, it's like so many things on my screen: okay, so first we're going to choose a deployment node and if it's running so we're choosing a sidekick okay, so each each deployment, primary site secondary site has two rails notes: two psychic notes: Italy Postgres and Redis get everything yeah, so we're gonna use sidekick to as the deploy note and I've already run the status here and see. That's not currently not running and there's links in the document.

B

So I've run the the antennas for geo just to make sure things working as it should be and same setup for the secondary, but we'll get to that. We're at that step. So, yes, I made sure that it's closed shut, stopped rather.

B

I can't cover these too many things on my screen.

B

Good way of doing this well, anyway, after step, one which is stopping the deployed step two, we will make sure that this file exists, skip Auto reconfigure, and we also want to do that so all nodes, including the primary deploy node. So all of the notes here, yeah I, did that before we started recording just to make sure that it's there on each I.

B

Can do it here just to be sure.

B

Yeah I think you can still see here here here here and we also want to make sure that Auto migrate is set to false on every node except the deploy node which at the top one so false.

B

Let's check it here.

C

Well, false, okay,.

B

Okay, now on the primary giddily only nodes, we want to update the get lab package and make sure I have the right package number here.

B

Okay, so here's our go! Delay node! It's about thirteen point, zero point. Ten.

A

This is riveting.

B

It's just the beginning.

A

Let's get them started, yeah.

A

Years ago, I used to manage a source install of gate lab. This is so much better.

C

A

Yeah, are we only so like I? Don't know that much about the downtime like the no downtime deployment strategy, because we could only do downtime deployments basically, because we had to recompile everything hmm they took an hour pretty much at a minimum. Every time and yeah people always got really mad at me, because I didn't want to do them after business hours and they were like, oh but I, think it worked out and I'm like well me too, and I also want to not do this not at work so have fun. Yeah.

B

Okay, so okay giddily the next stage. Oh you want to reconfigure.

A

So we upgrade the other nodes first before running the database migrations. That's the strategy right um or is Ginley the deploy, node I can't.

B

Least, not the deploy node, but you have to update it first.

A

Okay, sorry, no.

B

Problem: let's go ahead and update the deploy hood, mmm which is up here.

A

B

Yeah they answer this well after the deploy node, we did all the other ones and that's what I usually do all at the same time, yeah.

C

B

Yeah the time today so so.

A

It doesn't take downtime while you're doing the deployed out, though right I.

B

Know it's already offline, like it, we stopped. Oh.

A

Sure yeah, that makes sense, but the other apps like don't go down as a part of that. That's interesting.

C

Nope, oh sorry,.

B

Last girl, we are still setting up.

B

Okay, so that was the deploy note. Oh god reconfigure! Oh.

B

Wait all right here, let's go here so point: I have to get regular database migrations run. This commands do that.

A

A

I just realized off my bottle of water in the hallway I'll, be back in like two seconds.

B

B

Did we win, it finished, so um I think still running okay, so that's normally uses a request. So now we couldn't restart psychic.

C

B

Okay, now we're moving on to the rest of the notes. Nice.

B

Update go down the line here, so I've already done these top two.

A

Yeah, starting with a rail snow that seems like the right choice.

A

I wonder I, wonder if there's a chance we won't see downtime because we're doing the rails notes one at a time.

B

That's how we do it that, in the previous lesson, okay,.

A

And you still saw it.

B

Interesting all right, but I, just yeah as I said. That's when I would I would update all of these. At the same time, so yeah.

B

Oh I see what you mean: that's what you're referring to, because before I would run them all at the same time, yeah we're doing them yeah see.

A

I, don't know enough about how the load balancing between those two notes works.

A

D

A

Know it well we'll see what happens yeah.

B

I think it's just.

B

About to check puts around often and whatnot, but you can see that they service not the same. Never, look less your sessions.

B

Oh I see a 500 here, but oh.

A

Look at that Oh on elasticsearch.

B

Health checks for.

B

Okay, so you write that in our notes- yeah, that was during the update of rails, one.

A

So on those sorry.

B

Here's the test that built his name here want to copy that in awesome.

A

A

What's ok upgrade all other nuts nevermind.

A

And that was rails.

D

A

And you're ten steps ahead of me. Sorry I was still trying to figure out where the document to paste that okay, so I'm me sorry not reconfigure down I think a little more Freight under.

C

Yeah me too: it's.

A

Like the next step, there yeah so.

B

Here's big job link.

B

Let's continue the instructions so actually.

A

Put that I'm gonna actually make a copy of that table and move that link to the step above it where hoops go ahead. Yeah, that's just just so that we've got it under the upgrade stuff instead of under the whoo. Look at that another failure. Oh.

B

This might be a this might get like you think, yeah I'm fixing well.

C

Let's check it out.

B

Okay, so let's actually check.

B

A

B

Think I think it's just it. There was a lag in it. It saw that it was empty, but it just hadn't finished replicating. So it's I'm fixing to.

A

A

Do that all the time this.

B

Actually does say it's empty.

D

A

Is the and sidekick is all running again.

B

Yeah, so there was always one psychic note that was ready, but can check this one yeah.

A

B

Okay, well, we.

A

Should look at the sidekick box I wonder it.

A

I wonder if we have like an incompatible job or something.

B

I, don't have just be able to get anything out of this yeah.

A

I don't know either sorry, maybe this actually useful. No.

B

No definitely had failure, and it looks like it didn't, actually replicate the repo.

A

A

Be cool, look, we could run like century or something so that, if it aired, we could like check it seriously. Good great I do that sometime, yeah.

B

Sorry, nope, no sorry except I'm. Just thinking of me.

A

Which is for horses.

B

Where's my pipeline.

B

Yeah son, although we missed what happened because I had clothes they're not on the screen yeah, but we're definitely seeing something. Although the.

A

B

Are fine, so yeah.

A

That's interesting.

A

Actually, you know if it's.

A

We haven't upgraded the second sidekick yet right, we're.

B

Only up kick no, we haven't, we haven't upgraded. The second sidekick that.

A

Could do it if we had a job that was not backwards compatible and we wrote it to Redis from the updated rails code.

A

And the sidekick node that didn't have it I mean I, don't know just maybe it looking at sidekick job changes might be. It could be a culprit because I feel, like historically I, feel like we do not as much review of sidekick backwards compatibility over database backwards, compatibility.

B

I missed another pipeline to make sure get those tests running for the rest of the steps. There's a three minute if you need to break it, takes about three minutes for the test. Like you start running well, today, start these cuz I felt container and whatnot did the same ones that you had.

A

Yeah should we pause the recording and take a break for a couple of minutes, yeah sure.

B

So we took a break after updating the first rail snow. He started seeing some pipeline failures. We've noted in the document after that I started a new pipeline, the same test where you're not running anything right now in terms of upgrade commands and seeing failures, which are it look like also 500 errors. It's looking at the output here.

B

Turtle server error, stop I'm gonna copy these and put them in the document.

B

And this is not even a Geo test, insist on the primary.

A

B

The first thing.

A

That we're getting 500 and the tests, but it's not.

A

It's interesting that we're getting five hundreds on the test, but we're not occasionally getting them in the atria proxy dashboards.

B

Pardon cancel another one.

B

B

I'm gonna wait for these to start running.

A

Seems reasonable.

B

Okay, okay, so we had just run the update stuff myself to reconfigure on this one. Oh yeah.

B

Okay, okay, this one failed again: hmm okay, well, look at the I.

A

Wonder if that one will keep failing until we upgrade the sidekick.

C

B

Oh, the API tests, Oh grandpa, read here: oh nice,.

A

B

That was after we reconfigured the primary oh yeah.

A

Oh yeah, a bunch of them are I've, got I've got a bunch of stuff. In my log.

B

A

B

And that was looking at so that's the one that we just reconfigured rails to is still oops I'll. Do that.

B

How do you undo this I.

A

Don't I don't know, looks like it. Okay,.

B

But looking at that was great yeah, so there was a blip right after we did reconfigure just thing. Yeah.

A

It was about 20 seconds when it stopped.

A

11:37 48 to 11:30 807, no 19 seconds, okay,.

B

B

Jeez every time I start these tests we've like three minutes. Let's start up again, let's see I guess we should update the rails to.

A

The first thing that had happened on reconfigure.

B

And he's like so much bigger screen.

B

Like everything else in the other job felt because I don't see any requests, you see.

A

A

B

Morning, if it just automatically does that if I start a new pipeline, oh.

A

Yeah, it may have canceled the old tests. Okay, I.

B

Mean that we still have the locks and everything it's just um yeah I, don't see anything happening right here so.

A

Do we want to see what happens when we upgrade the sidekick.

B

um Yeah we could I just I, don't have any really I just have the readiness checks running right now. There's no, that's true. I'm still waiting for these to start and they take about three minutes.

A

That make sense all right. Sorry,.

B

No I wish it I wish. It was more just quicker.

B

Well, actually can just watch the vlog and see one request start coming in.

B

Okay: let's go on to the second males mode; I think that so that's I.

A

Wonder if we want to do I'm wonder if we want to do the sidekick first mm-hmm.

B

You can do that sure.

A

If that's okay, yep.

A

Sorry to be an annoying no.

B

No, this is good cuz it in this section we don't specify an order, we just say all the other nodes so really interesting it. It's order specific.

A

Yeah I mean we definitely saw downtime in that reconfigure, so I, don't know that this will necessarily make a difference. I also am not sure what we count like what our SLA counts as downtime and whether or not trio sync is what a Gio not syncing is considered downtime or not. I have no idea what that I, don't know how how things work yeah.

B

Well, in this case, the failures see happening before we even get to do replication. Like great merger question, should I marry note s.

A

Yeah and maybe obsess because so like it when you upgrade some of them could be like, because if it expects you to upgrade all the other ones at once right, it could be that the site, like I, don't know. Oh.

B

Yeah, so it could actually be better keep them up. At the same time, yeah.

A

And some circumstances yes,.

B

B

Because we also don't specify to do the mow, we don't say to do them all. At the same time or individually, it's good yeah.

A

Hey yeah I, don't know: oh yeah, yeah, sorry I'm, just saying filler words. Now I'll shut up.

B

Okay, so I'm gonna run reconfigure.

B

And this failed during this sidekick to upgrade step or sidekick one sorry, yeah.

B

That's when I'm I'll put it down to you, but I know you may be flaky and the last time I put it in here.

D

A

Wow get lap, comm is approaching. It's like over 600 million CI drops.

B

Okay um going so that's step, we got that okay, so we configured psychic. So let's go back to our else or not back. Let's go to rails on.

A

B

Okay, that was updates. We can be figure.

B

Okay, now, let's go on to Chris Chris.

A

So that was oh we're getting primary colors again on.

B

Rails, so that was straight after we can get here. Dude yep.

A

A

Looks like it's back now.

B

Well still showing his red on my dashboard.

A

Ice they seem to, uh it seems to have stopped getting them on the script, so the script got them for about 26 seconds this time. Okay,.

B

Waiting for the API test, starting at her yeah.

D

B

Usually takes like.

A

We didn't have any failures that happened during the rails to upgrade right.

A

It was just starting the reconveyance yeah.

B

Same with the first I think the first step yeah.

A

I think so too! Oh.

B

No no say no you're here, but.

A

B

Are seeing failures like these are all jump links to failures, so it's like it going to.

B

Flip it down tiny one, so I go ahead. I was.

A

Just gonna say I know for me: historically, when I build zero, downtime upgrade strategies, I almost always do sidekick first, mmm just because then you don't have to worry about in queueing jobs from rails. That sidekick doesn't understand and it makes your a great strategy a little easier, but that doesn't mean that if we're not telling people to do that, then I don't know that we shouldn't. We shouldn't expect that.

A

So that's definitely interesting.

B

B

Okay, so we are going to do Postgres stopped so I'm gonna do that now what that's something I wanted I get rid of this.

B

A

B

Here there we go whoo.

A

So you upgrading the post press version, no.

B

It's already upgraded okay,.

A

A

B

B

B

Okay, so let's reconfigure.

B

All right, we move on to Redis the last one of the primary site.

A

That we don't see anything else.

A

If I already get there, there are a couple of different issues. Something clearly is happening during reconfigure and I think it looks like there's something else, maybe in the sidekick stuff, because we saw a bunch of errors before we finished upgrading sidekick.

A

Anyway, those are my bets yeah.

D

B

You did sorry or notes that ran. You know, cut unicorn work this case, : yeah, so that's roads, one wheels and then psychic to restart.

D

A

Interesting I would have expected the upgrade to restart the I guess I mean no. It makes sense that it doesn't purposely. Don't do that. huh Oh.

A

We both turned red ah I.

B

Did not see exactly when they fix it, but we have our log still Murray.

A

Yeah happened about.

A

45 seconds ago, doesn't.

B

I have Tacoma.

A

B

My son is knocking on my door; it's definitely okay, okay, so that was temporary. Yeah.

A

It was about with just over a minute I.

C

C

C

Okay, in the meantime, I can go.

B

B

Screen here, I restarted the test. Yeah.

A

B

Just not going to because I think there's some Oh.

A

A

Would guess that it aired out? Because it didn't for it timed out trying to get it well during the minute. But things were down.

B

We also have to keep in mind when they fail, there's a lag I'm. Just remembering that's that whole. Oh.

A

B

There might be alike, it depends where it fails in the test.

A

B

If it's, if it fails at a login stuff, that's when you get to like, because that we have weights.

B

The 500 ones might be more immediate because, let's just get that it's gonna fail, you know what I mean yeah.

A

Yeah I would think so Oh. What like.

B

This one, for example, we were not really doing anything, it failed, but it failed trying to log in so there's gonna be it waits and it tries a couple times too. So that was like, maybe a few minutes ago. Actually.

A

I wish some we had timestamps like on the log outputs. Oh hey, look at that sorry carry on.

D

B

Step this does not look like a walk in failure. That's what it says.

D

The right one make sure.

D

Yeah not one see.

A

So that test did start.

A

While we were still so, if you scroll up a little more mm-hmm, those timestamps overlap with our downtime okay.

B

A

The downtime may extend path like it could still 502 pass Platt, depending on what kind of response it's actually sending from the endpoint yeah.

B

B

Yeah cuz we're not running anything right now. I was just checking. I was just over here. What's.

C

B

Okay: okay, well good news. Those were finished on the primary note, I.

B

Guess you never start them here as long as I got the jobs from the failures, but I think what you said makes sense.

A

B

Okay, so we're over here, um yeah there's some chicks, I think I've already done. Let's go to that panel.

B

B

Let's start again: oh yeah secondary the ploy node is also the second Rails kick node. Let's yeah I checked that it's not running or its side kicks stopped.

B

There well, let's check yeah I check that this file is on all of the nodes, including the deploy. Note. Let's see that here still skip.

B

And then excluding the secondary, deploy, note I'm gonna make sure that geo secondary, auto migrates. It's a false, so yeah do that a secondary, secondary just pull this up there. Okay.

B

And we want to reconfigure.

B

B

I make sure I'm not skidding, so excluding the sick, so everyone, then everything excluding the secretary I note.

B

B

Turn out on the giddily, only node we're gonna update the package install. So that's good Elina. This is here.

B

Kind of wondering what our strategy should be in terms of the order and timing of upgrading for the secondary the nodes on the secondary yeah, like maybe this time after updating the deploy node or should do that other sidekick. Don't next yeah.

A

Let's try that I'm curious to see what happens: hey, yeah I'm, very curious to see what happens. Okay,.

B

Possibly also update the rails notes at the same time like together, not necessarily with sidekick but sure.

A

Okay, yeah: that would be reasonable.

A

I'm curious to see if we have failures that happen wall.

A

Voila two are being upgraded together after we do the sidekick book or purple. So if we do sidekick and then do them I'm curious to see if we have failures in that yes mm-hmm, because I'm gonna tell us something- maybe interesting, maybe not yeah yeah.

A

I'm not sure if it matters, if we upgrade them separately, what's up no.

B

Not just I notice that it finished it up great installing sorry.

A

I wasn't paying attention.

C

No okay, I was.

B

Like, oh, oh, okay, this is my no not that one. They hadn't restarted.

B

B

Now we go to the deploy node.

C

C

D

D

B

We have to run regular migrations.

C

B

So I haven't you look well just.

B

Yet so it's here to start your locker sir, and it's still running so look well. Let's take it out of that. I forgot to turn that off its.

B

Sight other sidekick.

B

Oh, my mom, oh my goodness, I mean yeah. It's gonna say.

A

I think you're in the wrong something cuz. It's the priority. Oh No.

B

Why did I just run reconfigure I had the color coded to just.

A

We've all been there: okay.

B

A

Yeah for this I'm gonna go to the store and buy like a million box fans. That's super nice here today and all I want to do is get like as much air circulating through my house.

D

A

Humanly possible.

B

That's pretty great hair today or.

B

The temperatures like that's.

A

It's like 80, but it's the humidity is way down. It's only.

D

A

50% it was it's been like 90s and high humidity for several days. Mm-Hmm.

B

Humidity's yeah, it was very cubed in Toronto growing up here, like like Stanford, it's like when you're on the phone number blank on up being on a bike, and it just it's so dry. It feels like you're there's like a ha blow dryer just blowing in your face. It's a time. Oh yeah,.

A

B

So that was a second okay. So now we sorry, no I'm sure I'm on the right step here. The.

B

Deploy okay, so now we said we want to do their rails nodes at the same time. Let's do that.

A

B

Okay secondary yes,.

B

B

Think I was supposed to restart Oh.

B

No I'll do that the end.

A

I've been one minute, I'm. Sorry, a problem.

B

Okay, so that was reconfigured on the two rails notes: um let's see yeah, it's always a lot to do they Postgres and get only earth and Redis nuts when we do Postgres first Oh.

B

B

Failure and then geo API test, and that was after Iran we can figure on the rails. I would silly put that in the sheet.

D

B

You can figure.

B

Still running start that up.

A

I've returned I miss anything good.

C

The API test failed, oh yeah,.

A

That a bunch of the secondary failures in the the secondary status, bad, okay, yeah.

A

Just about a minute.

A

Did you run the reconfigure? Is that what that was.

C

D

A

Okay, so just like in the primary reconfigure is doing something.

B

Just gonna wait for these to start up again the API test. They feel fast.

A

B

Okay, node started.

B

Okay, so we were going to grade Postgres node.

B

Okay, so we pick here.

B

Okay, let's go through Redis.

B

A

It's weird: it's almost like some of the things are restarting early.

A

Because right we're supposed to be hot reloading, the puma app it shouldn't it shouldn't go down for that. But if we're in queueing jobs from a new for a new psychic workers before the hot reload happens, we must we must have the new code running.

A

B

So it's something to restart these deep restarts high kick first or does it matter? Oh.

A

No I don't think it matters realistically. I mean I just.

A

I think the Pro I, based on what's happening in reconfigure, I, actually wonder if we're already restarting when we run the reconfigure I.

B

Think so something Michael brought up as well.

B

A

But I don't think we're supposed to be restarting right, like we're not supposed to restarting the reconfigure if the, if the upgrade, if the no downtime upgrade stuff is marked correctly right, sorry did.

D

B

Say that again, sorry, oh no.

A

I was just saying, like hey reconfigures, not supposed to restart it. If you've got the like, don't upgrade me stuff on disk, so I.

A

If I am I, it leads me to strongly suspect something.

C

Is wrong? Sorry.

A

I ran this along in there. Oh, it happens, I, don't.

B

Think it has sidekick on there anyway. Well, oh.

C

Did you see when this happens I test no okay? Well, I wasn't paying attention.

B

A

B

Was on screen so we'll have it in there recording it. Oh.

A

Yeah, look at that we had another minute or so, according to the script from 33 past so couple minutes ago,.

A

A minute ago give or take when and.

B

It's meltdown that.

A

B

Like the hot reload, I think, I think.

A

It's the hot reload yeah.

B

Okay, start: okay, I'm, not sure finish, okay, so that was all the steps for this secondary.

B

Let me make sure I get the job for that. Let's put this document, so that was, we believe I.

A

Think it was the rails one rails to hop.

C

B

This is probably gonna go over. Do you have another meeting after this I? Do almost I, don't know anything: okay, yeah, like oh, really, yeah,.

B

I'm sure I'm on the right.

B

That's like a case again in the ones that are failing.

B

Oops I don't want to do that.

B

I'm just going to take a quick break, while this is loading up just a few minutes, though,.

C

Okay should I pause, yeah, sure.

B

Okay, so now we're on step four, so back on the primary on the deploy, node run, the database migrations copy this. So here we are.

B

C

Now geo Chuck.

B

Yeah we have an open issue for this that doesn't apply to the sidekick node. Okay, everything else is good. Okay. So now we go to the secondary the nodes here.

B

Run the geo database migrations so psychic. How are we doing here.

A

It's the events, what's that, what's.

C

A

So the sink is Adam yeah.

B

Well, we have to run the refresh this. Oh right, um so wait for private you, yeah I was just going to this step to look at the data, replication lag and it's a zero. So now, I'm gonna run this on the deploy note on the secondary.

D

B

A

Come on Angie, you can do it sometimes.

C

It takes a little yeah just you know.

A

It finished no.

B

Okay, hold on there's one other thing.

B

So it's unhealthy.

D

A

Does say the status of the check? There, though, was less than a minute ago on the in the terminal, sure yeah. So it may be that things are not cash or like cashed or something in your UI.

B

Except the version was updated to, but still showing it's ten 12.10.

A

It looks like you have a replication paused. Oh.

C

B

Okay, well at least the version- that's updated. Oh yeah.

A

B

Weight of it yeah.

A

It has to finish thinking everything now I.

A

Would explain why the Geo HTTP push wasn't working potentially right.

B

A

Bat test, if replication was possible time yeah.

B

What was it pause? I, never I, didn't know.

B

That should be another check before we start. I usually just go to the admin panel, but we should go here and make sure nothing's pause, yeah.

A

Hey, look it's healthy! Oh.

B

Boy, okay and nothing ill during this last step, so well, I think we got some information out of this still something to investigate, but yes, stop recording.