GitLab Deep Dives, 18 Mar 2021

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: GitLab Support Gitaly Cluster Deep Dive - AMER

Description

Recording of GitLab Support AMER deep dive on Gitaly Cluster

A

Record on my computer, all right and I'll share my screen. Okay,.

A

Okay, um so welcome to the getaway cluster deep dive for amer, um so get the cluster just to give a real high level overview is kind of our high availability solution for getting um so a little bit of background. You know. Initially we had rugged or like lublight2, and so we directly access get data from the application server and then we split it and split that out into gitaly, so getaly in the current or the original instance didn't really support ha at all.

A

So there's no way to coordinate between multiple nodes without using nfs, which is slow and causes some consistency issues.

A

So what you have to do is you have to share your data so that you know projects one two and three are unshared one project, four, five and six are in chart two, um and so you have some mitigation of data loss, but uh because you only lose a half of your projects if one of those nodes goes down, but you still don't have any redundancy there um so again, cluster does. Is it lets?

A

You run multiple gateway servers that are kept in sync um and presented to rails as a single back end, getaway, so rails.

B

A

Even know that this is prefect- or this is gilly multiple ghillies here um and so this this says the obvious advantage of now. You can lose one of these and you don't lose all your data or two um okay, and so the way we mediate, that is, with a program called prefect, um so prefect you, you need to have at least three of them. Three is what we recommend.

A

I don't think I've seen anyone using more that you could in theory, but you need to have these three, so they can agree on who the primary is for among the gilly clients um and in terms of getting these servers themselves. You know typically see three as well. You know again, you could have more, but three is most common anyway, so prefect. Well, they don't directly talk with each other.

A

What they do is they share a postgres database which in theory you could put on the rails database, but in practice we recommend putting on its own database at some separate database. So usually most customers are doing that in the cloud so using rds or an equivalent and other cloud providers for customers who are self-managed. I think there are some limitations right now.

A

um I can't think of what they are in my head, though, but yeah so essentially we'd recommend having a separate postgres for prefect, because it can put a lot of load on on that server, particularly if you have distributed reads enabled it can put a fair amount of traffic on it and there used to be to the point where we actually disabled that feature, because it was just like on large instances.

A

It would just tank the the postgresql that's been resolved now um by using a maybe added some caching, so that we're not going to the databases often and that also required.

A

Where is it talk about this little bit in the box?

A

Yeah? Okay, uh no, not there! Okay! Here it is yeah, and so then that means that we also require a direct connection to that postgres as well from each prefect um so actually back up for a second, so most connections will go through pg bouncer or we recommend pg bouncer.

A

um But then we also need to have the more efficient connection for this distributed. Read cache, and so you also need to have in your prefect config, just pull up the gitlab rv here.

A

um So you have both your database host, which in this this particular demo, I'm just using a direct connection, but for a customer, this vpg bouncer here and then you also have this host no proxy, where you're directly connecting to we're, presumably your primary postgres server. um So if you're yeah actually, this is this is the downside, but for self-managed dub. So if you're not using a highly available postgres so like rds, then then you have the downside of if your primary postgres server goes down and pg bouncer fails over to a secondary this proxy.

A

This post proxy connection no longer works right. um Italy and prefecture still work, but now we are much less efficient um because we no longer have this cache. So I don't actually know what this I don't know. This is actually actually haven't happened in anyone's production. Yet, um but I'd guess we'd probably see a lot more load on postgres when this happens, um but uh it should be interesting to see whenever that eventually does happen, what what the the consequences are?

A

Yeah. So, that's that's why we have this separate, no proxy connection here before this distributed reads flag. The advantage of distributed reads is that yeah by default? Previously, you have your primary gilly server and we would do all reads and all rights there, and that means that you know you, even though you have three servers in there, you functionally have one handling. All the traffic distributed reads. Lets you well distribute the reads like you might expect, and so that means that your traffic read traffic can by default.

A

I believe it'll go to the two secondaries, just assuming they're up to date and that way you're getting much more performance, so you're just you're spreading out the load so that you know your primary is no longer nearly as much pressure as it was previously and you can hit hit it with more traffic and have better performance in general.

A

So that's this the reason for the feature and uh some some of the complications with it. uh Okay, where was I all right, so you've got prefect they're talking to postgres via two different connections, um and so what they're? What they're, basically doing is saying. uh Well, let's actually pull up postgres here.

A

So you have a few different tables here, so you get repositories repository assignments.

A

I only have like three repositories and they're wikis on this cluster, so it should be pretty.

A

Short yeah, so you can see the generation here is basically how many replications have happened. So prefect here does not directly track. The state of the repository doesn't say like head is shot one two, three four five and saying I replicated this many times for this project, and so given that I know that this is the state of the repo, and so what that kind of complicates is like a lot of customers have asked about.

A

Can I hot swap in you know a new server um and that's that's something we don't support at all right now: hey caleb, welcome!

A

um There is an epic for that, uh but it hasn't got any attraction yet so eventually we'll support something like that. But right now we don't- um and so this is. This here is at least part of the reason. Why? Because we're not, we don't really know what the state is directly in prefect right. We just know we replicated this many times. I know that prefix or getting two has not run that replication. Yet so I know it's behind, so we can't we're not going and directly checking.

A

You know the head of the repo we're just saying this: is the replication count based on that? I know uh what the state is. uh Okay, so the repository since the repository assignments.

C

I can't spill that wrong just copy it.

A

All right, that's actually empty. uh Okay, replication queue should be empty because nothing is actually happening right now, yeah, okay, um and so this is. This is kind of tracking how uh you know when when we're when the primary changes.

A

um So actually let me back up there's kind of two ways that we do this so initially uh we had eventual consistency. So when a change was made, we'd update the primary and then we'd schedule our application job to say all right now get elite. Two and three go fetch the changes from italy, one which is the primary, uh and so we we'd add this here and then prefectured fire off jobs to the uh the two other ghibli servers to have them fetch.

A

The changes with there's another feature: strong consistency which uses what are called reference transactions. So basically, when a change comes in the primary will then call back out to prefect and say: hey, I'm going to make a change, and I want to synchronize with the other two gitly servers. So then they'll establish a quorum and the details of that are a little bit. I'm not a little bit fuzzy on that I'll.

A

Actually do that maybe catalan does um but anyway, so they they established a transaction that says you know we're going to update all three of these servers in lockstep um and then that way, you don't have this replication lag when, when someone pushes you know all killing, one two and three are all through this mechanism updating simultaneously, um so that you know, obviously makes it more reliable so that, if guilty one, if the primary goes down or even one of the secondaries, the other two are, you know much more likely to be up to date and in a good state than uh than if we were replicating as it came.

A

um We had some issues in the past, so one thing I should note is that we really don't want customers to be using this with an old version of gitlab. It's like you really want them to be on the latest version. I think up to 13.5, there is a bug. Replication would just freeze you know every hour or two, and so you'd have this huge backlog of jobs that would build up until you restarted prefect and that would clear out the race condition that was breaking it until it happened again.

A

um So if you see someone using an older version of getlab with the github cluster strongly recommend that they move up to the latest, because there are more features, there have been a lot of bug fixes. You know it's so much very much a feature, that's in the process of being polished and improved. So this is not something you want to like hop up to thirteen point one and say all right: I'm gonna use good catalytic cluster. Don't do that! That's a really bad idea, make sure you're using the latest version.

A

uh Okay right, so that that's strong consistency um we can actually and we'll still see some replication jobs. I think catalan. You want to ask you to jump in here, real quick, so you had a customer who is using geo with uh with gili cluster and they were seeing a backlog of replication jobs building up over time.

A

D

It in our case it was like performance related where um jio was thinking more repositories and perfect, could um clear up from the backlog.

A

Were they using strong consistency? No uh okay. That makes sense all right. I was wondering how their how this backlog is building out when they're using charcoal synthesis, okay, yeah um one, and so the uh the other major pain point with strong consistency, or maybe the the major strength pain point is that so, like I mentioned, um italy is going to create a new connection from itself to the prefix host that called into it.

A

Oh, that's, marina, um and so the way it does that is, you know if you look at the gateway config, let's just pull up.

A

It has no knowledge of prefect at all. As far as gita is concerned, it's just you know it's just a standalone getaway. It doesn't it's not a where that's in a cluster, so there's no there's it doesn't have any way of connecting to prefect. It doesn't know their other gitlies. It's just hanging out. um So the way the way we handle like how does it?

A

How does it call back into prefect, then is gilly will read the the incoming address of the connection request from prefect and we assume that that is a connect, an address that we can connect to in the other direction, but that's not always the case. So if you were netted and so like it wasn't actually the correct address. That's not going to work the thing that we've seen multiple times now is in a containerized environment. So I don't actually know how this works with the chart.

A

I don't know if it does, I haven't tested the chart, so the chart may not work either, um but certainly in dockerize in docker, which was the one that we saw the most. um What will happen is gearlee will see the connection coming in. So, if you're using bridge networking the default right, you'll see the connection coming in from like the bottom address of the bridge network address space, um and so it tries to call back out to that and it doesn't work because that's not actually an address that goes back to prefect.

A

So if you're using vms this, this should be fine docker. You can turn on host networking, but that might cause other problems. If you have multiple containers on the same host, I'll try to bind to the same ports, um so my recommendation would be just turn off reference transactions, um just disable feature.disable gateway reference transactions, um if they're in docker- and I I don't know- what's gonna happen with the charts, um but that would be the same there. You just disable the feature there.

A

um If it is an issue, I'm not sure, probably- and there is uh an mr to fix this I'll- put that in the dock after the call. But um it's been kind of, I mean it's been worked on, but like paul, the developer, who was working on it just left the company like two days ago, um and so someone else is taking it over it's supposed to be for 13.9. Now I know it's supposed to be 13.10.

A

I don't think it's gonna make 13.10. So eventually this feature will work with everyone right now. It uh is a source of some issues: um okay, yeah, all right, so whatever strong consistency whenever distributed reads: yeah, uh okay, other things that can go wrong so in terms of networking connections um the way the standard way it's set up. Is you have your rails server and let me actually uh just pull up here so this is. This is using uh a terraform script. That's in the gallery pro itself, so anyone can use this.

A

If you have gcp access, you just go into gili underscore support terraform and you can run that and spin up a demo cluster. So it's really easy they're. Just like two scripts. You run like uh create demo cluster and configure demo cluster and you're up um anyway. So what that does? And what typically you see is that you have your rails nodes in this case, you only have one database, so this is. This is actually just one database, it's gcp right and then there's a low balancer and behind the low bouncer.

A

You have your three prefectos and then you have three gitly hosts as well. So in terms of net connections, you're gonna see from rails to the low bouncer the low balancer into the prefix, the prefix into the gateways. The gateways will talk amongst each other and then, if you have reference transactions from themselves directly back to prefect and then I believe they also will be calling back out to the low balancer as well in some cases um and then also using gitlab shell out to the the rails node.

A

So it's just like a big spiderweb of connections and you need to make sure that you know you're able to connect both between them and to themselves, because kitty will also call into itself which has been a problem sometimes where, if it doesn't resolve its own name.

A

um So let's actually look at that. Real quick, oh hey ben!

A

uh So, let's see, let's just look at see if we can get that to happen so I'll just edit. This.

A

Okay, yeah, so here we're connecting a port 8075. So that's probably replication uh connection to get late, ruby, locally, yeah, all right and yeah. Here we're calling out to 53. I think that's rails, right, yeah, okay! So that's calling the internal api.

A

And, uh and where are we calling prefect should be calling it yeah? I don't see it, she did this earlier and it worked.

A

B

A

Interrupt me at any point uh with questions. Okay, actually, it's interesting. I don't see that we called prefect here I had that feature turned on and distributed. Read is true: okay, um uh weird um sorry reference transactions, that's the one I care about yeah, um so in terms of observability um we have logs, but those can be verbose.

A

One problem: that's come up consistently, not a problem, but a challenge. Maybe is that you know if you understand what's going wrong in the situation.

A

You've got at least six servers that you need to check, because you need to probably check all three prefix and all three giddaly's, and this can just be a lot of data to have to deal with um so a support author, that's not as big a deal, but it's still just a pain to have to go up through all that um you know we have a number of useful prometheus metrics, but you know in self-managed: that's often not an option.

A

um So, let's see so you can see here, replication latency, so how long it takes to keep the other two nodes in uh and suck surprise there's any at all. Since I have strong consistency on, but maybe I'm just not going to understand that correctly and so here this is interesting, um so with redistribution, it should be apologies for the dog.

A

I guess you feel strongly. um It should be sending most reads to the secondaries so right now, gitly three is our primary um and let me look at the ips here: okay, so giddily three is 54., but we actually see that 54 and what is that 27, which is two up here, are getting most of the traffic and getting one is getting less, which isn't what I expect, because it should be taking the primary and sending most requests to the secondary three distribution on.

A

um I know that's what happened on.com, but okay, so, let's get into a little bit of um actually. Let me stop first, um so, particularly for anyone who um just joined any questions. You have right now.

B

A

Okay, so let's get into breaking things, um so let's see we have giggly3 as our primary. So let's turn up on the secondaries first, let's.

C

Turn off synchronization.

A

Stop giddily and get a 2 here all right, so we've got 2 out of 3 get laser up. I should be able to view everything all right, so I can go in.

A

A

I can write to it. um It's also. Let's look at the replication queue on p, sql.

A

Yeah, okay- and so you can see here so like this- is the replication queue of jobs that needs to be uh to be run. You can see that so we want to replicate to get elite two and we have these pending jobs, but they can't run. I mean you can actually see so here we have a cleanup. We have an update, uh we have a repack incremental um just from all these are just triggered by that one push.

A

So we so we have these four jobs here that we want to push together to, but we can't because gilly 2 is currently down. uh Let's look at oh actually we'll just do that. There that's fine, okay and, let's just start goodly 2 again and just validate that that clears out.

A

And it'll, so one thing that has come up is you know: prefect will immediately see a gilly that you bring back online is healthy. I think it needs three successful health checks before it'll, say: okay, this one's definitely up and so there'll be about a five to ten second lag from use from gilly can becoming healthy to it actually starting to take traffic again, um so that can be a little bit confusing.

A

It's like oh it's up, but it's not working and that that's why it's just prefix is making sure that gita is actually back okay, so you can see there that gilly two came back prefect, wait a little while is it okay, you're good, and then it finished replication jobs. And now, if we go to prefect.

A

And we run the data loss command, uh where's.

C

A

Okay, yeah all right everything looks great all right. So let's do that again, let's run data loss. What's down.

A

Okay, gilly 2 is down.

A

Made our change.

A

We see that we have our pending job scaling, 2. now to go to prefect. Let's try that again. uh Do we need to do pressure, replication, partial, partially replicated? That's it.

A

I think, because here it's just it's just validating that yeah here we go okay, yeah, so getting two is recognized as being behind by two changes or less. So that's yeah, because we're just using replication count here. That's that's! This can be a little bit vague. So we don't. We don't tell you. You are exactly n changes, we're just saying you're you're kind of two changes behind, probably so uh it can be a little bit frustrating because you don't really know how many changes have been missed.

A

Okay, so, let's start.

A

Again, verification jobs go.

C

A

There we go okay and now, let's run that again and now everything's great okay, all right! So now, let's take down the prime array, which should still be three yeah: okay, um yeah, so really there's. There is no convenient way other than grafana to find out what the primary is. I don't believe we say that in the logs anywhere um it's kind of a pain we just we just assumed graffana usage, so we should definitely recommend customers set this up. You know it's it's built in, so it's not exactly uh a big deal.

A

Okay, so now, let's stop get get the.

A

A

All right, let's just monitor, grafana here for a minute, because it should see a change here in the flapping. So basically, what this this graph here is saying is: are we flat we're switching between primaries rapidly? You know: do we see any kind of instability there? Okay, so now get elite. Two is primarily there. Let's just refresh this yeah okay, so two out of three prefix recognized gidley three is down. Okay, now we've switched over all three prefix to recognizing gilly two is the new primary.

A

um So now, let's just navigate make sure we can view things. Okay, so we've successfully handled the the old primary going down.

A

Let's just make another change yeah. We can write to it.

A

Data loss, gilly 3, is now behind, so we've we've seamlessly handled that um and then, if we restart.

A

Start the old primary again.

A

Watch for that and that so that feature of automatically remediating repos when they go down like that came out like 13.5 30.6. Where is that.

A

Yeah here it is data recovery 13.4, I guess yeah, um so I've modified this config slightly yeah, okay, so we've we fixed that gilly one is back up.

A

Let's just see what the hash path is. Okay, so if we go into synchronize.

A

Just as a sandy check, if we go into actual.

C

A

What am I doing wrong.

A

Crazy: okay, there's no branch, that's weird! All right! Let's just get ahead is it packed must be packed.

A

Okay, yeah, so let me clear that so it's a little clearer all right, so you can see that the uh all three repos have the same head for the master branch: okay, uh okay, right now I was talking about the replication uh reconciliation scheduling interval. um So that's that's configurable! I set it to 30 seconds.

A

Just so that would be a little bit faster in the demo, um but I think by default it'll be five minutes. You have five minutes by default. So just one thing to be aware of you know like your server close down and then you check it immediately. It's like well, it's it's still not up to date. It's just just because prefect probably has not actually told it to go in and make sure that that has been fixed. So okay, so there's that and reflection jobs.

A

Let's see what else can we try to break it? Let's, let's bring down two of the gillies and so what's primary right now, two all right! So let's stop.

A

Okay, so get at least down.

A

Let's do we've got some errors there.

A

Now, that's probably failing over to the primary right now um yeah. So you will see some transient failures like this, as things are kind of getting up to date, so hopefully actually yeah. So we can't even come to a quorum right now because we're down two servers. Okay, so we have no primary all right. So, let's just yeah all right. So that's not a great error.

A

Let's look at the actual rails logs here.

C

I have to kill this thing.

A

I don't yeah okay, we'll live.

A

Let's just look at the errors here. If any.

A

I don't actually need to be logging.

A

A

Here we go: okay, no healthy notes. Primary is not.

D

A

Okay yeah, so this is not a great error to show to the user. No, I don't know if, if telling them the primaries, unhealthy is particularly useful either.

A

But yeah so in this case, because we have two servers down, we can't elect a new primary.

A

That's interesting that we can pull that.

A

That's where what are we getting that from is that from postgres yeah no gilly calls. Okay, that's why that works 404. There that's interesting found.

A

Let's look at gilly one logs.

A

Even getting traffic here at all we're getting a health check, so let's get rid of.

A

A

Okay, so here we're trying to dial up to other ghidlies, presumably to replicate how no, no, actually, we shouldn't need to replicate anything. Why.

B

A

Even trying to do that.

B

I wonder if it's because the primary was down if the primary stays up and the two secondaries go down, I wonder if the primary doesn't lose its head.

A

Yeah we'll test that next, uh I'm just surprised that we're even trying to reach out here, because I'm not actually sure why we're doing that wouldn't.

B

A

B

A

Well so giddily doesn't force an election. That's preferred that's right, yeah! So so you know, ghillies do talk to each other so like if we need to replicate something, let's check pre-sql yeah, so we don't have any pending replication jobs. So I'm not actually sure why gideon is trying to go out to its colleagues.

C

C

A

Oh, that's why that's an o.

A

All right, let's just let that run for a second.

A

So here we're actually launching some kit processes.

A

I think actually that could be ps, yeah, that's ps, all right.

A

A

And that's 22, I believe, is oh, that's uh that's right, uh but 60 is prefect one okay, yeah! So we're getting something in from prefect one health checks right. That makes.

C

Sense, let's just look for.

A

Or even get anything well actually doesn't matter, okay, uh all right! So, let's start out the other two again and try taking down the primary or leaving of the primary.

A

C

Right: let's talk about the prefect logs here.

A

See if anything interesting comes up.

A

A

Okay, so we've settled on ghillie tubing in the primary now that we have a quorum again.

A

uh Let's type that wrong, this is an interesting error. I don't think it did since it works. That's weird: um let's just look at the config.

A

Yeah that looks right host yeah, I don't know weird uh all right. That would be a bug, prefect okay. So, let's validate that, we can see things again.

A

There we go okay and we're healed. Let's make sure we can write.

A

Oh, that's nice! All right! Let's see what that is.

A

27 is gilly two.

A

Weird skinny two should be.

A

A

Yeah, it's taking health checks. Let's try again.

A

A

Okay, well now it's now it's perfect! So we'll ignore that happen. Right, okay, yeah scale choose primary all right. So now, let's.

A

Turn off gilly, one.

C

And give it three: let's do the primary up.

A

See how we handle that I'll. Give that a minute to recognize that they're.

D

A

Okay, so the primary is up, things are still viewable, it's a little surprising to me that it um broke completely, because you know the secondary that we had was still in sync, with the primary all right and things are still writable with the primary up interesting. um Let's do.

B

I think the lone segment secondary was just loneliness. It couldn't set itself the master.

B

A

Yeah yeah, so I think that kind of makes sense right because on the one hand you can't establish a quorum between which ones are healthy, because there's only one server. On the other hand, prefect knows that all three of these are up to date, so if it only has one healthy server, you'd think it'd be smart enough to just say: well, okay, I can just take this one, that's left and use it, but it's probably more complicated than I'm making it sound like.

A

Okay, all right so primary goes down. We can still make changes, um let's bring them back.

A

A

Okay yeah, so here we can actually see there are more yeah three for each three for three three for one.

A

A

And we recognize that everything's back we've cleared out the jobs and, let's run data loss as well and everything's great double check: okay, yeah. So that's the core functionality. I'm trying to think. If there's anything else, I should cover. Let me go through my notes here.

A

I have a replication job, strong consistency distributed. Reads: uh yeah all right. So let me ask: what's what's unclear what what uh anything that seems vague or not obvious to anyone.

B

Our diagrams don't have the pg bouncer in them. Do they.

A

ah No they're really vague. This is something I need to add. I've been waiting for um I'll just throw this. Let me throw this in the dock. I guess where'd it go here. It.

C

A

Yeah, I really wanted to do should be shared with everyone.

A

Okay, maybe I can write now? Okay, now I can write anyway. So this, mr um I just pulled up. Actually, let's share my screen, um so it's been in the works for a long time like I said, uh but my my assumption was like by the time I go in and make this change. This number will be in and so there's no reason to update the docs to show how the the current network works but uh yeah.

A

So we don't mention that at all here and not there either okay yeah, so another none of our diagrams show a pg bouncer in there. It's it's mentioned.

C

Proxy, that's right.

A

Yeah I mean it's in text, we talk about it, but we don't. uh We don't diagram it now. It's definitely something we should. uh We should rectify yeah. So you know if you're using a cloud service, it's not really necessary right because you only have one postgres you're going to. But uh if you're using uh you know when, like an omnibus postgres, then yeah it would make sense for sure feature bouncer in there all right, yeah, good question ethan uh any any other questions right.

C

Now, okay, yeah.

A

Well, let's try: let's try breaking it again, one more time seeing if we can get that same behavior to happen.

C

Just clear that.

A

And what's currently primary two okay: let's, let's make three the one.

C

A

A

A

huh That's interesting.

A

What, if I give it a little longer, let's start to break.

A

So far, so good huh yeah, all right, that's aired. So this time we elect gili 3 as the primary last time. This kind of just dropped out entirely.

B

A

Didn't have a primary ah weird yeah all right, so this is more or what I would expect.

A

What was different about last time. Try it right yeah! Well, since this is the primary I think.

B

It'll work but yeah. Let's make sure.

A

I definitely should not soon okay yeah, so that's hunky-dory.

A

Neat yeah: we could try to really mess with it now. um So, let's stop three.

A

A

And then I think this should be like this locked into a read-only state.

A

Because now we have inconsistent data between the the replicas yeah. So are we not electing a primary now yeah so now we're losing our primary.

C

Let's look at the prefect logs.

C

It's two isn't.

C

A

Yes, here we're just doing our health check.

A

Let's just remove those. We don't care about that.

A

So here we are demoting. The primary.

A

And we don't have enough healthy storages, so we don't.

A

Reconcile- and we kind of you know, there's like requests, we kind of bouncing between you know which, which actual prefect is getting. The request will just be random, based based on whatever the low balancer chooses. So you kind of have to look at all three to see all the requests are coming in.

A

um Although the the internal note keeping entries like this should be consistent should be, but if, if they're contacting us, maybe they aren't so we don't want to assume uh okay yeah all right, let's just make sure we're still not healthy oops.

A

Yeah gilly 2 is reluctant now, okay, interesting yeah, all right we've we have recovered and now now we've rolled back to the previous state, though all right, that's good, excellent! That's what we want so now. Let's start, oh three them up again.

A

And see how it handles this.

A

So now let's go to data loss.

A

Yeah, okay, so it's recognized it right. So we say: gilly 2 is our primary, but it also knows that gilly 3 is a change ahead of gilly 1 and kingly 2.. Let's look at our reconciliation here and okay, so it might have just recovered, maybe yeah. It sounds like it's still behind.

A

Okay, um although I think that is the correct data.

A

Three minutes ago, no, not me.

A

Yeah all right that looks right.

A

Yeah all right and now we recognize that we replicated cool so even with the primary being the one out of date, we still recognized it was not correct and recovered from that all right. Let's try messing up a little bit more. Let's.

A

Turn off kelly, one two.

C

A

Give that a minute for it to recognize that one and two were down.

A

Yes, maybe I just didn't give killing cluster long enough to elect a new primary with that first test, because we did see the same behavior where we lost the primary here, but then it came back in that most recent one. So that's interesting! Okay, let's give that a minute.

A

Okay, so giglio 3 is our prime right now cool. Let's make sure we can navigate.

C

A

So we'll add gili 3's primary.

C

Now we're going to stop giving three.

A

And let's start getting one.

A

And I think it'll stop us from writing here. Maybe I'm wrong, though, let's find out yeah all right. Let's give it a minute to recognize that uh get leave. One is the new primary yeah. So this, like I said earlier and like you've, seen this. This transition between nodes is not particularly smooth um and part of that's just prefect, just waiting to get multiple health checks before it says. Yeah definitely go to this.

C

A

Yeah, it's still blowing up. Okay,.

A

Oh, that's, uh I guess my disturb just turned off.

A

Turn that back on for another.

A

A

Doesn't seem to be selecting a news primary this time, yeah interesting, so that seems inconsistent.

A

I don't have a good explanation for why that.

A

A

All right, if we'd also turn uh so, let's turn off gilly one again turn on kilo 2, the previous primary or if that'll help.

A

So here we're only trying to replicate two giddily two. That's.

A

A

Yeah, we still seem unhealthy. Okay,.

A

Yeah, okay, so 69.

A

C

It's a little balancer.

A

What's the address, I don't actually see it in my list of addresses. I don't know what that is. Oh, it is a little balanced. Okay, yeah there we go all right, yeah. That makes sense all right, so we're calling it this here we're calling this is the prefect load balancer, and so that's what rails sees so just uh config real quick.

A

So in terms of what you see from rails and to get editors, we just have a gilly address right, so it just just it rails just thinks this is just you know, just another giveaway server. It doesn't know that it's prefect, so all that is abstracted away, and this this here is just for prometheus. It's not something that the rails application itself is aware of.

A

Okay, let's give this one more minute to see if it'll come up. No, it doesn't look like it's coming up.

A

Okay, all right so get into this primary again come on okay, now we're magically healed great, all right, good progress, so we're back up, and now, let's actually make a change here that took a while, though it's interesting longer than I'd.

A

Expected two is primary: uh we are read only mode, okay, good! That's that's what I expect, um because this is not up to date with the uh let's actually check the.

C

A

Can we log anything about that here? I don't think.

B

A

Actually say that let's look at the database see, I don't actually sure how this would show up here.

C

uh Let's look at repositories first, I guess I don't think we actually state.

C

A

We've got our jobs.

A

C

C

Just one virtual storage: it's like a storage repositories.

A

Yeah, I don't actually know stop my head where we store the read-only flag for the repo. I.

D

Don't think we have already done this flag, we only checked that the primaries generation is higher than that.

A

uh That makes sense yeah. You can see here the generation, these two 701 and then for here I get a lead. Two seven, oh three, which but gilly three should be the one. That's ahead. It's confusing.

A

All right, uh let's just build it- that still fails okay, cool, okay, let's bring them back up, make sure it actually.

A

C

C

C

A

So that was nice. It actually saved me shooting myself on the foot there.

A

All right so now, we've complicated that uh I can leave. One is behind gillig, three okay. Now, if we actually.

A

Yeah yeah, yeah, and so now now we're recognizing that q3 is primary, because that was the most recent change. Okay, all right uh all right! So we're pretty much in time um all right, any any. Last questions for anyone, yep all right cool! Well, thanks for joining everyone! um I hope this was useful.

A