GitLab Gitaly Walkthroughs, 1 Apr 2020

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: 2020-04-01: High availability Gitaly demo

Description

Learn more https://gitlab.com/groups/gitlab-org/-/epics/842

A

B

C

I Paul Francis.

D

James is doing this one right. He.

C

Asked me to because he said: hey, you want to showcase multiple prefix. So that's what I'm gonna! Do you can't guarantee any of this will work, but it will be hopefully entertaining.

C

Dial down the exposure on my webcam right now,.

C

So let me just pull it down datian, so we are all on the same page and what's happening today.

C

All right think I'll wait one more minute in case anybody else joins on 15, more seconds.

D

I'll, just post in the Gilly channel, just in case people are snoozing.

C

Yeah I already did two minutes ago so, but you're welcome to.

C

C

All right, I'm gonna go ahead and hi James welcome aboard I'm gonna get started here. Today's demo is going to involve multiple prefix sized, something I, don't think anybody else has done yet. So if you look at this picture, we've we claimed to have multiple prefix, but we actually haven't done this before so this will be the new wrinkle and then I think we've also added replication queue to this whole system. So we'll be demoing some new stuff here. So now, I've gone ahead and I've done most of the basic prep works.

C

All this stuff, getting Postgres up and running is done. Getting prefect. The like I said: the difference was that I actually had now on this. You can see my bottom. These are three prefect notes running on you know then cloud and then top three are particularly notes, but I did want to leave some things to kind of configure, so we could all sort of get on the same page. So this is all done on two on three of the prefect notes.

C

That's all the same and giddily notes and likewise I configure all these notes to be up with these configuration settings didn't run into anything strange there. The only thing I left was just configuring get lab itself here. So here we say configure add the prefect cluster and this is a little confusing because it's called prefect and then I've got like multiple prefix. But anyway, that's I, don't know it's it's intentional.

C

We want to call it by something more obvious because now I've got you know, we've got giddily, nose, briefing, notes and, and then yeah now, when I go here, is I see this I'm like okay. What if I've got anyway so.

B

That just a little, maybe we should take salt into like storage, one storage, yeah.

C

Something like making it obvious that it's a virtual storage or something like that, because I have I was almost tempted to put in like giddily stuff I'm prefect stuff in here, but then they saw that anyway. So if we look at this config data, editors, I went ahead, just we're just using the same external token, but this I guess this confuse me a little, because this is supposed to be prefect on only one. A prefect like we assuming there's gonna, be a load balancer in front of this prefect IP. Now.

C

Right because you see it's a single prefix or should I yeah.

B

I think that's yeah I raised this with an zj yesterday night when I was had a 101 and said, I was a little concerned that I Doc's are so get it like a single prefect configuration, because a lot of these questions are completely unanswered at the moment.

C

Okay, I will soon now that we're gonna, just gonna, do a load. Balancer in front of this I haven't consented up that way. Yet, but that's just something that kind of stuck out when I was going about this yeah.

B

I think it would like it would be a load balancer, that's my guess yeah, but the other. The other question is like what load balancer do we recommend like? Is there like, there's, probably like s3 and I, know sorry, AWS and Google load balancer products that you could use, or maybe you could run your own load balancer like what load balancer would that be, and it does omnibus need to package a load balancer that can be used for this purpose. Yeah.

C

These are all good questions, I think for Gila back on the simplest thing we would probably do already using Google load balancer. We would just basically plug in that IP address, but then it does bring. The question is like well: how do we desire to prefect notice unhealthy like? Is it just a TCP check so anyway, all these questions will have to answer I'm not going to answer them today, but I'm just raising them because, as I was going through this documentation, that was a little confused okay.

C

So you can see I added from me to scrape targets. I I left one out just so we could do it live I. Believe. Let me just check my notes that is duck. Gonna be dot, 54, prefect, oh, it's it's 556 and actually 65 is what I want here.

C

Okay, so I've got all three prefix in there and all three Italy knows in there, and that should be good. I. Think that's what this is saying here.

C

So even though I'm gonna have to be prefix in this demo, it will really only be talking to one of them right now, but I will show you what how this leader election stuff works all right. So, let's run this great check task here.

C

And while that's going, I'm just gonna go ahead and well, let's see okay, so everything's good there I'm gonna go to the UI and I think there was a bug before that. You guys were worried about. If you click, this button doesn't actually take effect right away. It should down so I'm gonna toggle. That off is that what we tell people in the documentation.

C

These look yes,.

B

There's a request that I opened like yesterday, which adds a screenshot to increase the emphasis of that that the current doctor do say to turn off default and only enable the prefect so.

C

Just in case all right, so I'm gonna create it right ahead, wet away, not wait for that minute, like we did before. Just a monitor.

C

Okay, so now we have this and let's verify that this thing actually ended up and the right char there, prefect so I think we had a bug before where that wouldn't take effect right away. It takes effect now. So that's good!

C

What else do we need to do in this location? Oh all, right, so I think I have I just configured this graph on. Let me refresh it now.

C

So this is the replication latency, just some of those tooltips.

C

I'm trying to see and make sure there's all so Promethea should be bringing in the latest script. Okay, oh I know I. Okay, so I wanted to show now that we've got three prefix. They are each kind of paying all the different giddily nodes and I think before they were just doing this in memory, I think John, your invitation was just you know, one of the prefect and see if you see here. You know we have dear friends it's kind of hard to see here, but you can see there's two right now.

C

There's two prefix notes all saying guilty: one is the primary, and so how do they know? How do they agree on it all right? Well, they use this database now. So if you look at the database now, there's these two new tables prefect note status and prefix shard primaries. So the note status is drill. It's just really a table of.

C

It's just showing I've got a bunch of giddily nodes, and so you see if we took one two and three on the left side and then you see like which nodes get only one two two and three and then each row represents, like hey I last tried to talk to this person. This is when I last saw this act. Person act. There's this note active.

C

So you can see we're missing one here, and that means it wasn't able to talk to get early 3, so good, early, 3 well actually, and the reason why it's giddily, 3 isn't being figure yet I purposely left it off. If you actually look at the prefix, this is prefect. If you look at the prefect logs you'll see that the health checks are failing for that nosy. These failures here so I'm gonna go ahead, and just so, if you look at the gitlab RB here, it's pretty much empty.

C

It's basically the default config, so I'm gonna I put a working version of it here, all the right. You know configuration get alene able, true internal IP set so I'm just gonna head and reconfigure that and then, when we do that, we should see this table being updated once it's up with no more caps. There I will point out in the prefix there's new configuration that this isn't. This is hasn't been merged, that isn't an omnibus yet but for the purposes stem I'm showing you which configurations are being used now.

C

So this is going back to the whole sequel election process. Here now I mentioned there was two tables on this database. The other tables show is the elected primary. So if I look at what's there, it's a single row right now, because it's saying hey the shard, what we called prefect Amanda current primary is giddily one, and this just tells you say this Gilly one, just the prefect one decided. This is going to be the the leader and then everybody else is going to just look at this database.

C

So I like to turn on I like to turn on full statement logging for debugging in my Postgres log, so you can actually see that it's like it's. Let me just just correct for prefect stuff, but you can.

E

See that yep I mean multiple previous configure, the just wanna fight the prefect come all right. So.

C

I have normally buzzed up tomorrow to do this, but I didn't push this out.

E

C

That so I do reconfigure. It's gonna blow it away right.

E

Right right, okay! So this one okay! So you.

C

Can see just periodically it's there's two different guy queries. This one is just like telling me hey who's the primary right. Obviously we did that select statement and all the nodes are saying hey. It was a primary I'll update my journal state to reflect that periodically and then the other one is I'm. Just gonna update the state of what I see when I'm. All these briefing nodes are doing, pings based health checks with all the different early nodes, and then it's just updating its database.

C

So if I go back to the original one, you now see that once we brought up Gilly 3, everything is now last seen very very recently and then that should be reflected well Stan. One.

B

Question so you meant you showed in those logs that it's sort of polling and checking, maybe the primary is. Is there a race condition if a new primary is elected before that check comes through, or is that check performed like as part of every single bride operations, yeah.

C

It's it's performed right now, we're just doing it like I know every time it does the health check after it does the health checks. We could just not store it internally and we could just every time you want to do a pass for the primary. We could go directly to the database. I haven't I, haven't.

A

C

I think that's right. Now, that's probably the first thing to do just ping: the database. Every time you need to know who the primary is and then we'll always have the latest.

C

State of the actual primaries.

E

C

That is a concern right like if you get a point where you have a failover and then everyone is, the internal state is different from what actually is a reflect in the database. You could have a problem I.

B

Mean I would only impacts like I guess like a transient failure, where, like some other node, has decided to elect a different giddily I it selected Gilly, but then immediately after Gilly is elected, could leave one becomes accessible, and so a prefinished that hasn't known as caliche was elected continues right. Give me one, and you know it's like a split situation.

C

Yeah, so that's a split brain! That's why I'm thinking that I'm right right now, the current merge request, does just update this internal state but I think I'm thinking just removing that and just passing it directly to the database. Every time we need to know who the actual primary is. But yeah that's a question.

C

So let's just verify that rights, let's see! Well, let's just do the basic. Well, how do I check if that project was replicated right right now? We can only look at the directories themselves. It's alright.

C

Yeah, what's the best thing to do just run the script that just periodically mimics changes to it? Let me just let me just bring down, let's see the most simple thing right now and just bring down.

A

C

I bring down giddily here all right.

C

It does take a little while to kick in so I'm guessing if I hit reload it might get a 404 right away because you know it's going to take right now. It's it's going to take 10 seconds or so to kind of reflect this.

C

So now you see that it's switched to giddily, let's see if you deleted, you came up alright, so it took us to about 10 seconds recover or so, but they've all looks which Tovar Gilly and, let's see, if that's reflected here well, there's all sorts of crazy graphs here, hopefully there isn't like a disagreement partway through.

C

Interesting I'm not sure it might have been a split brain here right. It's like well.

B

I mean that's I, guess yeah if the right came through, because it's only on the health checking, the health checks are out of sync by I. Guess like this.

E

Is happens yeah also, it depends when the probe et is scraping. Yeah.

C

E

C

I met matters a lot, but yeah I think we do have to we. If we see extended periods of crossing, then you get really concerned.

B

Stan is there configuration around like the the styloid, the metrics, so, like you said it takes 10 seconds, so that requires like a certain number of failed health checks from a certain number of nodes, yeah.

C

B

Sorry go ahead, I was like I guess. My question is: do you think we should just sort of like to start with what we think is a good default which gives us more flexibility to completely change and rewrite that whole mechanics like this is offering some sort of like variable configuration which allows us to make it more or less aggressive, yeah right now.

C

It's hard-coded right now so that couldn't like I, say: I I put this configuration section here in case we did want to start having knobs and things that we can turn.

C

But yeah the thresholds and all that what is a failure right now? So it's so this table is being used as kind of like an election process, because it will only consider healthy notes. Is that this be that have two or more prefix saying it's up right? So, if you'll actually look at the query that it's running this query here is basically saying give me the list of active nodes that to prefect nodes, said we're up right. This is kind of like your quorum right. So if.

E

C

One prefect node, if only one prefect node, could talk to to get a leave one. For example, we wouldn't consider that because it you know the other prefect nodes said: hey, there's something wrong with that note, so it's kind of like taking a majority vote here.

B

Cool I guess we could also start on like a relatively conservative approach anyway, like having an automatic failover after a node being inaccessible by majority of nodes for like 30 seconds and then automatically recovery like in most situations. That is like ideal, like it's a good outcome like compared to like a complete outage. So if we're comparing the current situation being conservative so that we avoid, like node, bouncing like every couple of seconds that that's, maybe you better to start there and then dial it down, this would become more confident methodology right.

C

Yeah, all that all the criteria for like how often like right now, there's no D pound thing. So if one node started failing, you know every ten seconds we might be, we could bounce from one get away no to another, so that could be a problem. Okay,.

C

What else would anybody here like to see.

A

I have a question: are we planning to expose the primary the current primary status through the UI? Some we ought to admins through the API or the UI.

C

It's good question James. What do you think I.

B

Think that is like yes, but probably not in the general, like the GA, like I, don't think it's critical that we expose it directly through the admin interface like we do have observability through the Griffin Oh dashboards and we will be supplying like default dashboards that provide this information, and you have the Prometheus metrics make it observable.

B

I think improving the admin interface is something that is generally important for a number of features, including like automatic shard rebalancing and as part of that I think we can also look to improve some visibility of these metrics directly into your lab UI yeah.

A

C

I think having in Prometheus is gonna, be like the number one priority having in the UI is kind of nice, but I know from like geo experience. The UI was handy, definitely for getting kind of high-level summaries, but knowing when fail overs happen, these grabs are much more important right having like timelines and knowing which nodes are aware, especially if you got like if we have like 40 shards right, that's going to be a lot of data to kind of have to show on our admin panel.

C

Somehow right when I was Prometheus allows you to query filter all that kind of stuff, yeah.

A

I'm also thinking about it from a testing perspective, but we're also going to be inspecting the actual servers themselves to see whether or not they're up. So we wouldn't be relying on that data anyway. I have no idea how to incorporate Prometheus into testing but I'm sure this way right.

C

I mean we will be at the very least. You could scrape the metrics right so yeah.

A

C

Easiest thing here is, you know if I do.

C

96 932 6. What did I see? 1965 2, all right, if you do I mean basically it gives. You can't really see that well, but you can see that this is the primary right yeah this kind of data you could easily just curl parse it if you needed to yep.

A

Sound bees, you know cool thanks.

C

I, don't really have a demo to bring down prefect right now, cuz, that's a dude! That's the next step of Foucault, and that's that you know we're gonna have to figure that out as well right like if we use like a load, balancer and somehow prefect encounters an error. How do we bring down that prefect gracefully so that it doesn't like screw everybody else? Oh yeah.

B

There's I mean there's also not much point demoing, because the replication queue is still in memory. So, oh is it oh yeah.

C

B

A

C

Mean I know this is in the yeah, the.

E

Tables there I think the it's almost merged. Okay, yeah, switching.

B

Over to use the sequel so just to clarify stand, the current this configuration you've got is actually pointing the get live application, just a prefect one. So if prefect one goes down right exactly, and so that also means that the replication queue would be entirely stored in memory on prefect one. That's.

C

B

C

All right this, even though there's a scraping prefect, wanted prefect to you, would only you would never see it actually change right. That's.

E

C

And this disc I need to figure out how to plot histograms better on griffons, so that it's more useful. This is just going to be increasing like crazy, yeah.

E

I think ingre funded there. If you click into when those graphs there's like a suggestion that we used exactly where woods I.

B

Think when we just talked to me last time, oh.

C

E

C

Don't know I think.

B

We just what was it, namely, didn't use. Latency back and I was a different picture. I.

E

C

Yeah anyway, this is all part of just getting this the default dashboard. So every time someone who does his demo I don't have to recreate this.

B

When do you plan looking at yeah.

D

I was gonna hit that today, I don't know if we I was looking through the script right now, I don't see cuz I remember last time we just kind of I. Think John you had some like stashed somewhere right. It was that written anywhere that those queries that we used when I ran the demo.

E

No, but most of them are actually on the dashboard for gala, calm prefect can take a look there. Okay,.

D

I'll just rip them out of there yeah. We should have that in by this Friday, hopefully for the next demo.

B

Maybe we just get one thing in then we can like decide the dashboard, so no good in the next demo and tweak the query and then it's just imagine one node request to update the query.

D

Yeah it'll be easier for that.

B

I'm not sure, if there's much else to demo on this front, Stan, okay,.

D

A

I have a couple.

B

One quick high-level.

D

B

Guys know you got oh so.

D

I was just wondering um the the consensus I haven't visited this in a while, since I think the original discussions came up on using consoles and all that. The the main thing that we're concerned about here is just the split brain where we're we're sending queries to a primary and it's not getting replicated, because it's like some kind of network isolation right where we've got some kind of we've got like a network partition. That's that's the primary issue that we're concerned about when we're looking through the demos and evaluating did the leader election first shard.

D

Is that what I'm understanding yeah.

C

There's a couple things right: first of all, it's you've got to nominate a leader right, and so we we need a chain prefect. We need eh-eh in the giddily notes right. So in order to have let's say you know, we have to multiple prefix means they all have to agree on. Who is going to be the primary for the giddily node right, so we're kind of tackling a bunch of problems here, we're tackling okay, they all have to agree.

C

So you can't have it in memory because they all have to share this kind of the consensus that a believe one is the primary right and then the second thing is okay. How do we get them to agree on which one is up right so that my first iteration was, let's just you know. Basically any given whatever giddily note is healthy will try to elect that as like the first I was like the primary and just that will be, that will be the primary and that that works.

C

That works okay right because you know if it's just like a greedy thing, where everybody just tries to nominate somebody and whoever you know, whoever happens to insert it into the table, wins like that. That can work, but then what happens if you get to a state where to only one of these prefect nodes, can actually talk to that giddily node right now you have a problem where suddenly like that node is clearly broken, but to the two out of the three noes are saying: hey somebody else should be the leader here now.

C

What do you do right? So this is why we had. Is this a little bit more complex where I have the separate table and then I I do a query based on activity of who reports in here to actually do the election right? So everybody is doing this similar query I mean if you look at this log, everyone is basically running the same query as like these are the healthy know. This is a healthy node query, so they should, in theory, get consistent results. So something like this- you know the actually. This is not yeah.

C

This is the active one. So assuming that you know the there aren't any there isn't too much craziness with no it's a ping times like everyone should be reporting in within 10 seconds, so they should all come a consistent list of who's, healthy right and then based on who's healthy. They should nominate almost the same node each time right. So even if they right now, the friend the thing is, if this is healthy, if this entry is healthy, it doesn't do anything right. Okay, I agree that this node is fine.

C

I'm, not gonna, touch it right. As soon as we get to the point where this node ends up being deemed unhealthy, then somebody's gonna say hey that guy is no longer good. We need to do something and then each each prefect node basically do the same query and they should come up with the same result and say you know: I'm q2 goes down guilty. Three is now the new guy.

C

So again, it's adding complexity, the consensus algorithm doing in sequels a little bit, dicey, I'm sure, there's corner cases here that I haven't you know worked through yet, but that's what this is, why I think consoles really the future because they've, you know, they've got mathematical proof that you know this is guaranteed and all that right, like I, can nominate one leader and don't do don't figure out the consensus and all that right.

C

You know the neg, that's the next step, but this is really our. You know our bridge gap because you know consoles gonna be another component. We're gonna need to get that up and running. Gonna have to be knows of that, and so this is sort of the interim step of getting helping us get the kill eh-eh without having to introduce a whole new component.

C

B

It also provides a bridge towards a cloud native approach, because it's it would be quite possible to run the prefect nodes in kubernetes in some kind of configuration, and we probably don't want to be running console in communities when kubernetes provides like alternative primitives. That may be able to solve some stuff. So yeah.

C

So we may need to do a cloud native election process to write like yeah.

A

C

Anyway, the note that the code, you merge the code refactoring that allow us to kind of plug in different strategies. So hopefully we can. You know eventually settle on like the right strategy that we think is going to be sort of the majority like what everybody wants to do. So this is kind of a stopgap I'd.

D

Say so, we've been talking a lot about the the leader being elected. Do we care at all about this, the the non leaders, the the secondaries, you know disappearing and reappearing? Would this be helpful with regards to that or are we not concerned about that at all.

C

That's actually that's a good question. I mean this I mean this is what this table is for. So if we want to use this as a health check for secondaries, we can do that too. Right like we can. We can start to mark unhealthy secondaries as well, but that's like yeah, that's the main focus is get the primary right and then obviously we need to figure out like okay. What happens when a secondary does go down? It's probably less of an issue, because the primary right is the most important thing right now right.

C

If the secondary goes down, you know, maybe well we don't propagate that right, but we can bring it up in the background and people don't freak out.

D

Cool, thank you. So this is kind of a step back I'm wondering do we actually need to do. We actually have to worry about split brain if we are introducing transactions where we have like a majority quorum that is getting written to if we know that we always have a quorum every time that we're you know, making a change and then seeing it propagate to the secondaries. Do we still need to worry about split brain I.

B

Think so, because, like the quorum doesn't mean every node agrees, so we should consider the situation where you've got like some repository that isn't in sync, because it might not be a whole node level failure and you could end up with failing over to a different get early node. But for some reason the most recent transaction failed on that node for one of the repos.

B

That repo is then and like now, inconsistent with other repos and so there's like a range of factors that could result in like a problem, because you know consider node local problems and also repo level. Props transaction won't necessarily always succeed across all repos I'm.

D

Wondering if we need to have all repos because I know some databases, they do something like that where, if you write a quorum, you win and then, if you don't, if there's like one that gets missed like a minority is missed, then there's a way to eventually repair that later or catch it on the next transaction.

D

It's kind of unrelated to decide. No I'll have to create another issue. To discuss this in more detail. That's.

B

What a replication you can help with, because that can essentially become a repair q, so the replication queue essentially looks at nodes that are considered good, and so, if we're tracking, with fine grained resolution in tracking database, which I think Pablo's work would lead like we can know like. Oh this repo is on this. Shard is stale, so like basically, not only should it be and like recovered, but also excluded from transactions for the time being, because we shouldn't be considering bad nodes in future transactions. Probably.

D

Yeah good point: I'll create an issue, so we can um chat more about that. Async.

C

Yeah, that's interesting question, though I think it'd be nice, designed the system where like if we did have a split brain them its. It may be mitigated by that whole consensus of like three phase commit right like okay, I, maybe I got the wrong. Let's say: I have a B and C and a somebody thinks a is a master, but P be somebody else, things B! Well. If they happen to go to those nodes, they all say: okay, well, you know agree on something and then eventually it gets persisted anyway.

C

So yeah I don't know, hopefully we're not so sensitive, but you know we haven't gone that far yet I guess.

C

I had actually.

E

Looking at an application queue here, it's.

C

Interesting that we've got these columns here anyway. Oh sorry, good.

B

Yeah I was gonna say like, depending on the situation like it might be possible to like try and recover in in the middle of the transaction, like just that. A force might decide like we've reached quorum and then like force, not agreeing nodes to like come into sync in the transaction. Because then you can read like you're reducing the situations where like and if, for some reason that no refuses and the stubborn like doesn't come into consistency. You just mark the whole node as like combat because, like that would be treated as like.

B

It should succeed operation and if that operation isn't succeeding, then there's a there's, a whole problem on that italy node and then you follow like the recovery process, and at that point it's like you know having to track such fine grinding to make sure the repo level.

B

Makes it easier to reason about.

B

B

And stand interested in your thoughts on bouncing because absently we wrote invention is an address here, but is it like it's a risk concern like if, like there's some sort of network event that it's like impacting a data center, it's quite possible that, like you, could resolve the situation where a high availability makes things worse, rather than just sort of being like no we're just we're detecting some like more broader anomaly, I mean yeah.

C

I have to look at what other people are doing about deep balancing right because, like you know, this is a concern that we have in this case. You know the balancing would happen if, like let's say you know, different nodes went up and down like let's say like the network, and maybe you know so in this case, like two out of the two out of the three nodes have to say: hey that guy is not healthy anymore right so again that helps mitigate that, because it was just one prefect node saying that something is wrong.

C

Then at least there's a little more stability there there. You know, you could add debouncing by you know backing and backing off your fellow thresholds a bit like if it happens in the last you know minute, or so you start to increase the threshold at which you actually do fill up her, but I don't know like week. I think we're gonna have to look at what other products are doing because I don't think but I'm guessing Patroni doesn't do anything that sophisticated either like I think you can have.

C

You have the same similar issues and it hasn't been too much a problem for us there, but.

C

Yeah I'm wondering if I'm wondering if it's you know we could get to the point where ballot balancing isn't as big of a deal as we think it is because, let's say we just keep rotating primaries, you know that may be okay, hopefully isn't such a such a pain to recover. But again, like I, said I, don't I haven't really thought through all the kinds of things that we're worried about here.

B

Yeah I mean it's a good point like. Maybe it's not a problem to be solved urgently. If we're not observing you, and maybe all we need is like some kill switch so that it's like really easy for admin to be like just stop doing automatic fail looks like we were, observe we're in a situation where some really bouncing is happening. Just stop, maybe there's an easy way to turn failover off a crack off across the whole fleet or like a specific shard. Then we got em to run reconfigure on all three nodes.

B

Then that could be interesting.

D

Maybe that needs maybe that means we need a an API for admins to be able to tell prefect which primary to stick on, or should we rely on the configuration file for that.

C

I'm not sure we want to configuration, cuz I could change from time to time, but maybe what we need to do is keep track of how many fail. Overs have happened in the last like a minute or so, and that's a narrow threshold right like if, if we exceed that threshold by some amount, we Eve there stop doing a failover or you'll be sent on an alert, and we do something.

C

First thing we do just are tracking it. Like, let's say this is this is happening at all.

C

And then the second question is: do you raise your failure threshold see like like right now? It's a notice deem inactive if it has not reported in once in the last 10 seconds right. That's, you know, I think John's. Your memory, one does a little bit more like hey. If it's, if it hasn't passed, three health checks in a row, then it's down right, like we can't we in our threshold.

C

We can either like you know, we start to see it flapping, so we start either increasing the threshold by which you decide a notice up or down right. You can tweak those knobs to to say, hey, look, we're not gonna fail over, because you know somebody's got a you know a copy network and silver. You know hold off and not worry about that until they really are getting higher rates. So.

A

C

Many ways to tackle it right: you can raise thresholds all over the place. You can send air flags, but as far as I can tell I haven't seen a lot of debouncing stuff going on and in my console either right like I, don't know if this is a well fleshed out problem, so.

D

When I understand this is like I mean Jakob would have to correct me on this if I'm wrong, but it's a rare thing that pops up and when it does, it creates huge outages. But it's like a really rare edge case. It's on a bunch of post mortem articles from various companies.

C

Yeah so I think the first thing to do start tracking. The rate of what you're having failover is right. If you start to see that rate starting creeping or higher than you expect like.

E

C

Once every minute like that could be at least a starting point like, if that's happenings that often.

E

You just like yeah: this is a ripcord 4 on that failover. Just stop doing just stop it yeah just.

C

E

C

Down and just stop and just send up pages to everyone, yeah.

B

I'm, just thinking about like get lab calm, essentially what like kill switches do we need to be more like I, think we've got the play books where we can just turn it off by like running reconfigure, a little slow though right so like what would that mean? I also know 100 the other day I was like. Oh, how do you feel about like running automatic failover in production once we have that running in the next like 2 weeks, it was like oh yeah feeling pretty good about it. Like I was like okay, that's surprising.

B

But also glad you have such confidence in the team because they are all smart people good on my resume. Welcome to as a product manager like if there was like some sort of really fast kill switch where we could like see a problem and just be like slack, stop and then like prevent that being a problem and.

C

Their signal we could just have like giddily the prefect process. We can receive a signal. There's a stop. Failover I could be multiple nodes. It's kind of annoying I mean.

B

The kill switch is not a bad idea like just having some sort of thing like it's a breach of some level of flapping, then just like disable automatic failover before so I'm like enormous amount of time. It's like just a database row that basically says like favor is enabled for like 24 hours, was it disabled for 24 hours or like 6 hours until someone could like reevaluate the situation.

B

Then we've at least like mitigated it like. We see flapping, it starts happening for five minutes. We know that like, even if we did nothing like it's gonna stop within like some period of time.

D

What are the types of things that cause us to fail? Our health check, it's just a gr, PC ping. That's all we're doing is just at your PC house.

C

Right now did the health check right. So you know what the process stops. I mean it's possible that they've responded with a health check and it's actually not working right. That's the flip side. Yeah.

D

That's what I'm thinking is I mean, maybe the bounce may be bouncing around- is really what we want, because if we can't communicate with something, why would we stay elected to it? I I would think. If we saw a lot of bouncing, we would just want to kind of increase the amount of time it takes to make a decision before the next bounce.

D

If we've like gone through all the hosts like every time, we've rotated through every possible node- and we still see this pattern of you know bouncing, then we should increase the amount of time. I don't know, maybe exponentially just increase the amount of time it takes. Yeah.

C

You get up a backup mechanism right, like over. The threshold, gets a little bit harder and then comes back down again overtime or anything yeah.

D

Because I mean you only want it to not bounce. If you see this pattern happening across the entire set.

C

Well, I could be flopping between two zones right, you could go between 1 &, 2, 1, &, 2, back and forth.

D

That depends on how we're, so how does that work with? So if we have equal chances of being elected a leader, is there a there, a strategy to which one gets chosen in your seat.

C

Right now it just so, it's actually like the way this thing works right now is it will sort by the name right now it has sending so it'll sort by the number of prefect nodes. That say that thing is up and then also type breaks are made by just a name right now, it's you know. Obviously we can make that random if we care, if we wanted to.

B

Like see this through, like we're talking about like actual like raisins, we're like the node might be wrong, though, like imagine like an application logic.

B

So, just like, we made a mistake and like in the application code which causes it to flat like this is essentially like a feature like any other feature like if we've made a mistake or we make a regression, will upgrading something we may just want to be able to turn it off because, like that would mean that yeah, we don't have to redeploy Ginley to fix like this problem or like suddenly shut down all these prefect nodes like there's, there's the possibility that this isn't just they we'd want to flap it.

B

We've legitimately made a mistake so having a kill switch is useful, absolutely.

C

The question is like what is the most useful kill switch? You can have like, probably something we check in the database right, because each is updated in one place and say: yeah I.

B

Mean what if there was just a command you run like on a prefect node and it just sort of inserts the record yeah.

C

That's like yeah, it.

B

Doesn't have to be anything.

C

Like any tool, we make that changes that flag or something.

B

Cool well, it sounds like we didn't issue on this front and it's probably something with I've no feel like I would like it to be done before we turn automatic failover all night productions, giddily is deployed automatically.

B

We make a regression I'd love to have my backside covered I'm.

C

Wondering we should start moving some of these settings that are shared between different nodes in the database. Right, there's a trade-off there yeah.

E

Because if the one has all my filler turned on one, doesn't then yeah.

C

Exactly so, I could see failover settings being in there, like our thresholds and all that, so we can tweak them as necessary. I.

B

Guess the question is: do those failover settings belong in the pre check database in the application database? Like should Prefect be somehow connecting through the application database to get those I.

C

Don't think we want perfect talking to the application database at all.

C

Although yeah I guess your concern like just getting it from the application and signaling it to yeah,.

B

I guess it's more around that line.

C

If you care about an RPC, you can always do it in the RPC that actually changes those Prefect flags and prefixes intercepts in doesn't care about it. That way. Yeah.

B

I'm thinking more about like in that admin settings area, it would be interesting to have like some failure. The configuration settings, like if they's alike, can I level on and off like turn, because then you can expose it through like the standard D lab IEP is, and then you don't have to have like some special tooling. That's like kind of unique is more what I'm thinking right.

D

How do we do that with giddily today? Like do we have giddily settings? I can change through the API that are currently stored, I I think the answer's. No, we don't have anything like that. I think I.

C

Mean the closest thing we have is like feature flags right like we yeah, we set feature flags in the application database. It gets sent over the G RPC metadata to do something right.

D

Yeah, but as far as changing a configuration for like the whole giddily for the entire life of it, we don't do that. Yeah I mean.

B

We could use feature flag and like somehow like, if the feature flag comes through like from the route that application is like disabled. If I like failover off, then we like trip the switch prefect just trips us, which updates the database and then until it seems like a failover true on some RPC.

B

D

Like a reverse feature flag right, we're trying to run off the feature, the two features on by default, but we want to turn it on by sending the feature flag, I think.

C

We're gonna have it off by default right now, right and then but yeah. It is interesting.

B

We can continue to see an issue around any other questions for anyone else, if not I'll, just quickly review this and we'll summarize so I'll create an issue about changing the storage name and I might be able to do that in Doc's from that request, it's already open unless it's merged, so we don't call it prefect anymore when you need to about the load balancer.

B

We need well stand. Do you want to just take that as feedback around split-brain with polling and just move the check into the all the mutator options or just create a follow-up? You sorry.

C

What specifically.

B

So the leader election is just like done and the health check section but- and we say we don't check- we don't update the local state and the prefect Duran like that one. Do you want to take that as a follow up, or do you want to just address that I.

C

Mean just address it cuz that merger guys are still in review right now, don't be down there. I.

B

Think we needed debounce, slash, kill, switch issue just basically some sort of MPC of like how do we turn this thing off? If we make a mistake and then then I think things are great thanks, Stan appreciate your work. Sure.

E

B

Remember, to take some time like for yourself and family I, keep seeing you on when it's like midnight, 1:00 a.m. and makes new margin.

C

It's like, like I, don't know everybody else. Witting splitting time with you know, kids at home, 8.

A

C

Is actually when they all go to sleep I'm like oh I, can do whatever I need to do that also.

B

Well, just remember to get enough rest somewhere in there as well. Alright,.

C