GitLab Gitaly group, 15 Apr 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 20200415: High availability Gitaly demo

Description

Epic: https://gitlab.com/groups/gitlab-org/-/epics/842

A

Cool, it's 4:30 on my time, so I'm gonna get started here. I've done a little bit of prep work, so we don't go through everything here. I've set a prefect database already so I'm gonna skip that I haven't actually run reconfigure, but I did want to show one difference here: I'm. What I'm doing here is I added a load balancer in here for pre fake, so just keep in mind that we're gonna this demo we're gonna demo, multiple prefix in front of a little bounce right, I've actually tested this. So this will be exciting.

A

Okay, so I'm, just gonna broadcast make sure they're all just the configurations are the same. This is basically just the same thing. There are two new configuration options: I'm gonna flip on now. This is for using multiple free prefix. So I'm gonna change that to true and I'm, going to change that to see Cole.

A

Let me see: that's okay, so I'm just gonna go reconfigure on this on these prefect nodes and let let it do its thing there and hopefully I, don't get any errors.

A

That's gone well, that's doing that's its thing: I'm gonna, I! Guess we can come back to this simple check. I'll do the same thing for the kita Lino's. While that's running.

A

Basically, it's all the same configuration here. Everything that we see here get data ders the same so I'm, just gonna reconfigure. There.

A

All right so, okay, those already told me failover, okay, so we've already got these I'm surprised that this doesn't achilles. Section, though, is it that's a little weird okay. So it's basically saying go back to the pre I think it'll yeah, so I I sky combine this stem, but do we really need to split them up right.

B

So they tell you to keep your last time. I read, will dog the reason I was split up is there's a after you configure. You basically verify that you can watch it's so like if we've done them in another order, you wouldn't be able to do the verification step associate with perfect changes. You just made alright.

A

So I just messed up something there, no virtual storages configure. What does that mean? Did.

B

I miss a step on this thing. Oh yes,.

A

I did ha so. This is what I get for trying to do this all at once. So I missed this step here. That's good I mean alright, so I need to get the addresses of the Gili nodes, which is going to be in my notepad here. Okay, so so I'll just change them all at once.

A

And I'll change: let's SSH shoot that's the external one. That is not what I want.

A

Okay, that's the primary, so it's good that it caught my errors here. 24 is the next one and.

A

This is the next one here, okay, so the kid only one down there ran fine I. Think.

A

Okay, let's reconfigure that and let's go back down there and see, there's a check in there too. Right is so. This is back to the prefect okay, so I'm just gonna. Wait for that reconfigure to finish, and do this again. Okay, that was fast, alright see clipping, is happy good, giddily notes I. Actually, this is this is I got ahead of myself because I'm confident there's a lot of.

B

Sorry Stan I should just blanked out when you wait. Can you show me the gate, lab Bobby, cuz I? Remember you copied the virtual storage thing. I know it's 336 I'm getting pretty.

A

So this is the giddily configuration. So do we see anything funny here, I mean I, don't know what all this stuff scheme is not there's.

C

An issue for that: okay,.

A

B

I think that issue pole in the notes they were taking sure.

A

Should this worry me it's closing, but it's consistent. This is just like it finished and it decided.

B

Yeah we gotta fix that I think this is positive. Can.

C

You uh copy that and paste it somewhere stand and I'll put it in the issue like I just paste. This whole thing in.

A

A

I'll put it in our slack channel for now.

A

Dump that in there okay, sorry for flooding the channel there- okay, so that's it! So that's good! We dialed on the nodes, failovers enable sequel's, enabled and I've already reconfigured. The question I have is whether the database migrations are up, and we this is an I'm just going to show this off, because this is a relatively new thing.

A

It is our prefect, it is prefect, I.

A

Don't really need to run it all them, but what I want to know is so I thought there was a. There was a Omnibus merge request to make this automatic I may not have been merged in this one, so I'm gonna not broadcast it to all them and just migrate. Is it in that? It's not in our documentation either in a migrate right?

A

A

Okay, well, I think it's a manual step. We don't it's supposed to happen without me. Bus and I think there's I think it's three zero! Six, six! No okay.

B

It's not gonna believe ya. Combs working on the location, yeah.

A

B

A

To run it with reconfigure automatically so I guess I, just I'm, not sure if I just didn't run it or the merge request actually isn't in here. So I'm just gonna run it for now, but we should check that all right so now that I have that I have the status again and we are good okay, so either that function and thanks for merging that Paul, because I was constantly getting annoyed of having to log into the database to figure out what was in it.

A

Alright, so we've got the migrations running prefix databases up, oh I, think the the one thing we don't have is a replication queue configured on this thing. So.

A

What's the name of the thing, the configuration variable that we need their replication queue I forget what it is. Let me see if I could find.

A

This is why I added this omnibus, so we don't have Postgres a queue and naval right, I'm gonna enable all sorts of stuff is that right, I.

B

Don't have any waste underscore Q, underscore name yeah this.

A

B

All right, this was cute yeah that looks great okay,.

A

So the configuration looks for a election strategies. Failover is enable okay. So let's see this is okay, so this is all the configuration I've actually run this already, but I haven't actually so. The one thing I'll know to note here is that prefix last week, last time, I did this demo. We weren't sure what to do here. This is actually the load balancer IP, so I have to I. Think I had to run and reconfigure here.

A

Cuz I, don't know if I actually saved it, but that IP is basically what you see here: 10.1 56, so it's cool. It actually shows me that prefix is up because before we started this actually was unhealthy. So it actually is it's a TCP load balancer. So it's actually pinging port 20 305, which is the GRP sea port and basically saying it's gonna round-robin between those, so cool good, so that reconfigured fine, let's go back to the documentation.

A

So we may need to update this documentation to talk about using a load, balancer or something I, don't know what we want to say there, but there's.

B

An issue of that load balance there's metal prefix because the docs don't cover this at all. All.

A

Right so check worked great first, try get love can reach a prefix. So does it oh wait? This will be interesting because I've actually never have run this with the load bouncer in front. What's going to happen, we should get a break, we're faster, it's not as satisfying as the other ones, all right. So while that is cooking there, I look Oh No field, Connect, prefigures okay, but cute ugly, Oh Gilly is uh it's.

A

It's my built-in key delete the default interesting.

B

You know I think there's a bug in the docs, so I think I put in TCP the scheme, Docs.

A

B

In the giddily, the Gilly config the listen adder a little oh, but.

A

This I'm, using the default here, I'm, not actually oh you're, saying actually in the ghillies indicator league.

B

On a Gila box.

A

This is the default of so it was complaining about the, but it was complaining about the default and.

B

I have an exit, yeah, yeah yeah, so um well in a Docs I made the default, the the internal IP I, guess so that you could move posit or II between shards. So you have to you have to actually enable the listen hundred to be because right now, it's just localhost but then so trying to find it on the internal IP I see.

A

So you would actually have to change this to not local hosts. First right.

B

Yeah right now.

A

It's fine right now, it's fine, but it actually works, but it just the ping doesn't work. Is that what you're saying.

B

I, don't think it'll work, I, think.

B

It will work, I, think the pain won't work and then, if you try to do a move repository, it won't work. Ok,.

A

Interesting, so if I just do this, this is actually just gonna. Go to the dough, actually doesn't work whoa, oh because it's going through prefect right. No, so I actually haven't changed the admin setting. So so this is surprising to me. I said.

B

When their default.

A

A

So it's going to default.

B

Yeah, but if get labs trying to connect over TC.

C

Yeah yeah listening.

B

Then it's not gonna, but.

A

Like I said, the.

B

A

Configuration I mean most of this is default right. Are you.

B

Saying there's it's changes to the internal IP got it.

A

Okay, that's the change that they made to.

B

A

Dollars got it so I actually need to fix this. If I want this to.

A

B

Need a fix to get.

A

Early thing to actually well, if I care about my local one, yeah yeah,.

B

A

Right all right, so, let's, let's just make sure everything is in so this is not localhost anymore right.

A

B

You have a mix of like socket local and remote googlies, it breaks and stuff interesting, some stuff. That's work, cuz the enterprise in regularly process. They work bypassing the IPS around that.

A

Make sense yeah.

B

We had a whole range of demos which, like ground, to a halt, because we couldn't move things between yards and pile. The problem was we weren't provisioning, an optics disk space but they're actually provision in that disk space. We realized, oh, that was like only the first of multiple.

B

We can't talk from the socket to the other IP address. We.

C

Do we need an issue for this? Do we want to concern ourselves with being able to support IP and socket based get elite connections? No.

B

I just make sure the docs are clear that if you are using Gillies on different servers to just use, IP addresses everywhere and I think that's John. You said you've gotta merge request for that yeah and.

A

I wonder if this checks, you then check that right, because if you start putting throwing in I, don't know if I can check it, but because it's.

A

B

My purpose was to do a validation on the bus, so that it will complain if you have like a mixture of socket and IP base. Good leads go.

A

Alright, but if you move the charts- and you still have.

B

A

So I'm going to go just go ahead with this change: it back the prefect now and we're good alright. So let's create a new one. This will be actually, while this is happening. Let's, let's take a look a fauna, so I can at least see some metrics if there are any and I and I actually haven't, set the password. This admin admin as the default yep no to have to set the route. Oh.

C

You know what there's a bunch of steps to like enable login without using your github account yeah.

A

I did do that fennel time.

C

A

This stuff, I did I think I need to set this stuff here. Yeah.

C

Because it sets it to something random, I think.

A

A

Secure all right so now it's back.

A

B

A

I don't know I had the latest nightly build, let's see all right cool, a chop all.

C

Thanks I think it might be something wrong here. We got three prefect virtual storages: are they they are, how they all have to have unique names right for it to be a valid configuration. So something doesn't this right. There, no I, think those.

B

Are the right, cuz, there's separate instances right and they all they're all configured to be the same virtual storage name? Oh I,.

C

See: okay, that's kind of confusing, then how are we gonna, Ellis trait this in the in the dashboard? Well,.

B

You could put the fully qualified name yet or I guess yeah.

C

I think we need that you.

A

Think of the prefect the prefix have a different names, or what do you mean by those? What's the issue here? Well,.

C

Right now, you can't really distinguish between prefix because of not a fraction, so I think John's right. We need the hostname in here.

A

A

C

If you go to yeah right here, you can add a new label rule and you need to know the name of the column that you want it to be see how this is giddily storage at the bottom right. There it'll, the underscore storage, that's the name of a column, so you have to know the name of the column for the.

A

Hoses that dial has this button here. Yes,.

C

A

Me see if it's just F cubed straight, does it do anything or just.

A

Okay, I might have just been trying.

C

Try to click back and forth I, don't know how to get it to apply.

A

What's this, oh, this is alright! Well I! Don't.

C

Think there's a.

A

Little, it may not be a label there. Ok, we'll worry about that later. I, just hoping I get it in there. That's fine, ok,.

C

So you'll be one actually, if you mouse over that one right there, you should be able to see later.

A

Good plug this one doesn't have labels, but if I just remove that what happens? Does it show.

C

Yeah he should show up after you refresh it after you apply it really.

A

Doesn't do it automatically I thought it doesn't.

C

Try hitting save.

A

Which is the Save button.

C

At the top of the screen on the rice.

A

The dashboard right- oh, this- saves the dashboard right yeah.

C

Yeah that old, whatever changes you made to this panel, will get incorporated. Okay,.

A

I can't save it for some reason, because it's been provisions I can't change anything in here. Ok,.

A

No, so it's really not letting me change anything. You can.

A

Just duplicate that well, let me at least do placate. There goes.

C

Instances the IP.

A

A

It's acutally storage is that what I want to say oops.

A

Okay, all right, whatever, okay, well, that's cool, so we got we got different reports. All of them are on one good, all right so now, what do we want to do? We want to create a project now right. Let's see anything else, I need to do okay. So now we mentioned sequel, stuff, all right cool.

A

So let's try creating.

A

We're just gonna do plank one prefect, multiple.

A

Cool I, don't know which note I went to, but it's going to it's basically.

A

So let me stop briefly I want to broadcast.

A

Let me stop prefect here. We've always stopped, we've always stopped giddily, but we've never actually and hopefully at some point. This wall.

A

This is my failed attempt to use sequel uh cloud sequel.

B

Did you just notice that the raid made didn't mud.

A

Was there read me, oh yeah didn't.

B

Yeah, weird so I'm guessing one of the obviously requests don't route it to the prefect that got down was down. There's.

C

No affinity right, it's.

A

Just a big step, I, don't think, there's an affinity, yeah I think it's just hopping around interests, yeah! Alright! Well, that's not the worst thing in the world yeah. It happens on a right, not so good, but on a reading.

A

Okay. So now you can see. That's that's markdown and it's still still happy here.

A

Let's do it right, let's just make sure this works test.

A

Okay, that worked. Is there any there's? No indication here say now: we lost okay. So now it updated. It's very strange.

A

Metrics you're, like one two, three one, so nine nine points: okay,.

A

What do we want to do you wanna, just kind of are.

C

We are we happy with that replication job panel. Should we be breaking it up by prefect instance? Would that be helpful? The total replication jobs in flight.

A

Okay, good idea, alright, just to see which one's worse I mean I. Think that total one is nice, but did it be nice to see the breakdown as well right, yeah.

C

That's what I'm thinking.

B

The replication latency, but we don't have the new replication to line metrics if Johnny fluency, actually Patrick, did that one right, yeah, sorry yeah that one this merge, so we could add. It's called like giggly underscore appreciate, nose for replication I'm just going to delay so just change the latency to delay.

A

This makes sense, yeah.

A

This is actually.

A

B

So that's the combination of time waiting in the queue and the actual time to do the replication right, yeah.

A

How come I don't see anything did I change, something in here. Oh I, don't see anything I see metrics here. Let me do you last. Let me lower this thing. Okay, that's better! Okay!

A

So what are we seeing here? Replication delay was about a second.

B

Yeah, it makes it I think we're pulling once a second for John's god.

A

It makes sense so most common case. We should see about 500, milliseconds, okay,.

A

Happy to entertain what we should do next, we could hit this. We could just have a script that constantly pushes commits or.

B

Say some some load and then we can start turning things on and off and see what happens under like a sustained load, see what the numbers happen.

A

Does anyone have that script hand, everybody a chance, my.

B

Knees even find it.

A

And well I think I need to add an SSH key or something to make that happen. So while you do that, I will.

A

Add an SSH key to my profile, so I can actually hear.

B

B

Yeah, it's the one I pasted into the resume chest: yeah yeah. Thank you.

A

Well, do we want to do this, so this is it's just this I.

B

Don't think any of that other stuff matters, it was only if you're gonna use a child right back. Oh.

A

B

A

Me do you want me to use the wo kilo, calm, I.

A

Can just I can just use this this one.

B

A

I'll clone this to my desktop.

A

Let me create a new window.

B

A

B

B

B

Like we've only really looked at testing in all these demos of like node up node down all right, we haven't considered out of the kinds of failure modes like a degradation like where one no becomes much slower than the others right I'm. Presuming that the the load balance that doesn't do anything smart around that just.

C

It just uh it just operates on the primary everything else is async.

B

We know the load balancers round-robin in between the prefect nodes. Well,.

C

You mean if what.

B

C

Prefect nodes are slow, yeah.

B

Let's say one of the pre kuk notices degradation I mean yeah the giddily side we've within the shard from prefect to get early, but a prefix not doing anything smart. There then the other thing that I think I create an issue about the other week. Is that what happens? If say, the health check succeeds, but it doesn't so like, like imagine like the disk, is full like what happens if we would like to be able to fill the disk or basically prevent the get user on the primary from writing.

B

Like would the failover still happen so I don't know if we can do something to basically change the user permission on a giddily node, the current primary, so that that primary no longer had permission to write to disk I. Imagine the health check would still succeed because diddly still up right. It wouldn't have write permissions so like right, there's a whole range of failure modes that we don't consider.

B

I don't know doing a demo is actually gonna help us see that well.

A

We probably need a walk. We need to think through, like what's our what's the failure mode and then how do we recover I, exciting I can guarantee you can demo that it's gonna, it's gonna, feel flat on its face right here and my constructive yeah.

B

So maybe we should I, don't know if there's a way to inject some light and see, or do we want to like get lienard's like where we want to spend that time right.

C

Now the way operations people are approaching this is they just have dashboards to monitor each individual, get Alize storage space right, but they don't really have. We don't have anything today to provide visibility into how much storage space a prefect virtual storage has right.

B

Yeah I mean storage. Space is just like one example of why I would value.

C

Because right now, I think what our approach to it is we just never let it get to that point right, we're monitoring them and we get alerts when the disk starts to get full and then we proactively react to it.

C

We can't really do that with prefect right now, because you'd have to you have to have something that correlates. You know the the storages of the backends are part of this virtual storage. How do you do that in a real easy to manage way? You want people creating dashboards manually for every virtual storage and figuring out which get alleys belong to it, or should that be automated to a certain extent?

C

B

The proposal I made in the issue when I was thinking about this problem. The other week was rather than using a health check. We should be logging failures of operations, so like error rates, and so, if the error rate over a very short time span of like two seconds like on an active node with regular activity like if any read or write operation fails like five times in a row mark the node out right, don't rely on a health check because health checks. Let me happen every one.

B

Second, if you've got 500 operations per second and you've got six in a row or ten in a row that have failed like the nerd is out like you, don't have to wait. Another 600 operations to mark the note. Oh yeah.

A

It's my health checks are just fine, but your actual operations are failing, so.

B

A

B

Need two tiers of determining whether a note is quote down.

A

I did create an issue about that too, because right now, prefect doesn't consider I mean we're just trying to we're just electing whatever node is responding to health checks right, which is not like what happens when it falls behind right. What happens if you delete two needs ten seconds to catch up and three is up to date. Really you want to go to three, not two right, so there's all sorts of decisions we need to make about.

A

You know which note is really the right note elect and if it's got all these failures and obviously don't want to liked it either.

A

Interesting that raph doesn't seem to be changing at all. I'm, not sure, didn't see me we're just handling everything so quick. It doesn't show up my.

C

Repo, we can take a look at the the raw metrics coming from prefect.

A

Which made metric the.

C

Actual resting point we can actually we can curl it and watch it and grip it for the the metric.

C

It's like 9 2, something right, yeah.

B

C

There you go I.

A

Think it's actually my piece listening only to the IP, no.

B

Seems to me, like the replication delay metric, might also be artificially high. Oh I give those numbers, because the first yeah the replication today I think it's like incorrectly high because essentially like if we put 10 jobs on the queue for the same repo like if we process one all 10 jobs will probably be up to date, because we've mirrored the whole thing. So all 10 jobs are there no ops, and so, if you've got like a higher, sustained right volume right.

A

B

You're always gonna have that yeah, even though actually the gap is like that right.

A

I really want to test, like many simultaneous pushes on different positives.

B

A

Once right, because the fetch single fetch could cover a lot of this.

B

A

A

Think when we start to we.

B

A

To really hit hitting this hard and trying to figure out, like you know, performance and delays, all that then I think that's really. This is really gonna become interesting. These metrics will actually become useful.

B

It'd be super interesting to know what the right operations per second ah like on a per repository basis like we know what they are at an instant's level. Well, I could get lino level, but we don't know like ah for this specific. What's the cross section like for a single repo like what's the 50th percentile for right, opps per second by repo and then 99 percent up, because really it's that 99th percentile and write operations per second at a repository level, that's really like the limiting factor.

B

That's going to control like like how you scale and get laughs. Giddily instance.

B

Because that's sort of like the smallest unit, you can't break that down anymore.

A

All right, let's just bring down this giddily, no just for kicks.

A

Okay, well, I did get that error. This is going to take longer because it takes ten seconds for it to switch. But okay, now it's switched cool.

A

Let's see if it gets reflected in this and there's a little bit delay there.

B

A

You can see switch over right. There.

A

So even and they all I assume, went to the right place now. How do we? How do we confirm I think you had a rate test right, John that actually looked at the checksum yeah? Should.

B

Be merged actually, so we can try it out.

A

B

Brought that problem the.

A

Kit lab note or worry what you know: do you run that from well? There's just run.

B

That from the prefect node.

A

It was like, oh no.

B

No actually, no, no. You read something to get loud. Note. Sorry cuz, uh you passin a project, ID, okay, so it's uh let's get loud, :.

A

Let me find like your love instance window here, it's somewhere around here.

A

That's how do they close it maybe I closed it.

A

A

Okay, this did you know if you get that command ready, yeah.

B

Yeah take it into the chat.

A

The creep get loud, prefect, replica, prefect replicas. Does that sound familiar.

B

A

Okay, this thing right.

B

But uh yeah yet passing the probaby.

B

A

Oh, you mean like house: what's, the actual syntax is with.

B

A

A

So, let's all right so I mean this. This is mostly I think a sampling issue.

A

Let's bring this back up.

A

What's that, what does that mean.

B

Something went wrong.

A

But my check was okay, but it's because I stopped to get a lay down advice. Oh.

B

Maybe up and see if.

A

It's sensitive to just.

B

A

Failure, oh I did I did have a prefect know. That was that shouldn't matter to you, shouldn't.

B

Because it's being a load, balancer yeah.

A

Yeah so well, let me see turn this back on, maybe maybe he's just sensitive. You had it down.

A

So when I bring this back up, does this mean that this one will recover cuz, it's gonna be Reese inked, I, assume so right, I'll find out I, guess yep.

B

Old I feel, like you, had some thoughts around very sinking standards. Do you want to walk us through what the current behavior is around Racing yeah.

C

So we have the reconcile command, subcommander, the prefect executable, and what that'll do is you point it at a target? That target can be a node that you consider to be stale. It's been out of the loop for a while and then it'll automatically.

A

C

To replicate anything, that's inconsistent with the current primary or you can point it at whatever. No do you want so there's actually some Doc's on that in the on the prefect doc page. It's I think it's closer to the bottom, with multiple prefix, since they all have the same configuration you just need to do it on one of them.

C

But yeah that it doesn't rely on the database or anything like that. It's just actually looking on disk to see what repositories are available on the primary to make sure that the the target has those repositories.

B

So it's a manual action right.

C

Now it is yeah we, the first step was making it manual and we were talking about when a node rejoins after a certain amount of time. We would run this as a like sanity check to make sure that the user didn't like remove the disk and swap it out. For example, there's like a disk failure. The problem with that is like you get into problems, you you're, defining like arbitrary amounts of time to say that someone could have swapped the disk or there could have been damaged or something means DJ.

C

We're also talking about putting some kind of token on the fall I guess we have a filesystem UUID that we could use, and if that you ideas change, then that means it's not the same file system anymore. So that was another thing. I don't know. If we have an issue open for that right now, cuz we had that MVC issue, I'm, not sure if we follow it up off the chart.

A

John I went to go hack them script.

B

Because maybe it didn't like when they give me those being on my.

A

Only guess, but we.

B

A

Is interesting that it didn't it didn't resync, yet I'm, not sure.

A

If that's a bug on our part or let's take a look at prefixes.

B

Well, gimme one was down for a while right, yeah.

A

But I'm wondering what point.

B

A

It just deciding that it's not gonna replicate.

A

It's just staying still so I'm curious, like let's yeah.

B

What point is prefect, give up and never very try, not even if it comes back.

B

A

B

Dead but then you.

C

B

New jobs coming in so right, because if, if the node comes back, we thinking strong consistency, future state right right.

A

B

Want to just be like oh yeah, like you, would just be like an inconsistent owed like next replication job. It would be like okay can't make quorum, let's like resync it, and then the only concern is failover. How do you handle favor over, like until the node is fully in sync? You can't really fail over that's I. Guess, like the problem, yeah see the guys stuck he's.

A

Not getting any comments at all, so I'm just come on during it. This this, like.

C

Another thing that means.

A

Talking about oh.

C

Sorry go ahead. Sorry.

A

I just saw this failure: I, don't know what this means.

B

But feel the persist replication to been Oh huh I, wonder if it's having issues talking to the database.

A

A

Soon I can do this, I don't know if that's good enough.

B

A

Let's see how loaded this system is, it's not murder at all.

A

Let's see if I can see in here, this is advantage of having not using cloud sequel. I can actually see if there's any errors and I.

A

Don't see any error, but look at the logs later. Nothing sticks out right away, but it may be the fact that we were having trouble. Persisting is causing us to complete to can't persist, okay, well, that's a problem.

A

B

Yeah, it looks like that error happens when they can't in queue a job which would explain why. It's that okay.

A

B

But we're not logging to actual error the problem. Okay, a creating issue for that.

A

Let me see this thing is actually increasing. What's oh huh Oh.

A

User, that's the problem.

B

John I missed the thing you're gonna Crane, you fool: could you put a knight in the Google Doc? Okay? So it is yeah.

A

What's the time, so it's it's persisting! Stuff right, like you see it says, completed I, don't know what that. Maybe this is just me. Is it just the noise I'm not sure what it means so we'll have to figure out what that.

B

Message actually.

A

C

A

C

Can't be that error stand please.

A

But we soften the rake test, I we're actually in sync.

A

We are actually in sync to when we're getting sync, but the one that we dropped out earlier. For some reason, this isn't picking up I'm, not sure. If we have a way to I mean it's clearly not getting any yeah.

B

Yeah I mean it seems like it can't talk today. The database.

A

The prefect Qatar the beginning.

B

The giving up- oh wait, no, no, the prefix aren't ya. Oh.

A

But some, but some things are getting persisted. So that's what's confusing to me. It's like.

B

Well, those are many other prefix right.

A

Let's take a look.

A

Oh yeah they're not complaining here all right.

B

Was that the prefect that he brought down the road.

A

Oh, no, this one I brought down, but this one's actually a message. Oh.

B

That's that's. That's really.

A

Weird, unless I'm hitting in the number of character, connection limit or something like that, it's possible.

A

The hope of not hitting connection limits, not even close so I'm, not sure why we're getting there I'd be nice to.

B

A

Have actually access to the sea but yeah.

A

Weird all right, I feel like I, might have actually seen when I ran those locally to, but I don't know if it I think it was maybe as a different error. So anyway, we can take a look at it. Offline I think we're at 5:20. So unless someone wants to stay on and look at something I.

B

Wouldn't mind just saying: giddily crashing bad I like I, know somehow preventing the primary curb writing to disk. So, okay.

A

Do the primary is to right now I believe what.

B

Does the dashboard say.

A

The primary all right, all right, you want me to completely break it anything you want to watch while this happens.

A

Basically the pushes will fail, actually I think I think it actually is not as bad it's nice, it's not actually terrible. Yeah.

B

Raids would probably pass, but rights would fail.

A

B

The other, the other one is emissions. One yeah.

A

That's what I'm wondering if I just want to change permissions or just rename the whole thing yeah.

A

How many possible excuse me alright! Well, I, just move this I'm, just some change the permissions on this.

A

Okay, so if this happens and I assume, this will just not sure what just happened there, maybe I didn't actually change it.

A

B

You wanted to see.

A

So well, the question is: can I can I probably gonna get a mixture of? Oh, it's not even if it's a furrow floor now am I gonna get anything.

A

Because it's not it's not round robbing Therese all reads only go to the primary right yeah. So we don't have this nice. You don't have the read replica thing going for us so yeah, it's kind of not the worst I guess, but it's not great.

B

Looks like you just lost my data yeah.

A

Exactly it's like, you know, big deal, no big deal just dropped. All your data and all your pushes.

B

Yeah so I think I really like it. If anyone's like got any other ideas like the kinds of failure modes that could occur that are not just like the server going dark. There would be my guess: it's like Commission's problems, I say you like incorrectly configured, one of the giddy nodes and like it was working, and then you push some config change to your fleet and you're, incrementally rolling out a config change, and some percentage of your nodes were like a partially operational but like essentially unusable so like the data directory was.

C

B

You commissioned it wrong or storage was full trying to think of other ones, like the other one that comes to mind that that that who could be like some sort of a latency one, we're like one node is getting really slow and unresponsive. I mean I. Think our timeouts are quite high. So like we wouldn't catch an error because it would just be slow. But then, if you like got a different shard, it would be fast and I.

B

Think that would be hard to like I guess like it works like it's not down, but I mean that might be something to consider. I.

C

Think that's the most interesting edge case I'm worried about because, somewhere in our you know, interactions between the prefix and the database. There could be some kind of bottleneck that isn't exposed until one of the nodes is really slow because we're kind of doing distributed, locking of this data set and what happens when one of them is just really slow. Maybe we want to have some kind of slow mode where we like inject sleeps and stuff into various actions.

A

Yeah I might just be a I think there is some documentation on how to like it his packet loss, so that could certainly cause all sorts of havoc.

B

Okay, I think we should put our thinking caps on that yeah.

A

B

A

Do is like create a matrix or a spreadsheet, of like failure. Modes right like this is what you do in the auto industry or some safety critical thing. You write down like possible things that go wrong and level the risk and then maybe like what are you gonna do about it kind of thing right, so it's a figure out what the term is, but it's fairly basically a failure analysis, and we need to do something like that here, ya,.

B

A

I'm gonna keep these nodes up and probably try to dig into why I get only one just get stuck in this lovely stay here.

B

Yeah I'm really encouraged by this. This was great to see all like for certain failure modes. Like then I'm, just going out like this seems to work relatively well, with the exception of, like generally one that I never will making good progress.

B

It's great thank the wrong thanks. Everyone then we stand. Thank you.