GitLab Geo Group, 24 May 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: GitLab Geo - Self-service framework discussion

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

To mike you're all set.

B

It'd be a very informal discussion of how the geo self-service framework works um to replicate and verify um everything that everything that it everything that it does, uh I kind of like would be nice to have a whiteboard.

B

Well, we'll just we'll just talk sure, so you could use.

A

Paint or something like that, if you got it yeah, I what.

B

Is your equivalent of paint.

A

On the mac.

B

That's okay, it would be, it would look so so childish, okay, so.

A

Powerpoint, google.

B

A

Slides would they help maybe describing things? uh If not don't worry, we can.

B

Well, when we first like this, we're gonna first talk about the overview we can.

B

uh We can look at the.

B

Like this little diagram here so first um geo, the idea is, you have your git lab deployment and you have maybe another location with many developers at that location.

B

Who are cloning and it's taking a very long time, because it's on the other side of the earth, so you create another gitlab instance. Just like you're installing you know another kit lab for them, but um you use geo to make it um follow the primary site.

B

B

um You know I I guess we may as well go through all of the pieces overview. um What's going underneath here is the postpress database, which is the main database of all. The data in gitlab um is being replicated with just standard postpress streaming replication to the secondary site, and then it's kind of um geo's job to replicate everything else, which is like get repos and blobs of various types.

B

B

That that seems very simple uh on its face.

B

But when you get into how do you make sure that uh everything is replicated uh and how do you make sure that there was no corruption on the way over, for example, or how do you make sure the secondary site um like say uh somebody assist admin accidentally deleted a whole directory of.

B

Blogs or repos like how do you? How do you know about that like because it really really needs to be there uh for people to use it and for disaster recovery? um So.

B

um Geo was initially written um to replicate kid repositories for projects and.

B

It it worked uh and we also added support for uploads, which are, for example, if you make a comment in an issue or merch request, and you attach a an image. That's an upload! If you.

B

Create an avatar, but you know, upload an avatar uh image. That's an upload.

B

Things like that, a lot of things fall into that table um and then we also uh replicated lfs objects and drive artifacts, so that was all working. Okay, except that every time we added a new type like job artifacts, it was like a massive effort. It would take a single developer. You know six months to accomplish.

B

And meanwhile, we also added on uh for git repos for project and wiki git, repos verification, uh and what that would do is it would take all of the.

B

uh Of the refs uh in in the git repo and checks on that um on the primary and on the secondary, and it would compare the checksums and if it didn't match, then we would know that.

B

We don't have everything on secondary um now, that's kind of not yeah.

A

It's certainly what what do you mean by refs? Oh.

B

So um if you, if we go into like what the structure of a git repository is, um the refs are pointers to.

B

The uh in the end, it's kind of like pointers to all of the data. Okay, um that's a huge simplification, um but that's basically it and, and I'm not um a git expert. So.

A

As an example, it's a there was a commit. There would be arrest for that. Commit, for example, um would that be accurate or uh is there a referee, a comment, etc, etc? Or have I misunderstood the concept.

B

So you know, I mean now you're testing, my my get knowledge, so I I'm just trying to remember what it's like and I think it's more like say you have a good repo with a whole bunch of commits, um but only one branch master or main same name, and um so what you're gonna have in the in the structure of the daily boat is you're. Gonna have.

B

B

Something like that and you're gonna have head so you're gonna have two main refs and they're. Both gonna be pointing to you. The last comment.

A

B

All right in the chain of commits.

B

I don't remember what happens.

A

After that, we can, we can look into that uh offline. So no, no problem, yeah. Let's continue so yeah.

B

So so that's a good, um but that's a a good way enough to say check something. The the refs is not enough to know that nothing is corrupted, uh but it's a pretty good way to say that if it doesn't match, then you definitely don't have everything gotcha, um yeah, so uh anyways that that verification piece was added and and as part of that, you know just managing like the process of like okay check, some all of the repo's on the primary checks them all to get repos on the secondary.

B

Do the checking if it doesn't match resync, there's a whole bunch of logic and a whole bunch of data that needs to be stored in the postgres databases.

B

To make that happen, oh.

B

Which reminds me we need to mention the geo tracking database. It's just a separate postgres database in the secondary, specifically.

B

Which has tables like the project registry so for every project, it's going to have a corresponding row in the project registry table in the secondary, and it's going to have data like is this synced or did it fail? Sync, um when was sync attempted uh if it failed? When should we retry? A sync um was, when was verification attempted well, it did verification fail.

B

If so, when should we try verification all that kind of stuff, okay um and there's a table for each for projects or uploads for lfs objects for artifacts in the geo tracking database.

B

B

B

The for forget, repos, the basic way that um something is replicated to the secondary is the secondary. Actually does something similar to what a user would do.

B

It does a git fetch against the primary site, for that particular git repo um and it authenticates with jwt authentication, which I'm not a security expert. But um it's it's that that piece was needed because it's like it's, not a user on the secondary doing a get fetch.

B

um I I don't know what other reasons there were for choosing that method of authentication, but that's what it does.

A

B

um For things that are not getting those things that are blogs, which are basically just files in the file system, for example, or in object, storage, um uploads, job, artifacts valves electrons, those are all basically blocks.

B

The primary has a geo specific like api endpoint, which the secondary can call to say, like hey, give me this upload or they need this job artifact. To give me, this lfs object, okay um and it's just an http yeah.

A

B

So that's kind of the overview.

A

So if I may, um the the secondary it fetches all of these individually, so it'll it'll get a list of objects, blobs, for example, that the primary has into its tracking database. Then it knows what it synced, what it hasn't and then it will fire off jobs to fetch the missing.

A

Let's take blobs as an example uh do to go fetch those and if there will be a separate request for each item from the secondary to to go fetch, you wouldn't be like. I want these items as a collection.

A

It would be individually fired off, firing off um requests, yeah, yeah.

A

A

And would it also fetch the checksum for verification from? Is there an api on the primary for that? How does the check.

B

The nice thing with uh doing the posts for streaming replication, that's kind of like underlying layers that the primary we can we can send data to the secondary just through that, so we don't have to make calls for for absolutely everything so yeah the primary will just check some things. uh Have it stored in the database.

A

B

Yep and the secondary will see that um the secondary uh so that that kind of gets into back filling versus like events um but yeah do you have any other questions.

A

I think you've touched on it, so I'll wait for for for that uh topic to be discussed uh before I ask any any questions so yeah, let's, uh let's get into it so.

B

um So, let's just go with um we're now we're going to just talk about self-service framework, how that works, um because we're just going to be talking about the logic the just like the logic of like how do we do this and do that uh and it will be uh we're my we're migrating projects currently do ssf, so we're pretty far along and we don't.

B

We don't need to remember all that old stuff.

A

Let's kind of focus on the ssf flow rather than the legacy.

B

Yeah, so um there's a lot of moving pieces, um so we'll start with kind of how the development of ssf like started and how it grew. So the beginnings of it was.

B

um We know that we want the secondary to um do things responsively so like. If something changes on the primary we want it to be reflected on the secondary as soon as possible, um and that means that we really shouldn't be doing with it, uh with some kind of like just background process churning through every single thing. Looking for what needs to be synced, but that's too slow. um So.

B

So so we need to implement some kind of eventing system, um there's like so many ways to do it. um One of the choices that we made was to one not introduce new infrastructure like rather than q. You know things like that, but certain money brought up because it's it was built for that kind of thing.

B

We we just built it on top of postgres. uh We already had some uh some kind of architecture uh doing that for the legacy logic, um so we kind of just built on top of that for us.

A

Was that decision just to keep the architecture simple, not not to complicate the architecture by introducing more components into it?.

B

Yeah there should be a very high bar for introducing new uh new architecture, new infrastructure, new, um like git lab. You know, one of the values is it's like boring solutions. You know- um and I think I think that makes sense um for uh for our purposes um and it certainly has been working so far. Okay, so.

B

So from from the outset of self-service framework, we already knew that we were doing a lot of the same thing over and over for uploads artifacts.

B

And lfs objects, um so we knew we wanted to make uh to make everything that could be reused, uh reusable um so.

B

We started building it where it was like. um There's like a. uh I should probably jump to.

B

So there is a document for kind of like the overview of the self-service framework. Now it hasn't been updated a lot in recent months, but it does still basically work the same at its part um yeah. So maybe before we start, it is worth like just briefly talking about like uh the names of things, uh because many of them are overloaded.

B

uh I mean they were a model, because that was kind of confusing, so I'll say like a resource like a resource is the thing that you want to replicate. Basically, um a dated type is okay, so we've kind of done a terrible job of sticking to this technician.

B

Because we talk about different data types, all the time- and I say like okay: uploads is another data type lms objects, some other data type. But here in this document we're saying a git repository is a data type. A blob is a data type and like post first database and some data.

A

So we could probably call these the fundamental data types: um okay,.

B

B

uh Okay, a replicable, that's the thing you want to sync: uh it's it's a resource that that geo wants to sync. So.

A

um Just a quick question.

B

Is there a good.

A

Example of a database in this scenario where we would want a replicated database or is this.

B

Oh just future proofing. Well so the postgres database, you have to set.

A

B

Postgres streaming replication.

A

But does that happen through through the ssf framework or does the ssh framework use that, for example, it doesn't.

B

Yeah, it doesn't happen through the ssf framework, yeah I'll just say: we've got to set up postdress replication, somehow.

A

Gotcha so there are, there, isn't a data type that is a database uh today, but we could have something like that in the future.

B

Yeah, uh I I well, I don't think so. I think it'll be because, like for the kind of data database, it'll always be like different. um If we were um going to make the ssf handle some kind of thing like that, and it would have to be like you've. For some reason, you've got like a 100 container registries or something in gitlab. I don't know why it would be like that, but you know otherwise I don't think we're gonna.

B

I don't know: does that make sense, I'm not really sure, but.

A

B

A

Don't quite follow, but we can we can come back to that. I was just curious because I I totally understand to get repositories, um get the blobs um the database. Obviously there is a postgres database, but to me it feels like the postgres database forms part of the ssf framework rather than the ssa framework managing the postgres database. So yeah you.

B

A

B

That's definitely accurate and- and you know I don't I don't uh like so- the ssf doesn't handle container registry. But can data registry is basically a database.

A

Okay, we're looking to pick that up soon right, so that would make yeah okay, yeah, okay,.

B

But anyway, um so uh moving on a replicator, um that's kind of more like if you're in the code like you're. Looking at the ruby code, like you need to know like the replicator is the object. That knows how to replicate something.

B

um You you'll call a method on the replicator to fire events which is like from the primary uh and then in the secondary you'll call the method on the replicator to consume. Those events.

B

Does that make sense.

A

Interesting so so the primary trigger is the replication. Is that the way to look at it.

B

Yeah yeah so yeah because of this whole like event, eventing idea um in order to keep the secondary, responsive, yeah the primary needs to somewhere somehow, when something's changed or created.

A

Fire puts it puts it into the database, which gets sent across to the secondary, so it's a primary, creating the entry in the database, uh that's the producing part and then getting replicated. Then the consumer would be the secondary when it actually sees the replicated data at that point, gotcha just wanted to make sure I understood the model. Thank you.

B

Yeah, um oh and initially we tried to write this so that it could swap out the underlying uh eventing system. um If needed, I don't think it's. I think it's turned out to be that we don't. We won't need it but um yeah.

B

Okay, that's something to know um uh so the the old code that we used to carry these events from one side to the other um is not described in here, um but remind me to talk about that later.

B

And then the last thing here it mentions geo dsl, um okay, yeah. I don't know what to talk about.

B

So now we're going to start looking at the ruby, you know code, the dsl that we uh we have going on in the color base.

B

Yeah, so to replicate something you have to write a replicator for it. So that's going to be like package file replicator class, and ideally you don't need to write a whole bunch of specifics, like you know, code yourself, because, as we know, uh downloading a package as an http file transfer is the same process as downloading an upload and a job artifact, and I know of this object.

B

So most of that stuff should be handled by you know the superclass gitlab to replicator and the strategy that you include so in this case log replicator, instead of repository replicator, gotcha and you're, going to have to specify some, you know the actual specifics like okay.

B

If you call dot file on this thing, then that's going to give you that's going to give the strategy uh this thing that it needs carrier, wave, uploader, okay or the model here uh packages calling colon package file um is the active record model that represents the packages number sport package, underscore files table.

B

um And so yeah, all the all the reputation logic is hidden. You know you only need to to say, like the actual specifics.

A

So you're, inheriting um from from the superclass, as you say, all.

B

A

Is just under the hood yeah.

B

Yeah yeah, um and here in that active record model the package file model. um We we tell you, you gotta, include this thing uh because that's gonna, you know, provide some uh the ability for, like the replicator, to call certain methods on this model um and also say that replicator uh package file replicator that replicates this thing.

B

B

So, for example, if you have in uh if we're talking about rails code now I mean um you know, because this is active record, you can do package file dot, find that id you can call dot replicator on it, and it gives you an instance of package file. Replicator, okay and package file replicated. Oh sorry, go ahead.

A

Oh, no is it what what does the id represent there? uh Oh.

B

It's the id in the table of that row a particular row, okay, um so like in uh in uh in your in your browser here, for example, like issues number 11, that's the iid so that that item.

A

um Okay, it uniquely identifies the the item in in there yeah, okay, yeah.

B

B

Oh, you can do the reverse here. If you have a replicator, you can call model record method and that'll give you the package file. uh Instance. Okay,.

B

And here it says you can uh use the replicator to generate events so again, if we're in the primary, if you call publish created event on a replicator that is like, if you got it by doing this, like you have the package file model instance and then you do dot repetitor, and then you call publish created event that will create an event in the database. That says, oh this package file number four was created um uh so that the secondary can act on that event.

A

Right so it writes it to the database, which then gets carried over to the secondary second receives. That, then goes comes in fetches. The file from the private venture.

B

uh Okay and down here it mentions, we have blob replicator and we have repository indicator.

A

B

A

The the primary initially was somebody added a whole bunch of files. All right, um blobs, I'm going to stay generic 100 blobs.

A

Would the um the producer generate a hundred events immediately, or is this a a throttling algorithm there to not overwhelm for the secondary not to overlap the primary so sarah, or does the secondary manage that on their own? What what mechanism stops stops that from um happening.

B

So yeah, so it it creates them all immediately now it depends on like, if you have. um Let me think.

B

So like, if you, uh if you push to get repo.

B

In the back end, there are some things that are done: synchronously with your push everything that's done, synchronously, with your push, slows it down so for at least that reason, um a whole variety of things are done asynchronously. So when you do a push, then a job is enqueued somewhere and um um I'd.

B

Believe that's true. In the case of uh get pushes and geo, like repository, created events being so created done, asynchronously so.

B

You know if, if, if you, if you don't have enough sidekick workers or the amount of pushes that are happening- maybe you know maybe just happens to be that there are a ton of pushes at this moment, but that's not like the average.

B

um Then you'll get a throwing of the psychic queue uh of all of that stuff happening in the background. uh Does that make sense.

A

Yep and the queue is serviced um but something's dispatching the the work right so um it'll pick. So I guess the sidekick queue is on the secondary side right.

B

Only the primary.

A

B

For creating those events on the primaries of the producer side right, uh I believe that that's done asynchronously, so you could end.

A

B

A pile of sidekick jobs that are not yet executed for a little.

A

While gotcha, so so, just just to make sure I understand this correctly. So when you commit uh a change of creator, let's say you create a project that kick initiates a sidekick job, that's responsible for writing to the database, which then gets replicated to the secondary. That's that's the flow of events, so nothing writes directly.

A

The the task of writing to the database gets enqueued on the uh onto a sidekick, uh job and and okay understood, um I believe so. Yeah.

B

Cool yeah so see, we've got this uh post for sea worker, which is supposed to receive is like the main. The big, the big thing that happens uh on a push post receive is like one of the git hooks. So um this is a this is a worker so.

B

um Yeah, this is a scientific worker or or a sidekick job yeah. Basically, and uh in here we have a hook into the korean open source uh code after project changes.

B

uh If this is a geoprimary, then call this repository, updated service and in there we create that record. Okay.

A

Okay, so that's all happens inside the primer, understood.

B

B

To your question, there's still an opportunity for the secondary to have more to do than it can do within like a second.

B

So, like, let's say the primary sidekick is, you know fasting up. uh So a hundred deposits. Forty created events get created um in the span of a second um postpress streaming. Replication has to bring those rose over to the secondary, and then I guess here's what uh what else? uh When I asked you to remind me about what happens?

B

Here's where we get into a detail of the event system on the secondary there's, a process called geology and all this this process is doing is, is watching that tables, uh there's actually one event table and many specific event tables that uh uh feed into it are tracked by that one event: log table yeah, okay, um but uh so it's watching the net blog table actually, and it's just a new uh reference that it hasn't seen before.

A

B

So when one appears because it was rested replication, then it.

B

Calls it it calls the specific um event code, like even processing code, for that specific type of event like repository, so it's like code is specific code is plot for each of those types of events and in the case of repository created event, what happens? Is it.

B

Creates the project registry row if it doesn't exist and it marks it as pending meeting sync and then it enters a reputation, job, okay, right.

B

Oh, so that's a little bit different from ssf, although in the end it's it's the same, um the ssf uh has one event type because we're trying to get away from like every time you create a new.

A

B

Yeah data type or event type um you need to like create a new table and uh it references the geo event. Log. That's like a whole bunch of overhead. That's not helpful!

B

So um in the ssf there's only one event typing: it's just geo events. It was had to be generic.

A

B

So, like a new uh event comes in like uh creatively down to an updated event or deleted event, um then it creates a geo event, job where it where that job has all of those arguments. It says like. Oh, uh this thing was created or this thing was updated or this thing was deleted, and um so in the end you get a sidekick job that will do a replication or deletion.

A

Or whatever, okay, uh I, uh and if it's a deletion, it would need to do that locally. But if it was a additional replication, then you would need to establish a call to the primary and fetch the data.

A

Okay, yeah, oh I guess. If it's an addition or a deletion, it would need to go if, in the case of blobs, um it would need to go bring that would the blob have the same reference. If it was the it was updated, it would just have the same references to indicate that it's been updated. Yes,.

B

A

There, a new new record, that's created every time.

B

Yeah, so almost every blob is uh immutable, which is super convenient for geo, not just convenient. It's like the right way to do it, and- and so there's no there's no update about them, but we actually have some blobs that are beautiful and it's terrible uh and uh I think it's avatars like if you update your user avatar, there's some bug around that. Currently: okay, because they're just updated in place.

A

Right, you're right: okay, yeah, okay, but in general it's immutable, so there will just be add and delete uh as a a a a a group, just assume that okay cool and so the sidekick is responsible. Ultimately, it is the worker that's responsible for um talking to the primary and managing the download of the gotcha.

B

Yeah all the requests for repositories and files against the primary are going to be coming from secondary sites, sidekick nodes.

B

uh And you know one of the one of the things is like if you uh um like, the the the web notes right like the the ui the, uh but it has short timeouts, because if you don't do that, then you're susceptible to um ddos attacks and all kinds of problems, not religious ones, um but sidekick. Can you know it's not ideal?

B

If you could, if you can avoid it, but sidekick a psychic job can run for 12 hours, okay and in some cases that really is needed for something like if you've got an lfs object. That's, like you know, 10 gigabytes and you've got like a 10 megabit connection on the other side of the world.

B

That's the only way it's going to happen. It's going to take a long time.

A

Yeah yeah, so that's okay, sidekick can keep that channel open or this is. This is uh a problem that we we face at the moment.

B

I mean it is a problem if that is actually happening. um If you get any kind of like like you, you end up being very susceptible to like corruption and and and whatever, but you know so that nature.

A

Do we do we need to flag this when this type of thing happens, I guess would be a question to the customer.

B

Yeah so there's been discussions recently around like establishing um expectations for geo in different scenarios, and uh this is one area that yeah we haven't. Really. We haven't really done a lot yet in the case where you've got like a terabyte of data in your primary and your secondary is on the other side of the world with the 10 megabit connection, you're, just not gonna have a great experience.

B

um It may never catch up it. You know it may not be technically possible to catch up sure.

A

Sure, gotcha, okay, but it's uh so I've heard what workhorse being mentioned. So what's and I'm kind of going off reservation and apologies for that brother um you got sidekick who, who manages you know writes to the database, also manages calls out to the primary on from the secondary and and manages the the downloads.

A

What is workhorse responsible for.

B

So if you imagine like the web, you know ui gitlab. um Those are served by those requests are served by rails, but in between we have a component written in go called workforce and the reason is, there are a number of very specific um types of requests that can be handled uh much more efficiently, not by rails.

B

So like, for example, if you uh are doing a get pull of a huge repo, um we really don't want that request to be served directly by rails.

B

And and and so what happens is workforce um is able to uh talk to italy and bypass rails, I mean it. Workhorse also is able to talk directly to the rails. App to say, like is this user authenticated for this thing, um but when it gets that authentication, um then they can fast get away directly for okay, right and uh and you bypass rails completely. So for.

A

The moment geo doesn't doesn't use workhorse, um it's just psychic.

B

uh Yeah yeah so well, okay, so on the primary side, right um workhorse is getting used for serving the get repost right to the secondary, and also I mentioned that I got two jobs yeah the end point, but.

A

That's done through an api, though right, so so it's behind when the request to fetch the repository comes in the primary kicks off the workhorse job. That's that's a.

B

Workhorse, it's not really like a yeah. It's not really like a job thing. It's more like workforce is an http server.

A

B

And yeah, and and like by default, most requests are just passed through to rails, but when it's when it makes sense for us when it's more efficient to handle it in that go http server, then okay, so.

A

You've got the single sorry mike I'm going into details here, but you've got a single api, uh something percepts the api call and decides whether the job, uh whether or the endpoint is pinned to a sidekick or a workhorse uh kind of.

B

Not sidekick uh so the rails app is running in like uh puma.

A

uh Yes, yeah so.

B

Many players to this.

A

Yeah, so so, okay cool! So so when you uh ask for the repo, basically just hits the the workhorse uh server and it just gets uh from there. So the api hangs off workhorse.

A

Would that be fair to say or not?

A

Different endpoints hang off different different things: boomer somewhere.

B

A

Endpoints could hang off, kuma won some good hang on.

B

B

B

Forget forget: forget, requests.

B

A

So your question again so um so for different um api endpoints do do they have different. Are they backed by different services like um let's say, a get, get request that goes to workhorse, but um let's say a blob or something like that right? uh Does that get serviced by sidekick in the end when they come in, so you got the same same api, but they've got different endpoints within the api, so I guess you're hitting a request to replicate git repositories or we get another request for blobs. Are they all hitting the workhorse.

B

Yeah everywhere.

A

Everything hits them.

B

All http requests go through workforce. First, oh.

A

B

Or what okay so like? If you were imagining your typical get uh deployment uh nginx first and then from uh there it gets workforce right and then workforce will split off a number of things like it'll it'll handle, get requests itself.

A

B

It'll, if you, if you are using object, storage for uploads and things like that workforce will directly handle contacting object, storage for the authorized url. Let's send that back.

B

Wait there's a couple ways that it does not switch but anyways yeah, so it handles those uh and then, if, um if it's a blob download request.

B

We we wrote the api endpoints in rails, um so if it's not, if it's not stored in object storage- actually I'm wondering now. um Maybe it is limited by a.

B

Timeout in puma and rails, like walk downwind, does that.

A

Let's go um but that that's a good insight, I've kind of learned a lot with regard to sidekicking workhorse here. um So thanks for that. But let's continue with the ssf sorry for the digressions.

B

No, I mean so I mean that's that is part of the the uh the endpoints for downloading blocks from the primary um it's it's the same, one endpoint for all the different kinds of blobs and it's kind of all written within the blog.

B

B

But yeah, I don't know not sure where we were oh, so we had kind of covered a lot about like when an event was created and then how the like what happens, you know happens right.

B

So, outside of the event system, um there's a couple things that need to be handled: one is back filling if you have a large gitlab deployment uh and you've got a ton of data from the losers already using it you're, not using geo, and then you added your secondary site yep, then everything that you have needs to be replicated too secondary, regardless of whether someone is actively updating those things or whatever.

B

um So that's, that's that's one thing that needs to happen. Another thing that we learned with geo um is that.

B

B

We can't depend on the inventing system, one hundred percent- um that's for like more than one reason. One is like if somebody does a push on the primary and there's like some kind of transient infrastructure problem or there's a bug in in the current release, where the post-receive worker, for example, raises under some specific condition um before the geo repository created event is created exactly whatever there could be many reasons or or like a sidekick. Job is just lost, because that can happen and.

B

Or or maybe on a secondary there's, a bug in the geo code and those jobs just die for something, so events can't be relied upon 100, because.

B

If you do, then you'll, eventually see or you'll have customers seeing like hey this thing supposed to be synced, but it's not synced like and now now what um we have to kind of uh assume that events are not reliable. Okay,.

A

Okay, cool, um I think I wanna uh I'll. Let you continue with the backfill strategy, because I wanna I've got a couple of questions coming up already: okay,.

B

Sure so um the way that it works. Currently, it's not it's nothing. Fancy.

B

There's a there's: some one of the main components is like this, this job that runs in the background, and it basically runs all the time.

A

B

B

Like it looks at every single record of the source table- and it looks at every single record in the registry table the corresponding registry table- and it says, like it's- a registry table missing the registry that should exist if so created.

B

Just the registry have records that don't exist in the source table. If so, that should have been deleted already, but we'll delete it.

B

So that's happening and then there's another job that says like uh is checking only the registry table, because you don't want to do big cross database uh queries.

B

uh I see you don't want to do prosthetic expiries in general, like they're, expected um yeah, they're, expensive, they're, slow, they're, yeah, um there's a number of things that can go wrong like but anyways, so um there's another job that is constantly just looping over their registry to people. Well, now, I'm sorry not looping. Under the registry tables, it's doing queries on them, because now we can do full queries like give me everything that is pending. Sync, that's never been synced, okay in queue, uh jobs give me everything that has failed.

B

That is ready to be reset because it's past the you know retry. At times, okay and keys of jobs for those, but we also need to manage like if this thing we're allowed to do all of that um as much as it wanted to.

B

um You could easily end up in a case where those sync job cues grow infinitely.

B

So uh that's a problem also, even if, like the deployment has tons of sidekick workers, um you don't necessarily want to just saturate all of them with geo sync jobs. At the same time, make.

A

B

Bad for the secondary that could also be bad for the primary, so we have maximum concurrency settings for these sync jobs.

B

Yeah yeah that you have and there's just settings that you can go.

B

Kind of simple at the moment.

B

And it's per secondary! So if I go to the edit secondary site.

B

Here's file civilization, comprehensive language container repositories and bring to limit repository and then verification.

B

um One of the problems that we have currently is that these are the defaults that.

B

Are set when you create a new secondary. This is like totally inappropriate for everything, except for the specific size of deployment that this is good for right, okay, yeah! So like it's, it's inappropriate for the gdk. In fact, like I always like, go in and put like yeah something like this, I see um or for the judy taste um so yeah, that's that's a problem that we have, um but you can tweak those you know yeah, that's so easily worked around. If you have a problem with like. Oh I got, you know too many jobs.

A

But we don't provide customers with guidance on what those should be at the moment right. Would that be fair.

B

uh Really- and it's extremely difficult to provide that kind of practice, um so yeah, that's why I just propose.

B

This epic, I remember.

A

B

This we can yeah, we can do tests on a specific setup that we can define and we can say like it's on gcp. We can. I just added this- take a measurement of bandwidth between sites, so we can tell people like okay if you've got all thinking, gcp, us, east and being in gcp europe west of these sizes records architectures of gitlab, um and your bandwidth, you know, is approximately like this.

B

This is the behavior that you can kind of generally expect from gm. Now it depends on like so many things like uh how many users are currently using your primary and secondary. What are they doing?

B

um But you know.

A

B

At least, if we, if we do this kind of, we define this kind of a reference architecture, maybe.

A

It's a good place to start right.

B

It's a good place to start and we can kind of map out his face. What to expect.

B

Cool um yeah, currently, there's no guidance. I don't there's like my way to give.

A

But the default values are quite high. I guess from.

B

There, I I you know, I guess I should say so.

B

Originally geo mostly was developed for, like we've got a single node for each site and um you know maybe it was like not a small node, but um I think those defaults, probably getting out of that since you set up on the bus gitlab on both sides on, like a you know, healthy sized node, then the defaults of 25 10 10, like you'll, be okay, okay, okay, cool.

A

All right, uh so we got, we discussed the concurrency setting with this.

B

Yeah so yeah, that's that's most of sync, I mean there's so many details. We've been diving into um you mentioned. You had some questions before going into the back.

A

Yeah so the backfilling, let's say you're, you're thinking a large number of jobs. Let's say uh you just started it's a huge repository or a large number of projects right. What uh I think you've answered that question, the concurrency throttles you from overwhelming the primary or overwhelming the secondary overwhelming itself. So I think you've already answered that question um there um in terms of um information with regard to us, the sync activity, what type of information do we have uh with respect to? Let's say a repository sync?

A

What type of information do we have in relation to where the primary and secondary are with respect to each other? You know how many outstanding requests etc. What's what's the what's the deviation between the two? um How many are in flight uh things like that? Do we do? We have um typically that type of info this relates doesn't relate back to the question yeah uh and you mentioned there was. Is it? Is it this information that that we have um yeah.

B

Mostly so this is the registry, this is the snippet repository registry table.

A

B

Fields, um so we can talk about some of these, so like state, that's that's the big one where we uh it can have three states, it's another one. I think four states zero is pending. um I mean like it. It just needs to be synced. We didn't necessarily fail before yeah and then uh one is uh started.

B

So it's actually turns out to be very important to track started, especially if you have mutable types like repositories.

B

Because of race conditions, okay and um uh money preserve one, two two is um success, so it was successfully synced before it. R3 is um failed and um those are those are. This is the big. This is the main field to look at if you're wondering.

A

Is there somewhere, where all of the different values are defined for these ones? Yeah.

B

A

Maybe you've shared it in.

B

The code, not in the.

A

uh That's fine! If there's a single file, then I'm happy to kind of uh eyeball it just to get an idea of uh different different different values that these could have.

B

Yeah so, um and this replicable registry is included in all of the registry models.

B

Yeah so state machines provided by the extinguishing gem um to uh give this give us this.

B

Dsl domain language around managing states so like, for example, if you in the code remove a registry to start it, then we set last sync dat field to time: dot correct. I see so that's that's setting like when did it start and then, if you move it to pending, then we zero out the uh failure. We try out and retry count.

A

B

If you move it to failed, then we add want to reach right down and we set the v triad based on what is the reach rectangle and oh, that's an interesting thing to note. We use progressive back off in general foreign.

B

And that's where this comes in is the most retry time is based on retractable.

B

Okay, you yep. If you move something to synced, uh then we zero out retry count. We clear any sync failure message and we clear the triads. I don't know why we need to get this to zero. Okay,.

A

And you've got a last thing: failure uh free flow, I'm assuming that's a free flow text field.

B

Yeah: okay, that's right!.

A

Okay, it's is it. Is there a limit to that? I feel there.

B

Is and it was kind of arbitrary- which I regret, setting it to like 255 okay, but it's actually not consistent across tables because of a problem where we started saying that we have to use the text type field instead of our chart and text requires limit to be set as a constraint separately.

B

But we were using schema.rt which did not track those things.

A

Okay, I'll think back: okay,.

B

But uh yeah so they're totally inconsistently set, but what I want to get to is a place where we have a text field with, like maybe 4, 000 characters allowed and we drop in the whole back trace because, like it's not enough to just give me a one little last error message.

A

That was gonna, be my that was gonna, be my question: uh is it possible to isolate the back back? You know the failure. uh Log messages for this particular event um sounds like there is, which is great news.

B

Yeah yeah, listen, click, it cool all right number places. Let me see the last synthetic set, uh maybe not that many so you'll see it below.

A

So um one other question in when you look at the logs, I'm assuming this information comes out of the rail. Is it rails logs that this information is coming yeah yeah? Is it if you? If you look at the console, you wouldn't see this output right, the rails console uh which, which are so.

B

A

When syncing fails, for example, and and the the back trace, um where would you find that information if, before we surface it anywhere else? Where would I normally go to look for this information? If I was a cis admin.

B

B

Psychic arms, um so if you like, if you go onto this sidekick, no, this is a good name and you tail uh sonic logs. Then I think you should see back traces or you know, uh yeah.

A

And would it be possible to I'm assuming sidekick does more than geo work right? Would it be possible to identify the geo logs from there? um Are there any kind of prefixes or anything that you could look for to isolate it to filter on the geo specific logs.

B

Oh, you know what I'm sorry, okay, so if we I might, I might have spoken about, uh I think, that's part of the problem uh with the way games are at the moment. ah If, uh if you transition something to fame.

B

We set by sent value of the error message: we truncated it to 25, but the problem- oh, why am I searching for fails? Okay,.

B

B

Okay, yeah, so that's part of the problem, I think essentially we're swallowing the fat choices. Okay,.

B

This takes the message.

B

B

B

I mean, I guess you know in some cases we are like.

A

But, but if if if we wanted to troubleshoot what was going on, it would be possible to fetch the logs from from a sidekick node.

A

Not in here, but is it still possible to find the log somewhere or are they just not being written out.

B

Yeah, so normally, if the sidekick job gets like an exception, that's not um raised or not.

B

I believe it outputs, the back trains, I'm actually not 100 sure.

B

Yeah we need. We need to double check all that. um Yes, because yeah we like we have, we know like when there's a problem with sinking or something we really need to see the back traces.

B

So um it helps if that can be found in multiple places, even like if we store that and still raise so that scientific uh jobs are, you know, raising exceptions with back traces, uh then a century uh tracks it. That would be good. That is something that we need to double check on. What is exactly okay,.

A

B

A

Yeah, let's do let's, let's pick that up on.

B

Off to the call, it must be: okay, yeah. It must be swallowing it somewhere that I'm just like messing up a moment because like in order for failed to happen right like it's got to be executed. So the error is not being raised. It's not being raised here so yeah. I think it's being swallowed at the moment.

A

Okay, go maybe maybe that's something we could look at uh improving, like you say,.

B

Awesome, uh uh oh so we're talking about the states uh sinking.

B

We basically covered all that before transitions, there's an accurate transition here for when something that's not the same. Successful descent.

B

B

B

I thought oh no yeah, okay, I I'm working on this code at the moment. Okay and this is master, and it's not my code, so um so yeah after sayings. um uh What was that happen? Is we need to mock the base gratification pending so that you know, regardless of whatever happened, with verification before we know that it just finished sinking? So it's probably changed so we'll just say like this needs to be verified.

A

B

Yeah um and it doesn't have to like- we don't think you a job here or anything for gravitation, because it really doesn't have to occur that the point, the uh uh there's already a verification job. That's like looking for things that are hanging just like the backfill one or something it's like. So anything like this verification editing. Is there anything about verification, that's ready for retry.

B

um So after sync that happens and I'm going to gloss over this other stuff. For that show, that's fine.

B

Okay, yeah: that's basically.

A

So it basically gets the checksum which is synced across from the primary. The verification job runs a is it is it? Is it checksums that we use for for verifying uh yeah yeah? So so it's got the checksum from the original on the primary because it come across in the the stream uh stream replica, um and then it does it on the actual object that it's got a copy of checks, those if they match we're all good we're off.

A

uh If they don't match, would it go back and try and resync what would be the actions at that point.

B

Yeah yep, so uh I guess we can just get into verification that I think we're people there.

A

Yeah- um and there was another one related to just just bringing enough uh in case- we forget it mike- was uh if the master hadn't done the checksums. I think there was an issue if it hadn't done, but it's it's been copied. How would how is that even possible, but we can come back to that one right. uh Let's talk about verification.

B

Yeah, that's related to this thing, which I'm fixing at the bottom. Okay, I think it's best to leave that one. Just out of this okay sounds good yeah. I don't think it's uh worth. uh uh Okay, so verification.

B

um Right well: okay, yeah, so pretty so, we've already covered the overview verification by the primary base, checks and secondary, advanced tracks that compares it. If it doesn't match, then the secondary says: failed verification uh failed sync, because apparently this thing is not the same.

B

So, interestingly, what that means is failed. Verification on the secondary side is not a valid state because it should always immediately transition to failed sync and if something's failed, sync um verification doesn't come into play. It's not ending. It's not success, it's not sure, but it makes sense. We record the message for the verification failure, but like this, it's not really verification failed. It's! It's really. Sync failed!

B

uh That's an interesting side now um so to talk about that.

A

So, but that that bit of information might be useful for someone troubleshooting it right, so we would hold that information. It has failed um because I guess, if it's happening fairly regularly, then we've there it's more of a systemic issue rather than oh. We just that got corrupted on the way or something like that. If everything's failing verification, then there's probably something bigger going on.

B

Or or like a maybe it's more likely that it's a geo problem.

A

They're, just seeing sinking fail, but actually it's not just thinking it's things have come across, but look uh the verification is failing. So uh I guess we just as long as the information is there they can. They can um decipher what what's gone wrong, but if we're hiding that information, I think that that could be more difficult for someone to troubleshoot.

A

Okay, cool and then I'm just looking at these two, uh the couple of fields that you've got there. You've got check some mismatch, which is a boolean, make sense, verification, checksum and then verification checks are mismatched they're, both bite arrays.

A

What's the difference between the two.

B

Who's here, yeah, okay, so um verification checksum is what did I check sun on this site and uh if a if a mismatch was detected, then this becomes true instead of false and the uh primary side checks um gets recorded in here ah right: okay, yeah, okay, yeah, that's not totally obvious.

A

I I thought it might be something like that yeah, but it wasn't quite true: okay, cool yeah, all right.

B

A

Thanks for that bit, so if you never kind of sit in a verification fail state, you move immediately to a sink, failed state and continue and try and re-fetch the data yeah.

B

Actually, you know now that I think about it again: it wouldn't. Maybe it wouldn't be a bad idea to let something sit in verification failed and until like 10 tries. You know because check something is so much cheaper and cheaper. um Then re-downloading the thing and then.

A

But why would you want to res retry check some I get what verification we did. Would it succeed after a repeated.

B

Detail, that's true, that's true. It's very unlikely to succeed and if it does succeed,.

B

You'd still want to be worried, maybe.

B

A

Could the yeah you would be worried because is it doing the checks, I'm on partial data? That would be the only I guess or why would.

B

You get why you'd be like. Why didn't it fail like five times and then yeah and I didn't change anything yeah yeah.

B

Well, I don't know what.

A

What to think about.

B

um Yeah also, currently it just goes sleeping varication fails. It goes straight to failed sync.

B

Actually we have some other fields here so to mention forced to redownload for git repose, um there's like two approaches to sinking like that now, there's technically three, but get fetch versus fireballing the gate, repo on the primary side and having a secondary downloading an end point.

A

B

And that that that's called snap, we call it snap shotting and that's what we use during three download okay um and that one that one's come up. You know not in frequently with customers, because it's possible for one approach or the other to be not working, but the other one is working and um so force to redownload comes into play. Where, when you go to the ui on the secondary and you uh you click the download for a particular thing.

A

B

So you're you're forcing them.

A

Is it redownload or resync at that point, is there a download.

B

A

B

Yeah, which is like the regular way, yeah and then using force to.

A

The ui is, is it resync or is there a separate action for re-download? There is a separate action for retail.

B

Okay uh yeah. Let me just.

A

I really need to get my instances running up running again. Hopefully, when I get my laptop tomorrow, I will have.

B

Oh, I thought we had a download button.

B

um um We have reverify okay, I thought you could set it.

B

Well, anyways, I guess forced to redownload is used.

B

Yeah yeah so like if a if a, if a secondary, tries to sync something and this no repository error gets raised, then we set full story down on the repository to true and the next state job will try to.

A

A verification failure doesn't automatically set the force to redownload flag. That's different.

B

um Yeah yeah right, okay, um read out.

B

Your download is also loose.

B

My free triathlon is higher, and so that's how that's how it started. It was kind of like you kept trying to do, get fetch 10 times it didn't work. Well, maybe you should try the snapshot way. You're going to say.

A

Well, all right.

B

A

That's it's a backup kind of up method to to try and yeah. Okay yeah.

B

But the problem is like too many times. We ran into the case where, for some reason well, there was a bug. There was fun for a long time. uh Free download, but the snapshot way wasn't working but get fetch was okay. So it's right extremely painful for customers to be like you know, resyncing keeps trying to happen, but it just fails every time, but then we go in and it's a rails console and make it sync with a git fetch. It works.

B

um So we added I added this thing where, if it's an odd number of retries, yes, you either read the snapshotting, uh otherwise try to get fetch so that way after 10 tries it flips back and forth between the approaches.

A

Okay, gotcha, okay,.

B

A

B

I put that down: okay, good uh yeah, so that's because the projects projects are using the legacy stuff, but uh framework repositories. Some services used by ssf to sync submits.

A

A

Anything else on verification we need to.

B

B

If you want to go into some more details of verification, there's a bunch of like.

B

um Yes, there's a bunch of things in the in the background.

B

B

So one thing to mention: oh there's, a big thing to mention uh verification, state, there's, a verification state for resources on the primary side as well um right to manage I'll, be gonna track them failed. I didn't try.

B

Did it succeed, all that stuff and at first we were putting all of that on the source table so like package files stable, I added all of those verification within which fields right.

B

um That's fine in many cases, but we have some cases like job artifacts, where they're on, like a hundred over a hundred million job artifacts on github.com um and so like uh extending this. The job architect's table is like bad for performance.

B

Bad form, uh well, mainly performance, but like it, uh you know there are no there's more than one reason why that is the case, but anyways it's a huge table. So we don't really want to add these fields, especially if they're only like, like on gitlab.com they're, not using geo so like they're, not even relevant.

A

B

Or self-managed customers might not be using cheating, um so anyways. uh I think it was for job artifacts, specifically like we added the verification fields in a separate table.

B

That adds a whole bunch of complexity, but it was uh seemingly necessary um and one of the bits of complexity that came out of that was this worker, uh and so this worker iterates over the table, the source table.

A

B

A

If you're pointing to another window, I can't see that uh what can you see? I see your browser, um so I can see the geo reference architecture. Oh.

B

My goodness I keep doing stuff in my ide thinking.

A

There we go uh now, we can see it.

B

All right, here's, my ide, which I was pointing to for a long time on in the other discussions, um verification state, backfill, worker innovates over the services, verification, stay tuned.

B

Did chris cut out my talking just now.

A

B

You broke up there.

A

Mike, um oh, I was wondering if you.

B

If you could hear my talking.

A

I can't hear you typing nope, oh okay, okay, great.

B

B

B

Any anyways uh so this so the verification state table is separate for a well, especially these big tables, and I think um we decided that this is just the way going forward for everything.

A

B

uh Even small tables.

A

How about we do uh a follow-up session on on this mic? I know we've been uh going for some.

B

Time, yeah yeah, that makes sense.

A

um We covered a lot of a lot of ground and it's been super helpful, but um how about we we'll pick this up on our next session yeah.

B

I've made it verification.

A

Thanks very much mike for this, I just can't stop the recording.

A