GitLab Delivery Team, 18 Mar 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Fill the environment_id column in the deployment_merge_requests table

Description

Yorick Peterse and Alessio Caiazza chat about how to approach filling a column in a database for GitLab merge request https://gitlab.com/gitlab-org/gitlab/-/merge_requests/27219.

A

All right so we're discussing the back filling of the environment ID column on the deployments Marquez table, it's Kayla I might request two seven, two one nine and some concerns there about how we're sort of going to do that in a way that is efficient enough and those are quite downtime, etc.

A

The current implementation uses a temporary table that is filled up and then that data I believe is inserted into the old table.

A

Then there were some proposals to use a different approach like a temporary ID column and some post, critical, specific approaches, and so we're gonna discuss sort of the the pros and cons of thousand and see. If we can kind of make things work, let's see so we have the temporary ID one I think we sort of yeah, I, guess I, guess I at least realized I wasn't gonna be as easy as.

B

A

Column is that off we go because our code for updating, so you adding columns concurrently I, believe, doesn't.

B

Support expressions.

A

As the default value, yes,.

B

Or maybe we can use them in our issue, support an aryl note. So if we.

A

B

Erin, oh, that dude, the sequence increase right. Take it yeah.

A

That there is one approach, then a yeah but telling.

B

Rails to ignore that column during.

A

B

Is more concerning yeah.

A

Basically, then, the alternative was to use the postal cycle internal row and page IDs. Basically, I was watching every row and post the seco has a sort of internal ID. It's a tupelov I will give the page number and the sort of row number on that page that the rows own. So it's kind of like a composite primary key. Basically, the the benefit there is, you can fetch rows using those IDs fairly quickly kind of like a primary key downside.

A

There's if you use a where expression I believe if we use like greater or equal or lower or equal, it has to do a sequence can on the table, so it generally only works. If you want to I believe get the like the the rows with, like the literal literal value, basically,.

A

The sort of experiment that there, but it's not entirely clear, I- think how exactly we were gonna use that and yeah I.

B

Was looking at your answer, so I realized that you reply to my comment. So there was this point when I was asking about the, why you stayed 2049 here.

A

B

Looking at the answer that you gave me I, my understanding is that this is the number of row within a page. The point is: how can we get the number of pages, for instance, right.

A

Yeah so he was I. Think it's a number of pages 2049 when you see because.

B

The first, the first element of the tuple is the page number and the second one is the row number and this code dude is incrementing the second value right, so it sticks to page zero and Elmer all the elements yeah.

A

So let me see, I was gonna open up a connection to production, just want to make sure it's in read-only mode in the interesting section about whatever. Let me share my screen.

A

Did you did the job share and I have to make sure it's there right this one I think it is yeah. Let me increase the font size, it's I think I think it still shares a sub size of the window. We tried it again there yeah, so the query that I ran.

A

Where did it go yeah, so this just counts. The number of pages I suppose which is 2049. If we just I, mean.

B

I say this is the number of rows, not the number of pages zero.

A

B

Page number or the.

A

Number of rows per page yeah good point.

A

So we have write this down.

A

Right so 249 rows per page.

A

B

Think this one understood reading deep. The original document that you linked is that, basically you can have whole sequence if you delete something that you just get deleted. So in that case the row will have nothing. So if you select the whole table- and you also ask for the city ID column- which is this one.

A

B

Would just see that there are holes in there, but so but I don't know if it's worth to just say up to 2000 49, whatever rows per page, so we count the whole table. We divide by the number of and say: ok, we have up to that page plus 1, because then.

A

We expect the last the.

B

Final query: just to find nothing or things like that: yeah.

A

Let me see cuz if I'm just a select count it times out, but we can get an estimate if you just like serve man's requests. So it estimates yet one point: seventeen million rows, that's probably more than what they actually is. Two thousand forty nine rows per page. So then we have eight eight, nine thousand, basically pages.

A

So we we don't have to do this basically from 0 to let's say a nine thousand, just to be sure you have to iterate, and then you have to get all the rows on that page. But yeah.

B

But then we this means that we have to issue 7 million update, because if.

A

B

The say it's collecting everything that matches to environment ID 1, and then we can use a sub select, which we clearly we can or we can. Yes, we can, because we can do, join and take the city ID out of it right.

A

Yeah so I think just selects what you need to do. Basically, we start with 0 come as one minus.

B

A

Say welcome, 0, not sure if it it's one index based, then you basically increment that to 2049, and then you start over. You start with 1, comma 0, 1, comma, 1, etc. Basically, keep doing that so I think what you could do is in theory, you, given a page number, you fetch all the rows which is up to 2049, basically load those into memory. I think the way you do, that is the class.

A

It's not completely that's minimally, we're seated a in.

A

Was is zero, comma, zero?

A

That's just zero, comma one. Let's do that see if that works, deploy man I want to play miniature bets. That's all right! You also do.

A

Yes, we end up basically with a giant.

A

Yeah, there's a clue in: let's see how that performs, explain analyze, so that's still at its can see Ted scan whatever did you do yeah so we're in that we're in would them contain 2049 values for every page and I? Think what you can then do is given those rows. I guess you would fetch the unique deployment IDs and then do a update for every unique to plummet, ID to set the environment ID. So you get something like.

A

Let's say: select, distinct claimant ID and then you do something like you know: updates deployment requests where deployment ID in set environment IDs X, so it is gonna, be quite a bit of updates. So.

B

What you're suggesting here is just to use the city ID to partition the fetch data by mean update in place, basically so you're, not using update every row that has this city ID with this final environment ID, because you selected it before.

A

Right, let me pop open vim. Then I can type the query there with her accidentally executing it.

A

And they like that, a bit bigger treatable and that can go away then I'll share my screen again and I just got the whole thing there. You go all right so here on the right, we have fin c'est ce qu'elle, so we basically end up at some points like update deployment, merge requests, set environments by the equals x, we're deployment in.

A

A

So what I see is you do something like select, distinct, ployment ID from deployment merge request where CT ID in.

A

A

A

Because this sub query will at most I guess, returned 2049 zeros, but due to the distinct most case, will probably be a bit lower and then we updated the maybe additionally say an environment. He is or something like that. Maybe here too and.

B

This will fail because we are deleting duplicated right, so we this will create, duplicates well.

A

B

This yeah, this the whole reason why we're thinking? Oh.

A

Yeah yeah yeah, so we have to fill the duplicates first, I guess yeah, but I I think we're moving. The duplicates itself is a little easier because what you essentially can do is you can do select count and then group by I guess the combination of obey by metro custody or either way whatever the sort of composite is, and then you get the rows where the count is greater than one and they just remove.

A

So you get where the count is greater than one. Then again, you have to fetch those rows where the duplicate wants and you do a guess it did lead with a sub select or something where you basically limit it, so that you only remove all about one row after the top of my head. That's something like the leads from the point where.

A

Let's say, but by what column are they saw, the duplicates are like multiple rows with the same request: ID but different deployment. Ids yeah.

B

That's it because you can, we had the.

A

B

A

Because you can have a merge request, deploy to staging and production, there will be two different deployment: IDs yeah.

B

This is fine, but the other one is that when you have before we introduce the environment ID, we had.

B

Say merge requests to different deployment targeting the same environment, all.

A

Right, okay, so we need to join and then group by that. Yes, it's not like the delete from let's say we get to the duplicates. First, no.

B

Because what I'm thinking is that, if we end up doing the join right, which is we will, then we can do this for the update and ready so weed the sub. If we wouldn't do the update just to the Jonatan, we can paginate over environment right.

C

B

Doing the city ID thing, alright,.

C

A

B

We do update things that have those.

A

Yeah I think this one of case. We have to really break it down into like, let's say you know, ten separate queries first and then see how we can sort of stitch those together. It's just trying to think what sort of the final query is. Gonna, look like my head already explodes: let's do this: how can we best test this without blowing stuff up? I will just start a real console, and that was probably some production engineers. Looking at this gogo production.

A

Typing these two polls from their CT is a little annoying yeah, so I get we get something like select. These things.

A

We're just all rows: okay, boy, okay, so if two things we need to get rid of these duplicates, which means we have multiple rows where the the MIRR trucker study is the saying the diplomatic is different, but the environments they point to is the same. Yes, in other words, deployment a and B may point to environment a and if there's you know two occurrence of Americas, we only want one.

A

Normally, if you have a primary key, we can do something like is delete from blah, where ID is not maximum something we don't have yet. So we have to delete the latest CT IDs, I, guess boy, um let's see so, let's get the three of 2049 pages to do pages 249, and then we had how we need row still yes to marriage. Let's say seven thousand per page I also had justice. A sorry I think was nine thousand just eight thousand six hundred yeah.

B

A

And then it is T's.

A

Shoot so the page is 0-2.

B

Like zero up to 9,000, no.

A

Sort of page is zero to two forty, nine and then the row I guess is zero to the exam. Nine thousand.

A

So we need to generate a permutation of that- let's just start with row 0, so it is 0 to 9,000.

A

A

Lips recording of my typing skills drop drastically. There you go and then we do deployment merge request where.

A

Was the CT idct in T, IDs and 226?

A

Let's just get the first first I'll, just all of them.

A

Right so then we have a bunch of them rose first, let's say: first n: let's make this a bit bigger. Now it's got its objects. I mean that test. It's a little more readable I was a attributes.

B

I'm, thinking of our estimate of the number of years is correct, considering that the first page gave you to hang 200.

B

A

You delete the.

B

Environment we cascade delete things.

A

Well, so the pages I think is correct, it's just because the number of holes, so all you have- we have here a page, probably of 226 rows.

A

But then there might be other pages or more rows. I guess. Let me see if you can find a case like that. Let's just say we grab page 5 and we count the number 226 interesting. So let's say a 2,000 page number.

A

226 huh interesting, so it keeps producing that I wonder. If then, the.

A

The the row ID are different yeah. That's.

B

A

Let's go back to the database, console I. Think.

B

We do it, we can take the last deployment and getting so from the deployment. We get the merger whose deployment way and then we ask the CT ID there.

A

Browser, oh, no, that's who say get a deployment IDE.

A

Detective environments that we do need one where it is actual murtra quests associated with it. Let's just get staging that probably has some requests, so we do a select star from the loins there. Project X.

A

Is taking so long.

A

Just call me indexed project ID is indexed IDs indexed. Of course. Iad is also indexed. That's interesting, explained, oyster, taking so long Ben.

A

A

Okay and I: don't baby, wait that we're here or something a select code star.

A

So yeah my dad deployment does not have any mesh request.

A

It's basically what I think is. If we can get a couple of recent deployment rows, we can see what the CT IDs are for those rows and you can get.

B

This suggestion.

A

Distribution is I, guess what we can just use, select CT ad deployments. Many from the point.

A

B

Can do the deployment ID you can sort by deployment ID and take the last yeah. That's basically.

A

B

It's it's increments, incrementing, so the biggest number but I already descending minutes.

A

A

A

Exactly he from deployments where projects IDs to seven.

A

Why is it? Is that sub-query returning any rows.

A

So that is very.

A

Bizarre yeah, but maybe you're training a deployment so different environments. So what we do.

A

That's and name equals all.

B

Review application: that's why every magic review application we don't track.

A

Let's see so, we do then, is selectively or rip and.

A

There we go okay, okay, yeah that looks a little different. Those IDs 78,000 15.

B

A

But it's way cuz. The first call mister page number right yeah. So then that other query that just pre muted, the patron armors that I guess was playing wrong or is that the I guess that might be the value for the second column? Now.

B

I mean then we assume that because of the page can fit up to 2,000 49 rows, every page has 2,400, but when you delete something, you create a hole in the page.

A

Boy Jesus, so I guess what you need to know then, is basically the maximum value for both these columns. I guess it selects a few too many patient come on.

A

Yeah see that second call never really goes above 200 by the looks of it, but that might just be coincidence.

A

Yeah I think at this point. Maybe you should have andreas. Take a look here, some suggestions, because this is something where I basically need to sit for two or three hours and look at the data and see oh yeah.

C

A

Can we actually do something with this I think there might be a solution here.

A

Yeah, it's gonna be tricky. The other option that I saw is that we can use cursors, which is basically stateful pagination on the database side, but you have to use a transaction for that cuz they're scoped to a transaction I. Think in that case we could do. Is you basically start with no transaction? You start one. You get a limited number of rows, let's say a 5,000 whatever do your updates and then sort of move on, but I'm not sure how you'd answer to figure out what sort of offset to use for the next transaction.

A

B

Go for through why ending to a temporary table and destroying it at the end of the migration is not a viable solution here, because this is not a big variable. We just probably a synchronously when our deployment happened so.

A

So instant, it's an entirely valid solution to basically duplicate this table. There's some sequel statements: we can actually use where you can create a copy of the table, including the indexes and everything you fill it up and they're. Basically, you rename it the issue. Then, if you have to keep those tables in sync, so you have to say using triggers to make sure that yeah.

B

But this is why I didn't copy, but I just created it from scratch. Is that if I, because we do simple, join right.

C

B

Can I can get the environment ID from the original query right.

C

B

I can fill data from source table to the new one right.

C

B

With on conflict, discard I can ask basically the.

A

B

Table to take care of dealing with duplicates yeah.

A

So that that is very much a viable solution. The tricky thing is there: when you swap the tables you.

A

Can I guess insert the date? Well, you have to remove the old data somehow.

B

So I'm not just replacing the table I'm just copying the uncompleted data in a complete form in the new one. The duplicates will be deleted in this insert operation. In the meantime code, the new code is already running because we deployed it and also it suppose, deployment migration. So new code is already running, so every new deployment will get the the full tuple will also the environment.

A

B

In the system- and so at this point in time, I can delete everything that has environment IDs sets in now, because there's no code running anywhere that is still creating stuff with the environment. I didn't know so I'm missing data at this moment in time. So there's a tiny window of time.

B

Where basically did the production data is in on a temporary table and we removed it from the official table, which means that if you query the API as we just go through the modules page, you would see complete data and then we fill it back right.

C

B

We fill it back, we will handle duplicate deletion again because if, in the mean in the meantime, something at the same environment, ID yeah.

A

So that could work. There's some issue there. If you, if you do the lead of the old data and then insert the new data in two separate transactions, there's gonna be a short period of time where Marcus will be associated with deployments because we basically deleted them, but we haven't copied them over. Yet that is effective.

A

You know it's not data losses like temporary data, unavailability I guess, but it's annoying and issue dares if you stuff there in transaction, I, think that transactions going to timeout and it will acquire a lock on the roast lead while doing it, because, basically, where you copy the data from the new table or that temporary table to this physical one, it also has to update all the indexes and everything.

A

That's gonna take a while I guess what you could do is sort of narrow that down where it's instead of saying, delete all insert everything you do it in groups, so you iterate, basically over the temporary table, grab, let's say a thousand rows and then you say: hey in the target table all rows with this deployment. Id remove them with these deployment, IDs remove them insert, etc. It still means temporarily. Some of the data is not there, but it's gonna be much shorter.

A

You sure they're still, that could be annoying. If there's a system getting this data, it might see hey. There is no data, and then you know second, okay, no, the data is there.

A

The table. Swapping approach doesn't have that problem. Assuming xx schema, etcetera. You start ups in section you just rename them and you're done. That's a very cheap operation issue. There being is, if you copy the table or you set it up, you also have to probably fix the names of all the indexes, sequences, etc.

A

That is a little annoying I. Don't if we renaming a table also renames sequences and indexes I, don't think so. Yeah.

B

But then we still have the original problem of trying to figure out how to deal with an in-place date, which is notice.

A

So we use your approach right, so we create a temporary table. Oh you know we fill it up with all the appropriate data, etc. Then, at some point we determine hey we're ready to swap we started transaction. We basically lock both tables, then I, guess we have to do one final check to make sure that you know any missing data is in this temporary table. We swap the names commits, and then we delete the entire table that we have now. Basically, in favor of this new quantico temporary table.

A

That particular approach means one. The table is compact because we insert all this new data, so it might actually save some space benefit there. Also that the process of swapping is largely dominated by how long it takes to figure out hey. Do we have any remaining data that we need to take care of from a coding perspective? Mostly, the annoying part is going to be fixing all the index, naves sequence, names etc, so that they are the way reals expects them to be I.

A

Think post the sequel has a way where you can say create table like this thing, and it will copy over indexes everything. It will just get some funny names, I think I think you can basically just do a little pre say all these things, rename them to that. I think that is probably the least annoying approach seems to certainly be easier than the CTI D approach.

B

Will this go be also data structure and data? You.

A

Can have it do both I think you can just have create table with just the schema I think you can also have it well so just to push you can do create table as I. Think it's the syntax and you can do something like create table and then feed it a select statement. It will fill it.

B

Out of things yeah yeah, but.

A

B

A

Four keys and all that stuff, but you can do that a two-step approach, you basically copy the table with all the start, etc and then just do a select into the copy over the data. Yeah.

B

But this is what it is, my what I'm new with my imagery, because you know the difference- is that instead of renaming I, just feel it I copy back again and drop it right. Yes,.

A

B

Your point is to avoid the final yeah, probably back, operation. Okay, so you just create the new one put data in and then you swap yeah. They said, okay and.

A

A

Let me see so we swap the tables I. Think technically, you can take an existing index, for example, and change what table it points to. But then you have to stop messing I, think with processing all the the cataloging terminals, so it's probably just easier to recreate the index with like a different name and then rename all of them, because let me see cuz renaming indexes and everything that's instantaneously.

A

You need a lock on the table, but so you just to alter table and rename all of them commit should be fairly quick, but it is something we have to test to. If you know data we probably have to copy the.

A

Yeah, what's the best isn't think where you can do is best way of testing is to create a copy of that table on production.

A

With just scheming and miss sure measure how long it takes to fill it up, you could speed that up by basically creating a table without indexes filling it up and then creating all the indexes, because that way you only need to index once I know for every row that you insert yeah.

B

But then I have to take care of uniqueness by myself, yeah.

A

That that's a no weight thing and I immediately have I think we we we've had the case in the past. We looked into that fingers when we mark rated the events table and I think the approach it. It makes a difference depending on the index types especially I have trigram indexes. They take a long time to update, but for most of regular b-tree indexes.

A

I haven't really in a case where it made a huge difference, or at least where it made such a difference that the amount of effort you have to go through to make it work was actually worth it.

A

Yes, so in this case, I think we could do look at the swapping approach.

A

There's a sequel query approach where you can just do a create table, ask if the table name and then with indexes contains whatever you'll do the whole thing.

A

And then I think we can do it in such a way that the schema file remains unchanged and that we can probably do it as a reasonable amount of time.

A

Without you know, services temporarily not having access to their data, etc.

A

Yeah, so that's I guess my suggestion. It.

B

A

I can take a look at it as well tomorrow, see if I could come up with Lisa. Oh, you know hey these sort of queries. We need to round that sort of thing. Let me get that down.

B

Yeah I think that tomorrow, I will give a chance also to take a look at paginating over some other resources, because basically this table is linking tree resources. So if I can paginate over- let's say environment and then I can use the CT ID in the where to issue the update. Maybe we can update this in place right.

C

B

Matter dealing with duplicates, but.

B

Even if it may end up looping over a bigger selection of data, because environments is I, think we have more environments than the deployment row, yeah.

A

I think much you could probably paginate by deployment ID since it is I mean there are duplicates, but probably not that many that's forever saner quickly. You multiple rows with the same deployment ID, but if you basically sort them as sending, you can do where and what diplomat IDs greater than that might be even easier, no I'm totally sure, but all tomorrow take a look at the sort of table swapping approach, see if that even makes any sense all right. Let me stop the recording.