GitLab Verify Group, 2 Jan 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: GitLab CI/CD - Partitioning of CI/CD data stored in PostgreSQL - 01.02.2021

Description

Andreas, Grzegorz and Yannis discuss alternative ways to improve CI/CD data model.

A

Yeah so uh hello, everyone, it's a call about uh database partitioning, um and I have the agenda document here. Let me paste that to the zoom chat, uh there's nothing much in there for today, I'm I'm. I just plan to take some notes so.

A

Yeah, so I wanted to discuss um our plan for partitioning uh ci builds table or another table uh that we are going to then migrate rows into, and I wonder what are your like thoughts about that? I I hope that you had to have had opportunity to read the um merch request and um and that you have some ideas, questions concerns. So I would like to hear them.

B

Yeah thanks for inviting us to that call and um also bringing that topic up, um because it has been a long, long, standing topic right. We we're sort of looking at that table for a long time and seeing that grow and seeing the growing pains with that and um yeah. We do have some some troubles with that table and sort of the data model that surrounds it. So that's a great topic to look into um I was.

B

I was looking at the the recording that you did a bit um from from last week and also read the the much request on the discussion there um I think, or just as a general comment sort of to get started. What is what is perhaps worth uh saying is that um I think we need to look for a better data model for for the pci for some parts of ci, basically that, based on this table um and sort of partitioning, that is, that is one tool in our hands that we can use.

B

um But it's it's probably it's a good tool and perhaps we can use it, but it's. I think it's good to approach it with an open mind. Sort of look at what is the problem that we're trying to solve and um what yeah, what changes do we want to make to the data model and partitioning can be, can be one of that right. We have more problems. I think already identified that that we want to discuss there right.

A

Yeah, so I completely agree. um Unfortunately, there are some constraints. For example, we cannot use uh any database that we would like to use. We cannot, we, you know, introduce new technology, because it means that we will need to ship that on premises as well- and it's kind of you know something that is limiting us in many different ways.

A

So, um there's not like not a lot of tools. We can choose from to model the cicd data better. We can. We are already using object, storage, a lot. We are using postgres a lot and uh if we can actually improve the data model using these tools, it would be great.

A

Otherwise it might be much more difficult and uh might you know, go beyond the scope of the architectural change, so um you know I try to focus on the main pain points. uh It's usually, you know a good idea to see what the problems are and work based on. You know what we want to solve uh so like we. We all know that the ci table is very large and creates a lot of problems.

A

uh I plan to get them documented in the emergency quest. I'm in the architectural. You know blueprint I've, I have not done it yet. So I'm sorry for that, but basically I think that you know database experts. You basically know what the problems are, so I'm not going to describe them in this call, but um we're running out of the primary keys uh like we are probably at around 50 capacity right now uh and like the table is super wide.

A

Super long and uh engineers are facing problems with uh statement timeouts and um from what I've learned from jose and nikolai like. There are much more problems uh that stem from the table, size that are happening inside the postgresql that are not really visible to engineers or even sres. uh You know things that are consuming a lot of internal resources, and uh this is something that is not that tangible for people, but I'm sure that problems like this also exist.

A

I would like to quantify them better, but not really, I'm not really sure how to do that.

B

Yeah, I think maybe it may be a good way or a more tangible way of approaching that I know you mentioned that before is like seeing performance problems with the table size, um and this is this- is something where partitioning can help you, but you still have to understand how the application works with the data. So, basically, if you partition it one way and the application works with it the other way, then it's still gonna be bad, like or maybe even worse, right, so yeah.

A

I completely agree that that's uh that's the reason why I decided to get involved, because um I know how ci works uh and uh I'm not a database expert, but that's the reason why I hope to collaborate with you to make it like happen. I mean partitioning or something else that we decide to.

A

You know pursue, but um right now my best bet on what we should do is uh to create a new table partition the table, uh redesign application on the back end and front end to actually make use of the other table as well and to gradually move old pipelines. We would call archive pipelines to that table.

A

We can also uh slice uh vertically and, for example, persist. Yaml variables, commands and options inside the object. Storage for archived builds.

A

So these are my ideas and I just wonder what you think about that and uh what do you think about this idea of introducing partitioning not in a way that we basically partition ci bills, but instead we create a separate table that we partition and devise an iterative plan to move all pipelines and builds inside the new partition. The new type table, and this way we would leave the ci builds uh basically alone for now. But it would be much smaller and much easier to change and uh you know in the future.

B

Reps would it help if, if we expand a little bit on what we already um have as experience in in gitlab with working with partitions, because it's not the first time that we're doing this? um We have a previous example where we were working on the the audit events model. That's what we picked at a time um where we also introduced partitioning and it really works um much in a way like you described it. We were basically we had a very large table, the auto defense table.

B

We were looking at the basically under trying to understand the application behavior. So how do you? How do you access, auto defense? How do you work with that? And um in this case it was relatively simple because it's you know a limited feature, uh there's relatively limited ways of accessing the data, and it turns out that well, the the time dimension is always the dimension that we work with the data.

B

So there's always a date filter on the ui and also the api calls they all have a data filter and if they didn't, then we were able to to con.

B

You know change change it in that way um and that sort of helped us to make a decision to set up audit events is something that we want to partition by time, because that's the dimension that we always have when you access data- um and this is really where partitioning shines, because then you always have a key available that allows you to really when you query the the database go to the right partition.

B

So you don't end up scanning, all all partitions or a very large table, but only the one partition, or maybe a few- that you're interested in that that's where the performance benefits them come from and then what we did was um creating, like, like you explained it already, creating a new table that is partitioned. Basically, it has an identical schema to the compared to the original table.

B

um In that case, even we had the same problem with the integer primary keys. So we turned the new tables primary key into a weekend, so 8 bytes integer.

B

The new table is partitioned, so there is. There is already code that handles creating monthly partitions, because um that is something that you have to. You have to do as you go so not not with schema migrations, which is more of a static kind of thing, but creating monthly partitions is something you have to do on a monthly basis.

B

Basically, so there is code that handles this, and then um there was a background migration that was running for a while, basically copying records over and making sure that updates are also being being carried over and then at some point we were at least or actually relatively recently. We were swapping tables, so the application was basically picking up the new table partition one and now the correct me.

B

If I'm wrong, uh I think we're still have to drop the old uh table that is still around so at the moment it works vice versa, so we're inserting and updating copying stuff to the old table just for as a backup mechanic, um but we're basically ready to drop that table.

B

um So basically that that explains. That's it's very similar to what we're talking about here, except for the fact that the ci builds table is very large, also much larger than the other defense table, and it also has complications with regard to.

B

There is many foreign keys that are referencing. The eye builds, which is a problem in process 11.

B

um post was 12, is something that fixes this, so this is going to get easier in a couple of months when we, when we drop couscous, 11 support, um and then I think the sort of the key question that we would have to look at first is like what is the application behavior? What is the? How do we access that that data?

B

um What is ultimately, what is the partitioning key that is sort of something so.

A

And I think that's a very valid question, but I think it's basically too early to answer that question, because the partitioning k we choose uh is going to depend on how we model access to archived pipelines, and we can model that on many different ways.

A

The this is something that does not exist yet because I'm pretty sure that we cannot partition. Ci builds in in the form that this table exists today, like it's too big, too complex, it has too many foreign keys constraints and uh a lot of you know um statement groups that are addressing statement groups yeah that we do have hundreds of them. Basically, so uh I feel, like you know it's: it's not really possible to partition. Ci builds.

A

What is possible instead is to build a completely new storage model for archived pipelines, and the difference is going to be that the aircraft pipelines do not require being processable. I mean you know. Whenever a pipeline is archived, you will be able to only see it and access, perhaps for the api, but you will never be able to trigger a manual action. You will never be able to retry a build.

A

You will never be able to. You know, do anything with that pipeline, except of visualizing all the data that we store, because I, in my opinion, data durability, is important and um removing data like that. That's not good, not a good product decision in case of pipelines that may be relevant to user even after years, they might want to go to a pipeline that deployed like version x created years ago and want to see, for example, um volume for variables like that. So I think that we should not remove data.

A

We can move data, move data to a different table partition or to object storage, but it the data should be there. So how I basically uh envisioned the new feature and new data model for archive pipelines. I can tell you describe, I can describe you that in a moment, but eventually it might not be only my decision.

A

We need to involve product and designers. So that's in that regard it's going to be quite complex, but I I I can envision that you go to pipelines list a page.

A

You know the table that lists all the pipelines latest 20 because we paginate that right, but at the bottom, at the at the top of the page, you have running pipelines, pending pipelines, all pipelines, and uh I envision adding a archived pipelines tab in the aircraft pipelines that you will see a drop down, for example, 2020 2021, you know 2019 and- and you choose a time range we could do partitioning yearly, for example, and when you choose the range that's the moment when we know which partition we are going to access and we are going to like display all the pipelines when you click on that.

A

We also know the time range, so we will be able to actually retrieve that from partition. For that particular year.

B

A

Month, like it's not clear to me yet, if we should partition yearly monthly like it depends on the quantity of data of other aspects, and uh and basically you know, we need to model that in the front end, we need to model that in the ui we need to refactor and change backend to make it possible to display pipelines that are kind of different. They are going to be stored differently in a different table with perhaps you know, using a different model.

A

uh Yet we want to reuse all the features that we have right now for visualizing pipelines and builds.

A

So um you know that's that's, basically how I see that and of course, how we model that in the front end whether it's you know, the product team will need to make a conscious decision about how users want to access their pipelines after they are archived right, and uh this decision will actually influence what partition key or keys. We are going to use so with right now, it's too early to tell, but this is how I see that do you think it makes sense.

B

No sure um I think the like you said this is not a not only a technical discussion, it's also product product or foremost, the product position to make on that on the technical side, only I I wouldn't be able to see why we or if we would do it any differently. If we didn't have archive pipelines so um could even like if from a product site, it makes sense to to always access your pipelines by month.

B

So you pick the month first and you access current pipelines by current month, and there is always a time dimension as well.

A

One reason is that the pipeline queueing has been modeled on the database level, so whenever a runner asks gitlab to provide a build, we run this huge statement. That is like one page long. If you print it and it's very complex, it's using sub queries complex joins, and you know, if we add partitioning to the equation like it might be very inefficient or even impossible to do that properly. Imagine queuing that spans through multiple partitions.

A

What happens if you go to a pipeline that you created a year ago and you create create retrieve build? In that case, the new build is going to be created and it's going to become pending and uh which partition. Should it go like how the queueing would work in that case. So this is one of many many caveats of how ci works and yeah.

B

Isn't that also a way of saying when we look at the current data model, then there is, we basically only have one model for ci bills right. um This is also the reason why it's so wide, um because we cover all those those cases and it sounds a bit like we're. Also mixing up different needs like one is, storing the build information and then the other one is the queuing which is sort of a way of saying that across the site I want to see all pending builds.

B

Roughly speaking, give me those- and this is a problem when you have a very large table or when you partition by- I don't know, project or by time.

B

This is not going to work, but basically maybe we can make that a part of the discussion that we say. We know that we have this queuing problem and we also need to store the data in some way. Perhaps we can treat those as separate ones and find a good solution for the queueing problem and also for the for the data storage um and that can be postposed to for the queueing side. But it's just that. We split those needs, would that make.

A

Sense, I completely agree, and I think we should um redesign queueing and um but I also want to be pragmatic.

A

It's not going to happen overnight or before we devise a way to partition, ci builds and in order to actually redesign queueing, uh we should improve. Ci builds table anyway, um and um I feel like these are like separate things we should do in the ci. But partitioning has a bigger priority right now, because of the all the database quality okrs being created and uh yeah. It's clear that you know we are also running out of the primary key, so um redesigning queueing is going to take a lot of time as well.

A

So I feel like the ci builds partitioning should be addressed before we address cueing, but both things are important, so um yeah. So.

B

Can I ask a question? um Yes, of course, just just from a strategy point of view. um I know you, you were working on um the sort of the bigger picture um blueprint redesign for for that and it's it sounds a bit like we're sort of narrowing the focus and concentrate more on the database side. Only of.

C

B

And say like well database partitioning is, is the one step that we want to take? um How does how does it relate to the bigger picture? Is that something that's still in progress, or have you sort of given that one up and sort of focus on this.

A

Because unless you understand what bigger picture do you mean like.

A

B

So we're saying that we need to do database partitioning because we have those database problems right um and at the same time I know there is a lot more discussions going on about redesigning ci completely.

B

What I wonder, though, is it why why are we focused on on the database trying to fix database problems, and do we still have the bigger picture in mind like how do we address queueing? How do we.

A

So that's a good question. I think that, basically we it's clear that we need to redesign ci in a way that it can sustain the further growth and it can scale even more, and there are some limits to how, for example, efficient can be efficient, but uh I feel like we should move increment and incrementally and in order to actually improve ci even more, we need to have a way to migrate.

A

The data in the ci table somewhere and without partitioning this table is just too big to migrate data, either within the table or outside of the table to extract something to a different table. For example like it's, I remember working on writing a couple of vibrant migrations.

A

Moving data out of the ci builds table and actually also in updating every row, and it it was three years ago or two years ago and back then it was a big problem. It it caused a ton of table bloat. It cost like background migrations running for weeks and right now, it's even difficult to actually take a step at migrating. Anything uh because this table is too large.

A

So in my mind you know, addressing partitioning can allow us to uh slightly improve background migrations and uh if we make them possibly, we should make it possible for them to run in parallel. For example, like that's many one of many ideas, we can actually uh resolve that problem of not being able to migrate data from this table or within this table, and this can actually unblock all the future redesigns and improvements for the ci data mode. That doesn't make sense.

B

It does yeah and it's it's very painful today- to do these kind of things um and we can improve that. But I think I would still say that looking just looking at the size and the number of records, if we want to migrate the data and copy that over into into a new table structure, um I think we have one shot of doing that within the next year. Basically or you know, roughly speaking, but that's my that's my feeling about this.

A

A little bit more because I'm not completely sure if you understand.

B

I think we can still improve on the on the background migration side, copying data over making that faster and all that, um but it's still a huge undertaking. So I would just based on the experience with audit events and other background migrations that we did. I would expect that if we approach that we copy data over, um we have a good chance that this takes a long time for us to come to finish completely right um and that's why.

B

I think we need to be very careful with our expectations towards partitioning um and what problems this is solving for us.

A

Partitioning this table is not going to make it easier for us to migrate data.

B

So with migrating data, that means basically things like schema migrations, adding columns or do we talk about copying data over into a different format.

A

Yeah I mean uh transforming data and copying to a different place or within you know, all creating a column and transforming data and moving it to separate to the other column and like I can use many different transformations that we should make. For example, instead of having a hard code that, like hardcoded like environment, referenced as a character varying string, we could create a separate table, have a environment, you know row and then reference this as an id like partially is done already, but there is some.

A

There are some inconsistencies. I mean you know, there's it's not consistent across the application.

B

I don't see how partitioning improves that for us, we would still update the the full column you still have to have to deal with all the records um and, yes, those those will live inside different partitions, but you still have to update all of them. So for those scenarios partitioning doesn't help us. I think, unless I'm I'm seeing something.

A

So that's that's a very interesting discussion because in in the past, I've heard about, for example, being able to run more vacuum workers to make it easier to handle table bloat. Whenever you update that many records, I've heard that it might be possible to actually execute migration in parallel on every partition, instead of doing the sequentially record by record- and there are- you know a few other things that I've heard could be possible with partitioning that are not possible when you don't have partitioning.

B

That's that's true. For the vacuum side, on the sort of more the operational side, this becomes easier to manage, at least um but still from a from a sort of application perspective from how the how you deal with data. How do you write those migrations, um not a lot of changes, so you don't really uh benefit from that side from partitioning.

A

Like I, I mean, if we really want to transform cic data model, we would need to improve data migration, background migrations as well like this mechanism. In my opinion, it's not reliable enough right now uh and yeah and there's definitely some room for improvement. If you want to make it much faster and much more reliable.

A

And I I think that you know partitioning can actually help with um migrating data data once we improve background migrations as well. But what? What are your thoughts about that.

B

Yeah, like I said I I see that from the operational side it becomes a little bit easier vacuum processes. They finish faster.

B

But other than that from the application center doesn't really really help. You.

C

Be honest yeah, so I agree with andreas. So.

B

C

You are discussing here so there are a couple of problems when you are working with uh very large stabilities. So if the problem.

B

Is acquiring logs, for example,.

C

Partitioning won't solve your problem; it can solve some problems so as as uh you said that we could uh start building at some point, uh batching that, with bats, uh would create different batches over different partitions.

C

Maybe that will be that will speed up things, but it's not like it could make things crazily faster, because if you need to, for example, lock the table, you don't want to to update in parallel multiple uh partitions.

C

So it also depends on what you're doing but yeah for sure you will um you could. I could see something working faster there and more paralyzed.

C

This is vertical to the discussion of how can we make uh background migrations better in from my perspective, so we also agree that we have seen issues with background migrations not being reliable enough not being able to to be split properly or how can you define the bad size how you can make them dynamic?

C

Those are a lot of problems that we also are thinking about, but I'm not sure if they are only tied to partitioning or not partition tables. This is something we have to work uh towards, but this is also something vertical to all, whether we need partitioning or not and from my perspective partitioning is needed if you can do partition pruning properly, and that means that when you clear the data, when you are trying to fetch data, uh that you can focus the query to check only one or two partitions out of 100.

B

C

Of 200 partitions and that will make things way faster, both while selecting and also while upgrading the updating data, and I think that that's the important uh thing to focus when we are thinking about partitioning.

A

And I think that you know basically I think that's exactly what we want to achieve. We want to make ci build stable, very small, because every time, a runner processes a job. We do see a ton of reads: uh a ton of rights and a ton of filtering and searching through this table and uh with you know, introducing another table called. For example, ci builds archive with a slightly different schema.

A

We can actually make ci builds much smaller, but I I don't think it's reasonable approach to partition ci bills. Right now. We can partition ci builds archive, but we might decide not to do that like if, if the benefit is not going to be significant- and it's you know up to you guys to tell me that it might not be then perhaps the first reasonable iteration is to actually extract see.

A

I built archive table migrate, a ton of builds to that table, and uh and that's it perhaps it's going to make ci build stable, sufficiently small uh and we might not need to partition the cia built archive, but from what I've heard from nikolai and jose.

A

Even if we move like 95 percent of builds to ci builds archive, the table is also going to be too weak and we should partition it, but perhaps it's not the case. So what do you think about.

B

B

Just from a general feeling, I think I still want to focus that discussion more on the problem that we're trying to solve, or at least I don't understand what the what the exact problems that we're trying to solve. To give an example, um let's say we are concerned with the queuing mechanism that you described where we have those gigantic uh queries that go across everything um and those are very slow and presumably also pretty pretty frequent.

B

If that was the problem that we're trying to solve, we could say that. Well, we leave this in place as it is the ci builds table, um but we find a good model that allows us to do the queueing.

B

Perhaps we can discuss having a retention strategy on that model, for we only keep data for three months.

B

um We make that as as narrow as possible, and then we we have duplicate data, because that information presumably also lives in the current table. That's.

A

The problem, because we want to improve queuing, we want to make table more narrow. We want to do many different things, but how do we do that without being able to migrate data.

B

In the case that I just explained, you don't migrate, any data, you just start creating a new queueing mechanism and you start using that- and um perhaps that takes a little bit of like time to to get into.

A

B

There is no migration necessary.

A

In order to make the table narrower, we would need to extract data out of this table to a different table and it requires migration right. So how do we migrate data without making this table much like smaller in terms of the height or you know, vertically, vertically, smaller.

C

Either way, if you, if you want to batch over ids, if you have a billion records, those are a billion records, no matter how you partition or chunk the data.

C

So if your approach is to go through in batches of uh and uh go through, ten thousand batches each time you will take the same time, some things will be faster uh index, lookups will be faster, other things will be faster. Maybe we can start discussing and thinking about uh a way to addre to access multiple uh partitions at the same time, which would for sure make things faster, but it's not like it will make everything faster, because if you have to copy a billion things, you have to copy billion things. That's it.

C

You have one disk, let's say one disk and you have to move one billion things from one place to another place. So, however, with chunk things, the cost is the same. The lookups will may be faster and more parallelized version may be available, but either way you have to go through the process of coping uh a billion things.

C

But if we do that on a partition, schema.

A

The bloat is going to be much smaller and more manageable, right, 100.

C

That's true, that's 100 true, uh but if you and the and we agree that the bloating is affected, the size of the index is affected, which is very important uh because smaller indexes, you have a bigger partner in memory, et cetera, et cetera, et cetera. But this is not like uh it will move you from.

B

C

If you have to to spend three months to do the.

B

C

It's not like you will go down to two days. You will go down, but uh what we have to understand here is that this is not a magical one and what's very important here- is that with partitioning, because this is a physical partition, it's a physical, changing! You win something you lose something so, for example, if we partition- let's say by creating that, but you want to process by expire that, because you want to to process expired pipelines if those are correlated in this case, but.

B

Let's say they are not correlated.

C

If you start processing by expire, that.

B

C

You will have to visit marketing partitions, and this is a random example I'm giving now, which means in that case you won't gain anything so, for example, for uh the cleanup jobs that go in the expire, job pipelines, uh those won't uh be affected and maybe worse from an execution perspective.

A

I think I understand you know how to actually what to do to benefit from partitioning, and I understand that partitioning can go things much worse when we are not mindful about, for example, uh adding the you know keys, we partitioned right with the workloads right that that's clear what I'm trying to understand.

A

uh What is the message you are trying to convey like is that the partitioning of the ci bill stable, is not a good idea or that we should be very careful about how to do that, because that's clear that we need to be very careful and mindful about how we partition the table.

A

But um it's not clear to me what what? What is the message you're trying to convey? So can you like elaborate on that.

B

I think from my perspective and please add to that, but.

B

I think, in order to to say something about the quality of a partitioning approach, so basically pick some key partition a table. You really have to understand what uh what is the problem that we are solving like? What are the queries that we are making? How do we access that data? um How does that partitioning approach help us there?

B

What is our goal right if, if table bloat is our only problem, then partitioning is probably a really good tool right, um but I think I I but I mean we know we have more problems than table load and maybe that's not the most important one. So that's what I'm, what I'm still missing from that conversation is um sort of saying what is what what are ways that we access this data?

B

um Where do we see problems currently and then go one step forward and devise a partitioning approach or any other kind of data model changes.

A

B

See how that's also a problem? We are.

A

Seeing the problems related to the size of the table, that every statement is very slow, we might even be using an index, but sometimes it's not enough. The table is just too large and uh you know engineers are being affected by that, because something works in the gdk. But then, when it hits production it doesn't work because the table is just too weak. We we have we've had.

A

We have had like a lot of merch requests when someone wanted to migrate the data because, for example, this column is not used anymore uh and we would like to uh extract this data to somewhere else or you know, there is a a column that is basically nil and we would like to backfill the data in there and there.

A

You know many problems like that are affecting development velocity and the idea I had is to actually you know, introduce the ci builds archive table that is going to have a different schema and uh if we move 95 percent of builds to that table, the ci builds table is going to be much more manageable. It's going to be much smaller. The index size is going to be much smaller. The amount of row and reference from external tables is going to be much smaller.

A

Does that make sense.

B

It does yeah, um but it's also a way of saying that everything depends on the data access um yeah. So let's say you have a very large table. That itself is not a problem. It's just a way of saying um how much data there is right. If you have a data access type. That's that's always looking up records by its primary key, not a problem. You can have very large tables for that.

B

If you have in addition to that, you're also scanning on a column and have a very expensive query where you can't make use of the one index, only large table becomes much bigger of a problem. So basically um that's what I'm? What I mean with looking at the problems that we're just trying to solve is uh by which perspectives do you have on the data? How do you access with that data.

A

So, and that's that's a very good question, if you formulate this way, because I think the answer is that we have so many patterns of using this table, that we can't even tell how we are accessing data, we presumably do have hundreds of queries that are accessing ci builds and there's almost no way to optimize that, given the current size of the table and observability mechanisms like, I think it was jose or nikolai that tried to collect or the statement groups um with cia build stable involved, and although we can probably duplicate this uh set of statement groups even more like it's quite clear that if there is more than a few hundred uh queries, different queries that somehow join or use the ci builds table.

A

So it's clear that you know uh trying to address how we are using this table is not going to help us or you know uh like it might be a few times effort there is like the application is too complex. Already the coupling to the table is just too big. What we can do instead is to tell the application that look. 95 of that of these builds are archived.

A

We can move them to a completely different schema, completely different table and uh you will need to deal with like only the five percent and that's something that I feel is attainable. That is, you know something we can do because optimizing all the queries optimizing all the usages patterns, it's simply impossible, given the size of this application in the amount of complexity and cutting.

B

That's a bit surprising, though I mean you know just coming in from from having an ideal idea about how to approach it um when you think about ci data. um This is always within a pipeline, probably within a project or at least within a namespace right.

B

So when you look at queries today that we're making, what I would expect to see is a lot of queries that happen inside the namespace. You go to the pipeline, tab and stuff like that, and it can well be the case that these queries don't have any additional namespace id equals or project id equals, filter and stuff like that today, right, I.

A

Think that's not needed. The only thing that uh can not be scoped to a namespace or project is queuing because is an instance-wide thing right.

A

So, however, we address that the queuing will eventually need to access multiple partitions and that's the reason why we cannot really partition by the namespace or project we might, you know actually need to rework queuing, but the extent of changes would be so huge that we probably would need to build a separate service that would have uh has it its own database and whenever there is a build, we would need to push the push the bill to the service and the queuing could happen there.

A

So this way you know, we could actually make the queuing possible when we partitioned by the namespace or project. But I we try to like there's the. If there's this issue about extracting cicd daemon, we wanted to make it like a golang based service with a separate database, a separate queuing mechanism that would all only hold active builds that are being either processed or enqueued right, but we know that it's not going to happen anytime soon.

A

The effort and the investment would be enormous, and uh that does not seem like something we can do anytime soon. We might need to do that one day, but not anytime soon.

A

So with with that, you know in mind that we cannot really uh partition by project or namespace.

A

And the the concept of the archive build is the only reasonable that I see, but it, of course, doesn't mean that that's the only way right.

B

Do you see what I'm saying when, when we talk about looking at the current increase and not being able to spot the common key that we could use, um this doesn't really say that there is no such common key right.

B

um I mean I see what you're saying about queueing and this goes back to maybe doing that that uh differently or splitting those models for it. um But basically that's that's. Perhaps we can.

B

We can take that one step further and when looking at all those queries figure out um if it would be possible to use a project key project id as a key or something like that um just to explore those those options a bit further.

A

Actually, I have more like practical question: what is the cost of having a partition if we decide to partition by a namespace or a project, and we have like at one of three users that are going to have one pipeline? We are going to create a partition for them because they do have a new project is what is the cost of maintaining such a partition? That has only a handful of entries.

B

Now you don't you generally, don't want that. So basically you want you. Ideally you have a similarly sized partitions.

B

What you can do is hash partitions based on the project id, for example, that doesn't give you, those not you, know same size partitions and there can be hotspots. So this this can have some problems, but um I would not recommend creating partitions for selected projects or something like that.

A

So it means that having a project or namespace id partitioning key like would it ever make sense.

C

Yeah it's as long as it's supported by the queries. This is a natural way to to split a table by in 256, smaller tables, so because.

B

It is a half based partitioning.

C

It will most probably you know it will uh send everything uh all over the place. Then you will have some issues with a namespace like gitlab, so whatever is located together with gitlab will be. This will be a huge partition, but other than that, all other issues, all other uh instances.

A

C

Will be uh grouped together equally.

A

But wait like if we devise a hashed based partitioning key and we have let's say 500 partitions. Yet we do have millions of projects. So does it mean that it's possible that uh multiple partitions are going to hold, builds for the same project.

C

Multiple projects will be the same partition. It's the other way around.

B

Well, one product is always the same partition if the product.

C

B

The partitioning key I understand, or the name.

C

B

Let's say the name: space.

C

B

Or the project.

C

Let's say if it was by project id, everything from the same project will be in the same partition, but you will have you know whatever your ratio is uh that many uh different projects in the same participation.

A

And so how do we model the multi-level nested groups in that case, because this is how gitlab works right there are multi-level, like group can include a group and include a group, and you know there is some nesting involved, and this can actually affect the strategy. But, like the strategy like, we can explore that of course. But how do we fix the wing in that particular case?.

A

How do we fix what sorry queueing of builds like the the thing that you know scans, multiple, like all the pending builds.

B

Right, so this is basically there's this partitioning scheme uh where you always have the project id. So this is great right um and then there is this additional perspective where you say like give me all the pending builds based on their status.

B

That is pending right, um and in this case, what you do is actually split. Those like have two different data models um and sort of make make one it's an optimization right. It's it's a way of saying that we optimize.

C

B

By project id and that doesn't work for queuing.

A

Yeah, so it means that you know we uh uh we cannot have a separate partition for pending, builds only or active bills like for queueing, it might work, but it it's not going to work to for pipeline visualization right, because someone might want to see a pipeline and it does not make sense to scan multiple partitions to have only one pipeline graph. If we you know make pending, builds, uh be written to a separate partition.

A

So it feels like you know that two mechanisms are not very like compatible right now, because we might need to rework cueing. But this is uh not going to be a simple problem to solve.

B

Yes, but that's, I think, that's ultimately. The point is that we have different ways of accessing the data and if we choose partitioning we always choose, we always favor one right, um which might mean that it could still work for for the other. But in many cases it's just going to break right because it scans all those partitions. Well.

A

Not not really because we can have two different tables like right now we have ci builds, but we can have ci builds and use it. The old way and we can have ci, builds archived that has a different schema, different access patterns and it's basically something that is kind of you know separate from ci builds. We might you know at the end of the month migrate.

A

Some old complete builds to the archived table, but it means that it's still like everything feels comfortable in that way.

A

I might, I might be mistaken, so I'm you know here to you know, hear your opinions and thoughts about that.

C

And what you're saying is 100 uh valid, but the the.

B

C

There is that this is unrelated partitioning, so having multiple entities or marketing models that we are going to use. Those are valid solutions that we may have to do while also thinking about partitioning, but our feedback there is that partitioning is not going to solve all problems. You may also need to have you know materialized tables or secondary models, or what you are saying uh so that you can support multi and different access patterns and uh yeah.

A

I completely already, I think that no one says here that you know partitioning is going to be the only valid solution. I mean that we need to redesign our application in some way or another. The idea with the archive build seems the most attainable to me, and then you know these are like having separate table uh for the archived builds is not an orthogonal problem to partitioning, because if we want to partition the separate table, we need to set up and configure partitioning from the day zero.

A

Otherwise going back in and partitioning the table once we move all the data in there like it's going to be again almost impossible. So if you want to create a ci builds archive table, we will need to device a partitioning model for it before we actually start to rebuilding the application on backhand and front end, because the partitioning model is going to dictate the changes we need to make on backhand and front.

C

Yes, it's not impossible, it's just costly.

A

Yeah not possible, like it's just money and uh and.

C

From my from my perspective- and I assume that others will agree in this call as a database guy, if you could uh give me a solution where we only have, uh we have a partitions that are below 50 gigabytes and I will be happy. So it's not like we don't uh like partitions.

B

If I can keep.

C

My tables small, as small as possible- that's the best case. Possibly we I have a fast indexes. Lookups are the best, but uh we have to think about how to manage to do that. And then there are a lot of details. I want to to address something that jose.

B

Writes there like multi-dimensional.

C

Partitioning multi-level partition, I think that we can dive into those as we move forward and for sure there are solutions like you know. You can start uh by name space and then going uh a time based or do whatever, but most probably that's after we discussed the core uh accessing mechanism.

C

I think maybe I'm wrong there, but.

B

Maybe coming back to the blueprint that you described, um I think, would would be awesome to have as a as a sort of document um describing those things as um basically the how do we? How do we plan to access data or how do we access data today? How do we plan to change that?

B

What is the partitioning strategy and how does that help for those cases and then perhaps also dive into the retention strategy? If that is possible like um do we need to keep data forever? Where do we not do that, like in the what you described, we have the archive table and the new table like there is a retention strategy on a new table. How that is implemented.

B

You can also use partitioning, and how long do we need to keep data? So that's.

A

So that's interesting because in my mind, all the usage patterns for the new table that is going to hold the archive builds uh depend on how we partition the table, because how we partition the table will depend on the foreign keys. We can have constraints we can have, and all the caveats and limitations of partitioning will need to be addressed in the application in how we can model access to that table.

A

Also, the partitioning key we choose with the product theme and stuff like that, so I feel like we cannot really describe the usage patterns of the table. That does not exist yet because the user usage patterns are going to depend on how we partition the table.

B

Ultimately, yes, but then I was assuming that we we do have some patterns already.

B

A

Do have patterns for ci builds, but we know that the these patterns are so chaotic and so unpredictable that we cannot really describe them and think about partitioning this particular table easily without actually completely reworking the patterns, and this will rework what would result in that. The new table of builds that are no longer active or you know, accessible.

B

But that means that we we have to rewrite a very large bit of the application right in order to meet those.

A

Demands from the participants.

A

Processing pipelines how eq builds how we update them, how you know the processing works? What are the dependencies? What is dark, what is like stage, and all these things, like that's 90 and remaining 10 percent- is perhaps how we visualize that and moving to a new new like table is, is going to require reworking this 10 percent of how we visualize and the 90 of how we process pipelines is not going to be changed because only what is left in the ci builds table will need to be processed.

A

So it feels like by introducing the new table. We are limiting the scope of changes required in the application and we can move incrementally for, for example, I envision working with database team on on creating the new table partitioning that uh writing a migration that is going to add a few.

A

I don't know seat records to the table and then working with front-end and back-end teams to actually visualize that in the ui make it available for the api for users modeling all the patterns around using the archived builds, and only when it's done, we can add logic that is going to gradually move all the builds to the table once we know how it works.

A

What the usage pattern irons are- and that's only the moment when we can start moving when the partitioning and all the front-end bucket changes are done, and we know that this is actually something that works for us.

A

I don't know if you, if you understand what I mean, what uh how it looks like in my mind- but I feel like you know, having a separate table is the only way forward that will allow us to iterate on that through making. You know these two-way door decisions and.

C

Can I ask something: how are we going to what's your plan about moving to the archives debbie, I know that we discussed it already, but uh can you, how are we going to move for data? Are we going to say that we haven't.

A

Going to get background migration, I think it needs to be an in application logic. It's not going to be a migration, because we will need to have a worker, presumably working on being scheduled by some kind of a chrome job or basically, you know a crown worker that is going to gradually find old, builds and move them to the new.

A

Table, okay: in that way, we would actually, you know, uh wouldn't need to be concerned about using backward migrations, especially on premises, and this would be this kind of mechanism that works all the time every day, every every week, every month, it's actually moving old builds with. You know this actually message to a user that such build moved to the new table is never going to be processable again.

A

So, if you want to run this pipeline, you need to run a new one, create a new one instead of re-trying to build in the old pipeline, for example,.

B

Do we even need to take the or to keep those archive builds in postgres, then? Is that a consideration too.

A

So uh that's that's a very good question and um I feel like data durability is important. I I I I in my opinion like that. That's my personal opinion. We should not remove them. We we can move them somewhere else, for example, object storage, but they should be there right and we should also make it possible for users to filter them by month or by year for the api. So essentially I envision a completely new api endpoint for archive builds.

A

So when you use the api for existing api endpoints, you can only operate on. Existing ci builds table with the with all their builds from let's say last year, or basically this would be a value that is configurable in the application settings.

A

What's the time when you know build, becomes and old enough to be moved to a different table, and we would need to have a new api because users might still need to, for example, enumerate all the builds for given runner from a given year, let's say or all the bills for their project that were failed in january right and because there is some need of enumerating them, we should keep them in the relational database instead of object storage, because it's very difficult to enumerate objects in the object, storage right.

A

But ultimately again it's not an engineering decision. It's a product decision. We can, you know, tell product managers what is easier or better, but if they tell us that, for example, I don't know users should have access to enumerating all the bills they have in one core post or get like. It means that partitioning this table this way might not be possible. So that's you know again this very interesting situation, which I believe we cannot really partition anything without the input from the product team.

B

Perhaps there is also a difference between being able to enumerate stuff, with maybe basic information. Like the I don't know, the name of the build job and the the timestamp and then being able to get those details only as a second step.

B

A

What what is a detail and what is not in detail it? It's very doubtable. The button will because, for example, people might want to enumerate fail, builds and is a build status. Detail or is it not? Is the runner id detail, or is it not like? I do agree that no one wants to find a build by a full text, search of a script or a yaml variable used right. This is not something you can do right now and presumably never going to be something we will allow users to do uh so.

A

I feel, like you, know, a script being executed variables and all the options like these things can be considered details. But these are like three columns out of 50 right and we certainly can extract them.

A

We can archive them in object, storage and only make them available whenever someone wants to visualize this particular build, but everything else that we store in ci build stable and eventually in the cia builds archive table. It can be considered as a not a detail right. So that's tough, and it's like this is my realization that it's incredibly surprising that a ton of product decisions need need to be made before we actually know how to partition. Something- and that's that's. You know very interesting insight.

B

I just thought, maybe just as an idea when you, assuming that normally people don't interact with archived, builds a lot like this. Is you know if we get the timing right, at least um what? If you can go into your project and sort of be able to say like I want to retrieve all my builds from 2017, and I accept that this takes a couple of seconds minutes whatever we grab that from the cheap storage and we import that into the database, so you can interact with that.

B

The way you are used to do that and then, when you're done after a while, when you, when you haven't accessed that data, we basically come in again and and drop it from the database, annoying that it's going to be on the second level.

A

This way you cannot search them like you, you cannot really like. You might need to make a api request, for example, because I'm thinking about api, because people tend to automate a lot of stuff for the ci, and this way you would need to make a request to the api to tell gitlab that it needs to hydrate, see I build stable, given like the time span, and only then you can filter that and then after some time it would remove. Data like it feels complex and.

A

Yeah, I just you know, try to find the most pragmatic and reasonable simple solution and I'm a huge believer in simple solutions, and I think that simple does not equal easy and sometimes simple is much more difficult than easy.

A

But um it's better to have a simple solution done an easy solution and uh keeping stuff in postgresql feels still like a simple solution.

B

Though it might.

A

B

Also, the most expensive one in a sense, that's what I wanted to add about the retention strategy: you're, not not dropping data from the database. um That's I understand the reasoning for that, but it's also a very expensive decision, um and you know we.

B

I think we from just from a global perspective on the database, but we still discuss two lessons retention strategies because, especially for free users, keeping data forever is expensive right and it's not only like the storage cost, but it's also the like. We can see the engineering side, that's a.

A

Good question, I think, that's a very valid discussion and we should have more discussions about data durability versus you know the cost of retaining everything- and this is perhaps discussion we should have with product team about uh you know how to model that for users like because we want him, we would need to factor in the plan that user is on. Then it's going to be it.

A

We, it will need to behave differently in on-premises versus on github.com, and what should we do with companies that are huge consumers of ci, but are on premises like in in their case? Is that like fine to remove data or move it somewhere or like you know, and suddenly it becomes a very big and complex discussion about how to model even such a simple thing, as data data retention for c builds and uh and the other I feel like it, we cannot really remove or move anything without having discussions like that.

A

Because again we are running software on premises and we are running gitlab.com and essentially, these are you know, a little bit different product in products in in the sense of how they should behave.

A

Like it's not expected to remove data from database for users running their gitlab installations on premises right, they don't, they might not even have object, storage configured because I'm not even sure that's actually a required setting right now so and then, like is the object, storage the place we should move data to, or should we have a document database for example, or something like that I can now we start discussion about uh introducing a separate database because, in my opinion, like object, storage is great, but it might not be the best database out there.

B

I just had a similar thought: is it right to think that the archived builds data is read only then so? This is something that never changes. Yes after it's been archived. Yes, yes, so in a sense, you could even even argue that this is sort of a analytical approach, or you know we talked about similar problems when you have an analytical queries, analytical data.

B

um This is mostly data that is read-only and you access that in a different way and sort of we're also arguing that. Well, maybe we should model for that right. We should maybe have a different different approach than we are used to with with the application that has has data that is rewrite and changes all the time which eventually goes to like hey. Do we actually need a different database to support that better, even though postgres is really good at that too?.

A

Yeah, so that that's really interesting, so I I just wonder if one action point after this meeting could be scheduling a call with a product team member to actually discuss expectations around data retention and archiving old builds like because you know engineers we might be surprised by what the requirements are. We do not, you know, know or understand all the usage patterns of how big customers are using.

A

You know ci on premises. What are their expectations towards you know, data retention and filtering and being able to you know, use ip to get insights about their ci usage.

A

There are things that might be very important for them, for example, using api to model like to get data out of gitlab to draw charts about their ci usage. How many failed or successful blues there are and stuff like that. So again, it's kind of surprising how you know important the feedback from the product team might be on this.

A

Okay, anyway, so I I think that we are presumably over time, so I have not yeah.

B

A

Very overtime, so is there anything else you would like to chat, ask or anything I can help you.

B

It's probably going to be a lot. I think that's a really interesting topic, and uh only starting to to dig in to understand people.

A

I would love to follow.

B

A

What do you think about rescheduling this call for the next week? At the same time, perhaps you know there is this agenda. I will link that to the meeting. Perhaps I will have more questions. Perhaps you will have more questions. Perhaps I will gather some feedback from the product team and you know it might be actually interesting to uh be good.

A

Okay. So thank you very much. I would like to you know. Thank you a lot for joining the call. uh I think it was very useful, like this is a complex topic and we need to you know, align our perspectives and expectations and hopes, and eventually we might be able to actually devise a good strategy because I still feel like my idea about the separate table is just an idea and uh I'm not a database expert, and I don't pretend to be so.

A

What you are telling me like, like that's super valuable, and I hope that you know we will find a good strategy for that.

B

Yeah- and it's also the other way around for sure um this is kind of kind of the typical problem that we have as a database group. um We know very little compared to you about about ci domain and how it works and let alone any product considerations. So we can, you know, I think we understand our partitioning works or you know how to look at that data model and maybe how to approach it, but we wouldn't be able to talk about anything cfci related without that input. So um that's why it's great.

B

I think that we that we have those conversations.

A

Thank you will schedule the next call and uh thank you very much and have a great day.