GitLab Database Team, 11 May 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2021 05 11 TimescaleDB POC

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

All right, so um this is a discussion to talk about a time, scale, db, proof of concept or demonstration. um So with that wonderful introduction I'll just hand it over to nick, and we can start talking about it. So nick you talked about during database. Scalability working group it'd be quite easy to put together a time, scale db proof of concept because we had not actually talked about using that for the time decay blueprint.

A

So what are your thoughts.

B

Right so for time, series data, uh it's like time, scale right now became like standard de facto in postgres ecosystem and more and more companies are using it and it's it's been developed very well, so so they uh implemented various interesting things related to compression of all data.

B

They also released uh recent, like they have a lot of automation, for example, uh automatic uh deletion of old old chunks. We call it chunks, but it's basically partitions, so partitioning management fully automated. They also have additional functionality, like user defined. The functions like metallized views used to, for example, we can use it to track some counters or some aggregates, so they have a lot of functionality and it's improving I'm I was very. I was really impressed uh about their compression work, so it's like very interesting. They have good article.

B

I can send you so like skill compression you just can google it. It was very like very deep and impressive material and it's working. It's also it's uh open source they made last last year, as I remember, they made a decision to make everything open source all functionality and only require not to build cloud solutions based on time scale, so they decided at all, unlike cytos, which distinguishes open source and enterprise version timescale decided to have all functionality. You know in an open source version but uh like they only say, don't build.

B

As I understand, maybe I like I'm, not lower and so on, but, as I understand they just forbid creation of uh cloud offering uh for time series data like they are because they they they using their own product. uh What else uh we like? We can easily check uh some tables. I I've said like we can check the ci builds in in its current shape.

B

um Okay, this uh I will send link to com about compression to the chat here.

B

Okay- not here I see so we can take ci bills, for example, and actually I already played a little bit with it uh with time scale for for cia builds. I did it in database live engine, so I just took a clone and checked various approaches. How we can.

B

Split table to chunks so partition, basically with time scale, and I'm sending this like very quick and dirty proof of concept uh activity. I did like several days ago uh here. It's it's a bad form and snippet, but you can see results in comments there. So interesting, interesting observation. uh I had let me point you to some additional document.

B

I've checked, I I was reading different uh uh materials about what to do with plans, what to do with sea bills, and I was curious how current workload or ci bills would looks like and also create this document also sending to our to our zoom chat. So let me share my screen also, and we will discuss it.

B

So this is basically some dump of progesterone statements for ci builds order by calls in reverse order, and we see these queries and I was manually checking. We can automate it of course and understand patterns here, but basically some queries. We have single lookup from primary. I think I'm sure other people already also did similar analysis, but just to be on the same page.

B

So we have some queries, which are basically primary key lookup based on id and many many queries are searched by commit id which commit ideas, understand, don't commit id, it's a pipeline id right, so id of because I see foreign key to pipeline stable, so this commit id is something special, as I understand for when we work with serbians. So that's why I decided in my experience I I considered commit id as time because I don't see created ad at all here right, so we cannot choose create that as our time.

B

If we want to uh partition by times recycles, I I did very like a very independent experiment, so just consider it as again some dirty data, the dirty proof of concept, so I use commit ideas as our time, because it's incrementally growing in time and oh, like lower, commit id means some old data. Newer means, like some newer data, so it's some integer, of course, right of course, time scale can work with both lift time time stamps and integers people do it, and- and it's it's it's okay. Moreover,.

A

um What was the period of time for this.

B

Right this is interesting. This is interesting. If we choose time, for example, we say we want months, then some old months will will have. We will have much less data than new months right, because rates are increasing. If we choose commit id. I took, I believe, one million as uh chunk size chunk time interval. It's called time, but it's like because originally they created it for time, but integers also supported. So we need to ignore this word here and I took 1 million, but I worked only with partial data.

B

You see it here I filled only with. I believe it. It's only two last months like uh march and april, because it also raised a lot of you. You can see a lot of data here so after I did it. I checked some queries very briefly, and I saw of course, if, if, if query, if query includes, if query includes our.

B

Commit id it's quite fast, so in short, approximate query.

A

B

I don't have it here, yeah I'll put it here or or when, where you want to discuss it, I can put it there as well. So I mean, if query has filter on commit id. It's very fast super fast. Of course, planning time is increased, but I believe the same problem we observed when we implemented the custom partitioning when we have a lot of partitions first time. We call a query.

B

It has very big planning time. So this is a separate problem. We need to discuss it. There are some approaches to it, but uh it's like it's independent on the tool we we use like our own uh solution for partitioning or timescale, but the second column and others were very well like I compared like querying original table and partition table with timescale. It worked. It worked well uh an interesting thing. Additionally, we can use uh two dimensions for partition.

B

It's like it's like some thought towards sharding, because time scale, two, they also have shading and we can use not only time climate id in our case, but also project id uh ci bills doesn't have name space so for dirty proof of concept. I used project id and I said we want to split it by three three like big chunks or how three parts by project id and then shrink two chunks, one billion each.

B

So it's like two-dimensional partitioning and additionally, I checked experiment. I checked experiments with for sharding. It was a little bit more difficult, so I had three shards based on project id inside each. We had partitions of one million by by commit id, but it's it's already beyond this discussion, so I'm not going to advocate time scale for sharding. It's it's quite more, it's more difficult topic, but for partitioning I see it's, we could benefit from it.

B

If at least we need to explore it because uh what they do like it's, it looks interesting and like features they have either we need to use them or we need to implement same features ourselves.

B

So like basically to me honestly, it feels like if we do it ourselves, it looks like reinventing the wheel. They will already exist a lot of a lot of engines. I mean cars. Companies already use this wheel.

B

So that's why I like I. I feel it like around me, like a lot of companies are using so.

C

Can I ask a few questions or craig, do you want to to to speak? First? No, no.

A

C

So I I was taking also time db and I'm trying to understand how it is going to help us now. So, for example, we can we.

B

Can do time partitioning time-based partitioning.

C

We have not worked with multi-dimensional partitioning, but we can discuss that or we don't have this problem at the moment. So I'm trying to understand how time db will help us with respect to yeah in.

A

C

Compared to what we have right now so, for example, right now, we can partition by date intervals either way. We have to do the workload analysis and to tell you the truth,.

A

C

Seed that you shared with us is very interesting because it shows you know the workload over sky beats. So this is something that we should do either way, but other than that. How is time bb going to help us uh on top of using pure posters now, but we have the tools.

B

Yeah, excuse me what what did seed show. I like.

C

No, no, I I mean.

C

Yeah, so what we did there, it's yeah what we have to do either.

A

Way and that right, this is very useful.

C

But let's say that at some point we decide that we're going to use created out or commit id or whatever.

B

Right so I, as I've said there are features they developed, but we didn't develop yet right. For example, do you have some logic for automatic uh deletion of all partitions? Do you have some logic to compress them automatically in the in background so implement compression?

B

Do we have like uh do we have? They have also, as I've called like, it's called continuous aggregates, I'm looking at it right now, continuous aggregates. It's like mutualized views that are automatically maintained, and so fresh data is coming and these views are uh automatically refreshed concurrently.

B

So it's like some additional helper mechanism that help you helps you have aggregates pre pre-calculated. Then they have user-defined actions. So it's like a background jobs on chunks, not sure it's needed, because we have our own background system, then automatically ordering data in chunks physically by indexes kind of cluster for postgres, but but for chunks only so they have these features to work with very large uh volumes of data.

B

I, as I've said like the most noticeable here is compression for me, like I, I really impressed with it and the automatic deletion of chunks is like it's working very well. That's it, but if you see we can develop everything or we don't need it. Maybe maybe no.

C

No because we were discussing about it, I I was also trying to understand how it can help us. So that's why I'm asking you, because you have played with it. So can I ask a question there so from my understanding it it works uh great.

C

If you have uh most of your queries are uh in what we we would call the latest partition, because they are doing a lot of work to keep the latest partition in memory.

C

So if we don't have that case, so, for example, we just partitioned webco clocks by date and in the case of webhook logs, we always query the last seven days. In that case, I assume that the time scale db will be perfect, but for all other cases, if we want to do you know normal partition, let's say 10 and partitioning or whatever, where you don't always access the same. The last one and good times kdb be useful for those cases, or it will be only for for the case where we have.

C

You know uh the the time data, the time constraint, data.

B

Well, it's a good question. Actually, uh if we first of all about cash, I don't think uh the only case when time scale or any time based partitioning works. Well, is when you access the only fresh data. Postgres caches uh pages, not whole table right, so we can cache some old partitions partially and still benefit from it. It's kind of it it it's make. It makes better uh maintenance tasks like index recreation.

B

Why auto? Because we can paralyze it and utilize more course to process uh one single big entity and uh but with respect to cash. I like it's like it's page level, so I don't think uh I don't think you like. You, cannot query old data but of course, like with time scale and any like time, series approach.

B

It's I I would say it's not it's not about selects it's about, updates but and maybe delete. So it's it's not like it's working less and less. Well, if you, if you update all data a lot or delete all data to not not just print partition like dropping it completely, but you need to delete some partitional records. Did you.

A

Say sorry clarify, did you say it works less? Well, if you're constantly updating rows.

B

Right right because it's like updating rows, especially if you, for example, you need to update some old row um and your update is moving into different positions. This is the worst case, okay, so but uh the same thing for time-based partitioning and postmas for not time-based partitioning. It's uh like, uh like time-scale, guys open their talks. I I saw it like they say all data in the world is times that time serious data.

B

This is like interesting, saying: I'm not 100, like percent convinced, but it's interesting to think about it in this case. So, but if you think about gitlab that the most of data is is has, of course, not everything, but but especially big tables, where we have a lot of data, it's it's produced over time, so we can think about the time series.

C

Yeah, but that's not the case if we start uh if we want to chart or partition by namespace, for example, or do.

B

C

And say that, for example, for.

B

C

Cases it's not useful, but for the time, decay and you're right. We have, for example, created that everywhere. So.

A

C

We always use, let's say, created that, and we always query for the last uh month or something like that. We always have a filter. I understand that, but for the other cases this is not the case. Am I right.

B

Well, yes, like two things here: first, if we, um if we think about using, for example, project id as time like to as a single uh dimension to partition and uh think about it as time, it will be not a good idea. I think, because we cannot there's not. There is no very old tail that we could compress or even delete.

B

So it's not very beneficial right.

C

There and not buy us in the dimension, so the time.

B

A

C

A very inherent property it has bias, so there.

A

C

To say that smaller project ids will get less updates than larger project ids, so it doesn't have this in interesting inherent property.

B

Well, maybe there is some correlation here if project id is very, like is very old. Maybe this project is already dead like there is some relationship, but I'm not sure like it's it's like kind of it's. I agree with you like it's it's it's very different, so.

C

Yeah you're right there is a bias, but there is no certain. You know cruciality.

B

A

Like, for example,.

C

If you have stan, who who has.

B

A very old user id exactly it's not.

C

Like he will have less updates right right.

B

I agree with you right and but it's it's about all time, serious positioning, I I I know some people try like to think about like let's take time scale, because we want to do to deal with partitioning because it still needs a lot of uh things to implement like pruning and and uh management of partitions right. So let's take it and use it. It's not not a good direction to to use it, but they also have second ability to create second dimension for partitions.

B

So if it's, if it's not time serious time, decay data at all, not probably not a good idea, uh that's why I'm not pushing this as a tool for sharding right. I probably will raise it a little bit later if we start thinking about sharding parts of database, so vertical sharding circle right. So if if we go this direction, we can take part of database which truly has times time serial time series nature like ci data. In this case, uh maybe it will be interesting to consider time series or at that time scale.

B

So right now, I'm I'm saying like why we develop on time decay, support ourselves. Why don't? We use existing very good tool, which is developed very well. So I see like a lot of activity around this project and a lot of users.

C

So can I ask here so let's say that we we want to use timescale db. How is it is so? What are the requirements for self-hosted instances and for us if we say that, because whatever solution we're going to use, we have to be able to switch and use it both on gitlab.com and.

B

Right in a perfect scenario,.

A

B

I my answer is, uh I think, no restrictions, but I will check with I don't know who are experts in in licenses.

B

So, as I know no restrictions, we can't use it on like hundred thousand setups and just ship with it. It's quite quite free license, but worth checking.

C

And you can switch a poster's database directly without any changes.

B

uh Say it again so.

C

You can switch any postgres existing postgres database to.

B

Yeah, it says it's extension right yeah, so.

C

So you don't have to to repeat you just add the extension or yeah okay, okay, so.

B

You create extension. This is what I did I uh this my snippet like I sent to our chat it. Actually it should have if it doesn't. I I will like. Let me uh let me know where we discuss it uh issue or something I will put instructions, how quickly try it it's quite easy, so we create extension, adjust the sharepoint libraries uh restart needed, of course, because it's extension, which requires restart and that's it. So we can create a so-called hyper table. They call it hyper table, so it's distributed table and start moving data.

B

There interesting question how to migrate uh online, but it's it's so they don't provide the tools for migrate online migration of fortunately, so you need to develop it. That's it so like.

C

Yeah, but those are the the tools that we already have so if they can provide.

B

Us with additional stuff.

C

So why- and I agree why build the additional uh uh yeah tooling, but we have to understand how? What are the problems with enabling it and.

A

C

Is this, so this is a plugin that needs building or uh or is it supported by the so.

B

I think uh so we just installed this extension, so it's easy and uh what I wanted to say ah they also. It has interesting um time scale scheme like information schema. You know this, like sql standard uh api, to understand, like some meta information about about database like to see which columns, table tables have and so on, so they created their own like additional schema to expose some stats and so on about what is happening with data.

B

This is quite interesting again, as I said, as I've said, if we even don't use it at least study this interesting extension, which is actually already quite big company and big project and learned something interesting from the from it for our own implementation.

A

Okay, so um I gotta drop off here and then I got another meeting to run too so I'll create an issue where we can discuss this. It doesn't sound like it's blocking our implementation of anything right now, but it's something we should discuss for accelerating any kind of time, decay that we may want to do in the future.

A

B

Keep me posted if you knew that I will. I will help awesome yeah.

A

Thanks for the overview, nick thanks, thank.

C