GitLab Database Office Hours, 28 Sep 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Database Office Hours - 2019-09-28

Description

Database Office Hours - 2019-09-28

A

Hey good to see you are you good, thank you. How are you fine.

B

Complain, that's.

A

Good, alright, hey Steve, that's going.

A

Alright, maybe we jumped just at the attenti there's a few topics we can talk about if you're interested and the first one is about like and indexes.

A

So the question was: there is two operator classes you can use, one is for the trigram index and the other one is for text patterns and the difference is basically one is actually a Gini index. So the tree gram index is an index, whereas the pattern operator clause is typically used on a b-tree index and the question was it: we should prefer one or the other, and apparently we have a mix of those.

A

So we have mostly tree gram indexes, but a few occasions where we use the pattern, ops, class and I think it really depends on what we're trying to do. The think. The tree gram index is much more powerful in a sense, because you can actually use regular expressions and a lot of them are being. You can put a lot of those expressions and unser CH by using the Trivium index. Where is the pattern operator class on a b-tree? It only allows for prefix searches.

A

So when you do ABC wildcard, that's a good way to use the pattern. Ops class. But if you do like wildcard ABC, then that's a problem, because you can only go into the B tree from the top basically and for any prefix urchins that works. So even if you do like foo wildcards a B white card and it would basically perform a prefix lookup for food so for the first string and then basically scan all the records and can't make use of the index anymore. So.

A

Basically, when when we were only interested in prefix searches than the pattern, ops is fine, and if we don't, if you want to do more than that, we typically do some like wildcard searches at the beginning in the end. So when you search for something you do wall cards, search term wall cards and then I think the tree gram index is more useful.

A

Does it make sense.

B

B

A

To Jaden stands for generalized inverted index. If I'm not mistaken, let me look it up, but it's basically a you can call it an inverted index and it basically you can think of it as it as a hash map where you have like a key and then the key points to all the records that somehow have this key right.

A

For example, text searches in leucine or elastic search. They also use inverted indexes where you basically extract all the terms from a document right. So all the words from from from the documents that you have and then you create an inverted index that basically contains each of those words as the key and then it points to a list of documents that contain this word right and then you can do office lis. You can do fast, look ups by word.

A

So, if you're interested in knowing which documents contain a particular word, you just go to that sort of hash map retrieve that list and then you're good. Then you're done basically, but it can also get more complex. Where you, let's say you're interested in two words or a phrase, then you can basically make those two lookups you.

A

You get two lists of documents, one for the one word and one for the other word, and then you, if you're interested in knowing all the documents that have both you basically and those those lists in a sense and that's that's sort of how the general index structure works and for a tree Graham. It basically means that the keys are three Graham, so consecutive three-letter words, and then it is quite interesting to look at how to translate a regular expression into how do I look.

A

How do I look it up with the inverted index, that's kind of a challenge to do.

B

So we could I mean I, don't know if this would work, but we could use v3 indexes. We run I.

A

Don't think that that would work, because the tree Graham is something you you extract from your from your document right, it's not that mmm the big be tree or, let's maybe look at how the B tree works in this case for the prefix search. Basically, you.

C

Can learn couple of words: hi, hey, so yeah v3 is only four numbers only on one single axis right, because one dimensional numbers are three is for two dimensional numbers for four three grams: we do we deal with sets actually like a race. That's actually so collections of two three actually three three letters objects of few letters right so sets of the character, objects and it's possible to use trees, but it will be so called Russian doll, 3, Rd 3, so we impose this.

C

We can use destroyed so in there inverted in the execution, gene or regulatory structure using just it will be similar to v3.

C

So there are two options for for three grams, but honestly, for at large-scale gist will be quite slow. Jim is always faster for retrieval, but some kind of slower for updates.

B

Perfect time and you.

A

Can still do prefix searches on b-trees on non strings that works to.

C

Perfect search on batteries, well, all I know about BB 3. It's like single, the one single dimension. That's it so I doubt it's possible, but this should be possible for.

A

Yeah I think prefix are just the exception where the b3 still works, that that's what they.

C

Had her own closet right, yeah, yeah, I, know I, understand what you mean right if we like something like or like something percent mark. If, if percent mark goes in only in there, it's possible to use b3, but it will be actually under the hood, it will be a to two comparisons right, so actually we will have two iterations more than and less than so again it will be single dimensional.

C

A

Right I think you can use the B tree to basically go down by looking up the prefix, and then you just keep scanning until you, you don't find any records anymore. Alright,.

C

Don't reach next next character right right.

A

All right and the other question in there was about changing a column type if that is safe, sorry, the column default for existing tables and that's I think that's totally fine you just if there is an existing default and the column exists already, then you just changed the metadata for that column, so there is not going to be any like rewrite on the table, so it's just for new records coming and they're going to see a new or different default value, totally totally fine to do, and what's pretty cool for postcodes eleven.

A

There is the feature that you, even when you create a new column, with a default value from couscous eleven you're not going to have to rewrite the whole table. So that's very nice because you can just add new columns with a default without rewriting the whole table.

A

That's kind of a cool feature.

B

So I mean understand all the I mean imagine that we have a default value of an integer and we change later later, the default value to a string.

B

We can query for records for records whose value is integer and a string.

A

Note the type is the same, so we're not talking about changing the type okay.

B

A

Any case it's going to be a problem. No, but just today you have the default value of tens. Let's say you have a lot of Records in there. You change it to 20, and new records are going to have 20 unless you know when they're using the default, and only if you create a new column and put a default one on it, and today you with nine six, you would basically rewrite the table, but with the eleven that's not even that is necessary.

B

If you want to update also existing columns, you would have to perform an update, I.

A

Know, that's when you, when you add a new column right, you can, you can either add a column and say it's it's nullable. So all the records will be null. That's a cheap operation to do, but as soon as you want those records to have a default value, the then all those existing records they would get this default value right and then for for the absolute end version for post quiz.

A

That doesn't mean that you're going to as soon as we add this column, with a default, we're rewriting the whole table and with the 11s readjust it's a very cheap operation to do that, even with the default value. So I think that I just keep track internally of the of the new default value for that, without touching existing records.

A

Should we move on to time stamps on indexes, I, don't know if tones on the on a call doesn't look like it.

A

That's a pretty interesting one. It's about a creating an index on an expression, and the expression consists of basically figuring out what the greatest time step is of things. So we have three different times them, some with time zones and without and basically we want to index the greatest times some of those three and that's a problem, because the expression isn't immutable and you have to use immutable expressions for indexes. I haven't figured that one out so I don't know if you want to take a look, but it's kind of interesting.

A

And there is all these it's a bit weird because you have like functions that involve timestamps, they sometimes they're, volatile, sometimes or stable, and sometimes they're immutable. It's a bit hard to get behind up I.

A

C

A

You have looked at that or any comments.

C

Phrase because it's very tricky for time it always like always converts timestamp with time zone 2 times, temp without time zone and versa.

C

Without x, not time zone, it will be converted to victims so and if you say something at time, zone blah blah- and it's it's very weird, because one thing is just if you have time zone this time zone and say at time zone UTC, for example, it will be just its own little conversions, just like we see it in this time zone and request value.

C

This is actually we can use for indexes to make it beautiful. But if it you have a time zone without time stamp and say a time zone. Sorry, this is real conversion, so value will be converted. That's it's quite tricky and hard to remember always better to check documentation.

A

Tom posted a good overview in in the EMR for at time zone. Conversions.

A

But do you have a good explanation? Why a function that deals with a timestamp with all the time so shouldn't be mutable.

C

No, if you have time stamped without timezone, should work for indexes, so it should index it easily because it's quite it's kind of kind kind of Texas. That's it similar to like. So it's just a value and that's it. It doesn't depend on our session variables or on our time zone or something, but with time zone will be tricky to index, because the value depends on like real value, which cost index depends on our position. So we need to convert it to same at time. Zone UTC, for example,.

A

My good something we'll be testing out, so a good another. Try.

A

Should we move on to migrations.

D

A

You're still needed right now.

D

Okay, so I just wanted to discuss what approach should we take when we want to get rid of a migration, so last release I think we March a migration.

D

That was very that caused some problems and the decision was to delete it, but I'm not sure if we should delete it like actually delete the file and then leave the speck or if we should make it a known operational immigration like remove the content of the method up and just add a comment why this is pay disabled, so we don't have documentation about this and I'm, not sure which approach to take I think the leading a migration is a bit aggressive, because if with the leader migration and these ones, this one was already executed on an instance, then the record of the migration is going to be on the schema version stable, which can be confusing if someone finds out so I am inclined to go with the disable migration.

D

What do you think.

A

No I think I agree when, when there is the possibility that this ransomware like, even even if it did, if even if it made it to get lab comma, not into a release, it might still be confusing to see the version record in the migrations table on the on the site.

A

Without having that on the code base right.

D

Yeah I think I think that is the better option, so I'm going to add something in our documentation for it mm.

A

D

And the next question is also mine. Sorry, do you want to say something undress? Oh no,.

A

D

So now that we have our single code rings, we need to be very careful of the well long story short on get laughs. We are not executing the test. The specific test forget Lacoste, which means that we need to catch these errors, like ourselves, our viewers and specifically for background migrations. Background migrations on the leaf, get laughs, background migration directory. We need to ensure these ones are not referencing, EES, specific models and staying for the specs.

D

The specs for these ones should not reference EES, specific factories, because the pipeline does not catch this type of errors. They are going to be green, but when these changes are transferred to get last fall's, they are going to fail because they are referencing a model that does not exist on get la fast, so they are discussing these like this, causing the the way we can automate this, but until that gets done there, this is going to be a manual job.

A

This is isn't only a problem for the for the migrations right or the factories, but in general we remove all the code before we merge.

D

This is a specific problem for every file that is going to reference an EE, a specific file, and there is a way to execute, get Lafosse a specific test. There is a new variable. Maybe we should also document that, but this has to be a manual job yeah.

A

Can we run that on the e on the CI per with that setting.

D

Yeah they are disgusting, does up not approach, but that will increase like the time of the execution pipeline. So I'm not sure. If that is going to be done, we can do it manually like when we are reviewing I know it is a bit of a hassle, but I think this we have to do it.

A

B

Well, one question: I mean I, don't know if this has already been discussed before here at the double. Why don't we wait on with move migrations and backroom migrations when they are a specific inside the EE directory I mean we.

A

B

Migration store for deal with space inside the EE directory. Why don't we do that as well for background buyer 8 migrations, which are a specific.

D

We do have a directory for a background migrations, a specific, really yeah, yeah yeah. We do.

D

The problem is that if you put an e but migrate with my background migration on the relief under they leave get la paura migration without the evil it is going to fail, often get La Paz. Oh.

B

Right, like I was, I was confused. I was with referring to post migration cosmic cause migrate directory. We don't have that that kind of directory there yeah.

D

That we don't have DV directory only anymore, which it might be an idea, I'm, not sure what are the consequences of now. Returning to DB migrations, BBB migration on there, the EE directory but yeah I- guess it is something we can discuss with whatever the team is in charge of it.

A

So we've only had to background migrations, 4e specific things as far as I can see. Does that make sense, I would have expected much more.

D

I think we have other problems. We had to background migration directories on ie. Just let me put it so we have this one. Oh yeah.

A

D

Then we have another one, just one. Second,.

D

Without this one.

A

D

So I'm not sure which one should be I think it should be the first one.

A

B

When we have an EE inside EE means that we are extending the functionality and when it only has one EE is a new thing. Only for EE.

A

B

It wouldn't make sense to extend a background migration from GE.

D

D

I guess not because it should be like the same right.

D

How this is complicated.

D

I mean I guess we can do a little bit of an investigation like open on the issue and see what is the best place, because I'm not sure if their migrations are actually like being extended. I are actually extending the behavior of the ce-1. So maybe we should just check I will open an issue yeah.

B

I think inside it you said the first one you have populated in poor state. We extend the c1 I mean.

B

That won't extend the see bugger migration.

D

Okay, okay, well, I will open a show when we can just cos there. It will be easy.

A

A

All right I wanted to call out something that recently got a bit of traction. Is that we're we want? We generally want migrations to be reversible. We already required up for regular migrations. So whenever you write around migration, you want to look at the up method as well as the dahle method. So we can revert that easily and that isn't always true for heavier migrations, where we have like post-deploy migrations that do data migrations or it's never the case currently for background migrations.

A

So for a background migration, you just kick off the job, get that scheduled, but you don't have a way of stopping it, gracefully or reverting it in the end. So whenever we run into issues in on the production side, we with the background migration. We basically manually kill all those jobs right. So we don't have a good interface to say, stop this migration, and now please revert it because it's causing issues and what we're starting or we're starting to discuss requiring that reversibility of emigrations.

A

Although it's not always easy or cheap, so there might be exceptions and in the long term we also want the background migrations to support actually being stopped and being reversed automatically we're not there yet, but we're starting to discuss that I'm.

A

We just just merging changes to the documentation where we once I think we want to encourage people to think about that in the first place and what we added was a checkbox to the M, our templates, reminding people to think about reversibility of migrations, and that's also what we should look for when when reviewing MREs.

A

So just keep in mind that conversation is going to happen more and more thoroughly, I think and it's a little bit fuzzy I guess so we're starting the conversation. But if you have any comments on them or I'd appreciate that.

A

And obviously it's not always always going to be possible like when you sometimes reverse or making sure you can actually reverse the data migration- it's probably very expensive at time. So we might want to skip it at at some times.

A

All right and the other topic I wanted to call out is we were discussing the different database roles or database maintainer responsibilities and I was hoping. We can clarify what the responsibilities of a maintainer are. Maybe also how that relates to the other roles that we have for database and I just wanted to ask if there is still like unclarity towards those roles or the responsibilities, and if yes, where we, where we can clarify, because that wasn't totally obvious to me where we can put that information.

A

D

In the database reviews page like clearly estate, what are the responsibilities for a database maintainer.

A

That's pacer. We talk about what a database code review looks like right, yeah.

D

Like maybe we can just add a paragraph there explicitly saying that the database maintainer is a person that is knowledgeable and the gift of gold base. That is not related to an infrastructure role, and that has a good eye for performance improvement and it can be a buckin engineer or a database ingénue. This new role.

E

D

We have that I think that is very specific now, because we don't have that anywhere else. I believe.

A

Yeah, that's right. I.

A

Mean in a sense, I think there is not not a lot different responsibilities for a database maintainer and a backer material. For example, like you said you, you know the co-pays very well. You watch out for your best practices for performance things.

A

But we yeah we should we can putting that on. A patient may be saying what the difference is to a database. Reliability, engineering infrastructure and a database engineer might clear it up.

D

Well, I think a button maintainer and database maintainer are similar but, as example as a buckin maintainer when I do about in container review I, don't normally review, like the migrations like very thorough, because I trust that italics maintainer to do that and I also don't review like the background migrations. So when I am reviewing the merge request, I saw Tommy's reviewers I do I do check the background migration I do check for the time like for how many records are going to be updated on the it lab for the specs and whatever.

D

So it is like a more focused review when I am doing as a database maintainer that arise from Europe.

A

So the focus is different, but the responsibility is the same innocence. Isn't it yeah.

D

Basically, just not not blew up get low, calm yeah, it is it's basically the same.

A

A

All right, so maybe we put something on the database review page explaining what a, but we already explained what the focuses of the database review right. What else will we put on on there to clarify.

D

I guess maybe the only thing that is missing is that distinct between a database maintainer and a DVR II.

A

D

Yeah and well there it was the database maintainer. All that is described on the database review page. It does not mention what a database maintainer is just mentioned like what are you supposed to do like if we explicitly say that the data is maintained, there is someone that has knowledge on the guid love code base like specific knowledge or the good luck code base. I think that will be very helpful.

D

A

That make sense, I'll drop something and send it later, maybe worth saying that the there is also a lot of changes going on currently for for those primary roles. So we've only had the database reliability engineer until recently and that that was- or maybe that's only talk about how that changes. I think it's it's changing in the direction that we that this is focusing on infrastructure, council running the site, running the database, owning the database infrastructure and then, on the other hand, we got that database team going so that that has been approved and we're.

A

Currently writing those role, descriptions and- and there is we talk about a database engineer there, where you basically you're back an engineer, but you also bring the deep database, knowledge and you're kind of working on. Let's say foundational database code changes sense not on features, but rather on those aspects and database performance, and then we have those.

A

But then we have the maintainer ship where you like, like Maura, said we you, you know the cocoa is very well and you do those reviews and you can be back a maintainer or your database engineer and I think what I heard is a we. We don't want.

A

We want don't want tbh to concentrate on the co-pays too much anymore. That's my understanding at the moment, but it's sort of changing a little bit at mode.

A

Okay, cool I'll drop something later and up in you anymore,.

A

A

What about the data team.

A

Sorry, can you say again those.

E

I'm not sure if I'm right.

E

A

Totally so who created pop in any.

E

E

I have a challenging problem, which is the data team.

E

A

Are currently using.

E

You know the gitlab database and there's the small flow tracking database, and you know there's the periscope there so later team has an issue opens and assigned to me which they Ivana. He see the historical relations of some tables in it, la Enterprise, Edition or Community Edition and those tables are subscriptions. Currently, you know when a user makes a subscription, it's always erasing the it doesn't, keep historical subscription. It's.

A

Always erasing the.

E

Current subscription and updating it- and the same is true with members and like five.

A

Other tables so.

E

Data team sees this as a blocker and they want to have historical versions of his tables. Okay, the idea, which seems obvious, is against all the rules. All the database design, which I read on the handbook, which is you know, I, want to keep a JSON me column. You know when I'm gonna store in polymorphic table a previous version of the subscription table, it's just for data purposes, so this is the first approach which which is requested by the data team in Julia. The second approach is with I'm gonna use it to.

E

A

Think it's a and we have these in like different places as well or it sounds like it's not.

A

E

B

To with those yeah.

E

Enterprise users on their staff fostered initially they will not see their own historical version. They will not see their previous subscription levels or members, but our data team is going to able to make historical analysis. You know simple cases like we change your subscription L. You are becoming from broast gold. For me, that is his loss immediately.

A

E

You know the database has just one row for each person, so data team wants to see it and I. You know I'm, not sure. If the current approach, I'm gonna ask any term is review but I just read every database design document and like I, think it's challenges and other issues.

E

It's gonna be big roaming, fast, polymorphic, be civilization and.

A

E

A

Sorry go ahead.

E

B

Only need to put that inside of Agra migration I will have it all.

A

So I think I think the you went based implementation is much better we've. We have different approaches for doing analytics. So, for example, we do cycle analytics and what we, what we try to do is basically looking at the data we have and figuring out different metrics from that, and that's that's already very painful when you, when you only look at the last three months, so they say we had a migration recently and it's just that the database itself isn't isn't really built for that.

A

In a sense, it's not an analytical database is a transactional one and it's really painful to run those analytics on top of that and then extract that information and I think a solution like like you described where you, you basically emit an event that says all this changed, and now we keep track of that event stream and then later on, you can run and analytics on that event stream. That makes a lot more sense to me than trying to store all this information in our main database, because we are already growing fast.

A

The database is growing faster, give up command.

A

We need to manage that in some way and I think if we, if we had a, we had this event based system in place where you you can consume those events later on I, think that would tremendously help in different areas as well, not only for the show subscriptions but, like I said for cycle analytics or anything like that. That would be very helpful.

E

The client may prefer not to.

A

When it probably means adding some kind of dependency to to a message, queue right or any I would I would be today. If we talk about snowplow, how would that work when we emit those events? Oh I, don't know anything about.

E

Where you are sending a hash like adjacent files,.

E

To the snowplow instance by a back-end code and it works fine, this will work fine at some point on github.com on self-hosted I think it would be a problem because.

B

A

Is no snow consistent.

E

And it's I think wrong to send some information from the client database to the server special historical divisions is something challenging.

E

Last question shall I send it to a database review first before they can review what we do.

A

You mean the migration to create those tables.

E

You know Beckham doesn't care much because we don't have screens for this table. It's just that for the label team. You know like do you think, like reading your previous discussion, maybe I should.

E

A

Feel free to send send that first sounds like it's going to be more database heavy anyways.

E

A

Alright I think we're at time and at the end of the agenda, so great timing, thanks for stepping by and like I said, if you have any the the format, if this is really put the questions on the agendas which have more of that I appreciate putting putting them in and stopping by so thanks and have a nice day, everyone.