GitLab Deep Dives, 13 Jun 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Create Deep Dive #6: GitLab ElasticSearch integration

Description

In this Deep Dive session, Mario de la Ossa, Backend Engineer on the Plan team at GitLab, shares his knowledge of GitLab's ElasticSearch integration.

Download the slides: https://docs.google.com/presentation/d/1H-pCzI_LNrgrL5pJAIQgvLX8Ji0-jIKOg1QeJQzChug/edit?usp=sharing

Learn more about Deep Dive knowledge sharing sessions: https://about.gitlab.com/handbook/communication/knowledge-sharing/#deep-dive-sessions

Find out when the next Create Deep Dive is taking place: https://gitlab.com/gitlab-org/create-stage/issues/1

---

Read more about our product vision: http://bit.ly/2IyXDOX

Learn about FOSS & GitLab: http://bit.ly/2KegFjx

Get in touch with Sales: http://bit.ly/2IygR7z

A

So, let's get started in everything, so we're gonna go into what is elasticsearch, and why do we want it? We're gonna talk about the differences between database search and searching when the last search we're going to be talking about why we're still not using it a yellow comb which we're trying very hard to do by the way we want to use it.

A

We're talking and we're going to talk about how the initial setup looks like for self manage users we're going to talk about the schema and the analyzers that we use how we interact with rails models, how the searching actually works and finally, we're gonna talk about a lot since our gem, mixer, which is a new index, are based on, go that we that will be replacing the rail space indexers.

A

So, first of all, what is the last search? So it's a search another analytics engine built on Apache, oh you've seen, which is a you know, text search engine, I'd, open source, it's restful and it's distributed. You can have multiple charts that talk to each other. So you the data and the work of finding the data. It's actually the most popular search engine for both log analytics and bootleg search and with good reason. It is very, very good at what it does.

A

It accepts JSON documents using its API or they done ingestion such as monster. The launch actually uses a lots and Surgenor, so you can feed sorry. It allows there to be a lot stat. It automatically sorts an original document and add searchable references through that document in the closer's index. So then it permits us to search and retrieve the document using the elastic search api, and you could also use to bond out to visualize your data and build build interactive dashboards and some data.

A

It's very high performance thanks to being distributed. It enables them to process large volumes of data in parallel and it's near real-time reading, a writing data usually think less than a second to complete. So by the way we do not want to use last search for all of our data. For the usual reasons, it's not a relational database, so it doesn't hold any sort of relation data between your your documents. It's purely document storage and what I mean by that is that it's closer to say a MongoDB than it is to.

A

So we talked about what it is: let's talk about, the differences between database search and all assets, engines and the main difference, and the main reason why we want to implement elasticsearch on get Lancome is that it allows for global code and commit search. So right now you go get Lancome and you friend search for codes or permits. You cannot do this globally. You have to drill down into a particular project before you can search for code.

A

So, as you can see in the screenshot on the right, we are searching through any group and any project in the integral based and filtered search, which is when you are actually searching for issues inside of the project or a group that does not currently use lacet search. When you see on the right is the global search, / search that you get when you, when you type, in any words on the top bar on the bottom and poisoning you some gear.com?

A

Well, basically, it's very expensive using it, for all projects will be sold in a 66%, so rates increase and the analysis is available and that infrastructure issue you see there.

A

Our administration support is lacking, so we're working on it. We need to be able to give our administrators more insight into what's going on with the elasticsearch luster. Allow them easier.

A

Easier indexing easier, an easier way of knowing if the index is stale, for example, we don't have a good way to do. Zero downtime deploys right now at a minimum. It requires a real sweet start if we change any classes. Sorry for change the scheme on itself, the classes think of the change in schema. The problem is global it to database migrations, but we actually don't have any new tooling around it.

A

So we are trying to figure out how do we couple this team up from the code itself, but it's a very hard thing to do and we do have good news. An embassy is going live very soon, as we've committed and not work in the epoch to allow us to enable in Ladakh bomb or a subset of troops and projects.

A

So we now have a way to enable elasticsearch just for a few groups or just for a few projects, so we'll be using that to enable it only for good luck, see you lucky on good luck on itself. You touch things out, so let's start talking about how to set it up initially. So first things. First, you have to install elasticsearch right. We have the requirements for each version available in our documentation.

A

Basically, you can install nowadays any elasticsearch between five point, six to any eleven dot. Sorry, any six point, X version.

A

We then move on to the National indexing of context after you've installed it. We currently do it via ray tests, but soon we will be adding this to the admin console. It'll miss me a button that you could say well index, but currently we're using good lab elastic index, great task that runs all the indexing operations in the program, except for repository indexing. The repository indexing is a little bit special and I'll talk about a little bit more in the following slides.

A

This is suitable for all, but extremely large instances which must run each indexing operation separately in order to avoid overloading cycling. This is because, when we start our repository indexing, we actually start thank youing as many sidekick operations as psychic workers as we can, and finally, we enable indexing and search, pls and search on the admin console, and this is what you see when you're in the admin settings area. You can enable the last Search, Indexing and search without accent, search as two separate bins.

A

This is so that you can start indexing what still get results via the database, while the indexing is ongoing, because if you enable search for the last surgeon a and your index is not ready or just not index at all, you're not going to get results as a bare minimum.

A

Most of our search results require all of your projects to be indexed and I'm, not talking about the repository, but rather the project metadata itself, because most of our queries rely on the project data in order to check permissions on lots of search itself before we return any results and you can set the URL of glass and search the number of glasses or shards and latitude replicas the normal things. We also have indexing restrictions in place for a last search where you can limit in the namespaces it's and the projects that can be indexed.

A

So if you don't want your entire instance to be indexed in asset search, you can set restrictions and then only when you are inside of the group or project will but code bathes, Musil asset search for search results. This does mean that you lose global search of course, because not everything would be indexed so on global search. We use the database when this is enabled so.

A

Let's start talking, let's start going a little bit more into how the lots of search works exactly so forged loved or Chrome. We set all of our objects to have the same document type and live in the same index in elasticsearch.

A

You can have multiple indexes and each of those indexes would hold one document type, and that is usually the way people use it right. We would have one document type for issues. One document Bible projects, one got you platformer two quests and that would keep everything separate and it would you know, keep the index is a little bit smaller. What about that? So?

A

Keeping all of the document types, the same permits us to have parent-child relationships, which is actually extremely important for us, because otherwise we not use the project data in order to filter, for example, issues or merchant searches.

A

It requires us to implement our own separate type checks, though, because everything in my home base is in dock I. Guess in the document type we selectively just gone, so we have a separate field in all of our documents.

A

That has a type name and that's where we keep with a parenting issue. Weather is merciless, etc. Now that does mean that all of our types share all the fields. So we have a lot of sports fields, which means we could have a lot of ways of storage. But thankfully, since elasticsearch 6.0 there has been great storage improvements for sparse fields, so we do not get a big storage penalty is actually negligible.

A

We should probably move to one and expert type, but, like I said we lose the ability to filter by project attributes or, alternatively, we are forced to denormalized project data into every data class type. So we would balloon storage usage because then we have to copy the same project information that we require, for example, of access levels for issues or access levels from Earth requests or whether or not you have permissions to see a project. We would have to put that into every single document story. Don't you think.

A

So analyzes are where the search magic actually happens and analyze the comparison, data for better searching and each analyser increases and storage needs, because what an analyser does is well. What we the way we're doing it is every analyzer has an option to keep the original data and we do want to eat the original data. So we have that on and then each analyzer molds the data into different into a different format, basically and they're, proposed to organizers and filters. For example, an analyzer that's used very often is just a plain English analyzer.

A

That would turn the string around falls into three different strings in the double D string. ah Mr. Brown and strength falls, and then we have rolled in graham analyzers, which can basically turn the word, for example, into three letters.

A

Fox10. We turn the sorry. We also have into engrams and an edge Engram actually would turn aux-in D, fo and fo x, and that's so that we can actually match when somebody searches for partial words, because the last search will not match a partial term. Unless you will really have tokens that have that partial term inside of them.

A

For models, we use the standard organization, which is what I was talking about, and we have three filters. You have a standard filter which actually doesn't do much. It's just there for a lasting search in case in the future. They have to add- and it's just an easy way for them to add an extra filter if they mean it which actually, on the last search stomach oil. The standard filters didn't loot, so I guess they never needed it, and they know that it's leaking differently.

A

We had a lower case filter which normalize it starts to lower case and also when researching we normal aspects, the lower case. So this means that it's a lot easier to match, search terms as everything is normalized to lowercase. We don't have to. We do not need sensitive and searching, and we have a custom stemmer filter that we called my stomach, because it's the only one we have their uses, the light, English stammer and that's the one that knows how to separate English words. For example.

A

It knows that and a loss would be should be part of the word.

A

Then also to have my Engram analyzer. Now again, you can see on the Engram analysis. We need it, so we never need their proper name and it creates two or three grams or projects for the project's name with namespace itself and toward three grams I mean so you're, just getting your getting partial strengths of two or three characters.

A

We also have way more interresting analyzers for repositories and commits we do with one of tokenizing with a ski holding and lowercase cultures. Ascii folding basically turns every utf-8 character that could be asked me into ASCII and Lotus. Just you know, makes everything more paste. We don't have special filter for the code analyzer, it uses an edge and burn filter that creates grams mean to 120 characters wide, and that is also so.

A

We used to this for code and basically means that if you have a very long string with a lot of periods, you know like if you call a very long function and then you add parentheses, and you add some arguments. Sometimes we want to just find everything that has the very line functioning, just a function. So if we did not have this edge Engram filter, we would actually not be able to find that function or if you want the pirate partial function- and you know like functions that have specific port inside of them.

A

We need this in order for elasticsearch to be able to find it. We also have a filter with a ton of reject patterns, so we are basically separating camelcase function names. So then you can find different words inside of a concave function. We also extract all the digits. We also extract terms and some quotes. We separate the terms on periods and we separate patterns. You know term to do slash, and this is again because if we don't do that, then you cannot search for just that term.

A

That's inside a period, for example, we have to search for the entire spring on there is a wide space.

A

We also have a custom share, analyzer, which organizes using an Engram to five movie characters and that directly mapped to you know how game uses charts. You do sorry to identify specific commits. So if you, if you put a commit sha into the search bar, we use this analyzer to turn like a 40-long sha into 19, Loong 18 long. So anything long all the way down to five long and that's how we can allow you to search for the specific shop, no matter how many bits you're giving us.

A

So how do we interact with rails models? Well, we use a customized lesson, search rails gem to link up our models, new asset search. We need it to customize it a bit because our way of doing the document type is not normal. We, you know we have a single document type for all of our models and that's not so many people usually do so. We had to customize it a little bit and you can find those customizations under a oops. Sorry I wanted to get out of here.

A

Under a gem, extensions, folder you'll be able to see all of the different changes we made to the actual Lassiter's model adapter to be able to actually support this change.

A

So we have an application search module, which is the entry point that defines cold ice and shared methods for everything, except for repositories, after that each class defines their own search module. So we have project, search, issue, search, Merguez, search and note search, / temple and these classes to find the basic elapsing search. Query structure and any special indexing concerns, so application search defines, for example, the basic project filter.

A

You we define the basic project filter right here. You can see that we are defining all of our settings, such as the number of charge, the filter. That was something about the lady whose temer organizer for project path and we go down. These are all the fields that we have available to us. This joint field here is actually how we do parent-child relationships, so the project can be a parent and then issue. Merge, request, milestone your block. We can Bob and commit in each child's multiple project, and we need this sorry.

A

We do have funnel fields, since everything is in the same index.

A

I'll talk about after commits in a few seconds, but what I really want to show you here is the basic, very catch here we go. So this is the absolute fitness hash we ever send to all asset search. We have query of boolean type that must match certain fields with the query and we use the and operator so the more in terms you give us, you know they all have to match.

A

And we have a project ID filter where the parent is any project and the query has to match a project, ID query which is right here, so you can see that we we checked into the user. Can we cross project, which basically means the user is an admin? Then we send enough to participate. We want. We want them to be able to read everything. Otherwise, we pick project by memberships. We bring projects by disability, because if the project is, if the user is not a member of it, the project has to be public.

A

Otherwise, if the project is internal, then we need the user to be number of it. So we have projects by membership as well. We also have project by feature, so we can.

A

If we are going to limit by numbers only then we need to make sure that the feature is enabled and and the user is a member. Otherwise we just need the feature to be enabled for public projects, and this is actually used heavily in issues, Mercia class, etc. To make sure that we are only returning issue results, for example, for projects that actually have issues enabled we also have a last search, git repository, that's the one that defines how blobs wiki blobs and commits interact with massive search.

A

We need a separate module because there, probably different rebels, are not on the database they're actually on this, so we need to talk to get Ally in order to get the active logs and the actual commits. We only index the default branch, otherwise its cost would skyrocket. The only branch we ever index is master or whatever other default we have set, and we currently have two indexes.

A

We have a rail script, that's actually very slow and it's due to be removed soon, and we have a good level assets virginity, which is weird in Ingo and knows how to walk, really knows how to talk to two and it's a lot faster. We agreement improve speed by almost 10 times for certain scenarios and lowers the research usage, and it's just because go with better at memory. Handy though this is still memory in Hungary, we're still loading all the lofts into memory.

A

When we are indexing, but we do improve our I/o, a lot is usually done like we can repository data and we improve encoding detection because we're using actual C.

A

Compiler modules in this case- and it also allows us to type from the site, make out on memory color so before we have problems with our rails script because against queue prone psychic. So since it just from psychic, it gets old from psychic issues of us and it's and rails it. It's a ruby script. It shows up as part of how much memory that worker is using, and so psychic actually has the limit upon how much memory we can use and we start getting killed by the memory killer.

A

So by pushing us off to another process entirely. We can hide from that, and so we can use as much memory as we actually need to now. It's really useful blobs, which includes wiki blogs and commits, and we loss is actually a new thing, because we found out last milestone that we basically had a broken indexer for anybody that was no longer using NFS.

A

Anybody doesn't that migrated over to giddily, because wiki blobs are blobs there and they are on the they are on this and they're not on the database, so we actually have to treat them exactly the same as all the globs. By the way, whenever I say, blobs I do mean files, so all of the code files in a repository or any other file. You have your repository and it's a good idea to note right now that we not law will not index binary blobs.

A

We only index things that can be affected as text, so any images of course getting or any binaries as in executable files, those get ignored. We also do not index very large files. Do not quite remember what the limit is right now. What is in the codebase? Where is I? Think that could be wrong? How much so in it I.

B

Think five megabytes bad could be wrong right.

A

If I make it by, it is actually very, very large text files. So it's a very it's a very reasonable limit, so it talks to get aliy and it gets a diff between the last commit has found an index status and the current shop now, in a sense, the index status and the rail stephenie's, and it really is just there to know what is the last minute that was indexed for that particular project, because we don't want to reinvent everything.

A

Every time we have some gigantic projects on your block home, and so we need to keep track of.

A

What's the last commit that we've already indexed and only index, all the new files that have come in after that and by new files, I also mean updated files, so we can catch all the updates that have happened between the last commit and the current head or the current Shalit can get wish to us and any deletions as well happen at this point, and the only problem here is that we are assuming that only humans are ever added. That commits are only ever added that they're not removed.

A

So you know like when you do of course boys, for example, you're changing history. Some commits have come, haven't gotten removed at that point. So there's an issue right there, where you can read more about how you're going to try to fix that.

A

But coming back to application search that one defines pull backs right for incremental indexing when a model gets updated, so we have inserted a Tunde story triggering a last in search of base via the elastic search index of Merkers whoops. Sorry, I keep forgetting how to use new slides. So, as we had seen a few seconds before, we have enough to come in and create an update and ondestroy, and this actually gets included in all of our project.

A

Search, for example, include application search and project search gets included into projects, so we have been available there same with issues and Mars etc, and we have the elastic in this immature worker, performing an index update or delete.

A

Depending on what we're doing, we always need to know what the parent is for a particular class, because elastic search requires us to route all the one of the objects that have a parent into the same shard like a Lasseter's has multiple shards, that's how it does it's it's readable and so all of those children of a particular parent.

A

In this case, a project need to live in the same start as the parent, which does mean that we can get a very unbalanced storage usage where, if you have a project with 10,000 issues, then that shard is going to be full of issues that were cleaning full of files and osteogenic. Some worker really is just in queueing index reference service or deleting the file directly, and this really again just.

A

Sends a direct operation to elasticsearch whether it is index or update and it can detect it. This is the first time we've ever annex a project. So if this is the first time we've ever index a project, there are some extra thing to do. Usually this would just, for example, get an issue and it would update the the issue description, that issue title etc or if it gets a project that it's already ends before so it gets an update operation.

A

Then again, it just makes sure that all of your project features are updated and make sure it's mated that off and that project is completed. But if this is the first time we've ever seen this project and the way we detect, that is by whether or not you're setting as an index or an update operation. Then we have to do an initial index where we have to grab every single index that Association, which is the issues the merge, request, etc, and we gotta import them into blast search.

A

If we don't comport them into a lesson search, then it's there they're never going to be imported in. Basically, it is the initial import when, for example, import a project into the above itself, then we would grab all of the new issues world where Mars to me, and otherwise you have an empty project. People recover anything to index and the repositories actually get updated ya, get which worker hooks so.

A

You have a good push service right, and the only thing that does is, if.

A

We are in equal branch and the project that's enabled for elasticsearch. Remember we can filter which project or which group gets indexed yeah, and we actually have a brenes lock here, because we were having some trouble if we had somebody create a new project as the indexing was going on. So if we should induce a commit, we just use elastic, commit industry worker and a lots of commit index worker is the simplest of our workers, it just Falls in level asset search in mixer, and this one.

A

The only thing has to know is whether we're going to use the new experimental index or not the way it chooses, whether to use the experimental limiter or not is by checking the application settings. If we have the setting to use the experimental index are enabled and if the binary exists in your path, binary must exist in your path. For you to be able to enable that feature, then it'll use the new index or we should really rename this, because it's not experimental anymore.

A

It's basically a data will to be a release, but you know we to be change the documents, but nobody looks at this. Hopefully, but anyhow, we set up some environment variables. You know we said from what chart to watch we want to index and the only thing we do here is update the index status and run the industry so running a mixer. It really is just pulling the indexer. If you don't have the experiment when these are enabled it'll go to the Ruby script.

A

If you do, then it'll fall the winery for the go code base 1, and we have a few extra bits for knowing whether it's a weekly or not, because if it isn't we eat, we have to send the weekly repository, not the project repository and.

A

That's really it. We just have some extra thanks to get the index status and what attributes we have to update, because now that the wiki also goes through the go indexer and we want to do incremental, we need to actually send weekly, commit and weekly, and it's that information as well so yeah. The last commit that was indexed escaped on the database in the ending status.

A

Monitor so how does search work at last search various adjacent structure and it contained multiple filters, so we see an example of it on the right side, we can say hey. This is a really very that must contain the term kimchi on there, the user attribute and let's filter out anything that has the the value tech under the tag attribute, and it must not contain.

A

Between 10 and 20 on the age attribute, and then we can also say it should now must it should exist like that, because elasticsearch also grants no results, I'll ask them how how close they match. Your query so, depending on whether you must or should must, is a filter.

A

Second must match this, but should just guessing the same must match, but it gives it a lower score, so they would show up later in the results we have were meant for missions as believe filters, and we can filter for a project that I user has sciences or projects which features enabled. So we want to filter, for example, for a publication tracker. We can do that and highlighting is given to us by a lesson search. If you click there, you'll be able to see the elastance rich documentation on how Eileen works.

A

You just send in highlight field and a query with the field to pilate, and all that's in search will tell you where exactly in that field, it matched the the query you gave it. So the response comes back with a highly element for each search hit and it has the actual fragment of the document where it matched. So if you pray performs- and you get a quick brown boss, something it'll give your fragments it's with Fox and it'll like it'll, have to match right there and where you should be highlighting.

A

We expose the last search, simple query string. We do this because it's a little bit simpler for us, but it also allows the user to use it solution operators. So you can, you can do a little bit more interesting things than our normal database search. We can do exact search matches and it is complex, but it is very powerful.

A

Moving Prairie if you click simple query string you'll be taken to get labs, advanced, search, syntax, which is a superset of last search and a few filters that we added so the filters that we added are defined using and getting lots of your squaring and whoops.

A

It allows us to filter by path, filename or extension, and the way they are implemented is extremely simple. Again we have to the repository I believe yep. So we need to have this search. Query class. Then you add filters so file name is the following: filter then path allowing people to my path and extension options from the by extension, and this is just how to partisans or whether it be input matches or not. And if you go into the query class and see how exactly these filters happen.

A

But it really is just adding more terms through the actual elastic search, query and.

A

This is what I have so, let's open up the floor for questions in case anybody hasn't go ahead and open the document.

A

C

Can you link the doc I didn't actually seen the calendar, invite oh sure.

A

Think sorry, the docking it was created right now during the talk it's in the chat, oh yeah,.

C

I took after you the chance. Is there any good thanks? Nick? Yes,.

A

I got that having a little bit killable opening the box. It's closed.

B

Please I, just pinkie backs, went over to prep doc, so it's all in one place. Yeah.

A

Sorry about that I completely forgot to create a thought, so I asked when restricting by group of project and something global search is disabled. What does that mean? For the good luck word uncommon. This is enable. Can you devil in the staging, so I see? Nick has already been answering a few of these, so thank you so much Nick. You want to take this. One I think.

D

I'm good on this one, unless other people want to go through it, I think Nick and I hashed it out through the table.

B

That's it then other people are going to be curious as well, so it's good to go through them. In this case, it's just asking I guess about there's limited products that we saw right at the start. In the admin settings you can say only index to get level group which we're gonna do on get lab calm because we can afford to just index the Gallimore group. It's a it's about three and a half gigabytes, just one group. So it's not too expensive.

B

If you have just a group or just a project indexed and set up in the elasticsearch settings, then, if you're on the search page for that group or for that project, you get all the extra wet elasticsearch magic. If you're in any other group or any other project or a global level, then you just get a regular ordinary database search with none of the extra features booked. That search does still work, which is the important thing it doesn't disable global search for everybody, which I think is what I was most worried about.

E

Is they? You are in any way different when, when we talk the last extraction and database search.

D

The only difference in the UI is there's a in the search box once the results are displayed. There's like a note underneath it just says: you're using advanced, advanced global search, I, think or advanced search whatever we call it. So that's that the indication is there to the user that they're using elasticsearch yeah.

E

I was wondering like how you can use these advanced features, if you, if you think why is the same like how do you know that the project is on the elastic search, war is on database search, so.

A

Basically, right now, the only way to know is whether or not you have the words advanced search. Syntax is enabled I have a an example right over here, you see is this advanced search. Functionality is enabled right below the search query. That's how you know whether it lasted search is enabled or not, and that's fair. We should probably give it a badge or something we're actually thinking about a few different ways to surface whether your lesson search is working or not. The.

B

List of tabs differs as well, but that's very useful right.

A

Yeah, like we have the code and the commit to have, will not show up if lassiter she's disabled for global search. So if you have the indexing, this enabled only for a few groups or a few projects, then sure for the group. You'll see coding, commit and that'll be different, but for the project, it'll look exactly the same.

A

So sorry, let's go ahead and switch over to Cody. Then he says we kick up as many sidekick jobs as we can so. Elasticsearch innocence has taken down your love before our self-managed. So is there a way to be nicer, so we were actually thank youing. Even more jobs nowadays leave that we used to because yeah we just do just do a matching a couple of thousand projects and then keep going from there, but I'm pretty sure that the we fix. This is not really us being nicer, but rather the siding configuration itself.

A

You know make sure that you have that sidekick queue in a cluster that will it set up in a way that will not take over the cluster. Basically, you know set it enough and a lower priority cluster. Okay, your mate, quick.

F

Question about that, you might not be the right person asked. Maybe you can put me in the right direction. Are we configuring sidekick that way by default right now, or is that something that we're gonna have to keep in mind with all of our self manage customers going forward? We.

A

Have all these hues I don't remember where there, but.

F

I, don't think we do clustering by default in self-managed for sidekick right.

A

We do have all these views separate, so they can be set up like that and I think. The way that the configuration is done is VHF not be a hill of itself, so yeah. This is gonna have to be more of how the initial Ally initial configuration is happening. Okay, cool.

F

A

For your next question, Cody are we close to the Bendix and being stable? Yes, we are actually going to get rid of the rails and mixer very soon and we're going to start be if they're using to go and it serve as our default and mixer as soon as next month. Okay,.

A

And we actually are not fixing issues in the old one, so right now, wiki's, for example, are broken on the old one for bikini and mixing. If you don't have NFS anymore, like a lot of our users are moving on to get Li, did you use some Italy and you're not using the go? Indexer you're gonna have a bad time. Okay,.

F

Good to know, thank you so.

A

People in Bern asks do permission, checks rely on 16 permission models in the Ruby code, or are they really fine to enforce an elastic and I've, seen Michael, Gray and step on it as well? Thank you so much Nick. You want to take this one that yes.

B

I think you showed off some of those permission filters in yes and earlier. There is a huge amount of duplication here and, to a great extent, is necessary. Just for a starts. You support, pagination of all things if elasticsearch doesn't know which issues etc. The user Canon cannot see Venice. They can't give you the correct results and you can't paginate them properly. So yeah we duplicate all that logic again in elasticsearch, there's a lot of it.

B

It's very sensitive and there have been books in it before I, don't know of any books that expose data right now, if you find one then it's probably gonna have to be p1 since we're going to start turning because online github.com at some point so feel free to look for them. I would love to find them if they do exist. So.

A

I do have relevant methods here on screen right now. We have to re-implement all this because we have to send it all as a jason as a jason document in order to be able to query against it and we query against it. Instead of you know filtering in memory, because it would be extremely prohibitive to do this in memory. I mean we're talking about. Sometimes we we return.

A

You know thousands of results and if we're gonna paginate this, we got a trajan ain't enough to get like this, the first 20, and then we could paginate enough its second 20. And then how do you keep all of those pagination things in sync? Whether because every single time you go to the next page, you lose all of your state. You don't haven't stayed in there. Restful environment.

B

And one thing probably worth calling out there explicitly is: if you're not an administrator, but you are a member of another 20,000 projects. Every elastic search query you do will send those 20,000 project IDs to the elastic search cluster such as life, it's horrible, but it does work I think we only send the project IDs for private and internal projects and that's right yeah. We don't do a huge amount ever. Otherwise we would bring down to close everything.

E

On the somewhat unrelated to the permissions, but also to like the indexing stuff, because when you change stuff on on this side,.

A

The unreal we lost you, there story Alexandre. Instead.

E

Yeah I'm, sorry, my internet might not be great I'm, not sure if you can hear me still okay. So so, when you like change permissions, it will have to be like in three indexed on or updated on the elasticsearch side right right. So every.

G

E

Do you know the question is how do you ensure that, like this, the indexing did not fail? I, don't know something did not break and, and the actual the actual permission update, which is sounds like more sensitive did actually get into elasticsearch is there is some sort of a check, I, don't know.

A

So right now we don't actually have any alerts when a mixing fails, but I mean we do need to figure out how to make this easier for infrastructure to be able to tell when something's going wrong every single time you edit your project, we do trigger indexing updates, do through and after commit pull back but yeah. If that.

G

Fails then you could have a.

B

Leak and we rely on sidekick, not losing jobs and geo has the same problem at times him there are times when psychic does leave jobs, we retry the jobs a couple of times. If that fails, they go into the dead job queue and from their last 10,000 of those, and if the index gets badly out of sync, your only real option is to V index from scratch, which is part of why making me indexing. Easy is so important right.

A

And it's also why we've switched so many things to using sidekick directly. It's a lot easier to have sidekick retry and we can rely on the sidekicks at least once falsity.

A

Sorry so the next question is Blair: can we leverage built-in filters and tokenizer from elasticsearch to allow language-specific analyzers for the code search right? The current analyzer does not cover their code language.

A

So that's a very, very hard problem, because every single time we add another filter, we are adding more storage as well so you're talking about you know either adding something that will detect what language each thing is in before sending it to elasticsearch, which I do not know if it's possible, because we we then have elasticsearch analyzers that run on the elasticsearch fluster itself, with no way of really ending to elasticsearch. Hey, don't use that analyzer use this one instead for this particular document.

A

So in order to implement this, we would actually have to add a bunch of analyzers or the blog content, which means you know one. Basically, one copy of the document per language, I.

H

A

H

If, like I like, are there any like case study is like because I would imagine that elasticsearch is being used to index, you know code and search for code other places? Maybe that's not but I wonder if there is like this is just me kind of like spitballing like is there some like resource out there that we may be able to tap into.

D

I'll just say from a product perspective and I think I got tagged on that issue. The other day, I think I, think the concern is super valid and I think it's something that we need to look into and the impacts on like index size and what all we have to do and adding more filters spawned a bunch of other discussion. It's one of those things that I think we just got to keep testing and iterating on to get that to get broader coverage.

D

I think it makes sense to strive for broader coverage on different languages, especially in some of those use cases where the search just flat-out can't return it, because we've already decided at some other language than it is so I think it it's something that we should just we're. Gonna have to keep top of mind and keep working towards all right.

A

Sorry, Blair, what? Let's do that we are looking into it, but it's just weird: we have so much problems with how much storage it takes that we got to be careful with what we had no I.

H

Understand, like I cuz I've, dealt with customers that are like, oh, why is it like this much storage and we're like? Oh that's, elasticsearch, so I totally get it I just wanted to bring it up.

A

Oh and Nick wanted to. Let us know that the blob limit section be one megabyte, not five, and it is not pumping herbal. So Cody has an issue for allowing configurable max blob size, so that'd be nice for itself. Posted people.

I

There is one one thing to consider when considering an elastic search. Is that whatever you change in the document usually requires the full resync of the whole thing right.

A

Whatever we change our schema, then that requires full reindex yeah and you know that's because we're, let's just really have a good way to transform data, and we would have to go into the database anyways to know how to transform it. So now we usually have not I mean we don't force reindex. Normally it's just.

A

For example, this year we went from elasticsearch 5.5 to supporting five or six to six and that required a complete reengineering of how we do our relationships before we were able to do a very different sort of relationship in alaska search, but they got rid of that and now we use parent-child relationships, so that made us change our entire schema, and so that one was a huge change.

A

It meant that nothing would work if you try to use the old index that you had so and we do try to batch up whenever we make a schema change. So we had a few small ones like there was a typo and one attribute, and we were not actually indexing the internal ID for one of our objects, but they were very mild only used for display things. So you know we didn't really say: hey, you need to re-indict.

A

We just waited until everything was ready, batch them up, send them out and that's when we said hey, you've got a remit.

A

So Stan wants to know where parent-child relationships come into play. Stan you want to take over yeah.

C

You topspin share your screen. You know I found it hard sometimes, but how to visualize our data without using Kabana and I? Don't know how many people actually use it, but I will share what I see. I have a simple Cabana instance that has indexed some data. So I think to your point earlier about like what everything is in one type. I think this is what it's showing right. Oh.

A

C

So I find this very in looma name, because sometimes it's hard to understand what you're saying without actually seeing this kind of.

A

You're actually filtering by block content so you're all. So that's not really something you want.

C

But like, for example, I can filled every project right. That's the time.

H

C

Rights will give me a set of projects in the database, so you mentioned parent travel, a lot I guess I'm, just wondering where we might see this in cabana the.

A

Joint field right there, you have right below issue access level that joint field is so when you're. This is a parent object, so it doesn't really have any interesting data. It's just a project, but if you go to an issue, for example,.

C

Ok, that's good to know that. So that's that's! That's useful and I'm curious about like boolean filters and because kibana allows you to do all sorts of cool searches. Are we gonna be thinking about exposing some of those kind of things as well? So.

A

We currently only allow you to do the simple, very straight and I think it does some boolean searches. Let me look up the elastic search.

B

It basically supports everything. Google search does so you can do food to say not food, for instance, that kind of thing.

A

Right so we have a plus for an an aggression. We have a pipe for an order operation. There is a negating a single token. If you do quotes then you're saying hey. This is a phrase. I want the whole phrase to be together and you.

B

Know, searches as well, but we don't make much use of this.

A

Yeah you can have as well, so you can do some interesting things if we were actually adding language analyzers.

A

Sorry I'll go ahead and search my screen again. Let's soon as I can find this one window. Sorry about that here, that's not.

A

So Gregg wants to know.

A

What are we gonna do about, elasticsearch go and its are needing 1.11 or newer, but most Linux distributions don't have that and their default repos.

A

So Nikki want to go on.

B

Yeah, it's a dependency, it's not a unique dependency to the indexer. Sorry I've got one of the Rings going on in the background.

B

If they don't have it, I'll have to install it if they're installing from source, as it was DJ said, we've already installing it by default and omnibus, and that's the path that we really want to make easy to people if they're running out of Debian stretch- and they need go on point 11- don't just have to install it as well. Right.

A

And we are including in omnibus right.

B

Yeah absolutely.

A

So yeah the indexer does come in doubly buzzed, so that's already a file binary, so they might not need go.

B

At all, they don't, if they're, using only those, we don't need a separate going, install I.

J

I think it's more of a documentation issue, because if you go to the elastic search integration, official Docs, which I linked it, there is no mention of it being installed in omnibus and it does directly instruct the user to get clone make and make install which, if they do not have that dependency installed, it will fail, and it's not just for Debian stretch. It's for cent OS, $0.06, OS, 7, Ubuntu, 1604 and I believe a default. Kubuntu 18:04 would also lack that.

J

So, basically, last week, I was trying to follow the documentation to set it up and, as is if you follow that documentation on most Linux distributions, you will fail at the make step and there is no.

A

J

It's included yeah.

A

So we need to change this, to include that we do have it already included in in omnibus and also letting them know that they need to install go from from source if they don't, if they can't find it there'll rip. Oh yeah.

G

Yeah, can you create nature for that great honor thanks Greg thank.

A

You Matt, we only index the master branch. You know we could do some sort of interesting stuff with uh only indexing things that have changed between master and another branch, but then there would be a lot of trouble, keeping it updated Plus.

A

What we want to do with last search is allow you to do global code, searches and a global code search is really only matter on the default branch, so master, usually.

K

Although I would hope, has the same mystery keys as well than thing yeah.

L

It would be cool, but oh, wait absolutely makes sense to limit it to master. I was hoping that that was your answer. Yeah.

A

Yeah we only little bit semester antis, but we especially want to live an investor because of large storage requirements, yeah thankful to find a way to lower the storage. But you know sorry DJ. What do you mean by tag code search.

M

Often times I might know a function name or something that existed in one version of our product and I want to know where what file it was in. Basically, okay,.

A

Fair enough yeah I find.

M

This I find this often on on github as well, when I'm, when we're using a dependency that maybe is a little old and I want to see where the thing that's breaking was implemented in the version that we're using and doesn't really work well, yeah.

A

That's fair enough: I mean you can't if you yeah, if you're in the same file, you can just you know, use our history viewer, but if there was a reflector at any point that came from another file, then yeah you lose visibility. If you use are just through your so.

M

A

M

You basically have to check out it locally. You know and then check out the tank. You want and then search locally in your ID to find it yeah.

A

Could you open an issue for that lease I mean it's good to for us, keep in mind for the future as well sure and.

L

When we say a master branch, are we talking youth? Are we talking about, like literally a branch name master, or is it possible to specify some some other some other specifics talking about.

A

The specified default branch. So uh if you go to your project, then you see default branch that one defaults through the name master. So, but you can change that and we would honor that.

L

Makes sense thanks and.

A

Matt wants to know how long will we retain data in the Alaskan surgeon, sir, you know that's as long as the document itself. It's still exists in the repository. It will be there as long as the issues themselves exist in the database. They will be there even if you close them, they still exist. They just have the tag closed on them.

A

So the only way that we remove a lasting search index data is if something dates properly deleted.

A

And I met again because we most allocate a whole project to a single asset surcharge. Do we anticipate needing to manage the and even grow the charge for a spool of years nodes, so good? Only nodes storage have uneven growth for the same reason.

A

To be honest, I'm not sure nick has some thoughts about that, but it'll be more of our infrastructure. Team will know more as well Nick you want to jump in. He might have left already. He he was getting. It was the later play for him. I'm sorry, I'm gonna have to ask you to follow up with either Nick or our construction team that have been dealing with the new staging stuff, I I'm.

L

On the infrastructure team, that's why I'm asking.

A

Your your first line of defense here you I mean it is hard. It is very hard because we we need. If we want to keep parent-child relationships, we need to keep them in the same shark. It's the last search requirement. Yeah.

L

Your your passion, I'll, make perfect sense, I'm just wondering about this daily. It you're right. It's a hard problem to solve yeah, not having an answer. Right now is okay,.

A

We're talking about you know getting rid of player child relationships and having one index per object, but the problem with that is: we need to be normalized, which means more storage, and we really don't want more storage requirements.

A

So Lois wants to go back to Cody's first question and self-manage clients. Well yeah. They had a large instance. They need a psychic cluster for implementing elasticsearch as small and as small as this doesn't really need it as well. It'll be super quick. We we do need. At least you know a few psychic workers to be able to keep everything updated, but that's a requirement for all of the good lab application right and I'm, not sure who it is. That's writing right. Now. Oh Cody, thanks Cody yeah.

A

So it depends on whether the customer notices that thanks you're slow when when something is getting indexed, if thanks you slow, then they're gonna need a psychic cluster.

A

And uh feel free to jump in if I, if you feel like I'm, not answering the question completely so Matt.

F

A

F

That I'm pulling up for just a second sorry to stop there, no I think we can just take this async and, like researchers, some more under stalling this out there for the other people who are curious about it, I think we need to dive in a little bit further and just figure out how to determine whether sidekick queue is is causing pain for other sorts of basic actions. So that's I, just I just wanted to mention that real quick.

N

Sure thing: so, let's get those last two questions out of the way it's 1104 I know. People have.

A

To go back to work so Matt wants to know considering the upgrade path for elasticsearch. Would we expect that we need to re, and it's only in the future, and that's yes and that's why we are trying so hard to decouple ask search team up from the codebase as well. We want zero downtime indexing. We might have to do one.

A

You know maintenance downtime in order to update to that in the future. Hopefully, hopefully we don't. Hopefully it's a matter of just you, know disabling indexing and then just doing our thing and the rolling update being enough, but we currently do not have zero downtime a bit saying like we require reload of the classes in order to pick up the new changes.

A

And you said we showed some clever use of engrams to handle both partial search terms and reduce the need for language, specific letters, so your app so that question you know it's the same as above. Are we going to add more stammers for other natural languages and the question is most likely not sadly, because of the problem of storage? You know. Every single thing you add requires more storage, we're copying the data and tokenizing specific programming languages would be interesting, we're trying to keep everything general and that's why we have all of those.

A

Let me see if I can find them real, quick there. This is what we have a little beast panelist, so we're trying to do. You know separate camelcase things separated separate things and parentheses separate terms like quotes, separate extension periods and path terms that tends to work well. For most of the cases, there are some edge cases, of course, I. Think if you go for Objective C, for example, you got those square brackets you're going to deal with, so that would be something we are not really capturing well with these different rejects, --is and I.

A

Think we're gonna have to you know, deal with that on a case-by-case basis.

L

Make sense, thank you very much so.

A

We are six minutes over, so sadly, uh I'm gonna go ahead and stop it here. Thank you, everybody for being here today and any more questions please feel free to drop them in the document. I will be checking it later and I hope. This was worth the time, for you have a great day.

A