GitLab Manage Stage, 10 Mar 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Manage:Import Onboarding - File based GitLab Import/Export

Description

An introduction to the GitLab (file-based) Import/Export tool by George Koltsov.

A

Hello and welcome to file based import export overview. uh This is a gitlab feature that allows you to migrate your projects and groups from one instance of gitlab to another. uh It can be found under this url um for more information like using the api, what it does etc.

A

So I guess the first thing that we're gonna do is uh show a quick demo of how it works. So I have a project um page here. If you go to the general settings under advanced, there's going to be an option to export the project, which is a sidekick background, job which creates a tarball and set and makes it available on the under a certain url, so you can generate a new export or download it.

A

I already did that, so I'm not going to do it again and then what it allows you to to do is create a new project and you upload this uh turbo testing 6..

A

I've done it more than once already so, if you oh, if you import this project, it should um recreate the original state of of of this of the project that you exported.

A

So, as you can see, I have this project here with five issues, and then you have this project here, lots of five issues and they look.

A

They look alike so, for instance, here's eight issues here and eight issues here. So that's a quick demo and the same applies to groups. You can take the group. For example, I have a group right here. You can go to settings general and do exactly the same procedure, but you get an idea, not gonna show it uh in full.

A

So where it's used uh like you mentioned the first two topics, uh the project, import, export and group import, export and also uh import export file based import export is used in custom templates, uh the instance level project level and group level instance templates which allow you to set up templates for your projects.

A

um Underneath this feature, import export is used essentially what's happening. If you have a project and you set it as a template, then when you create a new project that is based off that template underneath it exports that project and then it um imports it under a new destination. So you have like a a template project to work with.

A

So, in a nutshell, and on a very high level, what's happening uh is two things uh we serialize on the on the export side, we serialize um all the data to json and we stream that content into the nd json file uh and json is um essentially json.

A

uh Just a new line, delimited and well. What it allows us to do is to export um and import.

A

Projects with memory uh optimization in mind, you know so that our memory usage doesn't grow because before we we had a project, a json file. That was one massive object and, if you imagine um that contains like an array of issues, an array of merge requests, an array of milestones whatever, and let's say you have a project that has ten thousand labels, ten thousand issues, ten thousand merge requests.

A

All of that is one massive object that we need to load in memory, which is not very memory efficient, so instead, so we serialize it to json right so, and you can imagine it being something like this project that issues to json uh as simple as that, and we pass um a list of options which sub relations to include- and this is an example. This is not um an accurate representation of what's included, the actual list of included relations is um pretty big and I'll. Show you where to find it later and.

A

That's like on the high level on the export side, here's the export issues that uh and the json file and, as you can see, every every line is a valid json object and every line represents an issue.

A

On the import side, we again on the high level and I'm going to go in a bit more into detail later on the import side. We read those files, we consume them uh line by line, and then we do some processing on the line that we read. For instance, we convert, um we convert the json object into the activerecord object, including the subrelations recursively. So, as you can imagine, you have a an issue and I include the screenshot here to kind of better illustrate what I'm trying to say.

A

So we have the relation object, which is an issue as you can see, it doesn't have an id yet and it's not proceed persistent in in the database. I I put a breaking point uh in the in the code and just to illustrate it better.

A

So from the example above right, um you have an issue and that issue includes notes. um You know it's kind of hard to read, but you you cannot find where, where it says notes, but one issue can have notes. One note can have um award emojis associated with it. You know so every object um when it's serialized, it can be quite big and it's uh it has nested relations in it.

A

uh So on import what's happening is that we convert that string recursively and what we identified what's included, uh what kind of relation it is and we convert them into objects. So if you, if, let's say you have an issue with 10 nodes in it, we convert an issue into an issue object and every node is converted into a node object as well.

A

So like, like, I said right here, we have an issue that is not persisted yet and that issue has associated nodes with it, which is a a collection of nodes and, as you can see, this node doesn't have an id. Neither of these nodes have are persisted.

A

And, as you can see here, every node has an author, and this author is is root. The importer user, in this case I'll touch on the user mapping a little bit later as well, and once we convert all of the subrelations into objects, it gets persisted um and you know we move on to the next to the next object.

A

um So the import export contents on the left here. um This is the content of the tarball.

A

All of the information that we need in order to perform an import. There are lfs object, references. There are snippet repositories, there will be uploads if there are any uploads in there and there's also all of these nd json files that I just mentioned.

A

um You know like issues, labels, merge, requests, milestones, uh etc and on the right, this is the export service.

A

There is a list of exporters or savers that the service calls that exports all the required things that we need. Like I said here, you know the project 3, the avatar exporter, the repo, the uploads etc.

A

So all of these exporters can write on disk. Then all of that gets packaged into the tarball compressed and then you know uploaded for you to download so where, where do these things come from? How do we know what to export and what to import? Well, there is a main file for both import for project import, export and for group. Import export is the import export yaml file and the this file defines what we are going to include uh what kind of relations we are interested in.

A

What we are going to exclude uh what kind of attributes we want to see in the in the exported uh file so, for example, yeah in order to show a little bit more detail, uh we have an import export project, import, export, yama file. As you can see, it's quite big for groups, it's a little bit smaller, but for x for projects. It's it's quite it's quite big. We also have a section for uh ee.

A

If you have, let's say epic references, which is an ee only feature, you know you put them under the ee category.

A

So that's where you define.

A

Your relations and, as you can see here, all of the top level top level relations right here under 3 project are the ones that are going to generate in the json file.

A

So, for example, labels milestones, issues are going to generate issues and json labels in the json mouse and sending json, and you can see like you can pick any file from here. It will map to a top level relation here.

A

um One thing on debugging like there are two ways to kind of observe and if something goes wrong, um let's say something is missing right. You import a project and one issue is missing. uh There are two things that can help you identify. What was wrong. uh One thing is the importer.log file under the development repositor development under the local developer environments.

A

This file is located under log, slash importer.log. um We have um on production in kibana. I think it's a tag that you need to put like importer through or something I don't remember exactly, but it is possible to view importer only records uh on the on the I think it's it should be in the sidekick index. um I remember there was a catch.

A

That some of the logs are not where you expect them to be, but anyway on the developer environment, it should be under importer.log and there's also employers failures database table that tracks all of the failures that happened. Let's say that issue that failed to be created it should it should have a record in this table as well uh regarding what what um what failed.

A

uh So, for example, if I show three room relation three restore just to to show you where it's happening like um here's, the method called process relation item, um here's where we call save on the relation object that I showed you previously that issue that doesn't have an id. So we just rescue an exception and we create an import failure record in the database. For that.

A

Okay, next, um some of the key components uh when it comes to import export. So the export side is quite simple: we have an export service.

A

That has a number of exporters that I actually showed previously right. Every exporter is is called that has its own responsibility, and um you know it writes the file once all of them succeeded. We package that up that happens in the in the saver. We compress and save, and we upload that using.

A

Using carrier wave gem, you know that's all handled for us and then we remove the temp directory. One thing to note is that, before you know, we upload it to object storage, um we need to write somewhere all of these exporters. They need to write their data somewhere and we write that on on the temp directory in on disk, but it's not a temp directory.

A

That is, you know um them tier it's a temp directory that is actually under shared uh under shared path.

A

I'm gonna show you now, so I think tmp no shared share tmp, you know, that's that's the directory, and some of you may know that this directory here represents.

A

A

Especially on on.com, it can indicate that this is an nfs mount, but import export does not have nfs dependency. It just happens to be writing under this path. I'm not too sure why, but this is more of a legacy. Historical uh reason. The import export itself is is a single uh process. It does not require um files written to be on nfs and share any files with any other processes.

A

So, just to just do just a note on that for anyone, you know who who notices this um so yeah. We we say we compress that we and we upload the turbo.

A

Then these two things here, the streaming serializer and the anti-json writer- are applicable specifically to the project 3 saver.

A

So what the stream is serializer does is it um you know it serializes the relations in batches and uses the json writer to write them to the destination.

A

This thing that I showed you here.

A

The two json calls this is: what's basically uh it's doing it um parses um the whole yaml file. It constructs the list of options for for this to json call and then in batches. It performs these operations it retrieves.

A

You know a batch of let's say in this case issues it retrieves a bunch of issues and then using the nd json writer. It writes to them to well. I guess the entities and writer writes them to file, but the streaming serializer uses the ngjsonwriter to to do that.

A

To accomplish that um yeah, as you can see here, um exportable.publicsend key is essentially project.issues, you know and then, depending on type of the relation, if it's a many relation, if it's, if it's an array of data or if it's a single relation- and we have a few of those like there's there- there's got to be distinction between you know: a collection of data like project labels versus service desk setting or cicd settings.

A

You know something that is a autodevops, um something that is not a collection. Sometimes it's just a single single object of data and there's a distinction between those- and you know that's all handled here and then, depending on on type of data. Here, for example, we have a serialized menu relations.

A

um We do that in batches we preload stuff that needs to be preload and when it comes to individual records. You know right here is where it's happening record, that to json and we pass in the options, um and then we take that and json writer does the writing and then the end result is the entity sometime.

A

For import, it gets a little bit more difficult because there are a lot of things that need to happen. You know when they export you just serialize everything to to json and that's it for import. There are a few more things that need to happen, so I guess the very first thing that's happening. Is this uh importer.

A

This is similar to the to the saver. We have um sorry to the export service with the list of exporters. The importer has a list of you know restorers as we call them, so you import the file and you call each of those importers. You know you. The very first step is import repository then v key then project three, uh I guess uh the the project three is going to be my main focus here.

A

So here's the three restorer. This is the one that is getting called, and the three restorer is the entry point. Essentially. Might I mean nothing not entirely true, because there's there's a code pass leading to to this point, but that's where the three restorer star kicks in and restores data from from the ng json files.

A

So here we consume relations using the relation reader.

A

And the relation reader is the anti-json reader right, so we open the file for for reading and we read. We consume relations uh from that one by one um and then we pass that to the relation tree restorer.

A

I guess this is where everything starts to be even more difficult than before, because, as you can see, even the method signature is huge. You know um we have members mapper object, builder relation factory reader. What are those things right so I'll try to mention what they are, um but obviously I won't be.

A

I won't be able to cover all of them in detail, uh so the the main thing is the relation tree restorer and what it does is you know this is where all of the relations are um not consumed, but I guess um processed um and traversed so for every for every relation cash.

A

um You know this is where the actual replacement of strings into objects uh happens.

A

Well, this is where we call the class that is doing that, so the actual replacement is happening in the revelation factory, uh but this is where this is like the the class that holds all of these objects- and you know, either passes them down the line or or uses them for something else.

A

So in this case the relation three restorer, it's actually the place where you know that save happens on the transformed object, like something that I showed you previously. You know like the relation object.save is when we persist it in the database, but all of the conversions are happening in this class too, um in a sense that we traverse data and, depending on the type of data that we identify, whether it's you know if it's.

A

If there is an array um inside of our object, then we need to you know recursively process, every every item in that array, and if that item has another array in it, you know we have to go and process those two. So this is where we do the traversal um of building relations and transforming sub relations.

A

um I won't go into detail for this bit, but for every relation we essentially call the relationfactory.create.

A

And before I move to that, uh I just wanted to to to to note. I guess I mentioned that before that, if the save failed, we create an import failure record in the database, which can be helpful um identifying what can be wrong right. um So, regarding the relation factory, uh I guess this is where.

A

This is where you know a string like this uh gets replaced into a label like that and.

A

Yeah, it has a lot of stuff in here uh trying to so trying to accommodate, because it's a relation factory. You know it tries to accommodate for every possible scenario like, for instance, we have a huge list of all rights which are essentially a list of things.

A

um It's like a map, um for example, if you have project.ci pipelines, uh you know we replace ci pipelines with this class. We we map what what is the ci pipeline and some of the relations don't require this, but some do and that's the reason for it because activerecord doesn't know what is commit author right.

A

So we help it. Identify oh commit author is actually this this model, um and there are a number of um other things that we help the relation factory with, for example, the existing object relations.

A

I'm gonna touch on this actually right, after with the object builder, but existing objects. Relations is essentially this.

A

If you have a list of issues right, you just work through every issue and you just create them right, simple, but with a certain relations, that's not necessarily the case, for example, labels or milestones.

A

If you have an issue, if you have two issues in fact that reference the same label, you don't want for every issue that you create.

A

You don't want to create a duplicate label or a duplicate milestone right, so you want to create this label for the first occurrence of the issue if the label does not exist yet, but for the second one, you know that it's already there an object builder is the place that is responsible for making sure that you reference existing label instead of creating a new one.

A

A

But yeah the relation factory is where all of the modifications happen uh to the hash that you you have, for example, depending on the relation type we perform certain things. You know. If it's a note, we set up the note appropriately if it's a pipeline, um you know we do something else uh and we don't have a lot of models, as you can see here that require such modifications most of the time the exported hash is enough.

A

You know we don't need to do anything about it, but there are certain situations where we do need to set it up. For example, for diffs, you know we apparently delete utf-8 tiff for whatever reason um yeah. So the this is where we transform.

A

uh We take the relation name, you know we and we convert it. um We convert the string to uh to the to the file. Oh sorry, to the object. I guess it's happening actually here. This is the subclass of the base. Relation factory um yeah, there's a lot of a lot of the things going on, but we replace user references uh and that's responsible for um a responsibility of members, mapper um and I'll touch on that in a second.

A

We've removed duplicate assignees, we reset the tokens any occurrences of encrypted attributes. We remove those as well. um I just want to quickly show you where the actual creation of the object happens right from the hash to the actual object.

A

But it's essentially right here in the existing or new object. uh You know relation class that new as simple as that. um There are a few more things that happened before before that, like cleaning the attributes, security related things.

A

That we don't certain, you know, because it's um because it's a terrible, we don't necessarily want to it's user editable right, a user can maliciously or otherwise you know edit the turbo make it look like it's legit and then upload it and look for some sort of unexpected behavior.

A

So there is a process for that as well, and if you're curious it's attribute, cleaner is the one responsible for making sure that you know that the in hash is uh clean. All the references of ids are removed, for example, here this a list of prohibited references, everything that ends on underscore id or ids or html gets removed, except for a few things that are allowed, but I won't go through them um yeah now.

A

This is where the transformation happens, and you know there are a few cases where we, if it's a unique relation, we want to do something else. If it's an existing object, we want to do something else. So there are a few things that we we do here to accommodate for different types of relations. Right.

A

Yeah so object builder, like I told you, if it's an existing object, we want to do something else. We don't want to create a new one. So what's happening is we are trying to create a new object using the object builder and here in the find or create object.

A

We do some manipulation with the group with the ids, but here's the object builder not built and the object builder like I said it's used to find an existing relation. You know, do not create a milestone, uh duplicates or to not create a label duplicate.

A

um So we have an object builder uh and it's very so it's somewhat simple, because what we do is we create a we check. We take a list of attributes and we take the class and you know uh on a high level. Is this right? We either find an object using the list of attributes that we are we passed or you know we create a new one and for project object builder.

A

uh These attributes are, you know, either project id or group id and then the title, I'm pretty sure, there's more.

A

I think there's title things like title here: title description and created that and project id or group id so- and this has been proven enough uh to be able to identify existing objects instead of creating duplicates. So if there are two issues issue, one creates the label issue. Two: oh there's a label with title a description b and created at this. We check. Does this label exist and already yes use that instead don't create a new one and uh something that I mentioned before in the.

A

User references in the relation factory, we update user references so for every user reference that is described here. You know things like author id user id, etc.

A

Obviously, we cannot preserve these. We need to replace them with something else. A user with id1 on my instance is root on gitlab.com, it's someone else. You know, I cannot just upload the project and suddenly said authored everything I have right so um on any occurrence of these references we use, we replace them with them.

A

With the references that we built using the members mapper a members. Mapper runs at the very beginning, um because um you know we, we don't rerun the mapper every time we see a new occurrence of the user reference, we build it once and we pass it along to the relation factory, and here is.

A

Yeah, every exported relation has the some sort of a user reference and we use the public eye public email that that was exported from user or the email for older exports. uh It has. This has been a recent change to use public email only and not use the email, but in order to provide um backwards compatibility with user with older exports, we use in email as well.

A

We try to identify the user and then it's essentially a map right for old user id one with the email. Whatever is public email whatever? On the new instance? It's actually, you know a user under id5 and we use um this public email information.

A

We use a verified, you know primary email, so it's it's not a secondary or unverified or whatever it's. um It should match the source instance um primary uh email address, and one thing to note this whole members mapping functionality works if you're an admin. If importer user initiates an import and your um an admin, then a user mapping will work.

A

You know the let's say I left a comment on an issue and then I import this project somewhere else and if my user exists, I will show up as an author of that note as well, but if I'm not an admin when I'm initiating the import, then all of the user references are going to belong to the import user, so every node will say created by you know by somebody else, some other user at this time and here's the content.

A

But the author of this note will be the person who requested the to who imported this project right um yeah. So here we build the map and we also add add, found members to the project as project members um yeah and yeah. That's that's what it does.

A

So yeah um there's a lot going on uh in import export, especially in import, because you know export is quite quite simple procedure, but import. You have a lot of things that you need to cover and relation factory is um quite extensive, there's a lot to read and to understand, and it can be quite overwhelming to grasp everything right and but yeah and that's. I hope I hope this was helpful, a helpful overview of what import export does and what are these classes and what they are responsible for.

A

um Other things to mention um yeah, like I said before, all of the above exists for group. Import exports, uh separate files, separate importer classes, mostly subclasses, uh because they all share a lot of functionality. There's an import export api that you can use to initiate it's not only in the ui. It's also, you know via the api, and another kind of last note is that um for exports, it's possible, especially when done via the api.

A

You can pass and import your upload, url and upload params in order to, um in order to upload the the turbo to the provided url. uh For example, if you have a long run in export um and you want to upload it directly to a storage, something like an aws s3 bucket, you know you can provide that url and um the file will be uploaded there in a way. It's like an uh hook yeah um but yeah.

A

um I hope I didn't miss anything, but this has been a high level of view of import export.

A

That's it! Thank you.