GitLab Manage Stage, 10 Mar 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Manage:Import Onboarding - Introduction to GitHub Importer

Description

An introduction to the GitHub Import tool by George Koltsov.

A

Hello and welcome to github importer overview. uh This is a gitlab feature that allows you to migrate projects from github to gitlab. You can check more details under these urls, the user guide, as well as the developer, documentation.

A

What it does like, I said, migrates github repositories to gitlab's projects and migrates uh this list of data. Mostly it's um you know, um pull requests issues, labels milestones stuff like that, and comments and lfs objects.

A

And I guess the first thing that I would like to show you is the demo you can go to new project page.

A

Import project select github.

A

This is the page you will be presented with authenticated with github, where you provide your token, and this is the page that you're going to be presented afterwards with the list of projects that you have on github, and I have a project here that I'm going to press import and this project here is just one file project that I have one pull request on and I made some comments.

A

um I guess, while it's importing, I would like to.

A

Start by saying that there are two types of github importers, the sequential importer and the parallel importer.

A

By default, we have the parallel importer that is used in the ui, but sometimes you there's a use case for a sequential importer and, as the name suggests, it imports everything within a single process.

A

For example the use case. The only use case here is import rake task. If you need to import the project through the terminal, it's going to do just that and the sequential importer is used here and by default through the ui.

A

We have a parallel importer and parallel importer is highly asynchronous and.

A

Let me open it up.

A

Yes, uh it's highly asynchronous and I will show more detail, but uh when we execute this importer, the very first um the importer overall happens in different stages. It executes it imports data in stages, so the first stage is repository. This is something that is the first step for many of our importers. That's the step that we need for in order to migrate, merge requests it's pretty central to to our um all of the importers.

A

So, um as you can see here, the parallel importer, it includes the job for import repository worker, it does the import and then the next stage is base data and base data is essentially this labels milestones and releases it. It performs that it executes every importer for within this job and then it includes infuse pull requests worker.

A

So that's the next stage and after that, um all of these, so we have an advanced stage worker, that's the worker that is used in order to wait for a number of jobs to complete before we go because, uh prior to prior, through the advanced stage worker all of these workers, you know as the last step, they would end you, the next worker, but in case of advanced stage worker. That's no longer the case, because, because of the way we import data and here's the list of stages that we.

A

Execute so the next stage will be pull requests, merge by information and then pull request, reviews issues and div nodes, nodes, lfs objects and then we're done.

A

So, for example, here's the pull request merged by worker and it rings use the advanced stage worker and passes the next stage information once all of the jobs have been completed.

A

A

Yeah, it's um it's not necessarily that all the jobs have been completed, but it waits um for jobs to be completed and and then it reincurs itself. So um the the main collections worker it executes an importer. Okay and.

A

Every importer has this module parallel scheduling all of the importers across the entire github importer. They all have the similar pattern, which is utilizes inheritance heavily, as well as the defines these kind of classes. So there's a number of methods that you need to define and include the parallel schedule, scheduling, model, sorry module and parallel scheduling what it does is it fetches it fetches resource collection, collections, endpoint from github page by page and for every object to import it and use a separate job.

A

So, for example, here each object to import. We fetch that object.

A

We fetch that object here and then we have the representation class and each representation class is like a conversion tool right. Raw data into a hash, for example. Here is the pull request, representation and we convert it. We transform it right.

A

We take that and we enqueue the sidekick worker for each individual. Now, pull requests for each individual object, whatever it may be: issue node label, uh not label, but issue, note, uh pull, request, etc. Right and then we create a waiter for it, and then we return it, and I will cover the waiter now later on.

A

Actually right now, uh the job waiter is used to keep track of state.

A

uh If you've seen my previous recordings uh for import export, for example, we use database record to create to keep track of import state, but here we use registered state to keep track of all the all these jobs, because, uh if you imagine, if you have a project with 10, 000, pull requests um and each pull request is a separate sidekick job and if you have a separate sidekick job for every single um sorry, if you have separate database record for every single pull request to keep track of its state.

A

Obviously that's going to create a lot of noise in the database. I mean produce a lot of not necessarily noise but produce a lot of data write a lot of data right to the database.

A

um So yeah here is the gitlab job. Waiter, utilizes radius, for it.

A

And that's how it keeps track of of of the of the progress that's been made. You know the advanced stage worker. It checks this uh array by by the key, if it sees it being empty, um then that indicates to the worker that the job is done. If there is something in the array, um then it's um it knows that it's not done yet.

A

Okay and every importer yeah, like I said, um every importer class kinda, looks like this, where it includes parallel scheduling and defines the importer class. So this is the collection importer right, but it defines individual importer, an individual object, importer, the representation class, the sidekick worker class, the collection method. This is the method that is used to fetch data from github and so on.

A

You can have a look at the parallel scheduling um I mean at first it is kind of difficult to grasp and navigate, but over time it becomes easier to locate certain certain bits of logic that you're you're looking for and something is wrong with the import. But that's oh no, never mind.

A

I was worried that something might be up for my with my local environment, but it completed so we have this project here with the merger quest and we have this merchant quest was a few comments.

A

A

Yeah, so we have a regular comment, leave comment and review comment. Regular div review comment.

A

So that's like the basic idea behind this importer.

A

Bulk insert so, unlike unlike import export, where we for every for everything that we insert into the database for everything that we save. We run active record callbacks, except for a few exceptions, but the overall is that all of the callbacks are running. We never skip anything uh well, almost anything in github importer, that's exact! That's uh the opposite! Actually, because.

A

Well, primarily, he's for historic reasons: that's how it's been reintroduced, but mainly yeah, because uh there might be tens of thousands of nodes and in order to release the database from additional pressure, we we bulk, insert and here's the example of diff node importer, where we format the node, and then we use bulk insert to insert these attributes.

A

I guess the main difference between this and import export is that import export has a lot more nested associations. You know, like you, can imagine a merger class with notes. Inside of notes, you have word emojis user references um uh system. Note metadata events whatever it might be. It can be a bit of a bigger package as opposed to what we fetch from github api.

A

The data that we receive from there is quite flat, and perhaps it's a bit more safe to to insert it using the bulk insert functionality.

A

Now um you may have noticed that the advanced stage worker has these two stages: pull requests merge by and pull request, reviews um like why. Why are they needed? Why not just have them as part for like, let's say no diff notes right or notes, and the main reason is because these are separate api endpoints that are coming from github and they do not have.

A

um Individual records, uh sorry individual.

A

They only have individual api endpoints and not collections. So for every merge request, we would say: let's say we imported 10, 000 version quests for every merge request. We need to go back and one by one fetch all of them merge by information um which can be quite slow, so yeah. It does not have the collection api. So we have to facial mrs 1.1, which is not efficient.

A

It's just the way that github exposes their data and that's what we have to work with.

A

So mainly. These are the stages that you.

A

Have here obviously um so we have issues and div nodes combined. So we have the issues importer together with diff nodes, um new.

A

I guess another thing worth noting is object importer, that's another uh module that is being used to execute individual.

A

Imports for individual objects so yeah, like I said, I'm not going to go through all of them, but these are some of the pointers that can be useful when it comes to understanding github importer better because yeah, it's highly asynchronous, and it can be a bit of a challenge to navigate through this code base.

A

Also. The last thing I wanted to touch on is the single endpoint nodes importer. What is that so single endpoints diff nodes importer?

A

You can see we, this is the diff nodes importer and we have another d, the regular diff nodes importer right. What's the difference?

A

um Well, the difference is this: let me just open them side by side and if you take a look at the collection method, uh this one uses pull requests. Plural, pull requests comments versus pull request comment so which indicates that this method of importing fetches diff nodes comments, one by one, mr by mr okay. Mr mr number, one give me all the comments. More number two give me all the comments.

A

While this endpoint um returns all the comments across all of the merge requests right. So you don't like it's convenient because you have one endpoint, but you need to sort. um You need to sort which div node belongs to which merge request yourself and that's what we do during the import.

A

And so the question is: why do we need this one? And there are?

A

There is one big reason and that reason being is gitlab sorry, github, api limitation, for instance, if we open this url, which is issues comments of this project, page 10401.

A

It will tell you this in order to keep the api fast for everyone. Pagination is limited for this resource, meaning that you simply cannot fetch all the available comments uh for this project, and this is the node.js project by the way. um So we stumbled upon this problem somewhat recently, where an import was missing comments and after a lot of digging around, we found that github api just simply does not return.

A

All of the comments that there are so we had no choice but introduced an alternative way of importing notes to um to the github importer and this uh way fetches them one by one right, which is way slower, but that's the trade of um if you want, if, let's say you have a massive project, you want to import all of the uh comments um and the current approach does not provide all the comments. Then an alternative solution is to try this way of importing, and this issue is behind two feature.

A

Flags number one is the single point: endpoint nodes import, which changes the way we import nodes and the other one is the lower per page limit and that's another github limitation. Where, sometimes, if you own fetch a page of default, page number, 100 github will return 500 error and will not return the result. So another feature flag was added to just reduce the page number, the page size in order to try and help with stability and try and fetch all the data. There is, even though the price is speed.

A

So we have a single endpoint, diff nodes importer, single endpoint issues, nodes importer and single endpoint, merge request, nodes.

A

Yes, so that's the high level overview of github importer, like I mentioned it's highly asynchronous in parallel mode.

A

I guess the main point of the main module is parallel scheduling in order to understand how bit works.

A

But yeah, it's mainly one job per object, check, keeping state keeping track of state and redis and using the job waiter as a way to keep track of all these jobs.

A

That's it. I hope this was useful and you have a bit more clarity when it comes to how github importer works.

A