GitLab Verify Group, 4 Jun 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Verify Deep Dive: CI jobs scheduling for runners codebase walkthrough

Description

Tomasz (https://about.gitlab.com/company/team/#TomaszMaczukin) and Fabio (https://about.gitlab.com/company/team/#fabiopitino) take a deep dive into the CI job schedule algorithms for GitLab CI/CD

A

In the record and I will share my screen.

A

You should see now.

A

A

Okay, so the question for which we are meeting today was how jobs are being scheduled to the runners.

A

First thing that needs to be remembered is that it's always running that starts the communication. It's not Hitler.

A

We have now the CI web terminal where gitlab starts some requests. This is a little change in the general literature, but for a very very long time there was no communication from gitlab Tirana. It was always from runner to get love and then getting something in the response.

A

Now, if the first first request for a job, we are sending information how you can access the web terminal, so there is a way for gitlab to connect back to the runner, of course, if it is configured properly, but this we will not lock on this time for for for scheduling jobs, updating the trace, updating the job site, we can just assume that it's always get la planner. That.

A

Sends a request to to get la.

A

First, very briefly: let's, let's look into the built model. We don't use built name now in it love CI terminology. We use the job, but in the model it is still named the build so.

B

Sorry, if they build and.

A

Jobs are equivalent yeah of the very beginning, kita lopsy I was managing builds okay, but at some moments we decided that built for someone maybe make sure the wrong context, because it makes guess that's. It is all about building something so compelling building images amps and it's good luck, SIA. It doesn't limit you to only do this. You can do in in this script.

A

Anything that you can put you want, you can and you can deploy something you can create something you can do some tests and at some moment there was a project decision that will rename build to be a job.

A

Job is more generic word more generic description of what this part of C is, but in both phrase and run our code base that the built Ward was used in so many places that it's still left, especially the model is named build because the table in the database word meet was created, was named, C, I built, so so in the documentation in any and in written artifacts that we provide outside.

A

We we talk about CI pipelines and CI jobs, but but inside of of the code base, if you want to see something about the job you will most cases want to look for the built work in many different I.

B

Founded in fact that confusing and I didn't know exactly what build map to pipeline, sometimes or or jobs, yeah yeah built.

A

Is built is definitely the job okay, it was just not renamed in the Endicott, so not focusing now. What is what is the lifetime of the job we had such discussions few weeks ago and I hope we will have some documentation created from this. What needs to be what's needs to be known for the schedule. Inc is that the jobs have different states and.

A

The scheduling for the runner happens only for jobs that were translated to pending state, so we, the github, is creating a job when it is creating the pipeline and the job may be started by few things. It may be a job that is just started because, because it doesn't have any special.

A

Configuration and it's in the first stage, it may be a job in a not first stage of the pipeline. So then the job is created and it waits until I get lab, decide that the first stage is finished and then it starts all of the jobs from the another stage, and we have manual jobs where you need to manually click the start button or to API call to start a job. What this.

B

Starts, sorry, can you repeat it, so you can repeat again so what so it is originally the job is, is in created state, yes, and then when does it transition into Pangos.

A

Because we create all of the jobs for a pipeline at once when creating the pipeline and and when you have a pipeline that is running, you can see this graph where some of them are not working. They are there waiting. It's not only the graph that prints what jobs will be. The jobs are already created. The objects are stored in the database. All of the connections are made just the jobs are in the created, State and now to start a job. It may be done by three things.

A

First, you can have a manual job, it needs to be explicitly marked as manual and you need them to either click the start button in the UI or use API to start this job second thing is that when the job is not in the first state, I'm- sorry not in the first stage of the pipeline, it will be started by gillip automatically when the previous stage of the pipeline will be finished, then get lapiz transitioning the state of the stages in the pipeline, and it starts all of the jobs from the new stage.

A

And the third option is when the job is not money well and it's in the first stage of the only stage in the in the pipeline and then it is to start at the pipeline start and what starts does it? Is it transitions their job to the pending stage, and this is for us important in context of of jobs, schedule, because scheduling jobs for runners is working only with jobs that are being in the pending state.

B

Ose, nothing else. Yes, because you have everything wasn't created, then, potentially we could assign jobs there further down.

A

The line- okay, for example.

A

Second thing that is that is important to remember is that most of the logic of C is implemented in the raise up or I, always say that runner is a little stupid binary that has on one job. It gets a job definition it executed and it reports the state nothing else, nothing complicated. It, of course, have some own logic about how the executors are configured, how the job may be executed in different stings, but from the context of CI.

A

Most of the business logic is implemented in the race and especially the scheduling part is implemented in in wait. So knowing that we are working only with jobs in pending States, to think about scheduling, let's look how the scheduling works. I said that it is runner but starts the communication, and it does this with the special API that we have for runners. When you look into leap API, we have runner dot and b5. This is this file contains the API used by the top runner to work. Ok, second,.

B

Runners, it's the.

A

Api set for the user.

B

A

So you can, you can get information about your runners. You can have unpause rather do a lot of these things, and this is. This is mostly prepared to be to be used by the users of the club and we are interested in the running water, be because this is the definition of the API that runner users, so registration of new new runner, requesting a job, updating, a job, updating the patched race, downloading and uploading artifacts in context of job, not in context of user usage.

A

This this idea and- and we are interested in in this in this input- so runner sense- runner- sends the request to the address that looks like this. Let's, let's just see, use get lab calm as as it lab version as github installation.

A

Api before jobs request- and this is this- is the post, HTTP notice. With this, we are sending token that authorize the request. This is the token that is unique for every heat clip from a registered inside on Islam.

B

They be equivalent of the registration token.

A

No registration, token is used to register the runner, so you have registration token on the level of the club, the instance level on level of it groups and on the level of each projects and registration. Token is used only this one time when you register in your honor runner so shows this oaken.

A

Github then knows that this is the authorized request and to what it is authorized. So you try not to create a rather for the instance level or for the project level for the group level in case of group and project it finds, which group or which project, to connect this one word and when the run is registered. When it is connected to the instance group for project it retries, it's only unique token that is saved in any Rhonda's configuration file, and this token is nets next used to authenticate requests for new jobs.

A

And only think on only this not because we have also some API to checking if the runners still start in indeed lab, so there also, this token is used because next updating the job don't know the cup loading artifacts. This is done with the CIA job token, which is unique for each job and it's valid only when the job it's in the running state.

A

So with this token, we are authorizing the request for new jobs last update. This is this is used.

A

There is, there is a little caching around around this end point because we have Utley. There will not request by default. It'll appear on a wheeeeel request. He drop each three seconds for for new jobs and last year in every when I was checking this, we had 16,000 runners each day connecting to get lab each each of them I was creating at least one request per 330 seconds, so we have a huge amount of such requests only for each load of calm.

A

So we have some sort of caching on the level of the rails on level of workhorse, which is the proxy and this field. Then this field is used to connect all of these three parts and to decide if we need to send a response from not yeah.

B

A

That it returns.

B

Three or four every now and then from the.

A

Getting the logger.

B

A

Let's, let's not focus on this right now you can you can find where this last update is used here. In this end point, you can look a little through it up workhorse. If you will need help, we can. We can just jump to this yet and go through walker's code to see how the API is configured for now. Let's, just let us know that we are gonna, get love rise and get up workout site.

A

We are limiting the nominal number of requests and quitting hit a little to just make the database and the system will less hammered by the request with info. We are sending some metadata about about the runner, was version or configuration test and with session. This is the this new thing.

A

We are sending information, how you can connect with the CI web terminal today, okay, going fast forward if runner will be authenticated properly, if if it will be not outside of the limits of the last update everything that we check here for for preventing requests that could affect the system stability.

A

We entered this line. There is a register job service in which contains the scheduling Albert. When it returns a job, oh not a job, it returns a result that may be valid or not.

A

It can contain a built a job or it can contain information that for this runner we don't have a job basic on this. We are sending one of these three responses and in case of this one, this is the payload that runner will receive and start executing a job.

A

Ok, so to see how the job is scheduled, we need to see the register job service and here in the execute method. Again, I will not now describe how services working in the raise up of github.

A

For now, we need to know that this executed method- this is something that is that is being being executed to to handle this this request. So at first in the API we authenticated the runner and, while authenticated the runner, we found what Runner. This is exactly which one runner registers. Indeed lock. This is passive to the service, and here we are checking what type of the runner pieces because, for instance, level, so the showrunner. We have one query to the database for group level.

A

We have another query and the only left case is the project level runner, where we have also separate query because jobs query is stored in the database, and this is recognized by the pending state of the job. Okay, so we are running, we are executing one of these three queries, the basic queries.

A

That are next, additionally checked against the tax, the runner tax and this step here checks if the jobs, the list of jobs that we initially taken for the runner that is asking if we have, if we have any jobs that are matching the tax specific for the rather do. You know the tax field in the get lopsy I am file and.

B

I've seen the tax when we registered runners but I, don't know how that interacts with okay.

A

You can you can park each runner, for example, think all of our shelter under some bit, lobe, calm, art act with the dog attack and then in your github, see I am file for each job. You can specify a list of tax that needs present on the runner to make it possible to run this job. This is the way how you can define which runners should execute, which job? Let's assume that you have it, that you have a project where.

A

Where the tests may be executed on any platform that you want, but the binary can be built only on Windows machine, so you in case of github.com, we don't have Windows tetramers. We have only Linux based short runners and in your project. You can just decide that. Do you don't want to manage your own runner for this? You can just want to use the shut runners that get logged compromise to execute some tests. So you can just do nothing.

A

Shelter under some key club.com are configured in the way that they will execute each job that doesn't have tax, and if you have the stretch runners enabled in your project, they can take this job and enter executed. But for the compilation your project requires that it will use the Windows machine. So you add the Windows machine to your project. But now, if you have this Windows machine there, it may start catching some of the jobs that it should. Not so you add a Windows to this windows.

A

Rather, you add some tag, for example windows and in the job of computation. You add these tags awry where you specify windows, the the rule is that the runner needs to have at least all of the tags that you specified for the job to execute this job. So if you have a runner which is target with Linux docker Ubuntu and in your job, you specify a Linux tag, it will work. Okay,.

B

A

If you have a runner that is stacked with Linux, but in your job you specify Ubuntu, Linux and docker, then this will be not executed on this run, because this runner still misses the two tags that you require to have for this job. I.

B

A

This is this is how how tacks are helping you manage, which runner should execute the job and- and this step here from the initial list of jobs.

A

Listed for the specific runner.

A

Gets only disre these jobs that are matching the tags of the runner that we are working with, and- and this is what what makes that, if you don't have any available runner with such specific tags, no job will be. This job will be not not taken, and at some moment we will see in the UI that the job is started, because you have no runner with tax and there is the list of tags listed in the in the unification.

A

However, we have also the case where the runner may be explicitly configure that this Runner execute any job that doesn't have tax specified, and this is this is how how share trainers on github.com are configured. You can check what tax we provide them and set them for your jobs, or you can do nothing and just make sure that the shift runners are available, but for for some other runners, we have this option disabled. So this runner will handle only jobs that are matching with the tax if the job doesn't have tax.

A

This rule removes the job from the list can pick on the runner, that's more or less the same. We can we will look at the end on this on this method, what it does, because here we have the final list of jobs that should be able to be executed on the runner, but requested a new job. So we are iterating over this list, taking the first one which can be picket and assign it to the run in case of a race condition where two runners requested a job.

A

In almost the same case, two runners, who are so equal that the same jobs were taken into the list and two runners will take the same job of the first one. We have the transactions and and the first one which will in the milliseconds we in this compared with this competition, will obviously lock the transition transaction, be assign and update the state to running the second one will receive one of these two exceptions and.

B

A

This case to just speed up the thing we are allowing the loop to be executed once again, so if we had an exception only because there was a race condition, we try to assign the job another from the list to this, rather to just not require another request mate in case. If such situation would be repeated, we return an error and then the runner need to do a request again. So, let's, let's maybe walk on this- can pick it basically checks.

A

Against the protected feature, because you can mark some tags and branches as protected in the project, I think.

B

A

In that case, the runner needs to be needs to be maybe I know in the opposite thing. If you mark the runner as protected, then it will execute only jobs on the protected branch of syntax only thing so you can have, for example, who the runner that is dying, doing production deployment, and you don't want anyone with access to the project to be able to start the job. You want to only have some some users that you trust to be able to do this, so you made them make the maintainer in the projects.

A

Then they can start jobs on the protected branches and tags which will be running on the protected run.

A

Okay, in very short, how this was accepting tags and I'll see may before is, is again checking if the tax requirement is met, because because here we will remove jobs that have no tax matching to the runner, but this still leaves us with not quite sure situation if the job is ideally matched that the tack run attacks so so the as in a before we'll check this, and then it is also checking against groups and projects, because if you have a group runner, then it should be available for all subsequent groups and projects.

A

So these are some some checks additionally down to to just make sure that the the runner requesting for this job is, is just allowed to execute this and now looking on these three basic queries, the most simple is the project. One.

A

So if the runner is a project runner, we get the list of projects assigned to this runner. We are removing from it all projects that are pending deleted and all projects where the pipeline's feature is not enabled we have. We then get new deals. It's also what we need to look on, because we are looking on ending and unstarted jobs and in case of of the protected runner, we are looking on also on the jobs that are marketed as protected and then in case of projects.

A

We additionally limit this by by the projects for which the runner is assigned, and we are ordering the jobs in ascending order, so the first one in the first one assigned okay, yeah yeah on the group level, the query is a little more complicated because we need to get groups here here and check and release jobs from all projects from all groups and subgroups for which the runner is assigned and then and then again we are ordering this by ID.

A

This is, this is a rule that is repeated in all cases, so when we do all of the checks sort, if the tax are much, you can give the protect. That is matching. If, for the instance level in a moment, we will look on the first scheduling algorithm.

A

If this first scheduling is matching, if we finally have a list of jobs that is ordered by ID, so we are assigning from the oldest one and now I've mentioned about the first scheduling, which is done for the instant instance type of the runners of the street runner and.

A

If you will shake this query, analyse what it is doing. We basically prioritize jobs from projects that are at this moment, not using sheet runners. For example, you have a project and I have a project, you push it something that created thousands of jobs and the shelter handles that we have on our lead.

A

Lap instance is able to execute only ten of them at once and now I push something that creates one job, and if we would not do this first scheduling, but only handle jobs in the order of its creation, I would need to wait for all thousands of your jobs to be finished until my will be executed.

A

So here we are trying to make this distributed more evenly between the projects. So if your project at this moment doesn't have any job running on the shirt runner that is asking for the job, you are moving up in the in the list.

A

If you have some jobs, then you're put it in the list, depending on how many jobs you already have from the Shadrin and then in this list we again order these by the ID of the jobs. So if you push it hundred of the jobs and I push it ten of the jobs in very simple scenario, it should be done like gitlab will assign for each every query from the runner. It will assign your job number one.

A

My job number one, your job number, two, my job number, two, your job number three, my job number three up to ten and then because you have steel jobs, I have no jobs left and there is no other jobs in this get lab. It will be able to assign your job number eleven. Twelve thirteen fourteen, that's okay,.

B

A

So so, this is how we try to make usage of the jobs fair among all of their projects when they are using the short runs. We are not doing this in the group level and on the project level. We just assume that.

A

When you have this for your groups, you don't fight again this resource, because this is your resource for the sheet runners. We want to make it available for all users on the guillotines, so this is how how this is implemented here. We have also some metrics that are tracked during the job scheduling and what needs to be remembered is that it's runner that creates the request and asks for a job if you will disable all around, maybe not disable.

A

If you will have some Network problem that will make runner unable to communicate with gitlab, you always have a growing pending jobs. Query, because no one is asking for jobs. No job is being scheduled after some time. There is a worker.

A

C

A

A

Stacked CI job jobs Volker. This is a worker that will some moment decide that your job is stuck anything.

B

A

It will drop it with a proper, a proper error message in the in the job. So.

A

If, if there is no request from runners, no jobs are scheduled, because this is the basic requirement there needs to be a runner that will ask for it. When the runner is asking for the job. We are first checking which type of the runner basis, and then we are creating a basic list of jobs dependent on what runner asked it for it. In case of runner, we are trying to fail. Miss distribute this among all of the projects.

A

A

Checking if the tax are matching, if the protector is matching, if the.

A

Group projects is allowed in this case and we are limiting filtering all of the jobs that are not meeting the requirements putting them off, and then we have the final list of jobs ordered from the oldest one, and we are trying to schedule for this request. The first one from the list, but in case when you have more than one runner.

A

It finally ends that we have the assigning quite random. If you have, if you have in your project, future runners that are enabled and are matching all of the requirements for the job. If you have additionally few project rubber runners runners that are also matching this job- and you have few group Rev level runners on different groups that your project belongs to.

A

If you have this 50 runners, for example, and each of them is able to handle this job, it will be assigned totally randomly because you can't say which one of the runners will be the one that will ask for a job in a moment when this job will land at the top of this final list and be assigned.

A

Ok, so I think I said everything I wanted to say: I told you all four files that I think are important here: yeah.

B

I think it's very, very clear and I need to dig more in detail to digest the queries, but, regarding the you know, something we are recently discussing about pre assigning jobs to runner. As now. If we want.

B

The kill you all use in the workspace yes, so that means that, rather than assigning the first, let register success and then return the result. We might need to do something there. We actually we don't exit the build, but we try to schedule all the possible jobs we can give to this runner. It can be a little bit complicated and I. Guess yes,.

A

That's that's why I said that working on the workspace issue will not be only a runner think. It also needs to be done on the race side, because this change how jobs are being scheduled and assigned to the runner. You will look on the workspace issue and there is in the discussion.

A

I've mentioned this and and then can you propose that in that case we could start preparing separate queries for first add jobs. This is one of the ideas. Maybe we will find a better one yeah, but this will be definitely need to be changed if we will start working companies yeah.

B

So yeah, it looks like that we will have events and correctly like scenarios where the jobs, apart from the one that is currently the chemical it will be handled by the runner. That will be the only one pending. We have other jobs down the line that will be still in created state but somehow already assigned to to a runner, so that mix next time the run when the runner finished the first job requests the next job. It will simply trigger a change in the state machine from created to pending, but it will.

B

We already know that that is already pre.

A

B

That runner, don't.

A

Don't don't don't mess the the states the created depending is done by github, and if this this is what needs to be done to even make this job visible for the run runner when requesting for job is transitioning the job from pending tronic yep. So so, finally, we will have a list of jobs in pending state that are already assigned to some run. Not increased. Okay, I see I, see because, because, if job is in created state that runner will not get it because I.

B

A

Only created so so so transitioning from created pending will be happening like it's happening now. When, when the job is manual, you will need to trigger this by hand API or by the button. When the job is not manual, then in the first stage it will be started automatically in each next stage. It will wait until gitlab will decide that the first stage the previous stage was fully finished.

A

There is no failures that are not Marcus allowed to fail and if you'll start another state yeah.

B

So, in that case, we'll change the next job from clearly to pending. Yes,.

A

That a lot of job.

B

To be queried for for the scheduler, yes,.

A

And what we will need to change in case of the workspaces work is that this pending will need to be assigned in a different way that we have here for the default case, because this will need to be assigned to the one specific runner to make it able to reuse. My workspace yeah.

B

So if, but it seems like, we can't change all the jobs that could be assigned for a specific workspace. There are all part of the same workspace. We can change them all into pending at one time, because we might risk to pick jobs there later down the line and assign to to the runner right what we want to still what kind of a sequence of the events yes,.

A

And and that's why we talked last time that at least for now the workspaces will not work for the parallel jobs, because in case of parallel, we still need to make this something of somehow sequential. So, finally, we will we will need to I.

A

Don't know I, don't know how we'll just go with with go workspaces there two to two possible ways: we allow you to configure multiple jobs in one by one pipeline stage, but then we'll need to get them and order them sequentially. So you have hundred parallel jobs that are executed synchrotron. The other case is that we will not allow to create two parallel jobs assigned to the same workspace at the same time, but this is this is a different level of how we will handle the workspace from the runners side.

A

What we will need to change in the scheduling part is that we will need to pull these jobs out from the general creme and create some sort of qui specific for the runner and workspace pair, so jobs, jobs, jobs specific for this workspace can be executed only on this specific runner, which was the the first one, but that handled this workspace and then for this we will need to have a separate query and we will need to update this mechanism to somehow recognize that this Runner asks for some job that that may be in this specific.

A

In this specific way, yeah, so so this will be. This will be the challenge that we need to meet the result.

A

B

That's been very enlightening, so thanks for your time and yeah, if I have any questions, I'll, definitely we'll dig more in details and the queries and everything else yeah. If I have any questions in.

A

A moment I will send you the documentation page where the first scheduling algorithm is described. Okay, when looking on the query and on the documentation description, it may be a little easier to understand the query. The query when you will execute this on your walk out, get love development instance. The cube query is yeah. The analyze analyze form from phosphorus is I hope in looping looping. It has.

A

It has four or five levels of nice flower, so the query is really really complicated and looking on the documentation that describes how this should work for me, made it easier to work on the query and understand how, in fact this is working. Yeah.

B

Accent: okay, cool! Thank you very much yeah. If you, if you manage then to send the recording, I'll, probably go through it once again, if I've missed anything sure.

A

B

Yeah, thank you very very much for your time. Okay, yeah sure see you later yeah see you later have a good day.