GitLab GitLab Runner - Autoscaling, 28 Sep 2022

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Taskscaler 101

Description

This is a code walkthrough and demo of the Taskscaler library being integrated into the GitLab Runner. This is part of the work to replace Docker Machine and evolve the Runner infrastructure. See https://docs.gitlab.com/ee/architecture/blueprints/runner_scaling/index.html.

A

Okay, so this is a recording of my understanding of bleeding test scalar and a plug-in mechanism, as it relates to Auto scaling in the gitlab runner. So this is this: is it based on the original proof of concept, merge requests that Aaron put out it's kind of a point in time artifact, uh so some of the things would change, but so I'm going to focus on the the large pieces that should remain in place, even as we continue to develop this so first I'm going to take a few minutes and walk through the egg.

A

The way that the auto scaling works now with Docker um with Docker machine. This may be something you already know, but I found it really helpful for me to have context, and it helps to show what the differences are. Then I'm going to show you when, with this change with this merge request what happens inside the runner like, where does it actually get integrated I'm going to show you where what task scalar is um and where it fits in? What its responsibilities are?

A

I'm going to show you what fleeting is and what the plugins are and how they relate to the test scalar into the runner and then I'm going to do a little demo to just kind of walk you through what it looks like when it actually runs just so, you can kind of check some. You understand. So, first of all, let's talk about the way that auto scaling works now, so there's there's a um uh an Executor interface.

A

This is sort of the first level of abstraction here, um so an Executor is something that is able to be prepared and run some sort of thing somewhere. Any executive provider is the thing that um gives you that so like these two together allow you to sort of run something without knowing exactly where it's running.

A

um We get one of these when um the runner actually calls get executive provider right here. So if you look at where we're at we're inside of gitlab runner in common, so this is the sort of common logic that actually drives the build.

A

um It says: hey get me the execution provider for this executor, which is some string, which is a discriminator used to dispatch to the appropriate executor and, of course, um uh when it calls uh um executors, are registered on import. So like the way that get executed providers able to find executors is we import them somewhere and they all call you know, register execute a provider with some string and give this implementation so um Yeah.

A

So basically, the provider gets called create, gives back an Executor and we use it here when we actually run the build.

A

So I run a thing: okay, so um Docker machine so drilling down specifically on one of these providers. It has a provider and it is a provider because it's kind of a wrapper, and so what the docker machine is able to do is um you can see over here?

A

It has all the methods um that are necessary to to be a um has all the methods necessary to be a provider, but under the hood it actually has a provider as well, which is the actual provider, because what it does is it actually creates a VM and then uses the provider to get an executed around inside that VM it's a layer of indirection which allows for auto scaling and uh to be injected.

A

um The executor actually has the um sorry, the docker machine provider has the auto scaling in it, which kind of makes sense like there's some. You know you sort of configure idle count idle scale factor. It looks for the free machines. So when a when a job gets assigned to this machine provider- and you try to acquire- you know the machine, it looks for one, it gets you one. At the same time, it also likes to make sure that there's additional capacity for the next time um so Auto scaling um the instance. Life cycle.

A

Job assignment are all pretty tightly coupled into this little provider. Here um it then sort of uses that machine that you get back to actually in the in the in the prepare stage. It actually uses the underlying executor here to delegate so that all kind of makes sense. So that's the existing state of Auto scaling tightly coupled with Docker machine and all of those various considerations are, are coupled together. So let's take a look at um this pull request to this.

A

Excuse me merge request and um let's see what changes it makes so, first of all adds a new folder. So there's a new in executors, there's a new internal folder and there's a new provider and a new executor. What is this provider? Well, um it's another layer of indirection because it has a provider.

A

So if you look at this provider, this is how we get. um This is how we get test scalar and fleeting into the existing Runner. um So we sort of like provide this this um this. This other provider is able to delegate to um some other concrete provider.

A

um It has task scalers. So when it delegates it actually uses test skills, I'm not actually sure why there's more than one maybe I'm missing something, but one way or another. It uses test scalar to actually do the underlying work, so this is kind of just a shim. This gets uh gets this into the existing Runner, so tasks say a scalar itself is a new library, it's a new project.

A

So what does this thing do? Well, um obviously, that's all scaling, it's a scale it of some sort. So here it has this this this Loop, where it continuously sort of checks, the number of machines that we want and scales to that number of machines. So there's this the primary job of the task scalar, is to sort of in a loop on the side kind of like keep the auto scaling going just like pay attention to how many are used and what the idle factors we want.

A

So this is a re-implementation of the auto scaling algorithm, that is in the machine, the docker machine provider, but it is it's just that so it's the same algorithm, but it's the re-implementation, and this is the intent intended to be the the only implementation like they all should be able to be replaced by this one Auto scaling uh component.

A

So, in addition to um you know, of course, Auto scaling. It needs to be able to provide access to the underlying pool, pool so the you, the uh runner, can actually call a choir and get a instance. You can get and gets the information about how to connect to it, just like it did before actually um and uh the the actual instance. So what is an instance? The instance is actually just um a concrete, like VM, with a little bit of life cycle metadata around it so like? Is it acquired?

A

How many um you know how many uh times has been used if you can fit more than one job in it? How many jobs did? Can it fit now um stuff like that?

A

So what you have here inside task scaler is now um Auto scaling, logic, sort of on one side and then some life cycle as well, but it's just our specific part of the life cycle, like a lot of it, is delegated out to the actual fleeting um implementation and we'll get to that in just a minute, um so VMS actually get created by calling fleeting. So here um you can see that there's a fleeting provisioner and we're able to request n instances, okay, we'll create them for us, which is great.

A

So you can see this is the actual scale Loop here we're requesting to have enough instances to have the desired capacity.

A

So yeah there's a provisioner here, so I got to tell you what leading is as well um yeah. The only thing that you actually get out of task scalar is the acquisition, this information to connect so yeah.

A

um This kind of like takes a couple concerns and puts them outside of the runner and um Downstream, puts the actual machine creation concerns in a plug-in. Let me show you how that works, so a fleeting provisioner, basically just wraps um an instance group.

A

um The fleeting is a new project as well, so it's it's sort of like test, scalar fleeting and then there's going to be individual plugins.

A

um The uh the callers um request that an instance will be created here and um it actually calls increase on the underlying instance group.

A

So inside of pleading those are another layer of indirection which says: okay, here's, here's a pool of VMS and you can increase, get more decrease and you can get connection info for a particular VM. Of course we need that and update I'm, not sure what that does, but it does something, and so each one so now what's hidden behind this interface is going to be all the cloud providers.

A

Everybody that provides VMS, so it's a little bit different from Docker machine, because in Docker machine you tell Docker machine you, shell, out to an to Docker machine, and you tell hey I, want this machine in this provider and you give it all the relevant details and it makes it for you and gives it to you back so the runner itself, the docker executor, has to know everything about the VM. So you couple the details of a specific machine with again that very tightly coupled Docker machine implementation in run.

A

So here you just have an interface. So uh that's super great. um What is it um it's a plugin, so that's kind of weird: um why? Why do we actually need to have a plug-in? Well, um there's n different uh VM providers right it'll, be all the major clouds and then a bazillion other.

A

Like weird places, you can run things um you don't want to import all of those uh sdks into your golang binary, because you're going to have to resolve all dependencies and that might not even be possible and goes existing plug-in mechanism isn't really much better because it requires you to have the exact same set of dependencies. So this is where hashicors go.

A

Plugin comes in it's a way to sort of um isolate all the dependencies in the build of the plugin from the actual thing that uses it, but it provides a really nice mechanism and protocol for bridging it, um especially from go to go where you pretty much just use it like. It was local, so you won't be able to trace it. uh You won't be able to like um Step through it, because it's in a separate process, but like this usage, is actually very natural.

A

um It's a fairly mature piece of technology, so I I, don't feel too concerned about it um and I've worked with it before so I'm familiar with it, um but so yeah like um this is actually where the increased call goes in the golang in the um the plugin itself.

A

Sorry this is not where it goes in the plug-in. This is this is on the um on the fleeting side. When it calls so you see, there's some grp server stuff. You know you don't need to worry about the JRP stuff, that's not really. Unless you're creating a plug-in, you don't need to worry about it um and it's fairly, it's temp fairly standard, like boilerplate code.

A

um A lot of it is already taken care of by the library. So let's take a look at the actual plug-in, um so we're going to jump over here to a new project called um the Google fleeting plugin, Google, compute and and it will start up as a binary like this.

A

You know it just kind of says start this plugin and the place where it actually shows up is here, so you remember that call to increase on the leading side while it gets sort of transported it got serialized through grpc sent through standard in um to this process, deserialized and then materialized into this concrete instance group where it's actually able to do things like call a specific cloud provider, like instance group managers and requests that you create some instances. So um yeah the serialization is a little bit magical.

A

It's nice that it's grpc, because we can revise the parameters. You know, there's a protophile, you just add something new to the protophile and then generate the go files, and then you could just use it on both sides. So you can use the usual service mechanisms of you know, iterating on the API itself. You don't have to worry about the wire protocol. I'll just just modify the Proto itself um yeah. So that's where VMS come from, which is an important question all right. So let's, let's actually see how this thing works.

A

Okay, so I'm going to kind of take you over here to um my workspace, where I have all of these various projects checked out and used, go workspaces to wire them together and so I'm just going to run I'm going to run the um um I'm gonna run the runner in Delph. Just so we can Unbreak, so I'm gonna go ahead and tell it to start up, and the first thing that's going to do. Is it's actually going to fire up the um leading provider which is going to fire up the plug-in right? Now?

A

The plugin is actually directly hard coded, which is actually good for the these purposes, because it means I can actually step through it um and it's going to actually get you know kind of it actually cleans out all the old instances and creates new instances. So you can see it's. It wants to have five and it's got its five here. So it's it's ready to run jobs.

A

Okay, so that's super great.

A

um So what I'm going to do is I'm actually going to put a break point in here at an interesting place um and I'm going to put a break point at the um in the internal.

A

Executor so um when we run a job, it's going to ask the provider to please give it an executor and that's where we're going to kind of pick up the the thread here so.

A

Let me actually come over here. What I'm doing is. This is actually connected to my um little Dev project in gitlab.com and I've got a kind of a Hello World pipeline, so I'm just going to tell it run the pipeline and um it has configured to only use this Runner. So here we are great. We got a we gotta executor. So let's take a quick look at the stack you can see here. We're inside of you know the common build logic.

A

It says, Hey create me an Executor and this particular job um is being routed to our internal Auto scaler. So this is the shim that gets us into this new code path. Great super duper all right. So, let's um so, let's, um let's go to the next step. Okay, great we're gonna run we're gonna return. This executor, this is super duper um all right. We're gonna, run prepare on this executor and so we're stepping into the prepare step.

A

um We've got an acquisition. What did we get? um Have a look? Okay, nothing! Yet, okay, we got a key context, blah blah blah. Okay, here we go. This is an interesting bit all right, so we're going to ask the provider to get a runner task scaler and after we get the task scaler we're going to actually call acquire on it and I'm going to say, hey, give me give me a machine, so let's Grant acquire and hey look at that. We have some IP addresses.

A

We have some connect info so now we actually have a place to run our to run our job, so I'm, just going to kind of say, continue and it's going to run the first step in my pipeline blah blah blah. This is super great okay and the next job is going to run and then we're going to be right back here at the executor, um so that kind of consumes a machine actually. So you can sort of see here uh that machine that ran the job has been now being stopped and thrown away.

A

So we do actually need to replace it right like every time. You know we need to make sure we Auto scale. So, let's take a look at the auto scaling logic, so um I'm gonna go into the fleeting provisioner here and I'm going to go ahead and let this job run that job run and what we've we've picked up. The new thread when the auto scaling Loop actually asked for more VMS.

A

So if we look at the stack here, you can see here, we've actually got um the 3D fleeting provisioner, which is kind of running at the top level, uh and we've asked it to provision, which is what fleeting provisioners do. Okay, so let's step into increase and let's see actually what happens here?

A

um Would you so we want some instances? Okay, great and actually I. Don't know if you noticed this or not, but we actually kind of jumped straight into the Google plugin so um from the from the provisioner. We would have um at this point if I was if, if we were actually using the separate binary, which we will as soon as this lands, like that's the plan, that's of course the value of this.

A

We would have actually kind of gotten stopped at the grpc server and then, but but as as it is, we actually can can see what you know we're actually inside of the fleeting plug-in, which is you know where we would end up, and it just kind of just creates in the usual way. So that's kind of that's kind of it um VMS get replaced. Jobs are on um pipeline, succeeds so clear, all the rest of the break break points and let it just kind of go to town.

A

So that's that's a high level summary of how these pieces fit together. um The uh you know the runner sort of, like we Shaman a new provider which delegates to the task scaler.

A

The task scaler can take in any other provider, execute a provider, so you can actually take any of the providers that we have Docker shell, etc, etc, and you can ultimately give those to task scaler, because test scaler's job is just the auto scaling and then, when you actually configure the the runner, you actually tell it what plug-in it should be using and um here it will actually start the appropriate plugin.

A

So you see this is this is hard coding, but what it'll do here is it'll use, go plugins mechanism to actually launch and manage the actual separate process so test scaler only has Auto scaling, but it can be parameterized with any provider and it can be parameterized with any um specific leading plugin.

A

So you can run anything anywhere and in fleeting is sort of the additional logic of like how many times you've used stuff all the layer that we put on top of the actual concrete instance group. So that's kind of that's pretty much it. um You may see some changes to the code, but I think the the concepts should be fairly stable.

A

So that's it thanks for watching.