GitLab GitLab Runner - Autoscaling, 21 Oct 2020

Previous Meeting

⏯

youtube image

►

From YouTube: GitLab Runner - Autoscaling requirements gathering summary

Description

A summary of what was discussed and gathered in https://gitlab.com/gitlab-org/gitlab-runner/-/issues/27061

A

Hello, my name is steve, I'm a back-end engineer at gitlab and I work on the runner team. I wanted to summarize an issue that we've been working on recently about requirement scattering for the ci job, auto scaling.

A

This is mostly to look at what's needed to start working on a replacement for docker machine, as you might know, docker machine as a maintained state by talker itself or by any other community member, and we have our own fork to fix some security and cost uh affecting bugs, but we still need to figure out a way forward on what to do so before going into the solutions of that, we wanted to gather some requirements to see to see and get a better understanding of what our current situation is and, uh first of all, we went through all the runner deployments that we are seeing our customers using and us as well, are using, for example, so the first one is when they are, supposing the killer runners and they are mostly on a data center environment with just raw vms and they don't have any like content like kubernetes, cluster or so on and so forth.

A

Most of these environments are aircup, so uh either they don't have they don't have network access or some compliance regulations, and things like that.

A

They're sometimes managed by openstack, which is quite good in the sense that they can easily scale up and scale down vms for specific workloads, um they're running trusted code, in the sense that they're only running private code from their repositories. uh They're not like running untrusted code from you, a random user somewhere on the internet, and sometimes they implement all the scaling through the openstack.

A

What we manage, what we mentioned earlier, nomad or any other hypervisor feature that they have what we usually suggest: our users to use as the docker executor, either one large vm or have multiple small, vms and then horizontally auto scale with the hypervisor or use the docker machine executor using the hyper-v or vmware driver, for example. If you use vmware to create a machine to run the job reuse, that machine or delete that machine after the job is done or if, though, if like docker can be installed on the vmware.

A

For certain reasons, um you have to use a shell executor. uh This is uh less than ideal, of course, because the sherlock superior has its own problems. But this is what we provide at the moment, and then we have cell phones, customers who are. They can also be on a bare metal data center kind of environment, but they have like either kubernetes experience or they're already running kubernetes or they're, using cloud provider with a kubernetes offering.

A

What we usually suggest to them, as um just use the kubernetes executor- I either use our home chart or the operator to provide most of the time. This works quite well for them, apart from all the issues that come with kubernetes like uh stability issues that we are having at the moment and other notebooks that we have and then self-hosted using um cloud providers but for vms.

A

So this is very, very similar to data centers, but instead of using the vms in their own data center, they use the vms on cloud, and that is a bit of a different setup.

A

That means they have access to the cloud all auto, scaler, they're running trust code as well as before, and they don't really have a lot of isolation concerns in the sense that one job uh can run on the same machine as another job.

A

What we usually suggest is either use the docker machine executor, which is what is most popular with our users. They just set up the docker machine point to the right driver. So if you're using aws, you just use the aws driver and it will scale up and down the machine spending on the jobs, some more advanced users are using the kubernetes executor. They have, um they use the cloud providers offering of kubernetes and they just use that and like some other people are also starting exploring using the auto scaling groups provided by the cloud provider.

A

So, for example, on aws there there is cheese, auto scaling groups and they will spin up and spin down a number of runners spending on cpu or job utilization, and things like that now, uh apart from that, that's all about the runners that the users manage themselves for gitlab.com. We provide shared runners which gitlab itself handles, and they are usable by anyone who has an account on gitlub.com.

A

They can you specify any docker image they want and it will work.

A

Ideally, if something works on their laptop, it should work on our ci infrastructure. uh Our current solution is using gcp using the docker machine. We run one vm per job for security reasons, and we provide just one vm for any kind of job.

A

We don't provide any um any different, vms or whatsoever, and these are what I'm talking about right now is just unprivileged containers right now, privileged container containers come into play when um you want to use uh tools like docker and docker to build a docker image through our ci, um and our solution is pretty much the same as we had before right. We use the same solution for both scenarios.

A

A

Those are all the deployment methods we have for gitlab runner. Now we want to talk about specifically the auto scaler. Auto screening features that we have inside of kit lebron, so the first one uh we'll go through linux. First first one is kubernetes.

A

um Most of the time you see, kubernetes deployed on a cloud provider so, for example, gcp and they have their own auto scaling rules.

A

It's good to just uh for like greenfield project, it's good to just get started with it. It's pretty easy to get started with it. Just use our hand, chart or operator most cloud providers, support, kubernetes and support all the scaling out of the box, and then you can just use the kubernetes executor to schedule parts and then get all the benefits from kubernetes with bim packing termination and so on and so forth.

A

Where it collects at the moment and what pain points most users are seeing as we don't provide any guidance to requirements or limits, and the sense that we don't inform the users that they should set this, and it's best practice to set this. So most of the time you end up seeing um one job just eating up all like a whole note just for to run a job, because the administrators of the runner or the users of gitlab ci do not set the correct limits.

A

There are some usability and bugs inside of the executor itself that we're still working on, for example, the entry point of the build containers not running the um sometimes leave some pods behind it. Just because of some networking issues- and sometimes we also end up picking up a job when cluster is at capacity, so we schedule a pod. uh The job is running from a github perspective, but we're still waiting for the pod to be scheduled.

A

So we end up in issues like uh job being timed out and didn't actually run kind of thing and sometimes like debugging. A problem on kubernetes is really real hard. Just because we don't provide any guidance, we don't provide any ways to debug things, and that is something we should improve upon as well.

A

uh Apart from kubernetes, there is somewhat of the most popu more popular and more simplistic scenario of using doppler machine. This is what we use at com as well. uh It auto scales, depending on the job queue. So if you have a hundred jobs, uh it's gonna scale a certain amount. If you have 10 jobs, it's going to kill a certain amount.

A

You can control the number of vital machines. So imagine that you want to have like 5 vms in your back pocket every time just to make sure if there's a bunch of jobs coming in you can use up those item machines. So you don't wait up on the boot time for new vms to be scheduled.

A

um You can also schedule auto scaling in the sense that you can have a certain amount of machines from, for example, 9 am to 5 pm, um because that's when most of the ci fleet is going to be used and also the reason we use it for gitlab.com is it provides stronger isolation for us in the sense that we create one vm per job. So each job is contained from a virtualization layer and like that we can, if they break out of our containers, they can they're still locked into the vm.

A

The whole point of this uh presentation is because of docker machine so like there's a lot of bugs and feature requests um uh when it comes to talking about to the cloud provider like uh selling a specific side of this disk labels on the machine and so on and so forth, like each cloud provider has its own methods of doing things and we have to interact with the cloud provider api and that's a lot of maintenance costs.

A

As we already mentioned, the machine is no longer maintained and ha high availability is something that the user has to do themselves with kubernetes. You can get uh aha automatically because it restarts the pod and so on and so forth, but with docker machine you don't like, uh unless you put docker machine in order scaling group or something like that, uh there's also some strange concurrency behaviors from time to time.

A

uh There are no bugs like, for example, uh uh when you start up talking machine for the first time, there's some certificate issues or, if, like you, get a bunch of jobs scheduled at the same time on the same runner, the idle machines will start removing or deleting while adding more machines which trash structures some instances, and then the killer becomes more of a security risk than it already is just because it's not just contacting gitlab for jobs. It's also uh communicating with your cloud provider, so it can delete and create machines.

A

So um if those tokens got exposed like uh it's much easier to like to uh for an attacker to delete instances for you and things like that,.

A

There's also some some customers using uh cloud provider vm's, auto scaling in the sense that they just have um auto scaling template. They just use a vm with gitlab runner installed using the docker executor.

A

And then they scale up and down depending on cpu and memory usage, for example, it has most of the auto screening capabilities that docker machine has because you can also use specific metrics on how much to also scale problems with these is one. It's not very popular popular to drill depends on the cloud provider because each cloud provider has its own, uh auto screening features and it can be confusing, especially on like multi-cloud deployments, so, for example, if you deploy both to gcp and aws, both of them have different, auto screening, terminologies and behavior.

A

So you might not get consistent. Behavior between the two and uh gitlab does not provide any guidance on how to do this or why you should do this and when you should choose this, and another problem is more like some cloud providers do not have all the screening capabilities so they can like. If I'm using that cloud provider, I can't really implement autoscaling layers.

A

um This is a comparison of all the auto scaling um throughout the cloud providers. Most of the cloud providers that we see and also some self-hosted.

A

Software like openstack and nomads- and this is a comparison between docker machine, as you can see, like no cloud provider- has the same feature as the other. Everyone has their own implementation of it and um yeah like if there's something not on like something that you don't understand from the features section uh and the issue. There's a more detailed description about this.

A

Now we just talked about linux, there's also the windows scenario.

A

For windows we have our own implementation called autoscaler. It works really well for the gitlab.com scenario, where we just create one vm per job. It follows a lot of the methodology behind docker machine just create a vm run the job in that vm and delete it, and since it follows that methodology, it still has most of the problems of docker machine in the sense that we still have to talk to the cloud api. If we want to support more cloud providers, we have to set up a new integration with that uh provider.

A

uh Maintenance cost will start increasing more the more cloud providers you support so right now we only support gcp, but if we want to add aws, that is a completely different implementation that we have to do.

A

We're redoing some work from what docker machine is doing, but for windows, and it's not really clear how to set it up for self-hosted customers just because it uses the custom executor and the drivers methodology, which is a bit more convoluted to set up than just to get lebron and then there's also auto scaling by the cloud provider. um This has all the same benefits as auto scaling uh by the cloud provider on linux.

A

We just don't provide any guidance on it and um yeah mac os. We don't really have any solution for it, nor, like any um experience with it um at the moment, we are currently working on um on providing something, and that is probably using the same, auto scaling methodology as windows, just because um um we don't have uh most code like most mac os cloud providers, don't have uh another scaling solution out of the box. That is something that each we have to provide ourselves.

A

So those are all the solutions that we have now uh there's still some problems that we are in solving, and some of them are uh closely related to security right, um the docker machine and things like that. There is little to no observability on what's actually happening on the machine. It's an ephemeral vm that sometimes it's not monitored properly or if it's using privileged containers, the users can easily escape from that container and turn up the monitoring if they want to do some harm.

A

There's a really good report from the red team at get lab that goes into detail. I will not go into that myself.

A

There's no alerting, if there's any unexpected behavior on the machine like, for example, if someone escapes from the container there's a real high chance, they're not really doing anything legitimate uh and there's no really termination on like uh well-known bad behavior, for example, but fine mining, and it's like that.

A

There's also some usability problems in the sense that auto scaling is still really hard to set up. We provide little guidance. We only have one document on how to set it up on aws, but we don't provide on any other cloud provider.

A

There's also no automation that we provide in the sense that it's up to the user, to set it up themselves, and sometimes it just ends up with snowflake deployment of gitlab runner, that each customer that we have does not have the same deployment and both for support and for us to help customers debug a problem.

A

We always have to spend the first few minutes to understand what the actual setup looks like and also when the user set it up for the first time we don't provide any checks, or tests of things are working as expected. It's up to them to actually do all this and we provide no use like no guidance on which setup to use.

A

So we went all through all uh the setups like uh kubernetes docker machine, docker, shell or using the auto scaler from the cloud provider, but we don't provide any solution, any guidance to the user, like we don't tell them hey for your scenario. You should use x, for example, and most of the time we also like they don't have, they have to think about aj themselves and the sounds like yes, stockholm machine provides uh auto scaling, but it does not provide high availability.

A

If you just have one docker machine execute or if that machine goes down, all your ci fleet is done. So that is something that sometimes the users get caught up or still have to think about themselves.

A

So that gives us all the problems that we have all the solutions that we're providing and all the deployments methods that we can see. So what our are our design requirements trying to fill up fill up those requirements.

A

So from a usability perspective, we need to provide some kind of an automated way to set up runners right and then you, the user, can extend that automation to fit their needs or like if they grow at a larger organization or need to run more jobs. They can easily extend it.

A

uh Ideally, both self-hosted and kitlab.com, shared runners should use the same kind of solution just for the fruiting purposes and like uh there's too much of it too much of a disjoint between gitlab.com and like some self-hosted customer, just because we use our own homegrown configuration management with chef, for example, and they use something else and we don't like can't really provide any guidance.

A

uh Ideally, it should be a container based provide the maximum portability and customization in the sense that if my docker container works on my laptop, it should work inside ci and I can migrate to a new different cloud provider and my containers should still work.

A

There's cost management about running the lebron managers themselves and the science they should be cost effective and should not be like a massive cost burden for the customer and even running the jobs themselves. They should. They should not be a burden for anyone in the sense that, if there's a way to use, preemptable, vms or spot instances on aws, they should be able to do so, and high availability should be outside of the box that we provide, and that is not something that user has to opt in or think extra after they deploy.

A

The first runner uh and one of the biggest pain points as well is like the disc ends up being filled up when we run a lot of jobs on the same machine, uh and we need to provide some kind of hooks or clean up mechanisms in our setup. So we they never see this issue and now, from a security perspective, we need to be able to stop any bitcoin miners. It's not like it's not just us. They can even do it on as their own runners.

A

For example, um a rogue employee just runs a bunch of jobs to my bitcoin. We need to provide some kind of limits on cpu and memory uh right now, we're already doing that with vms, but we can do a bit more. For example, we need to do to limit bandwidth and network access, so, for example, certain jobs can't access the network for these specific domains, or they can only use a certain amount of our bandwidth and for gitlab.com.

A

Most of time need to allow for multi-tenancy in the sense that we need to run untrusted code and run multiple, multiple jobs for different users, so they need to be isolated and, apart from being isolated, we need to prevent noise neighbors in the sense that user a cannot harm user b in the sense that, if user a is using a lot of resources, it should not take up from user b.

A

And if we are using a container-based um solution, we should still use a lockdown host os and the sensor, even if they break out of the container when they are on the host os. They can't really do much because it's locked down and all this should be in an obs done in a way that everything is uh observable, meaning uh the secured security team can check what kind of syscalls binaries are being executed. So they can do better analysis and even the network behavior like what kind of domains.

A

What kind of parts are users using and things like that um and have failed close scenarios. So if something goes wrong in our in our job or something like that, we don't just keep running the job we just terminated and then leave it up to the secure security team to investigate further on and provide multiple layers of isolation, right and multiple layers of security. So if one layer is broken, the attacker still has another layer to go through.

A

So we went through a lot of things. There are a few things that we didn't go into: uh scheduling changes and what I mean by this is job scheduling so, for example, having a runner at the higher priority than another runner. So run array is running on aws.

A

That should be that should pick up more jobs than the one that's running on your data center or vice versa. We do. We will not go into this and scheduling uh jobs on runners that are underutilized. So imagine you have two runners.

A

One runner is um running 200 jobs and the other runners running 500, like we will not go into how we can balance those up, and we also did not investigate customer optimization uh and the most uh and the reason for that is because custom optimization scams comes when you have a solution at hand and right now we do not have a solution for a replacement photographer machine, for example.

A

So this is the last slide. Some. This is the main issue. um There's a lot more detail about all this and the main issue, and also some uh links to uh google docs about uh interviews with, like uh some customers, and uh this is only an internal link for github employees and also some uh uh interviews with kid club employees that deal with customers every day.

A

So that's it um feel free to open up this issue and ping me directly on on it. If you have any more suggestions, ideas or feedback on it and yeah. Thank you for.

A