GitLab GitLab Runner - Autoscaling, 15 Feb 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: GitLab Runner Autoscaling - AWS

Description

Main Epic: https://gitlab.com/groups/gitlab-org/-/epics/2502
AWS Epic: https://gitlab.com/groups/gitlab-org/-/epics/5223

A

So today we want to discuss um what our options are for replacing docker machine.

A

To give a bit of context, we started looking at one specific scenario, which is aws autoscaling, and the reason for this is those are pretty much our mostly high risk scenario at the moment. Just because most people are using the docker machine version, and that is not maintained anymore and most people don't use our fork either. Unless we tell them to do so, and we're also not fully aws experts with docker machine.

A

So um to give a bit more context to grakash, um we we want to replace our machine just because it's a maintenance mode and we don't want to keep maintaining that code base ourselves, because we have a fork to fix some problems and we don't want to keep maintaining it, because it's very, very costly from our hand.

B

I know the problem I I've been like involved in the discussions. Okay replaced.

A

Perfect, okay, so.

B

Frequently in the past so yeah.

A

Okay, perfect, so I I want to, um if you have any questions, don't ask. um I think uh what I'm proposing at the moment would be um to provide a terraform template to use autoscaling groups for aws and basically what this, how this is going to work. Is you have a vm with gitlab runner installed and it can pick up jobs, and then you have auto scanning groups from aws configured, so it will scale up scale out and scale and uh depending on some metrics right, it can be cpu usage memory usage.

A

It can be also metrics from the gitlab runner metric standpoint, and it can also be time based, auto scaling so, for example, at 8 am every morning scale up to a fleet of 10 runner managers. Now these runner mergers will be running the jobs in that vm.

A

um It won't be spawning up new vms or anything like that, um and it will be registering new gitlab runners inside or to the gitlab instance, and as soon as the vm is terminated, it will unregister dklabrunner and to remove the vm.

A

So we basically are migrating the auto scaling logic to the cloud provider via and the plan is to implement this for all the cloud providers, major cloud providers, which is aws, gcp and azure, and the benefit with this. We also got auto screening for um windows as well out of the box, without integrating with more.

C

B

About when, with more apis, use the same technique on github.com.

A

No, no gitlab.com, um we first started. I first started investigating gitlab.com um and the poc that I was going to do is with kubernetes and firecracker vms with cutter containers.

A

But we discussed that aws autoscaling for vms is like a more important problem for us to solve, because right now for gitlab.com, we own the docker machine fork and we have full control over the stack. But our customers don't at the moment, because they can't modify a docker machine of the aws api is broken or something like that, but with gcp we can because we have the knowledge and we have the um the bandwidth to do so, but for aws we don't now. um We also have to keep in mind.

A

This is strictly solving one problem, which is vm, auto scaling um like the cloud provider that users can use the kubernetes executor for autoscaling as well, and we have plans to improve the kubernetes executor. So it's more stable and more. It has feature parity with all the other executors, um but some people are like they don't want to handle a kubernetes cluster, even if it's managed by amazon or gcp with gke right.

A

They just want a simple vm if it's just a two two person shop, the vms is just fine for them, so this is targeting those customers who aren't really familiar with kubernetes or don't have the bandwidth to deal with a kubernetes customer.

A

So um if you look at my epic, not my epic at the epic, let me share my screen quickly, so we're clear on what we're talking about.

A

um I identified deployment methods, so some of them are self-hosted like on the data center surfaces, running manual, kubernetes and self, hosted on cloud provider, vms and then gitlab.com, which is privileged containers and privileged containers. So what we're trying to solve with this specific proposal is self-hosted cloud provider, vms.

A

Any questions so far.

B

Like uh do I understand correctly that we do not want to ditch the kubernetes solution like it's simply as another problem we want to solve later. Yes,.

A

Yes, so think of it as what the distribution team is doing with gitlab omnibus and with the hand, chart they're solving two different problems for two different kind of deployment models, and we want to do the same.

B

Okay, the last question that I have is: uh does it mean that a user who basically doesn't want to use kubernetes will need to instead understand how to use terraform.

A

That is correct. Yes, we can hide terraform as much as we can georgie already that I wouldn't call it a proof of concept. He already started working on this, where you just run a docker image and it automatically runs terraform for you and.

C

A

Just like authenticates with google api and so on and so forth for you. um So we can try and hide it as much as possible, but then, like this using terraform, we can instruct users using infrastructure of code, so we can integrate with gitlab's terraform um integration and so on and so forth. So we can have the club manage gitlab runners, um so yeah we will be exposing users to terraform, but at the moment what we are exposing users to is uh documentation, copy, pasting and so on and so forth.

A

So that might be somewhat of a better scenario than what we have right now.

B

Yeah thanks for explaining that.

A

Now I know thomas has a lot of opinions, so if thomas can verbalize some of them and then we'll go to through them like they're, really good questions, so I don't want to put you on the spot. That's why I really wanted you in this meeting so yeah.

A

uh We can't hear you.

D

A

D

Okay, maybe let's start with the one that I've asked monday. Other call like.

D

This solution is built outside of runner. It's built around the runner, which makes it a totally different experience for the user than what we have with the docker machine executor where you just install the runner, but the only requirement for auto scaling right now is that you need to install docker machine binary, and your cloud provider needs to be supported by docker machines, so either natively or by a plugin.

D

uh You just install the binary optional plugin binary then put a configuration copied one to one from documentation and it works. You have one runner manager that you need to care about, and nothing else uh with this with this. We uh we change this experience, because the user doesn't touch the runner at all. So they need to know how to use terraform or maybe not know how to use we may we may have it as you said, but they need to use terraform.

D

uh This server form will then inject the configuration, so the configuration form is way way different. Now you could just install the runner on the host create config.tom by hand, and it just works here. You need to have a tool that will magically scale this up and down and and put into the place so so this is uh my first course in consume that I said uh see.

C

A

One, and do you think, just to interrupt you and a bit, wouldn't it be the same effort as running terraform as figuring out how to install the club runner register it and so on and so forth.

D

Like in this case, I'm only concerned about the selling point- okay, uh it's it's it's different. When you have a user that already uses gitlab runner knows how gitlab runner works, how github ci works. Now he wants to auto scale, and you say: okay, install docker machine or anything else that we will switch to and use this configuration or just use this configuration and run and restart auto scaling.

D

So you don't need to change your infrastructure when you want to move from a static configuration to auto scale configuration and- and this was what we did with docker machine like you- could use exactly the same host. You just changed the executor. You installed one binary and you started to have auto scaling with exactly the same runner that was running for past one year and and my own concern is that it is a way different experience way.

D

Different behavior and I don't know if this will be as good selling point at is, but as it was with the docker machine, execute.

A

Okay, I think that's a valid concern for existing.

B

Users selling interesting, but what do you think about the maintenance point because uh maintaining a terraform plus a custom solution for aws it might actually uh get extended into maintaining the same or similar solution for gcp many other like providers and then suddenly you need to maintain terraform compatibility for all of them. Then you suddenly need to maintain the complexity related with the differences between them and suddenly you can. You know, have this proliferating, complexity and maintenance, valuable.

D

Yet this is this is true, but I would not be as afraid when it goes for this, because with docker machine or with any other uh existing or treated by us solution that replace docker machine, you still need to support multiple clouds. So if we need to support it by ourselves, I don't think it's a big difference. If these are terraform templates or cloud api implementation in in go as we started doing with autoscaler.

D

If we want to be multi-cloud, we need to support multiple clouds, and- and this will never change.

D

Docker machine was just so nice because it existed. It had a big community, it had a lot of existing integrations and now we need to either choose something that works like that and has all of this or we need to start maintaining our.

B

Okay, so my question is like it's clear to me like why docker machine was a great solution, but uh it feels like the solution suddenly is starting to decay or it's basically disappearing, and uh how is the industry solving that because it seems like that's a very valuable solution that is no longer viable. I can just guess that industry might be solving that with kubernetes these days, but is there like a different solution that people that want to use the cloud, but not necessarily kubernetes, are like using like disappearing and there's nothing in return.

E

The history loss on that steve.

A

uh Yeah, I I mean if.

E

A

Yeah, go ahead. Go ahead, so I'm not the only one. So.

E

So I I I had the same thought like a long time ago like what's everyone.

C

Else, switching to.

E

Like this is this, this provides value, there's got to be something else, and so, after doing some digging, what I turned up is that.

E

For maintenance reasons, mostly because this docker machine becomes a nightmare, steve kind of put it to me the other day like it's, the integration like integrating with all these really challenging services goes into the small place, so they refactored it or rethought it rebuilt it and it became something called infrakit.

E

So gawker actually made a new thing called intricate. That is the replacement of this. Then they realized the problem. Actually wasn't the code and the architecture it's having all these apis supported in one place is just unmanageable, and so they just walked away from it, and so they've deprecated their replacement for this, and no one has stepped in to provide a replacement, because I think it's such a massive maintenance burden to provide a generic multi-cloud uh like kind of general purpose cloud agnostic, orchestration tool.

E

It's chaos like no one wants to take on that maintenance burden.

A

uh And like what the industry is doing right now, if you look at like gcp um and things like that and.

F

A

For example, and aws as um they are providing their own kubernetes solution right and even digital ocean, like digital ocean for the longest time to never had an auto scanning solution, but with kubernetes they provided that now, they're also abstracting kubernetes away with the container runtime as a service, so, for example, for gcp it's cloud run and functions and for aws it's like far gate and now forget is even more powerful by allowing containers and also lambda.

A

This lambda now allows you to build your own containers and run those containers, and that's somewhat like the kubernetes experience right. You just build a container and the transit and you don't care where it's running kind of thing and the re now I know the question comes up like. Why aren't we using uh fargate or why aren't we using um uh cloud ram for these cis and because uh those are lockdown platforms uh built for handling requests or http requests or like as like timed applications?

A

It's not for building jobs, for example, we need privileged containers. Sometimes we need like jobs that can run for six to seven hours and if you look at fargate it does not support that long time execution. um So does that answer your question in regards, or is it still not clear.

B

It kind of does, but it feels that you know the maintenance burden that uh elliot is talking about, is exactly something that we want to take on ourselves. If we proceed with building this solution, because we will need to figure it out how to model that on other platforms and then.

F

A

And that's why I'm proposing using the auto scaler from each cloud provider, because then we don't have to think about the auto scaling algorithm and integrating that auto scaling algorithm to the cloud provider. We just use what they provide and we just provide a template.

B

Yeah, as long as there are no differences between.

A

Tires yeah: there is for sure.

B

Every other like club platform, and suddenly you realize that we might need to actually ship code to compensate for the differences, and then it gets even more messy. So that's now I'm not saying that we should not do something like we are describing or just playing they will advocate. You know.

A

No, I I agree, and that is a real concern because, for example, one thing that is an issue at the moment is aws provides timed auto scaling, so you can scale up at 8am, for example, but gcp does not, um and that is a real feature. Parity difference there. So and you don't have any real solution for that at the moment. Even with this thing that I'm proposing, um but then maybe we don't really need time to auto scaling so because, for example, you don't scale up your kubernetes instance on time.

A

You scale up your kubernetes instance depending on the load um and that's why I'm assuming gcp never really implemented the time motor scanning, because they use auto scaling behind the scenes for kubernetes, so they just exposed it as its own service, um but yeah like we're still gonna end up integrating, no matter what it's just a level of the level of integration we want to achieve. I guess as well.

B

Yeah, so that's that's uh interesting and I and I wonder if actually there is a way to simplify this somehow like of course, I I've been not thinking about it a lot recently, um but uh perhaps we could restrict ourselves to supporting only few the most popular platforms.

E

B

Product this it's product decision, it's it's like you know something that we would need to think about, but then we can, for example, build a auto scaling module that we would deploy to a master node and then only maintain a small api for creating and removing nodes.

B

And uh this way you know the this integration scope for every platform would be much smaller, just creating and removing nodes right like there are multiple ways of actually you know reducing the amount of integration, work and expanding the amount of generic work and uh yeah.

A

Adding nodes and things like that- that's interesting question like looking at just aws right, for example, they add a new arm cpu like a new instance right or a new uh availability group, or a new uh cpu architecture from an api level. If we want to add nodes and things like that, we have to add support like hey.

A

We want to request this cpu architecture and so on and so forth, but with auto scanning it's a matter of just pointing to an ami and that ami is configurable right, but then we I also have to look at intricacies like this, uh the flags that we need to pass. For example, recently we added gpu support for um gcp on our docker machine and that required us adding uh sending some new data to the api, so request, the accelerators and so on and so forth, and every new machine type that comes into play.

A

We might have to update our api integration to yeah.

B

So it means that we wouldn't be able to find a small solution for adding nodes and removing nodes. It's yeah.

A

Each node has its own configuration field from an api level just because it's a low level api and.

B

On the other, like I'm sorry to interrupt you, but I know it's like we don't have a lot of time, but the the one another like idea of mine is that we should start with gitlab.com, because this will dictate building the most efficient solution. And then you know telling our users that you need to use this method. It might not be the most simple one. You might not be able to model that on cloud without kubernetes, but we decided to give you only these, because this will make everything much more efficient for you.

B

We are running this on gitlab.com and if you do this, our way you are going to save a huge amount of money for cloud like kubernetes has been designed to be uh like cheap, because it provides this. You know better utilization, primitives that we can use, and it's designed for, like uh you, know, increasing the utilization so that it's more efficient.

B

So perhaps that's another way to think about that. But I I don't know it's super good problems, so yeah.

C

A

Agree, I was thinking the same like why not just go all in at kubernetes and just provide an easy installation kubernetes. But then we see what the rest of the product team is doing at getlab right.

A

If we look at like what dimitri dc is doing, he's creating a new five-minute deploy application on aws cc2, because some people find the other devops with kubernetes and so on and so forth way too complex or the release theme is adding auto deploy to ec2, because auto deploy to kubernetes is complex as well, and they can't users can just magically migrate from ec2 to kubernetes as well their applications. So it feels like we might end up going the same direction.

A

F

Directory, I don't agree, I think, from my perspective, we should be all in kubernetes long term and that the patterns for vms should be as simple as possible.

A

Yeah, okay, but we still need to provide a pattern for the apps right.

C

And that's like.

A

Like looking at what current solution we have like, we have two solutions that I can think of. Integrating with the api or just using auto scanning, and just using auto scaling from the cloud provider seems simpler than rewriting the api integration. We have to look at the past experiences.

A

We had integrating with cloud apis with gcp, for windows, auto scaling and for orca, um which spew like thomas, had a lot of knowledge with gcp api, and it still took around two months for us to finalize all the work and the same with worker like it's still taking uh two months to integrate each cloud provider. So we also have to think about that. Like yes, it needs to be a simple solution, but a simple solution needs to also be easily maintained, but that makes sense.

B

Yeah, I think it makes a lot of sense, but we we just need to be uh mindful about, and we should understand that it actually might mean that we are going to spend a lot of time on maintenance and making this solution uh available for on on no bigger amount of platforms. It's just you know this ratio of this balance between uh the you know, providing the most simple solution for users and maintaining all the solutions so like it boils down to like not.

B

There is no some thing like a free feature, every feature which comes with a cost.

A

Yeah, I agree um to keep conscious of time. Do you want to move on to the next point tomorrow? Do you have in your agenda in the agenda.

D

If we think we told all what is to say well, uh okay, so my my second, my second concern is it's way more technical uh from what I understand from what you are telling this auto screen would work like we have the in this case, aws outer scale groups that are basing on some metrics are creating vms, with gitlab runner installed, and this gitlab runner is registering to gitlab, with a configuration that we templated in terraform, etc, etc.

D

This means that, in a more loaded environments, we soon will get hundreds of thousands of registered runners.

B

At gitlab.com we currently see a limit at around 2000 and beyond that runners are killing the database. It's 2 000 per group. You cannot go beyond that because it's killing database. So that's that's from the incident we saw a few days ago.

D

Yeah, and- and this is what I started thinking about like looking on gitlab.comrunners- they are running up to around 3 000 jobs peak during the day. So with with that architecture in in in a bigger installations- and I know that at least some of our customers have way bigger loads than what we see on github.com.

D

It means we would have thousands of registrations on registrations, and existing runners and registrations and registrations are maybe not a big problem. The same isn't.

D

It would be harder to look in the ui and check all of the runners in in the table, but this could be fixed, but our scheduling is based on the database and uh everything starts with finding the runner and then calculating things in an sql query and having thousands of new runners during the hour means that the complexity and weight of the sql query grows by by an order of magnitude which will become a technical problem that will be very very hard to solve, because we are already at the limits of what we can do with this scheduling.

D

A

uh That's a concern I never thought of, and that's very true regarding scheduling and so on and so forth, but we also have to keep in mind. uh Probably we won't see thousands of runners registered. I mean these machines will run like four 200 400 concurrent jobs on the same machine, so it's just one run around 200 jobs at the same time on that machine. So it's just one earner.

A

So if you're running 10 000 jobs at the same time, it's not going to mean 10 000 runners registered, it might mean a thousand runners register right. But again um I don't have an answer to that and that's a really good concern that I never thought of. So thank you for that.

D

Yeah and currently with the runner manager approach. This is this is quite simple, because we have runner manager that doesn't do much, so it doesn't need a huge resources for for the instances, but what handles communication with gitlab and scheduling all of the jobs and auto-scaled vms, and now uh switching to the model where every runner is a separate, really separate runner.

D

I would not be as.

D

As I, I don't think it will be as easy as the runner can handle up to 400 jobs like like. Take a look that we have. We have now one no two cpu 375 gigabytes machines on gitlab.com.

D

We see that the average time of the job is almost an hour or even a little above the hour, which is quite long and and even for some of our own jobs for gitlab. We use more powerful machines because they need more resources.

D

Now, if you want to handle multiple jobs on one machine with, for example, docker executor for 100 400 jobs, you would need to have a vm that have thousands of cpus to to make it efficient. I don't think it will work so so so the numbers of concurrently running jobs will be definitely smaller like this will differ between environments, because if someone have tests that are running for 50 seconds and and are not using cpu and ram at all, they can put probably a thousand of jobs on on one bigger machine and it will work.

D

But but looking on my experience, these are not the the average jobs that we see. Most of jobs are either long or heavy on using the cpu or heavy on using the disk or ram.

D

And then the number of concurrent jobs that we can use on a single vm is more and more more limited, which means that we'll have more and more more registered rumors and we are back with the sql problems. So if we want to go this direction, we need to have this in mind. Think how we can resolve this when this will become a problem, because it will.

A

um I don't have an answer for that, so um it's a really good thing to be pointed out to be honest, um yeah.

F

E

If we're only thinking about auto scaling within aws, okay.

E

Like we, we have to what strikes me from this conversation or where things are added like we only want to solve for aws, but we only want to solve for aws in a way. That's completely generic and we can uh genericize. I don't know if that's the right word. I know that's a real word. We.

A

Can build on top of the.

E

We can we can do the same thing and turn it into gcp.

E

If you were really going to solve this problem in an aws specific way using asgs, you would hook it to sqs.

E

Every time there's a new job, the rails application would put a message in an sqs queue and the asg would respond to that and create an instance based on that.

E

So I guess what I'm really trying to say is: what are we trying to do here? Are we trying to figure out how to solve this in the generic way and we're trying to figure out how to solve this in the most aws ecosystem, friendly way that, like people that manage aws resources are going to be comfortable.

E

Or are we trying to figure out what we're trying to do, and I don't mean to put you on the spot to even like kind of uh beat you up on that one, but I I it was kind of it comes back to like my point around the docker machine and I think everyone I.

G

I think the docker.

E

People walk away from it because they realize like this is a big challenge like I, you could probably start a company.

E

I'm almost certain of this, you could start an open source like support revenue model company around providing a replacement for doctor machine, but we'd pay for it just put a contract in front of me and pay for this to go away.

C

um Like this is a.

E

Hard big problem.

A

Yes, I agree on what kind of problem we want to uh uh I'm just laughing at aaron's message. Sorry to what problem we're trying to solve, and I would say this specific problem with aws vs gaming is only for this small developer team that is running 20 hundred concurrent jobs. At the same time, if users really want scaling like at what thomas was saying like thousands of jobs at the same time, um then we probably need to direct them to use kubernetes, because kubernetes is more geared towards that.

B

Are we certain that a small team of developers actually do need autoscaling? Perhaps they can just install github runner on multiple instances on vms and that's it do they need to auto scale like yeah? I cannot understand that they might want to reduce the cost and especially on the night on weekends, but.

B

Like I feel like it's a lot of work that we would need to make um to support small teams, and then it might it's divisible if we would be able to solve that correctly for them and maybe for a very long time.

A

That's a good point.

D

uh And and I'm having this in mind, uh moving to one of the indentations that we've made in this point, uh I don't think we should like. I, like, I said I get what we want to achieve with this auto scaling in aws with the vms, but I don't think we should name that this is docker machine replacement, because.

A

This is part of the thing.

D

We are targeting different cases here where you, when you really need auto scaling when I, when I hear that someone needs out of scaling and thinking about thousands of jobs per day, because it's it's not only about costs, it's about uh the compute resources that you need to create to handle that load when you need to scale to handle hundreds of jobs that will go in total during the day and who are not working outside of the working hours.

D

Then then, what you hear propose to have these few vms that will be scaled up, install the runner register, start handling jobs and then, when the company is is going off the day. Everything turns down uh yeah, it would work, but it's not a solution for the problem that that docker executor, docker machine executor was was targeting it's not this. This order of magnitude and and and when we were discussing this in the agenda document, at least in two places. We we've got to uh opposite thinking that you say it's.

D

This is not a solution for github.com and I'm I'm showing github.com as the example, and I think this is because of that, like we started this discussion with how we can replace docker machine, why we want to replace the curve machine first is that customers are using docker machine and it's unmaintainable second, is that we use docker machine on gitlab.com in a very huge load that our customers are doing and and we are suffering from the maintenance of docker machine and and if we want to talk about how to replace docker machine, we need to look on the scale gitlab.com with the share creators is doing.

D

If we want to target to a smaller scale, then, let's not name it docker machine replacement. It will make our discussion easier, at least for me, and I think it will be also easier to to to the customers to understand what we are talking about, because if we want to propose this to a environment like cern where they run thousands of heavy jobs every hour, they will quickly quickly get back to us and say we were looking for something different.

B

And I wonder if we again should start with kubernetes and github.com scale, and then we can think about users of this. This, like vast landscape of cloud providers, and uh I can imagine, building a small service that they will deploy somewhere with the requirement that they need to write two simple scripts: to create a machine and destroy this machine, and for them it might be a viable investment if they want to reduce some cost, and we would not need to maintain.

B

You know this part of, like you know, uh integration to create machine or destroy machine. We just tell them that if they want to use vms, they need to write a script for their particle infrastructure, their particular cloud provider uh or whatever they are using to create or destroy a machine or perhaps like get the amount of machines running like it depends on how the this thing should go. Isn't.

E

That, basically, the custom executor.

E

I guess that kind of, like I mean like we're, we do a bit of the work, but like isn't that kind of like what we've done with the customer execute like we provide a script to like create the machine and then destroy the machine and we'll make sure that you know we'll handle part of the job.

B

Like I guess that you know the the scheduler itself, like it's 99 percent of effort, how to build that correctly, then creating a machine and destroying it might be like one or one percent of effort. I don't know perhaps it's like 90 to 10 because uh as far as I know that uh how you know runner works and how it auto scales like there is a lot of business rules and call it in there.

B

So that's just you know idea it might be dump id. I just wanted to share so.

E

So uh one thought on the like: let's not call this replacing uh doctor machine, I think uh a really probably a really helpful way to look at this whole challenge.

E

Is we, as an organization want to remove the need for docker machine as part of the code like it's? It's.

E

We think it's more maintenance and it's a bit of a support nightmare compared to not having it there so big asterisks on. We think, because maybe what we're finding out is worth keeping around. I don't think so, but like just like leaving the door open to that.

E

What I think we're trying to do and what I think steve's getting at here is like let's make more attractive solutions, so that the use cases that are currently using docker machine there's a more attractive thing for them to use to the point where docker machine is a feature that nobody uses like the parallels like peter, that was a joke, um but like like, let's like, can we solve the small teams, small dev teams problem in a way?

E

That's not using docker machine works better for them in some capacity, and then we've got still going to solve. Like the gitlab.com use case, the big deployments use case, like all these reasons, that people are using docker machine, we have to kind of like solve them and give them options, and then we can just remove docker machines and like no one notices kind of thing. You know what I'm saying like just don't think of it.

E

For um replacing docker machine like let's provide a more attractive option to some subset of users and then we'll keep doing that until there's no more setup.

F

Yeah, I don't know, I think I think you have to tackle the machine earlier. I think the more substitute users like you know, if you're a small shop you can kind of like hey, you could figure it out on aws at somebody, and I think we have to tackle the docker machine replacement at scale.

F

I mean we spent all of cycles just like solving a small shop trying to spin up a few videos for many number of jobs. I don't think we'll ever make traction on the big elephants in the room.

E

Wait I mean like maybe that's the the the elephant we go hunting for first, like let's solve it with, uh let's solve it, for the big user, let's solve it for us, um and then we have a case study on like why that works and like then we can do in a bit of examination like what sort of scale does it become impractical for right? If I'm, if I'm.

E

You know a three engineer shop and you know maybe doing three concurrent builds like. Does it make sense? Is there a better solution? It's like? Where is that line between, like that small small small model, and it's like six thousand concurrent bills that gillob.com is running.

F

Yeah and that's for me, that's why I think if.

F

The way I would I'm thinking about it as a my base of my interactions with customers is, if today, I have a very solid pattern at scale, whatever we define scale to be, and I'm going to choose new scrolling skill, 2 000 per group, 2 000, let's call it scale, that's the threshold. If I had that pattern today, in my conversation with the customers and then the customer says: well, hey, I'm not as big as that. I'm only the small shop that to me. I want the big pattern.

F

I want to give me the pattern for the 2000 through because it's easier to have a conversation with a smaller team that just has two developers like. Oh you have to developers just spin up a vm or install one on your local mac. For me, I want to have the pattern for the 2000.

F

Whatever you find skill to me,.

B

And that's the reason why I think we should start with gitlab.com. It might be kubernetes, it might be something else, but solving that problem for gitlab.com, because we know we need to solve that. It's going to give us enough insight, perhaps reason about how to solve that, for smaller teams.

E

Okay, how would you solve it for gitlab.com.

B

Like that's, uh there are probably a few uh like ways: one of the uber tests, another one might be an in-house build service that will do scaling on gcp, but um yeah.

G

Do we have time for me to get a little bit roundy.

E

G

Don't think we have.

E

Time for you to not get randy go for it. I.

G

I don't I don't really like any of this, because it's no matter what again.

F

G

Matter what we do we're going to end up re-implementing, something that already exists in gitlab runner, because it solves so many problems like so it's job is to execute a task, but it's also an auto scaling solution and even though we've got a custom, auto scaler, it still calls that auto scaler. It's not like the the two are separate, and they perhaps should be. um You know any the our competitors. They seem to have like an agent right so and that's its job is to execute a task.

G

So if we applied that to runner and forgetting about it being a manager, it would accept a job. It would be able to maybe support um running services. So you could almost join runner and docker together, let's just say, they're a unit and you give it a unit of work and that's its job, and then you deal with how it does that on every other platform.

G

Without so, if it needs to be auto scale that binary that includes docker can run. Services can execute that job needs to run in a bunch of other environments. So in auto scaling groups, it's you know. That instance uh ready to accept the job on a bunch of um auto scaling instances in aws in kubernetes. It's the same thing. It's a replica set that you can scale up. That's ready to accept that job, because at the moment we are auto scaling.

G

An auto scaler like this solution is git lab runner that can auto scale that we're putting in an auto scaling group we're not using it as an auto scaler when it's inside these, but it could be because it's still get lab runner. Does that make sense?

G

I I just feel like it would be easier if we could separate the two. So out of all the other executors we've got. Docker machine is like the odd one out. Every other one of them seems to run locally. It creates a vm. It talks to docker locally, it's the shell executor.

G

Then we've got docker machine that connects to remote hosts for an auto scaling solution.

A

To make sure I understand it correctly, isn't what I was suggesting using auto screening groups with gitlab runner installed of that inside of that instance, what you're describing or am I misunderstanding.

G

Here it is, but you are still move you're still installing gitlab runner, which is an auto scaler right. No.

A

Gitlab runner is just using the docker executor, picking up jobs and running it on that machine. The auto.

G

Screen is handled by, I know, but get lab runner itself can be an auto scaler right, but not with the sputter. I know you're installing it, but and you're not using it as an auto scaler exactly, but it's it's a really difficult story to tell, because, honestly, I think a lot of people approach this organically. They just want to say this- is gitlab runner and executes task. I'm coming from a different background. Where there's an agent that you've installed.

G

Let's say jenkins, you know you put jenkins in a container, you put it in a vm when you come from places like that, when you see gitlab runner, your first reaction is just oh. This is just a replica set I'll just scale this up to 100, and then it will accept jobs. That's fine, and then you start reading the documentation and it's very confusing coming from that background, because it is also an auto scaler.

G

So I know that you're using it in it's not an auto scale of the way that you're using it and you've made the auto scaling separately.

G

My point is, I think that we should always make it separate in in every way that we deploy, even if we end up still using docker machine docker machine rather than you know, it's creating a vm where you'd basically push the gitlab runner executor, not the auto scaler to it to accept jobs and do it stuff locally.

B

Yeah, I think I understand what you're saying you're saying that we should have a separate product that would be used to auto scale runners, whatever platform you install them on, but I feel like it's an orthogonal problem, because, basically uh it's uh you know how the auto scaler. uh However, we call it it's going to work. We still need to know uh what platform it's going to run against and you know on which platform is going to scale. The runners.

G

I I think that what I'm trying to get is that I would prefer if we pushed that problem a lot more onto the customers so that we just provide, I mean I. I definitely think we should also provide an auto scaling solution and we need one for ourselves but get lab runners job of just you, give it a job, and it supports that the feature set of services running anywhere um and executing a job anywhere.

G

um So sometimes people will do what steve has done. They'll create auto scaling groups, and it just contains gitlab the sort of executor side of it to execute the job. um Some people will create replica sets in kubernetes and they'll just scale it to 100 and they'll.

G

Have it except those jobs, are other solutions of auto scaling like I thought, the sort of custom auto scaler would be a separate prod product that pushes that get lab executed binary onto remote machines it it's it's really similar to what we have, because it is what we have, because yet lab runner currently does everything it's an auto scaler, and this executor um so approaching this is, is really difficult.

G

It's difficult to try and win an argument where this is be the case, because the argument is every time is going to be, but we already do that because we do everything already.

B

But whether gitlab, rather is one binary or binaries, it's kind of a different problem. I guess it's it's similar in way, but uh it's it's not. I think the problem that we are actually concerned about right now.

B

I might be missing something. So I'm sorry in advance.

G

I I think it's just difficult to explain: I'm explaining it badly. um Steve's also scared.

E

So I think what you're saying is like separating the concerns right, like runner, is executes jobs and there's something else that auto scales, job execution.

G

Yeah and I think that the or the custom auto scaler was like the wrong solution, because it's still get lab runner executing that auto that custom, auto scaler, ideally gitlab runner. That would be just concerned with the execution of the of the unit of job. That's in the script, the services that it runs and we try to get that repeatable in a bunch of different environments, whether that's using docker, locally micro, vms or whatever, even inside of kubernetes.

G

If it's inside a kubernetes container, you try to still have it support that same feature set of it can create services and run that job and it's a difficult thing to do across different platforms in the same way.

G

But it's a different problem to autoscaling.

D

Okay, I have a question because you are saying constantly that gitlab runner is calling out a scalar. Are you referencing the autoscaler customer executive driver.

G

uh And I guess stock machine they're they're the only two that I know about the sort of auto scales right.

D

Okay, okay, so like docker machine right, it's it's an auto scaler inside of the runner. Runner keeps track on how much machines it have should it create one or destroy one, and and- and this is something that we wanted to remove with the autoscaler custom executor. The plan was that autoscaler will have the scalar and runner will just talk with it.

D

Execute me this job and and what's happening with auto scaling is outside of it, and and if we, if we want to look on how to resolve gitlab.com scale problem, my question would be what execution environments we would like to target because for now gitlab runner. If we look on it as a binary that runs locally, we have shell, we have virtualbox and parallels. We have ssh, we have docker, I'm leaving kubernetes and docker machine executor outside because they are sort of auto scalers themselves.

D

Do we want to support all of these environments, or do we want to focus on one of them like shell or like container, maybe even not naming it docker, but container? Because with that, I I would still. I would still push us to the concept of custom executor and custom docker provider or container provider.

D

Last time we end with the problem that we wanted to do this with kubernetes, but with virtualization inside and it's hard and we can't really move forward, but it could provide.

D

Solution also for both the big and the in the smaller scale, because because we could use what steve now is working on a simple aws, auto scaling groups to create, for example, docker environments that single runner can attack with, or we can work on something more powerful. That will create even more things or maybe we will end with the fact that that aws, auto scaling group is enough to to to work for both 10 concurrent jobs and 10 000, concurring jobs uh and, and- and I I would really really look in this direction.

D

Like use the concept of of custom executor and and how to scale the environment that is executing the jobs, not the runner itself,.

B

So I I don't know how much time we have in this meeting, just just one, uh just one uh idea occurred to me and I I think that in this particular case we should in uh invoke the architectural workflow. We should create a blueprint and what it gives us is that it will uh like the architectural workflow, requires an architectural evolution coach to be assigned that.

B

Currently it's a distinguished engineer on engineering fellow and I wonder if we would actually see the valuable to have an input from an engineering, fellow or distinguished engineer in this case, because I feel like a bit. We are working in circles and uh it uh we might benefit from a decision made and like it's better to move forward to. You know, learn more get inside like because I I think it's extremely difficult problem.

A

Yeah and that's what I have at my last point like.

E

Should we open.

A

Up a blueprint um because I was going through learning this process this morning- actually yeah. We need to take a decision at some point and, like a blueprint, might give us a good way forward. But my question, like my goal from this meeting, was like: what are we actually writing in the blueprint? Are we writing auto scaling groups or are we writing kubernetes that that was the uh idea behind this meeting and race yeah.

B

So I I think it's uh like a blueprint starts with an issue in the gitlab architecture tasks and that's the moment when I can help you with finding an architecture. Evolution, coach, okay, help with defining, what's the scope of the blueprint, how to do that. Perhaps this this will help.

B

So we can try and see you know what it gives us.

E

Yeah, I think that's a really good idea.

E

E

Even bringing in more people someone at that level who's also a bit less in the weeds with the runner. It's probably good.

E

um I don't even know who, like I don't know how many people we have in that distinguished and fellow.

B

I can only guess who this person might be, but I would need to talk with camden. First.

E

Yeah, like, as I say, those words, I need to put them back in my mouth. um I think it's a good idea like. Let's do that, um the creating the blueprint um I keep, forgetting that that's actually an official process and not just like a word we use for sending like here's our plan.

E

um So let's do that, but let's wrap up this call.

E

This actually might, if we think about this as like, if we pivot this a bit and like how do we solve this for gala.com, this might become relevant for the call we have coming up in like uh another half hour, we might be able to like merge a couple of thoughts there and, like a couple, visions uh sure, if you're welcome to join that too. If you want it's around the like job scheduling, side of of ci job scheduling, um just.

B

Can you send me an invite, I'm not sure if it will be able to join.

E

Yeah I'll send you another.

E

uh Yeah otherwise, I think, let's, let's wrap this up thanks steve for setting us up, I'm sorry. We ended up just ganging up on you.

C

Well, more questions is better like uh if it's just a bunch of yes moments. It's.

A

Not a good solution, so thank you, um so I guess the next action point for us would be opening up a blueprint issue and start talking to an evolution coach. So we can discuss this further. Does that sound good.

E

I think you should work with us drivers on on setting that up.

A

Perfect uh and yeah. Thank you. Thank you. All.

E

A

B

Was very interesting.