Continuous Delivery Foundation cdCon 2021, 6 Jul 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: How Netflix Autoscales CI - Rahul Somasunderam, Netflix

Description

How Netflix Autoscales CI - Rahul Somasunderam, Netflix
Speakers: Rahul Somasunderam
Netflix's CI currently builds about 45k unique build configurations and about 600k builds/wk. We use Spinnaker for CD and most of our infrastructure runs on AWS. In this talk, we will discuss how autoscaling is being used to improve efficiency and developer experience.

For more Continuous Delivery Foundation content, check out our blog: https://cd.foundation/blog/

A

Hi, my name is rahul. I work on ci at netflix, I'm going to talk to you about what we did to get ci to auto scale.

A

Let's first look at netflix's, ci setup.

A

We use jenkins. These are some numbers to help. You see what we are working with the reason. The number of agents is a range. Is we auto scale so, depending on when you look at this, you would get a different number.

A

The reason the executors per agent is a range, is, you know, multiple kinds of agents. These would have different labels to target specific kinds of builds.

A

We use spinnaker for delivery. Each of our jenkins controllers runs in its own cluster. We make sure there is only one instance running at a time if you are running jenkins controllers. This should be no different from what you are used to.

A

There are 180 different agent clusters. Each has a fixed configuration instance, type labels, region, etc.

A

The reason we have more than one asp is when we roll out updates. We wait for the new asd to come online, and then we mark the old one offline in jenkins and wait for the birds to train as birds train. We terminate instances. If we see problems, we can roll back to the old asg.

A

After a few days of the new asp being online and the old one being down to zero agents, we destroyed the asg in very rare occasions. There are more than two asgs in the classroom.

A

Aws has auto scaling groups asgs for short on each ast. You can set a min and a max, then aws will figure out what the desired size is and adjust the current size by either spinning up a new instance or killing some running instance spinnaker calls those things server. Groups spinnaker has a naming convention that maps the server group to a more structured coordinate at the top level. There's an application within each application. You can have multiple stacks.

A

We map these stacks to jenkins controllers within each stack. You have multiple clusters. The naming relies on a field called detail for that you put one kind of agent into each cursor.

A

For example, one would have m5.2x large instances and run four executors. Another would have m5.4x large instances and run one executor.

A

We perform immutable deployments. If we want to update a package, we take a new amazon machine image and start rolling it out, so each cluster will have multiple server groups representing the version in this example. Jenkins is the application unstable is the stack agent. Highlander is the detail.

A

One 123 is the version that that whole thing is the name of the server group. The cluster is jenkins, unstable agent highlander. The version is not part of it. Planning for ci infrastructure is in some ways similar to planning for other applications. You need to plan ahead of time to make sure you've got enough capacity.

A

There are some differences in ci workloads, though the first way is to assume infinite resources. This is the easiest way to do things. You guess what the maximum workload is, and you back that estimate and provision capacity.

A

Then you keep revising your estimate. As you go along. The downside of this is you will have lots of fighting capacity and, depending on the size of your company, you could be wasting a lot of resources. However, for a smallish company, this is a great solution. The resource overhead may not be significant.

A

The users of your ci solution are going to be happy. They get instant build starts. Your capacity planning and budgeting teams might not like the cost of the solution at some point or you could assume infinite patience. Let's say you know how many instances you need in a median hour you could plan for that. Most of the time you will have instant pull starts.

A

However, there will be several moments when developers are waiting for other builds to finish so theirs can start. Eventually, developers will become very unhappy in some cases you can plan for instant resources. This assumes that that is a shared pool that you can get resources from it works really well with containers. For example, kubernetes not all builds can be containerized.

A

You could be running docker commands inside of your bill, or you could be using test containers in your build. There isn't really a prudent way to run docker and docker.

A

Also jenkins doesn't have too many agents, particularly well, it's easier to run 100 agents with 10 executors each than it is to run a thousand agents with one executor each. Finally, there's auto scaling. This approach tries to contain costs while still trying to provide instant build starts. We have been fairly successful with this.

A

Almost all auto scaling tries to match some metric indicating demand with an appropriate supply. Let's look at what metrics we can use. There are some system metrics. We often associate with auto scaling the nice thing about these is they are natively supported by cloud providers and most metric collection solutions.

A

You need to do very little work to get the metrics somewhere scaling with these metrics is also really well documented.

A

However, this doesn't really work well for ci. There are times when one or all of those metrics are really low, but your build is still running and it's holding on to an executor on jenkins, and if you have many such bills, you will need to scale up to start new builds. More importantly, you cannot scale down, because one or all of those metrics are low.

A

Q depth is an attractive metric. It tells exactly how many bills need an agent, so you might be able to measure q depth and add the correct number of instances to the asg based on it.

A

However, q depth indicates that bills have already begun queuing up, so you are already late trying to fix this in the case of containerized agents, where the time to start an agent is negligible.

A

This is not so bad, but between the metrics reporting, this and ec2 launching an instance and the instance booting up and launching the jenkins agent there's a lot of time being spent agent utilization is possibly the most relevant metric for jenkins.

A

You have some number of executors of them. Some subsets are busy idle and offline.

A

You will need to group these by asg once you have that you can compute utilization easily. We use this formula.

A

Let's see what it takes to measure agent utilization correctly, when we launch a new agent, we have it launched with many labels. We don't expect users to target some of these labels. They can, but they don't. These labels are useful for us to collect metrics. We report the placement of the asg. In this case we are reporting that the asg is on aws.

A

It's in our test account it's in us east one and its name is whatever the rest of the highlighted text is.

A

Next up, we wrote a custom plugin. The most important thing it does for auto scaling. Is it reports the idle busy offline, total, executors and executor utilization for asg atlas? Is our metrics collection system.

A

It ties into aws and can forward a custom metric to cloud watch which we do it dies into its alerting system and can send emails and slack notifications, but it can also launch a jenkins, build or launch a spinnaker pipeline when an alerting condition is satisfied.

A

Now that we've got these metrics where we want them, let's look at how we can auto scale based on it.

A

There are two ways that you can scale on aws. The first is target tracking, it's really cool, you tell it what metric you want to track and what it needs to be, and it will figure out whether to scale when to scale and how fast the scale. However, this is not ideal for jenkins.

A

We cannot let it arbitrarily kill instances that are running builds, so we have to stick to step scaling. We specify a set of thresholds and how the system should react to utilization reaching that threshold.

A

In this picture you see spinnaker being used to configure aws, auto scaling. We are setting it up to scale up when utilization is over 65 percent for at least one minute.

A

Now that we know we need to scale up, let's see how we can scale up we're going to increase capacity by 20. If the utilization is between 25 between 65 percent and 80, if it's over 80, we are going to speed things up and add 40 capacity.

A

Also, we try to add at least five instances when we scale up. This allows us to scale more rapidly at lower capacities. Next up, we need to adequately wait long, wait before sending more instances into the asg. You don't want to keep scaling up. While your instances are still booting up.

A

Scaling up was easy. Scaling down is much harder.

A

The way we do it is we set up an alert that gets fired when utilization for an asd is under 25 percent and stays that way for 15 minutes or longer.

A

When that happens, we run a program that talks to the jinping's controller and finds out which agents are id. Then it carefully marks them offline.

A

We have a configurable ratio for asg, so you mark at most the ratio times the total instances offline. You also run some checks to make sure you're, not zeroing out or terminating more instances than our idle.

A

Once you have a number, you call aws to terminate specific instances and shrink the asd's desired size. We initially had the alert call out jenkins later on. We switched to calling a spinnaker pipeline. Instead, he liked not to have jenkins be in the way of scaling down jenkins.

A

Once we put all of this together, you get a system like this.

A

Where you see the blue line go over 0.65 there's the green background indicating a scale up where you see the blue line, go below 0.25, there's a red background after 15 minutes indicating a scale down.

A

We set the min to 1 and set the max to 3x the original fixed size. When we started auto scaling the significantly reduced support request from users asking us to resize clusters.

A

Some agent clusters existed solely to serve spiky workloads. They would get a burst of tens to hundreds of bills within an hour and then never run for days. Supporting them economically became.

A

A

Hello, um sorry about the av issues, but I think I'm back so um let me start uh going through the questions. um So jay uh was asking um what our logic is to create jenkins controllers.

A

uh We initially tried doing it per team. uh Eventually we decided that um it's not the best utilization of controllers, in that some teams do not need as many uh resources as others do. So we ended up trying to just create arbitrary uh blocks of uh controllers and move teams uh into these blocks.

A

So um it really depends on uh how old things are, so some teams still have uh a single controller for their own use. uh We are trying to gradually move away from it um marcel asks um if we have evaluated uh kaneko. uh No, we have not. So uh we have got a strong uh support from the spinnaker team for uh continuous delivery, and uh we've got a long history of running uh things on jenkins for ci.

A

So we are not trying to uh disrupt any of that uh by trying to choose uh a new set of tools that uh we might have to support ourselves.

A

um Let's see so uh some questions that I've um asked I've been asked before are um what kind of utilization numbers are we looking at? So um the truth is when we started this. It was mostly a hunch that utilization is bad.

A

um We started gathering metrics, at which point we learned that our utilization was around four percent, which is not great. So we went ahead and started doing this, and uh if you look at the uh target we are sitting the best utilization we can expect to get is 25 percent.

A

So it's a very small number, uh but once you start measuring things, you realize that very few ci setups have got better efficiency numbers than that. Unless you have something like kubernetes or any solution, where you have a shared pool of executors that you can rely on.

A

um Another question I've seen before is who uses this.

A

In most part, it is a rci team, but we have some other teams who we do not directly serve um because their use cases are too specialized and they tend to use all the tooling that we have developed and get the same benefits.

A

Okay um is docker without docker, um so um we do use um spinnaker and we do not directly use kubernetes. So we have, um I think, kubernetes is too low level for a lot of engineers to use.

A

So we have something called titus, which is a layer that was initially built on top of mesos and then um eventually they adapted it to use kubernetes.

A

um That is uh the way we submit uh workloads uh to titus. So yes, it's interesting uh I'll, possibly have to take a look at it and uh maybe follow up on uh how that would work for us.

A

Thank you for the.

A

Tip all right, thank you very much have a great rest of the conference.