►
From YouTube: How Netflix Autoscales CI - Rahul Somasunderam, Netflix
Description
How Netflix Autoscales CI - Rahul Somasunderam, Netflix
Speakers: Rahul Somasunderam
Netflix's CI currently builds about 45k unique build configurations and about 600k builds/wk. We use Spinnaker for CD and most of our infrastructure runs on AWS. In this talk, we will discuss how autoscaling is being used to improve efficiency and developer experience.
For more Continuous Delivery Foundation content, check out our blog: https://cd.foundation/blog/
A
A
A
A
A
A
A
A
Aws
has
auto
scaling
groups
asgs
for
short
on
each
ast.
You
can
set
a
min
and
a
max,
then
aws
will
figure
out
what
the
desired
size
is
and
adjust
the
current
size
by
either
spinning
up
a
new
instance
or
killing
some
running
instance
spinnaker
calls
those
things
server.
Groups
spinnaker
has
a
naming
convention
that
maps
the
server
group
to
a
more
structured
coordinate
at
the
top
level.
There's
an
application
within
each
application.
You
can
have
multiple
stacks.
A
A
A
A
One
123
is
the
version
that
that
whole
thing
is
the
name
of
the
server
group.
The
cluster
is
jenkins,
unstable
agent
highlander.
The
version
is
not
part
of
it.
Planning
for
ci
infrastructure
is
in
some
ways
similar
to
planning
for
other
applications.
You
need
to
plan
ahead
of
time
to
make
sure
you've
got
enough
capacity.
A
A
Then
you
keep
revising
your
estimate.
As
you
go
along.
The
downside
of
this
is
you
will
have
lots
of
fighting
capacity
and,
depending
on
the
size
of
your
company,
you
could
be
wasting
a
lot
of
resources.
However,
for
a
smallish
company,
this
is
a
great
solution.
The
resource
overhead
may
not
be
significant.
A
The
users
of
your
ci
solution
are
going
to
be
happy.
They
get
instant
build
starts.
Your
capacity
planning
and
budgeting
teams
might
not
like
the
cost
of
the
solution
at
some
point
or
you
could
assume
infinite
patience.
Let's
say
you
know
how
many
instances
you
need
in
a
median
hour
you
could
plan
for
that.
Most
of
the
time
you
will
have
instant
pull
starts.
A
However,
there
will
be
several
moments
when
developers
are
waiting
for
other
builds
to
finish
so
theirs
can
start.
Eventually,
developers
will
become
very
unhappy
in
some
cases
you
can
plan
for
instant
resources.
This
assumes
that
that
is
a
shared
pool
that
you
can
get
resources
from
it
works
really
well
with
containers.
For
example,
kubernetes
not
all
builds
can
be
containerized.
A
A
Also
jenkins
doesn't
have
too
many
agents,
particularly
well,
it's
easier
to
run
100
agents
with
10
executors
each
than
it
is
to
run
a
thousand
agents
with
one
executor
each.
Finally,
there's
auto
scaling.
This
approach
tries
to
contain
costs
while
still
trying
to
provide
instant
build
starts.
We
have
been
fairly
successful
with
this.
A
Almost
all
auto
scaling
tries
to
match
some
metric
indicating
demand
with
an
appropriate
supply.
Let's
look
at
what
metrics
we
can
use.
There
are
some
system
metrics.
We
often
associate
with
auto
scaling
the
nice
thing
about
these
is
they
are
natively
supported
by
cloud
providers
and
most
metric
collection
solutions.
A
However,
this
doesn't
really
work
well
for
ci.
There
are
times
when
one
or
all
of
those
metrics
are
really
low,
but
your
build
is
still
running
and
it's
holding
on
to
an
executor
on
jenkins,
and
if
you
have
many
such
bills,
you
will
need
to
scale
up
to
start
new
builds.
More
importantly,
you
cannot
scale
down,
because
one
or
all
of
those
metrics
are
low.
A
A
A
Let's
see
what
it
takes
to
measure
agent
utilization
correctly,
when
we
launch
a
new
agent,
we
have
it
launched
with
many
labels.
We
don't
expect
users
to
target
some
of
these
labels.
They
can,
but
they
don't.
These
labels
are
useful
for
us
to
collect
metrics.
We
report
the
placement
of
the
asg.
In
this
case
we
are
reporting
that
the
asg
is
on
aws.
A
A
A
A
A
A
A
A
A
A
A
A
Hello,
sorry
about
the
av
issues,
but
I
think
I'm
back
so
let
me
start
going
through
the
questions.
So
jay
was
asking
what
our
logic
is
to
create
jenkins
controllers.
A
We
initially
tried
doing
it
per
team.
Eventually
we
decided
that
it's
not
the
best
utilization
of
controllers,
in
that
some
teams
do
not
need
as
many
resources
as
others
do.
So
we
ended
up
trying
to
just
create
arbitrary
blocks
of
controllers
and
move
teams
into
these
blocks.
A
So
it
really
depends
on
how
old
things
are,
so
some
teams
still
have
a
single
controller
for
their
own
use.
We
are
trying
to
gradually
move
away
from
it
marcel
asks
if
we
have
evaluated
kaneko.
No,
we
have
not.
So
we
have
got
a
strong
support
from
the
spinnaker
team
for
continuous
delivery,
and
we've
got
a
long
history
of
running
things
on
jenkins
for
ci.
A
So
we
are
not
trying
to
disrupt
any
of
that
by
trying
to
choose
a
new
set
of
tools
that
we
might
have
to
support
ourselves.
A
Let's
see
so
some
questions
that
I've
asked
I've
been
asked
before
are
what
kind
of
utilization
numbers
are
we
looking
at?
So
the
truth
is
when
we
started
this.
It
was
mostly
a
hunch
that
utilization
is
bad.
A
We
started
gathering
metrics,
at
which
point
we
learned
that
our
utilization
was
around
four
percent,
which
is
not
great.
So
we
went
ahead
and
started
doing
this,
and
if
you
look
at
the
target
we
are
sitting
the
best
utilization
we
can
expect
to
get
is
25
percent.
A
So
it's
a
very
small
number,
but
once
you
start
measuring
things,
you
realize
that
very
few
ci
setups
have
got
better
efficiency
numbers
than
that.
Unless
you
have
something
like
kubernetes
or
any
solution,
where
you
have
a
shared
pool
of
executors
that
you
can
rely
on.
A
In
most
part,
it
is
a
rci
team,
but
we
have
some
other
teams
who
we
do
not
directly
serve
because
their
use
cases
are
too
specialized
and
they
tend
to
use
all
the
tooling
that
we
have
developed
and
get
the
same
benefits.
A
Okay
is
docker
without
docker,
so
we
do
use
spinnaker
and
we
do
not
directly
use
kubernetes.
So
we
have,
I
think,
kubernetes
is
too
low
level
for
a
lot
of
engineers
to
use.
A
So
we
have
something
called
titus,
which
is
a
layer
that
was
initially
built
on
top
of
mesos
and
then
eventually
they
adapted
it
to
use
kubernetes.
A
That
is
the
way
we
submit
workloads
to
titus.
So
yes,
it's
interesting
I'll,
possibly
have
to
take
a
look
at
it
and
maybe
follow
up
on
how
that
would
work
for
us.