►
Description
A summary of what was discussed and gathered in https://gitlab.com/gitlab-org/gitlab-runner/-/issues/27061
A
A
This
is
mostly
to
look
at
what's
needed
to
start
working
on
a
replacement
for
docker
machine,
as
you
might
know,
docker
machine
as
a
maintained
state
by
talker
itself
or
by
any
other
community
member,
and
we
have
our
own
fork
to
fix
some
security
and
cost
affecting
bugs,
but
we
still
need
to
figure
out
a
way
forward
on
what
to
do
so
before
going
into
the
solutions
of
that,
we
wanted
to
gather
some
requirements
to
see
to
see
and
get
a
better
understanding
of
what
our
current
situation
is
and,
first
of
all,
we
went
through
all
the
runner
deployments
that
we
are
seeing
our
customers
using
and
us
as
well,
are
using,
for
example,
so
the
first
one
is
when
they
are,
supposing
the
killer
runners
and
they
are
mostly
on
a
data
center
environment
with
just
raw
vms
and
they
don't
have
any
like
content
like
kubernetes,
cluster
or
so
on
and
so
forth.
A
Most
of
these
environments
are
aircup,
so
either
they
don't
have
they
don't
have
network
access
or
some
compliance
regulations,
and
things
like
that.
A
They're
sometimes
managed
by
openstack,
which
is
quite
good
in
the
sense
that
they
can
easily
scale
up
and
scale
down
vms
for
specific
workloads,
they're
running
trusted
code,
in
the
sense
that
they're
only
running
private
code
from
their
repositories.
They're
not
like
running
untrusted
code
from
you,
a
random
user
somewhere
on
the
internet,
and
sometimes
they
implement
all
the
scaling
through
the
openstack.
A
What
we
manage,
what
we
mentioned
earlier,
nomad
or
any
other
hypervisor
feature
that
they
have
what
we
usually
suggest:
our
users
to
use
as
the
docker
executor,
either
one
large
vm
or
have
multiple
small,
vms
and
then
horizontally
auto
scale
with
the
hypervisor
or
use
the
docker
machine
executor
using
the
hyper-v
or
vmware
driver,
for
example.
If
you
use
vmware
to
create
a
machine
to
run
the
job
reuse,
that
machine
or
delete
that
machine
after
the
job
is
done
or
if,
though,
if
like
docker
can
be
installed
on
the
vmware.
A
For
certain
reasons,
you
have
to
use
a
shell
executor.
This
is
less
than
ideal,
of
course,
because
the
sherlock
superior
has
its
own
problems.
But
this
is
what
we
provide
at
the
moment,
and
then
we
have
cell
phones,
customers
who
are.
They
can
also
be
on
a
bare
metal
data
center
kind
of
environment,
but
they
have
like
either
kubernetes
experience
or
they're
already
running
kubernetes
or
they're,
using
cloud
provider
with
a
kubernetes
offering.
A
What
we
usually
suggest
to
them,
as
just
use
the
kubernetes
executor-
I
either
use
our
home
chart
or
the
operator
to
provide
most
of
the
time.
This
works
quite
well
for
them,
apart
from
all
the
issues
that
come
with
kubernetes
like
stability
issues
that
we
are
having
at
the
moment
and
other
notebooks
that
we
have
and
then
self-hosted
using
cloud
providers
but
for
vms.
A
A
What
we
usually
suggest
is
either
use
the
docker
machine
executor,
which
is
what
is
most
popular
with
our
users.
They
just
set
up
the
docker
machine
point
to
the
right
driver.
So
if
you're
using
aws,
you
just
use
the
aws
driver
and
it
will
scale
up
and
down
the
machine
spending
on
the
jobs,
some
more
advanced
users
are
using
the
kubernetes
executor.
They
have,
they
use
the
cloud
providers
offering
of
kubernetes
and
they
just
use
that
and
like
some
other
people
are
also
starting
exploring
using
the
auto
scaling
groups
provided
by
the
cloud
provider.
A
So,
for
example,
on
aws
there
there
is
cheese,
auto
scaling
groups
and
they
will
spin
up
and
spin
down
a
number
of
runners
spending
on
cpu
or
job
utilization,
and
things
like
that
now,
apart
from
that,
that's
all
about
the
runners
that
the
users
manage
themselves
for
gitlab.com.
We
provide
shared
runners
which
gitlab
itself
handles,
and
they
are
usable
by
anyone
who
has
an
account
on
gitlub.com.
A
Ideally,
if
something
works
on
their
laptop,
it
should
work
on
our
ci
infrastructure.
Our
current
solution
is
using
gcp
using
the
docker
machine.
We
run
one
vm
per
job
for
security
reasons,
and
we
provide
just
one
vm
for
any
kind
of
job.
A
We
don't
provide
any
any
different,
vms
or
whatsoever,
and
these
are
what
I'm
talking
about
right
now
is
just
unprivileged
containers
right
now,
privileged
container
containers
come
into
play
when
you
want
to
use
tools
like
docker
and
docker
to
build
a
docker
image
through
our
ci,
and
our
solution
is
pretty
much
the
same
as
we
had
before
right.
We
use
the
same
solution
for
both
scenarios.
A
A
A
A
It's
good
to
just
for
like
greenfield
project,
it's
good
to
just
get
started
with
it.
It's
pretty
easy
to
get
started
with
it.
Just
use
our
hand,
chart
or
operator
most
cloud
providers,
support,
kubernetes
and
support
all
the
scaling
out
of
the
box,
and
then
you
can
just
use
the
kubernetes
executor
to
schedule
parts
and
then
get
all
the
benefits
from
kubernetes
with
bim
packing
termination
and
so
on
and
so
forth.
A
Where
it
collects
at
the
moment
and
what
pain
points
most
users
are
seeing
as
we
don't
provide
any
guidance
to
requirements
or
limits,
and
the
sense
that
we
don't
inform
the
users
that
they
should
set
this,
and
it's
best
practice
to
set
this.
So
most
of
the
time
you
end
up
seeing
one
job
just
eating
up
all
like
a
whole
note
just
for
to
run
a
job,
because
the
administrators
of
the
runner
or
the
users
of
gitlab
ci
do
not
set
the
correct
limits.
A
There
are
some
usability
and
bugs
inside
of
the
executor
itself
that
we're
still
working
on,
for
example,
the
entry
point
of
the
build
containers
not
running
the
sometimes
leave
some
pods
behind
it.
Just
because
of
some
networking
issues-
and
sometimes
we
also
end
up
picking
up
a
job
when
cluster
is
at
capacity,
so
we
schedule
a
pod.
The
job
is
running
from
a
github
perspective,
but
we're
still
waiting
for
the
pod
to
be
scheduled.
A
So
we
end
up
in
issues
like
job
being
timed
out
and
didn't
actually
run
kind
of
thing
and
sometimes
like
debugging.
A
problem
on
kubernetes
is
really
real
hard.
Just
because
we
don't
provide
any
guidance,
we
don't
provide
any
ways
to
debug
things,
and
that
is
something
we
should
improve
upon
as
well.
A
Apart
from
kubernetes,
there
is
somewhat
of
the
most
popu
more
popular
and
more
simplistic
scenario
of
using
doppler
machine.
This
is
what
we
use
at
com
as
well.
It
auto
scales,
depending
on
the
job
queue.
So
if
you
have
a
hundred
jobs,
it's
gonna
scale
a
certain
amount.
If
you
have
10
jobs,
it's
going
to
kill
a
certain
amount.
A
A
You
can
also
schedule
auto
scaling
in
the
sense
that
you
can
have
a
certain
amount
of
machines
from,
for
example,
9
am
to
5
pm,
because
that's
when
most
of
the
ci
fleet
is
going
to
be
used
and
also
the
reason
we
use
it
for
gitlab.com
is
it
provides
stronger
isolation
for
us
in
the
sense
that
we
create
one
vm
per
job.
So
each
job
is
contained
from
a
virtualization
layer
and
like
that
we
can,
if
they
break
out
of
our
containers,
they
can
they're
still
locked
into
the
vm.
A
The
whole
point
of
this
presentation
is
because
of
docker
machine
so
like
there's
a
lot
of
bugs
and
feature
requests
when
it
comes
to
talking
about
to
the
cloud
provider
like
selling
a
specific
side
of
this
disk
labels
on
the
machine
and
so
on
and
so
forth,
like
each
cloud
provider
has
its
own
methods
of
doing
things
and
we
have
to
interact
with
the
cloud
provider
api
and
that's
a
lot
of
maintenance
costs.
A
As
we
already
mentioned,
the
machine
is
no
longer
maintained
and
high
availability
is
something
that
the
user
has
to
do
themselves
with
kubernetes.
You
can
get
aha
automatically
because
it
restarts
the
pod
and
so
on
and
so
forth,
but
with
docker
machine
you
don't
like,
unless
you
put
docker
machine
in
order
scaling
group
or
something
like
that,
there's
also
some
strange
concurrency
behaviors
from
time
to
time.
A
There
are
no
bugs
like,
for
example,
when
you
start
up
talking
machine
for
the
first
time,
there's
some
certificate
issues
or,
if,
like
you,
get
a
bunch
of
jobs
scheduled
at
the
same
time
on
the
same
runner,
the
idle
machines
will
start
removing
or
deleting
while
adding
more
machines
which
trash
structures
some
instances,
and
then
the
killer
becomes
more
of
a
security
risk
than
it
already
is
just
because
it's
not
just
contacting
gitlab
for
jobs.
It's
also
communicating
with
your
cloud
provider,
so
it
can
delete
and
create
machines.
A
So
if
those
tokens
got
exposed
like
it's
much
easier
to
like
to
for
an
attacker
to
delete
instances
for
you
and
things
like
that,.
A
There's
also
some
some
customers
using
cloud
provider
vm's,
auto
scaling
in
the
sense
that
they
just
have
auto
scaling
template.
They
just
use
a
vm
with
gitlab
runner
installed
using
the
docker
executor.
A
And
then
they
scale
up
and
down
depending
on
cpu
and
memory
usage,
for
example,
it
has
most
of
the
auto
screening
capabilities
that
docker
machine
has
because
you
can
also
use
specific
metrics
on
how
much
to
also
scale
problems
with
these
is
one.
It's
not
very
popular
popular
to
drill
depends
on
the
cloud
provider
because
each
cloud
provider
has
its
own,
auto
screening
features
and
it
can
be
confusing,
especially
on
like
multi-cloud
deployments,
so,
for
example,
if
you
deploy
both
to
gcp
and
aws,
both
of
them
have
different,
auto
screening,
terminologies
and
behavior.
A
So
you
might
not
get
consistent.
Behavior
between
the
two
and
gitlab
does
not
provide
any
guidance
on
how
to
do
this
or
why
you
should
do
this
and
when
you
should
choose
this,
and
another
problem
is
more
like
some
cloud
providers
do
not
have
all
the
screening
capabilities
so
they
can
like.
If
I'm
using
that
cloud
provider,
I
can't
really
implement
autoscaling
layers.
A
This
is
a
comparison
of
all
the
auto
scaling
throughout
the
cloud
providers.
Most
of
the
cloud
providers
that
we
see
and
also
some
self-hosted.
A
Software
like
openstack
and
nomads-
and
this
is
a
comparison
between
docker
machine,
as
you
can
see,
like
no
cloud
provider-
has
the
same
feature
as
the
other.
Everyone
has
their
own
implementation
of
it
and
yeah
like
if
there's
something
not
on
like
something
that
you
don't
understand
from
the
features
section
and
the
issue.
There's
a
more
detailed
description
about
this.
A
For
windows
we
have
our
own
implementation
called
autoscaler.
It
works
really
well
for
the
gitlab.com
scenario,
where
we
just
create
one
vm
per
job.
It
follows
a
lot
of
the
methodology
behind
docker
machine
just
create
a
vm
run
the
job
in
that
vm
and
delete
it,
and
since
it
follows
that
methodology,
it
still
has
most
of
the
problems
of
docker
machine
in
the
sense
that
we
still
have
to
talk
to
the
cloud
api.
If
we
want
to
support
more
cloud
providers,
we
have
to
set
up
a
new
integration
with
that
provider.
A
A
We're
redoing
some
work
from
what
docker
machine
is
doing,
but
for
windows,
and
it's
not
really
clear
how
to
set
it
up
for
self-hosted
customers
just
because
it
uses
the
custom
executor
and
the
drivers
methodology,
which
is
a
bit
more
convoluted
to
set
up
than
just
to
get
lebron
and
then
there's
also
auto
scaling
by
the
cloud
provider.
This
has
all
the
same
benefits
as
auto
scaling
by
the
cloud
provider
on
linux.
A
We
just
don't
provide
any
guidance
on
it
and
yeah
mac
os.
We
don't
really
have
any
solution
for
it,
nor,
like
any
experience
with
it
at
the
moment,
we
are
currently
working
on
on
providing
something,
and
that
is
probably
using
the
same,
auto
scaling
methodology
as
windows,
just
because
we
don't
have
most
code
like
most
mac
os
cloud
providers,
don't
have
another
scaling
solution
out
of
the
box.
That
is
something
that
each
we
have
to
provide
ourselves.
A
So
those
are
all
the
solutions
that
we
have
now
there's
still
some
problems
that
we
are
in
solving,
and
some
of
them
are
closely
related
to
security
right,
the
docker
machine
and
things
like
that.
There
is
little
to
no
observability
on
what's
actually
happening
on
the
machine.
It's
an
ephemeral
vm
that
sometimes
it's
not
monitored
properly
or
if
it's
using
privileged
containers,
the
users
can
easily
escape
from
that
container
and
turn
up
the
monitoring
if
they
want
to
do
some
harm.
A
A
There's
no
alerting,
if
there's
any
unexpected
behavior
on
the
machine
like,
for
example,
if
someone
escapes
from
the
container
there's
a
real
high
chance,
they're
not
really
doing
anything
legitimate
and
there's
no
really
termination
on
like
well-known
bad
behavior,
for
example,
but
fine
mining,
and
it's
like
that.
A
A
We
always
have
to
spend
the
first
few
minutes
to
understand
what
the
actual
setup
looks
like
and
also
when
the
user
set
it
up
for
the
first
time
we
don't
provide
any
checks,
or
tests
of
things
are
working
as
expected.
It's
up
to
them
to
actually
do
all
this
and
we
provide
no
use
like
no
guidance
on
which
setup
to
use.
A
So
we
went
all
through
all
the
setups
like
kubernetes
docker
machine,
docker,
shell
or
using
the
auto
scaler
from
the
cloud
provider,
but
we
don't
provide
any
solution,
any
guidance
to
the
user,
like
we
don't
tell
them
hey
for
your
scenario.
You
should
use
x,
for
example,
and
most
of
the
time
we
also
like
they
don't
have,
they
have
to
think
about
aj
themselves
and
the
sounds
like
yes,
stockholm
machine
provides
auto
scaling,
but
it
does
not
provide
high
availability.
A
A
A
A
Ideally,
both
self-hosted
and
kitlab.com,
shared
runners
should
use
the
same
kind
of
solution
just
for
the
fruiting
purposes
and
like
there's
too
much
of
it
too
much
of
a
disjoint
between
gitlab.com
and
like
some
self-hosted
customer,
just
because
we
use
our
own
homegrown
configuration
management
with
chef,
for
example,
and
they
use
something
else
and
we
don't
like
can't
really
provide
any
guidance.
A
A
There's
cost
management
about
running
the
lebron
managers
themselves
and
the
science
they
should
be
cost
effective
and
should
not
be
like
a
massive
cost
burden
for
the
customer
and
even
running
the
jobs
themselves.
They
should.
They
should
not
be
a
burden
for
anyone
in
the
sense
that,
if
there's
a
way
to
use,
preemptable,
vms
or
spot
instances
on
aws,
they
should
be
able
to
do
so,
and
high
availability
should
be
outside
of
the
box
that
we
provide,
and
that
is
not
something
that
user
has
to
opt
in
or
think
extra
after
they
deploy.
A
The
first
runner
and
one
of
the
biggest
pain
points
as
well
is
like
the
disc
ends
up
being
filled
up
when
we
run
a
lot
of
jobs
on
the
same
machine,
and
we
need
to
provide
some
kind
of
hooks
or
clean
up
mechanisms
in
our
setup.
So
we
they
never
see
this
issue
and
now,
from
a
security
perspective,
we
need
to
be
able
to
stop
any
bitcoin
miners.
It's
not
like
it's
not
just
us.
They
can
even
do
it
on
as
their
own
runners.
A
For
example,
a
rogue
employee
just
runs
a
bunch
of
jobs
to
my
bitcoin.
We
need
to
provide
some
kind
of
limits
on
cpu
and
memory
right
now,
we're
already
doing
that
with
vms,
but
we
can
do
a
bit
more.
For
example,
we
need
to
do
to
limit
bandwidth
and
network
access,
so,
for
example,
certain
jobs
can't
access
the
network
for
these
specific
domains,
or
they
can
only
use
a
certain
amount
of
our
bandwidth
and
for
gitlab.com.
A
And
if
we
are
using
a
container-based
solution,
we
should
still
use
a
lockdown
host
os
and
the
sensor,
even
if
they
break
out
of
the
container
when
they
are
on
the
host
os.
They
can't
really
do
much
because
it's
locked
down
and
all
this
should
be
in
an
obs
done
in
a
way
that
everything
is
observable,
meaning
the
secured
security
team
can
check
what
kind
of
syscalls
binaries
are
being
executed.
So
they
can
do
better
analysis
and
even
the
network
behavior
like
what
kind
of
domains.
A
What
kind
of
parts
are
users
using
and
things
like
that
and
have
failed
close
scenarios.
So
if
something
goes
wrong
in
our
in
our
job
or
something
like
that,
we
don't
just
keep
running
the
job
we
just
terminated
and
then
leave
it
up
to
the
secure
security
team
to
investigate
further
on
and
provide
multiple
layers
of
isolation,
right
and
multiple
layers
of
security.
So
if
one
layer
is
broken,
the
attacker
still
has
another
layer
to
go
through.
A
So
we
went
through
a
lot
of
things.
There
are
a
few
things
that
we
didn't
go
into:
scheduling
changes
and
what
I
mean
by
this
is
job
scheduling
so,
for
example,
having
a
runner
at
the
higher
priority
than
another
runner.
So
run
array
is
running
on
aws.
A
A
One
runner
is
running
200
jobs
and
the
other
runners
running
500,
like
we
will
not
go
into
how
we
can
balance
those
up,
and
we
also
did
not
investigate
customer
optimization
and
the
most
and
the
reason
for
that
is
because
custom
optimization
scams
comes
when
you
have
a
solution
at
hand
and
right
now
we
do
not
have
a
solution
for
a
replacement
photographer
machine,
for
example.
A
So
this
is
the
last
slide.
Some.
This
is
the
main
issue.
There's
a
lot
more
detail
about
all
this
and
the
main
issue,
and
also
some
links
to
google
docs
about
interviews
with,
like
some
customers,
and
this
is
only
an
internal
link
for
github
employees
and
also
some
interviews
with
kid
club
employees
that
deal
with
customers
every
day.
A
So
that's
it
feel
free
to
open
up
this
issue
and
ping
me
directly
on
on
it.
If
you
have
any
more
suggestions,
ideas
or
feedback
on
it
and
yeah.
Thank
you
for.