►
From YouTube: Delivery: Hands-off production deployments
Description
Recorded on 2020-09-15
Slides: https://docs.google.com/presentation/d/1dfV5LDTAeLxIwpy5P4rIi3tCNTo1U2gFpiyMtkPHFyY/edit?usp=sharing
Main epic: https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/280
A
A
A
Also,
we
don't
have
a
good
strategy
for
rollbacks,
which
has
it's
very
important
for
in
the
human
decision
of
a
release
manager
because
they
need
to
be
around
for
an
extended
period
of
time
to
just
make
sure
that
the
if
something
goes
wrong.
They
are
available
to
help
in
resolving
the
situation
and
providing
all
the
contextual
information
about
the
deployment.
A
A
Now
we
have
a
new
process
and
we
are
keeping
track
of
active
incident
and
active
change
requests
and
with
gitlab
issues.
So
we
could
evolve
release
tools
to
to
check
those
information
and
make
the
first
decision
if
it
was
a
good
moment
for
starting
a
deployment,
but
we
still
have
a
manual
check
which
is
error.
A
So
looking
at
our
production
dashboards,
we
have
other
metrics
that
we
can
use
for
checking
for
errors.
So
we
are
using
multiple
bond
rate
alerting
window,
and
with
this
thing
we
have
monthly
error
budget
and
we
have.
We
are
counting
the
number
of
errors
over
a
period
of
an
hour
and
six
hour
and
basically,
if
in
one
hour
we
are
burning
more
than
two
percent
of
our
monthly
budget,
we
will
end
up
violating
the
budget
by
the
end
of
the
month
as
well
of
in
the
six
hour
window.
A
If
we
are
over
the
five
percent
of
the
monthly
budget,
we
will
violate
the
budget
by
the
end
of
the
month
as
well.
So
we
came
up
with
this
idea
of
having
a
new
threshold
which
is
more
sensitive
than
the
one
that
we
already
using
for
production
deployment.
So
right
now
we
have
three
thresholds.
The
first
one
is
for
customer
sla.
A
A
If
we
look
take
a
look
at
this
schema
here
so
here
we
are,
there's
a
timeline
of
a
deployment
and
basically
we
have
a
machine
which
is
a
deploy
box
that
runs
migration.
So
we
can
think
about
this
timeline.
When
we
have
database
running
schema
a
then
the
deploy
box
run,
the
migration
and
the
schema
is
schema
b.
So
as
soon
as
we
have
the
new
schema,
all
the
machine
can
start
rolling
out
the
new
code.
So
we
are
switching
from
version
n
to
version
n
plus
one.
A
So
we
start
running
post
deployment
migration
as
soon
as
every
machine
is
running
the
new
code,
so
version
n,
plus
one-
and
this
means
that
we,
when
we
are
on
schema
c,
this
is
no
longer
compatible
backward
compatible
with
the
old
version
of
code.
So
this
is
a
point
of
no
return
as
long
as
soon
as
we
have
schema
c,
we
can't
go
back
with
the
old
version
of
the
of
the
code.
A
A
A
A
A
Then
everyone,
it's
clear
that
we
need
the
ability
to
safely
roll
back
and
we
need
to
be
able
to
prevent
a
deployment
if
the
system
is
not
healthy
as
well
as
we
need
to
be
able
to
promptly
detect
anomalies
and
commence.
An
automated
rollback,
then
phases
roll
out
with
baking
time
are
useful
for
spotting
anomaly
before
completing
the
production
deployment,
which
means
that
we
are
restricting
the
blasting
radius
of
a
new
of
a
change
that
introduced
new
errors
at
the
first
fl
of
the
first
machine
in
the
fleet.
A
Instead
of
going
all
in
through
deploying
in
the
whole
infrastructure
and
then
having
to
roll
back
every
single
machine
and
then
another
interesting
tool
is
load,
testing
and
production.
So
basically
it's
the
ability
to
steadily
increase
the
traffic
to
one
specific
node.
We
can
think
about
cannery,
for
instance,
in
our
case,
so
that
we
can
compare
metrics
with
the
new
version
and
the
rest
of
the
fleet
and
make
decisions
based
on
that
so
hands
off
deployment.
Why?
Now?
A
In
the
last
few
months
we
reached
our
average
mttp
goal
of
24
hours,
we
could
lower
the
the
goal
and
increase
the
outer
blood
branch
creation
frequency.
But
it's
unrealistic
to
have
more
than
two
or
three
hands
on
deployment
each
day.
The
span
of
attention
required
by
an
engineer,
a
release
manager
during
undeployment
is
really
high
and
it
will
just
be
not
really
safe
to
do
more
than
what
we
are
doing
now
and
also
increasing.
A
The
frequency
will
increase
the
likelihood
of
skipping
as
it
releases
basically
because
we
may
end
up
to
deploying
re
some
releases
only
on
cannery
and
then
when
we
want
to
promote
something.
We
aren't
sure,
because
we
already
have
something
new
ready,
so
it
also
affects
other
regular
performance
indicator,
because
we
can
imagine
that
deployment
our
deployment
metrics
will
give
us
an
edge
on
mean
time
to
detection,
because
we
have
get.
A
A
Two
of
them
are
the
assisted
phase,
and
two
of
them
are
the
automated
phase.
So
assisted
phase
is
breakdown
in
assisted
deployment
and
assisted
rollback.
The
first
one
is
about
creating
a
consistent
way
to
make
a
decision.
If
it's
time
to
promote
a
build
or
not
and
the
assisted
rollback
is
about
providing
a
manual
rollback
options
in
case
of
an
incident
once
we
have
refined
those
assisted
technique,
we
will
move
on
to
the
automation
part
with
automated
deployment
so
that
we
will
create
a
predetermined
release
window
and
within
that
window.