►
From YouTube: 2021-03-16 Delivery team weekly rollbacks demo
Description
Discussion on possible rollback failure scenarios and how we could test them
B
A
Tough
tough
times,
but
still
glad
your
test
is
negative.
That's
that's
halfway!
Good
right,
awesome,
scarbeck
is
in
that
incident,
so
I'm
not
sure.
If
he'll
we
should
start
without
him.
I
think.
C
Okay-
let's
start
then,
so
we
we
don't
have
plans
for
demos
today,
so
this
is
kind
of
we're.
Basically,
the
reason
behind
this
is
that
we
initially
we
thought
about
testing
this
in
production,
and
maybe
it
wasn't
really
clear
what
happened
last
week,
but
basically
we
always
tested
the
happy
parts.
We
were
always
lucky,
except
for
the
missing
machine,
but
it
was
not.
It
wasn't
in
the
cleanup
phase.
C
So
yeah
it's
more
of
an
open
question.
So
how
can
we
test
this?
How
can
we,
let's
try
to
make
a
plan
together.
B
Basically
yeah,
I
just
commented
there
that
maybe
I
mean
the
idea
would
just
be
to
create
a
file
to
fill
up
the
disk
right.
I
mean
that's
what
we
can
do
and
so
that's
well
as
a
manual
step,
and
I
think
you
don't
need
to
have
any
automation
on
this.
So
just
creating
one
selected,
node
big
enough
file
in
the
temp
directory
and
then
run
the
rollback
or
deployment
and
see
how
it
fails.
B
C
C
B
I
mean
I
would
do
this
in
staging
and
and
then
maybe
send
a
note
to
the
sre
that
we
are
going
to
fill
up
with
this
for
test
testing,
but
I
think
there's
not
much
more
needed
in
staging
anyway,
and
so
I
think
that
shouldn't
be
an
issue
and
then
production.
We
often
enough
had
these
issues
with
api
nodes.
You
know
failing
because
they
were
full,
so
we
know
what's
happening,
but
yeah
testing
and
staging
should
be
good
enough.
I
think
okay,
so.
A
I
mean
like
so
if
we
fail
we'll
fail
the
rollback
and
then
do
you
have
like
permissions
henry
to
be
able
to
actually
remove
that
file
as
well
from
the
machine.
B
Oh
by
the
way,
I
don't
know
what
permissions
do
the
rest
of
us
have,
so
the
release
managers
like
developers,
do
you
all
have
root.
A
C
Point
is
that
the
so
the
access
grant
the
base
title
based
entitlement
are
even
remembered
right
name.
So,
basically,
the
definition
of
what
are
we
a
backhand
engineer
in
delivery,
the
permission
that
he
have
was
changed
after
the
creation
of
the
team.
So
I
think
that
it
really
all
of
us
is
in
a
different
situation,
but
in
theory
we
should
be
allowed
to
have
root
access
on
staging
machine,
but
not
on
production.
D
C
C
I
re-normalized
my
situation
myself
just
just
kidding
this
is
this
is
the
new
rules
I
have
I'm
most
of
them
so
and.
C
Op
yeah,
it
was
a
real
approval,
but
was
kind
of
duty,
stamping
it
right.
Just
saying:
yeah.
A
D
Well,
we
can
discuss
in
the
issue,
but
should
we
also
have
admin
access
on
staging?
I
think
I
have
yeah.
I
have
admin
access
on
the
stage.
C
You
should
have
one
it's
all
defined
in
the
thing
which
instance
you
should
have
admin
access,
if
it's
your
own
user
or
if
you
need
an
extra
user,
it's
very
detailed.
It
also
goes
down
into
what
what
access
you
should
have
in
aws
in
digitalocean,
if
you're
using
it.
So
there's
a
lot
of
details.
C
C
So
henry,
are
you
willing
to
just
try
to
frame
this?
I
mean
we
can.
We
can
do
this
together,
but
you
started
with
with
your
idea
right
so
filling
up
to
this.
So
maybe
we
can
work
this
together
as
a
team
trying
to
figure
out.
When
do
we
want
to
fill
it
up?
How
can
we
fill
it
up?
What
do
we
expect?
How
is
how
is
gonna,
the
failure
look
like
and
how
we
should
fix.
B
This
yeah,
I
think
the
first
thing
is
to
test
how
long
it
takes
to
fill
up
the
disk,
because
if
you
know,
if
you
need
to
fill
up
10
gigabytes,
it
could
take
a
while.
I
mean
not
too
much,
but
I'm
not
sure
how
much
you
would
need
to
fill
up.
So
first
step
is
testing,
so
we
know
how
long
it
takes
to
write
a
10
gigabyte
file.
Maybe,
and
then
we
need
to
measure
how
much
free
space
is
there
and
how
much
do
we
need
to
fill
up
to
really
make
it
fail?
B
It
should
be
an
easy
calculation
and
then,
in
the
issue,
I
already
pasted
a
command
which
we
can
use
to
do
this
with
dd
to
just
create
a
file
in
the
size
that
we
want,
and
then
we
should
be
done
and
can
start
the
rollback
and
everything
else
and
see
how
it
fails.
Hopefully,
and
then
we
should
make
sure
after
that,
of
course
to
delete
the
firewall
again.
B
C
Yeah,
I
was
thinking:
do
we
want
to
make
it
fail
after
the
warm-up
or
before,
because
I
think
it
adds
so
from
my
point
of
view
when
things
fail
during
or
before
the
warm-up
phase
is
kind
of
no
big
deal,
because
we
know
that
nothing
started
happening
yet.
So
my
suggestion
here
would
be
to
make
it
fail
when
it,
when
it
arts
more
right
so
during
the
real
deployment,
so
filling
it
up.
After
having,
after
passing
the
warm
up
phase-
and
I.
B
C
Question
for
who
may
know
the
answer,
which
is
instead
of
using
dd,
we
maybe
we
can
use
trunk
truncate
where
you
can
provide
the
size,
but
I'm
not
I'm
not
sure.
If
the
linux
kernel
is
smart
enough
to
figure
out
that
there
is
actually
real
space
because
we
truncate,
you
can
say,
create
a
file
of
10
gigabytes
and
we'll
not
write
it.
We
just
say:
okay,
there's
a
10
gigabytes
file
on
disk,
but
then
I
don't
know
that's.
B
B
B
C
D
So
I
don't
have
a
question
well,
this
problem
of
having
not
enough
space
like
on
the
disk
space
could
lead
to
not
been
able
to
insert
database
records.
I
think
the
answer
is
going
to
be
now.
Our
features
not
working
for
some
reason.
C
It
really
depends
on
so
if
we
end
up
with
a
disk
that
has
no
space
at
all,
the
type
of
failure
is
really
unpredictable
because
it
really
depends
on
what
is
happening.
So
you
may
crash
the
running
puma
and
who
knows
what
is
happening,
but
I
think
that
the
point
here
that
henry
made
and
also
I
think
robert-
was
trying
to
figure
out
yeah
the
numbers
around.
This
is
that,
in
order
to
install
a
package,
he
has
to
be
extracted
so
that
one
alone
is
kind
of
a
big
expansion
phase.
A
I
think
also
to
your
point,
mario,
like
yeah,
absolutely
and
that's
kind
of
the
sort
of
the
point
of
this
testing
now
is
to
actually
find
out
what
might
go
wrong
and
how
do
we
respond
to
it
so
that,
like
we
see
this
fairly,
often
in
production,
so
it's
a
decent
chance.
We'd
hit
this.
If
we
rolled
back
production
so
yeah,
we
should
absolutely
see
how
wrong
it.
C
C
Just
repeat
it
once
more
sorry,
so
we
take
a
look
at
the
disk
space.
We
pick
one
machine
and
we
take
a
look
at
the
available
space.
We
fill
it
up
enough
so
that
we
have
enough
space
for
the
warm
up
phase
so
downloading
the
package,
but
not
enough
to
expand
it.
So
just
a
trivial
example.
Let's
say
we
have
a
200
megabytes
package.
C
We
just
leave
300
megabytes
because
we
know
that
once
expanded
is
2
gigabytes,
something
like
that,
just
random
numbers,
so
we
should
be
able
to
download
it
but
not
be
able
to
expand
it
so
that
we
don't
have
to
deal
with
with
timing
with
the
rollback
itself
right,
because
we
just
we
just
have
to
check
after
warm
up
that,
there's
really
a
tiny
fractional
space
and
it
should
fail.
C
B
E
If
it
feels
during
the
warm-up,
though,
that's
not
really
a
meaningful
test
right,
it's
like
in
reality,
we
would
clean
the
cache
and
free
up
disk
space,
and
then
it
would
continue
undoing
like
rolling
out
the
package.
I
think
a
more
interesting
failure
would
be
like
certain
hosts
fail,
actually
like
restarting
unicorn
or
puma
or
whatever,
and
then
we're
in
an
inconsistent
state
during
the
deployment.
I
think
that's
more.
C
The
asset
it's
no
longer
there,
the
job
is
removed
as
well.
Okay,.
C
C
The
thing
what
what
robert
was
suggesting
is
so
I
mean
last
time
we
identified
kind
of
two
main
areas.
One
is
something
that
is
recoverable,
regardless
of
what
the
error
is,
so
a
machine
that
goes
offline.
If
we
can
spin
it
up
online
again,
it's
recoverable,
if
we
can't,
because
it
was
deleted
and
there's
no
way
we
can
generate
this
one.
Then
it's
an
unrecoverable
failure
and
then
we
have
another
scenario
for
that
one
and
either
so
we
can
fill
up
the
disk
eric
proposed.
Another
thing
that
happened
today.
C
Basically
we
downloaded
our
corrupted
package,
so
it
was
impossible
to
expand
it.
So
all
of
this
involves
manual,
interaction
from
necessary
or
some
I
would
say,
nsra
because
yeah
we
have
access
to
staging
machine.
We
say
sre
in
general
to
fix
the
machine
because
it's
recoverable
but
doesn't
mean
that
we
are
targeting
to
do
self
recovery
here.
We
are
not
doing
it
for
regular
deployment.
Why
should
we
start
with
rollback
in
and
addressing
self
recovery
right?
C
So
the
point
here
is
that,
if
something
fail,
when
a
machine
during
during
the
rollout,
so
when
a
machine
fail,
what
what
happens
so
my
my
assumption
here
is
that
we
will
reach
a
point
where
we
have
two
versions
running
is
not
really
clear:
if
will
be
just
one
machine
missing
out
or
it
really
depends
on
when
it
fails.
C
So
if
we
just
upgrade
everything
else
except
that
one
and
so
then
we
can
fix
it
and
we
can
rerun
it
and
will
the
end
result
be
a
system
which
is
completely
rolled
back,
and
I
think
the
real
question
here
that
we
want
to
figure
out
is:
can
we
really
retry
the
job?
Will
it
do
the
right
thing
and
this
type
of
so
I
don't
think
I
don't
think
we
should.
B
So
in
just
one
sentence,
we
want
to
make
sure
that
everything
every
stuff
we
are
doing
is
even
potent
right
so
that
you
can
repeat
it
with
the
same
result.
I
think
that's
the
thing
that
we
need
to
figure
out
and
that
we
want
to
try
by
testing
right
and
if
we
identify
things
which
are
not
either
important,
then
we
need
to
be
extra,
careful
right
and
need
to
have
runbooks
or
something
else
to
know
how
to
deal
with
that,
like
restarting
the
whole
deployment
or
something
else
right.
B
C
Yeah
there's
also
another
thing
related
to
this,
which
I
think
is
about
being
aware
that
things
can
go
wrong
and
not
panicking
when
it
really
happened.
So
it's
just
a
matter
of
things
can
go
wrong.
So
let's
try
to
figure
out.
Let's
show
that
things
can
go
wrong
and
we
can
recover
instead
of
just
having
to
deal
with
this
for
the
first
time
during
a
real
incident
during
a
real
production,
rollback.
A
Yeah,
that
makes
sense
so
based
on
henry's
sort
of
summary
of
that,
then,
is
this
filling
up
of
the
disc,
the
best
sort
of
first
test
case
to
to
test
this
out
to
test
that
we
can
just
rerun
rollbacks.
B
I
think
this
this
failing
when
installing
like
after
the
pre-warm
face.
That
sounds
very
much
like
a
condition
where
we
can't
really
fix
it
by
repeating
anything
right.
So
it
needs
manual,
interaction.
B
And
after
that
I
guess
we
can
rerun
the
job,
which
would
be
fine.
So
that
would
be
maybe
a
good
test
and
we
have
some
experience
with
that
and
I
don't
know
what
else
could
break
if
we
fill
up
the
disk,
I
mean
as
unless
you
said,
we
could
have
issues
with
processes
just
stopping
because
they
can't
write
to
temp
or
something
like
that.
So
it's
very
unpredictable.
B
What
can
happen
so
we
can't
have
a
solution
for
everything
here,
but
if
we
just
need
to
be
able
to
roll
back,
then
it's
not
that
important
that
the
single
host
is
working
perfectly
just
that
we
can
bring
it
back
into
a
state
that
is
expected
right.
So
if
we
try
to
cover
this,
then
just
means
bring
it
in
a
state
where
we
can
retry
the
same
job
again
right.
Quite
importantly,.
B
C
Okay,
something
that
came
out
from
the
conversation
is
about
qa
failure
and
where
we
were
discussing
qa
failure,
then
we
realized
that
wait.
We
are
not
running
qa
triggers
at
the
end
of
a
production
deployment,
so
we
always
saw
qa
in
our
roadblocks
because
you're
rolling
back
staging,
but
this
would
not
happen
during
production
rollback
now
do
we
want
to
run
it
only
for
rollback?
D
I
think
we
should.
It
will
give
us
some
confidence
on
this
package,
even
though
it
was
already
tested
on
all
of
our
environments
and
if
something
goes
wrong
and
it
is
like
a
legit
failure.
Well
then,
we
need
to
investigate
as
any
other
failure
in
our
environment
or
perhaps
it's
a
flaky
one,
and
we
just
need
to
retry.
C
E
Have
guests,
I
believe,
the
canary
from
the
canary
deploy
runs
both
a
canary
and
a
production
qa
job.
I
don't
know
why
we
don't
do
it
after
the
production
promotion.
C
So
I
have
just
some
guests
here,
so
I
think
that
the
the
deployment
pipeline
is
really
long
enough
and
so
because
we
tested
it
on
cannery,
we
kind
of
expect
the
same
behavior
and
we
run.
We
already
run
production
qa
smoke
tests.
I
think
once
every
hour,
something
like
that.
So
probably
is
also
this
and
a
matter
of
not
putting
too
much
load
on
a
system,
and
things
like
that,
but
I
mean
I'm
just
guessing
here,
so
I.
A
Also
wonder
if
it
might
be
if
we
do
run
them
on
canary
whether
we've
left
anything
behind
in
the
database
that
might
conflict
I
mean
I
hope
they
they
clean
up,
but
they
may
not
right,
not
all
tests
clean
up
everything,
so
that
might
be
a
good
thing
to
answer.
I
I
agree
with
myra
that
if
we
can
run
these
in
production,
it
would
be
nice,
but
it
might
be
worth
us
trying
to
find
out
why
we
don't
do
that
already.
A
A
C
A
C
So
far,
we
are
always
discussing
this
kind
of
synthetic
example
where
we
decide
to
roll
back
for
no
reason,
so
we
are
taking
a
perfectly
working
system
and
attempting
to
roll
it
back,
which
is
exactly
like
a
rolling
forward.
I
mean
there
are
complexity
around
this,
but
it's
basically
you.
You
have
a
system
that
behaves
and
you
installed
something
else.
C
C
C
So
question:
oh
I
just
I
just
had
some
questions
now.
Remember
so
do
we
know
when
it's
really
safe
to
stop
a
deployment,
and
if
we
have
to
make
this
goal-
and
we
I
mean
the
company
now
not
just
our
team
so
who
should
make
it
so
is
the
asuri
cereal
call.
This
should
just
say,
kill
the
deployment
and
we
are
going
to
make
a
rollback
or
is
us
release,
managers
that
we
decide?
No
things
are
going
bad.
We
want
to
stop
it
now.
D
Yeah,
I
think
it
should
be
a
combination
of
both,
particularly
I'm,
not
quite
sure
when
it
is
safe
to
stop
a
production
deployment.
I
think
it
might
be
safe,
stopping
it
before
it
updates
the
fleet,
but
I'm
not
sure
about
the
state
of
the
machines
when
we
do
that.
So
probably
an
sre
can
help
with
that.
C
Yeah
this
is
exactly
a
good
answer
and
I
think
that
everyone
in
our
team
has
his
own
view
on
when
he
knows
it's
safe
to
stop
something.
Let's
say
prior
to
warming
up
or
between
warming
up
and
before
italy
starts
upgrading
things
right,
but
then,
when
you
are
in
the
middle
of
something,
can
you
really
hit
that
cancel
button
wow?
This
is
a
I'm.
A
B
C
C
Maybe
this
I
don't
know
what
you
think
about
it.
Maybe
this
could
remove
a
lot
of
stress
around
us
hitting
that
cancel
button
because
it's
really
hard
to
parse
what
is
happening
in
a
in
ansible
output,
so
yeah.
Now
it's
the
right
time
right.
Well,
I
don't
know
so
yeah.
Maybe
this
is
a
good
action
item.
I
think.
A
That's
a
great
option.
One
question
I
have
so
I
think
in
terms
of
your
question
of
like:
when
would
we
want
to
trigger
the
halt?
I
think
one.
Maybe
the
most
common
scenario
I
could
think
of
is
almost
the
question
mark
of
something
doesn't
look
right
either
from
our
metrics
or
another
alert.
A
C
So
something
like
I'm
not
sure
something
is
going
wrong
here.
So
whatever
you
run,
the
next
production
check
make
it
fail
so
that
it
will
stop
what
it's
doing
and
we
can
investigate
and
decide
if
we
want
to
roll
back
and
roll
forward-
and
this
is
good.
But
then
we
have
the
other
options,
which
is
something
fail.
C
So,
if
static
fit
during
a
deployment,
we
are
in
a
broken
state,
and
this
is
all
the
conversation
that
scarborough
and
I
are
having
in
that
issue
about
our
aj
proxy.
So
I
think
the
scarborough
point
is
really
good.
No,
he
is
not
in
the
school
but
yeah.
I
would
just
try
to
verbalize
his
idea
so
because
I
was
learning
a
lot
following
that
conversation,
because
it's
not
exactly
my
so
I'm
not
really
comfortable
in
discussing
this,
because
I
don't
know
how
much
how
it
works.
So,
henry
you
can
correct
me.
C
So
I'm
saying,
if
I'm
saying
things
that
makes
no
sense.
So
basically,
all
of
our
fleet
is
behind
h,
a
proxy
and
every
machine
in
behind
aj
proxy
have
three
states.
It
could
be
up.
I
think
his
or
ready
and
up
let's
say
up
maintenance
and
drain
so
happy
path
is
that
we
drain
machines
drain
means
that
they
receive
no
traffic
and
and
we
upgrade
them,
and
so
at
the
end
we
put
them
back
in
ready.
C
C
Where
maintenance
mean
a
human
operator
decided
to
put
this
machine
offline
and
so
tooling
should
not
touch
it
yeah
kind
of
so.
Can
you
touch
it
by
attach
it?
I
mean
truly,
should
not
put
it
back
in
the
rotation
while
drain
means
an
automation.
An
automated
system
is
handling
this,
it's
in
the
middle
of
a
deployment.
So
if
now
it's
still
in
drain
and
I'm
starting
a
new
deployment,
something
went
wrong
and
so
the
basically
the
the
deployment
bailout
and
say
no,
I'm
not
going
to
touch
this
because
something
is
wrong.
C
So
I
was
proposing
complex
solution
to
this
problem,
but
I
think
the
scar
back
area
is
very
good
here.
Is
it
and
it's?
Maybe
we
should
just
avoid
that
check
in
in
a
rollback,
because
when
we
are
issuing
a
rollback
is
something
that
we
assume
that
we
are
in
a
broken
state.
So
there
will
be
an
incident
so,
for
instance,
the
production
pre-check
will
fail,
because
there
is
an
incident
open.
B
Me
that
totally
makes
sense.
I
mean
it's
an
operator
error
if
you
set
a
machine
in
to
drain
and
after
that,
not
into
maintenance,
if
you
intended
to
really
put
it
out
of
traffic
for
a
longer
time
right
and
if
it's
only
temporarily,
then
I
think
the
risk
isn't
that
big
that
we
doing
a
rollback
pull
it
up
again.
B
C
Well,
consider
that
this
you
say:
let's
call
it
bad
behavior.
If
unnecessary,
I
could
say
yes
a
re,
because
I
think
I'm
the
only
one
that
can
put
machine
drain
state,
but
maybe
I'm
wrong.
So
if
you
are
just
draining
something
because
you
want
to
put
in
maintenance
but-
and
you
are
not
you're
just
draining,
it
will
fail
also
a
regular
deployment.
C
C
This
is
the
question
that
I
have
left
in
that
issue,
so
I
think
we
need
to
check
what
what
is
happening
in
in
the
job
and
if
it's
just
checking
for
that
state,
then
we
can
skip
it.
If
not,
we
have
a
variables
in
in
the
deployer
that
tell
us
that
we
are
rolling
back,
so
we
should
act
on
that
variable
instead.
So
when
we
do,
when
we
do,
the
ansible
check
say
check
if
everything
is
maintenance
or
up
we
just
say
not.
If
we
this
variable
set.
A
And
on
your
point,
henry
about
educating
people
about
maint
and
versus
train:
do
you
mean
across
all
of
infra,
like
would
other
sres
that
would
need
to
make
sure
are
using
the
right
state
for
this
stuff
as
well.
B
Yeah,
I
think
that's
especially
for
sies
a
good
education,
because
I
know
for
myself.
I
didn't
know
this
for
a
long
time,
because
I
always
assumed
okay,
I
want
to
drain
a
note.
Then
I
put
it
into
drain
and
then
I
do
something
with
it.
Maybe
there's
some
issue
with
it
and
I
leave
it
until
tomorrow
and
then
I
got
reminded
by
somebody
else:
oh
raise
it
and
try
and
put
it
into
maintenance,
because
you
know
that's
the
right
state
for
it
and
it's
just
a
thing
of
knowledge.
It's
hard
to.
B
Maybe
you
should
could
have
some
kind
of
check
for
that
or
an
alert
or
which
needs
to
be
silenced.
Then,
if
you
only
are
doing
something
with
it,
I'm
not
sure
if
the,
if
there's
a
way
to
prevent
this,
but
education
maybe
is
the
best
way
and
assuming,
if
you
put
something
into
drain
manually,
then
it's
mostly
because
of
an
change
issue.
B
C
It's
a
bit
hard
to
parse
that
error.
So
probably
the
first
time
you
see
it
just
say
what
is
happening
here,
but
yeah
basically
there's
a
list
and
involves
the
fact
that
you
have
this
concept
of
front-end
yeah,
there's
a
weird
math
involved
and
every
time
I
go
back
in
release
management
and
one
of
this
check
fails.
C
I
can't
really
understand
how
it's
working
but
yeah.
It's
kind
of
we
have
multiple
front
tent,
which
are
h
a
proxy.
Then
we
have
multiple
backhands
and
just
kind
of
this
and
basically
there's
a
cartesian
product
between
the
two
things.
So
you
say
how
many
servers
do
we
have
here
because
it's
kind
of
but
yeah
you
see
the
error.
A
The
halting
a
deployment
as
well
as
the
skipping
of
the
check
and
does
he
have
other
stuff
in
here
as
well.
C
C
How
can
we
safely
stop
an
ongoing
deployment
which
deserves
its
own
issues
and
the
thing
we
discussed
about,
kill
switch
shut
ups
come
on
whatever,
so
that
when
it's
safe,
the
the
deployment
script,
so
the
player
will
bail
out
and
fail
the
ongoing
deployment,
instead
of
just
us,
clicking,
cancel
and
hoping
for
the
best.
C
The
other
one
is
about
something
went
wrong,
so
we
are
in
a
broken
state
in
broken
state
in
terms
of
h.a
proxy
view
of
our
infrastructure,
so
hi
proxy
thinks
that
something
is
broken.
How
can
we
issue
a
role
back
in
this
state
when
there's,
usually
we
do
manual
cleanup
so
when
user?
How
can
we
remove
this
manual
cleanup
in
this
specific
case.
D
So,
just
to
understand
the
problem
for
the
first
case,
why
would
we
need
to
stop
an
ongoing
deployment
to
roll
back?
Is
it
because
gitlab.com
is
down
or
something,
and
we
really
want
to
roll
back
as
soon
as
possible?.
C
If
you
realize
this
is
a
good
question
mark,
so
the
idea
is
that
you're
rolling
out
a
package
that
is
broken.
Let's
say,
maybe
the
merge
request
search
doesn't
work
at
all,
so
big
p1
s1
issue
and
instead
of
generating
a
complete
outage,
which
means
we
roll
out
everything
and
yeah.
Now
it
doesn't
work,
no
matter
what,
if
we
are
lucky
enough
to
spot
this
early,
we
can
stop
the
ongoing
deployment,
so
only
10
of
the
traffic,
for
instance,
is
affected
and
rolled
back.
D
Yeah
yeah,
it
makes
sense.
I
mean
it
will
be
in
a
situation
in
which
we,
for
some
reason,
didn't
notice
this
in
staging
north
canary,
and
it
is
currently
being
deployed
to
production,
which
actually
happened
to
me
once
but
yeah
it
wasn't
like
an
s1
v1.
It
was
an
s2p2
but
yeah.
I
only
noticed
the
failure
on
production
not
on.
C
Canary
yeah
I
mean
it
could
be
also
for
yeah.
It
happened
right
that
we
figured
out
that
something
was
broken
during
the
deployment
and
it
was
not
catched
by
qa
or
anything
else
before
and
so
right
now
we
have
no
good
answers.
So
I
had
I
had
to
cancel
some
deployment
and
I
don't
remember
if
it
was
this
shift
or
the
previous
one,
but
it's
still
it's
really
stressful,
to
figure
out.
If
you
can
do
it
now
or
not
so
yeah,
that's
it.
A
C
C
A
C
We
we
miss
a
couple
of
check
and
concept
at
the
beginning
of
it
so
kind
of
extra
cases,
special
cases
what
to
do.
If,
because
we
are
always
assuming
that
we
are
in
a
b1
incident,
something
bad
is
happening,
but
there's
no
ongoing
deployment
and
anything
like
that.
We
just
issue.
We
just
installed
the
package,
which
is,
as
I
said,
as
you
see
it
as
well
specific.
C
C
I
have
another
item
here
which
is
yeah
it's
more
a
statement,
but
I
I
am
happy
to
discuss
this,
so
we
know
that
we
cannot
start
a
rollback
if
github
will
come
down.
We
kind
of
discussed
this
briefly
last
week
and
I
opened
a
couple
of
issues
discussing
this,
so
I
was
thinking
I
I
ideoprivatized
this
work
basically
so
reason
being.
There
is
a
lot
of
ongoing
thing
right
now
that
we
discussed-
and
this
is
unknown
limitation,
but
is
not
worse
than
what
we
have
today.
C
So
let
me
all
the
other
k,
all
the
other
failure
are.
We
start
a
rollback
and
something
goes
wrong.
So
we
need
to
make
sure
that
we
can
go
to
the
end
of
the
rollback,
but
here
it's
kind
of
at
the
beginning
of
the
rollback.
We
don't
we
not
don't
even
have
the
option
to
roll
it
back.
So,
yes,
I
would
love
to
be
able
to
roll
back
also
in
case
of
a
github.com
complete
outage,
but
it's
not
worse
than
what
we
have
today.
A
One
of
the
chances,
though,
that
this
could
would
have
rolled
through
staging
and
through
well
up
to
canary
right
and
then
taking
it
out
well
beyond
canary
right,
so
we
can't
drain
canary,
like
it's
scary,
but
doesn't
feel
like
the
sort
of
thing
that
will
happen
like
has
it
has.
I
hope
it's
never
happened
before
I
say
this:
has
it
ever
happened.
C
C
So
basically
we
rely
on
that
information,
and
so,
if
we
can't
get
this
out,
there's
no
way
we
can
roll
it
back
kind
of
I
mean
we
can
try
to
figure
out
packages
really
going
through
the
history
of
slack
and
things
like
that,
but
the
rum,
the
wrong
book
will
not
help
you.
You
just
have
to
to
do
yeah
something
with
what
you
have
at
hand.
A
Yeah
I
feel
like
we
should.
We
should
think
about
it,
along
with
various,
like
we've
got
a
few
gaps
like
this
right.
So
maybe
we
should
take
a
look
at
those,
but
it
seems
like
we
have
enough
rollback
stuff
to
be
testing
on.
Whilst
we
think
about
that
stuff,
it'd
be
good.
B
I
have
a
question
to
to
the
worst
case
study.
So
if
you
would
come
in
a
very
bad
state
by
deploying
something
deployment
goes
wrong,
we
can't
reach
gitlab.com
anymore
for
rolling
back.
Would
we
at
least
be
able
to
manually,
go
on
each
node
and
try
to
reinstall
the
latest
package,
which
is
maybe
still
around
there
or
something
stupid
like
that?
To
get
get
us
going
again.
C
Maybe
martin,
you
know
better
than
me,
but
if
I'm
thinking
about
big
incident
where
we
were
offline
or
something
similar
or
we'd
say
something
broken
like
at
the
very
core.
So
basically
every
page
fails
either
we
patched
it
with
a
patcher
or
was
something
let's
say:
external
to
code
changes
so
load
on
the
database
or
things
that
are
kind
of
we
had
to
handle
them
externally.
F
Only
time
when
we
were
hard
hard
down
was
database,
so
this
is
our
recurring
topic.
Apparently,
every
two
years
we
have
the
same
thing
or
when
we
wiped
out
our
load
balancers,
we
have
never
ever
had
an
issue
where
code
broke,
something
fundamentally
in
the
product,
at
least
not
in
production,
how
we
caught
it.
We
caught
it
on
devgetlove.org
because
it
was
running
nightly
packages
and
still
runs
nightly
packages.
F
We
caught
it
in
qa,
whether
manual
qa
or
automated
qa,
and
we
caught
it
in
staging
before
it
went
out
to
production.
The
only
time
that
something
resembling
core
breakage
was
when
we
were
fixing
a
security
problem
and
that
security
problem
changed
the
behavior
of
the
set
feature
and
it
broke
fundamentally
how
the
future
works
for
everyone
involved,
and
at
that
point
in
time
we
had
to
hot
patch.
F
C
C
B
B
C
B
I
mean
that
would
take
a
little
bit
more
disk
space,
but
then
everything
would
be
in
place
for
a
rollback
right.
If
you
figure
out
later
that
we
broke
something
with
this
deploy,
because
normally
it
takes
some
time
after
deployment
that
you
figure
out
that
we
need
to
roll
back.
I
guess
I
don't
know
if
it
would
worth
it,
but
that
would
be
an
extra
security
thing.
B
C
I
don't
know
how
we're
doing
this,
because
I'm
quite
sure
that
the
cleanup
is
an
apt
cache.
So
I
don't
know
if
we
first
download
the
package
in
some
directory
or
if
you
just
do
the
did,
we
pull
it.
Basically,
so
we
just
download
it
from
apd
and
if
that's
the
case,
I
don't
know
if
there
is
an
option
to
apt
cache
to
say
just
clean
up
everything,
but
something
so
yeah.
I
don't
know,
but
this
is
a
good.
A
A
So
what
are
the
what's?
The
action
from
the
maint
and
drain
discussion.
C
A
C
A
One
and
then
myra,
I
think
that
leaves
you
with
the.
A
Make
a
have
a
have
a
crack
at
writing
out
the
test
scenario
for
this
blocking
a
deployment
and
attempting
a
failure,
and
I
think,
probably
assume
you
have
the
the
drain
thing
resolved
and
assume
you
have
a
halt
deployment
but
it'd
be
good
to
at
least
think
about
what
would
we?
What
would
we
do
on
stage
and
what
steps
would
we
go
through
to
actually
test.
A
C
Well,
the
investigation
that
and
reproposed
about
leaving
the
package
in
place,
at
least
we
should
report
it
so
that
we
don't
lose
track
of
that
item.