►
From YouTube: 2021-03-30 Delivery team weekly rollbacks demo
Description
Testing out a staging deployment when we have nodes in DRAIN state
B
Yeah,
I'm
I'm
still
waiting
for
the
the
arm
to
arrive.
It
should
be
here
like
tonight,
so
for
now
it's
sitting
on
this
thing
feels
kind
of
voyeuristic
and
creepy
yeah.
B
It's
it's
a
bit
tricky
because
I
like
the
cable,
is
like
bent
such
a
way
that
if
I
put
it
more
to
the
right,
it
will
sort
of
yank
the
camera
down.
Let
me
see
if
I
can
maybe
fix
it.
A.
C
Bit
so
do
you
want
us,
kick
us
off
robert.
A
C
So
over
to
over
to
the.
B
Demo
hold
on
give
me
one
sec,
actually
I'll
just
turn
it
off,
because
I
don't
think
I'll
need
it
with
the
screen
showing
anyway.
B
Get
out
of
the
way
I
can
go,
I
will
share
my
screen
and
just
stop
one.
There
you
go.
B
Did
you
devote
in
rollback
demo
pretty
sure
I
had
it
in
my
history?
Apparently
not.
Let
me
get
the
calendar.
B
All
right
first
engineer:
that's
not
my
username.
Let
me
fix
that.
B
And
let's
see
first
steps
read
through
some
of
the
roblox
scenarios
today
there
are
so
yep
that
I
read
okay,
testing
steps,
engineer,
communicate
to
infrastructure
allowance
and
quality
that
we're
going
to
test
a
rollback
in
staging.
B
Let's
do
that
in
front.
I.
C
D
B
B
B
B
B
B
B
E
E
So
I
will
just
let
you
know
when
it's
ready.
Do
we
want
to
do
more
than
one
server
is
one
fine.
I
think
one
is
fine,
because
we
bailed
pretty
quickly
yeah.
That's
what
I
thought.
Okay.
So
if
I
do
a
get
server
state
and
staging
for
api.
E
D
B
B
So
I
guess
that's
the
previous
show
or
say
previous
package
name
copying.
It.
B
Perform
a
roll
back
I'll
run
that
in
delivery.
This
time.
B
B
I
mean
in
theory
that
applies
to
many
things
I
mean
it
should
yeah.
I
kind
of
think
like
looking
at
this
output
from
the
check
command.
It
shows
things
like
upcoming.
Current
previous
makes
sense.
Previous
package
also
makes
sense,
I'm
just
kind
of
curious.
If
there's
anything,
we
could
do
to
make
it
more
obvious
that
the
previous
package
line
is
what
we
want
to
go
in
that
rollback
command.
B
Maybe
one
approach
would
be
to
add
like
a
snippet
of
the
commander,
so
it's
like
previous
package
version
number
and
then
like
slash,
shut
ups,
run,
deploy
rollback
with
that
version
number
in
it.
That
way,
you
can
just
copy
that
entire
command
and
run
it
instead
of
copying.
This
sort
of
template
then
replacing
that
placeholder
with
the
actual
version
and
then
running
it.
B
B
E
Decent
so
far,
yeah
would
there
be
a
way?
I
don't
know
what
slacks
blocks
allows
us
to
do,
but
I'm
wondering
if
we
can
make
that
like
a
little
hidden
option
that
we
only
see
when
we
click
on
some
link
to
like
show
it
or
something
kind
of
like
a
arrow
that
hides
it.
A
E
A
A
F
F
E
E
C
We
do
so
does
drain
basically
mean
that
it
is
there's
just
no
traffic
right
like
it's
just
out
of
use.
E
C
C
I
had
a
discussion
item
that
maybe
we
could
jump
to.
We've
got
the
epic
for
11.,
and
this
is
the
pipeline
for
code.
Rollback
we've
been
working
towards
now,
one,
I
think
it's
sure
we
haven't
got
an
sei,
but
I
think
we
can
see
how
far
we
can
get
anyway,
but
the
exit
criteria
for
this
epic
are
sort
of
complete
as
they're
written
and
the
due
date
is
today,
but
actually
I
don't
think
the
epic
is
complete
based
on
other
issues
and
other
things
we've
got.
C
E
E
I
don't,
I
really
don't
want
to
wait
till
an
instant
occurs
for
us
to
test
it.
I
want
to
test
it
when
we're
all
calm
collected.
There's
no
incidents,
we're
all
expecting
this,
it's
okay!
If
we
want
to
create
a
test
package
to
be
safer
sure
I
don't
care.
I
just
don't
want
to
wait
to
an
incident
to
test.
C
E
E
B
Think
yeah,
I
kind
of
agree
there
like
saying:
oh,
we
can
do
rollbacks
just
because
we
can
in
staging
might
be
a
little
too
optimistic.
B
I
do
think
if
we
do
a
production
roll
like
the
first
time,
we
might
want
to
pair
that
up
with
a
production
change
lock
just
so
that
we
can
actually
pick
a
day,
knowing
that's
not
good,
knowing
there's
not
going
to
be
sort
of
self-induced
incidents,
because
there
couldn't
always
be
these
sporadic
random
ones,
but
just
we
have
a
day
where
we
know.
Oh,
we're
not
gonna
introduce
a
new
database
migration.
That's
going
to
take
two
hours,
for
example
beyond
that,
I
think
so
far,
yeah
we're
so
far
so
good.
C
So
the
the
the
question
mark
we
had
previously
when
we
started
thinking
about
can
we
go
to
production
is
what
are
all
the
ways
that
this
can
fail
and
how
will
we
be
able
to
recover
from
it
so
that
when
it
does
fail
in
production,
it's
not
the
first
time
we've
ever
seen
that
so
from
the
stuff
that
you've
been
looking
at
over
the
last
week
or
so
mira,
like
other
other
failure
cases
that
you
you
came
across
whilst
you're
going
through
that
stuff.
F
Yep
sorry,
so,
basically,
one
that
I
have
been
thinking
about
is
that
we
need
to
roll
back,
and
there
is
a
there
is
a
deployment
in
progress.
So
in
that
case,
what
do
we
do
like
so
far?
Our
options
is
to
basically
cancel
the
the
ongoing
deployment,
but
we
don't
really
have
a
way
to
tell
when
it
is
safe
to
cancel
like
we
can
click
that
button
cancel,
but
at
least
I
have
no
idea
what
is
going
on
with
all
those
logs.
F
Those
are
not
really
readable
for
me,
so
it
will
be
nice
to
have
some
sort
of
tool
that
we
were
talking
about
last
week
to
safely
trigger
that
on
slack
and
then
the
tool
will
analyze
when
it
is
the
safe
time
to
cancel
that
one,
but
well
that
aside,
once
we
cancel
a
deployment
in
progress,
we
need
to
know
when
it
is
a
safe
time
to
do
it
and
then
what
are
we
supposed
to
do?
Are
we
supposed
to
roll
back
immediately?
F
Do
we
need
to
do
something
with
the
servers
that
will
be
another
scenario
that
it
will
be
nice
to
to
have
before
actually
testing
this
in
production.
E
C
Yeah,
okay,
we
have
so
we
have
an
issue
for
like
the
automated
bit,
but
I
think
there's
an
there's
a
step
before
which
is
the
actual
like.
We
can't
automate
something
right
now
right,
because
we
don't
know
what
are
we
automating?
We
need
to
have
a
safe
way
of
of
knowing
when
to
cancel.
B
Issue
unrelated
to
that
we
had
apl,
look
api.
One
marked
the
stream,
it
seems
to
have
rolled
back.
Just
fine
double
checking.
Was
that
the
expected
behavior
or
were
we
expecting
it
to
fail.
C
E
E
Yeah
either
the
prairie
job
or
when
ansible
started
the
deploy
would
have
been
like
it's
already
in
drain.
This
is
a
problem,
but
we
didn't
see
that.
So
that's
a
good
thing
and
I
just
did
a
get
server
state
and
it
looks
like
that.
Node
is
in
the
rotation
again,
so
it
looks
like
the
rollback
indeed
completed,
just
like
we
wanted
it
to.
C
Very
nice,
that's
great
news,
great
news:
when
we,
when
deployments
fail
or
cancel,
I
don't
know
if
those
if
they
end
up
the
same,
do
things
get
left
in
drain
state.
E
Say
we
ran
a
server
at
a
disk
space
when
it
was
installing
the
package
that
one
server
will
have
been
marked
as
failed
and
ansible
and
from
whatever
call
in
school,
is
going
to
stop
at
some
checkpoint.
But
I
don't
know
what
that
checkpoint
is,
and
maybe
when
that
task
completes.
E
We
could
retry
the
job
at
that
point
safely
if
the
failure
occurs
outside
of
that
like
if
it
fails
trying
to
push
a
node
or
take
a
node
out
of
rotation,
then
nothing
bad
technically
happened
to
the
node
itself,
like
the
deploy
wouldn't
have
continued
because
aj
proxy
was
the
failure
case.
A
C
Okay,
that's
good
news
right
so
like
from
this
rough
sort
of
outcome,
it
looks
like
robot
pipeline
handles
drain
state
very
nicely.
C
Which
should
in
theory
like
we,
don't
know
how
to
cancel
a
deployment?
But
you
know
that's
not
really
a
blocker
for
rollbacks
right.
We
have
a
slightly
annoying
state
where,
if
we
see
a
deployment
going
through
and
problems
start
to
happen,
we
don't
quite
know
what
to
do
which
we
should
find
a
solution
for,
but
that's
not
really
a
rollback
problem.
E
E
C
What
about,
if
we
thought
about
this
in
the
so
the
goal
we're
going
against
is
improving,
mean
time
to
resolution
right,
and
I
think
euric
has
been
painfully
aware
this
week,
and
I
know
robert
in
recent
weeks
that
this
number
is
quite
high
right.
So
actually
what?
If
for
now,
we
just
made
sure
that
if
we
ever
had
to
cancel
like
in
the
theory
that
we're
talking
about
here,
we
just
cut
it
before
post-deployment
migration
right
so,
oh
wow
everything
looks
terrible,
we're
deploying
on
the
fleet.
At
the
moment.
C
Yes,
it
means
we
potentially
are
rolling
out
bad
things
to
the
whole
fleet,
when
maybe
we
could
have
only
put
them
on
half
the
fleet
or
a
third
of
the
fleet,
but
it's
certainly
not
it's
not
making
it
worse
than
it
currently
is.
E
I
think
that
might
be
a
little
naive,
because
if
we
allowed
to
continue
up
until
post-deployment
migrations
and
if
things
are
getting
worse
even
after
say
the
first
two
nodes
deployed,
we're
gonna,
make
it
worse.
Until
we
get
to
the
step
where
we
could
actually
watch
the
current
set
of
jobs
complete
and
then
start
the
rollback
process.
C
E
We've
stopped
it
just
like
you
said
at
post,
deploy
migrations
just
because
you
know
an
incident
started,
and
we
wanted
to
have
some
like
checkpoint,
that
we
could
make
sure
that
we
could
have
a
place
where
we
understood
where
our
environment
stood.
E
I
think
the
one
time
I
was
involved
in
that
we
ended
up
just
allowing
the
deployment
to
continue
because
it
was
discovered.
It
was
something
completely
unrelated
to
the
deployment,
but.
B
There's
actually
a
case,
I
think
a
day
or
two
ago
or
yesterday,
maybe
even
where
maybe
it
was
friday.
I
don't
remember
a
deploy
was
going
on,
or
at
least
I
felt
so
it
later
turned
out
to
be
a
a
dry
around
pipeline.
B
A
incident
popped
up
at
people,
I
think
was
skyward.
Actually
hey
can
we,
you
know,
pulse
the
deploy
while
we're
dealing
with
the
incident
and
my
response.
There
is
basically
like
paul's
how
like
and
when
like,
can
I
just
cancel
the
pipeline
or
wait
and
then
the
result
is
okay.
We
have
to
wait
until
I
think,
like
post
deployment.
Migrations
are
done,
reversals
if
gitly
is
done,
but
because
there
was
a
dry
on
it
happened
so
fast.
I
wasn't
able
to
respond
in
time.
B
So
so
there
is
a
need
for
this.
I
think
every
now
and
then,
but
it
is
quite
rare.
B
Back,
let's
see
rollbacks
are
done,
qa
is
now
running.
Do
we
want
to
await
the
qa
results.
C
B
C
Cool
awesome,
I
add
I
asked
as
well.
This
is
completely
dominated
what
we
were
just
talking
about,
but
just
what's
in
my
mind,
I'll
add
it
to
the
epic,
but
I
checked
up
on
the
qa
tests
and
we
don't
run
them
as
part
of
the
production
deployment
pipeline,
but
there's
no
reason
not
to
have
these
on
the
rollback
pipeline.
So
it
feels
to
me
at
least,
that
a
good
use
of
a
bit
of
extra
time,
following
a
rollback
to
run
these
tests.
B
Yeah,
it
seems
qa
has
just
skipped
here.
At
least
keyway
orchestrating
qa
full
see.
Qa
smoke
is
that
running.
Yes,
that
does
actually
run.
B
Yeah
pipelines
qa
trigger
smoke.
Let's
look
for.
B
B
Yeah,
it
seems
to
vary
between
15
and
30,
but
it's
a
bit
inconsistent,
so
yeah,
let's
see
what
the
next
steps
are
in
the
meantime.
C
So,
just
on
the
testing
on
production,
from
all
the
other
failure
scenarios
that
we've
already
outlined
or
and
any
we
haven't
yet
outlined,
what
else
do
we
need
to
test
before
we
try
and
set
up
something
for.
C
So
you
should
get
it
set
up
the
first
step.
We
can
do
this
as
a
dry
run,
so
that
would
be
a
good
first
step.
C
C
C
F
E
B
Yeah,
I
think,
like
the
proper
approach
for
doing
this
was
probably
going
to
take
a
bit
of
effort
because
there's
a
discussion
about
kill
switches,
basically
for
this
some
different
suggestions
there,
I
think
last
year
suggested
we
use
console,
and
I
actually
agree
with
that,
because
we
we're
now
sort
of
running
into
more
and
more
cases
where
some
sort
of
central
authority
of
state
would
be
useful,
and
I
think
console
is
a
decent
solution
for
that.
B
B
Reason
can't
stop
it
halfway
through.
C
Yeah,
I
think
that
makes
sense
it
will
give
us
being
able
to
halt
a
deployment,
would
give
us
more
rollback
use
cases,
but
I
don't
think
it
blocks
robex.
C
Is
everybody
confident
that
they
understand
that
you
think
it's
in
the
role
in
their
own
books,
but
it's
a
good
thing
to
just
have
in
your
head
as
well,
so
you
know
quickly,
but
we
roll
back
production.
B
E
Wouldn't
touch
canary,
I
think
we
wouldn't
touch
staging.
We
would
probably
make
sure
that
whatever
is
on
staging
make
sure
that
it
does
not
get
promoted,
because
we
probably
need
to
at
that
point
identify
what
needs
to
be
fixed.
What
caused
the
problem
and
get
that
merged
in
place
and
prioritize
getting
that
into
staging
and
proceeding
forward.
F
I
guess
we
can
treat
rolling
back
production
as
any
other
s1p1
incident
in
which
we
block
autodeploy
or
pause
the
tasks,
because
we
need
to
have
our
environments
frozen
so
to
speak.
C
We
may
not
need
to,
but
yeah
I
mean
sure.
I
think
if
that
makes
it
easier,
then
yeah
we
definitely
have
to
drain
canary
that's
a
really
key
bit,
because
otherwise
canary
will
end
up
potentially
ahead
of
sorry
behind
production.
C
Cool
okay:
it
is
in
a
run,
but
but
just
wanted
to
make
like,
I
think
in
the
heat
of
the
rollback
moment.
That
would
just
be
a
really
good
one
to
be
very
clear
where
these
things
sit
together.
C
We
should
see
how
that
goes,
whether
it's
almost
worth
or
so
it
might
make
it
just
too
complex,
but
with
the
production
rollback
like
with
a
rollback
almost
having
a
reminder
or
something,
that's
like
hey,
don't
forget
canary
or
something,
but
we
can.
We
can
see
how
that
plays
out.
For
now.
I
think
we
can
like
we
just
have
to
follow
the
run
book
super
closely
right.
C
Fantastic
and
then
the
other
thing
that
we
have
on
this
pipeline,
epic,
which
I'm
going
to
propose
moving
out,
is
we
have
quite
a
few
issues
on
here
around
kind
of
almost
like
information
and
posting
of
information.
So
there's
the
issue
which
I
need
to
do
something
with
like
putting
together
a
how
do
we
bring
all
this
stuff
together
into
a
human,
readable
proposal,
and
there
are
various
things
around
information
to
set
up
a
rollback
in
information
that
we've
done
a
rollback.
C
Just
whether
this
those
extra
tasks
should
be
parts
of
the
exit
criteria
for
this
epic
or
whether
we're
actually
saying
that
we
feel
like
production,
rollbacks
will
be
well.
Sorry
we'll
feel
like
a
rollback
pipeline
task
is
kind
of
complete
when
we
actually
feel
happy
with
the
production
rollback
or
whether
we
actually
want
to
go
through
and
call
it
complete.
Once
we
have
all
the
extra
pieces.
So
we
have,
for
example,
mirror
deployment
tracking
on
ops.
E
I'm
going
to
vote
yes,
in
my
opinion,
if
there's
anything
that
we
could
expose
to
make
our
lives
easier
during
a
high
stress
event,
and
I
feel,
like
I
haven't,
looked
at
the
issue,
so
I
don't
know
what
these
are,
but
just
based
on
your
description.
So
far,
they
sound
important
enough
that
I
feel
like
we
should
tackle
them.
C
Cool,
so
we
have
one
issue:
one
action
to
create
an
issue
on,
so
that
we
capture
when
and
how
we
cancel
deployments,
who
would
like
to
take
the
action
open
day?
Good.
Thank
you.
C
And
then
the
other
thing
that
would
be
worth
doing
is
having
a
write-up
on
the
I'm
going
to
propose
that
we
separate
out
these
two
failure.
Scenarios
that
you
put
together
byron
have
one
the
one
we've
just
done,
one
that
we
haven't
tested
yet
on
a
separate
issue
and
then
have
like
a
summary
of
what
happened
today
on
the
one
we've
actually
just
tested.
C
Does
that
sound
like
a
sensible
way
cool?
I
will
create
a
separate
issue
to
separate
out
the
two
scenarios
so
that
we
have
just
this
one
and
then
I'll
create
a
new
issue
with
the
one
we
didn't
test
eric
you,
okay,
to
put
together
like
a
summary
just
of
what
we
tested
and
how
it
went.
F
We
discussed
something
about
having
notes
in
maintenance
mode.
I
think
I
missed
what
happened
with
those
when
we
are
rolling
back,
are
those
ignored
or
do
we
need
to
do
something
with
them?.
E
They're
probably
going
to
be
treated
just
like
we
do
in
when
we
roll
forward,
so
if
they're
maintenance
but
they'll
just
stay
in
maintenance
mode
during
the
rollback
procedure.
If
we
want
to
change
that
behavior,
then
we
need
to
address
that
in
some
way,
but
there
could
be
cases
outside
of
deployments
that
would
lead
nodes
to
being
in
maintenance
mode,
so
it's
probably
wise
to
just
leave
it,
as
is.
C
It
might
be
us
just
checking
that
yeah,
that's
not
a
bad
idea,
because
yeah
they
do
need
to
be
left
right,
because
there
are
many
good
reasons,
they're
in
mate,
so
yeah
that
could
be
another.
It
could
be
another
useful.
C
Well,
it's
a
good
test
to
have
us
pose
right,
like
I
don't
know
if
we
should
try
and
get
coordinated,
so
we
can
get
some
of
the
production
tests
going,
but
it's
a
good
test
scenario
to
have
if
we
have
any
delays,
whatever
reason
so
cool
mario,
would
you
mind
adding
that
to
the
failure
discussion
issue
so
that
we
don't
lose
that
yep
sure
I
will
do
it
thanks
very.
F
B
C
We
should
keep
an
eye
on
the
tests,
though,
and
just
make
sure
those
do
pass.
C
Cool
great
nice
work
very
nice
very
nice
scenario,
nicely
executed
and
I
feel
like
we've,
we've
learned
some
more
stuff,
so
that's
super
good
great.
All
right
thanks.
Everyone
enjoy
the
rest
of
your
day.