►
From YouTube: RCA Retro gitlab-com/gl-infra/production/issues/1498
Description
Incident report details in https://gitlab.com/gitlab-com/gl-infra/production/issues/1498
RCA issue https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8729
Feature issue https://gitlab.com/gitlab-org/gitlab/issues/36709
A
B
B
C
A
Repeat
that
I
don't
know
who's
gonna
who's
running
the
meeting,
yep
it's
in
the
agenda,
so
I'll
be
driving.
The
moderation
part
at
least
and
Cameron
can
provide
some
timeline
details,
given
that
he
was
part
of
the
incident
report
original
I'm,
just
giving
some
time
for
everyone
to
join,
because
we
have
a
large
audience
and
that's
why
we
are
not
starting
right
away.
A
All
right,
I
think,
one
minute
past
the
scheduled
time.
We
are
ready
to
start
I'll
be
driving.
This
meeting
like
I,
said
earlier.
I'll
just
give
you
a
short,
very
short
expectation,
set
setting
so
that
everyone
knows
what
to
see
in
the
next
25
minutes,
re
I'm,
getting
distracted
by
swimjet
things
all
right.
So
the
goal
as
with
any
RCA
is
making
sure
that
we
understand
what
happened
and
what
we
can
improve.
We're
not
going
to
be
pointing
fingers
to
any
of
the
teams
or
people
that
were
participating
in
this.
A
I'm,
not
gonna,
read
the
whole
timeline.
The
timeline
is
there
for
a
sync,
so
I
want
to
make
sure
that
we
have
enough
time
for
discussion.
Points
so
I'll
first
hand
it
over
to
Cameron
to
see
whether
he
wants
to
specifically
point
out
anything
from
the
timeline
or
the
incident
reports
before
I
open
up
for
discussion,
Cameron.
D
A
B
B
The
the
intention
initially
was
a
refactor
right,
load-balancing
I
think
it
was.
One
of
these
things
were
a
lot
of
unfortunate
events
ended
up
tying
together
and
it
wasn't
strictly
related
to
load
balancing.
So
it
just
put
it
in
one
sentence,
but
we
had
wanted
to
do
was
to
provide
an
API
to
allow
the
ramps
have
to
identify
itself
as
sidekick
or
the
rails
web
app,
etc.
B
So
it
was
male
just
meant
as
a
refactoring
basically
and
the
load
balancing
code,
which
I
believe
only
executes
as
part
of
ie
its
relations.
That's
the
bottom
I'm
really
fuzzy
on
that
was
affected
by
this
so
and
that
that's
where
it
then
tied
into
this
change.
Where
and
a
certain
runtime
should
have
identified
itself
in
this
case,
sidekick
should
have
identified
itself
as
a
psychic,
but
it
did
not
identify
itself
as
sidekick,
which
then
led
to
a
code
path
being
triggered.
B
There
was
not
meant
to
be
trigger
and
that
Len
then
lent
led
to
all
we
kind
of
weird
errors.
That
kind
of
looked
nonsensical
at
first
and
that's
worse,
then
stepped
in,
and
he
kind
of
connected
these.
These
dots.
Yes,
so
specifically
to
the
point.
So
how
was
the
test?
The
change
confirmed
so
I
do
remember
struggling
it,
but
so
I
did
all
my
testing
on
the
GDK.
So
again,
it's
like,
like
local
instances
that
we
ran
for
Puma,
unicorn
and
sidekick.
So
what
I?
B
So
I
think
there
was
part
of
the
problem
that
it
was
not
really
tested
against
how
many
bus
and
it
specifically
broke
under
omnibus,
because
sidekick
so
part
of
this
check
to
identify
the
runtime
to
identify
itself.
That
kind
of
snuck
into
what
should
have
just
been
a
reflector,
and
maybe
I
can
talk
about
this
later,
but
was.
B
The
current
component
that
runs
and
we
tested
for
a
specific
script
name
that
should
execute
for
sidekick
which
worked
on
you
know
under
when
we
tested
locally
against
the
GDK
by
the
pairing.
These
scripts
names
change
in
omnibus,
so
so
that's
kind
of
where
that
started
to
break,
and
then
that
works
slip
past
us
do
any
button.
A
B
Have
known
I
might
have
not
considered
this
was
like
one
of
the
bigger
stories
I
work
in
as
well,
so
I'm
still
kind
of
like
learning
the
ropes
or
whatever
you
could
so
I
might
have
not
considered
that
I.
Also
yeah
I'm
not
sure
does
with
that
run
like
sidekick
alongside
the
rails
server
as
well,
or
does
it
just
start
the
rails?
It.
A
A
B
A
B
Yeah
I
suspect
I
did
not
look
at
that
yeah,
but
yeah
thanks
for
the
point.
That's
a
really
good
point,
I
think
you're
right
and
that
it
might
not
have
catch
the
caught
it
because
I
think
this
particular
code
pad
that
was
affected
require
a
particular
license.
But
this
is
the
bit
I'm
a
bit:
fuzzy
on
I
think
it
was
something
he's
specific,
at
least
so
I'm
not
sure
I
believe
you're
right.
B
A
E
C
A
A
Okay,
so
are
there
any
points
that
we
want
to
discuss
here?
I
think
it's
kind
of
clear
how
this
went
through.
It
was
not
obvious
number
one.
It
was
not
obvious
how
this
works
between
omnibus
number
two,
and
there
was
no
testing
in
packaging
QA
step
in
the
pipelines,
and
then
we
went
all
the
way
to
staging
where
it
actually
encountered.
The
first
ie
instance.
A
B
D
B
To
do
this,
headfirst
I
guess
because
it
wasn't
super
well,
groomed
and
I.
Think
we
just
like,
as
we
went
along
looked
at.
How
could
we
solve
this
problem
and
I
do
remember
looking
at
did
I
yeah
I
think
editing
here,
so
we
do
remember
I,
don't
remember.
Looking
at,
we
were
thinking
well.
This
sounds
like
you
know.
Someone
probably
solved
that
problem
before
so
we
looked
at
the
New,
Relic
client
library,
so
we
look
at
how
they
do
it
and
they
were
actually
doing
the
same
kind
of
checks.
E
Is
Mike
here
so
not
to
point
fingers
at
anybody,
but
they
be
considered
using
the
feature
proposal
template
because
that
template
has
a
specific
section
for
test
planning
and
if
not,
was
it
because
this
is
a
backlog
tech
that
refactor,
which
we
use
freeform
issue
format
just
want
to
understand?
How
can
we
facilitate
raising
these
test
planning
requirements
earlier
on?
Yes,.
B
B
E
I
do
want
to
highlight
that
we
do
have
a
refactor
template
now
and
I.
Think
that's
what
a
quarry
I
think
I'll
be
signal
boosting
this.
So
team
can
use
this,
because
if
you
look
at
the
the
refactoring
template,
it's
actually
more
more
aimed
to
testing
than
the
feature
delivery
template,
because
if
you
there's
like
sections
we're
like
what's
the
blast
radius,
what
are
the
infinite
side
effects?
What
are
the
test
levels
that
need
to
be
considered,
including
starting
from
the
unit
tests?
E
E
And
refactor
templates
I
think
we
should
make
package.
It
can
be
a
package
and
QA
meant
like
more
of
a
mandatory
thing
like
it,
you
are
doing
refactor.
Please
ensure
that
we
are
testing
everything
in
an
Internet
to
to
miss
right.
So,
thanks
for
facilitating
at
the
discussions,
man
I
want
to
give
you
kudos
I
think
you
were
being
very
humble
and
transparent
I've.
Given
you.
The
engineer
worked
on
this,
like
thank
you
for
the
insights.
Sure.
A
Back
in
you,
Mary
yeah,
so
I
think
that
brings
me
to
the
next
question.
How
come
was
this
not
caught
by
unit
integration?
Testing
I
specifically
want
to
highlight,
like
in
the
merge
request.
We
say
in
availability,
antastic
section
that
this
was
impossible
to
feature
flag.
This
is
how
we
usually
go,
negate
these
big
changes
and
try
to
control
the
impact,
and
there
is
a
mention
that
cementing
shouldn't
have
changed.
The
checks
just
moved
behind
the
in
interface,
so
yeah.
Could
you
tell
me,
tell
us
a
bit
more
or
not
you.
A
B
I
can
just
quickly
give
my
point
of
view
and
then
so
yeah
I
already
mentioned
it
was
meant
as
a
refractor
and
that
we
did
that
some
changes
to
semantics
kind
of
snuck
in
I.
Guess
when
I
noticed
or
like
it
wasn't
really
called
out,
and
we
should
have
probably
thought
you
know
we
yeah.
We
should
have
been
more
rigorous
about
saying.
If
there
is
a
change,
maybe
just
make
that
the
separate,
mr,
you
know
make
sure
that
this
is
purely
a
reflector.
So
that's
like
one
aspect
and
the
other
one
was
oh.
B
Flagging
in
so
this
changed
I
really
don't
know
how
we
would
have
done
this
because
it's
it
touches
like
one
of
the
entire
code
base,
but
it
was
used
in
completely
disconnected
part
across
the
code
base
like
whenever
we
need
a
like
yeah
like
environment,
specific
check,
like
am
I
running
a
psychic,
or
am
I
running
s,
AF
server
and
so
forth?
Am
I
executing
as
a
multi
for
the
environment?
All
these
things
I
would
have
you
know
you
would
have
had
two
features
like
every
single
call.
Basically,
I
guess
it's
possible
what.
A
C
B
C
C
C
A
B
Not
sure
yeah
again
it
went
events
I
guess
so
what
I
did
was
I
yeah
I
went
to
cabana
look
at
the
the
pre
prod
logs,
we're
staging
appear
so
and
I
guess
what
I
was
expecting
at
this
point
was
to
see
so
at
this
time.
I
think
we
meanwhile
I
pushed
another
change
that
actually
removes
these
log
lines,
but
at
the
time
we
were
using
the
application
logger
to
emit
like
a
one-liner
which
would
print
out
the
runtime
as
it
had
been
identified
by
this
new
class
to
the
logs.
B
D
B
Just
like
there
was
not
enough
because
again,
there
were,
like
certain
cases
under
certain
deployments
specific,
where
this
led
to
errors,
and
what
I
should
have
done
as
well
is
to
be
on
the
opposite
check
is
anything
that
I'm
not
expecting
to
break
is
actually
breaking,
because
I
only
saw
basically
the
fraction
of
it
working
that
didn't
work,
but
I
didn't
see
the
fraction
of
it.
Failing
that
it
fail
so
I
guess:
I
wasn't
there
enough,
then.
C
C
A
That
that
was
one
of
the
points
I
have
a
bit
below,
and
that
is
that
staging
has
limited
set
of
eyes
on
it.
The
only
to
me,
the
only
thing
we
depend
on
on
staging
is
automation,
so
we
have
QA
tests
running
automatically
after
every
deployment
and
when
they
pass
only
then
we
progress
further
to
other
environments,
anything
that
is
in
between
anything
that
is
not
failing.
Test
gets,
it's
become
secondary
right,
like
if
we
depend
on
manual
action,
and
that
is
not
reliable.
We
we
don't
catch
that
I.
A
A
E
You
so
being
fully
transparent.
Yet,
like
this
failure
was
raised
in
staging
I,
believe
three
of
our
tests,
not
in
the
smoke
tests,
we
caught
it
being
all
three
tests,
some
fun
normal
aspect
of
CI,
hence,
is
either
failing
at
the
Starla
test
middle
or
sometimes
at
the
end.
We
fail
to
escalate
this
to
the
delivery
team.
We
do
have
a
tryout
rotation,
which
is
soon
to
be
called
the
on
call
now
from
the
Quality
department.
E
If
you
know
this,
this
is
a
refactor
like
everybody
working
on
it
should
be
monitoring
the
test
results
when
they
change
its
hit
staging,
and
maybe
we
could
do
some
sort
of
automation
around
this-
we're
notifying
the
Amoy.
Okay,
your
test
has
passed
the
regression
on
staging
it's
passed
or
failed,
but
at
least
there's
some
thought
of
it
that
hey,
let's
check
staging
results
when
you're
windy,
a
mile
you're
working
on
has
like
staging
environment
or
workflow
staging
labelled
you.
E
A
Is
it
possible
to
not
depend
on
humans
here
and
consider
having
another
level
of
QA
tests,
so
we
have
smoke
test
and
the
rest
of
the
suit
it
possible
to
have
smoke
tests
and
then
reliably
passing
tests
and
then
the
rest
of
the
suit
as
well,
because
I
would
rather
plug
in
the
deployment
system
into
an
automated
system
and
have
that
higher
the
board's
then,
depending
on
an
on-call
pager,
Duty
rotation
that
still
depends
on
humans.
I
believe.
E
That
is
something
we're
looking
into
to
having
another
notation
of
reliable
tests,
and
then
that's
like
that.
The
next
cohort
that
you
can
plug
into
I
can
put
that
into
the
rapid
action
we're
now
working
hard
on.
Actually,
we
have
only
six
flaky
tests
left
total
and
we're
making
great
progress
on
it.
I'll
make
sure
that
we
tackle
that
from
the
on
the
quality
side.
Then
they
have.
You
have
another
level
of
cohort
for
you
to
select
from
all.
A
A
C
D
That
made
it
difficult
to
comprehend
or
narrow
down
where
to
look
quickly
to
see
what
might
have
changed.
So
we
were
really
heavily
relying
on
cabana
and
sentry
and
I'll
say
that
as
I
recall,
Stan
kind
of
narrowed
in
very
quickly
in
century
as
to
where
he
felt
the
problem
was
Cabana
led
me
in
a
different
direction.
D
Just
happen
to
be
standing
there,
but
I
don't
want
to
I
and
I'll.
Be
honest,
I!
Don't
really
remember
when
this
happened,
how
the
dev
escalation
channel,
how
setup
it
was,
but
I
feel
like
that
would
be
that
surrogate
for
what
Stan
did
is
that
we
would
go
right
into
that.
So
I
felt
confident
about
that
that
we
that
I
could
get
that
help
as
the
EOC
to
someone
from
dev
to
take
a
look
at
this
and
help
where
I
didn't
have
that
expertise.
Yeah.
C
A
Theoretically,
if
we
have
had
smaller
deploy
bundles,
we
could
have
theoretically
caught
it
earlier
and
we
are
driving
in
that
direction.
But
we
have
a
bit
of
a
chicken
and
egg
problem
with
you
know,
depending
on
the
test,
but
that's
not
being
reliable.
Can
we
do
it
quicker?
No,
because
we
need
to
have
a
reliable
test
and
so
on.
I
want.
D
To
say
in
this
case
Marin
it
was
not
a
normal
bundle.
This
I
believe
was
very
large
because
we
had
stopped
deploys
for
some
reason
for
a
longer
than
normal
period,
and
so
that
I'm
not
complaining
about
the
current
deploy
schedule
or
the
volume
of
each
deploy
I'm
just
saying
it.
This
one
seemed
like,
as
I
recalled
much
larger,
and
so
it
made
it
very
difficult
to
know
what
changes
were
in.
It
was
overwhelming
how
many
issues
had
been
committed
into
that
deploy.
Alright.
A
F
F
At
time,
you're
all
welcome
to
go
over
I
sure
I
knew
neither
meeting,
but
I
just
want
to
say
this
is
one
of
the
best
RC
has
we've
done
so
thanks
man
for
organizing
and
running
it.
This
is
really
good
and
I
echo.
What
Mac
said
a
big
part
of
this
in
maintaining
the
no
blame
culture
is
like
the
accountability
and
transparency
for
Mateus
for
Mike
and
others
are
a
part
of
this.
F
That's
really
really
important
and
I
know
it's
a
it's
a
chicken,
the
egg
sort
of
situation,
some
people
and
it's
very
human,
might
say
well,
I'm,
hesitant
to
speak
up.
Cuz
I'm
worried
about
getting
blamed.
I
promise
you
nothing
increases
likely
to
blame
than
not
speaking
up
and
being
accountable,
but
it's
a
chicken
and
egg
thing
so
I
just
want
to
say
like
let's
use
this
as
an
example,
and
hopefully
it's
a
signal
to
people
that
are
in
future
situations
that
it
is
okay
to
speak
up.
F
A
Eric
yeah
Eric
is
Eric
is
right.
We
are
over
time,
but
if
others
are
willing,
I'm
willing
to
go
over
five
minutes
to
kind
of
wrap
it
up,
I'm,
seeing
some
and
thumbs
up
so
folks,
who
can't
be
here,
feel
free
to
head
out
I
want
to
address
something
that
Cameron
said
before
well
feel
free
to
leave,
but.
A
We
should
we
should
consider
making
sure
that
any
stoppage
of
deployment
is
considered
as
as
high
severity
was
a
possibility
for
high
severity
issue.
I
know
that
it's
a
natural
reaction
when
something
happens
to
stop
and
back
off,
but
this
is
exactly
what
happens
with
large
volume
of
changes
that
we
have.
If
we
slow
down
a
bit,
we
will
just
introduce
more
changes
that
will
have
a
possibility
of
like
more
angles
of
problems
coming
in.
A
So
what
we
should
be
doing
as
all
of
us
right
development,
ops
quality
is
making
sure
that
the
amount
of
changes
that
we
have
are
small,
but
also
the
changes
themselves
being
small.
This
is
why
iteration
is
also
really
valuable,
because
you
can
actually
commit
five.
Ten
merge
requests
over
the
same
thing
and
as
the
point
five
ten
different
merge
requests
over
span
of
time
is
easier
than
one
big
thing
that
could
cause
an
issue.
So
that's
just
a
general
comment.
There.
A
All
right,
we
have
corrective
actions
written
out,
I'll,
encourage
everyone
to
read
them
and
I
will
take
another
action
item
from
me
as
a
representative
of
the
delivery
team
here,
and
that
is
looking.
Why
was
the
deployment
bundle
bigger
this
time
according
to
Cameron?
And
then
another
thing
is
whether
it's
possible
to
safely
increase
the
cadence
of
deployments,
while
still
depending
on
test
like
finding
that
middle
ground,
where
we
feel
comfortable
from
both
quality
and
then
infrastructure
side
of
things,
as
well.