►
From YouTube: Govern: Security Policies - FCL for Incident #8159
Description
This is pre-recorded summary of work related with FCL for Govern: Security Policies team.
Incident: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/8159
RCA: https://gitlab.com/gitlab-org/gitlab/-/issues/387556
FCL: https://gitlab.com/gitlab-com/feature-change-locks/-/issues/34
A
Hello:
everyone,
my
name,
is
Alan
versevsky
and
I'm,
an
engineering
manager
for
the
governed
security
policies
team.
Today,
I
would
like
to
talk
to
you
about
future
change
that
the
tower
team
was
involved
with
and
during
this
video
I
will
cover
what
happened.
What
we
have
identified
during
root
cause
analysis
process,
what
we
have
planned
and
what
was
the
goal
for
future
changelog
process?
What
we
actually
did
during
this
phase
and
what
is
the
plan
for
our
team
in
the
future?
So,
let's
start
with
what
happened?
A
We
were
working
on
one
of
the
issues
to
improve
the
Scout
result.
Policies
with
ability
to
compare
results
from
all
Pipelines
are
related
to
change
when
evaluating
them.
So
the
main
problem
that
we
identified
is
that
currently,
when
you
are
running
Scandal
result
policies,
it
is
only
taking
the
latest
Pipeline
and
not
all
pipeline
created
for
a
given
Mr.
So,
for
example,
you
could
have
merge
requests
by
Appliance
detached
pipelines
that
are
actually
run
along
with
with
the
main
pipeline
for
your
Mr,
and
we
will
not
be
able
to
to
check
this
conversation
policy.
A
We're
not
able
to
remove
the
approval
and
evaluate
this
properly.
So
the
reason
for
that
we
created
this
issue,
so
it
would
help
us
with
doing
this.
We
decided
to
work
on
this
feature
behind
feature
flag,
because
we
knew
that
at
some
point
we're
going
to
encounter
some
performance
issues
and
wanted
to
make
sure
that
the
change
is
easily
revertible.
A
We
were
enabling
the
future
flag
and
first
we
enable
this
on
on
staging.
Then,
after
a
few
days
or
nine
days,
we
decided
to
enable
it
on
the
production
and
then
the
incident
happened.
So
the
incident
was
created
as
to
incident
that
was
communicated
with
customers,
that
the
background
job
processing
time
was
dropped
and
and
the
for
35
minutes
it
impacted
both
web
and
API.
So
this
is
the
the
incident
itself
and
you
can
read
more
about
what
exactly
happened.
What
kind
of
slowing
down
we
saw?
A
A
So
after
the
incident
we
decided
to
actually
go
through
whole
root,
cause
analysis
to
understand
what
happened
and
how
we
could
avoid
that
in
the
future.
So
we
created
this
issue
that
could
help
us
with
understanding
exactly
the
root
cause
and
the
problems
that
and
things
that
you
can
improve.
So
let's
get
started
with
what
actually
happened
and
what
went
well,
so
you
can
read
the
timeline
and
what
was
the
type
timeline
here
in
the
issue?
I
will
link
the
issue
in
the
description.
A
However,
you
can
see
what
actually
was
happening
during
few
days
and
and
the
most
important
thing
is
that
it
was
enabled
on
staging
on
5th
of
December
and
then,
after
nine
days
on
13th
of
it
was
starting
to
be
rolling
out
on
production
for
like
percentage
of
actors,
and
then
we
saw
the
incident,
the
aptex
dropped
to
92
percent
and
we
saw
the
alert
from
alert
manager
on
our
Channel.
A
So
we
quickly
had
the
call
well,
we
had
identified
the
root
cause
and
we
disabled
the
feature
flag
and
it
after
a
few
minutes.
It
went
back
to
normal.
So
the
most
important
thing
is
that
we
try
to
identify
what
actually
went
well.
So
the
good
thing
is
that
the
feature
was
introduced
behind
feature
flag.
So
that's
the
clear
reason
to
use
feature
plaques,
especially
if
you,
if
you
know
that
you're
experimenting
with
some
new
improvements
that
could
cause
the
performance
problems.
A
The
other
thing
is
that
the
monitoring
system
of
on
production
works
perfectly.
We
were
able
to
declare
a
new
incident
where
we
were,
including
the
call
we're
able
to
to
make
sure
that
this
change
could
have
affected
and
with
support
team
-3
team,
we're
able
to
disable
the
feature
flag
and
we
saw
the
Improvement
of
the
performance.
So
that
was
great
to
see
the
cooperation
between
multiple
teams
to
make
sure
that
we
can
qualificate
get
back
to
normal
what
we
could
improve.
A
So
that's
what
we
did
actually
during
this
vertical
analysis,
we
were
trying
to
understand
what
we
could
improve.
So
we've
noticed
doing
this
analysis
that
we
were
enabling
this
feature
flag,
we're
not
waiting
enough
between
like
next
portion
of
actors.
So
we
we
thought
that
maybe
encouraging
Engineers
to
wait
for
like
15
minutes
before
naming
the
next
phase
of
future
flag
would
help
us
like
reduce
the
this
problem.
A
The
other
things
that
we
not
really
have
information
about
this
and
feature
of
like
roll
up
template.
We
have
the
information
about.
We
should
wait
at
least
few
minutes,
but
it
could
be
easily
overlooked.
The
other
thing
is
that
we
are
not
really
actively
encouraging
Engineers
to
observe
dashboards
after
feature
flag
is
enabled.
We
have
one
sentence
about
that
in
the
documentation.
A
However,
it
also
might
be
overlooked,
so
we
decided
that,
oh,
maybe
if
we
have
that
as
a
as
a
point
and
feature
flag
template
and
also
we
could
have
like
enhanced
documentation
around
enabling
feature
Flags,
it
will
help
us
avoid
similar
feature
problems
in
the
future.
The
other
thing
is
that,
even
though
it
was
enabled
on
staging,
we
identified
that
we
not
really.
We
had
bad
assumption
that
we
are
free
to
go
with
production
enabling
a
feature
flag.
A
We
did
not
have
enough
data
on
staging
for
our
group,
so
we're
not
able
to
properly
identify
the
problem
beforehand.
So
we
also
should
investigate
the
potential
impact
on
performance
during
the
development
of
testing
of
the
UI.
So,
during
the
review
we
we
should
be
able
to
actually
write
down
whenever
we're
doing
something
that
might
affect
multiple
parts
of
gitlab.
We
should
be
able
to
identify
what
happened
and
what
could
be
the
performance
of
it.
A
We
were
not
working
with
database,
so
we're
not
able
to
identify
the
the
problem
on
the
database
level
that
we're
not
able
to
to
actually
read
the
the
like
the
query
time
and
so
on.
So
because
we're
doing
everything
in
Ruby
to
read
the
report.
A
So
we've
decided,
as
this
was
the
S2
incident
declared
for
customers.
It
was
visible.
We've
decided
to
go
with
fcl
route.
If
you
want
to
read
more
about
xcl,
which
is
like
future
change,
lock,
you
can
go
to
documentation
and
in
the
handbook
and
actually
read
about
the
process,
and
why
and
two
is
involved
and
what
we
actually
did.
So
you
can
read
more
about
this
here.
A
So
after
this
root
cause
analysis,
we
created
the
fcl
issue
and
let
me
open
this,
for
you.
A
So
we
identified
the
date
for
it,
and
it's
actually
finishing
today
and
we
identified
the
work
plan
wanted
to
prepare
it.
We
wanted
to
update
stakeholders
about
it
and
wanted
to
make
sure
that
we
can
focus
on
on
this
one,
so
we
stopped
any
other
development
on
backhand
side
and
we
decided
to
to
start
working.
So
what
we
identified
during
this
phase,
our
work
was
work
plan
was
that
we
first
need
to
refer
TMR.
A
We
want
to
make
sure
that
we'll
not
introduce
the
similar
issue
in
the
future
by
accident
or
anything
like
that.
We
reverted
this
because
we
knew
that
we
need
to
work
more
on
improvements
and
it
will
not
happen
during
the
fcl
phase
after
the
feature
flag
was
removed
where
we
have
removed
it,
which,
like
itself
from
gitlab,
so
it
is
not
available
anymore
and
the
source
code
for
it
is
not
developed
anymore.
A
Then
we
we've
decided
to
actually
take
a
look
at
the
recent
improvements
that
create
improvements
that
threaten
sites
team
did
that
you
could
use
to
30
findings
to
a
database
to
synchronize
approval
rules,
first
control
policies.
Right
now
we
are
using
Json
security
report,
so
we're
parsing
every
single
time
the
security
reports
getting
the
data
for
it,
calculating
the
proper
value.
A
If,
if
it
should
be,
if
the
Mr
approval
is
required
or
not,
and
then
we're
doing
this
so
we're
actually
parsing
Mr
like
the
Json
files
twice,
because
we're
doing
this
when
we're
creating
security
files
in
database
and
the
other.
When
we
are
taking
the
approval
rules
for
standards
of
policies,
we
want
to
use
the
work
of
threat,
insights,
the
great
Improvement
they
did
and
they
wanted
to
change
the
way
we're
calculating
if
given
approval
is
needed
or
not.
A
So
that's
why
I
created
this
epic
and
in
this
epic
we
identified
on
the
implementation
plan
and
what
we
want
to
do
to
improve
it.
So
if
you
want
to
read
more
go
to
this
epic
and
you
can
read
more
about
or
any
considerations
that
we
have
and
we
had
during
the
space,
the
other
thing
that
we
wanted
to
make
sure
that
we
can
understand
similar
issues
in
the
future.
So
we
created
a
spike
to
investigate
how
we
can
actually
check
if
performance
is
okay.
A
For
this
type
of
feature,
so
we
did
the
whole
investigation
and
in
this
investigation
we
we
wanted
to
to
test
and
evaluate
different
options
that
we
could
take
and
the
most
important
thing
as
a
result
and
the
outcome
of
it
is
that
we
should
work
on
staging
more
and
we
should
add
more
data
and
stating
so
you
can
identify
this
issues
in
the
future
and
we
should
observe
the
metrics
that
we
see
from
staging
to
identified
issues
in
the
future.
A
We
created
a
separate
issue
to
use
security,
Precision
data
theater
to
improve
the
data
and
to
have
those
data
actually
on
staging.
So
that
was
the
work
that
you
wanted
to
do
in
terms
of
source
code.
What
we
can
do
in
the
code
itself,
but
then
we
identified
oh,
we
also
need
to
work
on
some
improvements
in
terms
of
processes.
So
first
we
wanted
to
include
the
information
to
verify
dashboards
after
enabling
feature
flag,
so
we
updated
the
feature
of
like
roller
templates.
A
So
it's
now,
it's
a
part
of
the
process
of
enabling
feature
flag
to
actually
check
the
dashboard.
The
other
thing
that
we
were,
that
we're
working
in
the
Mars,
ready
and
ready
to
be
merged
is
adding
sections
feature
flag,
documentation
about
Matrix
verification.
So
you
can,
you
can
see
more
about
about
it
in
those
Mrs
that
what
you
can
do,
what
kind
of
steps
you
need
to
take
and
and
how
to
identify?
A
If
your
change,
if
your
future
change,
if
your
future
flag
actually
caused
an
incident
or
not,
the
other
thing
is
that
we
want
to
talk
with
other
team
members
and
gitlab
to
understand
if
there
is
a
need
to
actually
add
link
to
dashboards
when
enabling
chat
Ops
when
asking
chatups
to
enable
a
feature
flag.
So
we
wanted
to
encourage
people
to
check
metrics
after
their
enabling
feature
flag.
So
maybe
adding
this
small,
a
sentence
like
verify
metrics
with
the
link
to
the
dashboard
could
help
us
with
that.
A
The
other
thing
is
something
that
we
should
consider
and
we're
discussing
this.
This
was
shared
with
others
is,
maybe
you
should
add
the
production
check
whenever
we're
enabling
feature
flag,
that
we
need
to
wait
for
at
least
some
time
before,
enabling
next
section
next
portion
of
actors.
So
that's
what
we
what
we
wanted
to
do,
so
that's
what
we
did
and
for
the
future,
what
we're
going
to
work
on
is
actually
this
epic
and
this,
and
as
a
result
of
this
bike,
there's
like
one
issue
that
we
wanted
to
work
on.
A
So
in
terms
of
what
we're
going
to
do
in
the
future.
We
want
to
work
on
this
data
center
that
we,
what
we're
going
to
do
with
the
quality
team
and
to
improve
that
and
to
help
them
build
a
similar
tools
that
will
help
us
do
it.
Actually,
we
already
have
DMR
with
with
POC
of
how
how
it
could
look
like,
and
the
other
thing
is.
A
We
have
created
this
epic
and
in
this
epic
one
and
this
epic
is
scheduled
for
15.10
and
this
epic
we're
going
to
work
on
improvements
to
the
worker
itself
to
synchronize
the
problems
using
database
instead
of
reading
Json
files,
which
is
quite
expensive
in
terms
of
like
resource
usage,
and
here
we
also
identified
that
we
already
can
do
that
with
the
POC
small
POC
that
we
already
did,
and
you
could
read
more
about
it
in
the
issue
and
in
Mr
itself.
A
So
I
encourage
you
to
check
links
that
will
be
added
to
the
video
description
and
we'll
gonna
talk
about
retrospective
during
the
synchronous
meeting
I'm
going
to
share
with
you
all
Lessons
Learned,
if
you
have
any
questions,
add
them
to
the
to
the
meeting
agenda.
Thank
you.
Goodbye.