►
Description
Closing ceremony for FCL for Govern: Security Policies team.
Pre-recording: https://youtu.be/ZpOxrCIPguY
Incident: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/8159
RCA: https://gitlab.com/gitlab-org/gitlab/-/issues/387556
FCL: https://gitlab.com/gitlab-com/feature-change-locks/-/issues/34
A
Hello:
everyone
welcome
to
current
security
policies,
fcl
closing
ceremony,
we're
here
today
to
celebrate
the
future
change
lock
process
and
that
we
finished
with
the
work
for
this
one
and
we
already
prepared
pre-recorded
session
with
the
general
update
about
fcl
and
why
it
happened
and
what
we
did
during
this
phase.
And
today
it's
all
about
celebrating
and
all
about
talking
about
possible
outcomes
for
the
future
and
what
kind
of
plans
do
we
have
and
answering
questions
so
yeah
I
see
wayne.
You
have
first
item.
B
A
B
I
reviewed
the
FC
I
reviewed
it
over
time,
but
I
reviewed
it
again.
Today.
Many
issues
in
the
work
plan
are
closed
and
many
Mrs
are
merged,
which
is
great,
but
not
all,
and
it's
not
expected
at
all
would
be.
But
the
question
my
question
is:
for
the
items
still
open,
which
give
us
a
high
risk
of
the
repeated
incident.
Should
we
hold
the
fcl
open
to
complete
that
subset?
B
You
know
in
particular
the
one
that
jumped
out
at
me,
but
you
and
Dominic
know
the
details
better
than
I
do
is
I'm
just
reading
from
the
RCA.
As
the
issue
was
not
reproducible
on
staging
environment,
we
might
look
for
ways
to
increase
traffic
in
staging,
to
be
able
to
properly
evaluate
the
given
change
affects
the
environment.
That's
from
the
RCA.
From
reading
the
fcl,
it
sounds
like
we
started
some
Skype
work
to
investigate
this,
but
haven't
completed
it
as
part
of
the
FCA.
A
Oh
yeah,
like
from
items
you
mentioned
like
we,
have
some
suggestions
that
we
will
leave
open
because
they
do
require
some
discussion
with
other
people
in
gitlab.
So
that's
that's.
The
first
thing
we're
waiting
for
the
feedback
for
for
this
one.
A
So
the
good
thing
about
the
whole
thing
about
the
whole
like
process
and
the
incident
is
that
the
feature
was
introduced
behind
feature
flag,
so
that
process
worked
and
the
main
cause
of
it.
So,
like
changing
the
code
that
caused
this
incident
was
reverted.
First
of
all,
it
was
like
disabled
by
Future
plague.
Then
it
was
reverted
and
will
not
do
any
improvements
in
terms
of
like
Scandal
policies.
Until
we
have
this
epic
finished
and
terms
of
like
tracking
improvements
for
improvals.
A
So
that's
that's
all
the
second
thing.
The
first
thing
on
staging.
We
started
creating
manually
project
like
group
with
the
project
in
NMR
to
to
get
more
and
more
data,
but
it
will
require
some
work
with
like
security
quality
Team.
So
that's
why
we
created
this
Spike
and
after
the
spike,
we
create
implementation
issue
and
we're
tracking
things
with
them,
and
if
we
will
be
able
to
help,
we
would
love
to
help.
But
this
will
require
some
work
with
with
people
on
on
the
like
engineers
and
testing.
A
So
that's
that's
the
the
biggest
like
biggest
three
and
the
last
item.
But
in
my
opinion,
like
from
my
perspective,
the
most
important
one
is
the
main
root
cause
for,
for
this
problem
is
not
paying
attention
to
dashboards
after
enabling
feature
flag
and
not
keeping
the
proper
distance
between
like
enabling
this
feature
flight
for
next
portion
of
actors.
So
we
did
two
two
improvements
during
the
process.
A
So,
first
of
all,
when
in
the
future
now
template
feature
flag,
rollout
template,
we've
added
this
step
that
that
you
should
actually
verify
this
and
then
in
the
documentation.
We
added
a
section
that
explains
what
what
to
look
for
and
so
on
and
one
on
the
dashboards
and
what
kind
of
dashboards
to
look
for
what
kind
of
data
to
look
for,
and
and
so
on.
A
During
this
RCA
and
talking
with
a
series
and
other
engineers,
and
also
there
is
one
feedback
on
the
chat
up
suggestion
that
we
do
not
have
like
a
good
materials
for
engineers,
especially
for
someone
that
just
joined
gitlab,
about
like
having
a
training
and
level
up
or
any
other
place
to
to
help
them
familiarize
with
dashboards
and
to
know
what
kind
of
metrics
to
look
for
and
so
on.
So
we
were
learning
that
during
this
process,
but
also
before,
but
it
would
be
great
to
have
this.
A
This
kind
of
of
training
I
wanted
actually
to
ask
you
about
this.
Is
this
something
that
we
should
talk
about
with
Learning
and
Development
team,
or
something
that
we
should
initiate
like
I'm,
not
sure
about
the
process
for
like
requesting
something
like
that?.
B
I
said
some
really
appreciate
all
the
analysis
on
this
and
it
sounds
good
to
me
some
low
priority,
but
I've
made
the
low
priority
feedback
fcl.
So
oh
sorry,
read-only.
B
A
B
And
I
have
the
I
always
like
providing
the
links
to
easily
find
things
so
I'll.
Add
that
to
the
notes
here
too,
and
that
will
be
the
Steph
Dominic
any
any
thoughts
or
feedback.
C
I
think
the
most
important
part
will
be
to
enable
staging
to
alert
us
for
performance
regressions,
but
this
takes
a
lot
of
work.
There's
an
epic
to
tractors
and
also
there's
a
separate
issue
that
quality
is
tracking
for
automatically
seeding
the
staging
database
and
I
think
this
is
the
most
promising
Way
Forward
like
currently,
we
I
was
also
looking
into
ways
because
alerts.
We
have
two
options:
either
we
increase
the
staging
the
traffic
and
staging
so
that
the
production
levels
would
work.
C
This
would
mean
we
create
a
very
large
number
of
these
repositories
or
we
change
the
production
rules
so
that
the
alerting
works
in
staging.
So
as
far
as
I
can
tell
from
this
epic,
they
choose
the
second
option
to
update
the
production
rules
and
I
think
until
then,
the
only
options
that
we
roll
like
a
custom,
Homebrew
metric
alerta.
But
then
you
also
run
into
questions
like.
Where
do
you
actually
deploy
this
thing?
C
Then
you
have
one
single
piece
of
one
service
which
is
bespoke
and
which
maybe
runs
in
some
CI
pipeline,
also
schedule
it,
but
still
it's
kind
of
wonky.
For
this
specific
thing
to
handle
the
metrics
reporter,
so
I
would
also
suggest
to
wait
until
this
assorted
out
and
staging
alerting
works
as
desired.
A
Thank
you
Dominic.
Will
you
update
the
document
with
the
links
to
Epic
and
like
to
those
two
epics.
A
All
right,
so
that
that
would
be
it.
Thank
you
for
your
questions.
The
last
item
is
like
we
should
do
the
retrospective,
like
I'll,
ask
you
probably
bro
like
I
I've
mentioned
something,
and
let
me
share
my
screen
quickly.
A
So
I've
mentioned
two
items.
What
went
well
and
what
was
great,
the
the
most
important
thing
is
that
most
of
the
features
like
say
all
features
that
were
related
to
enabling
anything.
This
kind
of
the
policies
were
implemented
behind
feature
Flags.
So
we
were
aware
that
at
some
point
our
experiments
will
end
up
in
something
that
we'll
need
to
revert
quickly.
So
we
followed
the
process
to
actually
be
able
to
enable
disable
it
quickly.
A
So
that's
that's
great
thing
that
that
we
did
that
we
were
able
to
revert
the
change
within
like
a
few
minutes
after
identifying
the
issue
and
the
whole
like
process
in
terms
of
like
the
communication,
with
support
with
SRE
team
and
and
many
people
that
were
involved
in
this
was
great.
It
was
really
quick
and
great
like
experience
for
us.
A
The
other
thing
is
that
we
started
learning
more
about
metrics,
not
only
dashboards,
but
also
like
other,
more
specific
metrics,
that
we
can
actually
track
particular
worker
or
particular
class
that
is
not
working
properly,
and
then
we
would
like
to
to
continue
doing
that
will
will
monitor
those
as
we'll
Implement
new
improvements.
So
this
is
this
is
really
important.
A
C
C
The
whole
applications
that
suffer
like
notes,
merge,
requests
and
everything
were
impacted
and
I
looked
into
how
the
gitlab
psychic
setup
looks
like,
and
we
have
this
workers,
the
number
of
workers,
and
this
is
metadata,
and
then
you
have
a
number
of
routing
rules
and
these
routing
rules
map
the
workers
onto
a
number
of
shots
in
total.
We
have
currently
I
think
eight
or
nine,
and
you
can
think
of
these
psychic
shots
as
the
unit
of
isolation.
C
So
whenever
one
of
these
workers
in
the
shot
gets
impacted,
the
whole
shot
suffers
potentially
as
a
consequence,
so
I
was
wondering
and
asking
if
we
cannot
like
move
it
to
a
separate
shot,
because,
ideally,
if
you
have
one
worker
one
chart,
the
new
workers
are
actually
isolated.
Reason
why
this
is
not
configured.
This
way
is
for
performance
reason,
so
actually
the
way
that
Sidekick
is
set
up.
It
contains
a
trade-off
where
reliability
is
traded
to
Performance
to
some
degree.
C
Now,
there's
also
a
current
time
chart
which
was
configured
one
year
ago,
and
there
are
two
workers
currently
in
the
shop
and
apparently
there
was
also
a
problem
for
this,
and
I
was
asking
if
we
cannot
move
our
worker
also
to
the
scare
internship,
because
it's
still
better.
If
these
two
workers
get
impacted,
then
the
whole
application
sector-
and
it
currently
is-
and
the
response
I
got
was
then
that
if
you
have
three
workers
that
are
potentially
like
vulnerable,
why
would
you
put
them
in
a
shot?
C
So
I'm
not
super
convinced
that
this
is
the
best
argument,
because
the
consequences
are
still
better
than
when
you
cannot
access
nodes,
merge,
requests
and
everything
right.
So
I
was
thinking,
maybe
I
just
opened
the
draft
in
there
or
just
assign
it
to
the
reliability
guys
and
see
what
they
say,
because
it
seems
like
you're
worth
buy
a
trailer
for
me,
I
mean
just
one
worker
Falls
over
which
is
really
only
limited
and
functionally
functionality
to
security
policies.
And
then
you
cannot
access
your
notes
or
your
merge.
C
A
Thank
you.
Thank
you
for
the
feedback
yeah,
so
we
definitely
learned
a
lot
about,
like
things
related
to
Performance,
reliability
and
and
like
we
improved
our
like
relationship
with
those
teams
and
hopefully
we'll
work
more
with
quality
team
in
the
future.
A
A
The
good
thing
that
the
timing
for
it
was
great
because
we
actually
like
starting
the
fcl
right
after
threat
insights
group
Team
like
finished
working
on,
like
moving
scan
like
all
security
findings
for
all
branches
to
database,
so
they
just
finished
and
they
like
turn
on
the
feature
Flag
by
default,
and
now
we
can
use
that
data.
So
so
it
will
help
us
with
with
moving
the
whole
thing
to
database.
A
So
the
main
problem
currently
is
that
we're
not
using
database
we're
using
like
reading
blobs
and
those
blobs
are
stored
in
in
the
storage.
So
you
need
to
like
Network
traffic
to
par
to
get
those.
Then
some
time
to
parse
them
and
so
on
and
so
on.
So
now
we're
gonna
just
reduce
the
time
used
for
for
parsing
this
data
and
we're
going
to
just
work
immediately
on
the
data
and
database.
So
this
will
help
us
as
well.
A
All
right,
so,
thank
you.
Thank
you
for
for
your
work.
Dominique
Sashi
is
not
here
but
yeah.
Let's
thank
him
as
well.
That
was
great,
seeing
us
collaborating
on
those
things
and
being
involved
in
the
process,
so
so
that
was
amazing
work.
So
now
we
can
get
back
to
normal
and
and
make
sure
that
we
will
deliver
those
improvements
in
in
the
future.