►
From YouTube: SRE Team Introduction at GitLab - Incident Management
Description
If you haven't already, watch the first Video in this series about the Infrastructure team structure at GitLab: https://youtu.be/SLTZzFT4mTs
Topics Covered:
- Definition of an Incident
- Who attends the incidents?
- What causes an incident?
- Who declares an incident?
- The Incident Room.
- The Incident Lifecycle.
Slack name to engage the SRE is: (at)sre-oncall
A
A
A
A
First,
the
engineer
on
call:
we
know
that
for
sure
this
person
is
responsible
for
mitigating
the
cause
of
the
incident,
it
doesn't
mean
they're
responsible
for
doing
that
on
their
own.
If
they
think
that
the
incident
is
bad
enough,
they
can
involve
other
people.
So,
as
we
said,
the
reliability
team
is
the
team
that
is
participating
in
this
on
call
rotation.
A
A
A
A
The
imoc
would
also
involve
the
communication
manager
on
call,
which
is
the
seamock,
and
the
cmoc
would
be
responsible
for
relaying
messages
about
the
incident
to
the
end
users.
So
if
the
end
users
are
affected
and
we
need
to
relay
messages
to
them,
we
basically
page
the
cmoc
on
call
the
communication
manager
on
code
and
the
people
who
are
responsible
for
doing
that
at
the
moment
are
from
the
support
team.
More
specifically
the
getlab.com
support
team.
A
So
these
are
the
main
three
people
who
are
involved
in
the
incident.
But
of
course,
if
the
incident
is
severe
enough,
you
always
find
people
from
different
teams.
Let's
say,
for
example,
the
deployment
is
blocked
because
of
an
incident.
You
might
find
someone
from
the
delivery
team
in
the
incident
room.
A
Let's
say,
for
example,
the
incident
is
affecting
a
part
of
the
infrastructure
that
is
rarely
modified
and
maybe
someone
who's
been
in
the
team
for
a
long
time
knows
more
and
they
happen
to
be
from
the
delivery
team.
So
we
are
going
to
call
for
help,
so
we're
going
to
call
them
on
slack,
saying
hi,
please
join
the
incident
room.
We're
going
to
talk
about
what
the
incident
room
is
so
keep
watching.
A
A
Let's
talk
about
examples
to
narrow
it
down
a
little
bit
internal
factors
when,
for
example,
the
delivery
team
starts
a
deployment
to
getlab.com,
they
watch
the
services
and
they
look,
for
example,
at
the
error
rate
for
some
of
the
services.
A
Let's
say
someone
from
the
reliability
team
is
doing
a
change
if
they
notice
an
elevation
in
the
errors
or
any
other
factor
after
a
change
that
we
did
to
the
infrastructure
itself,
for
example,
adding
pots
removing
pods
changing
some
sort
of
configuration
somewhere
if
they
notice
a
change
after
sorry,
if
they
notice
an
elevation
errors
or
a
change
in
matrices
after
a
change
they
made,
then
they
know
that
they
have
caused
that
issue
and
they
can
revert
it
so.
You've
got,
for
example,
deployments
and
an
internal
change
by
the
infrastructure
team.
A
An
external
factor,
for
example,
a
surge
in
traffic,
a
sudden
surge
of
traffic,
and
in
this
case
it
could
be
a
normal
surge
in
traffic.
It's
just
an
increase
for
some
reason:
that's
not
malicious
or
in
other
cases
it's
an
attack
and
our
automated
systems
didn't
catch
it
and
didn't
mitigate
it
automatically.
A
A
Let's
say,
for
example,
a
service-
that's
not
available,
of
course,
if
you're,
if
you're
someone
from
get
lab
and
you
you
work
for
get
lab
and
you
notice
that
getlab.com
is
not
working.
For
example,
I
mean
that's
the
biggest
service.
A
Let's
say
it's
not
working
well
before
you
create
an
incident.
You
might
want
to
check
with
the
engineer
on
call
just
to
make
sure
that
the
issue
is
not
local
to
your
computer.
Maybe
it's
something
local,
so
you
don't
really
need
to
page
or
create
an
incident
for
that.
So
you
can
always
call
the
engineer
on
call
by
quoting
a
specific
slack,
a
name
at
something.
A
So
let's
talk
about
the
incident
room,
the
incident
room
is
a
very
special
room.
It's
a
permanent
room.
You
can
find
the
title
of
it
in
the
incident
management,
slack
channel
and
there's
a
permanent
link
to
a
zoom
room
and
that
room
is
called
the
incident
room.
The
incident
room
is
a
very
interesting
place,
because
this
is
where
all
the
fun
happens.
Yeah.
This
is
a
room
that
is
accessible
by
everyone
at
get
lab,
so
you
can
always
attend
any
incident
that
is
happening.
A
Of
course,
if
the
incident
is
not
severe
enough,
you're
not
going
to
find
anyone
in
the
incident
room,
but
if
the
incident
is
severe
enough
is
affecting
enough
users
and
needs
think
communication
you're
going
to
find
people
in
the
room
most
of
the
people
in
the
room
are
going
to
be
the
engineer
on
call
the
incident
manager
on
call,
and
sometimes
the
cmoc,
the
communication
manager
on
call
and
a
lot
of
times.
You
also
find
other
team
members
from
all
the
company
different
departments,
some
people
where
their
specialization
is
needed.
A
What
is
an
incident?
What
is
it
like?
Really
in
real
life,
how
hard
the
infrastructure
team
works?
To
mitigate
incidents,
the
amount
of
stress
you
can
see
the
amount
of
stress
in
the
room
it's
going
to
manifest
in
different
ways.
You
might
not
detect
it
right
away,
but
everyone
is
of
course
stressed
from
from
time
to
time
and
depending
on
the
incident
depending
on
how
they,
how
their
days
when
it's
going
so
some
people
are
going
to
be
laughing.
Some
people
are
going
to
be
like
really
surprised.
How
could
this
happen?
A
How
did
we
set
up
our
infrastructure
to
handle
something
like
that
or
not
handed
something
like
that?
So
I
find
it
very
interesting.
I
find
the
incidents
really
really
really
informative
and
I
think
particularly
for
support,
because
we
also
in
the
self-managed
department
of
support,
we
participate
on
a
customer
emergency
rotation
and
it
is
very
similar
to
the
incident
room
except
we
are
not
responsible
for
the
customers
infrastructure,
so
they
basically
handled
that
part
in
the
customer.
Emergency
calls
in
the
support
team.
A
So
we
sort
of
see
the
end
results
of
what
happens
in
the
incident
rooms.
We
sort
of
see
we
think
of
infrastructure
as
a
black
box.
We
don't
really
know
what's
happening
there
and
all
we
care
about
is
the
product
get
lab
and
how
it's
running
on
that
infrastructure.
Sometimes
the
setup
is
too
easy,
like
a
one
node
setup
and
it's
just
simple,
and
we
know
how
to
navigate
that
easily.
A
Sometimes
it's
a
more
complicated
set
of
an
h,
a
setup
that,
where
the
the
setup
is
scattered
among
different
nodes-
and
that
makes
it
a
little
bit
more
complex.
But
if
you're
in
the
incident
room,
you're
gonna
get
to
see
the
infrastructure
side
of
the
incidents.
A
A
So
I
I
advise
you
to
actually,
if
you're
in
support,
perhaps
subscribe
to
the
incident
management
channel
and
if
you
see
an
s1
or
p1
severity,
one
s1
or
p1
priority.
One
issues.
Incidents
perhaps
join
the
incident
room
and
have
a
look
at
how
the
team
is
handling
their
incident.
A
A
A
So
that's
the
interesting
part.
They
uncover
a
layer
that
supports
don't
get
to
see
in
their
own
corrotation.
So
in
the
incident
room,
as
I
said,
you
get
a
lot
of
reactions.
Some
people
are
very
surprised.
Some
people
are
stressed,
so
they
laugh.
A
But,
more
importantly,
you
find
that
spirit
of
everyone
is
like
trying
to
help
and
everyone
has
got
the
same
goal
which
is
get
the
service
up.
So
that's
the
focus
and
I'm
gonna
actually
compare
what
I
saw
in
the
incident
room
to
something
that
I've
learned
recently
as
a
parent.
A
I
downloaded
this
application
on
my
phone
and
it
gives
you
a
sort
of
a
road
map
on
understanding
how
to
parent
your
child
in
a
better
way
and
at
some
point
they
talk
about
bonding
and,
to
my
surprise,
bonding
with
your
little
child
is
happens
during
the
hard
times
during
the
times
when
your
child
is
struggling
to
understand
their
feeling
or
when
they're
crying
when
they're
experiencing
a
negative
emotion.
A
That's
when
real
bonding
happens
between
you
and
your
child,
and
that
was
really
interesting
to
me
to
see,
because
it
did
actually
make
sense.
Looking
back
at
all
my
relationships
with
everyone
in
my
life
and
I'm
like
that's
right,
it's
like
during
the
hard
times.
That's
when
you
really
get
together
get
close
and
you
really
sort
of
bond-
and
you
have
you
establish
this
shared
experience
where
you
sort
of
trust
each
other
more
like
I'm
gonna,
be
there
for
you,
you're
gonna
be
there
for
me,
so
I
think
it.
It
is
an
example.
A
I
think
what
happens
in
the
incident
room
is
is
sort
of
similar
because
it
is
a
stressful
time.
Sometimes
we
have
no
idea
what's
happening
and
because
of
that,
we
sometimes
we
experience
a
lot
of
negative
emotions,
but
with
the
help
of
everyone
you
get
you
get
to
get
through
every
incident
we
have
to
right.
A
So
I
think
what
happens
is
that
it
creates
bonds.
I
think
you
see
some
sides
of
of
everyone.
You
don't
really
see
outside
of
the
incident
room,
and
I
think
it's
mainly
because
of
the
stress,
so
I
I
find
it
a
very
interesting
place
not
just
on
the
technical
level,
but
also
on
the
team
level
like
how
the
team
is
interacting
with
each
other.
A
So
yeah,
that's
that's
the
incident
room
during
an
incident
the
the
lifetime
of
an
incident.
It
goes
through
a
few
phases,
so
the
first
phase
is
when
an
incident
is
triggered
after
the
incident
is
triggered.
A
The
engineer
on
call
give
it
an
immediate
attention
and
they
try
to
identify
or
observe
as
much
data
as
possible
to
understand
what
is
the
impact
of
this
incident
on
our
users
or
on
different
services
once
they
establish
what
the
impact
is
they
decide
if
they're
going
to
involve
other
people
in
the
in
the
call
or
if
they're
gonna
try
to
mitigate
it
by
themselves
and
if
it's
severe
enough,
it's
going
to
get
the
s1
label
and
it
has
to
be
addressed
immediately
by
a
lot
of
people
they're
going
to
get
paged
once
the
the
these
little
things
are
identified.
A
A
So,
if
you're
lucky
enough,
you're
gonna
do
that
under
an
hour.
If
it's
an
internal
cause,
it's
gonna
be
an
easy
identify
and
we're
gonna
get
to
revert
the
change
quickly
because
we're
already
aware
of
the
change
of
the
cause
of
the
issue.
But
if
it
is
an
external
factor
a
lot
of
times,
it
takes
a
lot
of
time
just
to
identify
the
root
cause
or
how
we
are
going
to
mitigate
that
issue.
So
once
we
identified
the
root
cause
a
lot
of
times,
it's
easy
and
quick
to
mitigate
the
issue.
A
Sometimes
it's
not
that
easy
and
direct,
but
a
lot
of
times.
It
is
easy
and
direct
once
you
identify
the
root
cause
after
identifying
the
root
cause.
You
mitigate
the
issue.
You
apply.
Something
affects
a
hot
patch,
whatever
just
to
mitigate
the
issue.
Once
the
issue
is
mitigated,
it's
marked
as
mitigated
and
it's
not
as
urgent
anymore,
but
we
are
collecting
all
the
corrective
actions
that
we
can
take
to
prevent
this
issue
from
happening
again
in
the
future.
So
this
work
can
happen
most
of
the
time
it
happens.
Async.
A
There
is
a
brief
discussion
about
that
in
the
incident
room
about
corrective
possible
corrective
actions
from
here
and
there
during
the
incident,
someone
would
say.
Oh
perhaps
we
can
do
that
to
permanently
solve
this
issue
in
the
future,
and
these
corrective
ideas
are
collected
and
added
to
the
issue
at
the
end
of
the
incident,
and
someone
then
takes
care
of
that
after
the
incident
is
over.
A
Sometimes
the
incidents
are
not
severe
enough,
so
we
don't
really
have
to
spend
that
time
on
trying
to
find
the
root
cause,
but
sometimes
if
it's
severe
enough,
we
will
have
to
perform
the
road
codes
and
the
corrective
actions
are
going
to
be
implemented
in
the
following
weeks
months
or,
however
long
we
can
find
the
resources
to
do
so.
A
So
that's
the
life
cycle
of
the
incident.
I
hope
you
found
this
video
interesting
and
helpful.
Please
let
me
know
if
you
have
any
questions
reach
out
to
me:
I'm
rahab
hassanin
from
the
support
team
and
feel
free
to
leave
comments
below
I'm
gonna
check
it
every
now
and
then
thank
you.