►
From YouTube: Incident Management Walkthrough - Jan 2020
Description
Incident Management Walkthrough Jan 2020
https://gitlab.com/gitlab-com/Product/issues/717
A
Hi,
my
name
is
Sarah
Wilner
and
I'm.
A
senior
product
manager
for
the
monitor
staging
kit,
lab
I,
lead
the
Health
Group
and
today
I'm
going
to
be
walking
us
through
our
product
category
incident
management,
so
incident
management
was
added
to
get
lab
last
year.
The
product
category
is
currently
at
the
viable
maturity
level,
and
this
set
of
features
serves
users
who
are
responsible
for
monitoring
and
maintaining
the
availability
and
reliability
of
IT
services.
A
They
would
leverage
the
set
of
features
to
respond
to
IT
incidents
that
are
triggered
when
there
are
outages
or
lapses,
and
the
availability
or
reliability
of
those
services
that
they
are
responsible
for.
So
today
this
walkthrough
is
going
to
cover
the
configuration
of
alerts,
setting
up
issue
templates
to
be
used
for
the
auto
generation
of
incidents
enabling
auto
issue
creation.
A
A
So
what
I'm
looking
at
here
is
a
set
of
default
engine
X
metrics
that
are
automatically
added
to
a
project
when
you
deploy
ingress
or
Prometheus
to
a
cluster
that
you're
running
your
applications
on
so
you'll
get
a
set
of
nginx
ingress
metrics,
as
well
as
system
metrics
for
kubernetes
that
automatically
come
out
of
the
box.
So
for
the
purposes
of
this
demo,
I've
gone
ahead
and
set
an
alert
or
on
my
memory
usage
across
my
cluster.
A
So
I
did
that
by
going
to
the
drop-down
on
the
upper
right
hand,
corner
here
clicking
alerts
and
then
sitting
it
here,
so
selecting
my
query
and
the
value
that
I
want
to
set
the
threshold
at
if
I
wanted
to
remove
this
alert.
I'd
also
have
to
that's
a
good
idea.
To
put
so
looks
a
little
bit
challenging
to
delete
this
I
have
to
select
the
one
and
then
click
delete,
so
I'm
going
to
go
ahead
and
make
a
note
on
that.
A
A
A
And
then
the
incident
section
so
I've
got
a
really
simple
enable/disable
checkbox
here
when
I
set
up
alerts
from
the
metrics
dashboard,
which
I
can
do
for
instances
of
Prometheus
that
are
deployed
to
get
lab,
managed
clusters
or
I
can
also
integrate
external
instances
of
Prometheus
I
can
select
to
create
get
Lab
issues
automatically
for
each
alert,
triggered
I
also
have
the
option
of
customizing.
What
that
incident
looks
like
so
in
this
project,
I've
created
an
issue
template
called
incident.
A
I've
named
this
one
incident
for
ease
of
identifying
it
for
purposes
of
incident
management
and
within
it
I've
added
a
couple.
Different
sections
are
important
to
my
team.
When
we're
firefighting,
I've
embedded
a
metrics
chart
a
section
where
I'm
going
to
populate
the
timeline
and
then
I've
also
added
the
zoom
call
that
we
always
have
open
for
firefighting.
I've
indicated
the
slack
channel
where
this
project
is
integrated
with
and
then
I
have
used
quick
actions
to
do
things
for
auto
assignment
and
then
auto
Laden.
A
A
So
when
we
go
back
to
look
at
the
parts
of
the
issue
for
the
alert
that
was
triggered,
I've
got
the
environment,
the
environment
in
which
the
system
that
triggered
the
alert
lives
name
of
the
metric
and
then
the
threshold
that
triggered
it.
So
it
was
greater
than
point
two
gigabytes
for
five
minutes
and
anything
over
that
I
wanted
to
receive
an
alert
on
so
when
they
alert
triggered
it
automatically
created
this
incident
and
I
have
this
section
at
the
top,
which
is
the
summary
which
is
auto
populous,
set
up
Auto
populated
fields.
A
That
gives
me
more
information.
That
gives
me
information
that
comes
from
the
alert
payload.
So
it's
actually
five
different
fields
that
could
show
up
here,
but
we
only
surface
those
that
have
values
in
the
alert
payload
below
the
alert
details
is
going
to
be
the
rest
of
the
custom
issue.
Template
that
we
just
looked
at
so
the
embedded
metrics
is
a
link
that
I've
generated
from
the
metrics
dashboard.
So
I
can
do
that.
Let's
go
look
at
that.
A
I
can
either
copy
paste
the
link
to
the
entire
metrics
dashboard
into
a
markdown
field
or
I
can
generate
links
to
charts
for
specific
charts
that
I
only
want
to
show
that
chart
in
the
issue.
Both
of
those
things
are
really
helpful
in
the
initial
triage
process,
when
you're
trying
to
figure
out
what's
happened,
why
it
was
triggered
having
that
visual
immediately
available,
shortens
your
time
to
action
so
below
that
I've
got
my
standing
zoom
meeting.
A
So
if
you're,
if
I
was
paged
and
I
started
initial
investigation,
I
realized
it's
gonna,
take
more
people
than
I
to
fix
this
and
I
started.
The
zoo
meeting
I've
got
a
really
easy
way
to
link
this
to
the
incident
and
I
get
a
system
action,
but
that
was
successful
and
oh
I'm
and
I
need
to
refresh
the
page
there.
So
that's
another
that
doesn't
make
a
lot
of
sense,
so
I'm
gonna
say
remove
karma
to
refresh
incident
to
see
linked.
A
That's
not
a
great
experience
if
you're
trying
to
move
quickly
and
you've
added
something-
and
you
don't
see
it
immediately
so
now
is
this
button.
Anyone
else
that
I've
said
this
to
whether
it's
in
slack
or
anyone
else
has
been
paged
has
quick
access
to
that
conference
bridge
where
we
can
then
synchronously
collaborate.
A
couple
other
pieces
are
decided
to
me.
I
did
that
with
a
quick
action
in
the
issue,
template
I've
got
a
couple
of
labels
that
were
added
automatically.
A
The
incident
label
gets
added
for
all
issues
that
are
created
for
triggered
alerts
that
doesn't
need
to
be
configured
in
the
issue
template.
This
makes
it
easy
to
have
an
issue
board
that
you're
using
to
triage,
and
so
you
can
just
filter
that
for
issues
with
a
label
incident
and
always
have
access
to
those
and
then
I've
added
a
label
for
the
service
that
this
that's
running
in
this
kit
lab
project,
and
so,
if
I've
got
some
sort
of
overview
within
a
group,
I
can
filter
by
those
service
labels
as
well.
A
So
another
way
that
people
tend
to
collaborate
during
firefights
is
via
chat.
Ops.
We
support
slash,
commands
both
for
slack
and
matter
most
and
I'm
gonna
demo.
Let's
do
slash
commands,
so
this
project
is
integrated
with
slack.
We've
got
two
different
services
that
allow
you
to
either
send
things
to
slack
or
change,
get
lab
issues
from
slack.
So
if
I
tag
Clemente
and
say.
A
So,
within
the
slack
notifications
service,
I've
got
this
once
I've
set
up
the
web
hooks
for
integration.
I've
got
an
entire
list
of
actions
that
I
can
take
to
effect
different
things,
either
on
issues
or
changes
in
the
pipeline
or
delivery
pipeline
action.
So
these
are
all
going
to
show
up
in
slack
so
here
for
anything,
any
events
that
happened
to
an
issue
created,
update
or
clothes
are
going
to
go
to
this
slack
channel.
A
A
A
A
Awesome
so
that
was
an
overview
of
incident
management
as
it
is
today,
as
I
said
before
it's
at
viable,
and
we
are
we're
looking
to
get
this
dog
footed
by
the
internal
infrastructure
team,
as
well
as
recruiting
externally,
to
build
a
special
interest
group
for
incident
management
to
help
us
determine
improvements
moving
forward.
Thank
you.