►
From YouTube: Miguel talks about his SRE shadow experience
Description
The monitor team joins the SRE team about once per milestone to learn more about their work, Miguel from APM spent a week shadowing during the 13.1 milestone.
Learn more about the SRE shadow program:
https://about.gitlab.com/handbook/engineering/development/ops/monitor/#sre-shadow-program
A
A
A
As
you
can
see,
I
tried
not
to
delete
I
deleted
a
few
of
them.
As
you
can
see.
In
one
day,
I
can
receive
around
10
to
20
alerts
that
are
low
priority.
This
seems
to
be
a
kind
of
a
flaky
metric
and
then
high
priority
alerts.
I
receive
around
one
per
day
and
usually
when
a
system
is
failing,
we
don't
just
get
one
alert.
You
get
several
alerts,
several
things
that
that
are
failing,
but
they
are
mostly
old
symptoms
of
the
same
the
same
underlying
issue,
so
an
alert
looks
in
pager
duty.
A
A
A
There
are
few
that
calm
your
trigger,
because
here
three
or
three
trigger
incidents,
so
these
are
three
different
relay,
probably
related
incidents
that
happen.
At
the
same
time,
often,
it's
part
of
the
series
kind
of
job
kind
of
to
know
if
things
could
be
related
or
not
so
this
these
two
are
related
thing
Kingdome
check
for
these,
and
then
this
Pingdom
is
also
it's
basically
the
same
incident
but
there's
another
one.
That
is
something
else
and
it
might
be
related.
B
A
Automatically
this
is
something
that
I
received
from
pick
your
duty,
but
there's
not
I,
don't
I
think
it.
They
simply
do
it
so
that
it
doesn't
trigger
too
many
emails.
But
I
don't
think
there
is
something
super
clever
of
grouping
different
alerts.
They
just
just
happened
to
be
at
the
same
time,
so
they
mix
into
the
same
email,
I.
B
A
A
If
it
happens
for
more
than
a
few
minutes,
then
they
will,
they
will
notify
you,
others,
there's
not
like
I,
don't
think
it
appears
at
the
level
that,
with
the
alerts
they
can
group
them
symptoms
and
causes,
because
this
this
is
evolving
and
also
just
a
fact
that
this
is
already
many
alerts
are
being
ignored
by
the
system.
There
are
many
more
incidents
that
are
not
reported
here,
such
as
these
no
urgency
alerts.
They
you
don't
get
a
call
all
the
time.
It's
already
an
improvement
from.
A
A
Yeah,
so
one
incident
comes
if
the
incident
is
triggered
by
Prometheus
alert
manager,
which
is
the
main
way
in
which
we
receive
alerts,
then
the
alert
manager
will
we'll
send
you
some
information
about
what
metric
was
acting
up
and
then
you
will
also
get
so.
What
I
saw
mostly
was
being
used
was
the
graph
of
Prometheus.
So
this
alert
is
not
firing
anymore,
so
we
might
not
see
data,
but
let's
try.
A
A
It's
not,
this
might
take
a
while
I
might
want
to
pick
another
alert.
Another
incident-
oh
yeah
here
here
you
go
so
the
church,
because
we
don't
see
the
lines.
We
should
see
some
lines
which
are
typically
the
date
that
is
all
per
passing,
the
other
the
alert
level
there
might
be
something
better,
which
is
true,
but
since
bad.
Yet
this
one
is
interesting.
A
A
Yeah,
this
is
still
very
fast
evolving.
So
that's
fine
I'll
see
all
the
latest
data
yeah,
so,
okay,
that
was
probably
the
trigger
trigger
of
the
alert.
This
is
lying
and,
as
you
can
see,
suddenly
jumped
during
the
incident.
It
probably
was
a
smaller
window
time
that
you
can
see
putting
putting
data
here.
A
Yeah,
so
here,
okay
I'm,
starting
to
find
then
the
peak,
so
during
yeah
during
the
incident.
Probably
we
will
see
something
like
this.
We
will
see
suddenly
this
thing
is
firing.
We
see
it
at
the
end
of
the
chart
and
then
we
got
to
find
the
resolution.
This
one
happened
to
be
that
database
was
very
busy.
So
then
what
the
team
does
is
they
go
and
check?
What's
going
on
with
database,
they
might
use
the
final
not
whole
days
everything
members
I
work
with
three
of
them.
A
B
A
The
feedback,
so
the
feedback
was
a
little
bit
that
your
phone
is
changing
so
often
under,
and
it
seems
to
be
getting
too
complicated
for
for
usage
during
an
incident.
Is
it's
not
really
a
fun
attitude?
It's
just
the
way,
the
way
that
is
being
used
right
now,
it's
in
there
too
many
words
to
make
things
easy
to
find
and
to
navigate
to
during
an
incident
like
you
would
have
to
scroll
through
several
dashboard
and
trying
to
find
your
way.
A
Tell
us
something
that
was
difficult.
I
think
some
alerting
rules
have
more,
for
example,
this
one
is
about
much
information.
Some
of
them
have
more
graphing,
are
related
information,
and
is
that
these,
depending
on
how
they
prefer
now
the
alert
manager
in
setup
I
suppose
I
could
try
to
find
it,
but
it's
not
something
that
I
so
people
people
were.
This
is
dashboards.
A
A
B
A
So
the
the
logging,
so
if
you're,
if
you
look
really
good
with
lastic
search,
you
can
you
can
pinpoint
from
from
here,
you
can
kind
of
pinpoint
which
kind
of
events
are
being
triggered
and
there's
a
charting
section,
I
think
it's
called
chart,
or
either
that
or
her
district
that
lets
you
visualize
the
entries
for
a
given
according
to
due
to
some
filters
and
some
dimensions,
and
then
you
can
chart
exactly
what's
going
on
and
then
this
will
help
you
drill
down
more
to
the
specific
kind
of
event
like
maybe
which
IPS
are
creating
these
these
kind
of
jobs
and
then
then
you're
able
to
say
maybe
it's
time
to
block
this
appears
because
it
could
be
malicious
or
they
could
be
overusing
our
resources
so
that
these
things
happened
a
lot
during
the
incidents,
maybe
finding
their
piece
of
the
attacker
or
the
person
that
was
overreaching.
A
A
I
started
understanding
a
lot
more.
What
are
the
query
results?
Looking
like
so
then,
I
I
started
reading
a
little
bit
more
about
the
documentation
of
Prometheus
to
check
what
kind
of
information
is
giving
us
back.
I
found
out
in
some
charts
we
are
ignoring
so
in
in
every
chart.
We
can
have
any
number
of
metrics
and
in
any
chart
we
can
have
any
number
of
rows
of
data
because
in
all
of
them
we
could
have
a
matrix
resolved.
A
So
then,
for
this
reason
we
we
can
have
in
any
chart.
We
could
have
many
occurrences
of
multiple
series
that
are
accompanied
by
multiple
matrix.
So
we
should
consider
like
this.
This
kind
of
expanding
the
data
in
our
charts
and
I
check
to
the
code
and
I
saw
that
in
many
many
occasions,
which
is
do
very
lazy,
very
lazy
to
say,
oh
just
just
fetch
that
the
first
thing
define
and
then
display
that
and
I
think
that's
not
enough
if
want
to
create
a
more
robust
solution.
So
that's
what
fun
finding!
A
So
that's
why
I
put
a
toasted
this?
This
am
I.
So,
like
the
heat
map,
the
best
one
is
the
time
series,
because
I
think
series
shows
more
stuff,
but
I
still
want
to
check
if
we
are
showing
both
multiple
matrix
and
multiple
rows
of
X
series
and
the
same
for
the
other
charts.
So
I
think
now
we're
creating
already
issues
I'm
going
up
about
that
quite
cool
and
I
hope
we
can
start
work.
A
These
smiles
on
oh
yeah,
another
another
aspect
of
this
is
when
you
are
dealing
with,
we
don't
alert
or
with
a
specific
metric.
You
often
you
might
have
a
holding
the
data,
and
this
is
important
to
know
because
you
could
face
some
kind
of
shortage
or
add
some
kind
of
outage
of
primitives
scraping
or
the
source
of
permissions,
and
then
you
might
be
missing
data
points,
and
this
is
important
to
see
and
I
think
right
now.
What
we
do
is
when
some
data
is
missing.
A
We
simply
complete
the
line,
so
we
go
from
here
to
here.
Even
if
to
take
a
look,
maybe
I
can
be
more
yeah.
If
we
have
this
kind
of
this
kind
of
data,
what
we
would
do
with
our
chart
with
our
charts
today
will
be
to
simply
complete
the
complete
the
point
until
here
unless
assume
that
doesn't
matter,
because
so
because
it
looks
better,
so
we
shouldn't
do
that.
B
A
The
response,
if
the
value
was
0,
we
would
see
something
like
this.
We
would
simply
see
the
metric
drop
to
zero,
so
the
sort
of
problem
right
now,
but
it
might
be
that
for
a
series
of
values,
we
have
long
intervals
between
two
values
that
have
noted
and
we
should
be
able
to
know
and
detect
that
so
that
we
can
display
a
whole
I
I.
B
A
It
right
so
we
have
to
consider
the
right
or
the
yeah
the
sample,
right
and
and
understanding
the
sample
rate
is
is
to
be
what
is
bigger
than
a
threshold
that
we
should
start
adding,
maybe
some
empty
values
or
or
something
to
become
very
that
information
is
missing.
So
that's
something
else
that
we
should
try
to
address
initially
I
suspected
that
we
were
removing
these
values
so
like
these
I
was
complaining
all
these
lines,
where
we
kind
of
remove
undefined
values,
but
writing
now.
That
could
be
something
where
we
have
to
actually
fill
out.
A
A
Possibly
Mary
as
well
is:
this
is
something
that
was
super
useful
during
incidents
was
to
be
able
to
create
a
new
preview
of
a
metric
so
that
we
can
send
the
users
from
the
alert
manager
from
these
incidents
directly
to
our
dashboards,
but
without
the
need
of
creating
a
new
dashboard
or
immediately
create
a
new,
create
and
save
a
new
visualization.
So
it
could
be
something
like
a
metric.
Metrics
preview
similar
to
Prometheus
is
doing
you
can
throw
in
any
query
you
want
yeah.
A
Maybe
we
even
help
the
person
type,
the
right
query
or
the
query
comes
from
the
alert,
and
then
this
immediately
displays
a
chart
and
we
could
do
that
inside.
Gitlab
I
think
is
not
that
much
effort
so
for
all
ready
for
the
panel
for
the
dashboard
panel
that
we
have,
we
could
add
a
text
field
or
something
of
the
sort
and
let
the
person
type
under
the
user
type
the
queries
and
then
we
will
be
able
to
display
something
and
maybe
select.
Okay,
do
you
want
a
time
series?
A
B
B
A
One
of
the
team
members
showed
me
how
it
was
to
add
a
new
chart
and,
in
his
opinion
and
I
from
what
I
saw,
was
quite
difficult
to
create
a
new
chart
during
an
instant
or
when
you're
in
a
rush.
So
from
me,
tease
this
this
simpler
UI
primitives
does
a
better
job
at
that.
So
this
is
something
where
we
could
be
a
little
bit
more
agile.
A
B
B
A
If
you
want
to
use
the
logarithmic
scale
or
the
regular
scale.
You
can
siddhart
very
like
very
many
knobs
and
that
they
probably
very
useful
for
the
graph-
and
I
use
case,
because
not
all
the
data's
coming
from
from
against.
But
here
we've
done
from
it
is
this
something
much
simpler.
Just
show
the
data
shut
the
time
and
then
and
ensure
the
series
and
I
and
I
think
that
becomes
more
useful-
to
create
like
quickly
a
visualization
and
explore.
B
A
B
But
I
guess
for
for
when
troubleshooting
an
incident
and
you
quickly
just
won't
like
to
change
the
query
and
see
how
that
looks
like
you
don't
care.
No,
it's
not!
That
I
think
it's
important
to
like
emphasize
the
use
case,
because
some
would
expert
yes,
I
have
to
have
like
a
preview
but
I'm
like
no
I'm,
going
to
select
the
chart
and
select
the
lines
and
colors.
A
And
I'm
not
saying
so
I
think
I.
Think
I
mentioned
this
yeah
that
there
is
a
very
satisfying
issue
for
a
chart.
Builder
and
I
think
this
is
already
going
to
happen
soon.
So
so
the
features
are
very
similar.
I
think
there
is
a
lot
of
overlap,
but
there
is
a
line
where
it
stops
becoming
being
about
exploring
the
data
of
the
lace
the
last
hour
when
it
becomes
okay.
Let's
make
a
really
a
complete
solution
to
build
a
new
charts
and
see
how
it
goes
so.
B
A
B
I
think
I
think
it's
important
I
mean
the
difference
between
like
a
preview
and
like
a
metrics
Explorer.
This
is
like
a
matrix
Explorer.
You
just
go.
Oh
my
trick
or
good
thing.
There
is
a
lot
of
overlap,
but
this
is
where
I
mention,
like
the
use
case,
important
to
better
understand,
I'm,
pretty
sure.
If
we
build
one,
we
can
leverage
the
work
that.
B
The,
u
rdy
need
to
be
different.
I
see
the
like.
The
preview
for
the
child
builder
will
help
our
user
create
a
chart
for
their
dashboard,
and
while
this
issue,
which
is
more
like
a
metric,
Explorer
kind
of
thing
is
not
is
not
for
preview
is
to
explore
metrics,
so
maybe
yeah.
Maybe
it's
important.
You
probably.
A
A
A
Metric
builder
would
not
do
I
I
think,
as
you
are
adding
more
and
more
things
is
not
like.
We
are
keeping
the
state
on
the
URL,
but
in
this
use
case
it's
it
becomes
very
useful
to
like
parties,
it's
very
exact
I'm.
Not
this
is
part
of
the
incident
right,
so
I
already
have
I
think
we
are
going
in
the
same
direction,
we're
converging
to
that,
but
it's
a
little
bit
different.
So
that's
why
my
proposal
is
here.
A
A
So
we
can
just
see
bar
chart
may
be
more
useful
or
keep
my
or
for
something
else,
and
then
after
that,
probably
you
maybe
want
to
add
a
new
matrix
after
you
have
explored
when
the
next
step
is
in
about
to
and
say,
okay
at
this
metric
that
dashboard
and
then
you
start
that
more
complex
flow,
where
you,
edit
things
I,
don't
know
when
did
the
mention
to
me
was
saying
the
the
raw
Prometheus
Earth
was
important
because
during
an
incident
you
want
to
probably
see
exactly
what
the
API
is.
Throwing
back
at.
A
A
A
A
A
And
yeah
another
that's
the
idea
or
misconception.
I
heard
well
that
I
assumed
that
all
girl,
fauna
or
all
charts
are
always
corresponding
to
alert
rules
and
and
I
think
that's
the
assumption
that
we
have
done
in
our
UI,
but
that's
not
necessarily
true,
as
you
can
see
in
this
case.
In
this
case
the
alert
manager
is
sending
us
something
that
is
a
chart
that
most
of
the
time
we
do
not
generate
any
data
points.
A
B
B
B
A
B
A
A
A
A
A
This
gets
the
job
done
really
really
nicely
you
you
want
to
increase
the
time
window.
You
can
do
so.
If
you
want
to
go
back
and
forward,
you
can
do
so,
and
then
you
can
change
the
resolution
so
in
these
three
fields,
what
our
permittivity
a
lot
of
things
that
are
very
used
for
a
series
and
that
they
would
not
want
to
change
for
a
more
complex,
UI,
interesting.