►
From YouTube: Scalability Team Demo - 2021-06-24
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
A
I've
got
the
first
thing
so
I'll
start,
so
this
is
related
to
what
was
on
the
demo.
The
other
week,
which
I
missed
about
how,
when
we
have
low
frequency
metrics
counters
in
prometheus,
they
will
effectively
start
at
one
as
far
as
prometheus
is
concerned,
and
what
that
means
is
when
you
try
and
take
a
rate
from
them.
You
need
two
samples.
You
need
that
to
go
up
to
two
before
there's
an
actual
rate,
because
prometheus
doesn't
treat
missing
as
zero
for
good
reasons.
So
you
end
up
with.
A
If
it's
low
frequency,
you
might
never
get
to
two
from
a
particular
process
scrape
and
it
might
just
look
like
the
rate
zero
when
actually
it's
non-zero,
it's
just
that
we
didn't
start.
It
start
the
count
at
zero
each
time.
I
think
I'm
explaining
that
right.
So
let
me
demonstrate
the
issue.
I
guess
and
then
share
that.
A
So
this
is
the
current
state,
so
I
only
just
started
this
prometheus
and
the
background
process,
so
we
can
see
a
bit
clearer
here,
so
I'm
just
running
the
workers
for
source
code
just
to
simulate
a
sidekick
shard,
even
though
that's
not
a
real
shard.
So
we
see
the
update.
All
mirrors
worker
just
came
from
nothing
to
one
trending
projects.
Worker
I
scheduled
manually
also
came
from
nothing
to
one.
So
if
we
take
rates
of
those
like
you
know,
the
rate
is
the
slope
and
there's
no
slope.
A
Even
when
the
first
job
happens.
There's.
A
So
let
me
go
fix
that
so,
first
of
all,
I'm
gonna
stop.
A
Prometheus
I'll
also
stop
sidekick,
and
then
I'm
going
to
enable
this
feature
flag.
I
put
it
behind.
I
think
it's
possibly
overly
cautious
to
put
it
behind
a
feature
flag,
but
I
set
the
feature
flag
to
enable
by
default
to
get
around
that,
because
the
feature
flag
that
I've
added
here
will
only
apply
on
process
start
because
all
it
does
is
on
process
start
finds
the
potential
values
of
sidekick
jobs,
completion,
seconds
counts
or
sorry,
jobs,
completion
seconds,
the
histogram
that
this
process
could
run.
A
So
it
doesn't
generate
a
series
for
every
possible
sidekick
job.
It
just
generates
one
for
the
psychic
jobs
that
this
process
can
run.
Whilst
do
I
need
to
do,
I
need
to
do
something
with
docker:
that's
it
this.
C
A
Way
of
clearing
my
prometheus
data
is
in
the
gdk,
it's
it's
stored
in
docker
and
I
just
delete
all
my
docker
volumes.
So
I've
enabled
the
feature
flag.
I'm
going
to
start
prometheus.
A
A
So
I'll
see.
We've
got
nothing
here
just
for
now,
and
then
I'm
also
going
to
get
ready
to
schedule
a
trending
projects
worker,
but
not
do
that
just
now.
Just
to
demonstrate
that
so
we'll
need
to
wait
for.
A
A
B
A
Cool
good
demo,
so
here
we've
got
some
jobs
running
oh
no
they're,
not
running,
they
dropped
until
duplicate
dropped
and
start
executing,
because
they've
been
deduplicated
for
some
reason.
There
we
go
right,
so
I've
got
zeros
for
all
the
possible
combinations
of
labels
that
this
histogram
could
have
for
this
process.
A
So
now,
if
I
go
and
schedule
a
trending
projects
worker
and
do
that
same
query
graph,
we'll
see
that
there's
a
bunch
of
zeros
here's
the
update,
all
mirrors
worker,
because
that's
the
cron
job
so
just
runs
regularly
and
then
once
the
trending
project
worker
runs
and
gets
scrapes.
That
will
go
from
zero
to
one.
So
that
should
solve
this
issue,
but
I'm
only
doing
it
for
one
metric
and
there's
probably
a
bunch
of
low
frequency
metrics
we
have.
Actually,
I
found
one
with
the
plan
team
purely
coincidentally.
A
They
asked
me
about
something
that
was
related
to
service
desk
emails
received
where
they
emit
a
metric
when
service
test
does
something,
but
service
desk
is
again
quite
low.
Frequency
and
we've
got
a
lot
of
psychic
processes
that
can
pick
that
up
so
per
sidekick
process.
It's
very
low
frequency
that
that
happens,
so
their
rates
don't
match
what
they
see
in
cabana
at
all,
because
they're
always
starting
at
one
yeah.
There's
the
trending
projects
worker,
I
think
so
yeah.
A
I
don't
have
a
good
answer
to
the
the
general
problem
there,
because
to
do
this
registration
thing,
you
need
to
know
all
the
possible
combinations
of
labels
that
you
could
have,
because
each
series,
each
combination
of
labels,
is
effective,
its
own
series.
So,
for
instance,
here
we
have
job
status
labels
on
this.
We
also
have
labels
for
the
worker
class
and
the
worker
queue.
A
So
we
need
all
of
those
to
match
up
correctly
for
this
to
do
anything,
because
otherwise
we
just
create
two
different
series,
and
then
that
wouldn't
help.
So
I
don't
know
if
there's
a
better
way
to
do
this
or
if
it's
just
a.
I
think
we
have
to
take
care
of
her,
but
yeah.
D
I
mean
maybe
we
could.
I
don't
know
it
will
very
well,
but
I
I
we
have
some
sort
of
wrapper
class
around
prometheus
right
in
the
rails
code
base.
So
maybe
we
could
structure
the
api
of
that
class
so
that
when
you
create
a
metric,
you
must
submit
the
possible
label
values
and
but
then
I
guess
that
assumes
that
the
metrics
get
defined
at
boot
or
yes,.
A
A
Yeah,
the
only
thing
I
was
thinking
of
was
like
a
more
heavyweight
approach
where,
like
you,
create
a
different
type
like
we
have
a
registry
of
metrics
that
you
want
to
create
on
boots,
and
then
you,
you
use
that
constructor
rather
than
the
other
constructor.
If
you
want
one
of
those
and
then
yeah,
I
don't
know,
I
haven't
really
thought
it
through,
but
there
would
still
need
to
be
some
somewhere
some
code.
That
says
these
are
all
the
label
values
this
metrics
and
how.
D
Well,
I
I
suppose
we
could
also
do
something
obnoxious
where,
if
you
try
to
define
the
metric
that
wasn't
defined
at
boots,
then
we
throw
we
raised
an
exception
right.
We
could
have.
E
I
mean,
if
you
make
it
that
it
that
we
encourage
people
to
do
it
through
some
sort
of
class
method
kind
of
like
the
work
attributes
kind
of
where,
where
it's
almost
defined
on
the
class,
then
that
encourages
that,
obviously
it's
the
people
can
get
around
it.
But
you
know
if
you
make
the
api
sort
of
like
these,
are
the
metrics
for
this
controller
or
whatever,
then
that.
E
A
D
D
A
B
A
And
now
we
do
it
for,
but
we
almost
do
need
to
do
it
for
that
one
right.
When
are
we
doing
because
we're,
I
don't
know
some
combinations,
I
mean.
A
Yeah,
let
me
share
wait.
A
New
feature
categories,
because
we
did
it
for
every
feature:
category
that
caused
a
big
problem
and
because
we
have
this
like
product
of
a
product
thing
where
we
have
per
method
per
status
per
feature
category.
It
gets
real
big,
real,
quick.
B
And
andrew
for
the
one
with
because
we
use
this
for
errors,
we
get
around
it
by
recording
a
zero.
If
it's
missing
in
the
recording
in
the
intermediate
yeah
and
then
the
intermediate
recording
rules.
E
So
it
does
division
by
zero,
because
the
one
rate
the
error
rate
goes
up,
but
the
the
the
total
rate
doesn't
go
up
or
what's
what's
the.
B
E
G
B
Yeah,
I
don't
think
we
had
a
division
by
zero.
It
was.
E
So,
just
just
one
kind
of
broader
question
is
like:
would
it
be
better
if
we
just
said
we're
going
to
do
this
properly
and
we're
going
to?
We
can
have
all
the
combinations
we're
going
to
preload
and
we
just
accept
that
soon.
We're
going
to
have
to
start
shouting
prometheus
much
more
aggressively
than
we're
doing
at
the
moment
and
having
lots
of
prometheus
instances.
And
then
we
can
do
things
properly
and
we
don't
have
to
do
that.
Sort
of
slightly
strange,
like
combination
code.
D
Well,
what
about
I'm
curious
about
this
recording
rule
idea?
Because
it
is
kind
of
city
that
we
create
this
combinatorial
explosion,
where
the
pods
also
push
up
the
cardinality
and
the
processes
that
we
really
don't
care
about,
and
the
recording
rules
filter
out
stuff
we
don't
want.
So
is
there
a
way
we
can
inject
the
zeros
at
the
recording
rule
level
and
only
work
from
there.
A
That
would
be
nice,
that's
something!
That's
really
frustrated
me
as
well
as
like
the
metrics
generated
by
the
application
can
have
quite
high
cardinality
and
then,
when
you
apply
them
across
our
fleet,
we
get
a
multiplication
again
of
something
that
we
don't
end
up
caring
about
a
lot
of
the
time
once
it's.
A
D
E
B
Isn't
it
isn't
it
that
if
I
remember
correctly
when
we
wrote
some
of
those
rules,
if
there
is
an
operation
we'll,
if
there's
an
operation
rate
that
is
higher
than
zero,
we'll
record
a
zero
for
the
error
rate?
It's
something
like
that.
We
had,
I
think,
and
that
was
for
stuff,
like
gitlab
components
like
the
those
aggregating
rules.
D
A
A
Value
and
then
there's
no
rate
at
any
point,
because
there's
no
numeric
change.
E
A
Maybe,
but
I
I
I
didn't,
read
the
full
github
issue
about
this-
that's
closed
about
like
having
it
assumed
zero.
But
as
far
as
I
understand
there
were
good
reasons
for
not
doing
that,
so
it
would
probably
give
us
other
problems
yeah
and
then
the
other
thing
is
the
solution.
That
seems
obvious.
That
you
can't
do
is
that
if
you
summed
them
before
taking
the
rate,
then
that
wouldn't
be
a
problem
because
you'd
be
aggregating
across
the
fleet
into
one
place,
but
that's
also
a
thing
that
doesn't
yeah
that
doesn't
work
yeah.
E
A
E
Yeah
and
the
other,
the
other
way
to
kind
of
consider
it
is-
and
this
is
something
we've
discussed
before,
but
maybe
the
way
to
get
around
all
these
cardinality
problems
is
just
to
have
a
really
boring
metric,
which
is
in
the
application,
sli
successes
and
sli
total
requests,
and
it's
got
no
other
demand.
Well,
it's
the
only
dimension.
It's
got
is
basically
the
service
level
indicator
component.
A
A
E
D
E
Debugging,
well
I
mean
from
a
getting
their
point
of
view.
It's
fairly.
It's
not
you
know.
You'd
have
to
have
a
middle.
It
was
discussed
in
that
service
level,
monitoring
v2
proposal
at
the
bottom,
and
there
was
a
sort
of
thing
on
it
where
you
could
just
sort
of
have
a
yield
block.
If
you
want-
and
you
just
sort
of
said
you
know,
this
is
my
success
criteria
and
then
you
know
when
it
comes
out
of
that
block.
It
just
counts
the
counter
and
that's
and
that's
it
and
it
just
has
no
labels.
E
It's
really
nice
for
us
and
for
error
budgeting
and
but
from
an
application.
Developer's
point
of
view.
It's
it's
not.
You
know
they
want
to
go
and
start
looking
into
the
the
the
dimensions
and
understanding
you
know
what
was
it
a
302
or
three
or
four,
but.
D
A
I
think
I
think,
there's
definitely
something
there.
I
think
for
the
service
desk
thing
like
for
anything
that
we're
alerting
on
that
probably
is
a
way
to
go.
I
think
for
the
service
testing,
it's
more
arguable
because,
like
that's
a
it,
could
become
a
metric
that
they're
alerting
on.
But
at
the
moment
it's
like
they're
trying
to
use
it
as
an
informational
metric,
but
they
can't
because
it's
so
infrequent
so
I
mean.
Maybe
the
answer
is
make
that
an
sli
but.
D
A
D
I
think
what
I'm
trying
to
say
is
that
we
have
what
also
what
I'm
hearing
here
is.
We
have
different
use
cases
for
prometheus,
so
one
is
alerting.
One
is
error,
budgets
and
one
is
I'm
a
developer
out
and
I
want
to
see
if
a
random
thing
happened
in
production
with
the
counter
the
so
we
we
can
take
different
strategies
for
these
different
use
cases
and
for
air
budget
and
alerting.
A
E
It's
very
easy
to
say
you
know
at
the
moment
there's
kind
of
a
little
bit
of
magic
where
we
kind
of
look
at
the
thing,
and
then
we
know
where
to
go
in
the
logs.
But
with
this
it's
just
like
this
met.
This
sli
is
tanking
and
we
know
that
when
it's
tanking
it
starts
writing
logs
that
are
that
have
got
the
exact
same
labels
on
them
and
then
we
can
search
for
those
logs
and
and
help
with
the
diagnosis
of
the
problem.
A
D
I'm
just
wondering
like
well:
we
I
think
we
started
off
with
you
saying,
andrew
that
it's
bad
for
debugging
and
now
you're
saying
that's.
This
is
good
for
debugging.
E
A
My
feeling
traces,
you
know,
we've
got
this,
mr
I'm
gonna
see
how
that
goes
like
cardinality
is
like
ease
of
use
wise
then
I
might
speak
to
the
product
planning
team
about
their
service
desk
metrics
and
see
like
if
we
do
the
same
thing
there.
What
does
that
look
like?
Does
this
look
like
a
thing
that
we
could
reasonably
generalize,
or
does
this
look
like
something
where
we
just
want
to
do
it
differently
have
a
different
approach?
After
all,
sorry.
A
Sorry,
the
metrics
that
the
product
planning
team
have
created
for
service
desk
that
they
asked
me
about
that
had
the
same
issue
where
they
were
infrequent
and
the
rates
didn't
match.
Cabana
is.
D
A
D
Right,
okay,
so
that
is,
but
that's
not
for
alerting
purposes.
That's
for
informational
purposes
like
how
do
people
use
our
feature.
A
A
D
Know
right,
but
I
I
I
think,
the
way
okay.
What
sounds
to
me
like
the
best
idea
right
now,
and
maybe
that's
just
my
the
way
I
read
the
conversation,
but
the
best
idea
sounds
to
sounds
like
yes,
initialize
everything
at
zero,
but
at
the
same
time,
do
the
best
we
can
to
push
down
the
cardinality
by
going
to
the
the
most
prometheus
friendly
model
for
the
case
of
alerting.
F
D
Yeah
and
so
eventually,
it'd
be
good
if
we
could
just
or
maybe
it
would
be
good
if
we
decide
that
we
want
to
get
there
at
some
point
where
we
just
have
the
lowest
possible
cardinality
and
the
other
benefit
that's
in
the
back
of
my
head
is
that
once
we
count
things
in
the
application,
that
also
means
you
can
code,
you
can
put
limits
in
the
application
and
that's.
D
Yeah-
and
I
think
this
is
a
this-
is
this
from
my
limited
view.
This
has
been
a
pain
point
already
with
error
budgets
that
the
stage
groups
say
well.
This
thing
is
allowed
to
be
slow
because
of
x,
and
then
we
have
to
say
well,
but
the
recording
rule
says
it's
not
allowed
to
be
slower
than
that.
So
go
fix
it,
which
is
a
slightly
inflexible,
and
it
doesn't
quite
feel
right.
B
There's
an
issue
for
that
on
on
switching
the
the
way.
We
do
that
now
for,
like
the
whole
thing
for
aptx,
to
switch
that
from
histogram
to
two
counters,
which.
B
E
E
And
if
we,
if
we
can
do
that,
we
can
start
doing
it
with
you
know,
correlating
the
logs
and
the
traces
soon
as
well,
hopefully,
but
the
the
the
one
thing
I
thought
you
were
going
to
say,
but
actually
you
meant
a
different
thing
was
if
we
start
pre-initializing
the
values
we
can
also
have
ci
checks
to
make
sure
that
people
we
can
give
people
limits
on
the
cardinality
that
we
are
allowed
that
we
allow
on
metrics,
because
we
can
say
you
know,
pre-initialize
and
give
us
all
the
all
the
values
that
you
expect
for
this
thing
and
oh,
no
multiplying
those
all
out.
E
D
It's
an
interesting
capability,
but
is
this
a
problem
we
run
into
in
practice
already
that
we.
C
D
D
I
think
what
I
like
generally
about
this
idea
is
that
it
just
feels
like
using
prometheus
the
correct
way,
just
by
pre-aggregating,
more
and
not
trying
to
hold
on
to
your
precious
data
in
on
on
the
prometheus
server
like
you
just
need
to
count
things
and
let
go
of
detail.
That's
the
philosophy
of
prometheus.
E
But
yeah
I
mean,
maybe
maybe
we
just
skipped
the
whole
thing
of
changing
the
developer
interface
for
adding
metrics,
so
that
we
can
kind
of
push
them
to
pre-initialize
and
just
leapfrog
that
to
the
point
where,
on
a
on
a
class,
you
can
kind
of
pre-initialize
your
service
level
indicators
for
that
class.
So
we
kind
of
go
next
level.
D
We
don't
need
to
straight
away
force
developers
to
do
anything,
because
we
just
realized
that
we
are
the
developers
who
do
the
worst
cardinality.
D
E
Okay,
so
this
came
out
of
like
several
different
conversations
with
different
people,
and
so
I
thought
that
there
was
like
a
I
felt
while
they
were
being
had
that
scalability
should
kind
of
be
included
in
the
conversation.
So
now
you
are,
but
it
came
out
of
several
things,
so
the
first
one
I
think
that
really
kicked
it
off
for
me,
was
that
craig
gomes
opened
up
an
issue
about
the
value
of
the
database
peak
performance
issues,
and
there
was
like
a
lot
of
discussion
on
that
and
kind
of.
E
E
The
best
thing
for
for
infrastructure
to
do
is
to
kind
of
do
the
investigation
and
get
to
like
an
end
point
and
a
team
in
the
stage
groups
that
we
can
say
hey.
This
is
a
problem
and
actually,
when
you
do
that,
often
you'll
find
that
there's
already
an
infra
dev
issue
for
that
thing
and
then,
instead
of
having
two
issues
that
everyone's
looking
at
you've
kind
of
got
it
down
to
one
issue
and
you
can
kind
of
say,
there's
various
different
teams
in
the
company
who
now
reporting.
E
This
is
a
problem
and
you
can
kind
of
get
the
really
focus
on
the
prioritization,
and
so
that
was
where
that
came
from.
But
then
what
came
out
of
it
was
like.
Well,
it's
really
really
difficult
to
do
this
and
I
tend
to
agree
like.
If
you
see
cpu,
then
this
is-
and
this
is
all
tribal
knowledge.
So
the
next
thing
you'll
do.
Is
you
go
to
pg
stat
statements
in
thanos
and
you'll?
E
Look
for
some
patterns
in
there
and
then
you
might
go
to
the
slow
log
and
then
you
might
go
select
some
stuff
from
pg
stat
statements
with
like
queries
to
try
and
find
the
right
query
and
there's
all
of
this
magic.
That's
happening
along
the
way
and
then
you
go
to
other
rails
logs
and
eventually
you
come
and
it's
it's
very
difficult
and
it's
very
difficult
for
people
to
to
kind
of
grok
all
these
different
sources.
E
And
so
that
was
the
first
thing.
The
second
thing
was
that
jerry
was
saying
it
would
be
really
nice
if,
if
there
was
more
correlation
between
the
application
and
the
and
the
database
and
what's
going
on
in
the
database,
which
is
kind
of
actually
the
same
different
side
of
the
same
coin
and
then
the
third
part
that
andreas
was
really
talking
about
is
the
attribution
on
the
on
the
tables
and
having
feature
categories
for
tables,
which
I
think
we
spoke
about
in
this
meeting
as
well.
E
Last
week,
maybe
yeah
and
so
there's
there's
different
sides,
and
so
craig
opened
up
an
issue.
And
I
because
I've
been
thinking
about
it
for
a
while.
I
kind
of
did
this
like
really
rushed
sketch,
and
it
reminded
me
when
I
wrote
it
down
of
that
meme
of
the
of
the
investigator
with
the
board
and
he's
like
yeah.
Yes,
there
are
20
different
things
and
the
wires
between
them,
because
that's
what
that
thing
that
I
wrote
in
there
was
like,
but
basically
what
I've
done.
Let
me
share
my
screen.
E
Maybe
is
we've
got
two
parts
of
that
already
done
since
yesterday,
which
is
quite
quite
cool.
I
think,
and
the
I'll
just
explain
the
issue
quickly
as
well.
E
So
really
what
we
can
do,
what
one
of
the
big
problems
that
I
see
is
that
kind
of
we
all
talk
about
the
pg,
the
query
ids,
because
we
have
that
in
in
thanos,
and
so
people
say.
Oh,
we
know
that
this
query
id
is
this
is
is
problematic,
but
the
only
people
that
can
translate
those
query
ids
into
sql
statements
are
are
dbres
and
people
who
can
access
the
primary
database,
and
so
it's
not
a
very
friendly
interface.
It's
not
very
democratized
and
people
can't
get
there.
E
So
what
I
propose
is
that
we
take.
We
have
we
provide
people
with
a
way
to
map
from
a
query
id
to
a
query
right
and
played
around
with
various
different
things,
and
then
I
realized
that
the
easiest
way
to
do
this
would
be
with
a
fluent
plugin
that
occasionally
polls
postgres
and
says
what
are
all
your
query:
ids
and
what
are
the
queries
for
those
query,
ids
and
then
also
because
we
can
do
it
now
we
can
say
what
are
the
fingerprints
for
those
queries.
E
So
we
use
this
library
called
pgquery
and
it's
got
a
thing
called
fingerprint
and
it
looks
at
the
ast
of
a
query
and
it
generates
a
fingerprint
and
we're
starting
to
use
that
as
our
unique
id
for
a
query.
It's
not
that
good,
because
if
you
have
queries
with
different
number
of
question
marks
in
clause
they're
different
queries,
but
it's
good
enough,
and
so
what
that
will
do
is
we'll
we'll
poll
the
database
and
then
we'll
put
that
into
into
elk,
and
you
can
basically
log
into
elk
and
you
can
look
up
a
query.
E
Id
and
you'll
be
able
to
get
the
query
and
so
developers
will
be
able
to
do
this
off
the
bat
and
I
put
together
a
little
merge
request
for
that.
It's
like
pretty
straightforward,
it's
just
a
little
plug-in
and
it
just
pulls
like.
I
think
I've
got
it
to
hit
every
five
minutes,
because
it's
there's
5000
queries.
That's
quite
big,
and
I
don't
want
to
do
it
too
often,
and
it
doesn't
change
that
often
so
once
every
five
minutes,
I'm
sure
it's
it's
fine,
but
yeah.
E
If
anyone
wants
to
go
and
mock
my
terrible
ruby,
this
is
a
good
place
to
do
it,
but
stan's
already
had
a
go
at
it,
so
that's
cool
and
then
so
that
will
give
us
a
way.
So,
instead
of
always
just
talking
about
query,
ids
and
and
kind
of
struggling
to
map
that
to
actual
queries,
we've
got
that
then
the
second
part
of
it,
which
I
think
is
going
to
be
really
interesting.
E
E
Pg
set
is
is
a
view
of
of
the
current
running
things
in
postgres,
and
so
it
says
right
now
at
this
moment,
and
so
obviously
we
can
only
sample
it.
Really,
it's
not
really
any
good
for
kind
of
constantly
understanding,
but
it's
a
it's
a
useful
sampling
data
source
and
so
we've
had
this
thing
called
the
marginalia
sampler
for
for
quite
some
time
and
effectively
what
it
does
is
it
collects.
Maybe
if
I
just
go
back
to
here-
and
I
go
into
here,
this
might
explain
it
a
little
bit.
F
E
Deep
links
in
gitlab,
so
so
what
it
does
is
the
marginalia
sampler
looks
at
pg
stats
activity
and
because
we
have
marginalia
at
the
beginning
of
every
this
is
this
is
the
query
here.
I
can
see
ukrainian.
E
E
Now
we
can
do
really
horrible,
regular
expressions
which
pause
the
margin
earlier
and
then
tell
us
that
right
now,
there's
10
queries
that
are
running
that
are
for
apiv
for
jobs
request,
and
there
are
five
queries
that
are
running
at
this
moment
that
are
for
this
endpoint
and
of
the
ones
that
are
for
apiv,
for
jobs,
request
they're,
all
in
like
lock
contention
states,
so
we
kind
of
get
the
sample
and
because
it's
just
a
postgres
exporter
query,
it
just
runs
every
every
15
seconds,
but
I've
I've
found
this
this
view
to
be
pretty
useful.
E
E
This
was
a
time
series
and
then
this
was
obviously
during
an
incident,
and
you
can
see
that
the
number
of
active
queries
for
projects
id
kind
of
spiked
up
to
114
active
connections
on
the
on
the
primary
database
that
were
for
that
data,
and
then
I
just
kind
of
break
it
down.
Here
you
can
see
that's
by
it
was
actually
180
at
one
point,
so
that's
not
good,
and
you
can
see
what
they're
doing
you
can
see
that
a
lot
of
them
will
idle
in
transactions.
E
So
that's
kind
of
a
bad
thing,
and
it's
just
this
sort
of
view
that
you
can
break
things
down,
and
so
how
that
ties
into
this
is
that
it
gives
us
what
we
can
do
is
we
can
take
that
technique
of
sampling
like
every
15
seconds,
and
we
can
extend
it
by
saying
we
current
using
the
marginalia.
E
But
we
will
also
take
the
query
id
from
the
query
and
we
will
fingerprint
it
with
the
pg
query
fingerprinting
technique,
and
then
we
can
say
that
we
can
say
that
this
query
with
this
query
id
is
called
by
these
endpoints,
or
at
least
we've
sampled.
It
and
we
see
that
that
happens
a
lot
because
we'll
record
that
in
the
logs
as
well,
so
we
can
say
you
know-
we
saw
this
query
a
lot
and
95
of
the
time.
E
It
was
api
v
for
jobs
request,
and
you
know
we
don't
always
know
that
and
where
this
will
come
in
like
really
really
helpful.
For
example,
there
was,
there
was
an
issue
where
someone
was
saying
that
we
were
making
too
many
queries
to
this
query
and
no
one
could
decide
whose
problem
it
was.
E
It
was
kind
of
like
a
user's
table,
and
so
people
like
well,
it
could
be
a
cookie
and
there
was
a
lot
of
kind
of
fighting
about
it,
and
the
theory
was
that
there
was
one
endpoint
that
was
kind
of
coming
in
and
really
smashing
that
that
query,
but
there
was
no
way
to
kind
of
do
it,
and
so
we
thought
about
extending
this
marginalia
sampler
and
looking
for
that
exact
query
and
then
kind
of
reporting
on
it.
E
E
But
that
was
for
slow
logs,
okay
right
thanks
yeah,
and
so
it's
it's
actually
it's
in
the
same
project
and
it's
so
so.
The
difference
so
slow
logs
is
great.
But
the
problem
with
slow
dogs,
as
you
know,
is
that
there's
lots
of
stuff
that
that
is
very
fast
but
very
high
traffic
that
we
don't
that
doesn't
come
out
in
the
slow
logs
and-
and
we
don't
get
any
attribution
on
that.
So
so.
This
is
kind
of
a
way
of
saying
that
this
query
belongs
to
to
this
team,
and
we
can.
E
We
can
do
that
and
then
the
third
bit.
E
Basically,
this
is
kind
of
just
like
my
wish
list
of
all
time,
which
is
like
if
we
just
enable
distributed
tracing
in
production
like
so
much
of
this
stuff
will
be
easier
and
actually
weirdly
igor's
just
enabled
tracing
for
getting
but
using
google's
stackdriver
tracing,
which
isn't
very
good,
but
it's
better
than
nothing
so
we're
slowly
making
things,
because
we
have
the
the
the
request
in
there.
But
then
the
last
bit,
which
stan
has
just
done,
because
it
wasn't
very
hard.
E
E
E
You
can
go
for
something
in
pg,
stat
statements,
two
traces
that
are
generating
that
you
can
go
to
end
points,
because
you
know
what
the
end
points
are
and
you've
really
got
like
full
connectivity
or
much
much
much
better
connectivity
between
like
what's
happening
in
the
database
and
what's
happening
in
the
application
than
we've
ever
had
before,
because
we
can
just
say
we
know
that
this
this
query
was
really
bad.
We
know
that
this
query
is
equal
to
this
fingerprint.
We
know
that
this
is
a
sql.
E
E
So
that's
the
sequel
that
was
issued
to
the
database
and
obviously
this
is
kind
of
a
bit
difficult
to
grok,
but
we'll
be
able
to
add.
So
what
stan's
doing
is
he's
adding
the
finger
here
takes
this
and
he
generates
a
fingerprint
from
it
and
then
we
will
be
able
to
search
traces
by
fingerprints,
and
so
if
we
have
a
database
problem-
and
we
see
like
a
major
spike
in
in
cpu,
we
should
be
able
to
tie
that
back
to
like
traces
and
end
points
very
quickly.
E
A
I
think
the
exciting
part
is
the
distributed
tracing
part
like,
I
think
the
exciting
part
is
the
last
part
about
the
distributed
tracing
like
I
was
talking
to
igor
a
bit
about
this
yesterday
and
even
if
it's
just
the.
A
C
E
It's
it's,
the
the
thing
that's
difficult
is
finding
an
infrastructure
person,
but
actually,
like
I've
kind
of
I
think
you
know.
I
think
we
should
just
just
do.
Do
the
google
thing
and
then
people
will
get
the
understanding
of
it
and
and
be
able
to.
You
know
once
like.
I
think
we
should
just
use
stackdriver
and
and
just
accept
that
we
don't
have
people
to
focus
on
these
things,
and
you
know
we
we
haven't
got
a
yeah.
D
F
Back
for
three
years,
as
far
as
I
remember,
because
we
just
get
to
it
and
something
more
origin,
pops.
F
D
It
sounds
like
it
would
be
very
worthwhile
to
somehow
get
it
out
there.
E
Yeah
so
so
igor
was
busy
replacing.
He
has
already
started
working
on
replacing
the
lab
kits
open
trace
with
a
with
a
stackdriver.
It's
forgot
what
it's
called
now
the
claw
open
sensors
is
the
is
the
name
of
the
of
the
api
that
stackdriver
tracing
uses.
It
was
google's
alternative
to
open
tracing.
E
E
The
the
good
news
is
that,
whenever
an
adopts
open
telemetry,
it
is
a
much
nicer
api
like
this.
The
open
tracing
api
is
is
one
of
the
nastiest
apis
that
I've
had
the
experience
of
using
as
a
developer.
E
E
Yeah
but
but
having
lab
kit
means
that
we
can
switch
to
that
and
drop
it
in
yeah.
So
that's
the
only
part-
and
I
asked
ego
about
this
this
morning-
was
whether
ruby
has
open
census
library
and
he
said
that
it
he
thinks
it
does
otherwise
that
that
was
the
only
thing
I
was
kind
of
wasn't
sure
about,
because
obviously,
if
that's
the
case,
we
lose
a
lot
of
the
benefit
because
there's
a
big
black
hole
of
of.
D
E
I
I
think
I
mean
check
check
in
with
him.
It's
like
you
know.
He
goes
quite
good
at
he's
at
just
competently,
getting
through
things
and
asking
when
he
needs
help.
But,
like
you
know,
it's
worth
it's
worth
asking
whether
he
needs
some
assistance
with
that,
but
luckily,
like
it's
not
that
hard
to
add
different
endpoints
apis
to
to
the
lab
kit
side
of
things,
because
it's
it's
fairly,
it's
a
very
high
level
abstraction.
E
So
it's
sort
of
like
you
know
like
instruments,
an
http
request
and
then
the
inside
you
know
it's
not
like
a
low
level
api
like
open,
telemetry,
open
tracing.
Anything
like
that,
so
you
can
replace
them
fairly
easily.
C
I
think
it's
also
one
of
those
things
where
as
much
as
we'd
like
to
help
with
doing
this,
we
do
also
have
like
an
awful
lot
in
progress
at
the
moment
and
I'm
always
nervous
to
help
with
another
thing.
I
think
if
this
was
something
we
were
going
to
contribute
to,
we
need
we
need
more
space
before
we
can
say.
Yes,.
D
Yeah,
I
I'm
not
suggesting
we
jump
on
this
right
now,
but
the
the
fact
that
this
has
been
stuck
in
development
hell
for
three
years
and
it's
not
happening
is
since
cape
town
is,
is
not
good
and
I
think
it
it
feels
to
me
like
it's
close
to
what
we're
also
doing
with
error
budgets.
It's
about
helping
the
organization
scale
by
helping
developers
making
better
choices
about
what's
going
about
how
their
software
runs,
making
it
easier
for
sres
to
understand
what
is
going
on
it.
D
If,
if
this
and
getting
these
tools
to
work
right
is
a
lot
of
work,
I
mean
just
just
having
something
you
can.
Click
on
is
step
one,
but
then
you
find
out
that
the
data
is
diffic
as
like
holes
or
gaps.
You
can't
correlate
things
you
can
find
things.
You
get
things
like
these
fingerprints.
That
andrew
was
talking
about
so
once
you
have
it,
you
need
to
make
it
good
and
you
need
to
keep
it
good
and
right
right
now.
D
B
This
kind
of
tracing
is
something
that
developers
have
wanted
like
since
we
started
removing
the
thing
that
you
already
have
that
I
forget
the
name
of
now,
but
yeah.
B
D
E
F
Okay,
so
just
to
be
clear,
we're
not
our
mission
is
not
to
make
developers
happy.
Our
mission
is
to
scale
get
loaded.
What
we
do
on
the
path
to
scaling
gitlab.com,
if
that
makes
developers
happy
that's
great,
but
that's
a
secondary
third
ternary
or
whatever
can.
F
Okay,
the
second
point
is:
we
also
have
the
observability
team,
which
is
supposed
to
be
driving.
Some
of
these
items,
specifically
tracing,
was
supposed
to
be
one
of
the
items
they'll
achieve
and
last
night
check.
They
also
had
site
reliability
engineering
in
their
title
right
now,
who
am
I
to
say,
like
I
understand
that
there
are,
like
many
other,
more
important
things
that
they
need
to
handle.
So
what
I
would
suggest
here
is
figuring
out
with
the
observability
team
leadership,
so
we
they
have
leadership
now.
What
their
mission
is,
how?
F
How
does
that
align
with?
You
know
the
project
that
you're
mentioning
here
and
if
this
is
something
in
their
roommate
remit,
remit
doesn't
matter
if
it
is
then
great
they're
going
to
have
to
prioritize.
If
it's
not,
then,
if
there
is
no
owner
of
this,
then
we
can
discuss
how
this
fits
into
our
own
mission.
And
now
I
understand
this
sounds
like
and
we're
not
prioritizing
globally.
F
But
if
you
check
out
all
the
projects
that
the
team
has
been
doing,
we've
been,
maybe
even
over
optimizing
globally
right,
we
have
been
doing
a
lot
of
things
that
don't
really
move
the
dial
as
much
as
we
want
it
to
move,
even
though
the
team
has
been
doing
a
great
job.
F
So
I
would
rather
go
through
the
being
purposeful
about
this
and
making
this
a
discussion
between
observability
and
scalability,
and
then
when
we,
if
we
take
ownership,
we
know
why
we
took
it
rather
than
because
they
didn't
manage
to
get
it
done
in
three
years
like
that
would
be
my
suggestion.
E
F
It's
again,
it's
absolutely
fine.
We
can
then
do
a
a
combined
project
like
none
of
that
is
a
problem.
It's
more
that
if
we
go
into,
if
you
continue
going
into
a
habit
of
taking
over
items
from
the
srv
or
because
they
are
drowned
by
other
work,
I
would
rather
figure
out
how
scalability
can
make
the
platform
better
for
them,
so
they
are
not
drowned
in
other
work.
That
is
obligation
related.
F
Then,
taking
on
the
items
that
you
know-
and
I
know
that
there
is
like
a
dependency
there
right.
How
are
we
going
to
do
this?
If
we
don't
know.
C
F
So
on
and
so
on-
and
this
is
the
part
where
we
can
discuss
this-
a
bit
more-
not
discuss
but
make
a
purposeful
decision
about
this.
C
F
I'd
be,
for
example,
much
happier
if
this
team
takes
on
the
task
of
italy
cluster
on.com
running
at
scale.
That
would
be
much
more
interesting
thing
to
me
whether
it's
can
run
the
scale
or
not
is
a
different
question,
and
but
I
that's
the
question
I
would
like
answered
rather
than
like
the
the
tracing
work
right,
because
this
has
direct
impact
on
the
application,
direct
impact
on
our
capabilities
of
scaling
and
has
impacts
all
across
not
only
on
our
platform
but
other
scaled
platforms
as
well.
F
E
F
It's
more
focused
on
the
application
scaling
side
of
things
right
and
I'm
not
saying
we
shouldn't
discuss
this
so
rachel.
Maybe
you
want
to
take
an
action
item
here
to
figure
this
out
with
ken,
but
at
the
same
time
I
don't
want
us
to
go
in
and
just
take
on
an
item
that
they've
been
working
on
for
a
while.
B
With
all
the
work
that
we've
been
doing,
no
no
it's
that
it
won't
even
take
five
minutes.
I
was
just
like
it
took
longer
than
I
thought
it
should
have
to
write
a
line,
a
column,
and
that
was
because
the
there's
two
kinds
of
tables
in
grafana
and
grafalet
doesn't
have
the
new
one.
Yet.
But
since
it's
just
jason,
you
can
do
whatever
you
want.
That
made
me
think
if
we
maybe
could
upgrade
graphonic.
But
I
didn't
start
that
yet.
E
Somebody
actually,
a
friend
of
mine,
sent
me
a
link
to
rafana
had
a
conference
recently
and
they
had
a
talk
on
like
new
new
way
to
do
declarative
things
and
because
I
think
they're
kind
of
realizing
that
lots
of
people
are
using
this
now
sort
of
as
a
bit
of
a
side,
project
I'll,
try
and
find
the
the
talk.
I
haven't
had
a
chance
to
look
at
it.
Yet
myself.
B
Yeah,
I
think
the
biggest
thing
for
us
is
going
to
be
to
switch
over
to
when
they-
I
don't
know
where
it's
at
right
now,
but
they
were
going
to
start
doing
the
generated
thing
right
for
performance,
so
it
would
always
be
in
sync
with
what
grafana
can
do
now.