►
From YouTube: Scalability team demo 2020-07-17
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
Cool
thanks,
everyone
for
joining
the
scalability
demo,
we're
a
bit
with
a
small
crowd
today,
but
jakub
you
have
an
item
on
the
agenda.
B
Yes,
I
it's
unfinished,
but
I
think
it's
there
there's
already
lots
of
food
for
thoughts
and
things
I
want
to
share.
Let
me
share
my
browser,
so
it's
easier
to
know
what
we're
talking
about
I'm
working
on
an
issue
where
we're
trying
to
add
latency
metrics
to
all
the
reddit
servers,
because
we're
focused
on
redis
observability
right
now
and
the
so
I
I
really
only
care
about
this
top
row
of
the
dashboard.
B
We
have
three
redis,
distinct
reddit
services
and
they
each
have
their
own
dashboard,
but
that
they
look
almost
the
same
and
up
until
last
week
there
wasn't
even
an
error
ratios
panel
here
and
for
context.
What
we
expect
to
see
is
four
panels:
latency
or
aptx
error
ratio.
Request
per
second
and
saturation
error
ratios
are
super
boring.
There
are
no
errors.
B
This
was
already
discussed
last
week,
which
is
awesome,
but
I
still
think
it's
useful
to
have
the
panel,
because
if
you
know
how
reddish
works
or
how
we
use
reddish,
then
maybe
it's
obvious
that
there
are
no
errors.
But
if
you
don't
know
that
it's
not
obvious-
and
this
makes
it
obvious
and
it
makes
it
uniform.
So
I
still
think
there's
value
in
knowing
this
but
yeah.
B
So
now
I'm
working
on
towards
the
fourth
panel,
and
that
is
latency
and
that
one
is
less
rosy
than
the
errors,
because
it
turns
out
that
latency
isn't
all
that
great
or
it's
not
it's
not
what
I
expected
it
to
be,
and
now
I
need
to
digress
into
another
specific
technical
aspect
here,
which
is
that
we
need
to
measure
these
things
on
the
client
side.
B
Redis
is
meant
to
be
very
fast
and
if
it
would
log
everything
it's
doing,
it
would
slow
down
just
by
having
to
do
all
that
logging.
So
you
get
only
very
limited
information
out
of
a
running
ready
server
about
what's
going
on
and
it's
great
that
it's
fast,
but
that
limited
information
is
a
challenge.
B
So
in
this
particular
case
we
decided
to
measure
on
the
client
side,
and
that
is
a
fairly
sane
approach,
because
gitlab
is
mostly
a
monological
monolithic
code
base
and
almost
all
the
redis
use
is
in
the
monolithic
part
of
the
code
base.
So
if
we
measure
red
is
there,
then
we
probably
see
99
of
what
we
do
with
redis
anyway.
So
we
might
as
well
say
that
is
how
the
reddit
server
is
doing.
B
That's
a
practical
solution,
but
it
it
yeah.
Some
things
look
slightly
different
and
one
thing
that
I
find
found
surprising
is
that
in
my
mind,
reddish
is
supposed
to
be
fast
and
it
is
if
you
think
about
what
the
reddish
server
has
to
do.
But
if
you
think
of
the
bigger
picture
of
what
the
experience
from
the
side
of
a
redis
client,
it
turns
out
that
reddish
is
slower
than
I
thought.
B
So
I
was
thinking
while
a
registered
quest
should
take
like
one
milliseconds
or
five
milliseconds
would
be
really
slow
for
redis,
but
it
turns
out
that
sorry.
So
that
was
my
expectation.
So
what
I
did
is
that
we
created
a
histogram
in
the
rails
code
base
where
we
have.
Actually
this
code
is
new
and
this
wasn't
there.
Yet
the
the
slowest
type
of
request
we
could
observe
was
100
milliseconds.
These
are
seconds
so
anything
slower
than
100
milliseconds.
B
I
thought
is
so
terrible
that
we
don't
even
want
to
think
about
it,
but
it
turns
out
that
if
you
look
at
99
percentile
latency,
then
it's
just
constantly
above
10
milliseconds-
and
another
thing
I
had
to
learn
here
is
that
which
makes
sense
in
retrospect
is
that
the
prometheus
histogram
can
only
come
back
with
the
value
of
the
largest
bucket
anything.
That
is
any
observations
that
are
higher
than
the
largest
bucket
value,
get
reported
back
as
that
largest
bucket
value.
B
So
all
this
tells
us
is
that
there's
a
bunch
of
goals
that
took
more
than
10
milliseconds
and
yeah.
The
99th
percentile
is
always
above
that
and
yeah,
except
for
this
one
thing,
which
is
the
shared
state
redis.
So
sometimes
this
is
fast
and
then
you
get
0.06,
so
that
is
six
ish
milliseconds,
that's
not
really
yeah.
B
B
Right
is
again
that
is
99
here,
but
it's
dipping
to
values
like
97
and
then
this
cash
instance
is
more
like
98.6
and
then
here
it
dips
to
again
97.4-
and
this
is
the
cyclic
redis
and
this
dips
to
85,
meaning
that
15
of
requests
at
this
point
took
longer
than
10
milliseconds
from
the
client
side,
and
so
I've
been
having
a
discussion
with
andrew
about
this,
and
this
is
not
usable
for
our
latency
monitoring
and
we
just
need
to
own
up
to
the
reality
that
from
the
client
side,
these
things
are
slow
and
we
need
to
have
the
buckets
that
that
allow
us
to
reflect
that.
C
Yeah,
I
mean
the
thing
to
keep
in
mind
as
well.
Is
that
that's
obviously
including
network
round
trip
and
the
the
data
transfer
and-
and
we
do
some
rather
abusive
things
with
our
radius
with
very
big
objects
that
we
shouldn't
really
be
storing
in
readers?
Yeah
traces
comes
to
mind
here.
B
Yeah
yeah
exactly
this
is
one
of
the
things
I
learned
that
we
worked
on
redis.
Is
that
even
though
people
always
say
redis
is
single
threaded,
but
it's
really
the
part
of
redis
that
looks
at
the
request
and
writes
the
response
somewhere
in
memory.
That
is
single
threaded,
but
then
it
tells
linux
to
send
that
response
back
to
the
client
and
that
then
gets
done
by
the
kernel
in
a
different
kernel,
threads
and
so
yeah
yeah.
It
can
be
take
much
longer
to
do
it
for
the
data
transfer
also.
C
They
say
you'll
get
much
higher
throughput
yeah
because
of
that
as
well.
So
there's
more
to
come.
That's
great.
B
B
What's
what
we're
discovering
here
is
exactly
how
bad
things
are
on
the
client
side,
and
what
I
would
really
wished
for
today
is
that
I
could
show
you
this
dashboard
that'd
be
a
fourth
panel
here
and
I
could
say
look.
This
is
how
we're
doing-
and
I
can't
because
I
had
to
go
back
and
look
at
the
pockets
which
dovetails
with
another
interesting
topic
that
I'm
just
going
to
mention
for
a
moment.
This
came
up.
B
Sean
brought
this
up
a
while
ago
and
then
andrew
also
brought
it
up,
and
it's
sort
of
going
around
that
we
have
too
many
latency
buckets
on
our
histograms,
and
this
is
flooding
our
prometheus
servers,
and
I
took
this
as
an
exercise
to
say:
let's
have
very
few
buckets
in
the
first
merge
requests.
Yeah
again
like
I,
there
are
four
buckets
here,
and
this
is
a
very
small
number
by
gitlab
standards.
B
C
And
I
wonder
whether
this
would
be
something
that
would
be
generally
a
better
way
to
go.
Because,
like
you
can
see
the
pain
that
this
you
know.
C
Think
you're
right
here,
yeah
also
because
we've
kind
of
decided
we
we're
not
going
to
use
latency
histograms
in
this
way
because
they
don't
work
for
us,
because
we
have
other
cardinality
problems
that
are
coming
into
play.
Like
the
number
of
servers,
we
saying
we
want
three
buckets,
but
a
lot
of
clients,
they're
going
to
say
what
kind
of
buckets
are
these
like?
We
can't
we're
not
going
to
get
any
resolution.
C
Our
histogram
estimations
are
going
to
be
terrible,
and
so
maybe
it
makes
sense
to
kind
of
extend
out
what
we
did
on
gilly
to
a
broader
audience.
B
B
C
Also,
if
anyone
else
is
using
latent
aptx
scoring
they're
every
time
we
change
these
buckets,
we'll
probably
break
their
their,
which
won't
please
them
very
much.
No.
B
C
Yeah
I
mean
we
should
never
kind
of
say
that
the
metrics
are
stable,
because
I
mean
that
would
really
kind
of
hamstring
ourselves.
But
it
is
a
thing
to
think
about.
B
Yeah,
but
this
is
this
is
a
mistake,
waiting
to
happen
and
just
to
spell
it
out
for
those
of
us
who
are
not
super
familiar
with
this,
the
the
aptx
panels
we
have
in
the
metrics
catalog
are
standardized.
This
is
the
system
andrew
has
been
building
and
they
all
use
the
same
formula,
the
same
prometheus
query
and
these
do
counts
on
exact
bucket
values.
B
So
if
you
move
the
bucket
boundaries
for
for
computing,
histograms
moving
bucket
boundaries
is
irrelevant
because
the
histogram
is
just
takes
whatever
buckets
exists
and
compute
the
value
out
of
that.
B
But
the
way
we
do
aptx
we
care
about
actual
bucket
values
and
with
this
slow
process,
where
we
change
the
buckets
merge
it
wait
for
it
to
be
deployed
can
be
that
we
find
out
the
next
day
we
broke
our
alerting,
and
then
we
need
to
go
back
and
change
it
again,
and
if
this
was
in
configuration,
then
we'd
be
able
to
fix
a
mistake
like
that.
Much
quicker.
C
Yeah
there's
another
sort
of
related
issue
that
I
ran
into
this
week,
which
is
that
I
realized
that
we
can
treat
the
google
load
balancers,
that
we
have
in
front
of
a
lot
of
our
services
in
the
same
way
that
we
treat
h
epoxy
as
a
kind
of
external
ingress,
but
we
kind
of
counted
as
part
of
another
service.
So
we
do
that
for
registry
and
stuff,
and
one
of
the
nice
things
about
the
google
load
balancer.
C
But
the
problem
is
that
they
use
a
purely
logarithmic
calculation
on
the
bucket
sizes,
and
so
the
bucket
sizes
have
got
these
really
ridiculous
names
like
ten
point:
zero:
zero,
zero,
four,
two:
three,
eight
nine
five
like
ten
digit
long,
latency
values,
and
so
I
was
trying
to
figure
out.
Is
there
a
way
that
we
can
do
it
without
those
exact
values,
and
the
best
I
could
come
up
with
was
a
regular
expression
which
is
probably
worse
than
nothing.
C
C
No,
but
it's
in
in
in
jsonnet,
the
satisfied
threshold
is
a
is
it
floats,
but
we
can
change
that.
C
C
So
there's
one
there's
one
more,
there's
one
more
little
extension
to
that,
which
is
that
one
of
the
interesting
things
that
we
could
potentially
do
but
we'd
have
to
change
the
prometheus
client,
a
tiny
bit
that
we
use
in
rails,
because
one
of
the
things
I'd
really
like
to
do
is
put
those
thresholds
into
the
application
and
not
into
the
you
know
in
a
separate
place.
So
we
only
have
a
single
source
of
truth
right
and
one
of
the
things
we
could
do
is.
C
C
B
No,
I
I
want
to
come
back
to
aptx,
but
finish
your
thought.
First,
okay,.
C
So
because
then
what
you
can
do
is
that
you
can
expose
that
on
the
on
the
latency
histogram,
but
only
on
on
the
on
the
special
thresholds,
and
then
you
can
basically
sum
those
and
divide
them
by
the
infinity
bucket,
and
you
don't
have
to
know
what
the
values
are.
You
just
say:
yeah,
you
know
sum
the
sum
of
the
sum,
the
thresholds
and
divide
that
and
then
you
get
the
aptx
and
then
it's
all
managed
in
one
place.
Yeah.
B
That
makes
more
sense,
because
now,
even
if
we
manage,
if
even
if
we
push
the
bucket
definition
into
config,
then
we
still
have
chef
somewhere
and
then
we
have
runbook
somewhere
else
and
we
need
to
make
sure
we
use
the
same
numbers
so
yeah.
That
would
be
smarter.
What
I
wanted
to
ask
about,
abdex
is
I'm.
B
I've
also
been
looking
at
the
the
second
google
sre
book,
the
sre
workbook,
where
they
talk
about
alerting
and
these
sort
of
things
and
how
to
define
slos,
and
it
seems
like
abdex,
is
like
intuitively
I'd
say.
The
simplest
thing
to
say
is
number
of
requests
that
is
below
a
threshold,
but
that
actually
is
slightly
more
complicated
than
that
yeah.
And
I,
but
it's
actually
the
way
I
unders
the
way
I
understand
it
is
that
I
do
I
like
what
what
does
it
actually
mean,
because
it's
the
average
of
the
the
satisfying.
C
C
C
If
it
comes
in
less
than
or
equal
to
tolerated,
you
give
it
fifty
percent
and
that's
the
weird
thing,
but
then
other
than
that
you
treat
it
as
a
latency
error.
So
but
a
lot
of
our
aptx
thresholds
don't
have
the
tolerated,
so
they've
only
got
a
satisfied
and
and
there's
various
technical
reasons
why
it's
like
that
and
for
those
requests
it's
effectively
a
straight
error
budget,
and
you
know.
B
So
basic
and
it's
much
easier
to
do
so
exactly
my
question
is
I'm
I
I'm
sure
I
think
the
original
app
text
was
even
more
complicated.
B
Requests,
but
we
we
have
thanks
for
that
explanation.
Why
do
we
need
the
more
complicated
instead,
instead
of
just
saying,
satisfied
or
not.
C
So
really
what
it
is
is
just
kind
of
sticking
to
the
standard,
but
like
often
and
and
kind
of
probably
tending
more
so
I
tend
to
leave
out
the
tolerated,
because
exactly
the
same
thing
that
you
that
you
see
here,
it's
much
easier
to
sort
of
wreck
this
sort
of
cognitively
process.
It's
like
this
is
the
number
of
it's
a
pure
error
budget.
In
the
same
way,
the
errors
are,
but
the
the
the
what
you'll
find
is
for
a
lot
of
our
services.
C
At
the
moment,
we
kind
of
need
that
half
score
in
order
for
them
not
to
look
really
bad.
B
99,
but
this
is
very
interesting,
but
you
can
because
we
had
the
same
problem
in
the
discussion
about
how
to
set
these
redis
buckets,
but
the
status
quo
is
that
things
are
as
fast
as,
however,
they
are
and
yeah
we're
trying
to
come
up
with
slos
and
yeah.
I
think
the
way
I
look
at
it
now
is
that
we
probably
want
to
have
slos
that
capture.
What
is
going
now
on
now
and
say
this
is
reasonable,
even
if
it's
not
and.
A
B
At
least
then,
we
know
if
something
slows
down
and
becomes
even
worse
than
it
is
now
we
know
about
it
and
then
exactly
the
future
step
is
that
we
tighten
those
slos
and
say.
Actually
we
want
the
redis
to
be
twice
as
fast
and
we're
going
to
start
measuring
if
we're
there,
and
once
we
get
there.
We
lower
the
alerting
threshold.
C
That's
that's
exactly
it
and
and
all
of
the
budget,
all
of
the
thresholds
that
we've
got
are
based
on
what
was
available
right.
It
wasn't
like
what
you're
doing
now
is
you're,
manipulating
the
the
buckets,
but
almost
everything
else
is
like
this
is
the
best
bucket
we've
got
so
we're
going
to
use
that,
and
you
know
like
over
time.
We
will
start
improving
that
sorry,
you
had
something
to
say:
marin.
A
Yeah
I
wanted
to
actually
ask
because
in
my
head
I
thought
that
we,
like
the
reason
why
we
wouldn't
want
to
use
tolerated
in
the
in
the
absolute
articulation,
is
because
it
gives
us,
like
you,
said
it's
cognitively
more
like
it's
easier
to
to
grasp
it,
but
I
also
don't
know.
C
Okay:
okay,
just
check.
A
But
what
I
was,
what
I
was
trying
to
to
to
understand
is:
doesn't
that
also
mean
that
you
are
forcing
yourself
towards
a
higher
standard,
because
you
don't
have
that
in
between
state
right.
So
why
is
that
bad.
C
Because
so
it's
it's
not
bad
and
that's
something
we
would
like
to
go
to
so
so
say
we
have
at
the
moment.
We
have
a
one
second
and
a
five
second
for
workhorse
right,
I'm
just
I'm
just
gonna
use
like
hypotheticals,
so
that
the
choices
are
okay,
we're
gonna
drop,
the
we're
gonna
drop,
the
five
second
we're
just
gonna
use
one
second,
then
you
might
find
with,
and
this
is
all
hypothetical.
B
C
That
that's
and-
and
you
know,
the
thing
is
like:
if
you
go
back
six
months
when
I
started
writing
all
of
this
stuff,
I
was
kind
of
like.
If
I
go
through
and
change
every
exporter,
every
application,
you
know
every
histogram.
This
is
going
to
be
like
years
of
work.
So,
let's
like
use
what
we've
got
and
then
iterate
on
that.
But
I
don't
think
that
those
are
the
you
know.
C
But
the
the
thing
is
that
that
is
there's
one
part
where
they
still
the
same,
and
that
is,
they
still
use
the
same
thresholds
for
the
aptx
scores.
And
if
we
want
to
do
this
properly,
we've
got
to
we've
got
to
break
those
apart
as
well,
because
if
we
go
and
tighten
all
these
thresholds
up,
it's
going
to
screw
up
the
the
contractual
ones
without
any
change
in
the
actual
latency.
A
C
B
B
So
that
makes
a
lot
of
sense,
but
even
especially
when
you're
talking
about
slos
and
slas
and
maintaining
a
more
complex
set
of
of
thresholds,
that
with
higher
business
consequences,
if
we
mess
them
up,
it
seems
like
it's
even
more
important
to
have
a
system
that
is
sane
to
work
with,
and
yes,
some
something
that's,
maybe
a
bit
more
friendly
to
work
with
than
what
we
have
now,
because
right
now
we
have
buckets
that
are
hard
coded
or
that
don't
make
sense,
and
we
have
run
books
that
need
to
match
things
that
exist
in
the
pockets
and
we
use
app
decks
to
create
in-between
thresholds.
C
So
there's
there's
several
approaches
that
we
can
so
so.
The
first
thing
is,
I
think,
we've
got
to
split
this.
It's
actually
not
as
as
difficult
as
it
sounds,
because
we
only
need
this
now,
because
the
sla
metrics
we're
only
recording
on
the
key
services
and
that's
get
web
api
registry
and
ci,
I
think,
are
there
other
key
services.
C
A
bigger
part
yeah
so,
but
what
I'm
saying
is
it's
actually
a
smaller
subset
of
of
the
services
so
like
it
doesn't
really
matter
as
much
for
those
ones,
but
then
what
we
could
start?
You
know
we
could
go
with
those
and
define
separate
thresholds
for
those.
The
other
way
that
you
could
do
it,
which
is
a
little
bit
more
kind
of.
I
think
it
might
be
too
confusing
for
people,
but
it's
almost
on
a
per
request
in
the
application
we
we
have
a
counter
so
effectively.
C
The
counter
is:
did
this
request
complete
within
the
tolera
within
a
reasonable
amount
of
time
and
then
effectively?
You
divide
that
by
the
total
number
of
requests
and
you
get
the
same
thing,
but
then
you
can
vary
it
per
endpoints
and
again
you're
pushing
it
back
into
the
application
and
you're
not
relying
on
the
histogram
infrastructure
in
order
to
build
up
that
score
and
effectively
it's
just
like
a
error.
It's
an
error
counter!
B
B
That
might
be
that
might
have
the
advantage
that
if
we
want
to
head
towards
a
future
where
we
can
break
things
down
by
feature
category,
you
might
be
able
to
see.
If
you.
B
But
like
you
can
yeah
the
the
the
current
breakdown
of
web
api
kits
is
is
quite
coarse,
and
if
we
want
to
have
a
situation
where
we
can
say
I
I
don't
want
to
pick
on
any
feature
category
but
some
feature,
and
we
want
to
be
able
to
say
hey
on
compared
to
your
goals.
Your
feature
is
often
quite
slow
in
production,
and
maybe
you
need
to
spend
time
looking
into
those.
B
C
A
C
B
Know
I
think
it
like
from
an
optimization
point
of
view.
It
makes
some
sense
because
you're
only
counting
the
things
that
you
need
to
count
and
it
will
probably
scale
better
in
the
end.
But
it's
highly
specialized
to
what
we're
doing-
and
it
might
be
more
like
you
say,
might
be
more
confusing
to
people
if
it's
not
histograms
and
it's
our
the
one
part
about
histograms.
We
care
about
counter.
C
Yeah
I
mean
you
could
just
say
that
you
know
we
make
it
like
italy,
where
you
can
turn
off
the
histograms
or
you
can
put
them
on
and
use
your
own
buckets
but
effectively
for
the
for
the
for
the
service
level.
Metrics
we're
just
counting
an
error
rate
yeah.
C
B
Yeah
yeah,
it's
interesting,
I
my
main
takeaway,
is
that
the
current
system
is
too
fragile
and
we
need
to
do
something
and
this
this
would
be
one
way
of
dealing
with
it
and
your
other
idea
of
configuring
buckets
and
saying
these
are
the.
C
C
The
only
risk
I
have
with
that
is,
you
know.
For
me,
this
is
quite,
and
I
don't
know
if
this
is
a
real
risk
or
not,
but
we
just
have
to
test
it.
Prometheus
is
quite
delicate
about
label
matching,
and
I
don't
know
if
you
had
like
a
histogram
made
out
of
say:
10
labels
and
eight
of
those
labels
matched
on
everything
except
ali
and
then
two
of
them
you
know,
say
they
have
the
is
aptx
threshold
value.
C
C
B
The
benefits
of
the
people,
who
don't
have
this
in
their
heads,
let
me
spell
out
what
I
think:
what
we're
talking
about
prometheus
in
its
internal
data
model
has
histograms,
which
are
things
you
just
say
with
buckets,
and
you
say
I
see
I
saw
a
number
and
and
put
it
in
the
right
bucket
and
these
get
exported
as
a
series
of
individual
counters,
but
that
is
sort
of
how
the
the
wire
format
works,
and
you
define
one
bucket
when
you
have
when
you
have
a
prometheus
client,
you
define
a
histogram.
B
A
C
B
I,
the
the
other
I.
C
C
Just
yeah,
but.
B
C
Yeah,
what
do
you
mean?
You
don't
support,
weird
yeah,
so
I
I
agree,
and
so
maybe
then
going
with
the
like.
The
error
like
effectively
latency
error
encounter
is
the
other
thing.
That's
kind
of
interesting
about
that
approach.
C
Is
that
at
the
moment
with
with
aptek
I
I
don't
want
to
kind
of
jump
too
many
steps
ahead,
but
it's
something
to
think
about
at
the
moment
with
aptx
we
we
always
score
it
out
of
like
good,
is
100
and
bad
is
zero
percent,
and
then
with
error
rates,
good
is
zero
percent
and
bad
is
a
hundred
percent,
and
if
we
just
started
treating
like
a
latency
error
as
a
type
of
different
type
of
error,
I
still
keep
them
as
two
separate
dimensions
effectively
on
the
on
the
on
the
dashboards,
for
example,
but
effectively
they
both
good
is
zero
on
both
of
them
right
and
then,
if
you.
B
B
B
C
But
in
but
in
but
in
fairness
right
we've,
the
the
system
that
we've
been
using,
that
with
kind
of
predates
the
metrics
catalog
by
like
at
least
a
year
and
a
half,
so
maybe
two
years,
we've
been
using
updexes
for
all
of
the
stuff
and
it
doesn't
break
very
often
if
at
all
and
we
do
have
alerts,
you
know
to
make
sure
that
one
of
those
series
disappears.
We
get
an
alert
on
it
and
those
the
only
one
we
get.
That
for
is
the
is
the
errors,
but
that's
a
different
thing.
So
it
doesn't.
B
It's
hard
to
make
changes
to
and
we
were.
B
That
we
are
taking
the
average
of
two
types
of
latency
error
rates
or
latency
non-error
rates
to
compensate
for
not
having
the
right
buckets
because
editing.
All
these
things
is
such
a
pain,
so
yeah
yeah,
the
the
real
problem,
is
that
fine-tuning
the
system
is
as
a
lot
of
friction.
C
Yeah
and
then
you
know
using
a
separate
cancer
which
you
can
kind
of
just
plug
the
threshold
in
on
that
concept
might
be
the
the
other
thing
that's
interesting
about
it
right
is
that
you
could
then
potentially-
and
this
is
just
kind
of
like
ideas-
ideas
are
free.
One
of
the
things
you
could
potentially
do
is,
you
could
say
the
threshold
is
one
second
and
then
in
the
application
they
could
say.
C
C
B
B
Ideally,
the
other
thought
I
had
in
favor
of
your
latency
violation.
Counter
idea
is
that
clearly
we
we're
using
histograms
correctly
is
right.
Sorry
using
histograms
correctly
is
hard,
and
if
we
just
this,
allows
us
to
move
away
from
histograms,
and
we
can
make
it
clear
about
what
these
things
are
for.
These
things
are
for
alerting,
and
this
is
our
latency
alerting
infrastructure,
and
if
you
want
to
know
how
fast
things
are,
that's
not
what
this
is
for.
B
C
I
think
that
was
like
a
good
first
step,
but
like
ultimately
it's
something
we
want
to
move
away
from
towards
the
towards
the
violation,
but
but
you
know
not
some
like.
If
we'd
gone
to
that
as
the
first
step,
it
would
have
been
like
way
too
much
and
and
everyone
would
be
like
what
the
hell
is.
This
thing.
B
C
A
I
would
like
to
interrupt
you
if
possible,
and
ask
a
question
that
is
going
to
seem
now
like
a
very
amateurish
attempt
of
being
in
the
conversation,
but
I'm
sorry
you
kind
of
went
way
too
far.
For
me
at
least,
I
want
to
ask
a
question
about
the
the
the
redis
latency
metrics
that
you're
working
on
and
the
fact
that
you
said
we're
actually
not
measuring
what
ready
server
is
doing.
We're
measuring
what
redis
client
is
doing
within
the
gitlab
code
base
right.
A
What
kind
of
information
will
be
able
to
extract
when
we
see
that
things
are
going
sideways
like
like?
How?
How
are
you
going
to
use
that
information
to
dissolve
whether
code
change
created,
something
already's
network
is
failing
or
whatever.
B
C
C
This
is
what
we've
been
doing
for
this
dashboard
for
quite
a
while,
like
at
least
nine
months,
and
this
value
over
here
actually
comes
from
something
called
rail
sql.
So
rail
sql
has
got
a
aptx
and
an
rps
and
it
doesn't
have
an
error.
I
don't
think
we
don't.
We
don't
do
any
sort
of
way
of
tracking
errors.
We
probably
should,
but
we
don't
at
the
moment.
I
think
it
wasn't
immediately
apparent
and
they've
definitely
been
times,
I'm
just
trying
to
find.
C
Like
a
good,
I
mean
I'm
I'm
trying
to
remember,
I
think.
In
november
we
had
a
bunch
of
problems
with
petroni
and
the
the
signal
was
really
useful
to
have.
Obviously
the
the
downside
of
it
is
and-
and
I
had
to
explain
this
on-
a
few
incident
calls
like
this
thing
is
reading
as
patrony,
but
it
could
equally
be
a
bad
application.
C
You
know
like
a
like
a
poor
application
change
right,
and
so
there
is
a
little
bit
of
extra
cognitive
overhead
for
the
person
on
call,
because
it's
saying
petronio,
let's
just
call
it
pg
balancer,
because
patrony
is
a
terrible
name
for
the
service.
This
service
is
running
slowly,
but
actually
because
we're
proxying
the
data
from
a
different
service.
C
It
could
be
something
in
between
the
two.
It
could
actually
be.
You
know
pg
bouncer.
It
could
be
the
application
itself.
It
could
be
any
number
of
things
that's
causing
this,
so
it
is.
There
is
a
bit
of
cognitive
overhead,
but
at
the
same
time
it's
also
super
useful
because
without
it
in
the
same
way,
we
wouldn't
have
that
sort
of
latency
information,
and
it
is
a
very
useful
signal,
even
if
it's
like
flawed
in
a
way.
B
Trying
to
remind
you
to
come
also
to
come
back
tomorrow's
question,
I'm
reminded
of
working
on
gettily,
where
we
would
have
latency
information,
and
that
doesn't
tell
you
so
much
and
then
I
would
be
able
to
go
into
elk
and
look
at
access
logs
and
and
then
you
can
find
out
where
it's
coming
from
and
we're
just
not
in
the
same
situation
with
redis,
because
there
is
no
access
log,
and
so
the
question
likes
the
the
the
num.
The
latency
number
says
something
is
bad
now,
how
do
we
find
out?
C
B
Some
access
log
type
information
because
the
rails
access
log
prints
the
total
time
spent
in
redis
calls.
So
if
a
particular
thing
is
hogging
redis,
then
you
can
find
it
in
the
in
kibana,
based
on
that
in
the
rails
logs.
So
yeah.
A
The
reason
why
I'm
asking
this
is
because,
if
I
think
about
the
the
work
that
the
cicd
team
is
doing
right
now,
to
enable
traces
to
go
through
object,
storage
or
to
go
to
object,
storage,
what
they're
gonna
end
up
doing
most
likely
is
they're
going
to
leverage
redis
for
this,
like
they're,
gonna,
buffer
things
and
like
post
it
there
and
like.
I,
don't
know
how
they're
gonna
make
that
performance
before
they
upload
it
to
object,
storage
and,
oh
theoretically,
that
one
can
impact
us
significantly
with
with
this
number.
A
But
how
would
we
find
out
when
they
enabled
that
or
when
they
turned
it
on
like
how?
How
will
we
know
that
it
was
that
and
not.
B
Five
other
things
yeah
rails
application
log,
because
because
of
the
work
we've
been
doing
in
this
epic,
we
have
total
number
of
redis
calls.
We
have
time
spent
in
retail
redis
calls
and
we
have
bytes
sent
to
redis
bytes
returned
from
redis
per
rails
request
or
per
psychic
job.
So
if
you
have
a
psychic
job,
that
is
pumping
a
ton
of
data
into
redis
or
out
of
redis
that
will
show
up
in
those
logs,
and
you
can
aggregate
that
by
controller
in
action.
So
you
can
see
hey
this
controller.
B
A
Okay,
so
I
have
one
ask
for
you
yakub.
We
are
a
time
I
have
one
ask
for
you
now
that
you
explained
this
so
well.
Could
you
please
add
this
to
the
epic,
as
in
what
kind
of
benefits
did
we
actually
get
from
the
work
we've
been
doing,
because
I've
been
digging
a
bit
this
week
into
the
work
that
is
being
done,
and
I
understand
like
when
I
read
the
issue.
A
B
A
Nothing
else
put
it
as
a
comment
in
the
epic
discussion
section
and
get
sean
to
edit,
but
you
know
it
can
be
more
than
that,
because
I
think
there
is
more
than
what
you
just
said.
There
are
like
a
lot
of
really
useful
things
that
we
did
it's
just
that
I
can't
explain
them
in
like
one
goal
high
level
over
you
like.
B
Yes,
I
I
I,
I
feel
your
pain,
I,
but
I
I
think,
we've
from
my
point
of
view,
these
are
the
most
important
ones
that
we
have
the
rich
data
and
the
real
access
logs.
So
we
can
point
fingers
at
requests
and
we
are
getting
some
more
signal
of
the
regis
if
the
ready
service
is
healthy
or
starting
to
slow
down.
C
B
B
What
is
that
the
most
important
thing-
and
I
have
a
take
on
that-
but
that
is
very
personal
and
I
it
pains
me
to
say
it,
but
it
might
even
be
that
the
things
I
worked
on
look
more
important
to
me,
so
maybe
I'm
very
biased
and
wrong,
but
I
I
can
still
give
my
opinion,
so
I'm
going
to
put
my
opinion
on
the
epic
and
say
I
think
this
is
awesome,
that
we
can
do
these
things
now
and
maybe
someone
else
on
the
team
will
come
in
and
say
well
actually,
this
other
thing
we
worked
on
is
also
great,
but
that.
A
Try
to
explain
from
the
high
level
why
I
think
this
is
kind
of
important
to
actually
know
down
with
every
epic
that
we
are
doing,
and
rachel
is
going
to
get
a
task
to
ensure
that
we
keep
doing
this.
Is
we
know
that
we
improved
the
platform
over
the
past?
Let's
say
a
year.
A
We
know
that
there
were
like
a
lot
of
efforts
that
pushed
us
over
the
line
to
now
say
that
we
can
target
that
99.95
reliably.
A
All
of
these
small
things
matter
a
lot
because
if
you
go
small
things
under
air
quotes,
because
if
you
think
about
us
being
in
the
incident
calls
a
year
and
a
half
ago,
we
would
spend
nine
hours
in
the
incident
called
trying
to
discover
what
what
is
happening
and
not
finding
it
out.
When
you
have
information
like
this,
you
spend
five
minutes.
A
Instead
of
nine
hours
right
yeah
you
get
that
information,
which
means
that
you
can
quickly
find
the
culprits,
which
means
that
you
can
quickly
resolve
it,
which
reflects
in
the
sla
in
the
end
right.
So
I
want
to
be
able
to
present
that
the
scalability
team
is
doing
this
type
of
work,
and
even
if
it's
not
our
discussion
performance
indicator,
this
is
the
performance.
That's
the
team
is
putting
out
there.
B
Yeah,
this
is
exactly
how
I
would.
What
I
think
is
should
be
the
ultimate
value
that
we're
providing
or
one
of
the
ways
we
are
providing.
It
is
like
how
long
does
it
take
to
find
out
what
the
problem
is
in
the
incident
and
how
long
like,
in
the
case
of
psychic
cues,
if
something
is
wrong
with
the
psychic
use,
how
long
does
it
take
for
them
to
go
back
to
normal?
B
I
presume
that
this
is
one
of
the
areas
where
you
made
a
big
difference
with
the
sidekick
work
yeah,
so
that
makes
total
sense
to
me
and
it
aligns
with
how
I
imagine
things
work.
The
one
thing
I
want
to
say
is
that,
as
someone
who
is
not
in
incidents,
I
can
say
I
think
this
helps,
but
I
don't
feel
like.
I
know
this
helps
like
like.
Like
will
this
type
of
things
now
get
resolved
faster
like
have
we
communicated
this
well
enough?
Can
we
measure
if
it
gets
resolved
faster.
A
That
is
all
really
hard
to
measure
as
well,
because
you
can
give
the
perfect
tool
to
someone
who
doesn't
know
how
to
turn
it
on.
It
doesn't
really
matter
then
right,
but
ultimately,
even
if
we
jump
into
calls
when
I
say
we
I
mean
all
of
us
in
this
call
and
watching
this
goal,
it
means
that
we
can
actually
achieve
something.
B
B
Yeah
I'll
sit
down
and
write
a
comment
with
my
personal
take
on
what
I
think
are
the
highlights
of
what
we
just
gained.
A
Perfect
thanks
for
sharing
yakub.
I
think
this
was
a
good
discussion,
even
though
part
of
it
that
I
didn't
even
follow.
B
I
was
doing
my
best
to
interrupt
the
discussion
sometimes
and
try
and
keep
it
to
go
out
of
control,
but.
A
Yeah,
it's
fine,
but
I
I
got
my
two
questions
that
I
actually
wanted
to
ask.
So
that's
good
cool.
Well,
I
think
we
can
end
the
call
here
and
we'll
chat
all
later
and
I
hope
you
have
a
good
weekend.