►
From YouTube: Scalability Team Demo - 2021-04-22 (First call)
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
Yeah
and
I've
got
the
one
and
only
item
on
the
agenda
right
now
and
I
think
everybody
on
the
call
already
knows
about
it,
so
we
can
do
it
quick.
Let
me
perhaps
share
my
screen
to
show
you
the
mulch
requests.
A
A
So
those
slis
already
have
a
feature
category
the
find
them
and
that's
just
in
the
in
the
json
definition.
They
have
a
feature
category
and
to
include
those
into
the
error
budgets.
We
want
to
add
that
label
when
it
gets
added
to
when
the
metrics
get
recorded
into
prometheus.
So
the
first
attempt
I
did
added
that
label
directly
into
the
recordings
that
are
also
used
for
key
metrics,
which
are
used
for
alerting
yeah.
But
those
are
the
things
you
look
at
at
the
service
overview,
dashboard
and
so
forth.
A
So
that
didn't
go
so
well,
because
we
were
getting
alerts
on
traffic
being
increased
because
the
the
metrics
would
get
aggregated
together
like
we
would
count
the
metrics
without
the
feature
category
label
and
then
add
the
metrics
with
the
feature
category
label,
and
that
would
come
double
and
it
would
look
weird
and
would
tell
sres
to
look
at
things
while
it's
actually
not
true
like
we
didn't,
have
more
traffic.
So
we
reverted
that
and
then
the
second
iteration,
for
that
was
just
to
record
this
separately
from
the
key
metrics.
A
So
to
have
a
recording
that
records
exactly
the
same
thing
as
the
key
metric
would
record
but
add
a
static
feature
category
label
and
use
that
for
the
for
the
error
budgets.
This
obviously,
is
a
bit
annoying
because
we
record
the
same
thing
twice:
just
add
an
extra
static
label.
But
that
way
I
was
sure
that
it
wouldn't
get
in
the
way
of
like
anything
else
and
yeah.
B
Bob
it's
weird
I
when
I
reviewed
that
first
murder
request.
I
was
like
there
was
this
bug
in
the
back
of
my
head
and
I
was
like.
I
don't
think
this
is
going
to
work
and
then
I
looked
at
it.
I
was
like
no
it'll
work.
I
was
like
no
there's
something
wrong
and
then,
as
soon
as
it
happened,
I
was
like.
Oh
obviously,
you
got
to
be
super
careful
about
adding
dimensions
to
to
those
existing
aggregations.
For
that
exact
reason.
A
Yeah,
but
it's
no
big
deal
well,
the
the
thing
is,
I
think
we
all
need
to
do
this
once
to
like.
I
remember
sean
wrote
some
documentation
about
changing
the
key
metrics
using
like
when
using
different
sources
and
so
on,
and
that
was
also
triggered
by
him
changing
sources
and
then.
B
And
we've
actually
got
a
workaround
now
that's
much
safer
now,
because
before
we
used
to
have
that
like
the
aptx
was
measured
like
over
an
hour.
There
was
like
this
weird
thing
that
we
did,
and
so,
when
you
change
synaptics,
you
would
get
this
sort
of
decay
of
the
old
one
dropping
off
which
was
really
irritating.
I
don't
know
if
you
remember
that.
C
Yeah
so,
like
you'd
have
like,
if
you
had
the
new
one,
it
wouldn't
have
started
until
we
started
recording
the
new
one,
and
if
you
had
the
old
one,
it
would
stop
at
the
point
we
record.
We
started
recording
the
new
one
and
then
it
would
still
counts
after.
A
Yeah
this
is
different
because
it
was
actually
well.
The
sli
stuff
would
probably
just
remain
the
same,
but
the
problem
was
in
the
things
that
were
using
the
metrics
that
were
recorded
from
like
an
aggregation
set,
but
it
wouldn't
take
into
account
the
labels
that
are
the
final
aggregation
set.
So
that's
the
discussion
that
that
sean
brought
up
like
are
we
going
to
keep
two
recordings
like
forever?
A
We
don't
have
to.
I
think
we
can
figure
out
a
way
to
get
around
that,
but
it's
I'm
kind
of
wondering
if
it's
worth
it
could.
A
C
Upgrade
just
add
more
risk
to
that,
but
take
advantage
of
the
everything
being
off.
B
I
I
have
a
different
opinion,
and
that
is
that,
well
it's
the
same
opinion,
but
for
a
different
reason,
and
that's
that,
like
from
a
replaceability
point
of
view
and
a
like,
it's
it's
better
to
keep
them
apart,
because
we
can
rip
the
one
out
and
we
can
replace
it
without.
You
know
if
we
kind
of
tying
the
same
metrics
in
for
lots
of
different
things,
everything
becomes
risk.
All
change
becomes
riskier
right
and
at
the
small
cost
of
having
a
few
extra
recording
rules.
B
C
A
A
Is
a
very
good
point
as
well,
because
in
the
same
merging
quest,
both
merger
quests,
I've
also
added
validation
for
the
feature
categories
defined
on
those
slis
to
see
if
they're
actually
still
feature
categories
that
exist
in
our
handbook,
stages.yaml
file.
So
that
means
that
when
product
decided
to
rename
something,
then
we
would
have
this
exact
problem.
Well,
yes,
yeah!
So
so
I
think
the
conclusion
is:
let's
keep
them
separate
and
keep
the
extra
recordings.
A
B
So
my
take
on
this
is
right
now
I
think
we
have
several
hundred
machines
running
git
and
we
have
like
five
machines
running
prometheus.
So
that's
a
pretty
good
ratio
and
that's
just
git
right,
that's
not
even
like
the
rest
of
our
fleet.
Okay,
this
is
actually
come,
comes
up
with
the
efficiency
of
kubernetes
discussion,
but
that's
for
another
day.
B
But
my
point
is
that,
like
the
ratio
of
the
observability
stack
compared
to
the
rest
of
our
stack
is
really
really
small
at
the
moment
and
if
we
need
to
you
know,
if
we
get
better
efficiency
from
having
these
extra
recording
rules
like
adding
a
few
more
machines
like
you
know,
I
think
we
should.
Rather
I
I
guess
my
point
is
we
shouldn't
be
skimpy
on
these
things
if
we
get
value
from
it
like,
let's
carry
on
with
it
as
long
as
it's
technically
feasible
and
I
think
it's
technically
feasible.
C
Yeah
I
did
like
I
said
I
might
have
to
drop
as
someone
comes
around
to
give
me
a
vote
for
removals.
But
so
I'm
just
gonna
talk
through
it.
First
of
all,
because
it's
gonna
be
more
fiddly
to
stop
screen
sharing
as
well.
C
If
I
have
to
jump
off
quickly
so
for
I'm
coupling
two
things
together,
which
aren't
directly
related,
but
it
seems
like
a
good
opportunity
to
do
them
both
at
the
same
time,
which
is
that
when
we
switch
to
having
a
single
queue
for
the
catch-all
charge,
we
will
have
a
problem
which
we
don't
currently
have,
which
is
that,
if
you
deploy
a
new
worker
to
canary,
obviously
canary,
doesn't
run
sidekick
jobs,
but
it
can
schedule
sidekick
jobs.
C
So
it
can
schedule
that
worker,
currently
that's
fine
with
an
asterisk,
because
it
will,
it
will
have
its
own
cue
and
sidekick
on
the
main
stage.
Won't
start
listening
to
that
queue
until
we
also
deploy
sidekick.
C
So,
basically,
those
jobs
just
wait
for,
however
long
it
takes
between
us
deploying
canary
and
scheduling
the
job
to
us,
deploying
maine
and
that's
already
kind
of
a
gotcha
that
I
think
like
we
should
be
a
bit
careful
about
because,
like
you
know,
if
you,
if
you
add
a
job
to
do
something,
when
someone
does
a
thing
in
the
app
and
that's
not
behind
a
feature
flag
and
that's
just
available
when
it's
deployed
to
canary
and
then
we
deploy
to
production
like
12
hours
later,
and
the
thing
in
the
background
runs
that's
kind
of
weird.
C
So
I
think
we
need
to
document
that
anyway,
with
a
single
queue
for
the
catch-all
shards.
Those
workers
will
go
into
that
queue
and
start
being
picked
up
immediately,
but
they
will
fail
because
the
worker
class
won't
exist
on
the
main
shard,
sorry
on
the
main
stage,
because
sidekick
won't
have
been
deployed.
C
So
we
need
to
do
something,
and
another
thing
I
was
thinking
about
was
we
set
sidekick
default
retries
to
three
a
while
ago,
which
is
very
very
low.
So
that
means
that
if
your
job
fails,
it
will
try
again
three
times
and
that
will
take
like
it's
got
an
exponential
backup.
So
it
takes
a
couple
of
minutes
and
you
know
if
our
database
goes
down
for,
say
five
minutes
like
it
did.
C
Some
point
in
the
last
few
months:
I
can't
remember
all
those
jobs
in
that
point
will
fail
and
then
we'll
go
to
the
dead
jobs
queue,
but
we
never
ever
look
at
that
and
that's
always
full.
So
that's
basically
saying
they
failed,
and
so
what
I
want
to
do
is
change
the
default
number
of
retries
back
to
the
psychic
recommended
default,
which
is
25,
which
happens
over
a
course
of
like
three
weeks
on
average,
and
that
seems
to
be
the
sort
of
psychic
recommended
way
of
handling
this
case.
C
Where
you've
tried
to
run
a
worker.
That's
not
you
know
that
sidekick's
not
doesn't
know
about
is
it'll.
Just
try
get
a
name
error.
Try
get
a
name
error,
try
getting
a
mirror
and
eventually
run
so
what
I'm
going
to
do
is
explicitly
set
all
existing
workers
that
we
have
to
use
three
retries,
because
I
don't
want
to
change
the
behavior
of
any
of
those
like
some
of
them
might
want
three,
some
of
them.
C
C
Which
is
probably
a
more
sensible
default,
so
the
original
reason
we
changed
it
to
3
was
because
a
lot
of
our
workers
at
the
time
hit
external
services,
and
if
that
external
service
is
down,
there's
not
really
much
point
in
trying
over
the
course
of
three
weeks,
but
also
some
workers
are
time.
Sensitive
or
some
workers
might
fail
in
expensive
ways,
but
I
think
it's
much
better
to
keep
the
default
at
25
and
have
people
handle
that
on
a
case-by-case
basis,
rather
than
have
the
default
for
those
sort
of
exceptional
cases.
D
C
No,
I
think
we
should
still
account
for
it's
the
same
way
we
do
now.
I
think
if
you
have
a
job
that
takes
19
retries
to
work
that
should
count
as
18
failures
and
one
success.
C
I
think
that's
reasonable.
I
think
the.
C
The
better
way
to
solve
that
is
to
not
get
into
that
situation
in
the
first
place,
so
to
not
schedule
like
if
you're
deploying
some
code
that
can
schedule
a
job
put,
that
behind
a
feature
flag
or
something.
So
it's
not
immediately
scheduling
stuff
that
wouldn't,
like
I
said,
wouldn't
work
anyway.
If
we
deploy
to
canary
and
then
it
takes
us
two
days
to
move
that
to
production,
that's
already
a
two
day
delay.
That
is
nothing
to
do
with
the
retries.
C
It's
just
to
do
with
how
we
deploy
things,
so
I
think
you'd
already
be
better
off
using
a
feature
flag
to
handle
that
case
and
thinking
about
that
and
like
I
said,
I
need
to
document
that,
but
I
think
we
should
count
the
failures
as
failures,
because
they
are
yeah.
A
And
I
think
in
the
end,
for
new
features,
where
we've
currently
hidden
behind
the
scene,
the
single
cooper
worker
thing
when
deploying
canary,
I
think,
since
they
are
new
features,
they
aren't
immediately
going
to
be
super
used
and
so
on,
hopefully
they're
even
behind
a
feature
phone
yeah.
C
Exactly
like
none
of
the
I
had
a
look
and
like
we've
had
a
few,
but
none
of
them
are
really
like.
I
had
a
lot
back
in
that
like
a
month
and
like
there's
not
really
so
like.
You
can
just
look
for
a
high
scheduling,
latency
and
that's
sort
of
a
good
proxy,
because,
like
these
will
end
up
with
like
a
scheduling
latency
of
like
eight
hours
or
whatever,
it
is
because
they
were
scheduled
when
they
were
on
canary
and
then
they
would
run
when
they
were
on
main
it.
C
B
Just
as
a
like
a
what
about
as
an
idea,
what
about,
if
all
new
workers
went,
because
I
I'm
kind
of
worried
about
the
25
retries
thing,
because
I
think
it
puts
like
extra
work
on
like
the
developers
to
know
about
that.
And
then
you
know
if
they
are
using
an
external
service,
they
have
to
know
that
they
have
to
set
it
to
three
and
we're
kind
of
putting
extra
stuff
on
people.
B
Yeah,
what
about
if
it's
like,
I
haven't,
thought
this
through
fully
and
it
probably
doesn't
work
but
like
if
we
in
all
new
jobs
that
got
introduced
what
happens
if
they
were
in,
like
the
I
don't
know
what
behind
the
ears
queue
and
then
and
then
that
queue
yeah
that
doesn't
really.
And
then
you
know
that
because
then
you
don't
have
to
worry
about
them,
getting
retried
because
there's
nothing!
No!
No!
I
haven't
thought
of
it.
C
A
B
A
canary
stage
for
sidekick,
so
so
that
is
actually
really
the
problem
here.
Right
is
there's
an
impedance
mismatch
between
between
our
workflows
here
and
there's.
There's
a
there's,
an
issue,
that's
as
old
as
the
hills
that
is
canary
sidekick
and
it
it's
actually
kind
of
amazing
that
we
are
still
running
sidekick
without
a
canary.
C
Yeah,
I
think
I
think,
sort
of
a
theme
of
some
of
the
work.
We're
doing
now,
though,
is
to
try
and
work
with,
rather
than
against
psychic,
so
yeah
use
fewer
cues
use
the
default
retries,
like
you
know,
do
what
it's.
C
Us
to
do,
I
do
see
what
you
mean
about
not
running
a
psychic
canary
like
basically,
this
is
only
a
problem
that
affects
gitlab.com,
like
it
will
affect
some.
D
C
Downtime
customer
deploys
yeah,
but
they're
not
likely
to
have
separate
yeah
exactly
they
won't
have
a
canary.
It
will
just
be
because
they've
deployed
some
part
of
their
fleet
schedules
jobs
before
they
deploy
some
part
of
their
delete,
the
processes
job.
So
it's
not
going
to
be
several
days.
It
might
be
like
on
the
order
of
an
hour
or
two,
so
we
could
we
could
go
lower
than
25.
I
just
think
so.
The
default.
B
Us
back
in
line
with
like
the
way
sidekick
intends
to
be
used,
like
I
think
that
that's
all
good.
My
other
question
about
the
the
canary
smell
is,
I
imagine
that
when
we
have
zonal
sidekick
clusters,
it'll.
D
B
Much
easier
for
us
to
run
a
sidekick
canary
cluster,
because
it's
kind
of
just
another
zone
in
a
way.
B
D
B
C
B
I
think
so
so
the
way
that
it's
this
is
what
I
something
I
discovered
yesterday.
So
you
know,
we've
got
the
three
zonal
clusters
so
get
an
api
run
in
the
zones,
but
then
sidekick,
sorry,
not
sidekick.
Canary
of
those
things
runs
on
the
regional
cluster,
which
is
kind
of
to
me.
I
find
that
very
surprising,
like
it's
like
wait.
What
what?
B
Why
is
this,
so
the
canary
spread
across
everywhere
and
then
so
people
were
like,
oh,
but
we
don't
want
to
have
like
these
node
pools
and
you
know
because
it's
going
to
be
too
small.
You
know
each
canary.
B
But
the
thing
is
that
that
I
kind
of
think
that
that
if
we
just
and
then
the
other
problem
is
that
helm
doesn't
understand,
stages
right,
so
that's
one
of
the
reasons
why
we
can't
deploy
the
canary
stage
into
the
existing
clusters
so
either
we
need
to
do
like
a
whole
bunch
of
work
around
that,
but
another
option
would
be
to
have
you
know:
we've
got
three
zones
at
the
moment.
We
make
it
four
zones
and
the
fourth
zone
is.
B
B
B
Because
they're
different
pods
and
they're
different
containers
right
one,
but
it's
all
a
bit
like
yeah.
D
C
B
Whatever
it
just
you
know,
the
most
important
thing
is
that
people
can
grok
it
and
like
from
the
last
day,
like
I've,
been
like
wait.
What
what
really,
and
so
that's
only
going
to
get
worse
as
we
move
more
workloads
over
there
and
so
keeping
it
simple
is
like
as
important
as
any
technical
solution
right.
C
Oh
yeah,
speaking
of
keeping
it
simple
one
other
thing
I
decided
to
do
was
at
least
figure
out
what
the
current
catch-all
on
kubernetes
sidekick
queue.
Selector
means
because
it's
like
a
9
000
character
line
that
just
picks
a
bunch
of
cues
by
name.
C
B
Yeah
and
someone
was
like
yeah,
it's
kind
of
terrifying.
I
don't
know
why
there
was
some
reason.
There's
a
reason.
Scobeck
knows,
but.
B
C
B
C
Reasoning
to
say:
well,
okay,
this
could
also
fix
this
other
thing
that
we
need
to
fix
now,
so
I'm
just
going
to
do
kind
of
both
at
the
same
time.
So
I
am
still
going
to
play
around
with
this
a
bit,
but
that's
that's
where
I'm.
B
C
Exactly
so
like,
and
you
know,
if
you
have
a
high
retry
to
success
ratio
yeah
at
the
moment
that
could
be
masked
by
just
three
retries
and
then
failure
and
again
this
wouldn't
apply
to
existing
workers
initially
because
we'd
need
to
go
through
them,
one
by
one,
basically
and
check
them,
but
for
new
workers
it
would
be
nice
to
be
able
to
be
a
bit
more
confident
and
say
this
isn't
just
by
failing
three
trans
and
going
dead,
because
we've
added
metrics
for
that.
B
C
B
C
C
Sorry
I
say
we
could
already
change
it.
We
we
sort
of
blocked
that
on
the
issue
to
create
the
dashboard
which
I
think
yakub
was
going
to
take
so
we'll
create
the
dashboard,
which
will
also
create
the
recording
rules
and
then
we'll
create
the
alerts
after
the
recording
rules
have
been
in
place
for
a
while
cool,
because
we
don't
want
to
add
the
recording
rules
and
immediately
alert
on
them.
C
B
This
is
just
a
silly
thing
that
I
was
working
on
immediately
before
this
call
and
it's
like
a
terrible
hack,
but
I'm
working
on
all
the
the
kubernetes
monitoring
at
the
moment
and
one
other
thing.
One
of
the
other
unusual
things
is
that
we
have
a
lot
of
node
pools
and
far
too
many
so
like
for
the
git
service,
we
have
separate
node
pools
for
the
shell
and
http
ssh
and
https
components.
B
B
But
also
the
other
thing
is
that
all
of
these
different
pools
are
all
fixed
in
size,
and
some
of
them
are
probably
like
maxing
out
and
we
don't
have
any
way
of
monitoring
that
stackdriver
doesn't
export
any
any
information
for
this,
and
so
the
first
thing
I
started
doing
was:
okay,
we're
just
going
to
manually
sync
it
up,
so
we
just
can
have
a
manual
list
and
then
I
realized,
like
they're
all
different
and
there's
no
rhyme
or
reason
in
the
sizing,
because
there's
three
pools
in
terraform
there's
kind
of
a
copy
and
paste-
and
I
think,
what's
happened-
is
that
people
have
gone
and
updated
the
pool
sizes.
B
But
they've
only
done
it
on
two
of
the
three
or
one
of
the
three,
so
they're
all
different,
and
that's
only
going
to
carry
on
right,
and
so
I
decided
against
manually
syncing
it
up
because
there's
like
25
different
numbers
that
we've
got
to
keep
synced
just
for
production
and
then
we've
also
got
staging
and
everything
else,
and
so
I
just
decided
it
was
much
easier
to
do
an
automatic
thing.
B
Terraform
like
give
me
all
the
information
about
the
state
that
you
manage,
and
then
I
just
use
a
jq
script
to
pull
things
out
of
that
and
then
turn
that
into
prometheus
metrics,
and
I
push
that
to
the
push
gateway
which
is
like
the
world's
biggest
hack.
But
it
lets
us
get
on
with
more
important
things.
Could.
B
C
G
B
Yeah
push
gateway
has
got
like
a
slightly
surprising
approach
like
I
wasn't
sure,
so
it
has
state
right
and
then
you
have
a
thing
called
the
grouping
key
and
you
it's
arbitrary,
but
normally
people
use
job
and
instance,
as
the
grouping
key
and
job
for
timeland.
The
job
is
tamland
and
then
grouping
key
is
basically
the
page
that
you're
showing
and
then,
whenever
you
push
to
that
job
and
grouping
key,
all
the
existing
states
is
flushed
out
and
then
it's
replaced
with
whatever
metrics
you
push
in.
B
So
you
push
in
a
group
of
metrics
and
then
you
can
delete
so
so
I
guess
the
thing
that
I
didn't
understand
about
push
gateway
was
that
it
has
states.
That's
kind
of
that
stays
like
that.
Until
the
next
time
you
push
and
then
it
doesn't
just
like
overwrite
the
individual
metrics.
It
basically
removes
everything
under
a
particular
grouping,
key
with
job.
B
B
But
yeah
so
so
I've
just
kind
of
you
can
see
on
that
merge
request.
If
you
scroll,
you
can
see,
there's
like
lots
of
different
numbers
there,
and
none
of
it's
very
but
hopefully
once
we
get
monitoring
on
this
it'll
show
up.
You
know
where
the
problems
are
in
in
the
current
thing,
because
we'll
start
alerting
on
on
node
pools
that
are
saturated.
C
Yeah,
that
would
be
yeah.
That's
that's
an
item
for
the
infrastructure
christmas
quiz
like
what's
the
what's
the
maximum
no
pool
size
of
shell,
one
versus
well,
there
there's.
C
B
B
B
D
B
D
B
Yeah,
so
there's
there's.
Definitely
I
don't
I
don't
know.
I
don't
have
like
a
full
story
here,
but
I
think
now
that
we
starting
to
look
at
the
monitoring
of
this,
I
think
it'll
become
because
at
the
moment
at
least
to
me
in
my
head.
It's
all
quite
opaque,
like
I
don't
really
know
that
kind
of
kubernetes
is
just
doing
things,
but
we
need
to
kind
of
like
understand
if
we're
running
things
right,
because
it
seems
like
the
the
the
step
up
has
been
pretty
drastic.
D
It
just
also
I
mean
if,
if
the
sidekick,
if
our
challenges
with
sidekick
and
raiders
have
have
taught
us
anything,
it's
that
we
need
to
be
making
sure
that
we're
using
the
technology
the
way
it's
supposed
to
be
used
and
yeah.
G
B
So
if
there's
other
things
that
we
want
to
stick
into
the
the
prometheus
that
we
need
that,
sometimes
I
do
syncs
at
the
moment.
So
I'm
like
there's
a
comment
on
like
the
run
books.
If
you
change
this,
also
change
this
and
then
on
the
other
repo
like
if
you
change
this
also
change
this,
and
this
just
kind
of
like
is
a-
is
a
real
hacky
way
around
that,
so
something
to
keep
in
mind
if
you
find
yourself
writing
those
like
cross-reference.
B
Yes,
I
think
that
the
terraform
will
always
run
on
ops
because
of
the
permissions
and.
B
B
Yeah,
so
I
don't
know
if
it's
just
me,
but
I
kind
of
wanted
to
talk
about
two
things.
The
first
is
around
like
expectations
of
like
what
those
minutes
are.
You
know,
because
what
it
what
you
know
when
we
say
20
minutes
in
a
month,
you
know
how
we
derive
that
as
opposed
to
99.95,
because
I
kind
of
maybe
this
is
just
my
like
me.
Reading.
A
A
G
A
Yeah,
I
think
I
think
it
makes
it
hard
but
yeah,
it's
something
that.
D
Hard,
what
is
you
think
it
makes
it
hard?
What
are
the
two.
D
A
Using
minutes
to
communicate
this
makes
it
hard
makes
the
error
budget
hard
to
grasp,
because
it's
it's
okay,
if
you
are,
if
you
are
a
group
that
is
source
code
management
or
continuous
integration
and
so
on,
who
have
plenty
of
traffic
but
there's
also
groups
there
that
have
very
little
so
one
bad
request
has
a
lot
bigger
impact
yeah,
and
then
they
see
the
minutes
go
down.
Super
fast
yeah.
A
B
So
so
my
my
in
my
mind
like
this
is
this
might
not
be
true,
so
you
know,
because
I
haven't
spoken
to
enough
people
about
it,
but
in
my
mind
I
imagine
when
a
product
like
when
some
of
the
product
managers
think
about
minutes.
They
think
that,
like
gitlab.com
goes
down
somebody's
like
start
a
stopwatch
right
and
then
for
three
minutes:
gitlab
isn't
available
and
then
stop
the
stopwatch.
And
then
we
assign
that
to
a
stage
group
yeah.
Then.
B
D
That's
also
very
important
yeah.
Well,
that's
what
I
was
going
to
ask
next
is
like
so
thinking
about
like
what
you
said
there
was
like
you
know:
they
start
the
stopwatch
and
they
in
the
stopwatch,
because
it's
the
difference
between
on
and
off
like
it's
gitlab.com
is.
G
D
B
B
D
So
as
long
as
I
think
we
we
need
to
put
insert
that
so,
regardless
of
percentage
or
minutes.
The
first
thing
that
we
need
to
do
in
that
presentation
is
make
an
adjustment
to
say
your
number
is
about
your
stuff.
Your
stage
group
number
is
about
your
specific
stage
group
we're
not
taking
other,
and
that's
why
the
attribution
of
feature
categories
on
your
sidekick
workers
and
on
your
requests
and
all
of
the
other
pieces.
This
is
why
the
attribution
is
important,
because
this
is
how
we
read
the
stuff
to
you.
D
D
Is
what
number
do
we
want
to
show
and
the
reason
that
I've
been
pushing
for
having
minutes
is
because
I
feel
like
the
difference
between
10
minutes
and
15
minutes
is
more
pronounced
than
the
difference
between
nine
nine
point:
nine,
five
and
nine
nine
point:
nine.
Eight,
like
a
point,
not
three
percent
different,
like.
B
D
B
A
Is
I
think
the
best
like
is,
I
think,
a
very
good
idea
and
it
goes
from
100
to
wherever
negative
you
are
and
100
is.
You
haven't
spent
anything
any
percentage
point
of
the
0.05
that
your
feature
category,
that
you're
yeah,
that
your
features
have,
and
it
goes
from
100
percent
to,
however
much
you
spend.
D
Well,
I
think
she
was
talking
about
it
more
as
the
minutes
value,
rather
than
translating
it
into
another
percentage
value.
So,
for
example,
if
we
start
with
20.
D
D
F
F
F
A
The
thing
is,
we're
going
to
be
showing,
I
think,
three
numbers
now
we're
going
to
be
showing
availability.
That's
the
thing
that
you
can
compare
with
get
like
a
home
availability
and
we
want
it
to
be
above
99.95.
The
second
number
we're
going
to
show
is
20
minutes
counting
down.
So
you
start
with
yeah.
That's
the
thing
that
everybody
seems
to
like
is
20
minutes,
and
if
you
have
slow
requests,
then
that
number
is
going
to
go
down
and
the
third
number
is
starting
from
100.
A
G
D
D
B
B
Because
it
feels
like
we're
trying
to
give
them
too
much.
If
that's
a
problem,
you
say
you
use
25
minutes
out
of
20
minutes
or
you
use
15
minutes.
Out
of
so
you
always
keep
the
out
of
dot
dot,
right,
yeah
and
then,
and
then
people
can
if
they
want
to
work
out
that
as
a
percentage,
it's
very
easy
to
do
in
your
head.
A
D
Like
we've
had,
we've
got,
we've
had
jackie's
feedback
and
I
think
one
other
feedback,
and
I
think
that
perhaps
we
just
need
more
data
points,
and
I'm
not
saying
that
jackie's
opinion
is
not
wrong.
Like
is,
is.
A
B
Yeah
exactly
like
it's
it's
it's
like,
but
also
like
what
we're
talking
about
here
is
like
subtraction
of
like
numbers
below,
hopefully
40
right.
So
hopefully
you
know
like
most
people
can
do
like
some
level
of
this
in
the
head.
But
if
you-
but
my
concern
is
like
if
we
have
two
things
that
are
percentages,
you
know,
percentages
are
unitless
and
therefore
it
becomes
very
confusing
as
whether
you're
talking
about
like
percentage
of
requests
or
percentage
of
error
budget.
Or
you
know,
and
and
so
I
that's,
the
only
part
that
I
don't
really
like.
Then.
D
I
was
going
to
say
just
for
the
first
iteration.
We
have
we're
going
to
be
doing
the
ama
in
the
first
week
of
may
so
for
the
first
iteration.
Let's
just
have
availability
and
minutes
remaining
and
we'll
leave
the
budget
remaining
for
a
future
round.
C
Yeah
yeah
yeah,
that
makes
sense
to
me
I
mean
I,
don't
really
have
a
strong
opinion
on
the
middle
too.
I
kind
of
agree
with
andrew
that
it's
weird
that.
C
C
B
So
the
other
question,
the
other
thing
that
I
was
kind
of
a
little
bit
concerned
about
is
this
seems
to
be
a
bit
of
a
maybe
again
it's
difficult
to
to
read
this
out
of
a
single
presentation,
but
we
are
like
aptx
and
latency
is
totally
included
in
that,
and
it
feels
like
maybe
jackie,
feels
it's
a
future
iteration
where
I
want
to
be
clear
that
this
is
in
here
now
and
then
it
is
counting
towards
the
error
budget.
D
A
Yeah,
this
has
always
been
better.
We
always
we
always
had
both,
but
we
started
with
one
component
and
component
being
puma.
Now
we're
adding
that's
the
thing
that
I
talked
about
in
the
first
part
of
the
demo.
Now
we're
adding
everything
else,
except
for
sidekick
and
then
we're
going
to
add
site
yeah.
D
D
B
The
thing
that's
important
for
people
to
understand
about
latency
is
that
the
way
you
measure
latency
in
in
this
regard
is
it's
just
a
type
of
error
and
we
treat
it
as
an
error.
If
it's
above
a
certain
that
you
know
that's
what
aptx
is,
if
it's
above
a
certain
latency,
it's
an
error,
and
if
it's
below
that
latency,
you
know
the
threshold
in
for
rails.
It's
one!
Second,
then
it's
a
success
right
and
so
in
the
same
way
we
say:
well
if
it
comes
back
with
a
500,
it's
an
error.
B
If
it
comes
back
and
the
200
is
a
success,
it's
just
the
same
thing,
because
I
think
people
get
really
kind
of
confused
as,
like
you
know,
if
you're
talking
1995
percentile
latency
of
like
0.9
seconds,
they
don't
that
there's
no
sort
of
trivial
way
to
take
that
and
turn
it
into
an
error
budget,
because
it's
a
it's
a
different
way
of
measurement
right.
It's
a
it's
a
percentile
value,
and
so
that's
why
I
think
a
lot
of
product
managers
are
like.
B
We
don't
have
a
story
for
for
latencies,
because
they're
still
thinking
in
like
95th
percentile
latency.
How
do
you
convert
that
into
an
error
budget
where
we
we're
not
we
even
not
even
using
that
we
thinking
of
latencies
as
a
as
a
boolean,
effectively
on
every
request,
either
passwords
or
failed
its
latency
test.
A
Yeah
and
I
think
for
aptx
in
particular,
we
can.
We
might
also
stop
calling
things
requests,
because
it's
not
always
a
request,
something
that
can.
D
B
B
What
we're
doing
is,
is
you
know
like
it's
exactly
what
everyone's
doing
like
like?
Don't
get
me
wrong:
the
the
sli
with
the
threshold
and
the
above
and
below,
like
there's
lots
and
lots
of
really
good
talks
about
like
this
is
the
way
to
do
it,
and
you
know
we're
following
that.
You
know
that's
the
industry
standard,
it's
just
that.
We
call
that
aptx,
where
other
people
just
call
it
an
sli.
C
C
Yeah
on
the
naming
thing,
our
metrics
did
used
to
call
these
transactions,
but
then
that
also
got
confusing
yeah
yeah.
B
I
I'm
talking
with
alessio
at
the
moment
about
an
aptx
for
deployments
where
a
request
is
a
merge
request.
Actually,
I
suppose
it's
a
request
and
it's
how
long
it
takes
to
get
into
production,
but
we,
you
know
we're
using
the
same
approach
there
as
well.
So
you
know.
C
C
D
A
Could
you
also
make
the
changes
for
what
we're
going
to
display
because
we've
gone
like.