►
From YouTube: Scalability Team Demo - 2021-05-20
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
B
Yes,
thanks
so
yeah.
C
B
In
the
middle
of
doing
a
little
investigation
and
I'm
still
sort
of
stumbling
around
trying
to
understand
what
I'm
doing,
because
I
I've
in
my
experience-
that's
how
investigations
work
like
you
can
go
back
and
say
we
set
out
to
do
this
experiment
and
we
did
it,
and
this
was
the
outcome.
But
in
reality
you
don't
know
what
the
experiment
is.
When
you
start
so
or
you
sort
of
have
a
rough
idea,
what
the
experiment
is,
but
you
don't
know
exactly
how
to
run
it.
B
We
are
doing
this
work
to
have
fewer
cues,
because
it's
better
for
reddish
throughput
and
we
have
another
project
on
the
horizon:
to
have
multiple
reddish
servers
and
have
different
deployments
of
the
application.
Have
psychic
talk
to
different
redis
servers
and
but
like?
How
soon
will
we
need
this?
Because
the
idea
is
that,
if
we
have
a
more
efficient
redis
because
of
having
fewer
queues,
we
can
process
more
jobs,
but
how
many
more
jobs.
B
So
what
I
yeah.
I
should
try
to
talk
you
through
what
I'm
doing
so.
I
have
a
single
single
reddish
server,
which
is
the
same
specification
as
what
we
use
for
psychic
right
now.
So
that's
a
c2
standard,
eight
core
machine
on
google
cloud,
so
that
is
the
the
redis
server
and
I
now
have
two
virtual
machines.
B
B
So
right
now,
there's
two
machines
that
are
simultaneously
producing
and
consuming
jobs,
and
the
good
news
is
that,
with
a
q
per
chart,
configuration
I'm,
I
I
now
have
it
sort
of.
I
can.
B
At
that
point
I
was
using
one
big
machine
as
the
workload
generator
now
using
two
machines,
and
so
I
I
still
don't
understand
well
enough.
Why
at
one
moment
the
limit
is
32
000
at
the
other
moment
at
17
000.,
but
the
way
I'm
observing
it
now.
This
is
one
of
the
things
I
learned
is.
B
I
got
prometheus
working
on
the
redis
server
and
I'm
just
using
the
the
redis
exporter,
which
is
integrated
by
default,
and
only
so
that's
very
nice
and
the
graph
I
ended
up
using
is
the
one
that
shows
job
rates
per
command
and
the
because
that
gives
you
a
fairly
good
indication
of
what's
going
on,
because
the
the
the
problem
with
the
load
generating
part
is
that
it's
a
lot
of
different
processes
and
if
you
have
to
scrape
all
of
them
with
prometheus,
you
get
a
very
complicated
prometheus
configuration,
but
it's
much
simpler
to
have
prometheus
to
scrape
the
redis
server
and
I
just
restarted
the
the
simulator.
B
And
so
what
we're
seeing
here
is
something
like
thirty,
four
thousand
thirty
five
thousand
push
operations,
this
bush
operations
and
around
17
000
operations
that
correlate
with
jobs
being
consumed.
I
actually
modified
the
workers
so
that
they
ping
redis,
so
the
pings
correspond
to
jobs
that
are
actually
running.
B
So
this
shows
a
job
rate
of
yeah
in
the
order
of
17
000,
and
one
weird
thing
that
happened
here,
because
I
left
this
running
overnight
was
that
the
whole
thing
tanked
and
another
command
went
up.
So
let
me
just
so
like
this
was
running
well
and
then
all
of
a
sudden
job,
processing
tanks
and
these
zram
and
z
range
by
score.
Things
go
up
and
I
realized
that
the
the
load
generators
generate
logs
and
the
disks
were
full.
B
B
Yes,
yeah,
that
was
I
I
early
earlier
on.
I
I
thought
I
found
a
very
good
result,
but
then
also
all
jobs
were
failing
and
that
also
correlated
with
the
cram
and
z
range
by
scores,
because,
yes,
that's
the
retry
set,
so
I
I'm
still
trying
to
get
a
good
grip
on
it,
but
yeah.
If
I
zoom
out
a
bit
here,
I
had
a
long
spell
of
running
at
17k
jobs
per
second.
So
that's
good
and
yeah.
Here,
there's
a
spell
where
it
says
ping
at
22
000
per
second.
B
So
I
did
something
right
there,
but
I'm
not
quite
sure
what
I
did
right,
but
even
if
we,
even
if
yeah,
17,
000
and
22
000,
is
still
a
lot
more
than
what
we
process
today,
because
our
job
rate
is
more
like
in
the
one
to
two
thousand
range.
B
B
So
it
looks
like
a
single
reddit
server.
This
was
a
very
long-winded
way
of
saying
that
it
looks
like
a
single
regis.
Server
can
do
come
process
like
deliver
about
10
times
as
many
jobs
as
we're
processing.
Right
now,.
B
B
See
if
I
have
this
is
a
flame
graph.
I
captured
just
now
where
it's
a
little
distorted,
because
you
can
see
the
background
safe
in
here,
and
that
is
not
part
of
the
single
core
that
runs
on
a
different
core.
But
if
you
ignore
that,
you
see
that
very
large
part
is
lipsy
right.
B
Well,
if
you
look
so
this
is
900
samples,
920
lipsy
right
is
1000
samples.
Lipsy
reed
is
700
samples,
so
that
means
1700
samples
in
io
and
900
samples.
Or
what
is
this?
This
is
586.
B
B
B
I
I
think
I
think
it
uses
a
fairly
small
payload,
so
we
could
make
the
payload
bigger.
Of
course,
when
we're
get
concerned
about
io,
I
can
also
try.
Turning
on
threaded,
I
o
for
the
redis
server
and
see
what
happens
there.
B
But
the
yeah-
I
guess
the
other
part
that
I
failed
to
mention-
is
that
if
I
don't,
if
we
don't
do
this
cuber
shard
thing,
then
we
stall
way
earlier
around
three
to
four
thousand
jobs
per
second,
so
th.
A
B
A
You
so
you're
saying
that
what
you've
seen
in
the
tests
is
that
without
single
cube
per
shard,
we
stall
between
three
and
four
thousand
per
second
and
with
single
cube
per
shard.
We
stall
at
about
22.
B
Yes,
22
was
the
best.
Yes,
22
was
the
best
number
right
now,
I'm
stuck
at
17.
yeah
and
another
thing
to
keep
in
mind
is
that
I
was
talking
about
this
with
sean.
How
to
interpret
this
because
we're
getting
alerts
at
way,
lower
levels
right
right
now,
we're
in
the
1.5
to
2
range
and
the
experiment
suggests
we
could
do
double
the
number
of
jobs.
B
You
can
push
systems
harder
than
you
can
in
real
life,
because
your
over
your
model
isn't
perfect,
and
another
thing
I'm
wondering
is:
if
I
don't
know,
if
our
concern
about
redis
psychic
is
mainly
about
cpu
saturation
or
things
where
we
see
users
being
impacted
so,
and
that
would
mean
job
delivery
taking
too
long,
because
it
can
be
that
there's
an
operating
range
where
our
cpu
alerts
constantly
go
off,
but
users
don't
notice
and
users
don't
care
if
our
cpu
alerts
go
off
users
care
if
job
skills
deliver
fast
enough,
so
I
guess
another
way
to
say
that
is
that
if
I
observe
22,
000
or
17
000
jobs
per
second,
as
the
saturation
number
we'll
probably
never
want
to
do
that,
because
at
that
point
the
cpu
would
be
is
right.
B
A
A
Yeah,
but
I
think
what
I
was
struggling
with
when
I
commented
on
the
issue
yesterday
was
it
was
hard
to
understand
if
22
jobs
per
second
was
like
something
that
we
could
practically
take
into
talking
about
this,
but
I
think
what
you're
saying
now
is
that,
like
on
the
test
rig,
it's
the
difference
between.
A
B
And
this
is
the
part
I
I'm
still
honing
in
on,
so
I
I
want
to
have.
B
I
want
to
come
up
with
a
fair
experiment
where
I
can
say
this
is
what
we
get
without
cooper
chart,
and
this
is
what
we
get
with
cuber
shard
and
like
having
a
good
apples
to
apples
number
on
to
compare,
because
I
I
think
the
outcome
of
the
experiment
would
be
some
would
be
a
multiplier
saying
like
we.
B
We
are
stalled
here
and
if
we
once
we're
on
cooper's
chart,
we
see
this
multiplier
at
the
saturation
point
and
then
the
other
part
of
my
argument
is
that
if
at
100
cpu
we
can
do
this,
then
at
80
cpu.
We
can
probably
do
80
percent
of
that
so
like
if
we're
happy
at
80
cpu
and
then
we're
doing,
I
don't
know
2500
jobs
and
yeah
so
to
scale
it
to
suggest
that
it
would
need
to
be
scaled
down.
Yeah.
D
Yeah,
the
way
I
was
thinking
of
this
was
if
we
do
coupon
charge,
we
expect
the
cpu
utilization
to
go
down,
obviously,
but
we
also
expect
it
to
grow
slower,
which
is
what
jacob's
demonstrating
here
right.
That's
why
we
get
so
much
more
headroom
is
because
of
the
much
slower
demand
on
cpu
time
as
we
increase
job
through,
but.
B
Yeah
another
thing
that
is
interesting
or
that
I
find
interesting
is
that
just
judging
by
command
rates,
a
lot
of
the
work
comes
from
the
semi-reliable
fetcher
and
we
there
are
different
ways
to
approach
this
thing
and
we
are
they're
sort
of
an
optimal
way
to
do
the
reliable
fetch
that
we
can
do,
because
we
have
multiple
cues.
B
A
A
I
know
that
I've
asked
this
before,
but
please
refresh
my
memory.
I
know
semi-reliable
fetcher
is
something
that
we
wrote,
but
why
did
we
need
to
write
that?
Because
we
had
so
many
queues.
B
A
B
Yeah
and
well.
B
Well,
I
I
don't
really
know
the
answer.
I
can
make
up
an
answer,
because
I've
looked
at
the
difference
between
the
two
implementations
and
what
I
noticed
is
that.
B
The
idea
of
the
idea
of
reliable
fetch
is
that
you
can't
drop
drop
a
job
on
the
floor
if
the
pro
the
psychic
process
crashes
and
the
problem
is
right
now
is
that
if
psychic
accepts
a
psychic
server
accepts
a
job,
it's
no
longer
in
redis
it's
in
the
psychic
process.
And
then,
if
you
take
the
psychic
process
away,
the
job
is
lost.
B
B
So
jobs
need
to
move
from
the
incoming
queue
to
the
in
progress
set
or
thing
and
there's
an
atomic
instruction
for
that
in
sidekick,
which
is
sorry
in
redis,
which
is
perfect,
because
then
you
can
never
lose
the
job,
because
it's
by
the
time
you
get
it
red
is
already
put
it
on
the
other
thing,
but
that
atomic
instruction
only
works.
B
If
you
pull
from
exactly
one
queue
and
if
you
pull
from
more
than
one
queue,
you
need
to
ex
issue
that
instruction
for
every
queue,
so
we
would
have
to
issue
it
400
times
and
the
code
we're
using
the
reliable
fetcher
gem
does
something
naive
where
it
waits
five
seconds
in
between
doing
that,
so
it
it
would
400
times
try
to
safely
fetch
from
one
queue
and
then
wait
five
seconds
and
then
try
again.
B
B
Queues
are
empty
yeah,
so
it's
it's
a
peculiarity
of.
C
No,
but
they
do
mention,
you
can
only
use
a
handful
of
queues.
So
if
you
have,
if
like
we
need
to
do
this
operation
400
times
or
whatever
you
said,
with
no
jobs
in
those
queues,
so
that's
wasted
time.
If
you
need
to
do
that
six
times,
that's
not
that
that's
not
as
good
as
one,
but
it's
not
as
bad
as
five
four
hundred.
D
Yeah,
I
think
I
think
they've
recently
updated
their
wiki
page
or
I
misquoted
it
before
because
it
does
now
set.
I
can
check
that
because
it's
a
wiki
page,
it
now
says
a
handful
of
queues
per
process,
which
you
know
we
have
different
processors
running
different
sets
of
queues.
It's
just
that
the
catch-all
one
runs
on
a
ridiculous
number
of
queues
at
the
moment.
B
B
Yeah,
anyway,
what
I'm
trying
to
say
is
that.
B
It
it
could
be
that
there's
also
some
extra
room
to
be
clawed
back
just
by
taking
a
closer
look
at
the
reliable
fetcher,
but
we're
not
even
close
to
that
right
now,
because
right
now
we
need
to
do
something
about
the
queues,
because
we're
just
told
on
the
number
of
queues.
C
What
are
you
going
to
try
next
to
make
the
experiment
more.
C
B
B
So-
and
I
I
I
guess
today
I
was
trying
to
use
a
60
core
machine
and
run
lots
of
psychic
servers
on
it.
But
then
you,
I
don't
know
you
start
running
out
of
file,
descriptors
and
other
weird
things
happen,
and
then
the
system
misbehaves
because
of
the
wrong
reasons.
B
So
now
I
am
hoping
I
can
generate
enough
load
with
two
machines.
Maybe
I
need
three
three
machines
at
some
point.
Do
I
need
to
remote
control
all
of
them
over
ssh
with
a
script
or
do
I
need
to
change
the
interface
of
the
script
where
you
need
to
press
enter
three
times
now?
So
it's
I'm
still
sort
of
feeling
my
way
around.
C
B
B
Yeah,
we
there's
a
little
bit
too
much
copying
in
that
repo
I'm
trying
to
reduce
the
number
of
copies
yeah.
I
I
really
would
like
to
have
this
wrapped
up
by
the
end
of
the
week.
So
that's
sort
of
the
rough
time
box.
I've
given
myself.
A
C
C
So
right
now,
situation
right
now
is,
and
we
have
a
re.
We
have
two
middlewares
one
that
we
use
to
record
metrics
that
we
use
for
services
sli,
so
the
web
api
and
git
service
yeah
slis,
and
then
we
have
another
middleware
that
does
a
measurement
for
error,
budget
aptx.
So
the
duration,
that's
we
record
the
same
duration
twice
with
different
labels
in
the
first
one,
the
one
for
the
service
metrics,
we
didn't
record
an
abdex
measurement,
so
we
didn't
record
the
duration.
C
If
the
request
raised,
I
don't
know
if
that
means
that
we
didn't
record
it,
because
I
don't
know
if
that
app.call
thing
in
the
middleware
would
return
normally
with
the
status
500
or
if
that
would
blow
up
and
not
record
so
if
it
would
fall
back
to
the
insure
block
or
not
yeah
right
now,
I've
got
a
merger
quest
out
to
make
both
of
them
do
the
same
thing,
but
I
yeah
I
like,
I
think,
that's
a
good
first
step,
but
then
I
would
like
to
talk
about
what
we
want
that
to
be
actually.
F
So
when
you
listen
to
lots
of
the
conference
talks
and
there's
a
really
good
one
called
how
to
measure
latency
or
something
like
that,
this
guy
by
a
guy
named,
heinrich
I'll,
find
it
after
this,
you
you'll,
hear
he's
very
specific.
He
always
says
sorry
valid
requests,
and
so
that's
basically
2x
and
3xx,
and
not
4xx
and
5xx,
and
so
we've
always
known
that
we
don't
have
that.
But
I
guess
the
main
reason
is
because
of
the
cardinality
of
including
the
status
on
on
a
histogram.
F
F
Categories,
you
can
just
yeah
yes
valid
and
invalid,
but
we
lost
we
lost
jacob,
but
the
the
thing
that
I
was
going
to
say
is
like
actually
from
a
like
operational
point
of
view
and
just
knowing
the
way
that
our
systems
in
particular
fail.
I
think
that,
having
that
like
some
of
invalid,
or
at
least
like
the
histogram
of
the
invalid
things,
is
still
really
useful.
Just
take
the
the
problem
that
we're
seeing
at
the
moment
with
that
one
end
point
that
it.
F
It
returns
a
four
something
and
it
takes
30
seconds,
and
we
don't
know
why,
and
if
we
didn't
have
a
histogram
of
that.
That
would
be
unfortunate,
I
think
so.
I
think
yeah
having
having
it
as
a
as
a
you
know,
not
excluding
it
but
having
it
as
a
as
a
boolean
or
very
low.
Cardinality
label
would
probably
be
a
good
approach.
B
F
C
A
B
I
mean
well
there's
also
the
classic
failure
of
time
timeouts
have
502s
after
60
seconds.
Those
are
a
combination
of
bad
requests,
bad
latency.
D
So
each
request
could
attain
one
point.
So
we
add,
the
denominator
is
just
one
per
request
and
the
numerator
is
the
number
of
points
we
actually
got.
So
at
the
moment
we
say
and
I'm
going
to
use
valid
and
invalid,
but
like
our
definitions
of
valid
and
valid,
don't
match
the
ones
andrew
mentioned.
So
at
the
moment
we
say
fast,
valid
gets
one
point:
slow
invalid
gets
zero
points,
slow
valid,
gets
half
a
point
and
fast
invalid
gets
half
a
point.
D
D
I'm
just
using
point
to
mean
like
this
is
this
is
one
yeah?
It
goes
on
what's
on
top
of
the
the
fraction
what
fraction
so
the
fraction
is
the
total
availability
score
right
so,
like
the
total
number
of,
let's
say
for
the
request,
yeah
so
say
you
have
100
requests
that
all
succeed
in
good
time.
Then
you
have
100
divided
by
100,
so
you
have
100,
but.
C
But
we
do
the
same
thing
for
our
sla,
like
for
our
availability,
service,
availability,.
F
F
Instead
of
yeah,
so
it's
so
weak,
so
we
because
we
have
two
different
gates
that
people
go
through
right
and
and
obviously
it's
just
like
a
different
denominator.
So
it
kind
of
comes
out
slightly
differently.
F
Easier,
it
is
easy
to
calculate,
but
I
think
it
has
benefits
as
well
and
I'll
tell
you
the
main
benefit
like
if
we
just
had
a
single.
If
we
were
doing
it
like
the
exact
way
of
of
like
the
slo
book
and
everything
like
that,
we
get
one
slo
and
that
represents
errors
and
latency
right,
and
I
think
that
we
have
different
failure
modes
when
we
have
500s
and
when
we
have
things
slowing
down
and
even
though
they
are
sort
of
both
bad
user
experience.
B
C
Incident
just
now
in
the
in
the
demo
doc:
that's
the
equation,
we're
talking
about
and
we're
using
the
the
same
for
both.
F
Yeah
we
could,
I
mean
we
could
merge
them
into
one
one
gate
right
and
the
gate
is
either
zero
one
and
it
and
you
score
one
if,
if
the
server
fails
with
the
500
or
and
and
yeah
the
server
fails
with
the
500
or
your
request
takes
longer
than
a
certain
amount
of
time
and
that's
basically
bundled
up.
If
you
get
a
400
you
don't,
it
doesn't
even
count
in
the
in
the
proper
model
right.
F
F
If
we
wanted
to
go
the
full
hog,
I
think
the
the
only
way
you
could
really
do,
that
is
by
changing
the
code
so
that
effectively
we
have
that
count
where
we
can't
up
on
every
request.
That's
not
a
4xx.
We
can't
up
at
the
bottom
and
then
on
every
request.
That's
not
a
4xx
that
takes
less
than
amount
of
time
and
it's
not
a
500.
F
We
can't
have
one
on
the
top
and
that's
the
only
way
and
that's
a
like
a
ruby
piece
of
code
and
that's
the
only
way
you
could
really
do
it
in
prometheus
effectively
for
latency
and
valid
requests.
If
you
wanted
to
stick
to
the
like
proper,
proper
definition,
but
I
don't
know
if
we
want
to
do
that.
Yeah.
B
I'm
also
a
little
bit
lost
on
what
problem
we're
trying
to
solve.
I
guess
I
mean
what
bob
was
saying
that
we
should
treat
500
exceptions.
The
same
way,
it
shouldn't
depend
on
where
our
middleware
sits
in
ruby,
whether
something
gets
counted
as
a
bad
as
a
five
fx
5x
right,
because
if
it's,
if
something
turns
it
into
a
5x
before
the
middleware,
then
it
shouldn't
fly
by
and
then,
if
it's
an
exception
like
that
ordering
shouldn't
matter.
But
that's
like
that's
straightforward.
C
B
C
F
F
D
Because
I
wonder
if
then
we
should
just
change
this
to
just
put
that
in
the
dots
and
say
like
to
people.
This
is
well.
Maybe
maybe
not
exactly
like
that
because,
like
jacob
said
half
a
point
is
kind
of
confusing
but
yeah.
Something
like
that.
That
says
like
these
are
the
axes.
Your
request
is
judged
upon
and
you
know
you.
F
So
just
I
haven't
read
through
this
whole
thread,
but
what
a
handled
error
is
sorry.
D
So
so
this
change
state
makes
it
so
that
any
error
that
bubbles
up,
like
I
mentioned
earlier,
any
error
that
bubbles
up
out
of
rails
and
then
gets
handled
well.
A
handle
out
of
the
application
and
gets
handled
at
the
rails
or
erect
level
gets
zero
points.
Even
if
it's
fast,
no,
is
that
right
bob
or
does
it
just
not
get
recorded?
It
doesn't
get
recorded
so
that
bottom
table
is
wrong.
C
So,
where,
where
we're
at
now,
for,
if
something
raises
in
the
code,
we.
A
C
F
D
F
One
out
of
two,
because
in
that
other
one
it
would
actually
be
like
half
out
of
sorry
one
out
of
two
or
one
or.
F
F
B
B
Yeah,
the
the
bug
is
very,
is
very
clear
like
it.
It
shouldn't
yeah.
We
have
code
at
the
level
of
application
controller
that
like
where
does
the
gitly
503
come
from?
I
think
that's
an
application
controller.
Yes,.
B
C
C
There's
a
thing
for
that,
like
a
mapping,
you
can
write
out.
I've
looked
into
it
with
mario
who
left
the
company
like
a
year
ago.
I
think,
but
that
doesn't
exist.
C
That's
very
close
to
the
question
I
want
to
get
answered
here
like
right
now
we
have
a
thing
where
only
a
certain
type
of
500
gets
handled.
Should
we
change
both
of
those?
What
do.
C
Certain
type,
because
certain
type
as
in
render
503
versus
raise
something.
B
I
mean
think
about
the
rest
of.
F
B
C
We
know
we
have
to
stay
because
we
just
need
to
move
them.
We
just
need
to
move.
You
know
there.
I
go
again
because
I
just
moved
something
but
like
we
just
need
in
the
rack
request,
middleware
the
one
that
we
use
for
entire
observability
stack.
We
need
to
move
the
aptx
measurement
to
the
bottom
and
check
the
status
code
there
yeah,
so
we
record
recorded
that.
No
actually
we
needed
do.
We
want
to
record
it
or
not.
I
still
haven't
had
an
answer
to
that.
Actually,.
D
F
I
also
think
it
might
be
really
helpful
just
putting
like
a
good
request,
bad
request
without
all
the
different
status
codes
or
maybe
status
code
class.
So
that's
what
google
do
a
lot.
They
have
the
2x
x,
3
x,
x,
4,
x,
x,
etc,
and
then
I'm
putting
that
on
the
latency,
and
then
we
can
kind
of
progressively
bring
it
in
right.
It's
not
like
a
like.
You
know.
We
can
have
two
different
recordings,
we'll.
D
D
C
Yes,
no,
I
don't
know,
I
don't
think
so.
No,
we
just
have
the
method
get
post
kind
of
thing.
F
F
B
F
D
C
Everything
that
counts,
like
everything
that
counts
for
infrastructure,
we're
doing
the
bottom
thing
so.
D
D
F
C
F
F
Measuring
and
all
of
that
stuff
is
kind
of.
You
can
see
a
lot
of
other
people
who
have
the
same
kind
of
discussions,
which
is
kind
of
nice
to
see
that
how
and
things
out.
A
C
C
D
Oh
sorry,
I
mean
yeah.
Well,
the
error
budgets
are
documented
on
the
dashboards
for
stage
groups
page
right.
It's
not
just
about
error
budgets
so
that
that's
what
I
was
talking
about
there
is
like
you
know.
This
is
this
is
where
we
lead
into
talking
about
like
what
the
metrics
that
we
use
on.
Gitlab.Com
are
two
developers
so
yeah.
C
The
page
that
I
linked
could
have
also
could
mention
this
as
in.
If
your
request
is
an
error,
it
will
not
count
as
an
uptext
like
there
will
no
not
be
an
aptx
up
like
in
the
the
equation
that
I
linked
there
and
we
could
add
that
to
the
dot
to
the
dot
there.
Yes,
I
was.
D
D
F
C
F
So,
like,
I
really
think
it's
important
that
all
of
the
different
definitions
are
as
close
to
being
in
line
with
one
another
as
possible,
like
so
the
ones
that
we
use
for
service
monitoring
and
the
error
budgets
are
kind
of.
You
know
we
we
don't,
we
we
actively
try
not
to
diverge
those
too
much
and
where
they
are
diverge,
we
bring
them
back
in
line
because
it's
just
cognitive
overhead,
when.
F
C
C
F
C
F
Like
maybe,
let's,
let's
get
this
done
here,
but
maybe
like
a
longer
term
thing
is
just
to
like
actually
start
introducing
another
metric.
That's
that
is
the
you
know,
it's
basically
a
one
and
a
zero
counter
and
then
what
you,
the
flexibility
you
get
there.
F
You
know
this
was
actually
in
that
slo
monitoring
v2
proposal,
but
the
the
the
proposal
there
is
that
you
actually
give
the
teams
the
ability
to
define
those
thresholds
themselves
in
you
know
with
a
with
a
thing
and
then
and
then
it's
actually
in
the
code,
as
opposed
to
it
being
on
yeah.
F
F
B
C
E
B
F
B
Right
and
then
you
could
even
do
things
like
calculate
the
num.
B
C
F
F
B
General
model
of
it.
C
C
F
F
A
C
Short
term,
I'm
not
going
to
change
a
lot,
just
change
the
the
that
we
don't
care
anymore.
How
an
exception
was
raised,
how
an
error,
how
500
was.